Exercise 2 — Chunk Size Impact on Retrieval

We will:

1. Chunk the same document into:

   - 100-character chunks (small)

   - 200-character chunks (medium)

   - 400-character chunks (large)

2. Create embeddings

3. Search using the query: "What is machine learning?"

4. Retrieve top 3 chunks

5. Compare the results

In [1]:
# Step 1 — Helper Functions

from sentence_transformers import SentenceTransformer
import numpy as np

def cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


def chunk_by_char(text, size):
    """
    Simple character-based chunking.
    """
    chunks = []
    for i in range(0, len(text), size):
        chunk = text[i:i+size].strip()
        if chunk:
            chunks.append(chunk)
    return chunks


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Step 2 — Load Model and Document

model = SentenceTransformer('all-MiniLM-L6-v2')

document = """
Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast to
the natural intelligence displayed by humans and animals. Leading AI textbooks define
the field as the study of intelligent agents: any device that perceives its environment
and takes actions that maximize its chance of successfully achieving its goals.

Machine learning is a subset of artificial intelligence that focuses on the use of data
and algorithms to imitate the way that humans learn, gradually improving its accuracy.
Machine learning is an important component of the growing field of data science.

Deep learning is part of a broader family of machine learning methods based on artificial
neural networks with representation learning. Learning can be supervised, semi-supervised
or unsupervised. Deep learning architectures such as deep neural networks, deep belief
networks, recurrent neural networks and convolutional neural networks have been applied
to fields including computer vision, speech recognition, natural language processing,
machine translation, and bioinformatics.

Natural language processing is a subfield of linguistics, computer science, and artificial
intelligence concerned with the interactions between computers and human language, in
particular how to program computers to process and analyze large amounts of natural
language data. Challenges in natural language processing frequently involve speech
recognition, natural language understanding, and natural language generation.
""".strip()

Step 3 — Test 3 Chunk Sizes

We compute:

   - Number of chunks

   - Top result for query

   - Similarity score

   - Interpretation

In [5]:
# Small Chunks (100 chars)
small_chunks = chunk_by_char(document, 100)
small_embeddings = model.encode(small_chunks)

query = "What is machine learning?"
query_emb = model.encode(query)

scores = [(c, cosine_similarity(query_emb, emb)) 
          for c, emb in zip(small_chunks, small_embeddings)]

scores.sort(key=lambda x: x[1], reverse=True)
top_small = scores[:3]
print("Top 3 results for small chunks:")
for chunk, score in top_small:
    print(f"Score: {score:.4f}\nChunk: {chunk}\n")
    

Top 3 results for small chunks:
Score: 0.7230
Chunk: ce of successfully achieving its goals.

Machine learning is a subset of artificial intelligence tha

Score: 0.6666
Chunk: g its accuracy.
Machine learning is an important component of the growing field of data science.

De

Score: 0.5451
Chunk: Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast to
the natural in

