Exercise 2 â€” Chunk Size Impact on Retrieval

We will:

1. Chunk the same document into:

   - 100-character chunks (small)

   - 200-character chunks (medium)

   - 400-character chunks (large)

2. Create embeddings

3. Search using the query: "What is machine learning?"

4. Retrieve top 3 chunks

5. Compare the results

In [7]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Helper: cosine similarity
def cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Helper: character-based chunking

def chunk_by_char(text, size):
    chunks = []
    for i in range(0, len(text), size):
        chunk = text[i:i+size].strip()
        if chunk:
            chunks.append(chunk)
    return chunks


# document to be chunked and evaluated
document = """
Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast to
the natural intelligence displayed by humans and animals. Leading AI textbooks define
the field as the study of intelligent agents: any device that perceives its environment
and takes actions that maximize its chance of successfully achieving its goals.

Machine learning is a subset of artificial intelligence that focuses on the use of data
and algorithms to imitate the way that humans learn, gradually improving its accuracy.
Machine learning is an important component of the growing field of data science.

Deep learning is part of a broader family of machine learning methods based on artificial
neural networks with representation learning. Learning can be supervised, semi-supervised
or unsupervised. Deep learning architectures such as deep neural networks, deep belief
networks, recurrent neural networks and convolutional neural networks have been applied
to fields including computer vision, speech recognition, natural language processing,
machine translation, and bioinformatics.

Natural language processing is a subfield of linguistics, computer science, and artificial
intelligence concerned with the interactions between computers and human language, in
particular how to program computers to process and analyze large amounts of natural
language data. Challenges in natural language processing frequently involve speech
recognition, natural language understanding, and natural language generation.
""".strip()


# Load embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

query = "What is machine learning?"
query_emb = model.encode(query)

# Function to evaluate chunk size
def evaluate_chunk_size(label, chunk_size):
    print(f"\n{label} Chunks ({chunk_size} chars):")
    print("-" * 50)

    # 1. Chunk document
    chunks = chunk_by_char(document, chunk_size)

    # 2. Embed all chunks
    chunk_embeddings = model.encode(chunks)

    # 3. Compare to query
    scores = []
    for i, emb in enumerate(chunk_embeddings):
        sim = cosine_similarity(query_emb, emb)
        scores.append((chunks[i], sim))

    # Sort by similarity
    scores.sort(key=lambda x: x[1], reverse=True)
    top = scores[:3]

    # Print results
    print(f"- Number of chunks: {len(chunks)}")
    print(f"- Top result (first 150 chars): \"{top[0][0][:150]}\"")
    print(f"- Score: {top[0][1]:.3f}")

    # Analysis
    if chunk_size == 100:
        analysis = "Good precision but lacks full context."
    elif chunk_size == 200:
        analysis = "Best balance: enough detail + good focus."
    else:
        analysis = "More context but less focused; too broad."

    print(f"- Analysis: {analysis}")

# Run evaluations

print("\n\nChunk Size Comparison:")

evaluate_chunk_size("Small", 100)
evaluate_chunk_size("Medium", 200)
evaluate_chunk_size("Large", 400)

print("\nBest chunk size for this use case: 200 chars because it provides the best balance between focus and context.")

'(ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')), '(Request ID: f6ff49ca-5eee-4c52-b1b5-884c66d1b2a6)')' thrown while requesting HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/./modules.json
Retrying in 1s [Retry 1/5].




Chunk Size Comparison:

Small Chunks (100 chars):
--------------------------------------------------
- Number of chunks: 16
- Top result (first 150 chars): "ce of successfully achieving its goals.

Machine learning is a subset of artificial intelligence tha"
- Score: 0.723
- Analysis: Good precision but lacks full context.

Medium Chunks (200 chars):
--------------------------------------------------
- Number of chunks: 8
- Top result (first 150 chars): "t focuses on the use of data
and algorithms to imitate the way that humans learn, gradually improving its accuracy.
Machine learning is an important c"
- Score: 0.704
- Analysis: Best balance: enough detail + good focus.

Large Chunks (400 chars):
--------------------------------------------------
- Number of chunks: 4
- Top result (first 150 chars): "t focuses on the use of data
and algorithms to imitate the way that humans learn, gradually improving its accuracy.
Machine learning is an important c"
- Score: 0.654
- Analysis: More c