## Exercise 1: Chunking Different Document Types

### Task
Apply appropriate chunking strategies to different document types.

### Documents

You'll chunk 3 different types of documents:

**Document A: FAQ**
```
Q: What is the return policy?
A: Items can be returned within 30 days of purchase with original receipt.

Q: Do you offer international shipping?
A: Yes, we ship to over 50 countries worldwide. Shipping times vary by location.

Q: How do I track my order?
A: Use the tracking number sent to your email after shipment.
```

In [1]:
Document_A = """ Q: What is the return policy?
A: Items can be returned within 30 days of purchase with original receipt.

Q: Do you offer international shipping?
A: Yes, we ship to over 50 countries worldwide. Shipping times vary by location.

Q: How do I track my order?
A: Use the tracking number sent to your email after shipment."""



print(f"Document length: {len(Document_A)} characters")
print(f"Document length: {len(Document_A.split())} words")
print(f"\nFirst 200 characters:\n{Document_A[:150]}...")


Document length: 318 characters
Document length: 57 words

First 200 characters:
 Q: What is the return policy?
A: Items can be returned within 30 days of purchase with original receipt.

Q: Do you offer international shipping?
A: ...


#### **Strategy_A** =  **Fixed-Size Chunking (Word-Based)** 
``` Reason_A =  Fixed-Size Chunking (Word-Based) was chosen because it splits words instead of characters with small tokens or chunks which makes its retrieval very precise,and makes it easier to provide specific answers to specific questions which is usually required in FAQ . ```

In [6]:
def chunk_by_words(text, chunk_size=30, overlap=8):
    """
    Split text into chunks of specified word count.
    
    Args:
        text: The text to chunk
        chunk_size: Number of words per chunk
        overlap: Number of words to overlap between chunks
    
    Returns:
        List of text chunks
    """
    # Split text into words
    words = text.split()
    chunks = []
    start = 0
    
    while start < len(words):
        # Get chunk of words
        end = start + chunk_size
        chunk_words = words[start:end]
        
        # Join words back into text
        chunk = ' '.join(chunk_words)
        chunks.append(chunk)
        
        # Move start position (with overlap)
        start += chunk_size - overlap
    
    return chunks

# Test it
chunks = chunk_by_words(Document_A, chunk_size=30, overlap=8)

print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks[:3], 1):  # Show first 3 chunks
    print(f"Chunk {i} ({len(chunk.split())} words):")
    print(chunk)
    print("-" * 80)

Number of chunks: 3

Chunk 1 (30 words):
Q: What is the return policy? A: Items can be returned within 30 days of purchase with original receipt. Q: Do you offer international shipping? A: Yes, we ship to
--------------------------------------------------------------------------------
Chunk 2 (30 words):
offer international shipping? A: Yes, we ship to over 50 countries worldwide. Shipping times vary by location. Q: How do I track my order? A: Use the tracking number sent
--------------------------------------------------------------------------------
Chunk 3 (13 words):
my order? A: Use the tracking number sent to your email after shipment.
--------------------------------------------------------------------------------


In [7]:
Document_B = """ Technical Documentation

Installation Guide

Step 1: Download the installer from our website.
Extract the zip file to your desired location.

Step 2: Run setup.exe as administrator.
Follow the on-screen instructions.

Step 3: Configure your API key in the settings file.
The settings file is located at config/settings.json. """


#### **Strategy_B** = **Sentence-Based Chunking** 
``` Reason_B =  Sentence based chunking split by sentences, then group into chunks which makes it the usable option out of the options because it preserves more context , it's better for complex, interconnected information and there are fewer chunks to manage. ```

In [9]:
def chunk_by_sentences(text, max_chunk_size=50):
   
    import re
    sentences = re.split(r'(?<=[.!?])\s+', text)
    
    chunks = []
    current_chunk = ""
    
    for sentence in sentences:
        # Check if adding this sentence would exceed max size
        if len(current_chunk) + len(sentence) > max_chunk_size and current_chunk:
            # Save current chunk and start new one
            chunks.append(current_chunk.strip())
            current_chunk = sentence
        else:
            # Add sentence to current chunk
            current_chunk += " " + sentence if current_chunk else sentence
    
    # Don't forget the last chunk
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks

# Test it
chunks = chunk_by_sentences(Document_B, max_chunk_size=400)

print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i} ({len(chunk)} chars):")
    print(chunk)
    print("-" * 80)

Number of chunks: 1

Chunk 1 (322 chars):
Technical Documentation

Installation Guide

Step 1: Download the installer from our website. Extract the zip file to your desired location. Step 2: Run setup.exe as administrator. Follow the on-screen instructions. Step 3: Configure your API key in the settings file. The settings file is located at config/settings.json.
--------------------------------------------------------------------------------


In [10]:
Document_C = """Article

The Future of Renewable Energy

Solar and wind power have seen tremendous growth in recent years. As technology improves
and costs decrease, renewable energy becomes increasingly competitive with fossil fuels.

Energy storage solutions are critical for renewable adoption. Battery technology advances
enable better grid management and reliability. This addresses the intermittent nature of
solar and wind power.

Policy support and public awareness continue to drive the transition. Many countries have
set ambitious renewable energy targets for the coming decades.
"""


#### **Strategy_C** = **Paragraph-Based Chunking** 
``` Reason_C =  Paragraph based chunking is best for structured documents which is why it was chosen here,because we're dealing with an article which is organized in a clear thematic paragraphs. It preserves more context, there's more retrival. ```


In [11]:
def chunk_by_paragraphs(text, min_chunk_size=50):
   
    paragraphs = text.split('\n\n')
    
    chunks = []
    current_chunk = ""
    
    for para in paragraphs:
        para = para.strip()
        if not para:
            continue
            
        if len(para) < min_chunk_size:
            current_chunk += "\n\n" + para if current_chunk else para
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = para
    
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks

# Test it
chunks = chunk_by_paragraphs(Document_C, min_chunk_size=50)

print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i} ({len(chunk)} chars):")
    print(chunk)
    print("-" * 80)

Number of chunks: 4

Chunk 1 (39 chars):
Article

The Future of Renewable Energy
--------------------------------------------------------------------------------
Chunk 2 (177 chars):
Solar and wind power have seen tremendous growth in recent years. As technology improves
and costs decrease, renewable energy becomes increasingly competitive with fossil fuels.
--------------------------------------------------------------------------------
Chunk 3 (200 chars):
Energy storage solutions are critical for renewable adoption. Battery technology advances
enable better grid management and reliability. This addresses the intermittent nature of
solar and wind power.
--------------------------------------------------------------------------------
Chunk 4 (152 chars):
Policy support and public awareness continue to drive the transition. Many countries have
set ambitious renewable energy targets for the coming decades.
--------------------------------------------------------------------------------
