# Exercise 1: Chunking Different Documents

## Document A: FAQ

Strategy: Word-based chunking<br>

Reason_A: , because character chunking has no respect for natural boundaries and using it will ruin context understanding. The provided text is too short for either sentence chunking or paragraph chunking, making word-based chunking the most appropriate for this text, with the appropriate overlap context will still be retatined.


In [9]:
document_A = """
Q: What is the return policy?
A: Items can be returned within 30 days of purchase with original receipt.

Q: Do you offer international shipping?
A: Yes, we ship to over 50 countries worldwide. Shipping times vary by location.

Q: How do I track my order?
A: Use the tracking number sent to your email after shipment."""

def word_chunk(text, chunk_size=40, overlap=10):
    words = text.split()
    chunks = []
    start = 0

    while start <len(words):
        end = start + chunk_size
        chunk_words = words[start: end]

        #Join words back into text
        chunk = ' '.join(chunk_words)
        chunks.append(chunk)

        start += chunk_size - overlap
    return(chunks)

chunked = word_chunk(text=document_A)
print(f"Number of chunks: {len(chunked)}\n")
for i, chunk in enumerate(chunked[:3],1):
    print(f"Chunk {i} ({len(chunk.split())} words:)")
    print(chunked)
    print("-"*80)


Number of chunks: 2

Chunk 1 (40 words:)
['Q: What is the return policy? A: Items can be returned within 30 days of purchase with original receipt. Q: Do you offer international shipping? A: Yes, we ship to over 50 countries worldwide. Shipping times vary by location. Q:', 'over 50 countries worldwide. Shipping times vary by location. Q: How do I track my order? A: Use the tracking number sent to your email after shipment.']
--------------------------------------------------------------------------------
Chunk 2 (27 words:)
['Q: What is the return policy? A: Items can be returned within 30 days of purchase with original receipt. Q: Do you offer international shipping? A: Yes, we ship to over 50 countries worldwide. Shipping times vary by location. Q:', 'over 50 countries worldwide. Shipping times vary by location. Q: How do I track my order? A: Use the tracking number sent to your email after shipment.']
--------------------------------------------------------------------------------


## Document B: Technical Documentation

In [11]:
document_B = """
Installation Guide

Step 1: Download the installer from our website.
Extract the zip file to your desired location.

Step 2: Run setup.exe as administrator.
Follow the on-screen instructions.

Step 3: Configure your API key in the settings file.
The settings file is located at config/settings.json.
"""

def sentence_chunk(text, max_chunk_size=100):
    import re
    sentences = re.split(r'(?<=[.!?])\s+', text)
    
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if len(current_chunk) + len(sentence) > max_chunk_size and current_chunk:
            chunks.append(current_chunk.strip())
            current_chunk = sentence
        else:
            current_chunk += " "+sentence if current_chunk else sentence

    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

chunks = sentence_chunk(document_B)

print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i} ({len(chunk)} chars):")
    print(chunk)
    print("-" * 80)

Number of chunks: 4

Chunk 1 (68 chars):
Installation Guide

Step 1: Download the installer from our website.
--------------------------------------------------------------------------------
Chunk 2 (86 chars):
Extract the zip file to your desired location. Step 2: Run setup.exe as administrator.
--------------------------------------------------------------------------------
Chunk 3 (87 chars):
Follow the on-screen instructions. Step 3: Configure your API key in the settings file.
--------------------------------------------------------------------------------
Chunk 4 (53 chars):
The settings file is located at config/settings.json.
--------------------------------------------------------------------------------


## Document C: Article

In [1]:
document_C = """
The Future of Renewable Energy

Solar and wind power have seen tremendous growth in recent years. As technology improves
and costs decrease, renewable energy becomes increasingly competitive with fossil fuels.

Energy storage solutions are critical for renewable adoption. Battery technology advances
enable better grid management and reliability. This addresses the intermittent nature of
solar and wind power.

Policy support and public awareness continue to drive the transition. Many countries have
set ambitious renewable energy targets for the coming decades.
"""

def paragraph_chunk(text, min_chunk_size=100):
    paragraphs = text.split('\n\n')
    
    chunks = []
    current_chunk = ""
    
    for para in paragraphs:
        para = para.strip()
        if not para:
            continue
            
        # If paragraph is too small, combine with next
        if len(para) < min_chunk_size:
            current_chunk += "\n\n" + para if current_chunk else para
        else:
            # Save previous chunk if exists
            if current_chunk:
                chunks.append(current_chunk.strip())
            # Start new chunk with this paragraph
            current_chunk = para
    
    # Don't forget the last chunk
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks

# Test it
chunks = paragraph_chunk(document_C)

print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i} ({len(chunk)} chars):")
    print(chunk)
    print("-" * 80)

Number of chunks: 4

Chunk 1 (30 chars):
The Future of Renewable Energy
--------------------------------------------------------------------------------
Chunk 2 (177 chars):
Solar and wind power have seen tremendous growth in recent years. As technology improves
and costs decrease, renewable energy becomes increasingly competitive with fossil fuels.
--------------------------------------------------------------------------------
Chunk 3 (200 chars):
Energy storage solutions are critical for renewable adoption. Battery technology advances
enable better grid management and reliability. This addresses the intermittent nature of
solar and wind power.
--------------------------------------------------------------------------------
Chunk 4 (152 chars):
Policy support and public awareness continue to drive the transition. Many countries have
set ambitious renewable energy targets for the coming decades.
--------------------------------------------------------------------------------
