### **FAQ CHUNKING**

**Document A: FAQ**
```
Q: What is the return policy?
A: Items can be returned within 30 days of purchase with original receipt.

Q: Do you offer international shipping?
A: Yes, we ship to over 50 countries worldwide. Shipping times vary by location.

Q: How do I track my order?
A: Use the tracking number sent to your email after shipment.
```



**Strategy:** Paragraph chunking

**Why?** In other to have the full context of the document, since it's a Q & A document.

In [1]:
sample_document = """
Q: What is the return policy?
A: Items can be returned within 30 days of purchase with original receipt.

Q: Do you offer international shipping?
A: Yes, we ship to over 50 countries worldwide. Shipping times vary by location.

Q: How do I track my order?
A: Use the tracking number sent to your email after shipment.
"""

def chunk_by_paragraphs(text, min_chunk_size=100):
    # Split by double newlines (paragraph separator)
    paragraphs = text.split('\n\n')
    
    chunks = []
    current_chunk = ""
    
    for para in paragraphs:
        para = para.strip()
        if not para:
            continue
            
        # If paragraph is too small, combine with next
        if len(para) < min_chunk_size:
            current_chunk += "\n\n" + para if current_chunk else para
        else:
            # Save previous chunk if exists
            if current_chunk:
                chunks.append(current_chunk.strip())
            # Start new chunk with this paragraph
            current_chunk = para
    
    # Don't forget the last chunk
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks

# Test it
chunks = chunk_by_paragraphs(sample_document, min_chunk_size=100)

print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i} ({len(chunk)} chars):")
    print(chunk)
    print("-" * 80)


Number of chunks: 2

Chunk 1 (104 chars):
Q: What is the return policy?
A: Items can be returned within 30 days of purchase with original receipt.
--------------------------------------------------------------------------------
Chunk 2 (211 chars):
Q: Do you offer international shipping?
A: Yes, we ship to over 50 countries worldwide. Shipping times vary by location.

Q: How do I track my order?
A: Use the tracking number sent to your email after shipment.
--------------------------------------------------------------------------------


**Document B: Technical Documentation**
```
Installation Guide

Step 1: Download the installer from our website.
Extract the zip file to your desired location.

Step 2: Run setup.exe as administrator.
Follow the on-screen instructions.

Step 3: Configure your API key in the settings file.
The settings file is located at config/settings.json.

**Strategy:** Sentence chunking

**Why?**  Sentence chunking is mostly used for small documents. Also to have a grasp of each sentence.

In [3]:
sample_document = """
Installation Guide

Step 1: Download the installer from our website.
Extract the zip file to your desired location.

Step 2: Run setup.exe as administrator.
Follow the on-screen instructions.

Step 3: Configure your API key in the settings file.
The settings file is located at config/settings.json.
"""

def chunk_by_sentences(text, chunk_size=200, overlap=50):
    
    # Simple sentence splitting (split on . ! ?)
    import re
    sentences = re.split(r'(?<=[.!?])\s+', text)
    
    chunks = []
    start = 0
    
    while start < len(text):
        # Get chunk from start to start + chunk_size
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        
        # Move start position (with overlap)
        start += chunk_size - overlap
    
    return chunks

# Test it
chunks = chunk_by_sentences(sample_document, chunk_size=200, overlap=50)

print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks[:3], 1):  # Show first 3 chunks
    print(f"Chunk {i} ({len(chunk)} chars):")
    print(chunk)
    print("-" * 80)

Number of chunks: 3

Chunk 1 (200 chars):

Installation Guide

Step 1: Download the installer from our website.
Extract the zip file to your desired location.

Step 2: Run setup.exe as administrator.
Follow the on-screen instructions.

Step 3
--------------------------------------------------------------------------------
Chunk 2 (151 chars):
trator.
Follow the on-screen instructions.

Step 3: Configure your API key in the settings file.
The settings file is located at config/settings.json.

--------------------------------------------------------------------------------
Chunk 3 (1 chars):


--------------------------------------------------------------------------------


**Document C: Article**
```
The Future of Renewable Energy

Solar and wind power have seen tremendous growth in recent years. As technology improves
and costs decrease, renewable energy becomes increasingly competitive with fossil fuels.

Energy storage solutions are critical for renewable adoption. Battery technology advances
enable better grid management and reliability. This addresses the intermittent nature of
solar and wind power.

Policy support and public awareness continue to drive the transition. Many countries have
set ambitious renewable energy targets for the coming decades.
```

**Strategy:** Paragraph chunking

**Why?** The document is a bit larger than the other ones and it it also in paragraphs. To have the full context of each paragraph we have to chunk by paragraphs.

In [15]:
sample_document = """
The Future of Renewable Energy

Solar and wind power have seen tremendous growth in recent years. As technology improves
and costs decrease, renewable energy becomes increasingly competitive with fossil fuels.

Energy storage solutions are critical for renewable adoption. Battery technology advances
enable better grid management and reliability. This addresses the intermittent nature of
solar and wind power.

Policy support and public awareness continue to drive the transition. Many countries have
set ambitious renewable energy targets for the coming decades.
"""

def chunk_by_paragraphs(text, min_chunk_size=100):
    # Split by double newlines (paragraph separator)
    paragraphs = text.split('\n\n')
    
    chunks = []
    current_chunk = ""
    
    for para in paragraphs:
        para = para.strip()
        if not para:
            continue
            
        # If paragraph is too small, combine with next
        if len(para) < min_chunk_size:
            current_chunk += "\n\n" + para if current_chunk else para
        else:
            # Save previous chunk if exists
            if current_chunk:
                chunks.append(current_chunk.strip())
            # Start new chunk with this paragraph
            current_chunk = para
    
    # Don't forget the last chunk
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks

# Test it
chunks = chunk_by_paragraphs(sample_document, min_chunk_size=100)

print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i} ({len(chunk)} chars):")
    print(chunk)
    print("-" * 80)


Number of chunks: 4

Chunk 1 (30 chars):
The Future of Renewable Energy
--------------------------------------------------------------------------------
Chunk 2 (177 chars):
Solar and wind power have seen tremendous growth in recent years. As technology improves
and costs decrease, renewable energy becomes increasingly competitive with fossil fuels.
--------------------------------------------------------------------------------
Chunk 3 (200 chars):
Energy storage solutions are critical for renewable adoption. Battery technology advances
enable better grid management and reliability. This addresses the intermittent nature of
solar and wind power.
--------------------------------------------------------------------------------
Chunk 4 (152 chars):
Policy support and public awareness continue to drive the transition. Many countries have
set ambitious renewable energy targets for the coming decades.
--------------------------------------------------------------------------------
