# ðŸŽ¯ Practice Exercises

## Exercise 1: Chunking Different Document Types

### Task
Apply appropriate chunking strategies to different document types.

### Documents

You'll chunk 3 different types of documents:

**Document A: FAQ**
```
Q: What is the return policy?
A: Items can be returned within 30 days of purchase with original receipt.

Q: Do you offer international shipping?
A: Yes, we ship to over 50 countries worldwide. Shipping times vary by location.

Q: How do I track my order?
A: Use the tracking number sent to your email after shipment.
```

**Document B: Technical Documentation**
```
Installation Guide

Step 1: Download the installer from our website.
Extract the zip file to your desired location.

Step 2: Run setup.exe as administrator.
Follow the on-screen instructions.

Step 3: Configure your API key in the settings file.
The settings file is located at config/settings.json.
```

**Document C: Article**
```
The Future of Renewable Energy

Solar and wind power have seen tremendous growth in recent years. As technology improves
and costs decrease, renewable energy becomes increasingly competitive with fossil fuels.

Energy storage solutions are critical for renewable adoption. Battery technology advances
enable better grid management and reliability. This addresses the intermittent nature of
solar and wind power.

Policy support and public awareness continue to drive the transition. Many countries have
set ambitious renewable energy targets for the coming decades.
```

### Instructions

For each document:
1. Choose the most appropriate chunking strategy (character, word, sentence, paragraph, or custom)
2. Explain why you chose that strategy
3. Implement the chunking
4. Show the resulting chunks

### Template

```python
# Document A: FAQ
strategy_A = "?"  # Your choice
reason_A = "?"    # Why this strategy?
chunks_A = ?      # Your implementation

# Document B: Technical Documentation
strategy_B = "?"
reason_B = "?"
chunks_B = ?

# Document C: Article
strategy_C = "?"
reason_C = "?"
chunks_C = ?
```


#### Document A : FAQ

- Strategy: Paragraph-Based Chunking 
- Reason_A: The question and answer (Q&A) is a well-structured document, each starts on a new line, hence, paragraph-based chunking is most suitable.
- Chunk_A: 

In [19]:
docuument_A = """
Q: What is the return policy?
A: Items can be returned within 30 days of purchase with original receipt.

Q: Do you offer international shipping?
A: Yes, we ship to over 50 countries worldwide. Shipping times vary by location.

Q: How do I track my order?
A: Use the tracking number sent to your email after shipment.
"""

In [21]:
def chunk_by_paragraph(text, min_chunk_size=50):
    """
    Split text by paragraphs (double newlines).
    
    Args:
        text: The text to chunk
        min_chunk_size: Minimum characters per chunk 
        (combine small paragraph)

    Returns:
        List of text chunks
    """

    # Split by double newlines (paragraph separator)
    paragraphs = text.split('\n\n')

    chunks = []
    current_chunk = ""

    for para in paragraphs:
        para = para.strip()
        if not para:
            continue

        # If paragraph is too small, combine with next
        if len(para) < min_chunk_size:
            current_chunk += "\n\n" + para if current_chunk else para
        
        else: 
            # Save previous chunk if exists
            if current_chunk:
                chunks.append(current_chunk.strip())
            # Start new chunk with this paragraph
            current_chunk = para
    
    # The last chunk
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

# Test it 
chunks = chunk_by_paragraph(docuument_A, min_chunk_size=50)

print(f"Number of chunks: {len(chunks)}\n")
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i} ({len(chunk)} chars):")
    print(chunk)
    print("=" * 80)

Number of chunks: 3

Chunk 1 (104 chars):
Q: What is the return policy?
A: Items can be returned within 30 days of purchase with original receipt.
Chunk 2 (120 chars):
Q: Do you offer international shipping?
A: Yes, we ship to over 50 countries worldwide. Shipping times vary by location.
Chunk 3 (89 chars):
Q: How do I track my order?
A: Use the tracking number sent to your email after shipment.


#### Document B: Technical Documentation
- Strategy: Paragraph-Based Chunking 
- Reason_B: The technical documentation is a well-structured document with coherent units. Each starts on a new line, hence, paragraph-based chunking is most suitable.
- Chunk_B: 


In [18]:
Document_B_tech_doc = """

Installation Guide

Step 1: Download the installer from our website.
Extract the zip file to your desired location.

Step 2: Run setup.exe as administrator.
Follow the on-screen instructions.

Step 3: Configure your API key in the settings file.
The settings file is located at config/settings.json.

"""

In [17]:
def chunk_by_paragraphB(text, min_chunk_size=25):
    """
    Split text by paragraph. 

    Args: 
        text: The text to be chunked
        min_chunk_size: Minimum characters per chunk (combine small paragraphs)

    Returns:
        List of text chunks   
    """

    # Split the text by double newlines(paragrapgh separator)
    paragraphs = text.split("\n\n")

    chunk = []
    current_chunk = ""

    for para in paragraphs:
        para = para.strip()
        if not para:
            continue

        # If paragragh is too small, combine with next
        if len(para) < min_chunk_size:
            current_chunk += "\n\n" + para if current_chunk else para
        else:
            # Save previous chunk if exists
            if current_chunk:
                chunk.append(current_chunk.strip())
            # Start new chunk with this paragraph
            current_chunk = para
    
    # The last chunk
    if current_chunk:
        chunk.append(current_chunk.strip())
    return chunk

# Test the chunking
chunk = chunk_by_paragraphB(Document_B_tech_doc, min_chunk_size=25)

print(f"Number of Chunks: {len(chunk)}\n")
for i, chunk in enumerate(chunk, 1):
    print(f"Chunk {i} ({len(chunk)} char)")
    print(chunk)
    print("=" * 80)

Number of Chunks: 4

Chunk 1 (18 char)
Installation Guide
Chunk 2 (95 char)
Step 1: Download the installer from our website.
Extract the zip file to your desired location.
Chunk 3 (74 char)
Step 2: Run setup.exe as administrator.
Follow the on-screen instructions.
Chunk 4 (106 char)
Step 3: Configure your API key in the settings file.
The settings file is located at config/settings.json.


#### Document C: Article
- Strategy: Paragraph-Based Chunking 
- Reason_C: The technical documentation is a well-structured document with coherent units. Each starts on a new line, hence, paragraph-based chunking is most suitable.
- Chunk_C: 


In [22]:
document_C_article = """

The Future of Renewable Energy

Solar and wind power have seen tremendous growth in recent years. As technology improves
and costs decrease, renewable energy becomes increasingly competitive with fossil fuels.

Energy storage solutions are critical for renewable adoption. Battery technology advances
enable better grid management and reliability. This addresses the intermittent nature of
solar and wind power.

Policy support and public awareness continue to drive the transition. Many countries have
set ambitious renewable energy targets for the coming decades.

"""

In [28]:
def chunk_by_paragraphC(text, min_chunk_size=100):
    """
    Splits text into paragraph

    Args:
        text: The text to be chunked
        min_chunk_size : Minimum characters per chunk (combine small paragraphs)

    Returns:
        list of text chunks
    """

    # Split by double newlines (paragraph separator)
    paragraphs = text.split("\n\n")

    chunks_doc = []
    current_chunk = ""

    for para in paragraphs:
        para = para.strip()
        if not para:
            continue

        # If paragraph is too small, combine with next
        if len(para) < min_chunk_size:
            current_chunk += "\n\n" + para if current_chunk else para
        else:
            # Save previous chunk if exists
            if current_chunk:
                chunks_doc.append(current_chunk.strip())
            # Start new chunk with this paragraph
            current_chunk = para

    # The last trunk
    if current_chunk:
        chunks_doc.append(current_chunk.strip())
    return chunks_doc

# Test the chunk
chunk_doc = chunk_by_paragraphC(document_C_article, min_chunk_size=50)

# Print the output
print(f"Number of chunks: {len(chunk_doc)}\n")
for i, chunk_doc in enumerate(chunk_doc, 1):
    print(f"Chunk {i} ({len(chunk_doc)} chars):")
    print(chunk_doc)
    print("-" * 80)

Number of chunks: 4

Chunk 1 (30 chars):
The Future of Renewable Energy
--------------------------------------------------------------------------------
Chunk 2 (177 chars):
Solar and wind power have seen tremendous growth in recent years. As technology improves
and costs decrease, renewable energy becomes increasingly competitive with fossil fuels.
--------------------------------------------------------------------------------
Chunk 3 (200 chars):
Energy storage solutions are critical for renewable adoption. Battery technology advances
enable better grid management and reliability. This addresses the intermittent nature of
solar and wind power.
--------------------------------------------------------------------------------
Chunk 4 (152 chars):
Policy support and public awareness continue to drive the transition. Many countries have
set ambitious renewable energy targets for the coming decades.
--------------------------------------------------------------------------------
