# Chunking Component
This notebook implements and compares three chunking strategies for the CuisineRAG system
using the `langchain-text-splitters` library:
1. CharacterTextSplitter (Fixed-size chunking)
2. RecursiveCharacterTextSplitter (Smarter fixed-size chunking)
3. MarkdownHeaderTextSplitter (Section-based chunking)

In [2]:
from langchain_text_splitters import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    MarkdownHeaderTextSplitter
)


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:

import wikipediaapi

# Initialize the Wikipedia API
wiki = wikipediaapi.Wikipedia(
    language='en',
    user_agent='CuisineRAG/1.0 (your@email.com)'  # replace with your email
)

Collecting wikipedia-api
  Downloading wikipedia_api-0.9.0.tar.gz (20 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: wikipedia-api
  Building wheel for wikipedia-api (pyproject.toml) ... [?25ldone
[?25h  Created wheel for wikipedia-api: filename=wikipedia_api-0.9.0-py3-none-any.whl size=15522 sha256=038e878d2790321029a29399e9f272cf78a656d4b18e7c23a98891197af339f3
  Stored in directory: /Users/jinmingyi/Library/Caches/pip/wheels/08/22/bd/5181c75f59d48538eb0c0f3246ac541b8a3f0bce3bfd097047
Successfully built wikipedia-api
Installing collected packages: wikipedia-api
Successfully installed wikipedia-api-0.9.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run

## Scrape Real Data from Wikipedia
Scraping the Chicken Tikka Masala page directly from Wikipedia.

In [4]:
def scrape_wikipedia_page(page_title: str) -> str:
    """
    Scrape a Wikipedia page and return formatted text with markdown headers.

    Args:
        page_title: The title of the Wikipedia page
    Returns:
        Formatted text with markdown headers
    """
    page = wiki.page(page_title)

    if not page.exists():
        print(f"Page '{page_title}' does not exist!")
        return ""

    # Build formatted text with markdown headers
    text = f"# {page.title}\n\n"
    text += page.summary + "\n\n"

    # Add each section with its header
    for section in page.sections:
        text += f"## {section.title}\n"
        text += section.text + "\n\n"

        # Handle subsections
        for subsection in section.sections:
            text += f"### {subsection.title}\n"
            text += subsection.text + "\n\n"

    return text


# Scrape the page
raw_text = scrape_wikipedia_page("Chicken_tikka_masala")

print(f"Total text length: {len(raw_text)} characters")
print("\n--- Preview (first 500 chars) ---")
print(raw_text[:500])

Total text length: 10382 characters

--- Preview (first 500 chars) ---
# Chicken tikka masala

Chicken tikka masala is a curry consisting of roasted marinated chicken pieces (chicken tikka) in a creamy spiced sauce (masala). It is widely reported to have been created in Glasgow by Ali Ahmed Aslam, a Pakistani-born chef. It is offered at restaurants around the world and is similar to butter chicken. 
It is one of the most popular dishes in Britain, and in 2001 was described by the British Foreign Secretary Robin Cook as "a true British national dish". The dish has b


## 1. CharacterTextSplitter (Fixed-Size Chunking)
Splits text into fixed-size chunks based on character count with overlap.
Simple and reliable, works on any text format.

In [5]:
char_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=300,
    chunk_overlap=50,
    length_function=len,
)

char_chunks = char_splitter.create_documents([raw_text])

print(f"Number of chunks: {len(char_chunks)}\n")
for i, chunk in enumerate(char_chunks):
    print(f"[Chunk {i}] {len(chunk.page_content)} chars")
    print(chunk.page_content)
    print("---")

Created a chunk of size 307, which is longer than the specified 300
Created a chunk of size 534, which is longer than the specified 300
Created a chunk of size 526, which is longer than the specified 300
Created a chunk of size 360, which is longer than the specified 300
Created a chunk of size 409, which is longer than the specified 300
Created a chunk of size 301, which is longer than the specified 300
Created a chunk of size 301, which is longer than the specified 300
Created a chunk of size 363, which is longer than the specified 300
Created a chunk of size 484, which is longer than the specified 300
Created a chunk of size 740, which is longer than the specified 300
Created a chunk of size 424, which is longer than the specified 300
Created a chunk of size 334, which is longer than the specified 300
Created a chunk of size 463, which is longer than the specified 300
Created a chunk of size 400, which is longer than the specified 300
Created a chunk of size 641, which is longer tha

Number of chunks: 29

[Chunk 0] 22 chars
# Chicken tikka masala
---
[Chunk 1] 306 chars
Chicken tikka masala is a curry consisting of roasted marinated chicken pieces (chicken tikka) in a creamy spiced sauce (masala). It is widely reported to have been created in Glasgow by Ali Ahmed Aslam, a Pakistani-born chef. It is offered at restaurants around the world and is similar to butter chicken.
---
[Chunk 2] 534 chars
It is one of the most popular dishes in Britain, and in 2001 was described by the British Foreign Secretary Robin Cook as "a true British national dish". The dish has been called inauthentic both by white Britons and by South Asians. Scholars and critics have debated the status of the dish, concluding variously that it has undergone an elaborate process of cultural interchange, and serves as symbol of Britain's multicultural society. Lizzie Collingham states that it feels quintessentially South Asian despite its British origins.
---
[Chunk 3] 14 chars
## Composition
---
[Chu

## 2. RecursiveCharacterTextSplitter (Smarter Fixed-Size Chunking)
Tries to split on natural boundaries first (paragraphs → sentences → words)
before falling back to character splitting. Produces more meaningful chunks
than plain CharacterTextSplitter.

In [6]:
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", ".", " ", ""]
)

recursive_chunks = recursive_splitter.create_documents([raw_text])

print(f"Number of chunks: {len(recursive_chunks)}\n")
for i, chunk in enumerate(recursive_chunks):
    print(f"[Chunk {i}] {len(chunk.page_content)} chars")
    print(chunk.page_content)
    print("---")

Number of chunks: 63

[Chunk 0] 22 chars
# Chicken tikka masala
---
[Chunk 1] 225 chars
Chicken tikka masala is a curry consisting of roasted marinated chicken pieces (chicken tikka) in a creamy spiced sauce (masala). It is widely reported to have been created in Glasgow by Ali Ahmed Aslam, a Pakistani-born chef
---
[Chunk 2] 81 chars
. It is offered at restaurants around the world and is similar to butter chicken.
---
[Chunk 3] 232 chars
It is one of the most popular dishes in Britain, and in 2001 was described by the British Foreign Secretary Robin Cook as "a true British national dish". The dish has been called inauthentic both by white Britons and by South Asians
---
[Chunk 4] 204 chars
. Scholars and critics have debated the status of the dish, concluding variously that it has undergone an elaborate process of cultural interchange, and serves as symbol of Britain's multicultural society
---
[Chunk 5] 98 chars
. Lizzie Collingham states that it feels quintessentially South Asian de

## 3. MarkdownHeaderTextSplitter (Section-Based Chunking)
Splits based on markdown headers preserved during scraping.
Best strategy for Wikipedia/Wikibooks — each chunk maps to one section,
and the header is stored as metadata for better RAG retrieval.

In [8]:
headers_to_split_on = [
    ("#", "title"),
    ("##", "section"),
    ("###", "subsection"),
]

markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

markdown_chunks = markdown_splitter.split_text(raw_text)

print(f"Number of chunks: {len(markdown_chunks)}\n")
for i, chunk in enumerate(markdown_chunks):
    print(f"[Chunk {i}] Metadata: {chunk.metadata}")
    print(chunk.page_content)
    print("---")

Number of chunks: 11

[Chunk 0] Metadata: {'title': 'Chicken tikka masala'}
Chicken tikka masala is a curry consisting of roasted marinated chicken pieces (chicken tikka) in a creamy spiced sauce (masala). It is widely reported to have been created in Glasgow by Ali Ahmed Aslam, a Pakistani-born chef. It is offered at restaurants around the world and is similar to butter chicken.
It is one of the most popular dishes in Britain, and in 2001 was described by the British Foreign Secretary Robin Cook as "a true British national dish". The dish has been called inauthentic both by white Britons and by South Asians. Scholars and critics have debated the status of the dish, concluding variously that it has undergone an elaborate process of cultural interchange, and serves as symbol of Britain's multicultural society. Lizzie Collingham states that it feels quintessentially South Asian despite its British origins.
---
[Chunk 1] Metadata: {'title': 'Chicken tikka masala', 'section': 'Composition'

## 4. Comparison Summary
Comparing all three strategies on the same scraped Wikipedia text.

In [9]:
char_sizes = [len(c.page_content) for c in char_chunks]
recursive_sizes = [len(c.page_content) for c in recursive_chunks]
markdown_sizes = [len(c.page_content) for c in markdown_chunks]

print(f"{'Strategy':<35} {'Num Chunks':<15} {'Avg Size':<15} {'Min':<10} {'Max'}")
print("-" * 85)
print(f"{'CharacterTextSplitter':<35} {len(char_chunks):<15} {sum(char_sizes)/len(char_sizes):<15.0f} {min(char_sizes):<10} {max(char_sizes)}")
print(f"{'RecursiveCharacterTextSplitter':<35} {len(recursive_chunks):<15} {sum(recursive_sizes)/len(recursive_sizes):<15.0f} {min(recursive_sizes):<10} {max(recursive_sizes)}")
print(f"{'MarkdownHeaderTextSplitter':<35} {len(markdown_chunks):<15} {sum(markdown_sizes)/len(markdown_sizes):<15.0f} {min(markdown_sizes):<10} {max(markdown_sizes)}")

Strategy                            Num Chunks      Avg Size        Min        Max
-------------------------------------------------------------------------------------
CharacterTextSplitter               29              356             14         929
RecursiveCharacterTextSplitter      63              165             1          294
MarkdownHeaderTextSplitter          11              918             16         2882


## 5. Conclusion

Based on real scraped Wikipedia data (Chicken Tikka Masala):

- **CharacterTextSplitter**: Simple but may cut mid-sentence or mid-paragraph
- **RecursiveCharacterTextSplitter**: Smarter, respects natural text boundaries better
- **MarkdownHeaderTextSplitter**: Best for our use case — each chunk maps cleanly to a section and the header is stored as metadata

**Decision**: Primary strategy → `MarkdownHeaderTextSplitter`  
**Fallback** (for unstructured text) → `RecursiveCharacterTextSplitter`