# Chunking Component
This notebook implements and compares three chunking strategies for the CuisineRAG system
using the `langchain-text-splitters` library:
1. CharacterTextSplitter (Fixed-size chunking)
2. RecursiveCharacterTextSplitter (Smarter fixed-size chunking)
3. MarkdownHeaderTextSplitter (Section-based chunking)

In [1]:
from langchain_text_splitters import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
    MarkdownHeaderTextSplitter
)

In [2]:
import json
from pathlib import Path


## Load Corpus from Scraping Output

In [4]:
corpus_path = Path('data/raw/south_asian_corpus.json')
if not corpus_path.exists():
    raise FileNotFoundError(f'Missing corpus file: {corpus_path}. Run scraping notebook first.')

with corpus_path.open('r', encoding='utf-8') as f:
    corpus = json.load(f)

if not corpus:
    raise ValueError('Corpus is empty. Check scraping output before chunking.')

texts = [doc['text'] for doc in corpus if doc.get('text')]
metadatas = [
    {
        'title': doc.get('title', ''),
        'url': doc.get('url', ''),
    }
    for doc in corpus if doc.get('text')
]

print(f'Loaded {len(texts)} documents from {corpus_path}')
print(f"Total characters: {sum(len(t) for t in texts):,}")
print('\n--- First document preview (500 chars) ---')
print(texts[0][:500])


Loaded 257 documents from data/raw/south_asian_corpus.json
Total characters: 2,710,651

--- First document preview (500 chars) ---
# Acehnese cuisine

Acehnese cuisine is the cuisine of the Acehnese people of Aceh in Sumatra, Indonesia. This cuisine is popular and widely known in Indonesia. Arab, Persian, and Indian traders influenced food culture in Aceh although flavours have substantially changed their original forms. The spices combined in Acehnese cuisine are commonly found in Indian and Arab cuisine, such as ginger, pepper, coriander, cumin, cloves, cinnamon, cardamom, and fennel. 
A variety of Acehnese food is cooked


## 1. CharacterTextSplitter (Fixed-Size Chunking)
Splits text into fixed-size chunks based on character count with overlap.
Simple and reliable, works on any text format.

In [5]:
char_splitter = CharacterTextSplitter(
    separator=" ",
    chunk_size=300,
    chunk_overlap=50,
    length_function=len,
)

char_chunks = char_splitter.create_documents(texts, metadatas=metadatas)

print(f"Number of chunks: {len(char_chunks)}\n")
for i, chunk in enumerate(char_chunks[:5]):
    print(f"[Chunk {i}] {len(chunk.page_content)} chars | title={chunk.metadata.get('title', '')}")
    print(chunk.page_content[:300])
    print("---")


Number of chunks: 10919

[Chunk 0] 297 chars | title=Acehnese cuisine
# Acehnese cuisine

Acehnese cuisine is the cuisine of the Acehnese people of Aceh in Sumatra, Indonesia. This cuisine is popular and widely known in Indonesia. Arab, Persian, and Indian traders influenced food culture in Aceh although flavours have substantially changed their original forms. The
---
[Chunk 1] 297 chars | title=Acehnese cuisine
substantially changed their original forms. The spices combined in Acehnese cuisine are commonly found in Indian and Arab cuisine, such as ginger, pepper, coriander, cumin, cloves, cinnamon, cardamom, and fennel. 
A variety of Acehnese food is cooked with curry or coconut milk, which is generally
---
[Chunk 2] 300 chars | title=Acehnese cuisine
with curry or coconut milk, which is generally combined with meat such as buffalo, beef, goat meat, lamb, mutton, fish, or chicken. Several Aceh dishes can trace their origins to India, such as roti canai, which was derived from Indian 

## 2. RecursiveCharacterTextSplitter (Smarter Fixed-Size Chunking)
Tries to split on natural boundaries first (paragraphs → sentences → words)
before falling back to character splitting. Produces more meaningful chunks
than plain CharacterTextSplitter.

In [6]:
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", ".", " ", ""],
)

recursive_chunks = recursive_splitter.create_documents(texts, metadatas=metadatas)

print(f"Number of chunks: {len(recursive_chunks)}\n")
for i, chunk in enumerate(recursive_chunks[:5]):
    print(f"[Chunk {i}] {len(chunk.page_content)} chars | title={chunk.metadata.get('title', '')}")
    print(chunk.page_content[:300])
    print("---")


Number of chunks: 15062

[Chunk 0] 18 chars | title=Acehnese cuisine
# Acehnese cuisine
---
[Chunk 1] 272 chars | title=Acehnese cuisine
Acehnese cuisine is the cuisine of the Acehnese people of Aceh in Sumatra, Indonesia. This cuisine is popular and widely known in Indonesia. Arab, Persian, and Indian traders influenced food culture in Aceh although flavours have substantially changed their original forms
---
[Chunk 2] 170 chars | title=Acehnese cuisine
. The spices combined in Acehnese cuisine are commonly found in Indian and Arab cuisine, such as ginger, pepper, coriander, cumin, cloves, cinnamon, cardamom, and fennel.
---
[Chunk 3] 283 chars | title=Acehnese cuisine
A variety of Acehnese food is cooked with curry or coconut milk, which is generally combined with meat such as buffalo, beef, goat meat, lamb, mutton, fish, or chicken. Several Aceh dishes can trace their origins to India, such as roti canai, which was derived from Indian flatbread.
---
[Chunk 4] 91 chars | title=Acehne

## 3. MarkdownHeaderTextSplitter (Section-Based Chunking)
Splits based on markdown headers preserved during scraping.
Best strategy for Wikipedia/Wikibooks — each chunk maps to one section,
and the header is stored as metadata for better RAG retrieval.

In [7]:
headers_to_split_on = [
    ("#", "title"),
    ("##", "section"),
    ("###", "subsection"),
]

markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

markdown_chunks = []
for doc in corpus:
    text = doc.get('text', '')
    if not text:
        continue
    doc_chunks = markdown_splitter.split_text(text)
    for chunk in doc_chunks:
        chunk.metadata['doc_title'] = doc.get('title', '')
        chunk.metadata['url'] = doc.get('url', '')
    markdown_chunks.extend(doc_chunks)

print(f"Number of chunks: {len(markdown_chunks)}\n")
for i, chunk in enumerate(markdown_chunks[:5]):
    print(f"[Chunk {i}] Metadata: {chunk.metadata}")
    print(chunk.page_content[:300])
    print("---")


Number of chunks: 2803

[Chunk 0] Metadata: {'title': 'Acehnese cuisine', 'doc_title': 'Acehnese cuisine', 'url': 'https://en.wikipedia.org/wiki/Acehnese_cuisine'}
Acehnese cuisine is the cuisine of the Acehnese people of Aceh in Sumatra, Indonesia. This cuisine is popular and widely known in Indonesia. Arab, Persian, and Indian traders influenced food culture in Aceh although flavours have substantially changed their original forms. The spices combined in Ace
---
[Chunk 1] Metadata: {'title': 'Acehnese cuisine', 'section': 'List of Acehnese foods', 'subsection': 'Spices', 'doc_title': 'Acehnese cuisine', 'url': 'https://en.wikipedia.org/wiki/Acehnese_cuisine'}
Asam sunti, a condiment made of star fruit and salt.
---
[Chunk 2] Metadata: {'title': 'Acehnese cuisine', 'section': 'List of Acehnese foods', 'subsection': 'Dishes', 'doc_title': 'Acehnese cuisine', 'url': 'https://en.wikipedia.org/wiki/Acehnese_cuisine'}
Ayam tangkap, traditional fried chicken served with leaves such as temur

## 4. Comparison Summary
Comparing all three strategies on the same scraped Wikipedia text.

In [8]:
char_sizes = [len(c.page_content) for c in char_chunks]
recursive_sizes = [len(c.page_content) for c in recursive_chunks]
markdown_sizes = [len(c.page_content) for c in markdown_chunks]

print(f"{'Strategy':<35} {'Num Chunks':<15} {'Avg Size':<15} {'Min':<10} {'Max'}")
print("-" * 85)
print(f"{'CharacterTextSplitter':<35} {len(char_chunks):<15} {sum(char_sizes)/len(char_sizes):<15.0f} {min(char_sizes):<10} {max(char_sizes)}")
print(f"{'RecursiveCharacterTextSplitter':<35} {len(recursive_chunks):<15} {sum(recursive_sizes)/len(recursive_sizes):<15.0f} {min(recursive_sizes):<10} {max(recursive_sizes)}")
print(f"{'MarkdownHeaderTextSplitter':<35} {len(markdown_chunks):<15} {sum(markdown_sizes)/len(markdown_sizes):<15.0f} {min(markdown_sizes):<10} {max(markdown_sizes)}")

Strategy                            Num Chunks      Avg Size        Min        Max
-------------------------------------------------------------------------------------
CharacterTextSplitter               10919           294             40         300
RecursiveCharacterTextSplitter      15062           182             1          300
MarkdownHeaderTextSplitter          2803            944             4          16394


## 5. Conclusion

Based on real scraped Wikipedia data (Chicken Tikka Masala):

- **CharacterTextSplitter**: Simple but may cut mid-sentence or mid-paragraph
- **RecursiveCharacterTextSplitter**: Smarter, respects natural text boundaries better
- **MarkdownHeaderTextSplitter**: Best for our use case — each chunk maps cleanly to a section and the header is stored as metadata

**Decision**: Primary strategy → `MarkdownHeaderTextSplitter`  
**Fallback** (for unstructured text) → `RecursiveCharacterTextSplitter`