### Chunking Exercise: The Baker’s Choice Challenge

Imagine you’ve got your hands on a **Baker’s Choice dessert recipe book** — a glorious, chaotic collection of cakes, cookies, pies, mousses, and pastries, each more tempting (and complicated) than the last. 

Your mission? **Find the dessert recipe you actually want to make** without getting lost in the sugar-coated chaos.  

---

Sounds simple? Think again. This book also has experimental “fusion desserts,” random chef notes in the margins, and a few mysterious recipes written in code-like shorthand. If you feed the **entire book** to an LLM, it might start explaining how to make macarons when all you wanted was a brownie. 

---

Enter **chunking** — your AI sous-chef. Instead of dumping the whole book on the model, you **slice it into bite-sized, meaningful chunks**. Now, the AI can find the exact dessert recipe without getting distracted by pastries it doesn’t need.  

- **Goal**: Serve the model only the **chunks it can actually digest**, so it returns the right recipe, not a kitchen disaster.  
- **Takeaway**: Chunking = making sure your AI finds the dessert, not a side of confusion.


In [1]:
import sys
import os
import json
project_root = os.path.abspath(os.path.join("..", ".."))
sys.path.append(project_root)
from common.helper import read_pdf, chunk_text, clean_text

In [2]:
# Extracts text from a PDF, splits it into fixed-size chunks, and saves the chunks to a JSON file.
# 1. Set chunk size and initialize variables.
# 2. Read the PDF and concatenate text from all pages.
# 3. Print a preview of the extracted text.
# 4. Split the text into chunks and print chunk info.
# 5. Save the chunks to a JSON file.

chunk_size = 1000
full_text = ""

pdf_path = os.path.join(project_root, "data", "input", "recipe-book.pdf")
doc = read_pdf(pdf_path)
output_chunks_dir = os.path.join(project_root, "data", "chunks")


for page in doc:
        text = page.get_text()
        if text:
            full_text += text + "\n"
doc.close()

#print(full_text[:500])  # Print the first 500 characters of the extracted text

chunks = chunk_text(full_text, chunk_size=chunk_size)
#print(f"Total chunks created: {len(chunks)}")

#for i in range(min(3, len(chunks))):
#    print(f"\n--- Chunk {i+1} ---\n{chunks[i]}")

output_path = os.path.join(output_chunks_dir, "chunks_raw.json")
with open(output_path, "w", encoding="utf-8") as f:
    json.dump(chunks, f, ensure_ascii=False, indent=4)

### Observations After Initial Chunking

After extracting text from the PDF and splitting it into equal-sized chunks, we observed some issues that could negatively impact retrieval quality:

1. **Unnecessary characters**:  
   Some chunks contain page numbers, headers, footers, or stray symbols from the PDF. These elements do not carry semantic meaning and can confuse the retrieval model.

2. **Breaking of text between chunks**:  
   Since the chunks are created purely based on fixed size, sentences or paragraphs are often split in the middle. This can result in chunks that are **incomplete or hard to interpret**, reducing the effectiveness of semantic search.

3. **Impact on retrieval**:  
   Feeding these imperfect chunks into a RAG pipeline can lead to:
   - Lower relevance of retrieved results
   - Fragmented context
   - Potential hallucinations in generated responses

To improve retrieval quality, we need to **clean the text** and consider **smarter chunking strategies**, such as splitting by paragraphs or semantic boundaries.
Below is the snapshot of chunk after initial fixed size chunking

![Chunking Issue](https://github.com/Kunal627/rag-by-example/blob/main/data/images/raw_fixed_size.PNG)

In [3]:
cleaned_text = []
for chunk in chunks:
    cleaned_chunk = clean_text(chunk["content"])
    cleaned_text.append({"content": cleaned_chunk})

output_path = os.path.join(output_chunks_dir, "chunks_cleaned.json")

with open(output_path, "w", encoding="utf-8") as f:
    json.dump(cleaned_text, f, ensure_ascii=False, indent=4)

### Minimal Text Cleaning and Its Limitations

In this notebook, I have applied **minimal cleaning techniques** to the PDF text:

- Replacing newline characters (`\n`) with spaces.  
- Removing the word **"Page"** and numeric page numbers or ranges like `12-13`.  

While these steps help **make the chunks cleaner and easier for a model to read**, there are some important caveats:

1. **Context breaks:**  
   Since chunks were created using **fixed-size splitting**, sentences or paragraphs may be split in the middle, which can lead to **loss of context**.

2. **Text order issues:**  
   PDFs often mix text with images, tables, or multi-column layouts. Minimal cleaning cannot always preserve the **logical flow of information**, so text might appear **out of order**.

3. **Domain-specific cleaning:**  
   More advanced cleaning, like handling footnotes, references, special symbols, or domain-specific abbreviations, depends on the **complexity of the PDF and the domain**. I leave this as an **exercise for the reader** to explore and implement.

---

**Example of including images after cleaning**:

![Original pdf text](https://github.com/Kunal627/rag-by-example/blob/main/data/images/cleaned_fixed_size.PNG)
![Chunking Issues with Text & Images](https://github.com/Kunal627/rag-by-example/blob/main/data/images/funfacts.PNG)

By combining **minimal cleaning** with **semantic chunking**, we can produce chunks that are **both clean and contextually meaningful**, improving the performance of your retrieval-augmented generation pipeline.
