### Semantic Chunking  

Imagine you’re reading a dessert recipe. Fixed-size chunking would chop the text into equal blocks, like cutting a cake with a ruler — sometimes you get the whole "Add sugar and butter…" step in one chunk, and sometimes you just get "… for 10 minutes" dangling in the next. Not very helpful.  

Semantic chunking is more like cutting a layered cake along the frosting lines — you keep the steps and ideas intact. That way, "Mix flour, sugar, and eggs" stays together as one meaningful unit, instead of being split in the middle.  

In short:  
- **Fixed-size chunks**: Equal slices, context may get lost.  
- **Semantic chunks**: Natural breaks, meaning preserved.  

When working with recipes (or any domain text), semantic chunking ensures each instruction or paragraph stands on its own, making it easier for retrieval and for the model to understand.  

For this exercise, I have created a separate chunk for each recipe.

In [1]:
# imports and setup
import sys
import os
import json
project_root = os.path.abspath(os.path.join("..", ".."))
sys.path.append(project_root)
from common.helper import read_pdf, extract_recipes_from_pdf

pdf_path = os.path.join(project_root, "data", "input", "recipe-book.pdf")
out_dir = os.path.join(project_root, "data", "chunks")

In [None]:
# extract_recipes_from_pdf() reads the pdf spans and group the spans into logical sections like - directions and ingredients
doc = read_pdf(pdf_path)

skip_pages = 10   # skip initial 10 pages which do not have recipes
end_page = 62    # end at page 62 which is the last page with recipes

recipe_chunks = extract_recipes_from_pdf(doc, skip_pages=skip_pages, end_page=end_page)
with open(f"{out_dir}/recipe_chunks.json", "w", encoding="utf-8") as f:
    json.dump(recipe_chunks, f, ensure_ascii=False, indent=4)

### Challenges in Extracting Recipes from PDFs

While working with the recipe PDF, I faced a few key issues:

- **Column layout**: Many pages are formatted with two columns. When reading the file directly with `fitz`, the extracted text often comes out of order, mixing left and right column content.  
- **Non-recipe pages**: A significant number of pages don’t contain recipes at all (e.g., indexes, section dividers, or filler text), which adds noise if processed blindly.  
- **Inconsistent formatting**: Even on recipe pages, headings, ingredients, and steps are not consistently separated. This makes it difficult to chunk recipes cleanly.  

To address this, I wrote a small function that uses span-level information (e.g., font size, style, and color) to logically group text into meaningful sections before chunking. Also added page numbers in metadata.

Not all PDFs are structured the same way, so handling them often requires custom parsing strategies. 

With these preprocessing steps in place, we are almost ready for **vectorization**.

![semantic chunk](../../data/images/semantic_chunk.PNG)

### (Optional) Handling Tables in PDFs  

There are some pages at the end of the pdf for measurment guide. This information is structured as tables, we can also use this information to enrich our chunks.