### Semantic Chunking  

Imagine you’re reading a dessert recipe. Fixed-size chunking would chop the text into equal blocks, like cutting a cake with a ruler — sometimes you get the whole "Add sugar and butter…" step in one chunk, and sometimes you just get "… for 10 minutes" dangling in the next. Not very helpful.  

Semantic chunking is more like cutting a layered cake along the frosting lines — you keep the steps and ideas intact. That way, "Mix flour, sugar, and eggs" stays together as one meaningful unit, instead of being split in the middle.  

In short:  
- **Fixed-size chunks**: Equal slices, context may get lost.  
- **Semantic chunks**: Natural breaks, meaning preserved.  

When working with recipes (or any domain text), semantic chunking ensures each instruction or paragraph stands on its own, making it easier for retrieval and for the model to understand.  

For this exercise, I have created a separate chunk for each recipe.

In [7]:
# imports and setup
import sys
import os
import json
project_root = os.path.abspath(os.path.join("..", ".."))
sys.path.append(project_root)
from common.helper import read_pdf, is_numeric, semantic_cunks_unsorted

pdf_path = os.path.join(project_root, "data", "input", "recipe-book.pdf")
out_dir = os.path.join(project_root, "data", "chunks")

In [8]:
#Read the PDF file containing recipes.
#Skip the first 10 pages (these do not contain recipes).
#Perform semantic chunking on the remaining pages to extract recipe chunks.
#Save the resulting chunks to a JSON file for later use.

doc = read_pdf(pdf_path)

skip_pages = 10   # skip initial 10 pages which do not have recipes

unsorted_spans_chunks = semantic_cunks_unsorted(doc, skip_pages=skip_pages)
with open(f"{out_dir}/chunks_fitz.json", "w", encoding="utf-8") as f:
    json.dump(unsorted_spans_chunks, f, ensure_ascii=False, indent=4)