## 🧠 Semantic Chunking for Coherent Retrieval | RAG100X

This notebook implements **Semantic Chunking** — a technique that intelligently splits documents at **meaningful breakpoints**, resulting in **context-preserving chunks** for downstream LLM retrieval and generation.

Unlike traditional chunking (e.g., every 500 characters), semantic chunking leverages **embedding-based similarity** to detect where ideas naturally begin and end — leading to more coherent text segments.

The result? Chunks that reflect complete thoughts, better retrieval alignment, and answers that actually make sense in context.

---

### ✅ What You’ll Learn

- Why fixed-size chunking often breaks up sentences mid-thought  
- How to use LangChain’s `SemanticChunker` to split by semantic shifts  
- The difference between percentile, standard deviation, and IQR breakpoints  
- How OpenAI embeddings guide both chunking and retrieval  
- Why semantically coherent chunks boost RAG performance  

---

### 🔍 Real-world Analogy

Imagine splitting a novel into chunks every 200 words — you’d likely cut across scenes, dialogues, or even mid-sentence.  
Now imagine chunking where the **story naturally pauses** — at paragraph ends, chapter breaks, or scene changes.

✅ **Semantic Chunking works like that — using embeddings to find conceptual "jumps" between sentences and split there.**

---

### 🔬 How Semantic Chunking Works Under the Hood

Let’s break it down step-by-step:

| Step                        | What Happens                                                                 |
|-----------------------------|------------------------------------------------------------------------------|
| 1. PDF Extraction           | The full PDF is converted into a raw text string                            |
| 2. Sentence Embeddings      | Each sentence is converted into an OpenAI embedding vector                   |
| 3. Semantic Distance Calc   | The system computes differences between adjacent sentence embeddings         |
| 4. Breakpoint Detection     | When a distance exceeds a threshold (e.g., 90th percentile), a split occurs  |
| 5. Chunk Assembly           | Sentences between breakpoints are grouped into coherent chunks               |
| 6. Embedding & Indexing     | Chunks are embedded again and stored in a FAISS vector store                 |
| 7. Query & Retrieval        | Queries are matched against these semantic chunks using vector similarity    |

🧠 This process ensures each chunk contains a **complete idea**, reducing the risk of retrieving incomplete or confusing fragments.

---

### 🧪 Why This Works So Well

- ✂️ **No arbitrary cutoffs**: Chunks are split where semantic shifts actually happen  
- 🧩 **Improved coherence**: Each chunk is likely to contain a full concept or argument  
- 🔍 **Better retrieval**: Embedding-aligned chunks match query intent more precisely  
- 🤖 **LLM-friendly**: Language models handle semantically rich text better than fragmented ones  

---

### 🏗️ Why This Matters in Production

Fixed-size chunks often lead to:

- ❌ Mid-sentence breaks  
- ❌ Fragmented ideas  
- ❌ Poor retrieval quality  

But with semantic chunking, imagine retrieving this instead:

> *“Climate change is primarily driven by greenhouse gas emissions, especially CO₂ from fossil fuels. This has been confirmed by decades of atmospheric research.”*

✅ **Complete idea. Self-contained. Easy to ground an answer.**

That’s the power of semantic chunking — giving your RAG system **intelligent building blocks** to work with.

---

### 🔄 Where This Fits in RAG100X

In earlier projects, you’ve explored:

1. Vanilla chunking + retrieval (PDF, CSV, WEB Articles)
2. Chunk size sensitivity studies 
3. Propositional Chunking 
4. Query Enhancement (Query Transformations, HyDE, HyPE)  
5. Context Enrichment (RSE, CCH, CEW)  


Now in **Day 12**, we introduce **semantic awareness at the chunking level**:  
> 💡 **Let the *ideas* decide where to split — not the character count.**


## 📦 Installation & Setup

In [None]:
# Install required packages
!pip install langchain-experimental langchain-openai python-dotenv

import os
import sys
from dotenv import load_dotenv

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')


### Define the Path

In [None]:
path = "data/Understanding_Climate_Change.pdf"

### Read PDF to string

In [None]:
import fitz

def read_pdf_to_string(path):
    """
    Read a PDF document from the specified path and return its content as a string.

    Args:
        path (str): The file path to the PDF document.

    Returns:
        str: The concatenated text content of all pages in the PDF document.

    The function uses the 'fitz' library (PyMuPDF) to open the PDF document, iterate over each page,
    extract the text content from each page, and append it to a single string.
    """
    # Open the PDF document located at the specified path
    doc = fitz.open(path)
    content = ""
    # Iterate over each page in the document
    for page_num in range(len(doc)):
        # Get the current page
        page = doc[page_num]
        # Extract the text content from the current page and append it to the content string
        content += page.get_text()
    return content

content = read_pdf_to_string(path)

### ⚙️ Breakpoint Strategies in Semantic Chunking

At the heart of semantic chunking lies the decision of **where to split**. Unlike arbitrary character counts, this method relies on **semantic distances** — the difference in meaning between neighboring sentences, measured using their embedding vectors.

LangChain’s `SemanticChunker` offers three strategies for identifying these semantic breakpoints:

- `'percentile'`: Calculate all pairwise sentence distances. Split where the distance exceeds the Xth percentile (e.g., top 10% most abrupt meaning shifts).
- `'standard_deviation'`: Identify sentence pairs whose semantic gap is X standard deviations above the mean.
- `'interquartile'`: Use the interquartile range (IQR) of distances. Sentences beyond this typical range are considered "semantic jumps".

**What does this mean?**  
➡️ *“Split the text where semantic difference between adjacent sentences falls in the top 10% of all differences.”*



### 🧠 What Happens Under the Hood?

1. **Sentence Splitting**  
   The document is first broken down into individual sentences.

2. **Embedding Sentences**  
   Each sentence is converted into a high-dimensional vector using OpenAI’s embedding model.

3. **Compute Semantic Distance**  
   For every pair of adjacent sentences, a semantic distance (typically cosine distance) is calculated.  
   These distances measure how much the meaning changes from one sentence to the next.

4. **Detect Breakpoints**  
   The splitter identifies points where the semantic shift is in the **top 10%** of all measured shifts.

5. **Chunk Formation**  
   Sentences between these breakpoints are grouped into coherent chunks, preserving the flow of ideas.


### 📚 Simple Analogy

Think of reading a textbook:

- Sometimes, ideas flow smoothly — one sentence builds on the last.
- Other times, there’s a **clear topic shift** — like a new section, concept, or argument.

**Semantic distances act like a "topic change detector."**

When a big enough shift occurs, it’s as if the algorithm says:

> 🛑 *"Okay, time to start a new paragraph."*


### Why This Matters

Semantic chunking uses **vector math instead of intuition** to create chunks:

- **Fewer, broader chunks:** Use a higher threshold (e.g., 95th percentile).
- **Finer-grained splits:** Use a lower threshold (e.g., 75th percentile).


### 🎯 The Goal

> **Group sentences into logically self-contained chunks that preserve meaning — ideal for retrieval and generation in RAG systems.**


In [None]:
text_splitter = SemanticChunker(OpenAIEmbeddings(), breakpoint_threshold_type='percentile', breakpoint_threshold_amount=90) # chose which embeddings and breakpoint type and threshold to use

### Split original text to semantic chunks

In [None]:

docs = text_splitter.create_documents([content])

### Create vector store and retriever

In [None]:
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(docs, embeddings)
chunks_query_retriever = vectorstore.as_retriever(search_kwargs={"k": 2})


### Test the retriever

In [None]:
def retrieve_context_per_question(question, chunks_query_retriever):
    """
    Retrieves relevant context and unique URLs for a given question using the chunks query retriever.

    Args:
        question: The question for which to retrieve context and URLs.

    Returns:
        A tuple containing:
        - A string with the concatenated content of relevant documents.
        - A list of unique URLs from the metadata of the relevant documents.
    """

    # Retrieve relevant documents for the given question
    docs = chunks_query_retriever.get_relevant_documents(question)

    # Concatenate document content
    # context = " ".join(doc.page_content for doc in docs)
    context = [doc.page_content for doc in docs]

    return context

def show_context(context):
    """
    Display the contents of the provided context list.

    Args:
        context (list): A list of context items to be displayed.

    Prints each context item in the list with a heading indicating its position.
    """
    for i, c in enumerate(context):
        print(f"Context {i + 1}:")
        print(c)
        print("\n")

test_query = "What is the main cause of climate change?"
context = retrieve_context_per_question(test_query, chunks_query_retriever)
show_context(context)

---

## 📘 Summary & Credits

This notebook is based on the excellent open-source repository [RAG_Techniques by NirDiamant](https://github.com/NirDiamant/RAG_Techniques).  
I referred to that work to understand how the pipeline is structured and then reimplemented the same concept in a **fully self-contained** way, but using recent models — as part of my personal learning journey.

The purpose of this notebook is purely **educational**:  
- To deepen my understanding of Retrieval-Augmented Generation systems  
- To keep a clean, trackable log of what I’ve built and learned  
- And to serve as a future reference for myself or others starting from scratch

To support that, I’ve added clear, concise markdowns throughout the notebook — explaining *why* each package was installed, *why* each line of code exists, and *how* each component fits into the overall RAG pipeline. It’s designed to help anyone (including my future self) grasp the **how** and the **why**, not just the **what**.

## 🔍 Why Use Semantic Chunking in RAG?

Standard chunking methods (like fixed character or token windows) often split text mid-sentence or mid-idea — hurting both retrieval relevance and answer quality.

**Semantic Chunking** addresses this by:
- ✂️ Splitting text at **natural meaning shifts**, not arbitrary lengths  
- 🧠 Creating **coherent, self-contained chunks** that represent full ideas  
- 🎯 Improving retrieval grounding and reducing fragmented or ambiguous answers  

---

## 🧠 What’s New in This Version?

This implementation includes:

- 📐 **Embedding-based sentence segmentation** using OpenAI embeddings  
- 🔍 Flexible **breakpoint strategies**: percentile, standard deviation, or IQR  
- ⚙️ Clean integration with FAISS + LangChain’s `SemanticChunker`  
- 📦 End-to-end pipeline from PDF → semantic chunks → retriever  

---

## 📈 Inferences & Key Takeaways

- ✅ Meaning-aware chunking improves **semantic alignment** during retrieval  
- 🔗 Better suited for **complex or structured documents** like reports, research papers, legal content  
- 🤖 Enhances downstream LLM performance by providing **richer, more coherent context**

---

## 🚀 What Could Be Added Next?

- 🧪 Compare against fixed-size chunking with relevance + faithfulness metrics  
- 📊 Visualize sentence embedding distances and breakpoint distributions  
- 🔄 Extend to support **hybrid chunking**: semantic + metadata-based + structure-aware  
- 🧩 Integrate with rerankers or long-context models for improved QA

---
## 💡 Final Word

This notebook is part of my larger personal project: **RAG100x** — a challenge to build and log my journney in RAG from 0 100 in the coming months.

It’s not built to impress — it’s built to **progress**.  
Everything here is structured to enable **daily iteration**, focused experimentation, and clean documentation.

If you're exploring RAG from first principles, feel free to use this as a scaffold for your own builds. And of course — check out the original repository for broader implementations and ideas.
