# Semantic Chunking
- SemanticChunker is a document splitter that uses embedding similarity between sentences to decide chunk boundaries.
- It ensures that each chunk is semantically coherent and not cut off mid-thought like traditional character/token splitters.

In [1]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [None]:
## Initialize the model
model=SentenceTransformer('all-MiniLM-L6-v2')

## Sample text
text="""
LangChain is a framework for building applications with LLMs.
Langchain provides modular abstractions to combine LLMs with tools like OpenAI and Pinecone.
You can create chains, agents, memory, and retrievers.
The Eiffel Tower is located in Paris.
France is a popular tourist destination.
"""

## Step 1 : Split into sentences
sentences=[s.strip() for s in text.split("\n") if s.strip()]

### sstep 2: Embed each setence
embeddings=model.encode(sentences)

# Step 3: Initialize parameters
threshold = 0.7  # control chunk tightness
chunks = []
current_chunk=[sentences[0]]

## Step 4: Semantic grouping based on threshold

for i in range(1, len(sentences)):
    sim = cosine_similarity(
        [embeddings[i - 1]],
        [embeddings[i]]
    )[0][0]

    if sim>=threshold:
        current_chunk.append(sentences[i])
    else:
        chunks.append(" ".join(current_chunk))
        current_chunk=[sentences[i]]

# Append the last chunk
chunks.append(" ".join(current_chunk))

# Output the chunks
print("\n📌 Semantic Chunks:")
for idx, chunk in enumerate(chunks):
    print(f"\nChunk {idx+1}:\n{chunk}")


📌 Semantic Chunks:

Chunk 1:
LangChain is a framework for building applications with LLMs. Langchain provides modular abstractions to combine LLMs with tools like OpenAI and Pinecone.

Chunk 2:
You can create chains, agents, memory, and retrievers.

Chunk 3:
The Eiffel Tower is located in Paris.

Chunk 4:
France is a popular tourist destination.


### RAG Pipeline Modular Coding

In [None]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from langchain.schema import Document
from langchain.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.chat_models import init_chat_model
from langchain.schema.runnable import RunnableLambda, RunnableMap
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
import os
os.environ["GROQ_API_KEY"]=os.getenv("GROQ_API_KEY")

In [None]:
### Custom Semantic Chunker With Threshold

class ThresholdSematicChunker:
    def __init__(self,model_name="all-MiniLM-L6-v2",threshold=0.7):
        self.model=SentenceTransformer(model_name)
        self.threshold=threshold 

    def split(self, text: str):
        sentences = [s.strip() for s in text.split('.') if s.strip()]
        embeddings = self.model.encode(sentences)
        chunks = []
        current_chunk = [sentences[0]]

        for i in range(1, len(sentences)):
            sim = cosine_similarity([embeddings[i - 1]], [embeddings[i]])[0][0]
            if sim >= self.threshold:
                current_chunk.append(sentences[i])
            else:
                chunks.append(". ".join(current_chunk) + ".")
                current_chunk = [sentences[i]]

        chunks.append(". ".join(current_chunk) + ".")
        return chunks
    
    def split_documents(self,docs):
        result=[]
        for doc in docs:
            for chunk in self.split(doc.page_content):
                result.append(Document(page_content=chunk, metadata=doc.metadata))

        return result

In [5]:
# Sample text
sample_text = """
LangChain is a framework for building applications with LLMs.
Langchain provides modular abstractions to combine LLMs with tools like OpenAI and Pinecone.
You can create chains, agents, memory, and retrievers.
The Eiffel Tower is located in Paris.
France is a popular tourist destination.
"""

doc = Document(page_content=sample_text)
doc

Document(metadata={}, page_content='\nLangChain is a framework for building applications with LLMs.\nLangchain provides modular abstractions to combine LLMs with tools like OpenAI and Pinecone.\nYou can create chains, agents, memory, and retrievers.\nThe Eiffel Tower is located in Paris.\nFrance is a popular tourist destination.\n')

In [6]:
### Chunking
chunker=ThresholdSematicChunker(threshold=0.7)
chunks=chunker.split_documents([doc])
chunks

[Document(metadata={}, page_content='LangChain is a framework for building applications with LLMs. Langchain provides modular abstractions to combine LLMs with tools like OpenAI and Pinecone.'),
 Document(metadata={}, page_content='You can create chains, agents, memory, and retrievers.'),
 Document(metadata={}, page_content='The Eiffel Tower is located in Paris.'),
 Document(metadata={}, page_content='France is a popular tourist destination.')]

In [None]:
### VectorStore
import os
os.environ["OPENAI_API_KEY"]=os.getenv("OPENAI_API_KEY")
embedding=OpenAIEmbeddings()
vectorstore=FAISS.from_documents(chunks,embedding)
retriever=vectorstore.as_retriever()

In [9]:
## Prompt Template

# --- 5. Prompt Template ---
template = """Answer the question based on the following context:

{context}

Question: {question}
"""

prompt = PromptTemplate.from_template(template)
prompt

PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='Answer the question based on the following context:\n\n{context}\n\nQuestion: {question}\n')

In [10]:
## LLM
llm=init_chat_model(model="groq:gemma2-9b-it",temperature=0.4)

### LCEL Chain With retrieval

rag_chain=(
    RunnableMap(
        {
        "context": lambda x: retriever.invoke(x["question"]),
        "question": lambda x: x["question"],  
        }
    )
    | prompt
    | llm
    | StrOutputParser()
)

# --- 8. Run Query ---
query = {"question": "What is LangChain used for?"}
result = rag_chain.invoke(query)

print(result)

According to the provided context, LangChain is a framework for building applications with LLMs. 



### Semantic chunker With Langchain

In [1]:
from langchain_openai import OpenAIEmbeddings
from langchain_experimental.text_splitter import SemanticChunker
from langchain.document_loaders import TextLoader

In [2]:
## Load the documents
loader=TextLoader("langchain_intro.txt")
docs=loader.load()

## Initialize embedding model
embedding=OpenAIEmbeddings()

## Create the semantic chunker
chunker=SemanticChunker(embedding)

## Split the documents
chunks=chunker.split_documents(docs)

## Result

for i,chunk in enumerate(chunks):
    print(f"\n chunk {i+1}:\n{chunk.page_content}")


 chunk 1:
LangChain is a framework for building applications with LLMs. Langchain provides modular abstractions to combine LLMs with tools like OpenAI and Pinecone.

 chunk 2:
You can create chains, agents, memory, and retrievers. The Eiffel Tower is located in Paris. France is a popular tourist destination.


# Note

| **Aspect**                      | **Traditional (Fixed-Size) Chunking**                                               | **Semantic Chunking (Meaning-Based)**                                                     | **Examples / Framework Implementations**                                              |
| ------------------------------- | ----------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
| **Definition**                  | Splits text into chunks of fixed character or token length (e.g., 500–1000 tokens). | Splits text dynamically based on meaning, context, and sentence structure.                | LangChain SemanticChunker, LlamaIndex SentenceSplitter, Haystack SemanticTextSplitter |
| **Chunk Boundaries**            | Determined by token count, not context.                                             | Determined by semantic similarity, coherence, or embedding similarity.                    | Uses cosine similarity between embeddings to decide split points.                     |
| **Context Preservation**        | Often breaks sentences or thoughts in half, losing meaning.                         | Preserves complete semantic units — paragraphs, topics, or entities stay intact.          | LangChain’s `SemanticChunker` ensures minimal semantic loss.                          |
| **Embedding Quality**           | Lower-quality embeddings due to broken context.                                     | Higher-quality embeddings since each chunk contains coherent meaning.                     | Especially effective with dense embedding models (e.g., `text-embedding-3-large`).    |
| **Retrieval Accuracy**          | May retrieve irrelevant or partial chunks.                                          | More relevant document retrieval, as chunks align better with queries.                    | Improves recall and precision in vector search.                                       |
| **Chunk Overlap**               | Requires manual overlap (e.g., 100–200 tokens) to maintain context continuity.      | Overlap handled naturally by meaning — chunks overlap when context overlaps semantically. | Automatically managed via similarity threshold.                                       |
| **Processing Cost**             | Simpler and faster — low compute overhead.                                          | Slightly higher compute cost due to embedding-based boundary detection.                   | Overhead scales with chunk count × embedding dimension.                               |
| **Best Use Cases**              | Structured text, code, or fixed-length documents.                                   | Unstructured, narrative, or multi-topic documents (PDFs, research papers, transcripts).   | Particularly useful in knowledge bases and multi-topic RAG systems.                   |
| **Drawbacks**                   | Loses semantic meaning and cross-sentence context.                                  | Slower and more complex preprocessing pipeline.                                           | Trade-off: quality vs. speed.                                                         |
| **Integration with Vector DBs** | Works fine with any vector store (Chroma, Pinecone, FAISS).                         | Works best with high-dimensional vector DBs and metadata filtering.                       | Chroma + LangChain SemanticChunker combination works very well.                       |
| **Example Pipeline**            | `RecursiveCharacterTextSplitter` → Embed → Store → Retrieve                         | `SemanticChunker` → Compute embeddings → Segment semantically → Store → Retrieve          | LangChain `SemanticChunker` + `OpenAIEmbeddings` + Chroma                             |


| **Metric**                | **Traditional Chunking** | **Semantic Chunking**            |
| ------------------------- | ------------------------ | -------------------------------- |
| Context Retention         | ❌ Often lost             | ✅ Preserved                      |
| Embedding Relevance       | ⚪ Medium                 | 🟢 High                          |
| Retrieval Precision       | ⚪ Average                | 🟢 Strong                        |
| Computation Cost          | 🟢 Low                   | 🔴 Moderate                      |
| Implementation Complexity | 🟢 Simple                | 🔴 Moderate                      |
| Ideal For                 | Structured / short text  | Unstructured / long-form content |

