# Semantic Chunking

- SemanticChunker document splitter that uses embedding similarity between sentences to decide chunk boundaries.
- It ensures that each chunk is semantically coherent and not cut-off mid-thought like traditional character/token splitters.

### 1. Basics

In [1]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Initialize the model
model=SentenceTransformer("all-MiniLM-L6-v2")

# Sample text
text = """
LangChain is a framework for building applications with LLM.
LangChain provides more modular abstractions to combine LLMs with tools like PineCone and OpenAI
You can create chains, agents, memory and retriever
The Eiffel Tower is located in Paris.
France is a popular tourist destination.
"""

# 1. Split into sentences
sentences = [s.strip() for s in text.split("\n") if s.strip()]

# 2. Sentence embeddings
embeddings=model.encode(sentences)

# 3. Initialize the parameters & hyperparameters
threshold = 0.7
chunks = []
current_chunk=[sentences[0]]

# 4. Semantic grouping based on threshold

for i in range(1, len(sentences)):
    sim = cosine_similarity(
        [embeddings[i-1]],
        [embeddings[i]]
    )[0][0]

    if sim >= threshold:
        current_chunk.append(sentences[i])
    else:
        chunks.append(" ".join(current_chunk))
        current_chunk=[sentences[i]]
# Add the last chunk to the current chunk
chunks.append(" ".join(current_chunk))

# Output the chunks
print(f"\nðŸ“Œ Semantic Chunks:")
for idx, chunk in enumerate(chunks):
    print(f"\nChunk{idx+1}:\n{chunk}")



ðŸ“Œ Semantic Chunks:

Chunk1:
LangChain is a framework for building applications with LLM. LangChain provides more modular abstractions to combine LLMs with tools like PineCone and OpenAI

Chunk2:
You can create chains, agents, memory and retriever

Chunk3:
The Eiffel Tower is located in Paris.

Chunk4:
France is a popular tourist destination.


### 2. RAG pipeline modularized

In [7]:
# Import libraries
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from langchain_core.documents import Document
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OpenAIEmbeddings
from langchain.chat_models import init_chat_model
from langchain_core.runnables import RunnablePassthrough, RunnableLambda, RunnableMap
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.document_loaders import DirectoryLoader, TextLoader, PyPDFLoader
import os
from dotenv import load_dotenv
load_dotenv()
os.environ['GROQ_API_KEY'] = os.getenv("GROQ_API_KEY")


In [37]:
# Custom semantic chunker with threshold

class ThresholdSemanticChunker:
    def __init__(self, modelname="all-MiniLM-L6-v2", threshold=0.7):
        self.model=SentenceTransformer(modelname)
        self.threshold=threshold
    
    def split_text(self,text:str):
        """ Split the text into semantic chunks of sentences """
        
        sentences=[s.strip() for s in text.split("\n") if s.strip()]
        embeddings=self.model.encode(sentences)
        chunks=[]
        current_chunk=[sentences[0]]

        for i in range(1, len(sentences)):
            sim = cosine_similarity(
                [embeddings[i-1]],
                [embeddings[i]]
                )[0][0]
            if sim >= threshold:
                current_chunk.append(sentences[i])
            else:
                chunks.append(" ".join(current_chunk))
                current_chunk=[sentences[i]]
            # Add the last chunk to the current chunk
        chunks.append(" ".join(current_chunk))
        return chunks
    
    def split_docs(self, docs):
        """ Split documents into semantic chunks of sentences """
        doc_chunks = []

        for doc in docs:
            print(doc.page_content)
            print(doc.metadata)
            for chunk in self.split_text(doc.page_content):
                doc_chunks.append(Document(page_content=chunk, metadata=doc.metadata))
        
        return doc_chunks


In [38]:
# Sample text
text = """
LangChain is a framework for building applications with LLM.
LangChain provides more modular abstractions to combine LLMs with tools like PineCone and OpenAI
You can create chains, agents, memory and retriever
The Eiffel Tower is located in Paris.
France is a popular tourist destination.
"""
doc = Document(
    page_content=text,
    metadata={"source":"blog", "page":1}
)
doc

Document(metadata={'source': 'blog', 'page': 1}, page_content='\nLangChain is a framework for building applications with LLM.\nLangChain provides more modular abstractions to combine LLMs with tools like PineCone and OpenAI\nYou can create chains, agents, memory and retriever\nThe Eiffel Tower is located in Paris.\nFrance is a popular tourist destination.\n')

In [None]:
# Execute SemanticChunker
chunker=ThresholdSemanticChunker(threshold=0.7)
chunks = chunker.split_docs([doc])


LangChain is a framework for building applications with LLM.
LangChain provides more modular abstractions to combine LLMs with tools like PineCone and OpenAI
You can create chains, agents, memory and retriever
The Eiffel Tower is located in Paris.
France is a popular tourist destination.

{'source': 'blog', 'page': 1}
[Document(metadata={'source': 'blog', 'page': 1}, page_content='LangChain is a framework for building applications with LLM. LangChain provides more modular abstractions to combine LLMs with tools like PineCone and OpenAI'), Document(metadata={'source': 'blog', 'page': 1}, page_content='You can create chains, agents, memory and retriever'), Document(metadata={'source': 'blog', 'page': 1}, page_content='The Eiffel Tower is located in Paris.'), Document(metadata={'source': 'blog', 'page': 1}, page_content='France is a popular tourist destination.')]


In [40]:
chunks

[Document(metadata={'source': 'blog', 'page': 1}, page_content='LangChain is a framework for building applications with LLM. LangChain provides more modular abstractions to combine LLMs with tools like PineCone and OpenAI'),
 Document(metadata={'source': 'blog', 'page': 1}, page_content='You can create chains, agents, memory and retriever'),
 Document(metadata={'source': 'blog', 'page': 1}, page_content='The Eiffel Tower is located in Paris.'),
 Document(metadata={'source': 'blog', 'page': 1}, page_content='France is a popular tourist destination.')]

In [44]:
# Create vectorstore and retriever
os.environ['OPENAI_API_KEY']=os.getenv("OPENAI_API_KEY")
embedding=OpenAIEmbeddings()
vectorstore=FAISS.from_documents(chunks,embedding)
retriever=vectorstore.as_retriever()

In [45]:
# Create prompt template and prompt
prompt_template="""You are a helpful AI assistant. Answer the question based on the provided context.
    If you cannot find the answer in the context, say "I don't have enough information to answer that question."
    Be concise and accurate in your responses. 
        
    Context: 
    {context}

    Question: {question}
    Answer:"""

prompt=ChatPromptTemplate.from_template(prompt_template)
prompt

ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='You are a helpful AI assistant. Answer the question based on the provided context.\n    If you cannot find the answer in the context, say "I don\'t have enough information to answer that question."\n    Be concise and accurate in your responses. \n\n    Context: \n    {context}\n\n    Question: {question}\n    Answer:'), additional_kwargs={})])

In [46]:
retriever

VectorStoreRetriever(tags=['FAISS', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x3060eae50>, search_kwargs={})

In [57]:
# Create LLM and Build LCEL Chain
llm=init_chat_model(model="groq:llama-3.1-8b-instant", temperature=0.4)

rag_chain = (
    RunnableMap(
        {
        "context": lambda x: retriever.invoke(x['question']),
        "question": lambda x: x['question'],
        }
    
    )
    | prompt
    | llm
    | StrOutputParser()
)
rag_chain

{
  context: RunnableLambda(...),
  question: RunnableLambda(...)
}
| ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='You are a helpful AI assistant. Answer the question based on the provided context.\n    If you cannot find the answer in the context, say "I don\'t have enough information to answer that question."\n    Be concise and accurate in your responses. \n\n    Context: \n    {context}\n\n    Question: {question}\n    Answer:'), additional_kwargs={})])
| ChatGroq(profile={'max_input_tokens': 131072, 'max_output_tokens': 8192, 'image_inputs': False, 'audio_inputs': False, 'video_inputs': False, 'image_outputs': False, 'audio_outputs': False, 'video_outputs': False, 'reasoning_output': False, 'tool_calling': True}, client=<groq.resources.chat.completions.Completions object at 0x133e7b3

In [58]:
# Run the RAG Chain with a question
query = {"question": "What is langchain used for?"}
result = rag_chain.invoke(query)

In [59]:
result

'LangChain is a framework for building applications with LLM (Large Language Model).'

### 3. Semantic Chunker with LangChain

In [60]:
from langchain_openai import OpenAIEmbeddings
from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.document_loaders import TextLoader

In [64]:
### Load the documents
loader=TextLoader("./langchain_intro.txt")
docs=loader.load()

#Initialize the embedding model
embedding=OpenAIEmbeddings()

# Create semantic chunker
chunker=SemanticChunker(embedding)

# Split the documents
chunks=chunker.split_documents(docs)

# Display chunks
for i, chunk in enumerate(chunks):
    print(f"\nChunk{i+1}: \n")
    print(f"Content: {chunk.page_content}")
    print(f"\nMetadata: {chunk.metadata}")



Chunk1: 

Content: LangChain is a framework for building applications with LLM. LangChain provides more modular abstractions to combine LLMs with tools like PineCone and OpenAI
You can create chains, agents, memory and retriever
The Eiffel Tower is located in Paris.

Metadata: {'source': './langchain_intro.txt'}

Chunk2: 

Content: France is a popular tourist destination.

Metadata: {'source': './langchain_intro.txt'}
