In [1]:
#Semantic Chunking

**Semantic chunking** is an advanced text processing technique designed to split a document into meaningful and contextually coherent segments. Unlike traditional chunking methods, which divide text based on fixed word or character counts, semantic chunking leverages language models to find natural breakpoints in the text, ensuring that each segment retains its context and meaning.

**How It Overcomes Traditional Chunking:** Traditional chunking methods often break text arbitrarily, which can disrupt the flow of information and scatter related content across different chunks. This leads to difficulties in understanding and retrieving relevant information. Semantic chunking addresses this by using language models to identify logical split points, ensuring that each chunk is a semantically coherent unit. This results in more relevant and contextually accurate retrieval.

In [None]:
import os
from dotenv import load_dotenv
from langchain_experimental.text_splitter import SemanticChunker
from sentence_transformers import SentenceTransformer
from langchain.vectorstores.faiss import FAISS
from langchain.docstore.document import Document

# Load environment variables from a .env file
load_dotenv()

# Load local model (using a model from Hugging Face)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Function to read PDF (assuming you have a function to extract text from PDF)
def read_pdf_to_string(path):
    with open(path, 'r') as file:
        return file.read()

# Define file path
path = "sample.pdf"
content = read_pdf_to_string(path)

# Semantic Chunking setup with local model
class LocalModelEmbeddings:
    def embed(self, texts):
        return model.encode(texts)

# Use custom embeddings and breakpoint strategy for semantic chunking
text_splitter = SemanticChunker(LocalModelEmbeddings(), 
                                breakpoint_threshold_type='percentile', 
                                breakpoint_threshold_amount=90)

# Create semantic chunks
docs = text_splitter.create_documents([content])

# Create vector store using FAISS with local embeddings
vectorstore = FAISS.from_documents(docs, LocalModelEmbeddings())

# Create a retriever to query the chunks
chunks_query_retriever = vectorstore.as_retriever(search_kwargs={"k": 2})

# Test the retriever with a query
test_query = "Give me overview of different tradional word embeddings"
context = retrieve_context_per_question(test_query, chunks_query_retriever)
print(context) 

**Benefits of Using Semantic Chunking:**

**Improved Coherence:** Ensures that each chunk contains complete and meaningful segments of text, enhancing the relevance of retrieved information.
Better Retrieval Accuracy: By preserving the context within chunks, retrieval systems can provide more accurate answers.

**Enhanced Performance:** Downstream NLP tasks, such as question answering or summarization, perform better when processing coherent text segments.
Adaptability: The chunking process can be fine-tuned to different types of documents and tasks by adjusting the breakpoints and thresholds.

**Local Processing:** Running the process with local models eliminates dependency on external APIs, allowing for better control over the data and reducing costs.
This method is particularly valuable for processing long and complex documents where maintaining context and meaning within each chunk is crucial.