### Chunking
- LAnguage models do better when you increase the signal to noise ratio 
- Distracting information in the models context window does tend to measurabelly destroy the performance of the overall application.
- The act of gathering the right information for the LLM is called retrival.
- [YouTube video link](https://www.youtube.com/watch?v=8OJC21T2SL4) 
- [Semantic Chunking NoteBook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)
- [Medium Semantic Chunking](https://medium.com/the-ai-forum/semantic-chunking-for-rag-f4733025d5f5)

In [None]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from Semantic_Chunking import SemanticChunker
from langchain_community.vectorstores import FAISS

import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [None]:
model_e5_large_path = r'D:\AI Models\intfloat multilingual-e5-large'
model_kwargs = {'device':device}
embedding = HuggingFaceEmbeddings(model_name=model_e5_large_path, model_kwargs = model_kwargs)

In [None]:
loader = PyPDFLoader(r"C:\Users\laxmidhar.routa\Downloads\1810.04805v2.pdf")
documents = loader.load()

with open(r'C:\Users\laxmidhar.routa\Downloads\mit.txt') as file:
    essay = file.read()

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=0,length_function=len,is_separator_regex=False)
naive_chunks = text_splitter.split_documents(documents)

In [None]:
semantic_chunker = SemanticChunker(embedding, breakpoint_threshold_type="percentile")
semantic_chunks = semantic_chunker.create_documents([d.page_content for d in documents])

In [None]:
len(semantic_chunks)

In [None]:
semantic_chunk_vectorstore = FAISS.from_documents(semantic_chunks, embedding)
naive_chunk_vectorstore = FAISS.from_documents(naive_chunks, embedding)

In [None]:
semantic_chunk_retriever = semantic_chunk_vectorstore.as_retriever(search_kwargs={"k" : 1})
semantic_chunk_retriever.invoke("Describe the Feature-based Approach with BERT?")

In [None]:
naive_chunk_retriever = naive_chunk_vectorstore.as_retriever(search_kwargs={"k" : 1})
naive_chunk_retriever.invoke("Describe the Feature-based Approach with BERT?")