# Goal
    1.Data Ingestion - Load PDFs, text files, HTML, CSVs
    2.Advanced Chunking - Recursive, semantic
    3.Vector Indexing - ChromaDB
    4.Hybrid Search - Dense (embeddings) + Sparse (BM25)
    5.Re-ranking - Cohere API & Cross-Encoder models
    6.Query Transformation - Multi-query, HyDE, Step-back prompting
    7.Context Compression - LLM-based relevance filtering
    8.Generation with Citations - Answers with source attribution
    9.Evaluation Metrics - MRR, Recall@K, answer quality
    10.Complete Orchestration - Easy-to-use pipeline class

## 1.Data Ingestion - Load PDFs, text files, HTML, CSVs 

In [1]:

from langchain.document_loaders import (PyPDFLoader,TextLoader,Docx2txtLoader,DirectoryLoader,UnstructuredHTMLLoader,CSVLoader)
from typing import List,Dict,Tuple
import re

class DataIngestion:
    
    @staticmethod
    def load_pdfs(file_path:str):
        loader=PyPDFLoader(file_path)
        return loader.load()
    
    @staticmethod
    def load_text(file_path:str):
        loader=TextLoader(file_path)
        return loader.load()
    
    @staticmethod
    def load_directory(directory_path:str,glob_pattern:str='**/*.pdf'):
        loader=DirectoryLoader(
            directory_path,
            glob=glob_pattern,
            loader_cls=PyPDFLoader,
            show_progress=True
        )
        return loader.load()
        
    @staticmethod
    def load_docx(file_path:str):
        loader=Docx2txtLoader(file_path)
        return loader.load()
    
    @staticmethod
    def preprocess_text(text:str)->str:
        text=re.sub(r"\s+",' ',text)
        text=re.sub(r'[^\w\s\.\?\!\-\:\;]','',text)
        
        return text.strip()



In [2]:
document=DataIngestion.load_directory(r'C:\Users\evilk\OneDrive\Desktop\Projects\RAG-Complete-Pipeline\data')

 20%|██        | 7/35 [00:10<00:36,  1.31s/it]Ignoring wrong pointing object 8 0 (offset 0)
Ignoring wrong pointing object 26 0 (offset 0)
Ignoring wrong pointing object 28 0 (offset 0)
 77%|███████▋  | 27/35 [00:44<00:07,  1.14it/s]Ignoring wrong pointing object 10 0 (offset 0)
Ignoring wrong pointing object 18 0 (offset 0)
Ignoring wrong pointing object 42 0 (offset 0)
Ignoring wrong pointing object 45 0 (offset 0)
Ignoring wrong pointing object 48 0 (offset 0)
Ignoring wrong pointing object 51 0 (offset 0)
Ignoring wrong pointing object 71 0 (offset 0)
100%|██████████| 35/35 [00:58<00:00,  1.68s/it]


In [3]:
print(len(document))


1608


## 2.Advanced Chunking - Recursive, semantic

In [4]:
from langchain.text_splitter import (RecursiveCharacterTextSplitter,CharacterTextSplitter)
from sentence_transformers import SentenceTransformer
from langchain.schema import Document
import numpy as np

class Chunking:
    
    
    @staticmethod
    def recursive_chunking(documents,chunk_size=1000,chunk_overlap=200):
        textSplitter=RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
            separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""]
        )
        return textSplitter.split_documents(documents)
    
    @staticmethod
    def semantic_chunking(documents,embedding,chunk_size=1000):
         chunks=[]
         model=SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
         
         for doc in documents:
             sentences=re.split(r'(?<=[.!?])\s+',doc.page_content)
             
             if len(sentences)<=1:
                 chunks.append(doc)
                 continue
             
             embedding_array=model.encode(sentences)
             
             similarities=[]
             for i in range(len(embedding_array)-1):
                 sim=np.dot(embedding_array[i],embedding_array[i+1])
                 similarities.append(sim)
                 
             threshold=np.percentile(similarities,30)
             
             current_chunk=[]
             for i,sentence in enumerate(sentences):
                 current_chunk.append(sentence)
                 
                 if i <len(similarities) and similarities[i]<threshold:
                     chunk_text=' '.join(current_chunk)
                     if len(chunk_text)>chunk_size:
                         chunks.append(Document(
                             page_content=chunk_text,
                             metadata=doc.metadata
                         ))
                         current_chunk=[]
                         
             if current_chunk:
                chunks.append(Document(
                    page_content=' '.join(current_chunk),
                    metadata=doc.metadata
                ))
         return chunks

W1021 21:54:30.039000 11636 site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.


In [5]:
chunks=Chunking.recursive_chunking(document)
print(chunks[1000])


page_content='Salary Risk The present value of the defined plan liability is calculated by reference to the future salaries 
of plan participants. As such, an increase in the salary of the plan participants will increase the 
plan's liability.
  29.2 Share Based Payments
   a) Scheme details
    The Company has Employees’ Stock Option Scheme i.e. ESOS-2017 under which options have been granted at the 
exercise price of C 10 per share to be vested from time to time on the basis of performance and other eligibility criteria. 
Details of number of options outstanding have been tabulated below: 
Financial Year
(Year of Grant)
Number of Options Outstanding
Financial Year of Vesting Exercise 
Price (K)
Range of Fair value at Grant 
Date (K)
As at  
31st March, 
2024
As at
31st March, 
2023
ESOS - 2017
Details of Employee Stock Options granted from 1st April, 2020 to 31st March, 2024
2020-21 2,00,000 2,00,000 2021-22 to 2024-25 10.00 2,133.40 - 2,151.90' metadata={'producer': 'Adobe PDF Libra

## 3.Vector Indexing - ChromaDB

In [6]:
from sentence_transformers import SentenceTransformer
from langchain.vectorstores import FAISS,Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
import numpy as np

class Embeddings:
    
    def __init__(self,model_name='all-MiniLM-L6-v2'):
        
        self.embeddings=HuggingFaceEmbeddings(
            model_name=f"sentence-transformers/{model_name}",
            model_kwargs={'device':'cuda'},
            encode_kwargs={'normalize_embeddings':True}
        )
        
    def create_chroma_db(self,chunks,persist_directory="../chroma_db"):
        vectordb=Chroma.from_documents(
            documents=chunks,
            embedding=self.embeddings,
            persist_directory=persist_directory
        )
        return vectordb
    

In [7]:
emb=Embeddings()
vectordb=emb.create_chroma_db(chunks)

  self.embeddings=HuggingFaceEmbeddings(


## 4.Hybrid Search - Dense (embeddings) + Sparse (BM25)

In [12]:
from rank_bm25 import BM25Okapi
import numpy as np

class HybridRetriever:
    
    def __init__(self,vectorstore,documents):
        self.vectorstore=vectorstore
        self.documents=documents
        
        tokenized_docs=[doc.page_content.lower().split() for doc in documents]
        self.bm25=BM25Okapi(tokenized_docs)
        print(f"Hybrid Retriever ready with {len(documents)} documents")
        
        
    def retrieve(self,query:str,k=10,alpha=0.5):
        
        # Vector Search 
        dense_results=self.vectorstore.similarity_search_with_score(query,k=k*2)
        
        # BM25 Search
        tokenized_query=query.lower().split()
        bm25_scores=self.bm25.get_scores(tokenized_query)
        
        #Normalized Scores between 0-1 
        dense_scores=np.array([1/(1+score) for _,score in dense_results])
        if dense_scores.max()>dense_scores.min():
            dense_scores=(dense_scores-dense_scores.min())/(dense_scores.max()-dense_scores.min())
            
        if bm25_scores.max()>bm25_scores.min():
            bm25_scores=(bm25_scores-bm25_scores.min())/(bm25_scores.max()-bm25_scores.min())
            
        doc_scores={}
        # ADD dense scores
        for i, (doc, _) in enumerate(dense_results):
            doc_id=id(doc)
            doc_scores[doc_id]={'doc':doc,'score':alpha*dense_scores[i]}
            
        #ADD Sparse scores
        for i,doc in enumerate(self.documents):
            doc_id=id(doc)
            if doc_id in doc_scores:
                doc_scores[doc_id]['score']+=(1-alpha)*bm25_scores[i]
            else:
                doc_scores[doc_id]={'doc':doc,'score':(1-alpha)*bm25_scores[i]}
                
        # Sort by combined score 
        sorted_docs=sorted(doc_scores.values(),key=lambda x:x['score'],reverse=True)[:k]
        
        return[(item['doc'],item['score']) for item in sorted_docs]

In [16]:
hybrid = HybridRetriever(vectorstore=vectordb, documents=chunks)

query = "company leave policy for new employees"
results = hybrid.retrieve(query, k=5, alpha=0.6)

for doc, score in results:
    print(score," ",doc.page_content[:100])

Hybrid Retriever ready with 4031 documents
0.6   • Absence from place of duty without permission. 
• Obtaining or attempting to obtain leave or absen
0.6   • Absence from place of duty without permission. 
• Obtaining or attempting to obtain leave or absen
0.4   4. TRAVELLING ALLOWANCES 
 
4.1 TRANSFER  GRANT 
Employees will be entitled to one month basic pay p
0.3996287450278538   STANDARD OPERATING PROCEDURE – HR                                                                   
0.39644279161865015   This will be effective October 1 2016. 
 
Maternity leave 
 
Maternity Leave Benefit for the woman e


##  5.Re-ranking - Cohere API & Cross-Encoder models