### Data Ingestion

In [1]:
### Document Structure

from langchain_core.documents import Document

In [2]:
doc=Document(
    page_content="This is the content for the RAG system.",
    metadata={
        "source": "file.txt",
        "author": "Laksh",
        "date_created": "2025-09-08"
    }
)
doc

Document(metadata={'source': 'file.txt', 'author': 'Laksh', 'date_created': '2025-09-08'}, page_content='This is the content for the RAG system.')

In [3]:
## create a simple txt file

import os
os.makedirs("../data/text_files", exist_ok=True)

In [4]:
sample_text ={
    "../data/text_files/Why_RAG.txt": """Why is Retrieval-Augmented Generation important?
LLMs are a key artificial intelligence (AI) technology powering intelligent chatbots and other natural language processing (NLP) applications. The goal is to create bots that can answer user questions in various contexts by cross-referencing authoritative knowledge sources. Unfortunately, the nature of LLM technology introduces unpredictability in LLM responses. Additionally, LLM training data is static and introduces a cut-off date on the knowledge it has.

Known challenges of LLMs include:

Presenting false information when it does not have the answer.
Presenting out-of-date or generic information when the user expects a specific, current response.
Creating a response from non-authoritative sources.
Creating inaccurate responses due to terminology confusion, wherein different training sources use the same terminology to talk about different things.
You can think of the Large Language Model as an over-enthusiastic new employee who refuses to stay informed with current events but will always answer every question with absolute confidence. Unfortunately, such an attitude can negatively impact user trust and is not something you want your chatbots to emulate!

RAG is one approach to solving some of these challenges. It redirects the LLM to retrieve relevant information from authoritative, pre-determined knowledge sources. Organizations have greater control over the generated text output, and users gain insights into how the LLM generates the response.
""",
    "../data/text_files/RAG_benefits.txt": """What are the benefits of Retrieval-Augmented Generation?
RAG technology brings several benefits to an organization's generative AI efforts.

Cost-effective implementation
Chatbot development typically begins using a foundation model. Foundation models (FMs) are API-accessible LLMs trained on a broad spectrum of generalized and unlabeled data. The computational and financial costs of retraining FMs for organization or domain-specific information are high. RAG is a more cost-effective approach to introducing new data to the LLM. It makes generative artificial intelligence (generative AI) technology more broadly accessible and usable.

Current information
Even if the original training data sources for an LLM are suitable for your needs, it is challenging to maintain relevancy. RAG allows developers to provide the latest research, statistics, or news to the generative models. They can use RAG to connect the LLM directly to live social media feeds, news sites, or other frequently-updated information sources. The LLM can then provide the latest information to the users.

Enhanced user trust
RAG allows the LLM to present accurate information with source attribution. The output can include citations or references to sources. Users can also look up source documents themselves if they require further clarification or more detail. This can increase trust and confidence in your generative AI solution.

More developer control
With RAG, developers can test and improve their chat applications more efficiently. They can control and change the LLM's information sources to adapt to changing requirements or cross-functional usage. Developers can also restrict sensitive information retrieval to different authorization levels and ensure the LLM generates appropriate responses. In addition, they can also troubleshoot and make fixes if the LLM references incorrect information sources for specific questions. Organizations can implement generative AI technology more confidently for a broader range of applications.
"""

}


for filepath, content in sample_text.items():
    with open(filepath,'w',encoding='utf-8') as f:
        f.write(content)

print("Sample text files created.")

Sample text files created.


In [5]:
### TextLoader

from langchain.document_loaders import TextLoader

from langchain_community.document_loaders import TextLoader

loader= TextLoader("../data/text_files/Why_RAG.txt",encoding= "utf-8")

document = loader.load()
print(document)

[Document(metadata={'source': '../data/text_files/Why_RAG.txt'}, page_content='Why is Retrieval-Augmented Generation important?\nLLMs are a key artificial intelligence (AI) technology powering intelligent chatbots and other natural language processing (NLP) applications. The goal is to create bots that can answer user questions in various contexts by cross-referencing authoritative knowledge sources. Unfortunately, the nature of LLM technology introduces unpredictability in LLM responses. Additionally, LLM training data is static and introduces a cut-off date on the knowledge it has.\n\nKnown challenges of LLMs include:\n\nPresenting false information when it does not have the answer.\nPresenting out-of-date or generic information when the user expects a specific, current response.\nCreating a response from non-authoritative sources.\nCreating inaccurate responses due to terminology confusion, wherein different training sources use the same terminology to talk about different things.\n

In [6]:
### Directory Loader

from langchain_community.document_loaders import DirectoryLoader

## load all the text files from the directory
dir_loader=DirectoryLoader(
    "../data/text_files",
    glob="**/*.txt", ## Pattern to match files
    loader_cls=TextLoader, ## Loader class to use
    loader_kwargs={"encoding": 'utf-8'},
    show_progress=False
)

documents = dir_loader.load()
documents

[Document(metadata={'source': '..\\data\\text_files\\RAG_benefits.txt'}, page_content="What are the benefits of Retrieval-Augmented Generation?\nRAG technology brings several benefits to an organization's generative AI efforts.\n\nCost-effective implementation\nChatbot development typically begins using a foundation model. Foundation models (FMs) are API-accessible LLMs trained on a broad spectrum of generalized and unlabeled data. The computational and financial costs of retraining FMs for organization or domain-specific information are high. RAG is a more cost-effective approach to introducing new data to the LLM. It makes generative artificial intelligence (generative AI) technology more broadly accessible and usable.\n\nCurrent information\nEven if the original training data sources for an LLM are suitable for your needs, it is challenging to maintain relevancy. RAG allows developers to provide the latest research, statistics, or news to the generative models. They can use RAG to c

In [7]:
from langchain_community.document_loaders import PyPDFLoader, PyMuPDFLoader

## load all the pdf files from the directory
dir_loader=DirectoryLoader(
    "../data/pdf",
    glob="**/*.pdf", ## Pattern to match files
    loader_cls=PyMuPDFLoader, ## Loader class to use
    show_progress=False
)

pdf_documents = dir_loader.load()
pdf_documents

[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'source': '..\\data\\pdf\\attention-is-all-you-need.pdf', 'file_path': '..\\data\\pdf\\attention-is-all-you-need.pdf', 'total_pages': 15, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'trapped': '', 'modDate': 'D:20240410211143Z', 'creationDate': 'D:20240410211143Z', 'page': 0}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki Parmar∗\nGoogle Research\nnikip@google.com\nJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.com\nAidan N. Gomez∗†\nUniversity of Toronto\naidan@cs.tor

In [8]:
type(pdf_documents[0])

langchain_core.documents.base.Document

In [10]:
from langchain_community.document_loaders import UnstructuredExcelLoader

dir_loader= DirectoryLoader(
    "../data/excel",
    glob="**/*.xls",
    loader_cls=UnstructuredExcelLoader
    )

excel_documents = dir_loader.load()
excel_documents

[Document(metadata={'source': '..\\data\\excel\\CollectionCentersMuthootFinal.xls'}, page_content='S. No. Sl. No. State City Axis Dhanlaxmi HDFC ICICI IDBI Indusind Total 1 1 UP Agra 0 1 1 2 2 Guj Ahmedabad 1 1 1 0 1 4 3 3 RAJSTHAN Ajmer 0 0 0 1 1 4 4 UP Allahabad 0 0 1 1 5 5 HARYANA Ambala 0 0 1 0 1 6 6 Punjab Amritsar 0 0 7 7 Guj Anand 0 0 0 0 8 8 WEST BENGAL Asansol 0 1 0 1 9 9 MAHARASHTRA Aurangabad 0 0 0 10 10 KARNATAKA Bangalore 1 1 1 3 11 11 UP Bareilly 0 0 0 12 12 KARNATAKA Belgaum 0 1 0 0 1 13 13 BIHAR Bhagalpur 0 0 0 0 0 14 14 Guj Bharuch 1 0 1 0 2 15 15 Punjab Bhatinda 0 0 0 0 16 16 Guj Bhavnagar 0 1 1 17 17 CHATTISGARH Bhilai 0 0 0 0 18 18 RAJSTHAN Bhilwara 0 1 0 0 1 19 19 MP Bhopal 1 1 20 20 ORISSA Bhubaneswar 1 0 1 0 2 21 21 KARNATAKA Bijapur 1 0 0 0 0 0 1 22 22 RAJSTHAN Bikaner 0 1 0 0 1 23 23 WEST BENGAL Burdwan 0 1 0 0 1 24 24 KERALA Calicut 1 1 0 0 2 25 25 Punjab Chandigarh 1 1 1 3 26 26 TAMILNADU Chennai 1 1 1 1 1 1 6 27 27 TAMILNADU Coimbatore 1 1 28 28 ORISSA Cutta

### embeddings and vectorStoreDB

In [12]:
import numpy as np
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import uuid
from typing import List, Dict, Any, Tuple
from sklearn.metrics.pairwise import cosine_similarity

In [18]:
class EmbeddingManager:
    """Handles Document Embedding generation using SentenceTransformers"""

    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        """Initialize the embedding manager
        
        Args:
            model_name (str): HuggingFace model name for sentence embeddings
        """
        self.model_name = model_name
        self.model = None
        self._load_model()


    def _load_model(self):
        """"Load the SentenceTransformer model"""
        try:
            print(f"Loading the embedding model: {self.model_name}")
            self.model = SentenceTransformer(self.model_name)
            print(f"Model loaded successfully. Enbedding Dimension: {self.model.get_sentence_embedding_dimension()}")
            
        except Exception as e:
            print(f"Error loading model {self.model_name}: {e}")
            raise



    def generate_embeddings(self, texts: List[str]) -> np.ndarray:
        """Generate embeddings for a list of texts
        
        Args:
            texts: List of text strings to embed
            
        Returns:
            numpy array of embeddings with shape (len(texts), embedding_dim)
        """
        if not self.model:
            raise ValueError("Model not loaded.")
        
        print(f"Generating embedding for {len(texts)} texts...")
        embeddings = self.model.encode(texts, show_progress_bar=True)
        print(f"Generate embeddings with shape: {embeddings.shape}")
        return embeddings
    
    # def get_embedding_dimension(self) -> int:
    #     """Get the embedding dimension of the model"""
    #     if not self.model:
    #         raise ValueError("Model not loaded.")
    #     return self.model.get_sentence_embedding_dimension()


## initialize embedding manager

embedding_manager = EmbeddingManager()
embedding_manager

Loading the embedding model: all-MiniLM-L6-v2


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Model loaded successfully. Enbedding Dimension: 384


<__main__.EmbeddingManager at 0x1dc22d8ae40>

### VectorStore

In [23]:
class VectorStore:
    """Manages document embedding in ChromaDB vector store"""

    def __init__(self, collection_name: str = "pdf_documents", persist_directory: str = "../data/vector_store"):
        """
        Initialize the vector store
        
        Args:
            collection_name: Name of the ChromaDB collection
            persist_directory: Directory to persist the vector store
        """
        self.collection_name = collection_name
        self.persist_directory = persist_directory
        self.client = None
        self.collection = None
        self._initialize_store()

    def _initialize_store(self):
        """Initialize ChromDB client and collection"""
        try:
            # Create persistant ChromaDB client
            os.makedirs(self.persist_directory,exist_ok=True)
            self.client = chromadb.PersistentClient(path=self.persist_directory)

            # Get or create collection
            self.collection = self.client.get_or_create_collection(
                name=self.collection_name,
                metadata={"description": "PDF document embedding for RAG"}
            )
            print(f"Vector store initialized. Collection: {self.collection_name}")
            print(f"Existing document in collection: {self.collection.count()}")

        except Exception as e:
            print(f"Erro initializing vector store: {e}")
            raise

    def add_document(self, document: List[Any], embeddings: np.ndarray):
        """
        Add document and their emebedding to vector store
        
        Args:
            documents: List of LangChain Documents
            embeddings: Corresponding embeddings for the document
        """
        if len(documents) != len(embeddings):
            raise ValueError("Number of documents must match number of embeddings")
        
        print(f"Adding {len(documents)} documents to vector store...")

        # Prepare data for ChromaDB
        ids = []
        metadatas = []
        documents_text = []
        embeddings_list = []

        for i, (doc, embedding) in enumerate(zip(documents, embeddings)):
            # Generate unique ID
            doc_id = f"doc_{uuid.uuid4().hex[:8]}_{i}"
            ids.append(doc_id)

            # Prepare Metadata
            metadata = dict(doc.metadata)
            metadata['doc_index'] = i
            metadata['content_length'] = len(doc.page_content)
            metadatas.append(metadata)

            # Document content
            documents_text.append(doc.page_content)

            # Embedding
            embeddings_list.append(embedding.tolist())

        # Add to collection
        try:
            self.collection.add(
                id=ids,
                embedding=embeddings_list,
                metadatas=metadatas,
                documents=documents_text
            )
            print(f"Successfully added {len(documents)} documents to vector store")
            print(f"Total documents in collection {self.colleciton.count()}")

        except Exception as e:
            print(f"Error adding documents to the vector store: {e}")
            raise

vectorstore = VectorStore()
vectorstore

Vector store initialized. Collection: pdf_documents
Existing document in collection: 0


<__main__.VectorStore at 0x1dc2378ee40>