### Data Ingestion

In [2]:
### Document Structure

from langchain_core.documents import Document


In [3]:
doc = Document (
    page_content = "this is the content of the RAG document", 
    metadata = {
        "source": "example_source.txt",
        "author": "Aditya Rathore",
        "date": "2024-10-01", 
        "page_number": 1
    }
)
doc

Document(metadata={'source': 'example_source.txt', 'author': 'Aditya Rathore', 'date': '2024-10-01', 'page_number': 1}, page_content='this is the content of the RAG document')

In [4]:
import os
os.makedirs("../data/text_files",exist_ok=True) 

In [5]:
sample_text={
    "../data/text_files/python_intro.txt": """Python Programming Introduction
    
    Python is a high-level, interpreted programming language known for its readability and versatility. 
    It supports multiple programming paradigms, including procedural, object-oriented, and functional programming.
    Python's extensive standard library and vibrant ecosystem of third-party packages make it suitable for a wide range of applications, 
    from web development to data science and artificial intelligence.
    
    Key Features of Python:

    1. Readability: Python's syntax emphasizes code readability, making it easier for developers to write and maintain code.
    2. Versatility: Python can be used for various applications, including web development, data analysis, machine learning, automation, and more.
    3. Extensive Libraries: Python has a rich set of libraries and frameworks, such as Django for web development,
       NumPy and Pandas for data analysis, and TensorFlow and PyTorch for machine learning.
    4. Community Support: Python has a large and active community that contributes to its development and provides support through forums, tutorials, and documentation.
    5. Cross-Platform: Python is available on multiple platforms, including Windows, macOS, and Linux, allowing developers to write code that runs seamlessly across different operating systems.
    
    Overall, Python's simplicity, versatility, and strong community support have made it one of the most popular programming languages in the world.""",

    "../data/text_files/machine_learning_basics.txt": """Machine Learning Basics

    Machine learning is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models
    that enable computers to perform tasks without explicit instructions. It involves training models on large datasets to recognize patterns,
    make decisions, and improve their performance over time.

    Key Concepts in Machine Learning:

    1. Supervised Learning: In supervised learning, models are trained on labeled data, where the input features are paired with the correct output.
       The model learns to map inputs to outputs and can make predictions on new, unseen data.
    2. Unsupervised Learning: Unsupervised learning involves training models on unlabeled data, allowing them to discover patterns and relationships
       within the data without explicit guidance. Clustering and dimensionality reduction are common techniques in this category.
    3. Reinforcement Learning: Reinforcement learning is a type of machine learning where agents learn to make decisions by interacting with an environment.
       They receive feedback in the form of rewards or penalties and aim to maximize cumulative rewards over time.
    4. Neural Networks: Neural networks are a class of machine learning models inspired by the human brain's structure. They consist of interconnected nodes
       (neurons) organized in layers and are particularly effective for tasks such as image and speech recognition.
    5. Overfitting and Underfitting: Overfitting occurs when a model learns the training data too well, including noise and outliers, leading to poor
       generalization on new data. Underfitting happens when a model is too simple to capture the underlying patterns in the data.

    Applications of Machine Learning:

    Machine learning has a wide range of applications across various industries, including:

    - Healthcare: Predictive analytics for patient outcomes, medical image analysis, and drug discovery.
    - Finance: Fraud detection, algorithmic trading, and credit scoring.
    - Marketing: Customer segmentation, recommendation systems, and sentiment analysis.
    - Autonomous Systems: Self-driving cars, robotics, and drone navigation.

    Conclusion:

    Machine learning is a rapidly evolving field with the potential to transform industries and improve decision-making processes.
    As more data becomes available and computational power increases, machine learning will continue to advance and unlock new possibilities.
    """
}

for filepath, content in sample_text.items():
    with open(filepath, "w", encoding = "utf-8") as f:
        f.write(content)

print("Sample Text files created.")

Sample Text files created.


In [6]:
### TextLoader Example

from langchain.document_loaders import TextLoader

loader = TextLoader("../data/text_files/python_intro.txt", encoding="utf-8")
document = loader.load()

print(document)


[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content="Python Programming Introduction\n\n    Python is a high-level, interpreted programming language known for its readability and versatility. \n    It supports multiple programming paradigms, including procedural, object-oriented, and functional programming.\n    Python's extensive standard library and vibrant ecosystem of third-party packages make it suitable for a wide range of applications, \n    from web development to data science and artificial intelligence.\n\n    Key Features of Python:\n\n    1. Readability: Python's syntax emphasizes code readability, making it easier for developers to write and maintain code.\n    2. Versatility: Python can be used for various applications, including web development, data analysis, machine learning, automation, and more.\n    3. Extensive Libraries: Python has a rich set of libraries and frameworks, such as Django for web development,\n       NumPy and Pandas fo

In [7]:
### DirectoryLoader Example

from langchain.document_loaders import DirectoryLoader

dir_loader = DirectoryLoader(
    "../data/text_files",
    glob="**/*.txt",
    loader_cls= TextLoader,
    loader_kwargs={"encoding": "utf-8"},
    show_progress=False
)
documents = dir_loader.load()
documents

[Document(metadata={'source': '../data/text_files/machine_learning_basics.txt'}, page_content="Machine Learning Basics\n\n    Machine learning is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models\n    that enable computers to perform tasks without explicit instructions. It involves training models on large datasets to recognize patterns,\n    make decisions, and improve their performance over time.\n\n    Key Concepts in Machine Learning:\n\n    1. Supervised Learning: In supervised learning, models are trained on labeled data, where the input features are paired with the correct output.\n       The model learns to map inputs to outputs and can make predictions on new, unseen data.\n    2. Unsupervised Learning: Unsupervised learning involves training models on unlabeled data, allowing them to discover patterns and relationships\n       within the data without explicit guidance. Clustering and dimensionality reduction are comm

In [None]:
from langchain_community.document_loaders import PyPDFLoader, PyMuPDFLoader 

dir_loader = DirectoryLoader(
    "../data/pdf_files/",
    glob="**/*.pdf",
    loader_cls= PyMuPDFLoader,
    show_progress=False
)

pdf_documents = dir_loader.load()
pdf_documents  

In [14]:
import os
from langchain_community.document_loaders import PyPDFLoader, PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pathlib import Path

In [17]:
### Read all the pdf's inside the directory
def process_all_pdfs(pdf_directory):
    """Process all PDF files in a directory"""
    all_documents = []
    pdf_dir = Path(pdf_directory)
    
    # Find all PDF files recursively
    pdf_files = list(pdf_dir.glob("**/*.pdf"))
    
    print(f"Found {len(pdf_files)} PDF files to process")
    
    for pdf_file in pdf_files:
        print(f"\nProcessing: {pdf_file.name}")
        try:
            loader = PyPDFLoader(str(pdf_file))
            documents = loader.load()
            
            # Add source information to metadata
            for doc in documents:
                doc.metadata['source_file'] = pdf_file.name
                doc.metadata['file_type'] = 'pdf'
            
            all_documents.extend(documents)
            print(f"  ✓ Loaded {len(documents)} pages")
            
        except Exception as e:
            print(f"  ✗ Error: {e}")
    
    print(f"\nTotal documents loaded: {len(all_documents)}")
    return all_documents

# Process all PDFs in the data directory
all_pdf_documents = process_all_pdfs("../data")
all_pdf_documents

Found 1 PDF files to process

Processing: Rental_Lease_Agreement.pdf
  ✓ Loaded 2 pages

Total documents loaded: 2


[Document(metadata={'producer': 'Skia/PDF m142 Google Docs Renderer', 'creator': 'PyPDF', 'creationdate': '', 'title': 'Untitled document', 'source': '../data/pdf_files/Rental_Lease_Agreement.pdf', 'total_pages': 2, 'page': 0, 'page_label': '1', 'source_file': 'Rental_Lease_Agreement.pdf', 'file_type': 'pdf'}, page_content='Residential  Lease  Agreement  \nThis  Lease  Agreement  (the  "Agreement")  is  made  and  entered  into  this  21st  day  of  \nSeptember,\n \n2025,\n \nby\n \nand\n \nbetween\n \nRohit\n \nSoni\n \n(the\n \n"Landlord")\n \nand\n \nAmit\n \nKumar\n \n(the\n \n"Tenant").\n \n1.  PROPERTY  The  Landlord  agrees  to  lease  to  the  Tenant  the  property  located  at:  123  \nInnovation\n \nDrive,\n \nTechville,\n \nST\n \n54321\n \n(the\n \n"Premises").\n \n2.  LEASE  TERM  The  term  of  this  lease  shall  be  for  a  period  of  12  months ,  commencing  on  \nOctober\n \n1,\n \n2025,\n \nand\n \nending\n \non\n \nSeptember\n \n30,\n \n2026.\n \n3.  RENT  3.1.  T

In [18]:
### Text splitting get into chunks

def split_documents(documents,chunk_size=1000,chunk_overlap=200):
    """Split documents into smaller chunks for better RAG performance"""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        separators=["\n\n", "\n", " ", ""]
    )
    split_docs = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(split_docs)} chunks")
    
    # Show example of a chunk
    if split_docs:
        print(f"\nExample chunk:")
        print(f"Content: {split_docs[0].page_content[:200]}...")
        print(f"Metadata: {split_docs[0].metadata}")
    
    return split_docs

In [19]:
chunks = split_documents(all_pdf_documents)
chunks

Split 2 documents into 4 chunks

Example chunk:
Content: Residential  Lease  Agreement  
This  Lease  Agreement  (the  "Agreement")  is  made  and  entered  into  this  21st  day  of  
September,
 
2025,
 
by
 
and
 
between
 
Rohit
 
Soni
 
(the
 
"Landlor...
Metadata: {'producer': 'Skia/PDF m142 Google Docs Renderer', 'creator': 'PyPDF', 'creationdate': '', 'title': 'Untitled document', 'source': '../data/pdf_files/Rental_Lease_Agreement.pdf', 'total_pages': 2, 'page': 0, 'page_label': '1', 'source_file': 'Rental_Lease_Agreement.pdf', 'file_type': 'pdf'}


[Document(metadata={'producer': 'Skia/PDF m142 Google Docs Renderer', 'creator': 'PyPDF', 'creationdate': '', 'title': 'Untitled document', 'source': '../data/pdf_files/Rental_Lease_Agreement.pdf', 'total_pages': 2, 'page': 0, 'page_label': '1', 'source_file': 'Rental_Lease_Agreement.pdf', 'file_type': 'pdf'}, page_content='Residential  Lease  Agreement  \nThis  Lease  Agreement  (the  "Agreement")  is  made  and  entered  into  this  21st  day  of  \nSeptember,\n \n2025,\n \nby\n \nand\n \nbetween\n \nRohit\n \nSoni\n \n(the\n \n"Landlord")\n \nand\n \nAmit\n \nKumar\n \n(the\n \n"Tenant").\n \n1.  PROPERTY  The  Landlord  agrees  to  lease  to  the  Tenant  the  property  located  at:  123  \nInnovation\n \nDrive,\n \nTechville,\n \nST\n \n54321\n \n(the\n \n"Premises").\n \n2.  LEASE  TERM  The  term  of  this  lease  shall  be  for  a  period  of  12  months ,  commencing  on  \nOctober\n \n1,\n \n2025,\n \nand\n \nending\n \non\n \nSeptember\n \n30,\n \n2026.\n \n3.  RENT  3.1.  T

### Embedding and VectorStoreDB Creation

In [8]:
import numpy as np
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import uuid
from typing import List, Tuple, Any, Dict
from sklearn.metrics.pairwise import cosine_similarity

  from .autonotebook import tqdm as notebook_tqdm


In [10]:
class EmbeddingManager:
    """Handles document embedding generation using SentenceTransformer"""

    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
        """
        Initialize the embedding manager
        
        Args:
            model_name: Hugging face model name for sentence embeddings
        """
        self.model_name = model_name
        self.model = None
        self._load_model()
    
    def _load_model(self):
        """Load the sentence transformer model"""
        try:
            print(f"Loading embedding model: {self.model_name}")
            self.model = SentenceTransformer(self.model_name)
            print(f"Model Loaded Successfully. Embedding Dimension: {self.model.get_sentence_embedding_dimension()}")
        except Exception as e:
            print(f"Error loading model {self.model_name}: {e}")
            raise

    def generate_embeddings(self, texts: List[str]) -> np.ndarray:
        """
        Generate embeddings for a list of texts
        
        Args:
            texts: List of string to be embedded

        Returns: 
            numpy array of embeddings with shape (len(texts), embedding_dimension)
        """
        if not self.model:
            raise ValueError("Model not loaded")
        
        print(f"Generating embedding for {len(texts)} texts...")
        embeddings = self.model.encode(texts, show_progress_bar=True)
        print(f"Generated embeddings with shape: {embeddings.shape}")
        return embeddings

### initialize the Embedding Manager

embedding_manager = EmbeddingManager()
embedding_manager


Loading embedding model: all-MiniLM-L6-v2
Model Loaded Successfully. Embedding Dimension: 384


<__main__.EmbeddingManager at 0x7410e08879b0>

### VectorStore

In [None]:
class VectorStore:
    """Manages document embeddings in a chromaDB vector store"""

    def __init__(self, collection_name: str = "pdf_documents", persist_directory: str = "../data/vector_store/"):
        """
        Initialize the vector store
        
        Args:
            collection_name: Name of the collection in chromaDB
            persist_directory: Directory to persist the chromaDB data
        """
        self.collection_name = collection_name
        self.persist_directory = persist_directory
        self.client = None
        self.collection = None
        self._initialize_store()

    def _initialize_store(self):
        """Initialize chromaDB client and collection"""
        try:
            # create persistent chromaDB client
            os.makedirs(self.persist_directory, exist_ok=True)
            self.client = chromadb.PersistentClient(path=self.persist_directory)

            # create or get collection
            self.collection = self.client.get_or_create_collection(
                name=self.collection_name, 
                metadata = {"description": "PDF document embeddings for RAG"}
            )
            print(f"Vector store initialized. Collection: {self.collection_name}")
            print(f"Existing documents in store/collection: {self.collection.count()}")
        
        except Exception as e:
            print(f"error initializing vector store: {e}")
            raise 
            
    def add_documents(self, documents: List[Document], embeddings: np.ndarray):
        """
        Add documents and their embeddings to the vector store
        
        Args:
            documents: List of langchain documents
            embeddings: Corresponding embeddings for the documents
        """
        if(len(documents) != len(embeddings)):
            raise ValueError("Number of documents and embeddings must match")
        
        print(f"Adding {len(documents)} documents to vector store...")

        # Prepare data for chromaDB
        ids = []
        metadatas = []
        documents_texts = []
        embeddings_list = []

        for i, (doc, embedding) in enumerate(zip(documents, embeddings)):
            # Generate unique ID
            doc_id = f"doc_{uuid.uuid4().hex[:8]}_{i}"
            ids.append(doc_id)

            # Prepare metadata
            metadata = dict(doc.metadata) 
            metadata['doc_index'] = i
            metadata['content_length'] = len(doc.page_content)
            metadatas.append(metadata)

            # Document content 
            documents_texts.append(doc.page_content)

            # Embedding
            embeddings_list.append(embedding.tolist())

        # Add to chromaDB collection
        try:
            self.collection.add(
                ids=ids,
                metadatas=metadatas,
                documents=documents_texts,
                embeddings=embeddings_list
            )
            print(f"Successfully added {len(documents)} documents to vector store.")
            print(f"Total documents in store/collection: {self.collection.count()}")

        except Exception as e:
            print(f"Error adding documents to vector store: {e}")
            raise

vectorstore = VectorStore()
vectorstore





Vector store initialized. Collection: pdf_documents
Total documents in store/collection: 0


<__main__.VectorStore at 0x7410e0787170>

In [20]:
chunks

[Document(metadata={'producer': 'Skia/PDF m142 Google Docs Renderer', 'creator': 'PyPDF', 'creationdate': '', 'title': 'Untitled document', 'source': '../data/pdf_files/Rental_Lease_Agreement.pdf', 'total_pages': 2, 'page': 0, 'page_label': '1', 'source_file': 'Rental_Lease_Agreement.pdf', 'file_type': 'pdf'}, page_content='Residential  Lease  Agreement  \nThis  Lease  Agreement  (the  "Agreement")  is  made  and  entered  into  this  21st  day  of  \nSeptember,\n \n2025,\n \nby\n \nand\n \nbetween\n \nRohit\n \nSoni\n \n(the\n \n"Landlord")\n \nand\n \nAmit\n \nKumar\n \n(the\n \n"Tenant").\n \n1.  PROPERTY  The  Landlord  agrees  to  lease  to  the  Tenant  the  property  located  at:  123  \nInnovation\n \nDrive,\n \nTechville,\n \nST\n \n54321\n \n(the\n \n"Premises").\n \n2.  LEASE  TERM  The  term  of  this  lease  shall  be  for  a  period  of  12  months ,  commencing  on  \nOctober\n \n1,\n \n2025,\n \nand\n \nending\n \non\n \nSeptember\n \n30,\n \n2026.\n \n3.  RENT  3.1.  T

In [24]:
### convert the text to embeddings 
texts = [doc.page_content for doc in chunks]

### Generate embeddings

embeddings = embedding_manager.generate_embeddings(texts)


### store in vectorstore
vectorstore.add_documents(chunks, embeddings)

Generating embedding for 4 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00,  2.66it/s]

Generated embeddings with shape: (4, 384)
Adding 4 documents to vector store...
Successfully added 4 documents to vector store.
Total documents in store/collection: 8





### Retrieval Pipeline from Vector Store

In [None]:
class RAGRetrieval:
    """Handles query-based retrieval from the vector store"""

    def __init__(self, vector_store: VectorStore, embedding_manager: EmbeddingManager):
        """
        Initialize the RAG retrieval system

        Args:
            vector_store: Vectore store containing document embeddings
            embedding_manager: Manager for generating query embeddings
        """
        self.vector_store = vector_store
        self.embedding_manager = embedding_manager
        