### DATA INGESTION

What all is goin to happen in the Data ingestion pipeline
- Main aim is to load some data
- Apply some chunking
- Convert into embeddings 
- Store into vector DB


In [2]:
### First we need to understand document datastructure

from langchain_core.documents import Document

In [3]:
doc = Document(
    page_content="This is the content of the document.", 
    metadata={
        "source": "First_RAG_example.txt", 
        "pages": 1,
        "author": "Piyush Hemnani",
        "date_created": "2025-11-14"
    }
)

doc

Document(metadata={'source': 'First_RAG_example.txt', 'pages': 1, 'author': 'Piyush Hemnani', 'date_created': '2025-11-14'}, page_content='This is the content of the document.')

In [4]:
### Create a simple txt file

import os
os.makedirs("../data/text_files", exist_ok=True)

In [5]:
sample_texts={
    "../data/text_files/python_intro.txt":"""Python Programming Introduction

Python is a high-level, interpreted programming language known for its simplicity and readability.
Created by Guido van Rossum and first released in 1991, Python has become one of the most popular
programming languages in the world.

Key Features:
- Easy to learn and use
- Extensive standard library
- Cross-platform compatibility
- Strong community support

Python is widely used in web development, data science, artificial intelligence, and automation.""",
    
    "../data/text_files/machine_learning.txt": """Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing computer programs
that can access data and use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties

Applications include image recognition, speech processing, and recommendation systems
    
    
    """

}

for file_path, content in sample_texts.items():
    with open(file_path, "w", encoding="utf-8") as file:
        file.write(content)

print("Sample files created successfully.")

Sample files created successfully.


In [8]:
### Text Loader

from langchain_community.document_loaders import TextLoader

loader = TextLoader("../data/text_files/python_intro.txt", encoding="utf-8")
document = loader.load()
print(document)

[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content='Python Programming Introduction\n\nPython is a high-level, interpreted programming language known for its simplicity and readability.\nCreated by Guido van Rossum and first released in 1991, Python has become one of the most popular\nprogramming languages in the world.\n\nKey Features:\n- Easy to learn and use\n- Extensive standard library\n- Cross-platform compatibility\n- Strong community support\n\nPython is widely used in web development, data science, artificial intelligence, and automation.')]


In [9]:
### Directory Loader

from langchain_community.document_loaders import DirectoryLoader

dir_loader = DirectoryLoader(
    "../data/text_files", 
    glob="*.txt",                       ## Pattern to match files
    loader_cls=TextLoader,              ## Loader class to use (if there are multiple options we can make this a list of loaders)
    loader_kwargs={"encoding": "utf-8"},
    show_progress=True
    )
documents = dir_loader.load()
documents

100%|██████████| 2/2 [00:00<00:00, 149.74it/s]


[Document(metadata={'source': '..\\data\\text_files\\machine_learning.txt'}, page_content='Machine Learning Basics\n\nMachine learning is a subset of artificial intelligence that enables systems to learn and improve\nfrom experience without being explicitly programmed. It focuses on developing computer programs\nthat can access data and use it to learn for themselves.\n\nTypes of Machine Learning:\n1. Supervised Learning: Learning with labeled data\n2. Unsupervised Learning: Finding patterns in unlabeled data\n3. Reinforcement Learning: Learning through rewards and penalties\n\nApplications include image recognition, speech processing, and recommendation systems\n\n\n    '),
 Document(metadata={'source': '..\\data\\text_files\\python_intro.txt'}, page_content='Python Programming Introduction\n\nPython is a high-level, interpreted programming language known for its simplicity and readability.\nCreated by Guido van Rossum and first released in 1991, Python has become one of the most popu

In [10]:
from langchain_community.document_loaders import PyPDFLoader, PyMuPDFLoader

dir_loader = DirectoryLoader(
    "../data/pdfs", 
    glob="*.pdf",                       ## Pattern to match files
    loader_cls=PyMuPDFLoader,              ## Loader class to use (if there are multiple options we can make this a list of loaders)
    show_progress=True
    )

pdf_documents = dir_loader.load()
pdf_documents

100%|██████████| 1/1 [00:02<00:00,  2.33s/it]


[Document(metadata={'producer': 'WeasyPrint 65.1', 'creator': 'ChatGPT', 'creationdate': '', 'source': '..\\data\\pdfs\\Building a Personal Portfolio Q&A Chatbot.pdf', 'file_path': '..\\data\\pdfs\\Building a Personal Portfolio Q&A Chatbot.pdf', 'total_pages': 9, 'format': 'PDF 1.7', 'title': 'Building a Personal Portfolio Q&A Chatbot', 'author': 'ChatGPT Deep Research', 'subject': '', 'keywords': '', 'moddate': '', 'trapped': '', 'modDate': '', 'creationDate': '', 'page': 0}, page_content='Building a Personal Portfolio Q&A Chatbot\nFramework and Hosting Considerations\nChoosing the right framework for your portfolio site is important for ease of development and deployment.\nNext.js is a React-based framework that provides server-side rendering (SSR), static site generation (SSG),\nbuilt-in routing, and easy integration of backend logic via API routes\n. These features can improve\nperformance and SEO (since pages can be pre-rendered or SSR) and simplify development (routing and\nconfi

In [28]:
pdf_documents[0]

Document(metadata={'producer': 'WeasyPrint 65.1', 'creator': 'ChatGPT', 'creationdate': '', 'source': '..\\data\\pdfs\\Building a Personal Portfolio Q&A Chatbot.pdf', 'file_path': '..\\data\\pdfs\\Building a Personal Portfolio Q&A Chatbot.pdf', 'total_pages': 9, 'format': 'PDF 1.7', 'title': 'Building a Personal Portfolio Q&A Chatbot', 'author': 'ChatGPT Deep Research', 'subject': '', 'keywords': '', 'moddate': '', 'trapped': '', 'modDate': '', 'creationDate': '', 'page': 0}, page_content='Building a Personal Portfolio Q&A Chatbot\nFramework and Hosting Considerations\nChoosing the right framework for your portfolio site is important for ease of development and deployment.\nNext.js is a React-based framework that provides server-side rendering (SSR), static site generation (SSG),\nbuilt-in routing, and easy integration of backend logic via API routes\n. These features can improve\nperformance and SEO (since pages can be pre-rendered or SSR) and simplify development (routing and\nconfig

### embedding and vectorStoreDB

In [11]:
import numpy as np
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import uuid
from typing import List, Dict, Any, Tuple
from sklearn.metrics.pairwise import cosine_similarity

In [12]:
class EmbeddingManager:
    '''Handles document embedding generation using Sentence Transformer'''

    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):

        '''
        Initialize the embedding manager
        
        Args:
            model_name (str): HuggingFace model name for the sentence embedding.
        '''
        self.model_name = model_name
        self.model = None
        self._load_model()

    def _load_model(self):
        '''Load the sentence transformer model'''
        try:
            print(f"Loading embedding model: {self.model_name}")
            self.model = SentenceTransformer(self.model_name)
            print(f"Model loaded successfully. Embedding dimension: {self.model.get_sentence_embedding_dimension()}")
        except Exception as e:
            print(f"Error loading model {self.model_name}: {e}")
            raise

    def generate_embeddings(self, texts: List[str]) -> np.ndarray:
        '''
        Generate embeddings for a list of texts
        
        Args:
            texts (List[str]): List of text strings to embed.

        Returns:
            np.ndarray: Array of embeddings. numpy array of embeddings with shape (len(texts), embedding_dimension)
        '''

        if not self.model:
            raise ValueError("Embedding model is not loaded.")
        
        print(f"Generating embeddings for {len(texts)} texts...")
        embeddings = self.model.encode(texts, show_progress_bar=True)
        print(f"Generated embeddings with shape: {embeddings.shape}")
        return embeddings
    

### Initialize Embedding Manager and generate embeddings for sample documents
embedding_manager = EmbeddingManager(model_name="all-MiniLM-L6-v2")
embedding_manager

Loading embedding model: all-MiniLM-L6-v2


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Model loaded successfully. Embedding dimension: 384


<__main__.EmbeddingManager at 0x1b7380323c0>

In [13]:
docs = ["This is a test.", "RAG pipelines are fun."]
embs = embedding_manager.generate_embeddings(docs)
embs.shape   # (2, 384)

Generating embeddings for 2 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00,  8.56it/s]

Generated embeddings with shape: (2, 384)





(2, 384)

### VectorStore

In [24]:
class VectorStore:
    ''' Manages document embeddings in ChromaDB vector store '''

    def __init__(self, collection_name: str = "pdf_documents", presist_directory: str = "../data/vector_store"):
        '''
        Initialize the vector store
        
        Args:
            collection_name (str): Name of the ChromaDB collection.
            presist_directory (str): Directory to persist the vector store data.
        '''

        self.collection_name = collection_name
        self.persist_directory = presist_directory
        self.client = None
        self.collection = None
        self._initialize_store()

    def _initialize_store(self):
        '''Initialize Chromadb client and collection'''

        try:
            # Create persistent ChromaDB client
            os.makedirs(self.persist_directory, exist_ok=True)
            self.client = chromadb.PersistentClient(path = self.persist_directory)

            # Create or get collection
            self.collection = self.client.get_or_create_collection(
                name = self.collection_name, 
                metadata={"description": "Collection for PDF document embeddings"}
            )
            print(f"Vector store initialized with collection: {self.collection_name}")
            print(f"Existing documents in collection: {self.collection.count()}")

        except Exception as e:
            print(f"Error initializing vector store: {e}")
            raise

    def add_documents(self, documents: list[Any], embeddings:np.ndarray):
        '''
        Add documents and their embeddings to the vector store

        Args:
            documents: List of Langchain Document 
            embeddings: Corresponding embeddings for the dcuments
        '''

        if len(documents) != len(embeddings):
            raise ValueError("The number of documents must match the number of embeddings.")
        
        print(f"Adding {len(documents)} documents to the vector store...")

        # Prepare data for ChromaDB

        ids = []
        metadatas = []
        documents_texts = []
        embeddings_list = []

        for i, (doc, embedding) in enumerate(zip(documents, embeddings)):
            # Generate unique IDs
            doc_id = f"doc_{uuid.uuid4().hex[:8]}_{i}"
            ids.append(doc_id)

            # prepare metadata
            metadata = dict(doc.metadata)
            metadata['doc_index'] = i
            metadata['context_length'] = len(doc.page_content)
            metadatas.append(metadata)

            # Document text
            documents_texts.append(doc.page_content)

            # Embedding 
            embeddings_list.append(embedding.tolist())


            # Add to collection
            try:
                self.collection.add(
                    ids =ids,
                    metadatas=metadatas,
                    documents = documents_texts,
                    embeddings = embeddings_list
                )
                print(f"Successfully added {len(documents)} documents.")
                print(f"Total documents in collection now: {self.collection.count()}")

            except Exception as e:
                print(f"Error adding documents to vector store: {e}")
                raise

vectorstore = VectorStore()
vectorstore

Vector store initialized with collection: pdf_documents
Existing documents in collection: 0


<__main__.VectorStore at 0x1b73bd2c1a0>

In [1]:
doc

NameError: name 'doc' is not defined

In [25]:
chunks

NameError: name 'chunks' is not defined