# üìñ Project Introduction

This project implements a **Retrieval-Augmented Generation (RAG) system** tailored for PDF documents.  
The main goal is to allow users to **ask natural language questions** and receive **accurate, context-aware answers** based on the content of their documents.  

The pipeline consists of three main stages:  
1. **Data Ingestion** ‚Äì extract and prepare text from PDFs.  
2. **Retrieval** ‚Äì find the most relevant document chunks using semantic embeddings.  
3. **Generation** ‚Äì use a Large Language Model (LLM) to produce clear answers from retrieved context.  
The system is designed to be **modular, scalable, and transparent**, preserving metadata and sources for every answer.

![ragdiagram-ezgif.com-resize.gif](attachment:ragdiagram-ezgif.com-resize.gif)


# üì• Step 1 ‚Äî Data Ingestion

Data ingestion is the **first stage of any RAG pipeline**.  
Here we simply take raw data (PDFs, text files, etc.), **extract the text**, and **prepare it** for the next steps like chunking, embedding, and retrieval.

This step builds the **foundation** of your entire RAG system ‚Äî if ingestion is clean, everything after becomes easier.

üì∏ *Data Ingestion Overview Diagram*



### Part 1: Import Libraries üìö
In this part, we import the necessary Python libraries to read PDFs, handle paths, and split text into chunks for RAG. 

- `os` ‚Üí for working with directories and paths  
- `Path` ‚Üí easier handling of paths  
- `PyPDFLoader` / `PyMuPDFLoader` ‚Üí load PDF files  
- `RecursiveCharacterTextSplitter` ‚Üí split large texts into smaller chunks

In [None]:
# Part 1 ‚Äî Import Libraries üìö
import os
from pathlib import Path

# LangChain community loaders for PDFs
from langchain_community.document_loaders import PyPDFLoader, PyMuPDFLoader

# Text splitter to break documents into smaller chunks
from langchain_text_splitters import RecursiveCharacterTextSplitter


### Part 2: Define PDF Directory & List Files üìÇ
We set the folder where all PDFs are stored and then list all PDF files. 

This helps us process multiple PDFs automatically.


In [None]:
# Set the folder where all PDFs are stored
pdf_directory = "../data/pdf"  # the path where we have the  PDF folder 

# Use Path to easily handle directory and list all PDFs recursively
pdf_dir = Path(pdf_directory)

# '**/*.pdf' finds all PDF files in the directory and subdirectories
pdf_files = list(pdf_dir.glob("**/*.pdf"))

# Print how many PDFs we found
print(f"Found {len(pdf_files)} PDF files:")

# Print each PDF file name
for f in pdf_files:
    print(f" - {f.name}")


Found 4 PDF files:
 - attention.pdf
 - embeddings.pdf
 - objectdetection.pdf
 - proposal.pdf


### Part 3: Load PDFs into Documents üìÑ
Each PDF can have multiple pages. Here we use `PyPDFLoader` (or `PyMuPDFLoader`) to read every page.

We also add metadata to each page so later we know which PDF it came from.


In [None]:

# Initialize a list to store all loaded documents
all_documents = []

# Loop through each PDF file
for pdf_file in pdf_files:
    print(f"\nLoading {pdf_file.name} ...")  # status update
    
    try:
        # Create a PDF loader for the current file
        # You can switch to PyMuPDFLoader for faster loading
        loader = PyPDFLoader(str(pdf_file))  
        
        # Load all pages from PDF into Document objects
        documents = loader.load()  # returns a list of Document objects
        
        # Add extra metadata to each page/document
        # This helps track which PDF each chunk came from
        for doc in documents:
            doc.metadata['source_file'] = pdf_file.name  # original PDF file
            doc.metadata['file_type'] = 'pdf'            # type of file
        
        # Add the loaded pages to our main list
        all_documents.extend(documents)
        print(f"  ‚úì Loaded {len(documents)} pages from {pdf_file.name}")
        
    except Exception as e:
        # If any PDF fails to load, print error but continue
        print(f"  ‚úó Error loading {pdf_file.name}: {e}")

# Final status
print(f"\nTotal pages loaded from all PDFs: {len(all_documents)}")



Loading attention.pdf ...
  ‚úì Loaded 22 pages from attention.pdf

Loading embeddings.pdf ...
  ‚úì Loaded 27 pages from embeddings.pdf

Loading objectdetection.pdf ...
  ‚úì Loaded 11 pages from objectdetection.pdf

Loading proposal.pdf ...
  ‚úì Loaded 8 pages from proposal.pdf

Total pages loaded from all PDFs: 68


### Part 4: Split Documents into Chunks ‚úÇÔ∏è
#### Recursive Text Chunking ‚úÇÔ∏è

- `chunk_size` ‚Üí how many characters per chunk  
- `chunk_overlap` ‚Üí overlap between chunks for context retention  

We will split all pages into smaller chunks.


When working with large documents like PDFs, RAG models perform better if the text is split into smaller, manageable pieces called *chunks*.  
**Recursive chunking** splits the text using a hierarchy of separators (like paragraphs, lines, or spaces), ensuring that chunks preserve context while staying under a maximum size.  



In [None]:
# Part 4 ‚Äî Split Documents into Chunks ‚úÇÔ∏è

# RAG models perform better if documents are split into smaller chunks
chunk_size = 1000     # max characters per chunk
chunk_overlap = 200   # overlap between chunks to maintain context

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,           # how many characters per chunk
    chunk_overlap=chunk_overlap,     # number of overlapping characters between chunks
    length_function=len,             # function to measure text length
    separators=["\n\n", "\n", " ", ""]  # preferred separators for splitting
)

# Split all pages into chunks
chunks = text_splitter.split_documents(all_documents)

# Print summary
print(f"Split {len(all_documents)} pages into {len(chunks)} chunks")

# Show an example chunk
if chunks:
    print("\nExample chunk content (first 300 chars):")
    print(chunks[0].page_content[:300], "...")
    print("Metadata:", chunks[0].metadata)


Split 68 pages into 218 chunks

Example chunk content (first 300 chars):
Attention Mechanism in Neural Networks:
Where it Comes and Where it Goes
Derya Soydaner
Received: 22 July 2021 / Accepted: 27 April 2022
Abstract A long time ago in the machine learning literature, the idea of
incorporating a mechanism inspired by the human visual system into neural
networks was int ...
Metadata: {'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2022-04-29T00:26:20+00:00', 'author': '', 'keywords': '', 'moddate': '2022-04-29T00:26:20+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': '..\\data\\pdf\\attention.pdf', 'total_pages': 22, 'page': 0, 'page_label': '1', 'source_file': 'attention.pdf', 'file_type': 'pdf'}


In [None]:
chunks

[Document(metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2022-04-29T00:26:20+00:00', 'author': '', 'keywords': '', 'moddate': '2022-04-29T00:26:20+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'source': '..\\data\\pdf\\attention.pdf', 'total_pages': 22, 'page': 0, 'page_label': '1', 'source_file': 'attention.pdf', 'file_type': 'pdf'}, page_content='Attention Mechanism in Neural Networks:\nWhere it Comes and Where it Goes\nDerya Soydaner\nReceived: 22 July 2021 / Accepted: 27 April 2022\nAbstract A long time ago in the machine learning literature, the idea of\nincorporating a mechanism inspired by the human visual system into neural\nnetworks was introduced. This idea is named the attention mechanism, and it\nhas gone through a long development period. Today, many works have been\ndevoted to this idea in a variety of tasks. Re

# Step 2 : Embeddings üöÄ

### Document Embeddings for RAG

In this notebook, we will convert PDF/text chunks into **vector embeddings**.  
Embeddings are numerical representations of text that capture semantic meaning. These embeddings are stored in a **vector database** for efficient similarity search.


We will use:
- **SentenceTransformers** for generating embeddings.

- **NumPy** for handling vectors.
- **ChromaDB**  for storing and querying embeddings.


In [None]:
# ===============================
# Part 1 ‚Äî Import Libraries üìö
# ===============================

# NumPy for numerical operations on embeddings
import numpy as np

# SentenceTransformer to convert text into embeddings
from sentence_transformers import SentenceTransformer

# ChromaDB for vector storage and retrieval
import chromadb
from chromadb.config import Settings

# UUID to generate unique IDs for each document/chunk
import uuid

# Type hints for better code readability
from typing import List, Dict, Any, Tuple

# cosine similarity to check similarity between vectors
from sklearn.metrics.pairwise import cosine_similarity


In [None]:
# ===============================
# Part 2 ‚Äî Initialize Embedding Model ü§ñ
# ===============================

class EmbeddingManager:
    """Handles document embedding using SentenceTransformers"""
    
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        """
        Initialize the EmbeddingManager with a SentenceTransformer model.
        
        Args:
            model_name (str): Pre-trained SentenceTransformer model name
        """
        self.model_name = model_name
        self.model = None
        self._load_model()
    
    def _load_model(self):
        """Load the SentenceTransformer model"""
        try:
            print(f"Loading embedding model: {self.model_name} ...")
            self.model = SentenceTransformer(self.model_name)
            print(f"Model loaded successfully. Embedding dimension: {self.model.get_sentence_embedding_dimension()}")
        except Exception as e:
            print(f"Error loading model {self.model_name}: {e}")
            raise
    
    def generate_embeddings(self, texts: List[str]) -> np.ndarray:
        """
        Convert a list of texts into embeddings
        
        Args:
            texts: List of text strings
            
        Returns:
            np.ndarray: Embeddings matrix of shape (len(texts), embedding_dim)
        """
        if not self.model:
            raise ValueError("Model not loaded")
        
        print(f"Generating embeddings for {len(texts)} texts ...")
        embeddings = self.model.encode(texts, show_progress_bar=True)
        print(f"Generated embeddings shape: {embeddings.shape}")
        return embeddings

# Initialize the embedding manager
embedding_manager = EmbeddingManager()


Loading embedding model: all-MiniLM-L6-v2 ...
Model loaded successfully. Embedding dimension: 384


In [None]:
# ===============================
# Part 3 ‚Äî Encode Text Chunks üìù
# ===============================

# Example: assuming you already have 'chunks' from 1_Ingestion notebook
# chunks is a list of LangChain Document objects

texts = [doc.page_content for doc in chunks]  # Extract text from each chunk

# Generate embeddings
embeddings = embedding_manager.generate_embeddings(texts)

print(f"First embedding vector example:\n{embeddings[0][:10]} ...")  # show first 10 numbers


Generating embeddings for 218 texts ...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7/7 [00:49<00:00,  7.06s/it]

Generated embeddings shape: (218, 384)
First embedding vector example:
[-0.02570609 -0.06741273  0.04305235 -0.03098401  0.06855831  0.0532016
  0.06542959  0.01391513  0.06434312 -0.0265669 ] ...





### Part 3 : Vector Store: Save & Manage Embeddings üîó

In this notebook, we will create a **vector store** to store the embeddings generated from our text chunks.  
A vector store allows us to **search and retrieve relevant chunks** efficiently when we query our RAG system.  

We will use **ChromaDB** for this, which is a lightweight vector database.  

Key steps:  
1. Initialize ChromaDB client  
2. Create a collection  
3. Add documents + embeddings  
4. Query for testing



In [None]:
# ===============================
# Part 1 ‚Äî Import Libraries üìö
# ===============================

import os
import uuid  # to generate unique IDs for each chunk
import numpy as np

# ChromaDB for vector storage
import chromadb
from chromadb.config import Settings

# Type hints
from typing import List, Any, Dict


In [None]:
# ===============================
# Part 4 ‚Äî Vector Store Class üè™
# ===============================

class VectorStore:
    """
    Handles storing document embeddings in ChromaDB
    """
    
    def __init__(self, collection_name: str = "pdf_documents", persist_directory: str = "../data/vector_store"):
        self.collection_name = collection_name
        self.persist_directory = persist_directory
        self.client = None
        self.collection = None
        self._initialize_store()
    
    def _initialize_store(self):
        """
        Initialize ChromaDB client and collection
        """
        try:
            os.makedirs(self.persist_directory, exist_ok=True)
            self.client = chromadb.PersistentClient(path=self.persist_directory)
            
            self.collection = self.client.get_or_create_collection(
                name=self.collection_name,
                metadata={"description": "PDF document embeddings for RAG"}
            )
            print(f"Vector store initialized. Collection: {self.collection_name}")
            print(f"Existing documents in collection: {self.collection.count()}")
        except Exception as e:
            print(f"Error initializing vector store: {e}")
            raise

    def add_documents(self, documents: List[Any], embeddings: np.ndarray):
        """
        Add documents and their embeddings to the vector store
        """
        if len(documents) != len(embeddings):
            raise ValueError("Number of documents must match number of embeddings")
        
        print(f"Adding {len(documents)} documents to vector store...")
        
        ids = []
        metadatas = []
        docs_text = []
        embeddings_list = []
        
        for i, (doc, embedding) in enumerate(zip(documents, embeddings)):
            doc_id = f"doc_{uuid.uuid4().hex[:8]}_{i}"
            ids.append(doc_id)
            
            metadata = dict(doc.metadata)
            metadata['doc_index'] = i
            metadata['content_length'] = len(doc.page_content)
            metadatas.append(metadata)
            
            docs_text.append(doc.page_content)
            embeddings_list.append(embedding.tolist())
        
        try:
            self.collection.add(
                ids=ids,
                embeddings=embeddings_list,
                metadatas=metadatas,
                documents=docs_text
            )
            print(f"Successfully added {len(documents)} documents ‚úÖ")
            print(f"Total documents in collection: {self.collection.count()}")
        except Exception as e:
            print(f"Error adding documents: {e}")
            raise


In [None]:
# Initialize Vector Store
vector_store = VectorStore()

# Add all chunks and embeddings
vector_store.add_documents(chunks, embeddings)


Vector store initialized. Collection: pdf_documents
Existing documents in collection: 0
Adding 218 documents to vector store...
Successfully added 218 documents ‚úÖ
Total documents in collection: 218


# Step 4 : Document Retrieval üîç
In this part, we import all the necessary libraries to handle:

- File paths and directories (`os`)
- Unique IDs for documents (`uuid`)
- Type hints (`typing`)
- Numerical operations (`numpy`)
- Vector storage (`chromadb`)
- Our previously created classes:
    - `EmbeddingManager` (from 2_Embeddings.ipynb)
    - `VectorStore` (from 3_VectorStore.ipynb)

These libraries allow us to query our PDF embeddings efficiently.





In [None]:
# ===============================
# Part 1 ‚Äî Import Libraries üì¶
# ===============================

import os
import uuid
from typing import List, Dict, Any
import numpy as np
import chromadb



The `RAGRetriever` class is responsible for **query-based retrieval**.

- It takes a search query.
- Converts it into an embedding.
- Queries the vector store.
- Returns the most relevant PDF chunks with metadata and similarity score.

This is the core part of any RAG pipeline: **finding context for your questions**.


In [None]:
# ===============================
# Part 2 ‚Äî RAG Retriever Class üîç
# ===============================

class RAGRetriever:
    """
    Handles query-based retrieval from the vector store.
    """
    
    def __init__(self, vector_store: VectorStore, embedding_manager: embedding_manager.__class__):
        self.vector_store = vector_store
        self.embedding_manager = embedding_manager

    def retrieve(self, query: str, top_k: int = 5, score_threshold: float = 0.01) -> List[Dict[str, Any]]:
        """
        Retrieve the top_k most similar documents for a given query.
        """
        print(f"Retrieving documents for query: '{query}'")
        print(f"Top K: {top_k}, Score threshold: {score_threshold}")
        
        # Generate query embedding
        query_embedding = self.embedding_manager.generate_embeddings([query])[0]
        
        # Query the vector store
        results = self.vector_store.collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=top_k
        )
        
        # Process results
        retrieved_docs = []
        if results['documents'] and results['documents'][0]:
            documents = results['documents'][0]
            metadatas = results['metadatas'][0]
            distances = results['distances'][0]
            ids = results['ids'][0]
            
            for i, (doc_id, document, metadata, distance) in enumerate(zip(ids, documents, metadatas, distances)):
                similarity_score = 1 - distance  # convert cosine distance to similarity
                if similarity_score >= score_threshold:
                    retrieved_docs.append({
                        'id': doc_id,
                        'content': document,
                        'metadata': metadata,
                        'similarity_score': similarity_score,
                        'distance': distance,
                        'rank': i + 1
                    })
        
        print(f"Retrieved {len(retrieved_docs)} documents (after filtering)")
        return retrieved_docs


In [None]:
# ===============================
# Part 3 ‚Äî Initialize RAG Retriever ‚ö°
# ===============================

rag_retriever = RAGRetriever(vector_store= vector_store , embedding_manager=embedding_manager)

# Check the retriever
rag_retriever


<__main__.RAGRetriever at 0x26d96d477a0>

Now we test the retriever by asking a query.

- `query`: The question we want to answer.
- `top_k`: How many PDF chunks to retrieve.
- Prints a **snippet of content**, source file, and similarity score.

This ensures our **retriever works before connecting it to an LLM**.


In [None]:
# ===============================
# Part 4 ‚Äî Test Retrieval üìù
# ===============================

query = "What is an embedding?"

results = rag_retriever.retrieve(query=query, top_k=5)

results 


Retrieving documents for query: 'What is an embedding?'
Top K: 5, Score threshold: 0.01
Generating embeddings for 1 texts ...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  3.60it/s]

Generated embeddings shape: (1, 384)
Retrieved 1 documents (after filtering)





[{'id': 'doc_610fb7dc_101',
  'content': 'Embedding[34] designs a diversiÔ¨Åed prompting strategy by assigning document-s peciÔ¨Åc\nroles to simulate potential users querying that document, enabling LLMs to generate\nstylistically authentic queries that enhance diversity and realism.\n4',
  'metadata': {'file_type': 'pdf',
   'source_file': 'embeddings.pdf',
   'author': 'Peng Yu; En Xu; Bin Chen; Haibiao Chen; Yinfei Xu',
   'title': 'QZhou-Embedding Technical Report',
   'total_pages': 27,
   'moddate': '2025-09-01T00:50:53+00:00',
   'arxivid': 'https://arxiv.org/abs/2508.21632v1',
   'page': 3,
   'producer': 'pikepdf 8.15.1',
   'source': '..\\data\\pdf\\embeddings.pdf',
   'content_length': 238,
   'doc_index': 101,
   'doi': 'https://doi.org/10.48550/arXiv.2508.21632',
   'keywords': '',
   'license': 'http://creativecommons.org/licenses/by/4.0/',
   'creator': 'arXiv GenPDF (tex2pdf:)',
   'page_label': '4',
   'creationdate': '2025-09-01T00:50:53+00:00'},
  'similarity_score':

# üß† **Step 4 ‚Äî Generation (LLM Response Creation)**

In this step, we use a **Large Language Model (LLM)** to generate the final answer.

After the retriever returns the most relevant chunks, the LLM:

1. **Reads the retrieved context**
2. **Understands the user‚Äôs question**
3. **Generates a clear, concise, context-aware answer**

We use **Groq‚Äôs Llama-3.1-8B-Instant**, a fast and efficient model optimized for retrieval-augmented workflows.

The process:

* Build a prompt that contains both the **context** and the **question**
* Send it to the LLM
* Return the generated answer to the user

This completes the **RAG pipeline**:
**Retrieve ‚Üí Augment ‚Üí Generate**.

Below is the code that performs the generation.




In [None]:
### simple RAG pipeline with Groq LLM
from langchain_groq import ChatGroq
import os 
from dotenv import load_dotenv
load_dotenv()
### itialize the Groq LLLM (set your Groq_API_Key in environment)
# groq_api_key=os.getenv("GROK_API_KEY")
groq_api_key ="GROK_API_KEY"
llm=ChatGroq(groq_api_key=groq_api_key,model_name="llama-3.1-8b-instant",temperature=0.1,max_tokens=1024)
#simple RAG function: retrieve context +generate response
def rag_simple(query,retriever,llm,top_k=5):
    ## retriever the context
    results=retriever.retrieve(query,top_k=top_k)
    context="\n\n".join([doc['content'] for doc in results]) if results else ""
    if not context:
        return "No relevant context found to answer the question."
    
    ## generate the answwer using GROQ LLM
    prompt=f"""Use the following context to answer the question concisely.
        Context:
        {context}

        Question: {query}

        Answer:"""
    
    response=llm.invoke([prompt.format(context=context,query=query)])
    return response.content 

    


In [None]:
from dotenv import load_dotenv
import os
from langchain_groq import ChatGroq

# Load environment variables from .env
load_dotenv()

# Get your Groq API key
groq_api_key = os.getenv("GROQ_API_KEY")
print("Groq API Key loaded:")  # just to check it's loaded

# Initialize the Groq LLM
llm = ChatGroq(
    groq_api_key=groq_api_key,
    model_name="llama-3.1-8b-instant",
    temperature=0.1,
    max_tokens=1024
)


Groq API Key loaded:


In [None]:
answer = rag_simple("What is embeddings in LLM?", rag_retriever, llm)
print(answer)


Retrieving documents for query: 'What is embeddings in LLM?'
Top K: 5, Score threshold: 0.01
Generating embeddings for 1 texts ...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 17.25it/s]

Generated embeddings shape: (1, 384)
Retrieved 3 documents (after filtering)





In the context of Large Language Models (LLMs), embeddings refer to mathematical vector representations of natural language text or multimodal data. These vector representations are used in various applications such as text mining, question-answering systems, recommendation systems, and retrieval-augmented generation.
