### Data Ingestion

![alt text](1-langchain-document-components.svg)

In [8]:
# Import Document class from langchain
from langchain_core.documents import Document

In [9]:
doc = Document(
    page_content="This is the content of the document I am using to create RAG.",
    metadata={
        "source": "example.txt",
        "pages": 1,
        "author": "Rohan",
        "date_created": "2025-10-05"
    }
)
doc

Document(metadata={'source': 'example.txt', 'pages': 1, 'author': 'Rohan', 'date_created': '2025-10-05'}, page_content='This is the content of the document I am using to create RAG.')

In [10]:
## Create a simple txt file
import os 
os.makedirs("../data/text_files", exist_ok=True)

In [11]:
sample_texts = {
    "../data/text_files/python_intro.txt": """Python Programming Introduction

Python is a high-level, interpreted programming language known for its simplicity and readability.
It was created by Guido van Rossum and first released in 1991. Python supports multiple programming paradigms, 
including procedural, object-oriented, and functional programming.  

Key Features of Python:
- Easy to Learn and Use: Python's syntax is clear and concise, making it an excellent choice for beginners.
- Extensive Standard Library: Python comes with a vast standard library that provides modules and functions for various tasks, such as file I/O, regular expressions, and web development.
- Cross-Platform Compatibility: Python is available on various operating systems, including Windows, macOS, and Linux.
- Large Community: Python has a vibrant community that contributes to a rich ecosystem of third-party libraries
"""
}

for filepath, content in sample_texts.items():
    with open(filepath, "w", encoding="utf-8") as f:
        f.write(content)

print("Sample text files created successfully.")

Sample text files created successfully.


In [12]:
# TextLoader - Load a single text file
from langchain_community.document_loaders import TextLoader

loader = TextLoader("../data/text_files/python_intro.txt", encoding="utf-8")
document = loader.load()  # List of Document objects
print(document)

[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content="Python Programming Introduction\n\nPython is a high-level, interpreted programming language known for its simplicity and readability.\nIt was created by Guido van Rossum and first released in 1991. Python supports multiple programming paradigms, \nincluding procedural, object-oriented, and functional programming.  \n\nKey Features of Python:\n- Easy to Learn and Use: Python's syntax is clear and concise, making it an excellent choice for beginners.\n- Extensive Standard Library: Python comes with a vast standard library that provides modules and functions for various tasks, such as file I/O, regular expressions, and web development.\n- Cross-Platform Compatibility: Python is available on various operating systems, including Windows, macOS, and Linux.\n- Large Community: Python has a vibrant community that contributes to a rich ecosystem of third-party libraries\n")]


In [13]:
# Text Splitter - Split documents into smaller chunks
from langchain_text_splitters import RecursiveCharacterTextSplitter


def split_documents(documents, chunk_size=500, chunk_overlap=50):
    """ Split documents into smaller chunks
    Args:
        documents: List of Document objects or raw strings.
        chunk_size: Max characters per chunk.
        chunk_overlap: Overlap between chunks.
    Returns:
        List of Document chunks
    """
    # Validate inputs
    if not documents:
        raise ValueError("documents list cannot be empty")
    
    if chunk_size <= 0:
        raise ValueError("chunk_size must be positive")
    
    if chunk_overlap < 0 or chunk_overlap >= chunk_size:
        raise ValueError("chunk_overlap must be non-negative and less than chunk_size")
    
    # ✅ Ensure all inputs are Document objects
    if isinstance(documents[0], str):
        documents = [Document(page_content=doc, metadata={}) for doc in documents]

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        separators=['\n\n', '\n', ' ', '']
    )
    
    split_docs = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(split_docs)} chunks")
    
    # show example of chunk
    if split_docs:
        print(f"\n Example chunk")
        print(f"Content : {split_docs[0].page_content[:200]}...")
        print(f"Metadata : {split_docs[0].metadata}")
    
    return split_docs

In [14]:
### Directory Loader
from langchain_community.document_loaders import DirectoryLoader

## Load all the text files from the directory
dir_loader = DirectoryLoader(
    "../data/text_files",
    glob="**/*.txt", ## Pattern to match files
    loader_cls=TextLoader, ## Loader class to use for loading files
    loader_kwargs={"encoding": "utf-8"}, ## Additional arguments for the loader class
    show_progress=True
)
documents = dir_loader.load()
documents

100%|██████████| 1/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<?, ?it/s]


[Document(metadata={'source': '..\\data\\text_files\\python_intro.txt'}, page_content="Python Programming Introduction\n\nPython is a high-level, interpreted programming language known for its simplicity and readability.\nIt was created by Guido van Rossum and first released in 1991. Python supports multiple programming paradigms, \nincluding procedural, object-oriented, and functional programming.  \n\nKey Features of Python:\n- Easy to Learn and Use: Python's syntax is clear and concise, making it an excellent choice for beginners.\n- Extensive Standard Library: Python comes with a vast standard library that provides modules and functions for various tasks, such as file I/O, regular expressions, and web development.\n- Cross-Platform Compatibility: Python is available on various operating systems, including Windows, macOS, and Linux.\n- Large Community: Python has a vibrant community that contributes to a rich ecosystem of third-party libraries\n")]

In [15]:
from langchain_community.document_loaders import PyPDFLoader, PyMuPDFLoader

## Load all the PDF files from the directory
dir_loader = DirectoryLoader(
    "../data/pdf_files",
    glob="**/*.pdf", ## Pattern to match files
    loader_cls=PyMuPDFLoader, ## Loader class to use for loading files
    # PyMuPDF's page.get_text does not accept an 'encoding' kwarg; remove it to avoid TypeError
    loader_kwargs={}, ## Additional arguments for the loader class
    show_progress=True
)

pdf_documents = dir_loader.load()
pdf_documents

100%|██████████| 2/2 [00:00<00:00,  2.44it/s]
100%|██████████| 2/2 [00:00<00:00,  2.44it/s]


[Document(metadata={'producer': 'pdfTeX-1.40.26', 'creator': 'LaTeX with hyperref', 'creationdate': '2025-08-27T08:25:26+00:00', 'source': '..\\data\\pdf_files\\rohan_cv_aug.pdf', 'file_path': '..\\data\\pdf_files\\rohan_cv_aug.pdf', 'total_pages': 1, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2025-08-27T08:25:26+00:00', 'trapped': '', 'modDate': 'D:20250827082526Z', 'creationDate': 'D:20250827082526Z', 'page': 0}, page_content='GitHub: github.com/PhoneixDeadeye\nLinkedIn: linkedin.com/in/rohan-agarwal007/\nEmail: agarwalrohan2004@gmail.com\nMobile: +91-9811229836\nSkills\nLanguages:\nJava, C, C++, Python, HTML\nFrameworks:\nPandas, OpenCV, Tesseract OCR, Tkinter, TensorFlow, REST API, ComfyUI, AWS\nTools/Platforms:\nVS Code, IntelliJ IDEA, Git, GitHub, Canva, Linux/Unix\nSoft Skills:\nTime Management, Fast Learner, Leadership, Team Collaboration\nTraining\nSummer Training in Competitive Programming\n• Mentored in algorithm design and opt

In [16]:
# Split the PDF documents into chunks for better retrieval
# Using appropriate chunk size for PDF content
pdf_chunks = split_documents(pdf_documents, chunk_size=500, chunk_overlap=50)
print(f"Total PDF chunks created: {len(pdf_chunks)}")

Split 8 documents into 41 chunks

 Example chunk
Content : GitHub: github.com/PhoneixDeadeye
LinkedIn: linkedin.com/in/rohan-agarwal007/
Email: agarwalrohan2004@gmail.com
Mobile: +91-9811229836
Skills
Languages:
Java, C, C++, Python, HTML
Frameworks:
Pandas, ...
Metadata : {'producer': 'pdfTeX-1.40.26', 'creator': 'LaTeX with hyperref', 'creationdate': '2025-08-27T08:25:26+00:00', 'source': '..\\data\\pdf_files\\rohan_cv_aug.pdf', 'file_path': '..\\data\\pdf_files\\rohan_cv_aug.pdf', 'total_pages': 1, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2025-08-27T08:25:26+00:00', 'trapped': '', 'modDate': 'D:20250827082526Z', 'creationDate': 'D:20250827082526Z', 'page': 0}
Total PDF chunks created: 41


In [17]:
type(pdf_documents[0])

langchain_core.documents.base.Document

In [18]:
# Import required libraries for embeddings and vector store
import numpy as np
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import uuid
from typing import List, Dict, Any, Tuple

In [19]:
class EmbeddingManager:
    """Handles document embedding generation using SentenceTransformer"""
    def __init__(self, model_name: str="all-MiniLM-L6-v2"):
        """
        Initialize the embedding manager with a specified model.

        Args:
            model_name (str): Name of the pre-trained model to use.
        """
        self.model_name = model_name
        self.model = None
        self._load_model()

    def _load_model(self):
        """Load the SentenceTransformer model."""
        try:
            print(f"Loading embedding model: {self.model_name}")
            self.model = SentenceTransformer(self.model_name)
            print(f"Model Loaded successfully. Embedding Dimension: {self.model.get_sentence_embedding_dimension()}")
        except Exception as e:
            print(f"Error loading model {self.model_name}: {e}")
            raise

    def generate_embeddings(self, texts: List[str]) -> np.ndarray:
        """
        Generate embeddings for a list of texts.

        Args:
            texts (List[str]): List of text strings to embed.

        Returns:
            Numpy Array of embeddings with shape (len(texts), embedding_dimension).
        """
        if not self.model:
            raise ValueError("Model not loaded. Call load_model() before generating embeddings.")
        
        print(f"Generating embeddings for {len(texts)} texts...")
        embeddings = self.model.encode(texts, show_progress_bar=True)
        print(f"Generated embeddings with shape: {embeddings.shape}")
        return embeddings
    
    def get_embedding_dimension(self) -> int:
        """Get the dimension of the embeddings."""
        if not self.model:
            raise ValueError("Model not loaded. Call load_model() before getting embedding dimension.")
        return self.model.get_sentence_embedding_dimension()

## Initialize the Embedding Manager

embedding_manager = EmbeddingManager()
embedding_manager

Loading embedding model: all-MiniLM-L6-v2
Model Loaded successfully. Embedding Dimension: 384
Model Loaded successfully. Embedding Dimension: 384


<__main__.EmbeddingManager at 0x2448b5f3ed0>

### VectorStore

In [20]:
import os
class VectorStore:
    """Manages document embeddings in a ChromaDB vector store."""

    def __init__(self, collection_name: str="pdf_documents", persist_directory: str = "../data/vector_store"):
        """
        Initialize the vector store
        
        Args:
            collection_name (str): Name of the ChromaDB collection.
            persist_directory (str): Directory to persist the vector store data.
        """

        self.collection_name = collection_name
        self.persist_directory = persist_directory
        self.client = None
        self.collection = None
        self._initialize_vector_store()
    
    def _initialize_vector_store(self):
        """Initialize the ChromaDB client and collection."""
        try:
            # Create persist ChromaDB client (ChromaDB v0.4+ uses PersistentClient)
            os.makedirs(self.persist_directory, exist_ok=True)
            self.client = chromadb.PersistentClient(path=self.persist_directory)

            # Get or create collection with COSINE similarity metric
            self.collection = self.client.get_or_create_collection(
                name=self.collection_name,
                metadata={"hnsw:space": "cosine", "description": "PDF document embeddings for RAG"})
            print(f"ChromaDB collection '{self.collection_name}' initialized successfully.")
            print(f"Existing documents in collection: {self.collection.count()}")

        except Exception as e:
            print(f"Error initializing ChromaDB: {e}")
            raise
    
    def add_documents(self, documents: List[Any], embeddings: np.ndarray):
        """
        Add documents and their embeddings to the vector store.

        Args:
            documents (List[Any]): List of Langchain Document.
            embeddings (np.ndarray): Corresponding embeddings for the documents.
        """
        if len(documents) != len(embeddings):
            raise ValueError("Number of documents must match number of embeddings.")
        
        print(f"Adding {len(documents)} documents to the vector store...")

        # Prepare data for insertion        
        ids = []
        metadatas = []
        documents_text = []
        embeddings_list = []

        for i, (doc, embedding) in enumerate(zip(documents, embeddings)):
            # Generate a unique ID for each document
            doc_id = f"doc_{uuid.uuid4().hex[:8]}_{i}"
            ids.append(doc_id)

            # Prepare metadata
            metadata = dict(doc.metadata)
            metadata['doc_index'] = i
            metadata['content_length'] = len(doc.page_content)
            metadatas.append(metadata)

            # Document text
            documents_text.append(doc.page_content)

            # Embedding
            embeddings_list.append(embedding.tolist()) 
        
        # Add to collection
        try:
            self.collection.add(
                ids=ids,
                metadatas=metadatas,
                documents=documents_text,
                embeddings=embeddings_list
            )
            print(f"Successfully added {len(documents)} documents. Total documents in collection: {self.collection.count()}")
        except Exception as e:
            print(f"Error adding documents to ChromaDB: {e}")
            raise

vectorstore = VectorStore()
vectorstore

ChromaDB collection 'pdf_documents' initialized successfully.
Existing documents in collection: 82


<__main__.VectorStore at 0x2448b732150>

# ⚠️ ONE-TIME FIX: Delete old collection and recreate with cosine similarity
# Run this cell ONCE if you have old data, then you can delete this cell
vectorstore.client.delete_collection(name="pdf_documents")
vectorstore = VectorStore()
print("✅ Vector store recreated with cosine similarity")

In [21]:
# Display the PDF chunks to verify
pdf_chunks

[Document(metadata={'producer': 'pdfTeX-1.40.26', 'creator': 'LaTeX with hyperref', 'creationdate': '2025-08-27T08:25:26+00:00', 'source': '..\\data\\pdf_files\\rohan_cv_aug.pdf', 'file_path': '..\\data\\pdf_files\\rohan_cv_aug.pdf', 'total_pages': 1, 'format': 'PDF 1.5', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2025-08-27T08:25:26+00:00', 'trapped': '', 'modDate': 'D:20250827082526Z', 'creationDate': 'D:20250827082526Z', 'page': 0}, page_content='GitHub: github.com/PhoneixDeadeye\nLinkedIn: linkedin.com/in/rohan-agarwal007/\nEmail: agarwalrohan2004@gmail.com\nMobile: +91-9811229836\nSkills\nLanguages:\nJava, C, C++, Python, HTML\nFrameworks:\nPandas, OpenCV, Tesseract OCR, Tkinter, TensorFlow, REST API, ComfyUI, AWS\nTools/Platforms:\nVS Code, IntelliJ IDEA, Git, GitHub, Canva, Linux/Unix\nSoft Skills:\nTime Management, Fast Learner, Leadership, Team Collaboration\nTraining\nSummer Training in Competitive Programming'),
 Document(metadata={'producer': 'pdf

In [22]:
# Convert the PDF chunks to embeddings
texts = [doc.page_content for doc in pdf_chunks]

# Generate the Embeddings
embeddings = embedding_manager.generate_embeddings(texts)

# Store in the vector database
vectorstore.add_documents(pdf_chunks, embeddings)

Generating embeddings for 41 texts...


Batches: 100%|██████████| 2/2 [00:00<00:00,  3.49it/s]

Generated embeddings with shape: (41, 384)
Adding 41 documents to the vector store...
Successfully added 41 documents. Total documents in collection: 123





### Retriver Pipeline From VectorStore

In [23]:
class RAGRetriever:
    """Handles query-based retrieval from the vector store"""

    def __init__(self, vectorstore: VectorStore, embedding_manager: EmbeddingManager):
        """
        Initialize the retriever

        Args:
            vectorstore: Vector store containing document embeddings
            embedding_manager: Manager for generating query embeddings
        """

        self.vector_store = vectorstore
        self.embedding_manager = embedding_manager
    
    def retrieve(self, query: str, top_k: int=5, score_threshold: float=0.0) -> List[Dict[str, Any]]:
        """
        Retrieve top_k relevant documents for the given query.

        Args:
            query (str): The input query string.
            top_k (int): Number of top documents to retrieve.
            score_threshold (float): Minimum similarity score to consider a document relevant.
        
        Returns:
            List of dictionaries containing 'document' and 'metadata'.
        """
        # Validate inputs
        if not query or not query.strip():
            raise ValueError("query cannot be empty")
        
        if top_k <= 0:
            raise ValueError("top_k must be positive")

        print(f"Retrieving documents for query: '{query}'")
        print(f"Top K: {top_k}, Score Threshold: {score_threshold}")

        # Generate embedding for the query
        query_embedding = self.embedding_manager.generate_embeddings([query])[0]

        # Search in the vector store
        try:
            results = self.vector_store.collection.query(
                query_embeddings=[query_embedding.tolist()],
                n_results=top_k
            )
            
            # Process results
            retrieved_docs = []

            if results['documents'] and results['documents'][0]:
                documents = results['documents'][0]
                metadatas = results['metadatas'][0]
                distances = results['distances'][0]
                ids = results['ids'][0]

                for i, (doc_id, document, metadata, distance) in enumerate(zip(ids, documents, metadatas, distances)):
                    # Convert cosine distance to similarity score
                    # Cosine distance ranges from 0 to 2; similarity = 1 - distance ranges from -1 to 1
                    similarity_score = 1 - distance

                    if similarity_score >= score_threshold:
                        retrieved_docs.append({
                            "id": doc_id,
                            "content": document,
                            "metadata": metadata,
                            "similarity_score": similarity_score,
                            "distance": distance,
                            "rank": i + 1
                        })
                
                print(f"Retrieved {len(retrieved_docs)} documents after filtering by score threshold.")
            else:
                print("No documents found in the vector store.")
            return retrieved_docs
        except Exception as e:
            print(f"Error during retrieval: {e}")
            return []

rag_retriever = RAGRetriever(vectorstore, embedding_manager)
rag_retriever

<__main__.RAGRetriever at 0x2448b5e9990>

In [24]:
rag_retriever.retrieve("What are Rohan's technical skills?", top_k=3)

Retrieving documents for query: 'What are Rohan's technical skills?'
Top K: 3, Score Threshold: 0.0
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00, 102.20it/s]

Generated embeddings with shape: (1, 384)
Retrieved 3 documents after filtering by score threshold.





[{'id': 'doc_a6baa274_0',
  'content': 'GitHub: github.com/PhoneixDeadeye\nLinkedIn: linkedin.com/in/rohan-agarwal007/\nEmail: agarwalrohan2004@gmail.com\nMobile: +91-9811229836\nSkills\nLanguages:\nJava, C, C++, Python, HTML\nFrameworks:\nPandas, OpenCV, Tesseract OCR, Tkinter, TensorFlow, REST API, ComfyUI, AWS\nTools/Platforms:\nVS Code, IntelliJ IDEA, Git, GitHub, Canva, Linux/Unix\nSoft Skills:\nTime Management, Fast Learner, Leadership, Team Collaboration\nTraining\nSummer Training in Competitive Programming',
  'metadata': {'producer': 'pdfTeX-1.40.26',
   'total_pages': 1,
   'author': '',
   'moddate': '2025-08-27T08:25:26+00:00',
   'page': 0,
   'creationdate': '2025-08-27T08:25:26+00:00',
   'file_path': '..\\data\\pdf_files\\rohan_cv_aug.pdf',
   'trapped': '',
   'doc_index': 0,
   'creator': 'LaTeX with hyperref',
   'content_length': 465,
   'format': 'PDF 1.5',
   'subject': '',
   'keywords': '',
   'source': '..\\data\\pdf_files\\rohan_cv_aug.pdf',
   'title': '',
  

In [25]:
rag_retriever.retrieve("What are vietnam's challenges", top_k=3)

Retrieving documents for query: 'What are vietnam's challenges'
Top K: 3, Score Threshold: 0.0
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00, 107.08it/s]

Generated embeddings with shape: (1, 384)
Retrieved 3 documents after filtering by score threshold.





[{'id': 'doc_e0d2fe25_17',
  'content': 'skills. High literacy rates make it easy to train people for new jobs.\nStrategic Location: Vietnam boasts a long coastline along the South China Sea, a very\nactive global shipping route. It’s simple to send products to customers in Asia and the rest',
  'metadata': {'source': '..\\data\\pdf_files\\want the complete content_no omission.pdf',
   'doc_index': 17,
   'file_path': '..\\data\\pdf_files\\want the complete content_no omission.pdf',
   'creationDate': "D:20250908155505+00'00'",
   'content_length': 249,
   'moddate': '2025-09-08T15:55:05+00:00',
   'subject': '',
   'page': 1,
   'modDate': "D:20250908155505+00'00'",
   'title': '',
   'keywords': '',
   'author': '',
   'format': 'PDF 1.4',
   'total_pages': 7,
   'creationdate': '2025-09-08T15:55:05+00:00',
   'trapped': '',
   'producer': 'Skia/PDF m127',
   'creator': 'Chromium'},
  'similarity_score': 0.6216747760772705,
  'distance': 0.3783252239227295,
  'rank': 1},
 {'id': 'doc

### Integrate VectorDB Context pipeline with LLM Output

In [32]:
### Simple RAG pipeline with GROQ LLM
from langchain_groq import ChatGroq
import os
from dotenv import load_dotenv
load_dotenv()

### Initialize the GROQ LLM
groq_api_key = os.getenv("GROQ_API_KEY")

llm = ChatGroq(api_key=groq_api_key, model="gemma2-9b-it", temperature=0.1, max_tokens=1024)

## 2. Simple RAG function: retrieve context and generate answer
def rag_simple(query, retriever, llm, top_k=3):
    """ Simple RAG function to retrieve context and generate answer
    Args:
        query: User query string
        retriever: RAGRetriever instance
        llm: Language model instance (e.g., Groq)
        top_k: Number of top documents to retrieve
    Returns:
        Generated answer string
    """
    # Step 1: Retrieve relevant documents
    results = retriever.retrieve(query, top_k=top_k)
    context = "\n\n".join([doc['content'] for doc in results]) if results else ""

    if not context:
        return "No relevant documents found to answer the query."
    
    ## Step 2: Generate answer using LLM
    prompt = f"""You are an expert assistant. Use the following context to answer the question concisely.
        Context: 
        {context}

        Question: 
        {query}

        Answer:
    """

    response = llm.invoke([prompt.format(context=context, query=query)])
    return response.content

In [33]:
answer = rag_simple("What are Rohan's technical skills?", rag_retriever, llm, top_k=3)
print(answer)

Retrieving documents for query: 'What are Rohan's technical skills?'
Top K: 3, Score Threshold: 0.0
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00, 125.05it/s]

Generated embeddings with shape: (1, 384)
Retrieved 3 documents after filtering by score threshold.





Rohan's technical skills include:

* **Languages:** Java, C, C++, Python, HTML
* **Frameworks:** Pandas, OpenCV, Tesseract OCR, Tkinter, TensorFlow, REST API, ComfyUI, AWS
* **Tools/Platforms:** VS Code, IntelliJ IDEA, Git, GitHub, Canva, Linux/Unix 



### Enhanced RAG Pipeline Features

In [40]:
# --- Enchanced RAG Pipeline Features ---

def rag_enhanced(query, retriever, llm, top_k=5, min_score=0.2, return_contexts=False):
    """ Enhanced RAG function with score filtering and optional context return
    Args:
        query: User query string
        retriever: RAGRetriever instance
        llm: Language model instance (e.g., Groq)
        top_k: Number of top documents to retrieve
        min_score: Minimum similarity score to consider a document relevant
        return_contexts: Whether to return the retrieved contexts along with the answer
    Returns:
        Dictionary with 'answer', 'sources', 'confidence', and optionally 'contexts'
    """
    # Step 1: Retrieve relevant documents with score filtering
    results = retriever.retrieve(query, top_k=top_k, score_threshold=min_score)
    if not results:
        return {'answer': "No relevant documents found to answer the query.", 'sources': [], 'confidence': 0.0}
    

    context = "\n\n".join([doc['content'] for doc in results]) if results else ""
    sources = [{
        'source': doc['metadata'].get('source_file', doc['metadata'].get('source', 'unknown')),
        'page': doc['metadata'].get('page', 'unknown'),
        'score': doc['similarity_score'],
        'preview': doc['content'][:120] + ("..." if len(doc['content']) > 120 else "")
    } for doc in results]

    confidence = max(doc['similarity_score'] for doc in results) if results else 0.0
    if not context:
        return {'answer': "No relevant documents found to answer the query.", 'sources': [], 'confidence': 0.0}
    
    ## Step 2: Generate answer using LLM
    prompt = f"""You are an expert assistant. Use the following context to answer the question concisely.
        Context: 
        {context}

        Question: 
        {query}

        Answer:
    """

    response = llm.invoke([prompt.format(context=context, query=query)])
    
    # Prepare return dictionary
    result_dict = {
        'answer': response.content,
        'sources': sources,
        'confidence': confidence
    }
    
    if return_contexts:
        result_dict['contexts'] = results
    
    return result_dict

In [41]:
# Example usage of the enhanced RAG function
result = rag_enhanced("What are Rohan's technical skills?", rag_retriever, llm, top_k=3, min_score=0.1, return_contexts=True)
print("Answer:", result['answer'])
print("\nSources:")
for i, source in enumerate(result['sources'], 1):
    print(f"  {i}. {source['source']} (page {source['page']}) - Score: {source['score']:.3f}")
    print(f"     Preview: {source['preview']}")
print(f"\nConfidence: {result['confidence']:.3f}")
if 'contexts' in result:
    print(f"\nNumber of contexts retrieved: {len(result['contexts'])}")

Retrieving documents for query: 'What are Rohan's technical skills?'
Top K: 3, Score Threshold: 0.1
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00, 129.99it/s]

Generated embeddings with shape: (1, 384)
Retrieved 3 documents after filtering by score threshold.





Answer: Rohan's technical skills include:

* **Languages:** Java, C, C++, Python, HTML
* **Frameworks:** Pandas, OpenCV, Tesseract OCR, Tkinter, TensorFlow, REST API, ComfyUI, AWS
* **Tools/Platforms:** VS Code, IntelliJ IDEA, Git, GitHub, Canva, Linux/Unix 


Sources:
  1. ..\data\pdf_files\rohan_cv_aug.pdf (page 0) - Score: 0.386
     Preview: GitHub: github.com/PhoneixDeadeye
LinkedIn: linkedin.com/in/rohan-agarwal007/
Email: agarwalrohan2004@gmail.com
Mobile: ...
  2. ..\data\pdf_files\rohan_cv_aug.pdf (page 0) - Score: 0.386
     Preview: GitHub: github.com/PhoneixDeadeye
LinkedIn: linkedin.com/in/rohan-agarwal007/
Email: agarwalrohan2004@gmail.com
Mobile: ...
  3. ..\data\pdf_files\rohan_cv_aug.pdf (page 0) - Score: 0.386
     Preview: GitHub: github.com/PhoneixDeadeye
LinkedIn: linkedin.com/in/rohan-agarwal007/
Email: agarwalrohan2004@gmail.com
Mobile: ...

Confidence: 0.386

Number of contexts retrieved: 3


In [42]:
# --- Advanced RAG Pipeline: Streaming, Citations, History, Summarization ---
from typing import List, Dict, Any
import time

class AdvancedRAGPipeline:
    """Advanced RAG Pipeline with streaming, citations, history, and summarization."""
    
    def __init__(self, retriever: RAGRetriever, llm: Any):
        """
        Initialize the advanced RAG pipeline.

        Args:
            retriever: RAGRetriever instance for document retrieval.
            llm: Language model instance (e.g., Groq) for answer generation.
        """
        self.retriever = retriever
        self.llm = llm
        self.conversation_history = [] # To store past interactions

    def query(self, question: str, top_k: int=5, min_score: float=0.2, stream: bool=False, summarize: bool=False) -> Dict[str, Any]:
        """
        Process a user query with advanced features.

        Args:
            question (str): The user query string.
            top_k (int): Number of top documents to retrieve.
            min_score (float): Minimum similarity score to consider a document relevant.
            stream (bool): Whether to stream the response.

        Returns:
            Dictionary with 'answer', 'sources', 'confidence', and optionally 'contexts'.
        """
        # Step 1: Retrieve relevant documents
        results = self.retriever.retrieve(question, top_k=top_k, score_threshold=min_score)
        if not results:
            return {'answer': "No relevant documents found to answer the query.", 'sources': [], 'confidence': 0.0}
        
        context = "\n\n".join([doc['content'] for doc in results]) if results else ""
        sources = [{
            'source': doc['metadata'].get('source_file', doc['metadata'].get('source', 'unknown')),
            'page': doc['metadata'].get('page', 'unknown'),
            'score': doc['similarity_score'],
            'preview': doc['content'][:120] + ("..." if len(doc['content']) > 120 else "")
        } for doc in results]

        confidence = max(doc['similarity_score'] for doc in results) if results else 0.0
        if not context:
            return {'answer': "No relevant documents found to answer the query.", 'sources': [], 'confidence': 0.0}
        
        # Step 2: Generate answer using LLM with conversation history
        prompt = f"""You are an expert assistant. Use the following context to answer the question concisely.
            Context: 
            {context}

            Conversation History:
            {self._format_history()}

            Question: 
            {question}

            Answer:
        """

        if stream:
            return self._stream_response(prompt, context, question, sources, confidence)
        else:
            response = self.llm.invoke([prompt.format(context=context, query=question)])
            answer = response.content
            self._update_history(question, answer)
            
            return {
                'answer': answer,
                'sources': sources,
                'confidence': confidence,
                'contexts': results
            }

In [46]:
# ===== ENHANCED RAG QUERY DEMO =====
import json

print("🔍 RAG ENHANCED QUERY DEMO")
print("=" * 100)
print()

# Execute the enhanced RAG query
query = "What are Rohan's technical skills?"
print(f"📋 Query: '{query}'")
print(f"⚙️  Parameters: top_k=3, min_score=0.1, return_contexts=True")
print("-" * 100)
print()

result = rag_enhanced(query, rag_retriever, llm, top_k=3, min_score=0.1, return_contexts=True)

# ===== DISPLAY THE ANSWER =====
print("💡 GENERATED ANSWER:")
print("=" * 100)
print(result['answer'])
print("=" * 100)
print()

# ===== DISPLAY SOURCES =====
print("📚 SOURCES & EVIDENCE:")
print("-" * 100)
for i, source in enumerate(result['sources'], 1):
    print(f"\n┌─ SOURCE #{i}")
    print(f"│  📄 File: {source['source']}")
    print(f"│  📖 Page: {source['page']}")
    print(f"│  🎯 Similarity Score: {source['score']:.4f} ({source['score']*100:.2f}%)")
    print(f"│")
    print(f"│  📝 Preview:")
    print(f"│  {source['preview']}")
    print(f"└{'─' * 98}")

print()

# ===== DISPLAY CONFIDENCE & METRICS =====
print("📊 RETRIEVAL METRICS:")
print("-" * 100)
print(f"✅ Overall Confidence: {result['confidence']:.4f} ({result['confidence']*100:.2f}%)")
print(f"📦 Documents Retrieved: {len(result['sources'])}")

# ===== DISPLAY DETAILED CONTEXT INFO =====
if 'contexts' in result:
    print()
    print("🔎 DETAILED CONTEXT INFORMATION:")
    print("-" * 100)
    for i, ctx in enumerate(result['contexts'], 1):
        print(f"\n├─ CONTEXT #{i}")
        print(f"│  🆔 Document ID: {ctx['id']}")
        print(f"│  🏆 Rank: #{ctx['rank']}")
        print(f"│  📏 Distance: {ctx['distance']:.6f}")
        print(f"│  📐 Similarity: {ctx['similarity_score']:.6f}")
        print(f"│  📝 Content Length: {len(ctx['content'])} characters")
        print(f"│")
        print(f"│  📄 Metadata:")
        for key, value in ctx['metadata'].items():
            print(f"│     • {key}: {value}")
        print(f"│")
        print(f"│  📋 Content Snippet:")
        snippet = ctx['content'][:200].replace('\n', ' ')
        print(f"│     {snippet}...")
    print(f"└{'─' * 98}")

print()
print("=" * 100)
print("✅ Query Completed Successfully!")
print("=" * 100)

🔍 RAG ENHANCED QUERY DEMO

📋 Query: 'What are Rohan's technical skills?'
⚙️  Parameters: top_k=3, min_score=0.1, return_contexts=True
----------------------------------------------------------------------------------------------------

Retrieving documents for query: 'What are Rohan's technical skills?'
Top K: 3, Score Threshold: 0.1
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00, 111.14it/s]

Generated embeddings with shape: (1, 384)
Retrieved 3 documents after filtering by score threshold.





💡 GENERATED ANSWER:
Rohan's technical skills include:

* **Languages:** Java, C, C++, Python, HTML
* **Frameworks:** Pandas, OpenCV, Tesseract OCR, Tkinter, TensorFlow, REST API, ComfyUI, AWS
* **Tools/Platforms:** VS Code, IntelliJ IDEA, Git, GitHub, Canva, Linux/Unix 


📚 SOURCES & EVIDENCE:
----------------------------------------------------------------------------------------------------

┌─ SOURCE #1
│  📄 File: ..\data\pdf_files\rohan_cv_aug.pdf
│  📖 Page: 0
│  🎯 Similarity Score: 0.3863 (38.63%)
│
│  📝 Preview:
│  GitHub: github.com/PhoneixDeadeye
LinkedIn: linkedin.com/in/rohan-agarwal007/
Email: agarwalrohan2004@gmail.com
Mobile: ...
└──────────────────────────────────────────────────────────────────────────────────────────────────

┌─ SOURCE #2
│  📄 File: ..\data\pdf_files\rohan_cv_aug.pdf
│  📖 Page: 0
│  🎯 Similarity Score: 0.3863 (38.63%)
│
│  📝 Preview:
│  GitHub: github.com/PhoneixDeadeye
LinkedIn: linkedin.com/in/rohan-agarwal007/
Email: agarwalrohan2004@gmail.com
Mobile