In [10]:
## document datastructure

from langchain.schema import Document
from langchain_core.documents import Document

In [11]:
doc = Document(
    page_content = "this is the main text content I am using to create RAG",
    metadata = {
        "source":"example.txt",
        "pages":1,
        "author": "Sannidhi",
        "date_created": "2025-10-29"
    }
)

In [12]:
doc

Document(metadata={'source': 'example.txt', 'pages': 1, 'author': 'Sannidhi', 'date_created': '2025-10-29'}, page_content='this is the main text content I am using to create RAG')

In [13]:
## create s simple text file
import os
os.makedirs('../data/text_files', exist_ok = True)

In [14]:
sample_texts = {
    "../data/text_files/python_intro.txt": """
    
    Python Introduction

Python is a high-level, interpreted programming language known for its simplicity, readability, and versatility. It was created by Guido van Rossum and first released in 1991. Python emphasizes code readability with its clean and easy-to-understand syntax, making it beginner-friendly yet powerful for advanced programming.

Key Features of Python

Simple and Readable: Easy syntax resembling natural language.

Interpreted Language: Code is executed line by line, no compilation needed.

Dynamically Typed: No need to declare variable types explicitly.

Cross-Platform: Runs on Windows, Linux, macOS, etc.

Extensive Libraries: Rich ecosystem for web development, data science, machine learning, AI, automation, and more.

Open Source: Free to use and supported by a large community.

Common Uses

Web development (Django, Flask)

Data Science & Machine Learning (Pandas, NumPy, Scikit-learn, TensorFlow, PyTorch)

Automation and scripting

Game development (Pygame)

Desktop applications (Tkinter, PyQt)

Python’s versatility and ease of learning make it one of the most popular languages for beginners and professionals alike.
    """
}

for filepath, content in sample_texts.items():
    with open(filepath, 'w', encoding='utf-8') as f:
        f.write(content)

print("Sample text files created!")

Sample text files created!


In [15]:
## TextLoader

from langchain.document_loaders import TextLoader
from langchain_community.document_loaders import TextLoader

In [16]:
loader = TextLoader('../data/text_files/python_intro.txt', encoding='utf-8')
loader

<langchain_community.document_loaders.text.TextLoader at 0x174049240>

In [17]:
document_1 = loader.load()
document_1

[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content='\n    \n    Python Introduction\n\nPython is a high-level, interpreted programming language known for its simplicity, readability, and versatility. It was created by Guido van Rossum and first released in 1991. Python emphasizes code readability with its clean and easy-to-understand syntax, making it beginner-friendly yet powerful for advanced programming.\n\nKey Features of Python\n\nSimple and Readable: Easy syntax resembling natural language.\n\nInterpreted Language: Code is executed line by line, no compilation needed.\n\nDynamically Typed: No need to declare variable types explicitly.\n\nCross-Platform: Runs on Windows, Linux, macOS, etc.\n\nExtensive Libraries: Rich ecosystem for web development, data science, machine learning, AI, automation, and more.\n\nOpen Source: Free to use and supported by a large community.\n\nCommon Uses\n\nWeb development (Django, Flask)\n\nData Science & Machine Learni

In [18]:
## Directory Loader

from langchain_community.document_loaders import DirectoryLoader


In [19]:
## load all the text files from the directory
dir_loader =  DirectoryLoader(
    "../data/text_files",
    glob="**/*.txt",
    loader_cls = TextLoader,
    loader_kwargs = {'encoding': 'utf-8'},
    show_progress = False
)
documents = dir_loader.load()
documents

[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content='\n    \n    Python Introduction\n\nPython is a high-level, interpreted programming language known for its simplicity, readability, and versatility. It was created by Guido van Rossum and first released in 1991. Python emphasizes code readability with its clean and easy-to-understand syntax, making it beginner-friendly yet powerful for advanced programming.\n\nKey Features of Python\n\nSimple and Readable: Easy syntax resembling natural language.\n\nInterpreted Language: Code is executed line by line, no compilation needed.\n\nDynamically Typed: No need to declare variable types explicitly.\n\nCross-Platform: Runs on Windows, Linux, macOS, etc.\n\nExtensive Libraries: Rich ecosystem for web development, data science, machine learning, AI, automation, and more.\n\nOpen Source: Free to use and supported by a large community.\n\nCommon Uses\n\nWeb development (Django, Flask)\n\nData Science & Machine Learni

In [20]:
## Read all the pdf files
from langchain_community.document_loaders import PyPDFLoader, PyMuPDFLoader


In [21]:
## load all the pdf files from the directory
dir_loader =  DirectoryLoader(
    "../data/pdf_files",
    glob="**/*.txt",
    loader_cls = PyMuPDFLoader,
    loader_kwargs = {'encoding': 'utf-8'},
    show_progress = False
)
documents = dir_loader.load()
documents

[]

In [22]:
## embedding and vectorstorDB
import numpy as np
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import uuid
from typing import List, Dict, Any, Tuple
from sklearn.metrics.pairwise import cosine_similarity

  warn("The installed version of bitsandbytes was compiled without GPU support. "


'NoneType' object has no attribute 'cadam32bit_grad_fp32'


In [23]:
class EmbeddingManager:
    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
        """
        Initialize the embedding manager
        args: model_name: HuggingFace model name for sentence embeddings
        """
        self.model_name = model_name
        self.model: SentenceTransformer | None = None
        self._load_model()

    def _load_model(self):
        # load the sentence transformer model 
        try:
            print(f"Loading embedding model: {self.model_name}")
            self.model = SentenceTransformer(self.model_name)
            print(f"Model loaded successfully. Embedding dimension: {self.model.get_sentence_embedding_dimension()}")
        except Exception as e:
            print(f"Error loading model {self.model_name}: {e}")
            raise

    def generate_embeddings(self, texts: List[str]) -> np.ndarray:
        """Generate embeddings for a list of texts"""
        if not self.model:
            raise ValueError("Model not loaded")
        print(f"Generating embeddings for {len(texts)} texts...")
        embeddings = self.model.encode(texts, show_progress_bar=True)
        print(f"Generated embeddings with shape: {embeddings.shape}")
        return embeddings

## Initialize the embedding manager
embedding_manager = EmbeddingManager()
embedding_manager

Loading embedding model: all-MiniLM-L6-v2
Model loaded successfully. Embedding dimension: 384


<__main__.EmbeddingManager at 0x363a69360>

In [24]:
class VectorStore:
    # Manage document embeddings in a ChromaDB vector store

    def __init__(self, collection_name: str = "text_documents", persist_directory: str = "../data/vector_store"):
        self.collection_name = collection_name
        self.persist_directory = persist_directory
        self.client = None
        self.collection = None
        self._initialize_store()

    def _initialize_store(self):
        try:
            os.makedirs(self.persist_directory, exist_ok=True)
            self.client = chromadb.PersistentClient(path=self.persist_directory)

            # Get or create collection
            self.collection = self.client.get_or_create_collection(
                name=self.collection_name,
                metadata={'description': "Text document embeddings for RAG"}
            )
            print(f"Vector store initialized. Collection: {self.collection_name}")
            print(f"Existing documents in collection: {self.collection.count()}")
        except Exception as e:
            print(f"Error initializing vector store: {e}")
            raise

    def add_documents(self, documents: List[Any], embeddings: np.ndarray):
        """
        Add documents and their embeddings to the vector store

        Args:
            documents: List of LangChain Document objects
            embeddings: Corresponding embeddings as a numpy array
        """
        if len(documents) != len(embeddings):
            raise ValueError("Number of documents must match number of embeddings")
        print(f"Adding {len(documents)} documents to vector store...")

        ids = []
        metadatas = []
        documents_text = []
        embeddings_list = []

        for i, (doc, embedding) in enumerate(zip(documents, embeddings)):
            doc_id = f"doc_{uuid.uuid4().hex[:8]}_{i}"
            ids.append(doc_id)

            metadata = dict(doc.metadata)
            metadata['doc_index'] = i
            metadata['content_length'] = len(doc.page_content)
            metadatas.append(metadata)

            documents_text.append(doc.page_content)
            embeddings_list.append(embedding.tolist())

        try:
            self.collection.add(
                ids=ids,
                embeddings=embeddings_list,
                metadatas=metadatas,
                documents=documents_text
            )
            print(f"Successfully added {len(documents)} documents to vector store")
            print(f"Total documents in collection: {self.collection.count()}")
        except Exception as e:
            print(f"Error adding documents to vector store: {e}")
            raise

vectorstore = VectorStore()
vectorstore

Vector store initialized. Collection: text_documents
Existing documents in collection: 3


<__main__.VectorStore at 0x368942710>

In [25]:
## create chunks

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,       # Number of characters per chunk
    chunk_overlap=50      # Overlap between chunks
)

In [26]:
chunked_documents = []
for doc in document_1:
    chunks = text_splitter.split_text(doc.page_content)
    for i, chunk in enumerate(chunks):
        chunked_documents.append(
            Document(
                page_content=chunk,
                metadata={**doc.metadata, "chunk_index": i}
            )
        )

print(f"Total chunks created: {len(chunked_documents)}")

Total chunks created: 3


In [27]:
texts = [doc.page_content for doc in chunked_documents]
texts

['Python Introduction\n\nPython is a high-level, interpreted programming language known for its simplicity, readability, and versatility. It was created by Guido van Rossum and first released in 1991. Python emphasizes code readability with its clean and easy-to-understand syntax, making it beginner-friendly yet powerful for advanced programming.\n\nKey Features of Python\n\nSimple and Readable: Easy syntax resembling natural language.',
 'Interpreted Language: Code is executed line by line, no compilation needed.\n\nDynamically Typed: No need to declare variable types explicitly.\n\nCross-Platform: Runs on Windows, Linux, macOS, etc.\n\nExtensive Libraries: Rich ecosystem for web development, data science, machine learning, AI, automation, and more.\n\nOpen Source: Free to use and supported by a large community.\n\nCommon Uses\n\nWeb development (Django, Flask)',
 'Common Uses\n\nWeb development (Django, Flask)\n\nData Science & Machine Learning (Pandas, NumPy, Scikit-learn, TensorFlo

In [28]:
## genertae the embeddings
embeddings = embedding_manager.generate_embeddings(texts)

## store into the vectore database
vectorstore.add_documents(chunked_documents, embeddings)

Generating embeddings for 3 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00,  3.11it/s]

Generated embeddings with shape: (3, 384)
Adding 3 documents to vector store...
Successfully added 3 documents to vector store
Total documents in collection: 6





In [29]:
## Retriever Pipeline from vectorestore

class RAGRetriever:

    def __init__(self, vector_store: VectorStore, embedding_manager: EmbeddingManager):

        self.vector_store = vector_store
        self.embedding_manager = embedding_manager

    def retrieve(self, query: str, top_k: int = 5, score_threshold: float = 0.0) -> List[Dict[str, Any]]:
        """
        Retrieve relevant documents for a query
        
        Args:
            query: The search query
            top_k: Number of top results to return
            score_threshold: Minimum similarity score threshold
            
        Returns:
            List of dictionaries containing retrieved documents and metadata
        """
        print(f"Retrieving documents for query: '{query}'")
        print(f"Top K: {top_k}, Score threshold: {score_threshold}")
        
        # Generate query embedding
        query_embedding = self.embedding_manager.generate_embeddings([query])[0]
        
        # Search in vector store
        try:
            results = self.vector_store.collection.query(
                query_embeddings=[query_embedding.tolist()],
                n_results=top_k
            )
            
            # Process results
            retrieved_docs = []
            
            if results['documents'] and results['documents'][0]:
                documents = results['documents'][0]
                metadatas = results['metadatas'][0]
                distances = results['distances'][0]
                ids = results['ids'][0]
                
                for i, (doc_id, document, metadata, distance) in enumerate(zip(ids, documents, metadatas, distances)):
                    # Convert distance to similarity score (ChromaDB uses cosine distance)
                    similarity_score = 1 - distance
                    
                    if similarity_score >= score_threshold:
                        retrieved_docs.append({
                            'id': doc_id,
                            'content': document,
                            'metadata': metadata,
                            'similarity_score': similarity_score,
                            'distance': distance,
                            'rank': i + 1
                        })
                
                print(f"Retrieved {len(retrieved_docs)} documents (after filtering)")
            else:
                print("No documents found")
            
            return retrieved_docs
            
        except Exception as e:
            print(f"Error during retrieval: {e}")
            return []

rag_retriever=RAGRetriever(vectorstore,embedding_manager)
rag_retriever

<__main__.RAGRetriever at 0x368c1fcd0>

In [30]:
rag_retriever.retrieve("What is Python?")

Retrieving documents for query: 'What is Python?'
Top K: 5, Score threshold: 0.0
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00,  9.00it/s]

Generated embeddings with shape: (1, 384)
Retrieved 5 documents (after filtering)





[{'id': 'doc_d9978a5f_0',
  'content': 'Python Introduction\n\nPython is a high-level, interpreted programming language known for its simplicity, readability, and versatility. It was created by Guido van Rossum and first released in 1991. Python emphasizes code readability with its clean and easy-to-understand syntax, making it beginner-friendly yet powerful for advanced programming.\n\nKey Features of Python\n\nSimple and Readable: Easy syntax resembling natural language.',
  'metadata': {'chunk_index': 0,
   'content_length': 431,
   'doc_index': 0,
   'source': '../data/text_files/python_intro.txt'},
  'similarity_score': 0.5865612924098969,
  'distance': 0.41343870759010315,
  'rank': 1},
 {'id': 'doc_f5a0da11_0',
  'content': 'Python Introduction\n\nPython is a high-level, interpreted programming language known for its simplicity, readability, and versatility. It was created by Guido van Rossum and first released in 1991. Python emphasizes code readability with its clean and easy-

Integration VectorDB Context pipeline with LLM Output

In [31]:
## simple RAG pipeline with Groq LLM

from langchain_groq import ChatGroq
import os
from dotenv import load_dotenv

load_dotenv()

groq_api_key = os.getenv('GROK_API_KEY')


In [46]:
LLM = ChatGroq(groq_api_key=groq_api_key, model_name = 'llama-3.1-8b-instant', temperature = 0.1, max_tokens = 1024)

In [47]:
def rag_simple(query, retriever, llm, top_k = 2):
    results = retriever.retrieve(query, top_k=top_k)
    context = "\n\n".join([doc['content'] for doc in results] ) if results else ""
    print(f"retrieved context: \n {context} \n")
    if not context:
        return " No relevant answer find"

    prompt = f""" Use the following context to answer the question precisely.

        Context: {context}
        question: {query}
        Answer: 
    """
    response = llm.invoke([prompt.format(context = context, query=query)])
    return response. content


In [48]:
answer = rag_simple("Python is or difficult? ", rag_retriever, LLM)
print(answer)

Retrieving documents for query: 'Python is or difficult? '
Top K: 2, Score threshold: 0.0
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00, 28.96it/s]

Generated embeddings with shape: (1, 384)
Retrieved 2 documents (after filtering)
retrieved context: 
 Python Introduction

Python is a high-level, interpreted programming language known for its simplicity, readability, and versatility. It was created by Guido van Rossum and first released in 1991. Python emphasizes code readability with its clean and easy-to-understand syntax, making it beginner-friendly yet powerful for advanced programming.

Key Features of Python

Simple and Readable: Easy syntax resembling natural language.

Python Introduction

Python is a high-level, interpreted programming language known for its simplicity, readability, and versatility. It was created by Guido van Rossum and first released in 1991. Python emphasizes code readability with its clean and easy-to-understand syntax, making it beginner-friendly yet powerful for advanced programming.

Key Features of Python

Simple and Readable: Easy syntax resembling natural language. 

Python is simple.





Enhanced RAG Pipeline  Features

In [51]:
def rag_advance(query, retriever, llm, top_k=5, mini_score = 0.2, return_context = False):
    """
        RAG Pipeline with an extra feature
        - returns answers, sources, confidence score, and optionally fully context
    """

    results = retriever.retrieve(query, top_k=top_k, score_threshold = mini_score)
    if not results:
        return {'answer': "no relevant answer found", 'sources': [], 'confidence': 0.0, 'context': ''}

    #prepare context and sources
    context = "\n\n".join([doc['content'] for doc in results] ) 
    sources = [{
        "sources": doc['metadata'].get('source_file', doc['metadata'].get('source', 'unkown')),
        "pages": doc['metadata'].get('page', 'unknown'),
        "score": doc['similarity_score'],
        "preview":doc['content'][:120]+ '...'
    }for doc in results]

    confidence = max([doc['similarity_score'] for doc in results])
    
    prompt = f""" Use the following context to answer the question precisely.

        Context: {context}
        question: {query}
        Answer: 
    """

    response = llm.invoke([prompt.format(context=context, query=query)])

    output = {
        'answer': response.content,
        'sources': sources,
        'confidence': confidence
    }

    if return_context:
        output['context'] = context

    return output

In [52]:
result =  rag_advance("how to learn python?", rag_retriever, LLM, top_k=3 )
result

Retrieving documents for query: 'how to learn python?'
Top K: 3, Score threshold: 0.2
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00, 32.10it/s]

Generated embeddings with shape: (1, 384)
Retrieved 2 documents (after filtering)





{'answer': 'To learn Python, follow these steps:\n\n1. **Start with the basics**: Begin with the fundamentals of Python, such as data types, variables, loops, conditional statements, functions, and object-oriented programming.\n2. **Online Resources**:\n   - **Codecademy**: Offers interactive coding lessons and exercises.\n   - **Python.org**: The official Python website provides tutorials, guides, and documentation.\n   - **W3Schools**: Provides tutorials, examples, and reference materials for Python.\n   - **Udemy**: Offers a wide range of Python courses for beginners and advanced learners.\n   - **Coursera**: Partners with top universities to offer Python courses.\n3. **Practice with Projects**:\n   - **Start with simple projects**: Build calculators, quizzes, or games to practice your skills.\n   - **Work on real-world projects**: Apply Python to real-world problems, such as data analysis, web scraping, or automation.\n4. **Join a Community**:\n   - **Reddit**: Participate in r/lea

In [57]:
print('Answer:', result['answer'])
print("Source:", result['sources'])
print("Confidence:", result['confidence'])
# print("score:", result['score'])


Answer: To learn Python, follow these steps:

1. **Start with the basics**: Begin with the fundamentals of Python, such as data types, variables, loops, conditional statements, functions, and object-oriented programming.
2. **Online Resources**:
   - **Codecademy**: Offers interactive coding lessons and exercises.
   - **Python.org**: The official Python website provides tutorials, guides, and documentation.
   - **W3Schools**: Provides tutorials, examples, and reference materials for Python.
   - **Udemy**: Offers a wide range of Python courses for beginners and advanced learners.
   - **Coursera**: Partners with top universities to offer Python courses.
3. **Practice with Projects**:
   - **Start with simple projects**: Build calculators, quizzes, or games to practice your skills.
   - **Work on real-world projects**: Apply Python to real-world problems, such as data analysis, web scraping, or automation.
4. **Join a Community**:
   - **Reddit**: Participate in r/learnpython, r/Pytho