### Data Ingestion

In [1]:
### what exactly is a document data structure
#page content
#meta data
from langchain_core.documents import Document

In [2]:
doc = Document(page_content="this is the main text content I am using to create rag",
               metadata={
                   "source":"example.txt",
                   "page":1,
                   "author":"Charles",
                   "data_created": "2025-01-01"
               })

In [3]:
doc

Document(metadata={'source': 'example.txt', 'page': 1, 'author': 'Charles', 'data_created': '2025-01-01'}, page_content='this is the main text content I am using to create rag')

In [4]:
##create a simple text file
import os
os.makedirs("../data/text_files", exist_ok=True)

In [5]:
sample_texts={
    "../data/text_files/python_intro.txt":"""Python Programming Introduction

Python is a high-level, interpreted programming language known for its simplicity and readability.
Created by Guido van Rossum and first released in 1991, Python has become one of the most popular
programming languages in the world.

Key Features:
- Easy to learn and use
- Extensive standard library
- Cross-platform compatibility
- Strong community support

Python is widely used in web development, data science, artificial intelligence, and automation.""",
    
    "../data/text_files/machine_learning.txt": """Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing computer programs
that can access data and use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Learning with labeled data
2. Unsupervised Learning: Finding patterns in unlabeled data
3. Reinforcement Learning: Learning through rewards and penalties

Applications include image recognition, speech processing, and recommendation systems
    
    
    """

}

for filepath,content in sample_texts.items():
    with open(filepath,'w',encoding="utf-8") as f:
        f.write(content)

print("Sample text files created!")

Sample text files created!


In [6]:
### reading text using langchain

from langchain_community.document_loaders import TextLoader


loader = TextLoader("../data/text_files/python_intro.txt",encoding="utf-8")
document = loader.load()
print(document)

  from .autonotebook import tqdm as notebook_tqdm


[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content='Python Programming Introduction\n\nPython is a high-level, interpreted programming language known for its simplicity and readability.\nCreated by Guido van Rossum and first released in 1991, Python has become one of the most popular\nprogramming languages in the world.\n\nKey Features:\n- Easy to learn and use\n- Extensive standard library\n- Cross-platform compatibility\n- Strong community support\n\nPython is widely used in web development, data science, artificial intelligence, and automation.')]


In [7]:
#### Directory loader loading all files from a directory instead of a single file
from tqdm.auto import tqdm
from langchain_community.document_loaders import DirectoryLoader
dir_loader = DirectoryLoader(
    "../data/text_files",
    glob="**/*.txt", # Pattern to match files basically the files we want to extract
    loader_cls=TextLoader, #loader class to use since we are loading txt here we use textloader if it was pdf we would use a pdf loader
    loader_kwargs={'encoding': 'utf-8'},
    show_progress=True
)

dir_loader.load()


100%|██████████| 2/2 [00:00<00:00, 1026.63it/s]


[Document(metadata={'source': '..\\data\\text_files\\machine_learning.txt'}, page_content='Machine Learning Basics\n\nMachine learning is a subset of artificial intelligence that enables systems to learn and improve\nfrom experience without being explicitly programmed. It focuses on developing computer programs\nthat can access data and use it to learn for themselves.\n\nTypes of Machine Learning:\n1. Supervised Learning: Learning with labeled data\n2. Unsupervised Learning: Finding patterns in unlabeled data\n3. Reinforcement Learning: Learning through rewards and penalties\n\nApplications include image recognition, speech processing, and recommendation systems\n\n\n    '),
 Document(metadata={'source': '..\\data\\text_files\\python_intro.txt'}, page_content='Python Programming Introduction\n\nPython is a high-level, interpreted programming language known for its simplicity and readability.\nCreated by Guido van Rossum and first released in 1991, Python has become one of the most popu

In [8]:
from tqdm.auto import tqdm
from langchain_community.document_loaders import DirectoryLoader
dir_loader = DirectoryLoader(
    "../data/text_files",
    glob="**/*.txt", # Pattern to match files basically the files we want to extract
    loader_cls=TextLoader, #loader class to use since we are loading txt here we use textloader if it was pdf we would use a pdf loader
    loader_kwargs={'encoding': 'utf-8'},
    show_progress=True
)

dir_loader.load()

100%|██████████| 2/2 [00:00<00:00, 1222.12it/s]


[Document(metadata={'source': '..\\data\\text_files\\machine_learning.txt'}, page_content='Machine Learning Basics\n\nMachine learning is a subset of artificial intelligence that enables systems to learn and improve\nfrom experience without being explicitly programmed. It focuses on developing computer programs\nthat can access data and use it to learn for themselves.\n\nTypes of Machine Learning:\n1. Supervised Learning: Learning with labeled data\n2. Unsupervised Learning: Finding patterns in unlabeled data\n3. Reinforcement Learning: Learning through rewards and penalties\n\nApplications include image recognition, speech processing, and recommendation systems\n\n\n    '),
 Document(metadata={'source': '..\\data\\text_files\\python_intro.txt'}, page_content='Python Programming Introduction\n\nPython is a high-level, interpreted programming language known for its simplicity and readability.\nCreated by Guido van Rossum and first released in 1991, Python has become one of the most popu

### Chunking and Embedding and vectorstoredb

In [9]:
import numpy as np
from sentence_transformers import SentenceTransformer #embedding model
import chromadb
from chromadb.config import Settings
import uuid
from typing import List,Dict,Any,Tuple 
from sklearn.metrics.pairwise import cosine_similarity

In [10]:
class EmbeddingManager:
    """handles document embedding generation using the sentence transformer"""
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"): # it uses around 384 dims
        """
        Initialize the embed manager
        Args:
            model_name: Hugging face model name for sentence embeddings
        """
        self.model_name = model_name
        self.model = None
        self._load_model()
    def _load_model(self):
        """Load the sentenceTransformer model"""
        try:
            print(f"Loading embedding model: {self.model_name}")
            self.model = SentenceTransformer(self.model_name)
            print(f"Model loaded well. embed dim: {self.model.get_sentence_embedding_dimension()}")
        except Exception as e:
            print(f"Error loading model {self.model_name}: {e}")
            raise
    def generate_embeddings(self, texts: List[str])-> np.ndarray:
        """
        Generate embeddings for a list of texts

        Args:
            texts: List of text strings to embed

        Returns:
        numpy array of embeddings with shape (len(text),embedding_dim)
        """

        if not self.model:
            raise ValueError("Model not loaded")
        
        print(f"Generating embeddings for {len(texts)} texts....")
        embeddings = self.model.encode(texts,show_progress_bar=True)
        print(f"generated embeddings with shape: {embeddings.shape}")
        return embeddings
    def get_embedding_dimension(self) -> int:
        """Get the embedding dim of the model"""
        if not self.model:
            raise ValueError('Model not loaded')
        return self.model.get_sentence_embedding_dimension()
    
embedding_manager = EmbeddingManager()
embedding_manager

Loading embedding model: all-MiniLM-L6-v2
Model loaded well. embed dim: 384


<__main__.EmbeddingManager at 0x1dc4635f4d0>

### VECTOR STORE

In [11]:
class VectorStore:
    """Manages document embeddings in a chromadb vector store"""

    def __init__(self,collection_name: str="pdf_documents", perist_directory: str = "../data/vector_store"):
        """
        Initialize the vector store

        Args:
            collection_name: name of the ChromaDB collection
            persist_directory: Directory to persist the vector store (where we will store stuff(the chunk embeddings) locally)
        """
        self.collection_name = collection_name
        self.persist_directory = perist_directory
        self.client = None
        self.collection = None
        self._initialize_store()
    def _initialize_store(self):
        """Initialize the ChromaDB client and collection"""
        try:
            #Create persistent ChromaDb client
            os.makedirs(self.persist_directory, exist_ok=True)
            self.client = chromadb.PersistentClient(path=self.persist_directory)

            #Get or create collection
            self.collection = self.client.get_or_create_collection(
                name=self.collection_name,
                metadata={"description": "PDF document embeddings for RAG",
                          "hnsw:space": "cosine"}
            )
            print(f"Vector store initialized. Collecction: {self.collection_name}")
            print(f"Existing documents in collection: {self.collection.count()}")

        except Exception as e:
            print(f"Error inintializing vector store: {e}")
            raise

        ## Add document function

    def add_documents(self, documents: List[Any], embeddings: np.ndarray):
        """
        Add document and their embeddings to the vector store

        Args:
            documents: List of Langchain documents
            embeddings: Corresponding embeddings for the documents
        """
        if len(documents) != len(embeddings):
            raise ValueError("Number of documents must match number of embeddings")
        print(f"Adding {len(documents)} document to vector store..")

        #Prepare data for ChromaDB
        ids=[]
        metadatas = []
        documents_text = []
        embeddings_list = []

        for i, (doc,embedding) in enumerate(zip(documents, embeddings)):
            #Generate unique id
            doc_id = f"doc_{uuid.uuid4().hex[:8]}_{i}"
            ids.append(doc_id)

            #prep metadata
            metadata = dict(doc.metadata)
            metadata['doc_index'] = i
            metadata['content_length'] = len(doc.page_content)
            metadatas.append(metadata)

            #Document content
            documents_text.append(doc.page_content)

            #Embedding
            embeddings_list.append(embedding.tolist())
        try:
            self.collection.add(
                ids=ids,
                embeddings=embeddings_list,
                metadatas=metadatas,
                documents=documents_text,
            )
            print(f"Successfully Added {len(documents)} documents to vector store")
            print(f"Total documents in collection: {self.collection.count()}")

        except Exception as e:
            print(f"Error adding docs to vector store: {e}")
            raise

vectorstore = VectorStore()
vectorstore

Vector store initialized. Collecction: pdf_documents
Existing documents in collection: 0


<__main__.VectorStore at 0x1dc572c0590>

In [12]:
from pypdf import PdfReader

reader = PdfReader("../data/pdf_files/Code_Culture.pdf")
text = ""
for page in reader.pages:
    text+=page.extract_text()+"\n"
len(text)

3439

In [13]:
text

'Code Culture - User Guide and FAQ\nFrequently Asked Questions\nWhat is Code Culture Code Culture is a community and system designed to help people grow\nthrough technology disciplined thinking and shared learning It emphasizes building reflecting and\ncollaborating over passive consumption\nWho is Code Culture for Code Culture is for people who - Want to learn by building - Are curious\nabout how systems work - Value discipline reflection and long-term growth - Are willing to contribute\nnot just consume\nYou dont need to be an expert just intentional\nDo I need a technical background No Code Culture supports learners at all levels What matters is\nyour willingness to think rigorously practice consistently and engage respectfully\nIs there a fixed learning path No Learning paths emerge from problems projects and collaboration\nUsers are encouraged to define goals build solutions and reflect on outcomes rather than follow rigid\ntracks\nHow do I get started 1 Explore Code Cultures core

In [14]:
import re

def clean_pdf_text(text):
    # Fix hyphenated line breaks: exam-\nple → example
    text = re.sub(r"-\n", "", text)

    # Remove weird reference markers: † ‡ § ¶ ` *
    text = re.sub(r"[†‡§¶`∗]", "", text)

    # Add space between merged lower→upper case words
    text = re.sub(r"([a-z])([A-Z])", r"\1 \2", text)

    # Remove duplicate commas
    text = re.sub(r",,+", ",", text)

    # Normalize multiple newlines
    text = re.sub(r"\n{2,}", "\n\n", text)

    return text

cleaned_text = clean_pdf_text(text)



In [15]:
#chunking the sample paper

def chunk_text(text, chunk_size=500):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start +=chunk_size
    return chunks

chunks = chunk_text(cleaned_text)

len(chunks)

7

In [16]:
# convert text to embeddings
text = chunks
#Generate the embeddings

embeddings = embedding_manager.generate_embeddings(texts=text)

#store in the vector db

from langchain_core.documents import Document
docs = [Document(page_content=c) for c in chunks] #since our chunks are pure text and vectore store on accepts doc type we have to cast it

Generating embeddings for 7 texts....


Batches: 100%|██████████| 1/1 [00:00<00:00,  4.22it/s]

generated embeddings with shape: (7, 384)





In [17]:
vectorstore.add_documents(docs, embeddings)

Adding 7 document to vector store..
Successfully Added 7 documents to vector store
Total documents in collection: 7


### Rag Retrieval

In [18]:
class RAGRetrieval:
    """Handles query based retrieval from the vector store"""
    def __init__(self, vector_store: VectorStore, embedding_manager: EmbeddingManager):
        """
        Init the retriver
        
        Args: 
            vector_store: Vector store containing the document embeddings
            embedding_manager: Manger for generating the query eembeddings
            """
        self.vector_store = vector_store
        self.embedding_manager = embedding_manager

    def retrieve(self,query:str, top_k: int =4, score_threshold: float = 0.0)-> List[Dict[str,Any]]:
        """
        Retrieve relevant documents for a query

        Args:
            query: the search (users question)
            top_k: Number of top results to return
            score_threshold: Minimum similarity score threshold

        Returns:
            List of dictionaries containing retrieved documents and metadata
        """
        print(f"Retrieving documents for query: '{query}")
        print(f"Top K: {top_k}, Score Threshold: {score_threshold}")

        #Generate query embedding

        query_embedding = self.embedding_manager.generate_embeddings([query])[0]

        #Search in vector store
        try:
            results = self.vector_store.collection.query(
                query_embeddings=[query_embedding.tolist()],
                n_results=top_k,
            )

            # process results

            retrieved_docs = []

            if results['documents'] and results['documents'][0]:
                documents = results['documents'][0]
                metadatas = results['metadatas'][0]
                distances = results['distances'][0]

                ids = results['ids'][0]

                for i, (doc_id,document,metadata,distance) in enumerate(zip(ids,documents,metadatas,distances)):
                    #convert distances to similarity scores (ChromaDB uses cosine distance)
                    similarity_score = 1 - distance

                    if similarity_score >= score_threshold:
                        retrieved_docs.append({
                            'id': doc_id,
                            'content':document,
                            'metadata': metadata,
                             'similarity_score': similarity_score,
                             'distance': distance,
                             'rank': i + 1
                        })

                print(f"Retrieved {len(retrieved_docs)} (documents after filtering)")
            else:
                print("No documents found")

            return retrieved_docs
        except Exception as e:
            print(f"Error during retrieval: {e}")
            return[]

In [19]:
rag_ret = RAGRetrieval(vectorstore,embedding_manager)

In [20]:
rag_ret.retrieve(query="what is code culture")

Retrieving documents for query: 'what is code culture
Top K: 4, Score Threshold: 0.0
Generating embeddings for 1 texts....


Batches: 100%|██████████| 1/1 [00:00<00:00, 27.93it/s]

generated embeddings with shape: (1, 384)
Retrieved 4 (documents after filtering)





[{'id': 'doc_6ef2f2db_0',
  'content': 'Code Culture - User Guide and FAQ\nFrequently Asked Questions\nWhat is Code Culture Code Culture is a community and system designed to help people grow\nthrough technology disciplined thinking and shared learning It emphasizes building reflecting and\ncollaborating over passive consumption\nWho is Code Culture for Code Culture is for people who - Want to learn by building - Are curious\nabout how systems work - Value discipline reflection and long-term growth - Are willing to contribute\nnot just consu',
  'metadata': {'content_length': 500, 'doc_index': 0},
  'similarity_score': 0.7830911874771118,
  'distance': 0.21690881252288818,
  'rank': 1},
 {'id': 'doc_093424ad_1',
  'content': 'me\nYou dont need to be an expert just intentional\nDo I need a technical background No Code Culture supports learners at all levels What matters is\nyour willingness to think rigorously practice consistently and engage respectfully\nIs there a fixed learning path

In [21]:
# queries = [
#     "SIPIT",
#     "prompt recovery",
#     "SIPIT algorithm",
#     "exact prompt recovery",
#     "recover prompts SIPIT"
# ]

# for q in queries:
#     results = rag_ret.retrieve(query=q, top_k=3)
#     print(f"\nQuery: '{q}' → Found {len(results)} docs")
#     if results:
#         print(f"  Best score: {results[0]['similarity_score']:.3f}")

### Simple rag pipeline with gemini because it's free

In [22]:
from langchain_google_genai import ChatGoogleGenerativeAI
import os
from dotenv import load_dotenv

# 1. Load the variables from .env
load_dotenv() 

my_key = os.getenv("GOOGLE_API_KEY")

# Check if it loaded 
if my_key:
    print("API Key loaded successfully")
else:
    print("API Key not found")

llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash",
                             google_api_key=my_key,
                             temperature=.1)

try: 
    response = llm.invoke("Hello, are you alive?")
    print(f"respose: {response.content}")
except Exception as e:
    print(f"Connection failed: {e}")

API Key loaded successfully
respose: Hello!

No, I am not alive in the biological sense. I am an artificial intelligence, a large language model. I don't have a physical body, consciousness, or feelings like living beings do. I exist as code and data on computer servers, designed to process information and communicate with you.


In [23]:
# simple rag function to retrieve context and generate result

def rag_simple(query, retriever, llm, top_k=3):
    # retrieve context
    results = retriever.retrieve(query,top_k=top_k)
    context = "\n\n".join([doc["content"] for doc in results]) if results else "" 
    if not context:
        return "No relevant context found to answer the question."
    ## generate the answer using gemini llm
    prompt = f"""Use the following context to answer the question concisely.
        Context:
        {context}

        Question:
        {query}
        
        Answer: 
    
    """
    response = llm.invoke([prompt.format(context=context,query=query)])
    return response.content

In [24]:
answer = rag_simple("what is INJECTIVE?", rag_ret,llm)
print(answer)

Retrieving documents for query: 'what is INJECTIVE?
Top K: 3, Score Threshold: 0.0
Generating embeddings for 1 texts....


Batches: 100%|██████████| 1/1 [00:00<00:00,  7.65it/s]

generated embeddings with shape: (1, 384)
Retrieved 3 (documents after filtering)





The provided context does not define "INJECTIVE."


In [25]:
# import google.generativeai as genai

# genai.configure(api_key=my_key)

# for model in genai.list_models():
#     if 'generateContent' in model.supported_generation_methods:
#         print(model.name)

### Enhanced RAG Pipeline

In [26]:
def rag_advanced(query, retriever, llm, top_k=1, min_score=0.2, return_context=False):
    """
    RAG Pipeline with extra features:
    - Returns answer, sources, confidence score, and optionally full context.
    """

    results = retriever.retrieve(query,top_k,min_score)
    if not results:
        return {'answer': 'No relevant context found.', 'sources':[],'confidence':0.0, 'context':''}
    context = "\n\n".join([doc["content"] for doc in results]) if results else "" 
    sources = [{
        'source': doc['metadata'].get('source_file', doc['metadata'].get('source', 'unknown')),
        'page': doc['metadata'].get('page','unknown'),
        'score': doc['similarity_score'],
        'preview': doc['content'][:300]+'...',
    }for doc in results]
    confidence = max([doc['similarity_score'] for doc in results])

    #Generate answer

    prompt = f"""Use the following context to answer the question concisely. \nContext:\n{context}\n\nQuestion: {query}\n\nAnswer:"""
    response = llm.invoke([prompt.format(context=context,query=query)])

    output = {
        'answer': response.content,
        'sources': sources,
        'confidence': confidence,
    }
    if return_context:
        output['context'] = content
    return output
result = rag_advanced("What is Code culture?", rag_ret,llm,1,.4,return_context=True)

print(f"Answer: {result['answer']}")
print(f"Sources: {result['sources']}")
print(f"Confidence: {result['confidence']}")
print(f"Context Preview: {result['context']}")

Retrieving documents for query: 'What is Code culture?
Top K: 1, Score Threshold: 0.4
Generating embeddings for 1 texts....


Batches: 100%|██████████| 1/1 [00:00<00:00, 48.31it/s]

generated embeddings with shape: (1, 384)
Retrieved 1 (documents after filtering)





Answer: Code Culture is a community and system designed to help people grow through technology, disciplined thinking, and shared learning, emphasizing building, reflecting, and collaborating over passive consumption.
Sources: [{'source': 'unknown', 'page': 'unknown', 'score': 0.7837700247764587, 'preview': 'Code Culture - User Guide and FAQ\nFrequently Asked Questions\nWhat is Code Culture Code Culture is a community and system designed to help people grow\nthrough technology disciplined thinking and shared learning It emphasizes building reflecting and\ncollaborating over passive consumption\nWho is Code C...'}]
Confidence: 0.7837700247764587
Context Preview: Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables systems to learn and improve
from experience without being explicitly programmed. It focuses on developing computer programs
that can access data and use it to learn for themselves.

Types of Machine Learning:
1. Supervised Learning: Lea