## RAG From Scratch 

### Goals & architecture
•	Input: a corpus of documents (PDFs, HTML, text files).

•	Output: factual answers grounded in documents with citations and a confidence score.

Augment a generative model with retrieved documents so its outputs are grounded in external knowledge (our docs, DB, web, etc.).

In [1]:
!pip install openai faiss-cpu langchain tiktoken numpy pandas PyPDF2 sentence-transformers

Defaulting to user installation because normal site-packages is not writeable
Collecting openai
  Downloading openai-2.6.1-py3-none-any.whl.metadata (29 kB)
Collecting langchain
  Downloading langchain-1.0.3-py3-none-any.whl.metadata (4.7 kB)
Collecting tiktoken
  Downloading tiktoken-0.12.0-cp312-cp312-win_amd64.whl.metadata (6.9 kB)
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting jiter<1,>=0.10.0 (from openai)
  Downloading jiter-0.11.1-cp312-cp312-win_amd64.whl.metadata (5.3 kB)
Collecting langchain-core<2.0.0,>=1.0.0 (from langchain)
  Downloading langchain_core-1.0.2-py3-none-any.whl.metadata (3.5 kB)
Collecting langgraph<1.1.0,>=1.0.2 (from langchain)
  Downloading langgraph-1.0.2-py3-none-any.whl.metadata (7.4 kB)
Collecting langsmith<1.0.0,>=0.3.45 (from langchain-core<2.0.0,>=1.0.0->langchain)
  Downloading langsmith-0.4.39-py3-none-any.whl.metadata (14 kB)
Collecting tenacity!=8.4.0,<10.0.0,>=8.1.0 (from langchain-core<2.0.0,>=1.0.0-



In [5]:
# Next step is, we will load data and Next Text Chunking
import os
from PyPDF2 import PdfReader

def load_documents(data_path="data"):
    documents = []
    for file in os.listdir(data_path):
        file_path = os.path.join(data_path, file)
        if file.endswith(".pdf"):
            reader = PdfReader(file_path)
            text = ""
            for page in reader.pages:
                text += page.extract_text()
            documents.append({"filename": file, "text": text})
        elif file.endswith(".txt"):
            with open(file_path, "r", encoding="utf-8") as f:
                text = f.read()
            documents.append({"filename": file, "text": text})
    return documents

docs = load_documents("data")
print(f"Loaded {len(docs)} documents.")
print("Sample:", docs[0]["filename"])

Loaded 2 documents.
Sample: knowledge.txt


In [5]:
!pip install langchain-text-splitters

Defaulting to user installation because normal site-packages is not writeable
Collecting langchain-text-splitters
  Downloading langchain_text_splitters-1.0.0-py3-none-any.whl.metadata (2.6 kB)
Downloading langchain_text_splitters-1.0.0-py3-none-any.whl (33 kB)
Installing collected packages: langchain-text-splitters
Successfully installed langchain-text-splitters-1.0.0


In [7]:
pip install tf-keras

Defaulting to user installation because normal site-packages is not writeable
Collecting tf-keras
  Downloading tf_keras-2.20.1-py3-none-any.whl.metadata (1.8 kB)
Downloading tf_keras-2.20.1-py3-none-any.whl (1.7 MB)
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ------------ --------------------------- 0.5/1.7 MB 4.2 MB/s eta 0:00:01
   ------------------------ --------------- 1.0/1.7 MB 3.1 MB/s eta 0:00:01
   ------------------------------ --------- 1.3/1.7 MB 2.9 MB/s eta 0:00:01
   ------------------------------------- -- 1.6/1.7 MB 2.3 MB/s eta 0:00:01
   ---------------------------------------- 1.7/1.7 MB 1.9 MB/s eta 0:00:00
Installing collected packages: tf-keras
Successfully installed tf-keras-2.20.1
Note: you may need to restart the kernel to use updated packages.


In [8]:
# Now, we will Chunk for embedding and retrival
from langchain_text_splitters import RecursiveCharacterTextSplitter

def chunk_documents(docs, chunk_size=500, overlap=100):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        separators=["\n\n", "\n", ".", "!", "?", " ", ""]
    )
    
    chunks = []
    for doc in docs:
        for chunk in splitter.split_text(doc["text"]):
            chunks.append({
                "filename": doc["filename"],
                "content": chunk
            })
    return chunks

chunks = chunk_documents(docs)
print(f"Created {len(chunks)} text chunks.")
print("Sample chunk:\n", chunks[0]['content'][:300])


Created 176 text chunks.
Sample chunk:
 1. Can we copy list2= list1 just like that- No

2. avg salary for years 2023,2034,2025 in sql

3. diff b/w union and union all
UNION combines results from two (or more) SELECTs and removes duplicates (it performs a DISTINCT across all columns), which adds sorting/hash work and can be slower.

UNION 


In [9]:
# We will create Embeddings and Vector Index

from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
import os

In [11]:
pip install huggingface_hub[hf_xet]

Defaulting to user installation because normal site-packages is not writeable
Collecting hf-xet<2.0.0,>=1.1.2 (from huggingface_hub[hf_xet])
  Downloading hf_xet-1.2.0-cp37-abi3-win_amd64.whl.metadata (5.0 kB)
Downloading hf_xet-1.2.0-cp37-abi3-win_amd64.whl (2.9 MB)
   ---------------------------------------- 0.0/2.9 MB ? eta -:--:--
   ---------------------------------------  2.9/2.9 MB 18.7 MB/s eta 0:00:01
   ---------------------------------------- 2.9/2.9 MB 13.0 MB/s eta 0:00:00
Installing collected packages: hf-xet
Successfully installed hf-xet-1.2.0
Note: you may need to restart the kernel to use updated packages.


In [12]:
# Load a small but efficient embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

In [13]:
# Extract text from chunks
texts = [chunk["content"] for chunk in chunks]

# Generate embeddings (as numpy array)
embeddings = model.encode(texts, show_progress_bar=True)
embeddings = np.array(embeddings).astype("float32")

print("Embeddings shape:", embeddings.shape)

Batches: 100%|██████████| 6/6 [00:16<00:00,  2.69s/it]

Embeddings shape: (176, 384)





In [14]:
# Create FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)

# Add embeddings
index.add(embeddings)
print("Number of vectors in index:", index.ntotal)

# Save FAISS index and metadata
os.makedirs("index", exist_ok=True)
faiss.write_index(index, "index/vector.index")

# Save mapping (filenames/chunks)
import pandas as pd
metadata = pd.DataFrame(chunks)
metadata.to_csv("index/chunk_metadata.csv", index=False)

print("FAISS index and metadata saved successfully!")

Number of vectors in index: 176
FAISS index and metadata saved successfully!


In [15]:
# 4: Retrieve Relevant Chunks for a Query

import faiss
import pandas as pd

# Load index and metadata
index = faiss.read_index("index/vector.index")
metadata = pd.read_csv("index/chunk_metadata.csv")

print(f"Loaded index with {index.ntotal} vectors.")

Loaded index with 176 vectors.


In [16]:
# Define a retrieval function
def retrieve_relevant_chunks(query, top_k=3):
    # Convert query to embedding
    query_emb = model.encode([query])
    
    # Search in FAISS index
    distances, indices = index.search(np.array(query_emb).astype("float32"), top_k)
    
    # Fetch matching text chunks
    results = []
    for i, idx in enumerate(indices[0]):
        if idx < len(metadata):
            results.append({
                "rank": i + 1,
                "filename": metadata.iloc[idx]["filename"],
                "content": metadata.iloc[idx]["content"],
                "distance": float(distances[0][i])
            })
    return results

In [18]:
# Test it

query = "Explain the main topic of this document"
results = retrieve_relevant_chunks(query, top_k=3)

for r in results:
    print(f"\nRank {r['rank']} | Distance: {r['distance']:.4f}")
    print(f"Source: {r['filename']}")
    print(r['content'][:300], "...")


Rank 1 | Distance: 1.4879
Source: sample.pdf.pdf
  ...cent		ze		

Rank 2 | Distance: 1.5180
Source: sample.pdf.pdf
  Motivation ...

Rank 3 | Distance: 1.5349
Source: sample.pdf.pdf
  the	 ...	d	s	e	


In [23]:
!pip install transformers accelerate bitsandbytes

Defaulting to user installation because normal site-packages is not writeable
Collecting accelerate
  Downloading accelerate-1.11.0-py3-none-any.whl.metadata (19 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.48.2-py3-none-win_amd64.whl.metadata (10 kB)
Downloading accelerate-1.11.0-py3-none-any.whl (375 kB)
Downloading bitsandbytes-0.48.2-py3-none-win_amd64.whl (59.0 MB)
   ---------------------------------------- 0.0/59.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/59.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/59.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/59.0 MB ? eta -:--:--
   ---------------------------------------- 0.3/59.0 MB ? eta -:--:--
   ---------------------------------------- 0.5/59.0 MB 1.4 MB/s eta 0:00:42
   ---------------------------------------- 0.5/59.0 MB 1.4 MB/s eta 0:00:42
    --------------------------------------- 0.8/59.0 MB 685.3 kB/s eta 0:01:25
    ----------------------------

ERROR: Exception:
Traceback (most recent call last):
  File "C:\ProgramData\anaconda3\Lib\site-packages\pip\_vendor\urllib3\response.py", line 438, in _error_catcher
    yield
  File "C:\ProgramData\anaconda3\Lib\site-packages\pip\_vendor\urllib3\response.py", line 561, in read
    data = self._fp_read(amt) if not fp_closed else b""
           ^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\Lib\site-packages\pip\_vendor\urllib3\response.py", line 527, in _fp_read
    return self._fp.read(amt) if amt is not None else self._fp.read()
           ^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\Lib\site-packages\pip\_vendor\cachecontrol\filewrapper.py", line 98, in read
    data: bytes = self.__fp.read(amt)
                  ^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\Lib\http\client.py", line 479, in read
    s = self.fp.read(amt)
        ^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\Lib\socket.py", line 720, in readinto
    return self._sock.recv_into(b)
           ^

   ---- ----------------------------------- 7.3/59.0 MB 53.8 kB/s eta 0:16:01
   ---- ----------------------------------- 7.3/59.0 MB 53.8 kB/s eta 0:16:01
   ---- ----------------------------------- 7.3/59.0 MB 53.8 kB/s eta 0:16:01
   ---- ----------------------------------- 7.3/59.0 MB 53.8 kB/s eta 0:16:01
   ---- ----------------------------------- 7.3/59.0 MB 53.8 kB/s eta 0:16:01
   ---- ----------------------------------- 7.3/59.0 MB 53.8 kB/s eta 0:16:01
   ---- ----------------------------------- 7.3/59.0 MB 53.8 kB/s eta 0:16:01
   ---- ----------------------------------- 7.3/59.0 MB 53.8 kB/s eta 0:16:01
   ---- ----------------------------------- 7.3/59.0 MB 53.8 kB/s eta 0:16:01
   ---- ----------------------------------- 7.3/59.0 MB 53.8 kB/s eta 0:16:01
   ---- ----------------------------------- 7.3/59.0 MB 53.8 kB/s eta 0:16:01
   ---- ----------------------------------- 7.3/59.0 MB 53.8 kB/s eta 0:16:01
   ---- ----------------------------------- 7.3/59.0 MB 53.8 kB/

In [27]:
from transformers import pipeline
model_name = "google/flan-t5-small"   
rag_generator = pipeline("text2text-generation", model=model_name)

Device set to use cpu


In [29]:
def generate_answer_local(query, top_k=3):
    # Step 1: Retrieve relevant chunks
    results = retrieve_relevant_chunks(query, top_k)
    context = "\n\n".join([r["content"] for r in results])
    
    # Step 2: Build a context-aware prompt
    prompt = f"""
    Use the following context to answer the question concisely.
    If the answer is not in the context, say "I don't have enough information."

    Context:
    {context}

    Question:
    {query}

    Answer:
    """
    
    # Step 3: Generate the answer using Flan-T5
    output = rag_generator(prompt, max_new_tokens=200)
    answer = output[0]["generated_text"]
    
    return answer, results

In [30]:
# Test now
query = "Summarize the main topic of this document"
answer, refs = generate_answer_local(query)

print("Answer:\n", answer)

print("\nSources:")
for r in refs:
    print("-", r['filename'])

Answer:
 Using gradient ascent for linear classifiers: Using gradient ascent for linear classifiers: Using gradient ascent for linear classifiers: Using gradient ascent for linear classifiers: Using gradient ascent for linear classifiers: Using gradient ascent for linear classifiers: Using gradient ascent for linear classifiers: Using gradient ascent for linear classifiers: Using gradient ascent for linear classifiers: Using gradient ascent for linear classifiers: Using gradient ascent for linear classifiers: Using gradient ascent for linear classifiers: Using gradient ascent for linear classifiers: Using gradient ascent for linear classifiers: Using gradient ascent for linear classifiers: Using gradient ascent for linear classifiers: Using gradient ascent for linear classifiers: Using gradient ascent for linear classifiers: Using

Sources:
- sample.pdf.pdf
- sample.pdf.pdf
- sample.pdf.pdf


In [31]:
# Now Add Transparency + Improve Output Quality

def generate_answer_local(query, top_k=3):
    # Step 1: Retrieve top relevant chunks
    results = retrieve_relevant_chunks(query, top_k)
    context = "\n\n".join([r["content"] for r in results])
    
    # Step 2: Build clear prompt
    prompt = f"""
    Context information:
    {context}

    Based only on the above context, answer the following question concisely.
    If the answer is not present in the context, say "I don't have enough information."

    Question: {query}
    """
    
    # Step 3: Generate answer (beam search prevents repetition)
    output = rag_generator(
        prompt,
        max_new_tokens=200,
        num_beams=4,           # explore multiple candidate responses
        temperature=0.3        # reduce randomness
    )
    answer = output[0]["generated_text"].strip()
    
    # Step 4: Attach source details with similarity score
    sources = []
    for r in results:
        score = round(1 - (r["distance"] / (max([x["distance"] for x in results]) + 1e-6)), 2)
        sources.append({
            "filename": r["filename"],
            "confidence": score,
            "snippet": r["content"][:200].replace("\n", " ") + "..."
        })
    
    return answer, sources

In [38]:
# AFter this stage, Improve quality and test it
query = "Explain the main concept discussed in the document"
answer, sources = generate_answer_local(query)

print("Final Answer:\n", answer)
print("\nSupporting Sources:\n")
for s in sources:
    print(f" {s['filename']} (confidence: {s['confidence']})")
    print(f"   {s['snippet']}\n")

Final Answer:
 I don't have enough information

Supporting Sources:

 sample.pdf.pdf (confidence: 0.01)
  ma...rning	s	e	

 sample.pdf.pdf (confidence: 0.0)
   .1is differentiated from the standard neuralnetwork diagram in two ways. A standard diagram for a neural network does not show thischoice of intermediate quantities nor the form of the computations.Th...

 sample.pdf.pdf (confidence: 0.0)
   .1is differentiated from the standard neuralnetwork diagram in two ways. A standard diagram for a neural network does not show thischoice of intermediate quantities nor the form of the computations.Th...



In [6]:
pip install pdfplumber

Defaulting to user installation because normal site-packages is not writeable
Collecting pdfplumber
  Downloading pdfplumber-0.11.7-py3-none-any.whl.metadata (42 kB)
Collecting pdfminer.six==20250506 (from pdfplumber)
  Downloading pdfminer_six-20250506-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-5.0.0-py3-none-win_amd64.whl.metadata (67 kB)
Downloading pdfplumber-0.11.7-py3-none-any.whl (60 kB)
Downloading pdfminer_six-20250506-py3-none-any.whl (5.6 MB)
   ---------------------------------------- 0.0/5.6 MB ? eta -:--:--
   ------- -------------------------------- 1.0/5.6 MB 6.3 MB/s eta 0:00:01
   -------------- ------------------------- 2.1/5.6 MB 5.9 MB/s eta 0:00:01
   ---------------- ----------------------- 2.4/5.6 MB 5.6 MB/s eta 0:00:01
   ---------------- ----------------------- 2.4/5.6 MB 5.6 MB/s eta 0:00:01
   ------------------ --------------------- 2.6/5.6 MB 2.4 MB/s eta 0:00:02
   ------------------ --------



In [7]:
import os
import pdfplumber

DATA_DIR = "data"   # your folder with sample.pdf.pdf etc.

def load_pdfs_pdfplumber(data_dir=DATA_DIR):
    docs = []
    for fname in os.listdir(data_dir):
        path = os.path.join(data_dir, fname)
        if not fname.lower().endswith(".pdf"):
            continue
        text_parts = []
        try:
            with pdfplumber.open(path) as pdf:
                for page in pdf.pages:
                    page_text = page.extract_text()
                    if page_text:
                        text_parts.append(page_text)
        except Exception as e:
            print("Error reading", fname, ":", e)
            continue
        full_text = "\n".join(text_parts)
        docs.append({"filename": fname, "text": full_text})
    return docs

documents = load_pdfs_pdfplumber()
print("Loaded", len(documents), "PDFs.")
# show first 800 chars of first doc to inspect
if documents:
    print(documents[0]['filename'])
    print(documents[0]['text'][:800])

Loaded 1 PDFs.
sample.pdf
Available online at https://www.iimk.ac.in/research/markconf20/
INDIAN INSTITUTE OF MANAGEMENT KOZHIKODE
04th International Conference on Marketing, Technology & Society 2020
Social Media Analytics on Swiggy using Sentiment analysis :
The Good, The Bad & The omg..
Dr.S.Ajitha1
Janani Prabu2
Sai saranesh B3
Abhinayah B4
Abstract
In recent times, social media is the main resource for collecting the information about an individual. All businesses are
turning online, so the data generation rate is gradually increasing day by day. In this technical paper, we have applied
the sentiment analysis on the reviews of Swiggy. Swiggy is an online food delivery start-up. It solves the crucial
logistics problem in the local environment by having own delivery fleet. It is a high risk for them because the 


In [8]:
# Clean text
import re

def clean_text_basic(text):
    # remove weird control characters, fix hyphenation at line breaks, collapse whitespace
    # 1) Join hyphenated line-break words: "exam-\nple" -> "example"
    text = re.sub(r'(-)\n(\w)', r'\1\2', text)
    text = re.sub(r'-\n', '', text)
    # 2) Replace multiple whitespace/newlines with single space
    text = re.sub(r'\s+', ' ', text)
    # 3) Remove very odd non-printable chars (keep basic punctuation)
    text = re.sub(r'[^\\x00-\\x7F]+', '', text)
    text = text.strip()
    return text

for d in documents:
    d['clean_text'] = clean_text_basic(d['text'])

# inspect first cleaned snippet
if documents:
    print(documents[0]['filename'])
    print(documents[0]['clean_text'][:800])

sample.pdf
A:20INDIANINSTITUTEOFMANAGEMENTKOZHIKODE04ICMTS2020SMASS:TGTBTDSA1JP2SB3AB4AIAISSIIUEPSSSSxBK:STMRFC1IITWIBITWDxDTTT:TSSHSTSSSSxODC1APITMBSC22@2SITMBSC2706@3SITMBSC@4SITMBSC20@P04ICMTS2020ISBN:97893541974822S2LR21OMSCSIATDMBJDSLOMGLPTATTESAAxTCMAPRRBYJSANLPxxTOMSCSCOOTxNLPWTx22TxEVBRTSTAxMxxISPMISxxGV3RM31MTF311F311S3TJTTASAI1A;1I32DCDCTGP321TTTF312WxTTNLTNLTKF312T322SNWF313FxATWWCABWF313S433AHRFCARITRFAFTTATRFT:1F=1B:DZNGT:SPS2OTBBBExRFTARF7030ARFCAF314F314A4CTTCINTxACTx55R1BDMBJDSLOMGL2012WBVASMAAAIICWSMSMV:WSMVAAAITRWS1203DI252CMAPRRBYJ2012ESATCMPICWWWL:F6956983EIU:TDF:BDDMI:CR12420124HVMKKR:PI:P8CEACLMNJUSAACL1997174185MGA:W:AxCACM3811199539416RPMTSA:ACA7NMCOMTIJCAV56N13O2012PN168SCASOMSPCIJETAEV3I1J20139VBRSOMSURWIJCSITIJCSITV5220141026103010GVSAOM:ASIJARCSSEIJARCSSEV2I6J2012


In [9]:
def simple_tokenize(text):
    # naive whitespace tokenizer
    return text.split()

def chunk_text_words(text, chunk_size=250, overlap=50):
    tokens = simple_tokenize(text)
    chunks = []
    i = 0
    n = len(tokens)
    while i < n:
        chunk_tokens = tokens[i:i+chunk_size]
        chunk = " ".join(chunk_tokens)
        chunks.append(chunk)
        i += (chunk_size - overlap)
    return chunks

# build chunk list
chunks = []
for d in documents:
    cks = chunk_text_words(d['clean_text'], chunk_size=250, overlap=50)
    for idx, c in enumerate(cks):
        chunks.append({
            "doc_filename": d['filename'],
            "chunk_id": f"{d['filename']}_chunk{idx}",
            "text": c
        })

print("Total chunks:", len(chunks))
print("Sample chunk (first):\n", chunks[0]['text'][:1000])

Total chunks: 1
Sample chunk (first):
 A:20INDIANINSTITUTEOFMANAGEMENTKOZHIKODE04ICMTS2020SMASS:TGTBTDSA1JP2SB3AB4AIAISSIIUEPSSSSxBK:STMRFC1IITWIBITWDxDTTT:TSSHSTSSSSxODC1APITMBSC22@2SITMBSC2706@3SITMBSC@4SITMBSC20@P04ICMTS2020ISBN:97893541974822S2LR21OMSCSIATDMBJDSLOMGLPTATTESAAxTCMAPRRBYJSANLPxxTOMSCSCOOTxNLPWTx22TxEVBRTSTAxMxxISPMISxxGV3RM31MTF311F311S3TJTTASAI1A;1I32DCDCTGP321TTTF312WxTTNLTNLTKF312T322SNWF313FxATWWCABWF313S433AHRFCARITRFAFTTATRFT:1F=1B:DZNGT:SPS2OTBBBExRFTARF7030ARFCAF314F314A4CTTCINTxACTx55R1BDMBJDSLOMGL2012WBVASMAAAIICWSMSMV:WSMVAAAITRWS1203DI252CMAPRRBYJ2012ESATCMPICWWWL:F6956983EIU:TDF:BDDMI:CR12420124HVMKKR:PI:P8CEACLMNJUSAACL1997174185MGA:W:AxCACM3811199539416RPMTSA:ACA7NMCOMTIJCAV56N13O2012PN168SCASOMSPCIJETAEV3I1J20139VBRSOMSURWIJCSITIJCSITV5220141026103010GVSAOM:ASIJARCSSEIJARCSSEV2I6J2012


In [11]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
import os
import pandas as pd

model = SentenceTransformer('all-MiniLM-L6-v2')  # small / fast
texts = [c['text'] for c in chunks]

# embed in batches to avoid memory spike
BATCH = 32
emb_list = []
for i in range(0, len(texts), BATCH):
    batch = texts[i:i+BATCH]
    arr = model.encode(batch, show_progress_bar=False)
    emb_list.append(arr)
embeddings = np.vstack(emb_list).astype('float32')

# normalize for cosine similarity if using IndexFlatIP (we will use cosine via normalize)
faiss.normalize_L2(embeddings)

# build index (inner product on normalized vectors = cosine)
dim = embeddings.shape[1]
index = faiss.IndexFlatIP(dim)
index.add(embeddings)
print("Index vectors:", index.ntotal)

# save index and metadata
os.makedirs("index", exist_ok=True)
faiss.write_index(index, "index/vector.index")

meta_df = pd.DataFrame(chunks)
meta_df.to_csv("index/chunk_metadata.csv", index=False)
print("Saved index and metadata.")

Index vectors: 1
Saved index and metadata.


In [12]:
import faiss, os, pandas as pd

# Create folder if not exists
os.makedirs("rag_store", exist_ok=True)

# Save FAISS index
faiss.write_index(index, "rag_store/vector.index")

# Save metadata (chunks list)
meta_df = pd.DataFrame(chunks)
meta_df.to_csv("rag_store/chunks.csv", index=False)

print("Saved FAISS index and chunk metadata in /rag_store/")

Saved FAISS index and chunk metadata in /rag_store/


In [13]:
# Reload anytime

import pandas as pd, faiss
from sentence_transformers import SentenceTransformer

# Load model and saved data
model = SentenceTransformer("all-MiniLM-L6-v2")

# Reload FAISS index
index = faiss.read_index("rag_store/vector.index")

# Reload metadata
meta = pd.read_csv("rag_store/chunks.csv")

print(f"Reloaded index with {index.ntotal} vectors and {len(meta)} metadata rows.")

Reloaded index with 1 vectors and 1 metadata rows.


In [14]:
# Now we will follow this pattern Test Reload → Query → Answer

def retrieve(query, top_k=5):
    q_emb = model.encode([query]).astype('float32')
    faiss.normalize_L2(q_emb)
    D, I = index.search(q_emb, top_k)
    results = []
    for dist, idx in zip(D[0], I[0]):
        results.append({
            "idx": int(idx),
            "filename": meta.iloc[idx]['doc_filename'],
            "text": meta.iloc[idx]['text'],
            "score": float(dist)
        })
    return results

query = "Explain gradient ascent for linear classifiers"
res = retrieve(query, top_k=4)
for i, r in enumerate(res):
    print(f"\nRank {i+1} | score {r['score']:.3f} | {r['filename']}")
    print(r['text'][:500])


Rank 1 | score -0.015 | sample.pdf
A:20INDIANINSTITUTEOFMANAGEMENTKOZHIKODE04ICMTS2020SMASS:TGTBTDSA1JP2SB3AB4AIAISSIIUEPSSSSxBK:STMRFC1IITWIBITWDxDTTT:TSSHSTSSSSxODC1APITMBSC22@2SITMBSC2706@3SITMBSC@4SITMBSC20@P04ICMTS2020ISBN:97893541974822S2LR21OMSCSIATDMBJDSLOMGLPTATTESAAxTCMAPRRBYJSANLPxxTOMSCSCOOTxNLPWTx22TxEVBRTSTAxMxxISPMISxxGV3RM31MTF311F311S3TJTTASAI1A;1I32DCDCTGP321TTTF312WxTTNLTNLTKF312T322SNWF313FxATWWCABWF313S433AHRFCARITRFAFTTATRFT:1F=1B:DZNGT:SPS2OTBBBExRFTARF7030ARFCAF314F314A4CTTCINTxACTx55R1BDMBJDSLOMGL2012WBVA

Rank 2 | score -340282346638528859811704183484516925440.000 | sample.pdf
A:20INDIANINSTITUTEOFMANAGEMENTKOZHIKODE04ICMTS2020SMASS:TGTBTDSA1JP2SB3AB4AIAISSIIUEPSSSSxBK:STMRFC1IITWIBITWDxDTTT:TSSHSTSSSSxODC1APITMBSC22@2SITMBSC2706@3SITMBSC@4SITMBSC20@P04ICMTS2020ISBN:97893541974822S2LR21OMSCSIATDMBJDSLOMGLPTATTESAAxTCMAPRRBYJSANLPxxTOMSCSCOOTxNLPWTx22TxEVBRTSTAxMxxISPMISxxGV3RM31MTF311F311S3TJTTASAI1A;1I32DCDCTGP321TTTF312WxTTNLTNLTKF312T322SNWF313FxATWWCABWF3