<a href="https://colab.research.google.com/github/SourabhHegde14/GenAi_PES2UG23CS929_Hands_On_2/blob/main/PES2UG23CS929_4_RAG_and_Vector_Stores.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---



Name: SOURABH S HEGDE
SRN: PES2UG23CS929
CLASS: CSE 6 K

# Unit 2: RAG, Vector Stores, and Indexing

## Introduction
LLMs have a knowledge cutoff and can hallucinate. **Retrieval Augmented Generation (RAG)** solves this by retrieving relevant data and injecting it into the prompt.

In this notebook, we will master:
1.  **Embeddings:** Representing text as vectors.
2.  **Vector Stores:** Storing and searching vectors (FAISS).
3.  **Naïve RAG:** The standard Retrieval -> Augment -> Generate pipeline.
4.  **Indexing Challenges:** Deep dive into how vector databases search efficiently (Flat, IVF, HNSW, PQ).

---

## Part 4a: Embeddings & Vector Space

### 1. Introduction: Computers Don't Read English

If you ask a computer "Is a cat similar to a dog?", it doesn't know. To a computer, "cat" is just a string of bytes: `01100011...`.

To solve this, we use **Embeddings**.

### What is an Embedding?
An embedding is a translation from **Words** to **Lists of Numbers (Vectors)**, such that similar words represent close numbers.

### The Process (Flowchart)
```mermaid
graph LR
    A["Input Text ('Apple')"] -->|Tokenization| B["Tokens (101, 255)"]
    B -->|Embedding Model| C["Vector List ([0.1, -0.5, 0.9...])"]
    C -->|Store| D["Vector Database"]
```

In [1]:
# Setup
%pip install python-dotenv --upgrade --quiet langchain langchain-huggingface sentence-transformers langchain-community

from dotenv import load_dotenv
load_dotenv()

import os
from langchain_huggingface import HuggingFaceEmbeddings

# Using a FREE, open-source model from Hugging Face
# 'all-MiniLM-L6-v2' is small, fast, and very good for English.
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m566.4/566.4 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m42.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.7/64.7 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m64.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.0/51.0 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following depende

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## 2. Viewing a Vector

Let's see what the word "Apple" looks like to the machine.

### Conceptual Note: Dimensions
The vector below has **384 dimensions** (for MiniLM).
- Imagine a graph with X and Y axes (2 Dimensions). You can plot a point (x, y).
- Now imagine adding Z (3 Dimensions).
- Now imagine **384 axes**.

Each axis represents a feature (e.g., "Is it a fruit?", "Is it red?", "Is it tech-related?"). The numbers aren't random; they encode meaning.

In [2]:
vector = embeddings.embed_query("Apple")

print(f"Dimensionality: {len(vector)}")
print(f"First 5 numbers: {vector[:5]}")

Dimensionality: 384
First 5 numbers: [-0.006138487718999386, 0.03101177327334881, 0.06479360908269882, 0.01094149798154831, 0.005267191678285599]


## 3. The Math: Cosine Similarity

How do we know if two vectors are close? We measure the **Angle** between them.

### Cosine Similarity Formula
$$ \text{Similarity} = \cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|} $$

- **1.0**: Arrows point in the Exact Same Direction (Identical).
- **0.0**: Arrows are Perpendicular (Unrelated).
- **-1.0**: Arrows point in Opposite Directions (Opposite).

**Experiment:** Let's compare "Cat", "Dog", and "Car".

In [3]:

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


vec_python = embeddings.embed_query("Python programming")
vec_java = embeddings.embed_query("Java language")
vec_pizza = embeddings.embed_query("Pepperoni Pizza")

print(f"Python vs Java: {cosine_similarity(vec_python, vec_java):.4f}")
print(f"Python vs Pizza: {cosine_similarity(vec_python, vec_pizza):.4f}")

Python vs Java: 0.3959
Python vs Pizza: 0.1886


### Analysis
You should see that **Cat & Dog** score higher (e.g., ~0.8) than **Cat & Car** (e.g., ~0.3).
This Mathematical Distance is the foundation of all Search engines and RAG systems.

This is arguably the most important concept in modern AI.

---



# Unit 2 - Part 4b: Naive RAG Pipeline

## 1. Introduction: The Open-Book Test

RAG (Retrieval-Augmented Generation) is just an Open-Book Test architecture.
1.  **Retrieval:** Find the right page in the textbook.
2.  **Generation:** Write the answer using that page.

### The Pipeline (Flowchart)
```mermaid
graph TD
    User[User Question] --> Retriever[Retriever System]
    Retriever -->|Search Database| Docs[Relevant Documents]
    Docs --> Combiner[Prompt Template]
    User --> Combiner
    Combiner -->|Full Prompt w/ Context| LLM[Gemini Model]
    LLM --> Answer[Final Answer]
```

In [4]:
%pip install python-dotenv --upgrade --quiet \
    faiss-cpu langchain-huggingface sentence-transformers \
    langchain-community langchain-google-genai google-generativeai

# Restart runtime after this cell, then run the imports
from dotenv import load_dotenv
load_dotenv()

import getpass
import os
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_huggingface import HuggingFaceEmbeddings

if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter your Google API Key: ")

llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash")

# Using the same free model as Part 4a
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.5/66.5 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hEnter your Google API Key: ··········


## 2. The "Knowledge Base" (Grounding)

LLMs hallucinate because they rely on "parametric memory" (what they learned during training).
RAG introduces "non-parametric memory" (external facts).

Let's define some facts the LLM definitely *does not* know.

In [5]:

from langchain_core.documents import Document


docs = [
    Document(page_content="The CSE department at PES University is located in the B-Tech Block."),
    Document(page_content="Students must wear their ID cards at all times while inside the campus."),
    Document(page_content="The campus library closes at 9:00 PM on weekdays."),
]

## 3. Indexing ( Storing the knowledge)

We use **FAISS** (Facebook AI Similarity Search) to store the embeddings.
Think of FAISS as a super-fast librarian that organizes books by content, not title.

In [6]:
from langchain_community.vectorstores import FAISS

vectorstore = FAISS.from_documents(docs, embeddings)
retriever = vectorstore.as_retriever()

## 4. The RAG Chain

We use LCEL to stitch it together.

**Step 1:** The `retriever` takes the question, converts it to numbers, and finds the closest document.
**Step 2:** `RunnablePassthrough` holds the question.
**Step 3:** The `prompt` combines them.

In [7]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

template = """
Answer based ONLY on the context below:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
result = chain.invoke("Where is the CSE department located?")
print(result)

The CSE department is located in the B-Tech Block.


## 5. Analysis

The retrieval step is opaque here. In the next notebook (**4c**), we will look *inside* the retriever to understand how FAISS actually finds that document among millions of others.

---



# Unit 2 - Part 4c: Deep Dive into Indexing Algorithms

## 1. Introduction: The Scale Problem

Comparing 1 vector against 10 vectors is fast.
Comparing 1 vector against **100 Million** vectors is slow.

**FAISS (Facebook AI Similarity Search)** was built to solve this.

### The Trade-off Triangle
You can pick 2:
- **Speed** (Query time)
- **Accuracy** (Recall)
- **Memory** (RAM usage)

We will explore algorithms that optimize different corners of this triangle.

In [8]:
import faiss
import numpy as np

# Mock Data: 10,000 vectors of size 128
d = 128
nb = 10000
xb = np.random.random((nb, d)).astype('float32')

## 2. Flat Index (Brute Force)

**Concept:** Check every single item.

- **Algo:** `IndexFlatL2`
- **Pros:** 100% Accuracy (Gold Standard).
- **Cons:** Slow (O(N)). Unusable at 1M+ vectors.


In [9]:
index = faiss.IndexFlatL2(d)
index.add(xb)
print(f"Flat Index contains {index.ntotal} vectors")

Flat Index contains 10000 vectors


## 3. IVF (Inverted File Index)

**Concept:** Clustering / Partitioning.

Imagine looking for a book. Instead of checking every shelf, you go to the "Sci-Fi" section. Then you only search books *in that section*.

### How it works (Flowchart)
```mermaid
graph TD
    Data[All 1M Vectors] -->|Train| Clusters[1000 Cluster Centers (Centroids)]
    Query[User Query] -->|Step 1| FindClosest[Find Closest Centroid]
    FindClosest -->|Step 2| Search[Search ONLY vectors in that Cluster]
```

**Analogy:** Voronoi Cells (Zip Codes). We only search the local zip code.

In [10]:
nlist = 100 # How many 'zip codes' (clusters) we want
quantizer = faiss.IndexFlatL2(d) # The calculator for distance
index_ivf = faiss.IndexIVFFlat(quantizer, d, nlist)

# We MUST train it first so it learns where the clusters are
index_ivf.train(xb)
index_ivf.add(xb)

## 4. HNSW (Hierarchical Navigable Small World)

**Concept:** Six Degrees of Separation.

Most data is connected. HNSW builds a **Graph**.
- **Layer 0:** Every point connects to neighbors.
- **Layer 1:** "Express Highways" connecting distant points.

**Analogy:** Catching a flight.
You don't fly Local -> Local -> Local.
You fly Local -> **HUB** (Chicago) -> **HUB** (London) -> Local.

- **Pros:** Extremely fast retrieval.
- **Cons:** Heavier on RAM (needs to store the edges of the graph).

In [11]:
M = 16 # Number of connections per node (The 'Hub' factor)
index_hnsw = faiss.IndexHNSWFlat(d, M)
index_hnsw.add(xb)

## 5. PQ (Product Quantization)

**Concept:** Compression (Lossy).

Do we need 32-bit float precision (`0.123456789`)? No. `0.12` is fine.
PQ breaks the vector into chunks and approximates them.

**Analogy:** 4K Video vs 480p Video.
- 480p is blurry, but it's 10x smaller and faster to stream.
- Use PQ when you are RAM constrained (e.g., storing 1 Billion vectors).

In [12]:
m = 8 # Split vector into 8 sub-vectors
index_pq = faiss.IndexPQ(d, m, 8)
index_pq.train(xb)
index_pq.add(xb)
print("PQ Compression complete. RAM usage minimized.")

PQ Compression complete. RAM usage minimized.
