# 📘 Day 3 RAG Colab Notebook
This notebook demonstrates a complete Retrieval-Augmented Generation (RAG) workflow:
- Load PDFs from a folder
- Chunk and embed with SentenceTransformers
- Store in ChromaDB
- Run semantic search

## RAG (Retrieval-Augmented Generation)

RAG is an AI framework that combines **information retrieval** with **text generation** to produce more accurate, up-to-date, and context-aware responses.

It enhances traditional large language models (LLMs) by dynamically fetching relevant data before generating answers.

<img src="https://raw.githubusercontent.com/MariaAise/llm_guide/refs/heads/main/shared_assets/visuals/images/RAG.png" alt="RAG Diagram"/>

## Why Use RAG?

✅ Overcomes LLM Limitations

LLMs have static knowledge (trained on past data). RAG adds dynamic, real-time data.

Avoids hallucinations (made-up answers) by grounding responses in retrieved facts.

✅ Cost-Effective

Cheaper than fine-tuning an LLM on new data.

✅ Customizable

Works with private/internal documents (PDFs, databases, APIs).

## RAG vs. Fine-Tuning

| **Feature**     | **RAG**                          | **Fine-Tuning**                      |
|------------------|----------------------------------|--------------------------------------|
| **Data Updates** | Dynamic (real-time retrieval)    | Static (requires retraining)         |
| **Cost**         | Lower (no model retraining)      | Higher (GPU-intensive)               |
| **Use Case**     | Q&A, real-time data              | Domain-specific tasks                |


### Overcoming LLM Limitations with Retrieval-Augmented Generation (RAG)
Large language models (LLMs) face two critical constraints:

**Static Knowledge**: They can only recall information present in their training data, leaving them unaware of newer or domain-specific content.

**Context Window Limits**: Their ability to process long inputs is bounded by fixed token limits (e.g., 128K for GPT-4-turbo), restricting real-time analysis of large documents.

**Retrieval-Augmented Generation (RAG)** addresses both issues by dynamically integrating external data into the generation process.

A RAG pipeline operates in three phases:

- **Indexing (Offline Preparation)**

Documents (PDFs, databases, etc.) are split into chunks, embedded into vectors, and stored in a searchable database (e.g., Chroma, FAISS).

This enables efficient similarity searches at query time.

- **Retrieval (Real-Time Fetch)**

When a user submits a query (e.g., "What’s the pricing for Product X?"), the system:

Embeds the query into the same vector space.

Retrieves the most relevant document chunks from the database using similarity metrics (e.g., cosine similarity).

- **Generation (Context-Aware Answer Synthesis)**

The retrieved context is combined with the user’s query and fed to the LLM (e.g., Gemini, GPT-4).

The model generates a precise answer grounded in the provided data, bypassing its training limitations.

### Why RAG Matters

- **Dynamic Knowledge**: Integrates live data (e.g., weather APIs, internal wikis) or post-training updates (e.g., new research papers).

- **Cost Efficiency**: Cheaper than fine-tuning models on updated corpora.

- **Precision**: Reduces hallucinations by tethering responses to retrieved evidence.

---

## Vector Stores

**Vector stores** (or **vector databases**) are specialized databases designed to **store, index, and search high-dimensional vector embeddings** — numeric representations of data (text, images, etc.) generated by machine learning models like OpenAI’s `text-embedding-3-large` or Google’s `text-embedding-gecko`.

### ✅ Why They Matter

Vector stores enable **efficient similarity search**, making them essential for:

- 🔍 **Retrieval-Augmented Generation (RAG)**
- 🧠 **Semantic Search**
- 🎯 **Recommendation Systems**
- ⚠️ **Anomaly Detection**

### How Vector Stores Work

**Embedding Creation**:

A model (e.g., BERT, GPT) converts raw data (e.g., a sentence) into a vector (e.g., a 768-dimensional array).

Example:

In [None]:
# do not run
from sentence_transformers import SentenceTransformer

# Load a lightweight embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")  # Outputs 384-dimensional vectors

# Convert text to a vector
text = "What is RAG?"
embedding = model.encode(text)

print(embedding.shape)  # Output: (384,)
print(embedding[:5])    # e.g., [0.24, -0.56, 0.78, -0.12, 0.33]

---

## Indexing: Storing Vectors for Fast Search

### What Happens?

Vectors are stored in a **vector store** (database) using algorithms that optimize for **fast similarity searches**.

### Common Indexing Methods:

- **HNSW (Hierarchical Navigable Small World)**  
  Balances speed and accuracy. Ideal for high-performance vector search.

- **IVF (Inverted File Index)**  
  Groups similar vectors into clusters for faster lookup and scalability.

**Example: Indexing with FAISS (Facebook AI)**

In [None]:
# do not run - example

import faiss
import numpy as np

# Create dummy embeddings (1000 vectors of 384 dimensions)
embeddings = np.random.rand(1000, 384).astype('float32')

# Build an index
index = faiss.IndexHNSWFlat(384, 32)  # 384-dim vectors, HNSW with 32 neighbors
index.add(embeddings)  # Add vectors to the index

## Querying: Finding Similar Vectors

### What Happens?

A user’s query (e.g., *"Explain RAG"*) is converted into a **vector** using the same **embedding model** used for the stored data.

The **vector store** then searches for the **nearest vectors** in the indexed dataset using:

- **Cosine similarity**: Measures angular distance (ideal for text comparisons).
- **Euclidean distance**: Measures straight-line distance between vectors.

---

### 🔍 Example: Querying a Vector Store


In [None]:
# do not run

# Embed the query
query = "Explain RAG"
query_embedding = model.encode(query)

# Search the FAISS index
k = 3  # Retrieve top 3 matches
distances, indices = index.search(np.array([query_embedding]), k)

print("Top matches:", indices)  # e.g., [42, 101, 205] (IDs of closest vectors)

## Popular Vector Stores

## Popular Vector Stores

| **Name**     | **Type**             | **Pros**                          | **Use Case**                        | **Link**                                                                 |
|--------------|----------------------|-----------------------------------|-------------------------------------|--------------------------------------------------------------------------|
| **Chroma**   | Open-source          | Lightweight, easy setup           | Prototyping, small-scale RAG        | [Chroma GitHub](https://github.com/chroma-core/chroma)                  |
| **FAISS**    | Library (Facebook)   | Blazing-fast, GPU-optimized       | Research, high-performance          | [FAISS GitHub](https://github.com/facebookresearch/faiss)               |
| **Pinecone** | Cloud-based          | Auto-scaling, managed service     | Production RAG systems              | [Pinecone.io](https://www.pinecone.io)                                   |
| **Weaviate** | Open-source / Cloud  | Supports multimodal (text+images) | Complex knowledge graphs            | [Weaviate.io](https://weaviate.io)                                       |
| **Milvus**   | Open-source          | High scalability, distributed     | Large enterprise deployments        | [Milvus.io](https://milvus.io)                                           |
     |

----

**Minimum example**

In [None]:
# do not run

# Step 1: Generate embeddings
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
texts = ["What is RAG?", "How do vector stores work?"]
embeddings = model.encode(texts)

# Step 2: Store in Chroma
import chromadb
client = chromadb.Client()
collection = client.create_collection("docs")
collection.add(embeddings=embeddings.tolist(), documents=texts, ids=["1", "2"])

# Step 3: Query
results = collection.query(query_embeddings=model.encode("Explain RAG").tolist(), n_results=1)
print(results["documents"])  # Output: ["What is RAG?"]

## ⚙️ Setup Instructions
This notebook uses the following tools and APIs for document retrieval and LLM-based generation:

### 🔧 Libraries Required:
- `langchain`
- `chromadb`
- `pypdf` (for PDF parsing)
- `sentence-transformers` (embedding model)
- `google-generativeai` (for Gemini API)
- `openai` (for OpenAI GPT models)

Install them using the following cells where necessary.

### 🔐 API Keys Required:
- **Google Gemini API key** (set to `os.environ['GOOGLE_API_KEY']`)
- **OpenAI API key** (set to `openai.api_key`)

You can obtain keys from:
- Google Gemini: https://makersuite.google.com/app/apikey
- OpenAI: https://platform.openai.com/account/api-keys

Make sure to paste your keys in the designated code cells.

In [None]:
# 🟩 Install dependencies
!pip install -q langchain chromadb pypdf sentence-transformers requests

In [None]:
import requests
from langchain_community.document_loaders import PyPDFLoader
from io import BytesIO
import tempfile

pdf_urls = [
    "https://github.com/MariaAise/test/raw/main/lit/paper1.pdf",
    "https://github.com/MariaAise/test/raw/main/lit/paper2.pdf",
    "https://github.com/MariaAise/test/raw/main/lit/paper3.pdf"
]

all_docs = []

for url in pdf_urls:
    # Download PDF
    response = requests.get(url)

    # Save to a temporary file
    with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as temp_pdf:
        temp_pdf.write(response.content)
        temp_pdf_path = temp_pdf.name

    # Load the temporary file
    loader = PyPDFLoader(temp_pdf_path)
    docs = loader.load()
    all_docs.extend(docs)

print(f"Loaded {len(all_docs)} pages from {len(pdf_urls)} PDFs.")

Loaded 59 pages from 3 PDFs.


In [None]:
# 🟧 Split documents into chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(all_docs)
print(f"Generated {len(chunks)} text chunks.")

Generated 653 text chunks.


In [None]:
# 🟥 Embed chunks and store in ChromaDB
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
db = Chroma.from_documents(chunks, embedding_model, persist_directory="./chroma_db")
db.persist()
print("Embeddings saved to ChromaDB.")

  embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embeddings saved to ChromaDB.


  db.persist()


In [None]:
# 🔍 Run semantic search
query = "What are the findings about brand page satisfaction?"
results = db.similarity_search(query, k=3)

for i, res in enumerate(results):
    print(f"--- Result {i+1} ---\n{res.page_content[:500]}\n")

--- Result 1 ---
antecedents and moderators”, Journal of the Academy of
Marketing Science, Vol. 36 No. 4, pp. 578-596.
Dual impact of brand page satisfaction
Sanga Song and Hye-Young Kim
Journal of Product & Brand Management
Volume 31 · Number 7 · 2022 · 1033–1046
1042

--- Result 2 ---
antecedents and moderators”, Journal of the Academy of
Marketing Science, Vol. 36 No. 4, pp. 578-596.
Dual impact of brand page satisfaction
Sanga Song and Hye-Young Kim
Journal of Product & Brand Management
Volume 31 · Number 7 · 2022 · 1033–1046
1042

--- Result 3 ---
Previous studies have shown that both brand page
satisfaction and brand love are important factors when it comes
to consumers advocating the brand to other consumers
Dual impact of brand page satisfaction
Sanga Song and Hye-Young Kim
Journal of Product & Brand Management
Volume 31 · Number 7 · 2022 · 1033–1046
1036



## 🧠 Generate Answer with Gemini
Use Gemini model to answer a query based on the retrieved documents.

In [None]:
# ⚙️ Install Google Generative AI SDK if not already installed
!pip install -q google-generativeai

In [None]:
# 📡 Generate response from Gemini using retrieved content
from google.colab import userdata
import google.generativeai as genai

# Retrieve the API key from Secrets
api_key = userdata.get('GEMINI_API_KEY')  # Name must match what you set!

# Configure Gemini
genai.configure(api_key=api_key)
model = genai.GenerativeModel('gemini-2.0-flash')

retrieved_text = "\n\n".join([doc.page_content for doc in results])
prompt = f"Answer the following question based on the retrieved content:\n\n{retrieved_text}\n\nQuestion: {query}"

response = model.generate_content(prompt)
print(response.text)

Based on the provided content, here are the findings about brand page satisfaction:

*   Brand page satisfaction is an important factor when it comes to consumers advocating the brand to other consumers.
*   The article "Dual impact of brand page satisfaction" by Sanga Song and Hye-Young Kim, published in the Journal of Product & Brand Management (Volume 31, Number 7, 2022, pages 1033-1046) likely discusses this topic in detail.


## 📎 Gemini Output: Without vs With Citations
This section demonstrates how to:
- Generate a plain Gemini response (no source info)
- Generate a response **with inline source references** using metadata from the retrieved chunks

In [None]:
# ➖ Gemini Response WITHOUT Source Info
plain_prompt = f"Answer the question based on the following content:\n\n{retrieved_text}\n\nQuestion: {query}"
response_plain = model.generate_content(plain_prompt)
print(response_plain.text)

In [None]:
# ➕ Gemini Response WITH Source References
# Append [source_n] markers using doc.metadata
retrieved_with_refs = []
for i, doc in enumerate(results):
    marker = f"[source_{i+1}]"
    content_with_marker = doc.page_content.strip() + f"\n\n{marker}"
    retrieved_with_refs.append(content_with_marker)

referenced_text = "\n\n".join(retrieved_with_refs)
citation_prompt = f"Answer the question using the content below. Include [source_n] in your answer to show where facts came from.\n\n{referenced_text}\n\nQuestion: {query}"
response_cited = model.generate_content(citation_prompt)
print(response_cited.text)

Brand page satisfaction is an important factor when it comes to consumers advocating for the brand to other consumers [source_3].



## 🤖 OpenAI Output: With vs Without Citations
Now let's demonstrate the same generation using OpenAI's GPT model instead of Gemini.

In [None]:
# 🟦 Install OpenAI SDK if not installed
!pip install -q openai

In [None]:
# 🔑 Setup OpenAI API key
import openai
openai.api_key = "your-openai-api-key-here"  # Replace with your OpenAI key

In [None]:
# ➖ OpenAI Response WITHOUT Source Info
openai_prompt = f"Answer the question based on the following content:\n\n{retrieved_text}\n\nQuestion: {query}"
completion_plain = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": openai_prompt}]
)
print(completion_plain['choices'][0]['message']['content'])

In [None]:
# ➕ OpenAI Response WITH Source References
referenced_text_openai = referenced_text  # reusing same `[source_n]` markers
openai_prompt_cited = f"Answer the question using the content below. Include [source_n] in your answer to show where facts came from.\n\n{referenced_text_openai}\n\nQuestion: {query}"
completion_cited = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": openai_prompt_cited}]
)
print(completion_cited['choices'][0]['message']['content'])