# CS5588 — Week 4: **Pure RAG with Gemini API** (No LangChain)

This notebook implements a minimal, production-style Retrieval-Augmented Generation (RAG) pipeline **using only the Gemini API** for both **embeddings** and **text generation**.

**What you'll do**
1. Install & setup (`google-generativeai`)
2. Log environment → `env_rag.json`
3. Load documents (PDF/TXT/MD)
4. Chunk documents (default: `size=500`, `overlap=100`)
5. Embed with **Gemini Embeddings** (`text-embedding-004`)
6. Build a tiny vector index (NumPy; FAISS optional)
7. Retrieve top-*k* chunks (cosine similarity)
8. Generate answers with **Gemini 1.5 Flash** grounded in retrieved context
9. Mini-experiment: chunk sensitivity (500/100 vs 300/50)
10. Save reproducibility config → `rag_gemini_config.json`

> **API key:** Set `GEMINI_API_KEY` in your environment before running:
> - Colab: `import os; os.environ["GEMINI_API_KEY"] = "YOUR_KEY_HERE"`
> - Local: export in your shell or use `.env` loading.


## 1) Install & Setup

In [1]:
# If running in Colab: install Google Generative AI client and helpers
import sys, subprocess

def pip_install(pkgs):
    print("Installing:", pkgs)
    subprocess.run([sys.executable, "-m", "pip", "install", "-q"] + pkgs, check=True)

try:
    pip_install(["google-generativeai>=0.7.2", "PyPDF2>=3.0.1", "numpy>=1.23.0"])
except Exception as e:
    print("Install warning:", e)

# Optional FAISS (fast vector search); will fallback to pure NumPy if install fails
try:
    pip_install(["faiss-cpu>=1.8.0"])
except Exception as e:
    print("FAISS install skipped (optional):", e)

print("✅ Setup cell finished.")

Installing: ['google-generativeai>=0.7.2', 'PyPDF2>=3.0.1', 'numpy>=1.23.0']
Installing: ['faiss-cpu>=1.8.0']
✅ Setup cell finished.


## 2) Log Environment → `env_rag.json`

In [2]:
import json, platform, datetime
from pathlib import Path

env = {
    "timestamp_utc": datetime.datetime.utcnow().isoformat() + "Z",
    "python": platform.python_version(),
    "platform": platform.platform(),
}

try:
    import google.generativeai as genai
    env["google-generativeai"] = getattr(genai, "__version__", "unknown")
except Exception as e:
    env["google-generativeai"] = f"unavailable ({e})"

try:
    import numpy as np
    env["numpy"] = np.__version__
except Exception as e:
    env["numpy"] = f"unavailable ({e})"

try:
    import PyPDF2
    env["PyPDF2"] = PyPDF2.__version__
except Exception as e:
    env["PyPDF2"] = f"unavailable ({e})"

# Optional libs
try:
    import torch
    env["torch"] = torch.__version__
    env["cuda_available"] = bool(torch.cuda.is_available())
except Exception:
    env["torch"] = "N/A"

Path("runs").mkdir(exist_ok=True)
with open("env_rag.json", "w") as f:
    json.dump(env, f, indent=2)

print(json.dumps(env, indent=2))

  "timestamp_utc": datetime.datetime.utcnow().isoformat() + "Z",


{
  "timestamp_utc": "2025-09-18T20:03:04.471191Z",
  "python": "3.12.11",
  "platform": "Linux-6.1.123+-x86_64-with-glibc2.35",
  "google-generativeai": "0.8.5",
  "numpy": "2.0.2",
  "PyPDF2": "3.0.1",
  "torch": "2.8.0+cu126",
  "cuda_available": false
}


## 3) Load Documents (PDF/TXT/MD)

- **Colab**: a file picker will open.  
- **Local Jupyter**: place your files in `data/uploads/` and re-run.

We extract plain text from PDFs via `PyPDF2`, and read text/markdown files directly.


In [3]:
import os, glob, io
from pathlib import Path

DATA_DIR = Path("data/uploads")
DATA_DIR.mkdir(parents=True, exist_ok=True)

print("If on local Jupyter: place at least 3 PDFs/TXT/MD into:", DATA_DIR)

is_colab = False
try:
    from google.colab import files as colab_files  # type: ignore
    is_colab = True
except Exception:
    pass

if is_colab:
    print("Colab detected — use the chooser to upload files.")
    uploaded = colab_files.upload()
    for name, data in uploaded.items():
        with open(DATA_DIR / name, "wb") as f:
            f.write(data)
    print("Uploaded:", list(uploaded.keys()))
else:
    print("Found:", [p.name for p in DATA_DIR.glob('*')])

If on local Jupyter: place at least 3 PDFs/TXT/MD into: data/uploads
Colab detected — use the chooser to upload files.


Saving annotated-Project%20Title (1).pdf to annotated-Project%20Title (1).pdf
Saving mat-report_hurricane-irma_florida.pdf to mat-report_hurricane-irma_florida.pdf
Saving NeurIPS-2022-video-diffusion-models-Paper-Conference.pdf to NeurIPS-2022-video-diffusion-models-Paper-Conference.pdf
Uploaded: ['annotated-Project%20Title (1).pdf', 'mat-report_hurricane-irma_florida.pdf', 'NeurIPS-2022-video-diffusion-models-Paper-Conference.pdf']


In [4]:
# Simple loaders: PDF via PyPDF2; TXT/MD by reading text
from typing import List, Dict
import PyPDF2

def load_documents(data_dir: Path) -> List[Dict]:
    docs = []
    for p in sorted(data_dir.glob("*")):
        if p.suffix.lower() == ".pdf":
            try:
                text_pages = []
                with open(p, "rb") as f:
                    reader = PyPDF2.PdfReader(f)
                    for page in reader.pages:
                        text_pages.append(page.extract_text() or "")
                text = "\n".join(text_pages)
                docs.append({"source": str(p), "text": text})
            except Exception as e:
                print("PDF read error:", p, e)
        elif p.suffix.lower() in [".txt", ".md", ".markdown"]:
            try:
                text = p.read_text(encoding="utf-8", errors="ignore")
                docs.append({"source": str(p), "text": text})
            except Exception as e:
                print("Text read error:", p, e)
        else:
            print("Skipping unsupported file:", p)
    return docs

docs = load_documents(DATA_DIR)
print(f"Loaded {len(docs)} documents.")
if docs:
    for d in docs[:3]:
        print('-', d["source"], "chars:", len(d["text"]))

Loaded 3 documents.
- data/uploads/NeurIPS-2022-video-diffusion-models-Paper-Conference.pdf chars: 46512
- data/uploads/annotated-Project%20Title (1).pdf chars: 3616
- data/uploads/mat-report_hurricane-irma_florida.pdf chars: 349715


## 4) Chunk Documents (default: size=500, overlap=100) and Preview

In [5]:
from typing import Iterable, Dict, List

chunk_size = 500
chunk_overlap = 100

def chunk_text(text: str, size: int, overlap: int) -> List[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + size, len(text))
        chunk = text[start:end]
        if chunk.strip():
            chunks.append(chunk)
        if end == len(text):
            break
        start = end - overlap
        if start < 0:
            start = 0
    return chunks

chunks = []
for d in docs:
    for ch in chunk_text(d["text"], chunk_size, chunk_overlap):
        chunks.append({"source": d["source"], "content": ch})

print("Total chunks:", len(chunks))
if chunks:
    print("First chunk preview (first 400 chars):\n", chunks[0]["content"][:400])

Total chunks: 1001
First chunk preview (first 400 chars):
 Video Diffusion Models
Jonathan Ho
jonathanho@google.comTim Salimans
salimans@google.comAlexey Gritsenko
agritsenko@google.com
William Chan
williamchan@google.comMohammad Norouzi
mnorouzi@google.comDavid J. Fleet
davidfleet@google.com
Abstract
Generating temporally coherent high ﬁdelity video is an important milestone in
generative modeling research. We make progress towards this milestone by pr


In [6]:
import json
cfg = {
    "chunk_size": chunk_size,
    "chunk_overlap": chunk_overlap,
    "retriever_k": 4,
    "embedding_model": "text-embedding-004",
    "generation_model": "gemini-1.5-flash"
}
with open("rag_gemini_config.json","w") as f:
    json.dump(cfg, f, indent=2)
print("Saved config to rag_gemini_config.json")

Saved config to rag_gemini_config.json


## 5) Embed Chunks with Gemini (`text-embedding-004`)

In [13]:
import os, time
import numpy as np
import google.generativeai as genai

# Corrected: Look for GEMINI_API_KEY in environment variables
api_key = os.getenv("GEMINI_API_KEY")
if not api_key:
    raise RuntimeError("Please set GEMINI_API_KEY in your environment.")

genai.configure(api_key=api_key)

EMBED_MODEL = "text-embedding-004"

def embed_texts(texts):
    # Batch embed for efficiency. The client supports list input.
    # Limit very long strings to avoid request size issues.
    MAX_LEN = 8000
    texts = [t[:MAX_LEN] for t in texts]
    resp = genai.embed_content(model=EMBED_MODEL, content=texts)
    # Ensure a 2D array is always returned
    if isinstance(resp, dict) and "embedding" in resp:
        return np.array(resp["embedding"], dtype="float32")[None, :] # Keep this for single text case, but handle in retrieve
    # For batch: {'embeddings': [{'values': [...]}, ...]}
    vals = [np.array(e["values"], dtype="float32") for e in resp.get("embeddings", [])]
    return np.stack(vals, axis=0)

# Build embeddings for all chunks
texts = [c["content"] for c in chunks]
if not texts:
    raise RuntimeError("No chunks to embed. Please add documents in data/uploads/.")

emb_matrix = embed_texts(texts)
# Remove the leading dimension if it exists
if emb_matrix.ndim == 3 and emb_matrix.shape[0] == 1:
    emb_matrix = emb_matrix[0]

emb_norms = np.linalg.norm(emb_matrix, axis=1, keepdims=True) + 1e-12
emb_matrix_unit = emb_matrix / emb_norms

print("Embeddings shape:", emb_matrix_unit.shape)

# Persist vectors and metadata for reuse
np.save("chunk_vectors.npy", emb_matrix_unit)
with open("chunk_meta.json","w") as f:
    json.dump(chunks, f, indent=2)

print("Saved chunk_vectors.npy and chunk_meta.json")

Embeddings shape: (1, 1001, 768)
Saved chunk_vectors.npy and chunk_meta.json


## 6) Build Tiny Vector Index & 7) Retriever (cosine similarity)

In [17]:
import numpy as np, json

# Try FAISS, else fall back to pure NumPy search
try:
    import faiss
    use_faiss = True
    dim = emb_matrix_unit.shape[1]
    index = faiss.IndexFlatIP(dim)  # inner product on unit vectors = cosine similarity
    index.add(emb_matrix_unit.astype('float32'))
    print("FAISS index built with", index.ntotal, "vectors.")
except Exception as e:
    use_faiss = False
    print("FAISS unavailable, falling back to NumPy similarity search.", e)

# Load chunk metadata
with open("chunk_meta.json") as f:
    chunk_meta = json.load(f)

def retrieve(query: str, k: int = 4):
    """
    Retrieve top-k most relevant chunks for a query using Gemini embeddings.
    Works with FAISS if available, else falls back to NumPy similarity search.
    """
    q_emb = embed_texts([query])

    # Normalize query embedding shape
    if q_emb.ndim == 3 and q_emb.shape[0] == 1:
        q_emb = q_emb[0]
    if q_emb.ndim == 2 and q_emb.shape[0] == 1:
        q_emb = q_emb[0]

    q_emb = q_emb / (np.linalg.norm(q_emb) + 1e-12)

    if use_faiss:
        q_emb_2d = q_emb[None, :]  # ensure (1,d)
        D, I = index.search(q_emb_2d.astype("float32"), k)
        idxs = I[0].tolist()
        sims = D[0].tolist()
    else:
        sims = (emb_matrix_unit @ q_emb).ravel().tolist()  # flatten to 1D floats
        idxs = sorted(range(len(sims)), key=lambda i: sims[i], reverse=True)[:k]
        sims = [sims[i] for i in idxs]

    results = []
    for rank, (i, s) in enumerate(zip(idxs, sims), start=1):
        results.append({
            "rank": rank,
            "score": float(s),
            "source": chunk_meta[i]["source"],
            "content": chunk_meta[i]["content"]
        })
    return results


# Helper: return context string for generation
def build_context(results):
    return "\n\n".join(f"[Source: {r['source']}]\n{r['content']}" for r in results)

# -------------------------------
# Quick sanity check
# -------------------------------
sample_q = "Summarize the key datasets and models discussed in these materials."
hits = retrieve(sample_q, k=4)

print("Top Retrieved Chunks:")
for h in hits:
    print(f"[{h['rank']}] {h['score']:.3f} :: {h['source']} :: {h['content'][:120]}...")

# Build context string for Gemini generation
context_blob = build_context(hits)
print("\nContext ready for Gemini:\n", context_blob[:500], "...")


FAISS unavailable, falling back to NumPy similarity search. too many values to unpack (expected 2)
Top Retrieved Chunks:
[1] 0.484 :: data/uploads/NeurIPS-2022-video-diffusion-models-Paper-Conference.pdf :: arting point for further investigation on video diffusion models and investigation into
their societal implications, and...
[2] 0.462 :: data/uploads/annotated-Project%20Title (1).pdf :: 
drills,
 
response
 
strategies).
 
 
The
 
goal
 
is
 
to
 
make
 
disaster
 
preparedness
 
engaging,
 
immersive,
 
...
[3] 0.449 :: data/uploads/NeurIPS-2022-video-diffusion-models-Paper-Conference.pdf :: ture hyperparameters,
training details, and compute resources are listed in Appendix A.
4.1 Unconditional video modeling...
[4] 0.449 :: data/uploads/mat-report_hurricane-irma_florida.pdf ::  Technical Publications and Guidance  ......................................................................... 5-9
5.7 ...

Context ready for Gemini:
 [Source: data/uploads/NeurIPS-2022-video-diffusion-mo

## 8) Generation with Gemini 1.5 Flash (grounded by retrieved context)

In [18]:
import google.generativeai as genai

GEN_MODEL = "gemini-1.5-flash"
generator = genai.GenerativeModel(GEN_MODEL)

SYSTEM_INSTRUCTIONS = (
    "You answer ONLY using the provided context. "
    "If the answer is not clearly supported, say you don't know."
)

def answer_question(question: str, k: int = 4, max_ctx_chars: int = 8000):
    ctx_hits = retrieve(question, k=k)
    context_blob = "\n\n".join(
        f"[Source: {h['source']}]\n{h['content']}" for h in ctx_hits
    )[:max_ctx_chars]
    prompt = f"""{SYSTEM_INSTRUCTIONS}

Context:
{context_blob}

Question: {question}

Answer:
"""
    resp = generator.generate_content(prompt)
    return resp.text, ctx_hits

# Ask three domain-specific questions (edit to your project)
questions = [
    "What problems does this project aim to solve? List 3–5 key points.",
    "Which datasets or data sources are used or proposed?",
    "What methods, models, or evaluation metrics are mentioned?"
]

for q in questions:
    print("="*80)
    ans, used = answer_question(q, k=4)
    print("Q:", q)
    print("A:", ans)

Q: What problems does this project aim to solve? List 3–5 key points.
A: Florida faces recurring natural disasters (hurricanes, floods, wildfires).  Traditional educational materials lack interactivity and fail to capture the scale and urgency of these events.  The project aims to create engaging disaster preparedness education.  The project will produce subject-oriented disaster education modules that explain scientific causes, show social impacts, and teach civic preparedness.  The goal is to make disaster preparedness engaging.

Q: Which datasets or data sources are used or proposed?
A: The following datasets or data sources are mentioned: TensorFlow Datasets [1], NOAA Hurricane Database – https://www.nhc.noaa.gov/data/, FEMA Disaster Records – https://www.fema.gov/about/reports-and-data/openfema,  wind field maps, wind contour maps, and grids showing flood depths and extents produced by the FEMA Natural Hazard Risk Assessment Program (NHRAP), water surface elevation data compiled f

## 9) Mini-Experiment — Chunk Sensitivity (500/100 vs 300/50)

In [20]:
# ---------------------------------------------------
# Mini-Experiment: Chunk Sensitivity (500/100 vs 300/50)
# ---------------------------------------------------
import numpy as np, json, os

# Helper: build chunks with given size/overlap
def build_chunks(docs, size, overlap):
    out = []
    for d in docs:
        for ch in chunk_text(d["text"], size, overlap):
            out.append({"source": d["source"], "content": ch})
    return out

# Re-build smaller chunks
small_chunks = build_chunks(docs, 300, 50)
print("Smaller-chunk count:", len(small_chunks))

# Embed small chunks
small_texts = [c["content"] for c in small_chunks]
small_matrix = embed_texts(small_texts)
small_matrix = small_matrix / (np.linalg.norm(small_matrix, axis=1, keepdims=True) + 1e-12)

# Generalized retrieval function (works for both baseline & small)
def retrieve_matrix(matrix, chunks, query, k=4):
    q_emb = embed_texts([query])

    # Normalize query embedding shape → always (d,)
    if q_emb.ndim == 3 and q_emb.shape[0] == 1:
        q_emb = q_emb[0]
    if q_emb.ndim == 2 and q_emb.shape[0] == 1:
        q_emb = q_emb[0]

    q_emb = q_emb / (np.linalg.norm(q_emb) + 1e-12)

    sims = (matrix @ q_emb).ravel().tolist()  # flatten to list of floats
    idxs = sorted(range(len(sims)), key=lambda i: sims[i], reverse=True)[:k]

    return [
        {
            "rank": r + 1,
            "score": float(sims[i]),
            "source": chunks[i]["source"],
            "content": chunks[i]["content"],
        }
        for r, i in enumerate(idxs)
    ]

# Compare baseline vs smaller chunking
cmp_q = "Summarize project goals and methods."

base_hits = retrieve(cmp_q, k=4)  # baseline retriever you already defined earlier
small_hits = retrieve_matrix(small_matrix, small_chunks, cmp_q, k=4)

# Summarize results
def short(h):
    return [
        {
            "rank": x["rank"],
            "score": round(x["score"], 3),
            "source": os.path.basename(x["source"]),
            "preview": x["content"][:90]
        }
        for x in h
    ]

cmp = {
    "query": cmp_q,
    "baseline": {"chunk_size": 500, "overlap": 100, "top": short(base_hits)},
    "smaller": {"chunk_size": 300, "overlap": 50, "top": short(small_hits)},
}

print(json.dumps(cmp, indent=2))


Smaller-chunk count: 1600
{
  "query": "Summarize project goals and methods.",
  "baseline": {
    "chunk_size": 500,
    "overlap": 100,
    "top": [
      {
        "rank": 1,
        "score": 0.45,
        "source": "annotated-Project%20Title (1).pdf",
        "preview": "Project\n \nTitle\n \nAI-Driven\n \n3D\n \nVideo\n \nGeneration\n \nfor\n \nMulti-Subject\n \nDisaster\n \nE"
      },
      {
        "rank": 2,
        "score": 0.404,
        "source": "annotated-Project%20Title (1).pdf",
        "preview": "cts\n \n \nhttps://github.com/firelab/windninja\n \nhttps://github.com/huggingface/dif fusers\n "
      },
      {
        "rank": 3,
        "score": 0.393,
        "source": "NeurIPS-2022-video-diffusion-models-Paper-Conference.pdf",
        "preview": "ts? [N/A]\n(b) Did you include complete proofs of all theoretical results? [N/A]\n3. If you "
      },
      {
        "rank": 4,
        "score": 0.392,
        "source": "mat-report_hurricane-irma_florida.pdf",
        "p

## 10) Save Reproducibility Config → `rag_gemini_config.json`

In [21]:
import json

try:
    cfg = json.load(open("rag_gemini_config.json"))
except Exception:
    cfg = {}

cfg.update({
    "embedding_model": "text-embedding-004",
    "generation_model": "gemini-1.5-flash",
    "retriever_k": 4,
    "mini_experiments": [
        {"name": "chunk_sensitivity", "settings": [{"size":500,"overlap":100},{"size":300,"overlap":50}]}
    ]
})

with open("rag_gemini_config.json", "w") as f:
    json.dump(cfg, f, indent=2)

print("Final rag_gemini_config.json:")
print(json.dumps(cfg, indent=2))

Final rag_gemini_config.json:
{
  "chunk_size": 500,
  "chunk_overlap": 100,
  "retriever_k": 4,
  "embedding_model": "text-embedding-004",
  "generation_model": "gemini-1.5-flash",
  "mini_experiments": [
    {
      "name": "chunk_sensitivity",
      "settings": [
        {
          "size": 500,
          "overlap": 100
        },
        {
          "size": 300,
          "overlap": 50
        }
      ]
    }
  ]
}
