<a href="https://colab.research.google.com/github/Ag230602/ani/blob/main/Week4_1_RAG_Gemini_HandsOn_ipynb_(LangChain).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 4 — RAG with LangChain + Chroma
Colab implementing a complete Retrieval-Augmented Generation (RAG) workflow:

1) Install & setup  
2) Load your project documents (PDF/Text/Markdown)  
3) Chunk the documents  
4) Build embeddings & Chroma vector DB  
5) Connect an LLM (Gemini or Hugging Face)  
6) Build RetrievalQA and ask domain-specific questions  
7) Mini-experiments (Embedding swap, Chunk sensitivity)   
8) Reproducibility log

>

## 1) Install & Setup

In [1]:
# If you're on Colab, these installs will run in the current runtime.
# If you're on local Jupyter, you can run them once in your environment.

import sys, subprocess

def pip_install(pkgs):
    print("Installing:", pkgs)
    subprocess.run([sys.executable, "-m", "pip", "install", "-q"] + pkgs, check=True)

# Core libs (LangChain 0.2+ split packages; pin to stable ranges)
pip_install([
    "langchain>=0.2.15",
    "langchain-community>=0.2.11",
    "langchain-text-splitters>=0.2.2",
    "langchain-huggingface>=0.1.0",
    "chromadb>=0.5.5",
    "sentence-transformers>=3.0.1",
    "transformers>=4.44.2",
    "accelerate>=0.34.0",
])

# Optional: Gemini (if you want to use Google Generative AI)
try:
    pip_install(["langchain-google-genai>=2.0.0", "google-generativeai>=0.7.2"])
except Exception as e:
    print("Could not install Gemini packages (optional):", e)

# Some platforms need this to enable SQLite for Chroma persistence
try:
    pip_install(["pysqlite3-binary"])
except Exception as e:
    print("Skipping pysqlite3-binary:", e)

print("✅ Installation complete.")

Installing: ['langchain>=0.2.15', 'langchain-community>=0.2.11', 'langchain-text-splitters>=0.2.2', 'langchain-huggingface>=0.1.0', 'chromadb>=0.5.5', 'sentence-transformers>=3.0.1', 'transformers>=4.44.2', 'accelerate>=0.34.0']
Installing: ['langchain-google-genai>=2.0.0', 'google-generativeai>=0.7.2']
Installing: ['pysqlite3-binary']
✅ Installation complete.


### Log Python, Torch, Transformers, SentenceTransformers, and Chroma versions (saved to `env_rag.json`)

In [2]:
import json, platform, datetime
from pathlib import Path

env = {
    "timestamp": datetime.datetime.utcnow().isoformat() + "Z",
    "python": platform.python_version(),
    "platform": platform.platform(),
}

# Optional libs
try:
    import torch
    env["torch"] = torch.__version__
    env["cuda_available"] = bool(torch.cuda.is_available())
    env["cuda_device"] = torch.cuda.get_device_name(0) if torch.cuda.is_available() else None
except Exception as e:
    env["torch"] = f"unavailable ({e})"

try:
    import transformers
    env["transformers"] = transformers.__version__
except Exception as e:
    env["transformers"] = f"unavailable ({e})"

try:
    import sentence_transformers
    env["sentence_transformers"] = sentence_transformers.__version__
except Exception as e:
    env["sentence_transformers"] = f"unavailable ({e})"

try:
    import chromadb
    env["chromadb"] = chromadb.__version__
except Exception as e:
    env["chromadb"] = f"unavailable ({e})"

# Save
Path("runs").mkdir(exist_ok=True)
with open("env_rag.json", "w") as f:
    json.dump(env, f, indent=2)

print("Saved env to env_rag.json:")
print(json.dumps(env, indent=2))

  "timestamp": datetime.datetime.utcnow().isoformat() + "Z",


Saved env to env_rag.json:
{
  "timestamp": "2025-09-18T17:21:57.917400Z",
  "python": "3.12.11",
  "platform": "Linux-6.1.123+-x86_64-with-glibc2.35",
  "torch": "2.8.0+cu126",
  "cuda_available": false,
  "cuda_device": null,
  "transformers": "4.56.1",
  "sentence_transformers": "5.1.0",
  "chromadb": "1.1.0"
}


## 2) Load  Project Documents


In [3]:
from google.colab import files
from pathlib import Path

DATA_DIR = Path("data/uploads")
DATA_DIR.mkdir(parents=True, exist_ok=True)

print("📂 Please select your PDFs/TXT/MD files to upload...")

uploaded = files.upload()

for name, data in uploaded.items():
    p = DATA_DIR / name
    with open(p, "wb") as f:
        f.write(data)
    print(f"✅ Uploaded and saved: {p}")

print("All files saved in:", str(DATA_DIR.resolve()))


📂 Please select your PDFs/TXT/MD files to upload...


Saving mat-report_hurricane-irma_florida.pdf to mat-report_hurricane-irma_florida.pdf
Saving NeurIPS-2022-video-diffusion-models-Paper-Conference.pdf to NeurIPS-2022-video-diffusion-models-Paper-Conference.pdf
Saving annotated-Project Title (1).pdf to annotated-Project Title (1).pdf
✅ Uploaded and saved: data/uploads/mat-report_hurricane-irma_florida.pdf
✅ Uploaded and saved: data/uploads/NeurIPS-2022-video-diffusion-models-Paper-Conference.pdf
✅ Uploaded and saved: data/uploads/annotated-Project Title (1).pdf
All files saved in: /content/data/uploads


In [4]:
!pip install pypdf


Collecting pypdf
  Downloading pypdf-6.0.0-py3-none-any.whl.metadata (7.1 kB)
Downloading pypdf-6.0.0-py3-none-any.whl (310 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/310.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.5/310.5 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-6.0.0


In [5]:
# Load with LangChain loaders
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain.schema import Document

docs = []

for path in sorted(DATA_DIR.glob("*")):
    p = str(path)
    if path.suffix.lower() == ".pdf":
        try:
            loader = PyPDFLoader(p)
            docs.extend(loader.load())
        except Exception as e:
            print(f"PDF load error for {p}:", e)
    elif path.suffix.lower() in [".txt", ".md", ".markdown"]:
        try:
            loader = TextLoader(p, encoding="utf-8")
            docs.extend(loader.load())
        except Exception as e:
            print(f"Text load error for {p}:", e)
    else:
        print("Skipping unsupported file type:", p)

print(f"Loaded {len(docs)} document chunks (pre-chunking).")
if len(docs) == 0:
    print("⚠️ No documents found. Please add at least three PDFs/TXT/MD files to data/uploads and re-run.")

Loaded 198 document chunks (pre-chunking).


## 3) Chunk the Documents (start with `chunk_size=500`, `chunk_overlap=100`)

In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
import json

chunk_size = 500
chunk_overlap = 100

splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separators=["\n\n", "\n", " ", ""],
)

splits = splitter.split_documents(docs)
print(f"Total chunks: {len(splits)}")
print("\nFirst chunk preview:\n", splits[0].page_content[:400] if splits else "NO CHUNKS")

# Save run config so far
run_cfg = {
    "chunk_size": chunk_size,
    "chunk_overlap": chunk_overlap,
}
with open("rag_run_config.json", "w") as f:
    json.dump(run_cfg, f, indent=2)
print("Saved partial config -> rag_run_config.json")

Total chunks: 1071

First chunk preview:
 Video Diffusion Models
Jonathan Ho∗
jonathanho@google.com
Tim Salimans∗
salimans@google.com
Alexey Gritsenko
agritsenko@google.com
William Chan
williamchan@google.com
Mohammad Norouzi
mnorouzi@google.com
David J. Fleet
davidfleet@google.com
Abstract
Generating temporally coherent high ﬁdelity video is an important milestone in
generative modeling research. We make progress towards this milestone b
Saved partial config -> rag_run_config.json


## 4) Build Embeddings & Chroma Vector DB (default: `all-MiniLM-L6-v2`, retriever k=4)

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from pathlib import Path
import json

persist_dir = Path("chroma_db_minilm")
persist_dir.mkdir(exist_ok=True)

embed_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embedder = HuggingFaceEmbeddings(model_name=embed_model_name)

vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=embedder,
    persist_directory=str(persist_dir),
)

retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

SAMPLE_QUERY = "Summarize the key contributions, datasets, and methods mentioned in these project materials."
results = retriever.get_relevant_documents(SAMPLE_QUERY)
print("Retriever sanity check (top-k sources):")
for i, d in enumerate(results, 1):
    print(f"{i}.", d.metadata.get("source"), "…", (d.page_content[:120] + "…"))

# Update config
cfg = json.load(open("rag_run_config.json"))
cfg.update({
    "embedding_models_tested": ["sentence-transformers/all-MiniLM-L6-v2"],
    "retriever_k": 4,
    "vectorstore_persist_dir": str(persist_dir),
})
with open("rag_run_config.json", "w") as f:
    json.dump(cfg, f, indent=2)
print("Updated rag_run_config.json")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Retriever sanity check (top-k sources):
1. data/uploads/fema_national-risk-index_technical-documentation.pdf … 2.5. Data and Methodologies 
Over the course of several years, with the help of hundreds of collaborators and contributo…
2. data/uploads/annotated-Project%20Title (1).pdf … ●  Adrija  Ghosh:  Model  development  &  fine-tuning  –  35%  
 ●  Jonghyun  Lee  ,Albert  Choi:  System  integration, …
3. data/uploads/annotated-Project%20Title (1).pdf … Expected  team  Contribution  Statement  
●  Jonghyun  Lee  ,Albert  Choi:  Dataset  curation  &  preprocessing  –  35% …
4. data/uploads/fema_national-risk-index_technical-documentation.pdf … expertise and/or data. 
Contributor Description Expertise / 
Source Data 
Argonne National Laboratory is a multidiscipli…
Updated rag_run_config.json


  results = retriever.get_relevant_documents(SAMPLE_QUERY)


## 5) Connect an LLM (Gemini or Hugging Face)

In [None]:
import os, warnings, json

LLM_CHOICE = None
llm = None

# First try Gemini if API key is available
try:
    from langchain_google_genai import ChatGoogleGenerativeAI
    gemini_key = os.getenv("GEMINI_API_KEY")
    if gemini_key:
        llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash", api_key=gemini_key, temperature=0.2)
        LLM_CHOICE = "gemini-1.5-flash"
except Exception as e:
    warnings.warn(f"Gemini init failed: {e}")

# Fallback: Hugging Face tiny chat model pipeline
if llm is None:
    from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
    from langchain_huggingface import HuggingFacePipeline
    model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"  # small, chat-tuned model
    tok = AutoTokenizer.from_pretrained(model_name)
    mdl = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
    gen = pipeline("text-generation", model=mdl, tokenizer=tok, max_new_tokens=512)
    llm = HuggingFacePipeline(pipeline=gen)
    LLM_CHOICE = model_name

print("Using LLM:", LLM_CHOICE)

# Record choice
cfg = json.load(open("rag_run_config.json"))
cfg.update({"llm_used": LLM_CHOICE})
with open("rag_run_config.json", "w") as f:
    json.dump(cfg, f, indent=2)

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Device set to use cpu


Using LLM: TinyLlama/TinyLlama-1.1B-Chat-v1.0


## 6) Build RetrievalQA (retriever + LLM) and ask domain-specific questions

In [None]:
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

def format_docs(docs):
    return "\n\n".join(f"[Source: {d.metadata.get('source')}] {d.page_content}" for d in docs)

prompt = PromptTemplate.from_template(
    "You are a helpful assistant that answers using ONLY the provided context.\n"
    "If the answer is not in the context, say you don't know.\n\n"
    "Context:\n{context}\n\n"
    "Question: {question}\n\n"
    "Answer:"
)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

domain_questions = [
    "What problems does this project aim to solve? List 3–5 key points.",
    "Which datasets or data sources are used or proposed?",
    "What methods, models, or evaluation metrics are mentioned?",
]

for q in domain_questions:
    print("="*80)
    print("Q:", q)
    try:
        ans = rag_chain.invoke(q)
    except Exception as e:
        ans = f"[Error from LLM: {e}]"
    print("A:", ans[:1500])

Q: What problems does this project aim to solve? List 3–5 key points.
A: You are a helpful assistant that answers using ONLY the provided context.
If the answer is not in the context, say you don't know.

Context:
[Source: data/uploads/fema_national-risk-index_technical-documentation.pdf] 2.5. Data and Methodologies 
Over the course of several years, with the help of hundreds of collaborators and contributors, and

[Source: data/uploads/NeurIPS-2022-video-diffusion-models-Paper-Conference.pdf] contributions and scope? [Yes]
(b) Did you describe the limitations of your work? [Yes]
(c) Did you discuss any potential negative societal impacts of your work? [Yes]

[Source: data/uploads/fema_national-risk-index_technical-documentation.pdf] current challenges and look ahead to identify 
opportunities for change, and help stakeholders 
develop solutions and strategies to address concerns 
and remove roadblocks. 
Community

[Source: data/uploads/NeurIPS-2022-video-diffusion-models-Paper-Confere

## 7) Mini-Experiments
### A) Embedding Swap: MiniLM vs e5-small-v2
### B) Chunk Sensitivity: 500/100 vs 300/50

In [None]:
import json, os
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter

# --- A) Embedding Swap ---
e5_dir = "chroma_db_e5_small"
os.makedirs(e5_dir, exist_ok=True)
e5_embedder = HuggingFaceEmbeddings(model_name="intfloat/e5-small-v2")
e5_vs = Chroma.from_documents(splits, e5_embedder, persist_directory=e5_dir)
e5_ret = e5_vs.as_retriever(search_kwargs={"k": 4})

compare_questions = [
    "Summarize the project goals and expected outcomes.",
    "Name key algorithms, baselines, or architectures mentioned.",
    "Identify the most important risks or limitations discussed.",
]

def peek_sources(docs, n=3):
    out = []
    for d in docs[:n]:
        out.append({"source": d.metadata.get("source"), "preview": d.page_content[:160]})
    return out

cmp_results = {"embedding_swap": []}
for q in compare_questions:
    m = retriever.get_relevant_documents(q)
    e = e5_ret.get_relevant_documents(q)
    cmp_results["embedding_swap"].append({
        "question": q,
        "minilm_top": peek_sources(m, n=4),
        "e5_top": peek_sources(e, n=4),
    })

# --- B) Chunk Sensitivity ---
split_300_50 = RecursiveCharacterTextSplitter(
    chunk_size=300, chunk_overlap=50,
    separators=["\n\n", "\n", " ", ""],
).split_documents(docs)

minilm_dir_small = "chroma_db_minilm_small_chunks"
os.makedirs(minilm_dir_small, exist_ok=True)
vs_small = Chroma.from_documents(split_300_50, embedder, persist_directory=minilm_dir_small)
ret_small = vs_small.as_retriever(search_kwargs={"k": 4})

cmp_results["chunk_sensitivity"] = {
    "baseline_chunks": {"chunk_size": 500, "chunk_overlap": 100, "count": len(splits)},
    "smaller_chunks": {"chunk_size": 300, "chunk_overlap": 50, "count": len(split_300_50)},
    "sample_query": compare_questions[0],
    "baseline_top": peek_sources(retriever.get_relevant_documents(compare_questions[0]), n=4),
    "smaller_top": peek_sources(ret_small.get_relevant_documents(compare_questions[0]), n=4),
}

with open("mini_experiment_results.json", "w") as f:
    json.dump(cmp_results, f, indent=2)

print("Saved mini_experiment_results.json")
print(json.dumps(cmp_results, indent=2)[:2000])

modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

Saved mini_experiment_results.json
{
  "embedding_swap": [
    {
      "question": "Summarize the project goals and expected outcomes.",
      "minilm_top": [
        {
          "source": "data/uploads/annotated-Project%20Title (1).pdf",
          "preview": "Expected  team  Contribution  Statement  \n\u25cf  Jonghyun  Lee  ,Albert  Choi:  Dataset  curation  &  preprocessing  \u2013  35%  \n \u25cf  Adrija  Ghosh:  Model  development"
        },
        {
          "source": "data/uploads/fema_national-risk-index_technical-documentation.pdf",
          "preview": "Over the course of several years, with the help of hundreds of collaborators and contributors, and \nthrough unknown iterations of planning, design, and developm"
        },
        {
          "source": "data/uploads/annotated-Project%20Title (1).pdf",
          "preview": "Project  Title  \nAI-Driven  3D  Video  Generation  for  Multi-Subject  Disaster  Education:  Florida  Case  Study  \nProblem  Definition  &  Objectives

## 8) Reproducibility Log — Save Configs to `rag_run_config.json`

In [None]:
import json

# Append: both embedding models and chunk experiments if ran
try:
    cfg = json.load(open("rag_run_config.json"))
except Exception:
    cfg = {}

cfg.setdefault("embedding_models_tested", [])
if "sentence-transformers/all-MiniLM-L6-v2" not in cfg["embedding_models_tested"]:
    cfg["embedding_models_tested"].append("sentence-transformers/all-MiniLM-L6-v2")
if "intfloat/e5-small-v2" not in cfg["embedding_models_tested"]:
    cfg["embedding_models_tested"].append("intfloat/e5-small-v2")

cfg["chunk_experiments"] = [
    {"chunk_size": 500, "chunk_overlap": 100},
    {"chunk_size": 300, "chunk_overlap": 50},
]

cfg.setdefault("retriever_k", 4)
cfg.setdefault("notes", "Run on your project docs in data/uploads/. GEMINI_API_KEY optional for Gemini LLM.")

with open("rag_run_config.json", "w") as f:
    json.dump(cfg, f, indent=2)

print("Final rag_run_config.json:")
print(json.dumps(cfg, indent=2))

Final rag_run_config.json:
{
  "chunk_size": 200,
  "chunk_overlap": 100,
  "embedding_models_tested": [
    "sentence-transformers/all-MiniLM-L6-v2",
    "intfloat/e5-small-v2"
  ],
  "retriever_k": 4,
  "vectorstore_persist_dir": "chroma_db_minilm",
  "llm_used": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
  "chunk_experiments": [
    {
      "chunk_size": 500,
      "chunk_overlap": 100
    },
    {
      "chunk_size": 300,
      "chunk_overlap": 50
    }
  ],
  "notes": "Run on your project docs in data/uploads/. GEMINI_API_KEY optional for Gemini LLM."
}
