<a href="https://colab.research.google.com/github/Rohit-Munda/GenAIWorkshop/blob/main/Workshop-1/Day-3/RetrievalQA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## ✅ Step1: Install & Load required python libraries

In [None]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl (31.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.11.0


In [None]:
import faiss
import numpy as np
from transformers import pipeline
from sentence_transformers import SentenceTransformer

## 📚 Step 2. Create a Small Knowledge Base (Documents)

In [None]:
documents = [
    "The Earth revolves around the Sun in about 365 days.",
    "Python is a popular programming language used for web development, data science, and AI.",
    "The capital of France is Paris, known for the Eiffel Tower.",
    "Photosynthesis is the process by which green plants make food using sunlight.",
    "Basketball is a sport played by two teams of five players on a rectangular court.",
    "Large Language Models like ChatGPT are trained on vast amounts of text data."
]

## 🔡 Step 3. Convert Documents to Embeddings

In [None]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')
doc_embeddings = embedder.encode(documents)

In [None]:
doc_embeddings

array([[ 0.02731097,  0.0774596 ,  0.04154115, ..., -0.00238924,
         0.00900925, -0.0208409 ],
       [-0.04744018, -0.01688633, -0.02255926, ...,  0.12869716,
         0.1589973 ,  0.0186499 ],
       [ 0.0672824 ,  0.06106832,  0.02706188, ...,  0.06351971,
         0.11277966,  0.04227098],
       [-0.05481903,  0.06031541, -0.06253127, ...,  0.05017699,
         0.11442049,  0.03344167],
       [ 0.06275784,  0.01622444,  0.03321232, ...,  0.07344805,
         0.06807995, -0.00025563],
       [-0.01688357, -0.08255044,  0.06435584, ...,  0.05519102,
         0.03942716, -0.04331229]], dtype=float32)

## 🧠 Step 4. Create FAISS Index for Retrieval

In [None]:
embedding_dim = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(embedding_dim)
index.add(np.array(doc_embeddings))

## 🔍 Step 5. Define a Retrieval + QA Function (Mini-RAG)

In [None]:
# Use a model for QA
qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")

In [None]:
def rag_answer(question, top_k=2):
    # Embed the question
    question_embedding = embedder.encode([question])

    # Retrieve top_k similar documents
    distances, indices = index.search(np.array(question_embedding), top_k)
    retrieved_docs = [documents[i] for i in indices[0]]

    # Combine docs into a single context string
    context = " ".join(retrieved_docs)

    # Run QA
    result = qa_pipeline(question=question, context=context)

    # Output
    print(f"📌 Question: {question}")
    print(f"\n📚 Retrieved Context:\n{context}")
    print(f"\n🧠 Answer: {result['answer']}")

In [None]:
rag_answer("What is photosynthesis?")

📌 Question: What is photosynthesis?

📚 Retrieved Context:
Photosynthesis is the process by which green plants make food using sunlight. The Earth revolves around the Sun in about 365 days.

🧠 Answer: the process by which green plants make food using sunlight


In [None]:
rag_answer("Where is Eiffel Tower located?")

📌 Question: Where is Eiffel Tower located?

📚 Retrieved Context:
The capital of France is Paris, known for the Eiffel Tower. The Earth revolves around the Sun in about 365 days.

🧠 Answer: Paris


In [None]:
rag_answer("What is Python used for?")

📌 Question: What is Python used for?

📚 Retrieved Context:
Python is a popular programming language used for web development, data science, and AI. Large Language Models like ChatGPT are trained on vast amounts of text data.

🧠 Answer: web development, data science, and AI


# ✅ What We Did: A Simple RAG Pipeline

- Embedded documents into vectors
- Used FAISS to retrieve relevant documents based on question
- Passed retrieved text + question into a question-answering model
- Got grounded answers instead of hallucinations!