# RAG (Retrieval-Augmented Generation) based QA system

In [None]:
!pip install rank_bm25

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


This notebook implements a Retrieval-Augmented Generation (RAG) based QA system.
We use:

FAISS for document retrieval
Hugging Face Transformers for question answering
Sentence-Transformers for embedding documents



1.   Tutorial: Implementing a basic RAG-based QA system using FAISS for retrieval and Hugging Face Transformers for generation.


2.   Assignment Question: A task to modify/enhance the system within 30 minutes.




1. Install Dependencies

`faiss-cpu:` Fast Approximate Nearest Neighbors (ANN) search for retrieval

`transformers:` Pretrained models for text generation

`datasets:` Load large datasets like Wikipedia

`sentence-transformers:` Convert text into vector embeddings

In [None]:
!pip install faiss-cpu transformers datasets sentence-transformers

Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvi

2. Import Libraries



Why these libraries?

`faiss`: Efficient document retrieval

`sentence-transformers:` Converts text to embeddings

`transformers:` Loads Hugging Face models for answering questions

`datasets:` Loads Wikipedia snippets

In [None]:
import faiss
import numpy as np
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from sentence_transformers import SentenceTransformer
from datasets import load_dataset


Load and Embed the Dataset

1. Loads 1000 Wikipedia articles but selects first 500
2. Converts each article into numerical embeddings using all-MiniLM-L6-v2
3. These embeddings allow similarity searches

In [None]:
# Load sample dataset
dataset = load_dataset("wikipedia", "20220301.simple", split="train[:1000]")  # 1000 articles
docs = dataset["text"][:500]  # Taking 500 docs for efficiency

# Embed using Sentence Transformer
embedder = SentenceTransformer("all-MiniLM-L6-v2")
doc_embeddings = embedder.encode(docs, convert_to_numpy=True)

# Build FAISS Index
index = faiss.IndexFlatL2(doc_embeddings.shape[1])
index.add(doc_embeddings)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/16.0k [00:00<?, ?B/s]

wikipedia.py:   0%|          | 0.00/36.7k [00:00<?, ?B/s]

 Build a FAISS Index for Fast Retrieval

Why FAISS?

- FAISS is a fast vector search library
- Uses L2 distance to find the closest documents

Define the Retrieval-Augmented QA Pipeline

`How retrieval works?`
- Encodes the query into an embedding
- Searches for the top k most similar Wikipedia articles
- Returns those relevant documents

In [None]:
def retrieve_documents(query, k=3):
    query_embedding = embedder.encode([query])
    distances, indices = index.search(query_embedding, k)
    return [docs[i] for i in indices[0]]

# Load HuggingFace Model for Generation
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")


def generate_answer(question):
    retrieved_docs = retrieve_documents(question)
    context = " ".join(retrieved_docs)  # Combine retrieved documents
    input_text = f"Context: {context} Question: {question}"

    inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
    output = model.generate(**inputs, max_length=100)
    return tokenizer.decode(output[0], skip_special_tokens=True)




1. `retrieve_documents(query, k=3)`

Retrieves the top k most relevant documents for a given query.

How it works:

Converts the input query into an embedding using embedder.encode(`[query]`).

Uses index.search(query_embedding, k) to find the k closest documents based on the embedding similarity.

Returns the top k retrieved documents.

2. `generate_answer(question)`

Uses the retrieved documents to generate an answer.

How it works:

- Calls retrieve_documents(question) to fetch relevant context.

- Concatenates the retrieved documents into a single context string.

- Constructs an input text:

- Context: <retrieved docs> Question: <question>

- Tokenizes the input using a Flan-T5 tokenizer (tokenizer).

- Feeds the tokenized input into the Flan-T5 model (model) to generate an answer.

- Decodes the output into a readable string and returns it.

Retrieve relevant documents using FAISS

- Print the retrieved context (useful for debugging)
- Combine the text into a single input
- Feed it to FLAN-T5 to generate an answer


FLAN-T5 reads the context and answers


In [None]:
# Test the system
question = "Tell me a science fact"
print(generate_answer(question))


Conclusion:
This notebook demonstrates a basic RAG-based QA system using:

1. FAISS for fast document retrieval
2. Sentence Transformers for embeddings
3. FLAN-T5 for answer generation

# Assignment (30 min task)
Modify the system by improving retrieval or generation:

`Enhance Retrieval`

Try BM25 instead of FAISS (Hint: Use rank_bm25 library).
Experiment with different embeddings (sentence-transformers/all-mpnet-base-v2).
Improve Answer Generation:

Use a larger language model like facebook/bart-large-cnn for better summarization.
Fine-tune the model on a QA dataset.

Deliverable: Write a Colab cell showing the modification and compare outputs before/after.

In [None]:
import numpy as np
from rank_bm25 import BM25Okapi
from transformers import (
    AutoTokenizer,
    pipeline
)
from datasets import load_dataset

# Load a subset of the Wikipedia dataset
data = load_dataset("wikipedia", "20220301.simple", split="train[:1000]")
documents = data["text"][:500]

# Tokenizing documents
tokenized_corpus = [doc.lower().split() for doc in documents]
bm25_model = BM25Okapi(tokenized_corpus)

def fetch_relevant_docs(query, top_n=3):
    query_tokens = query.lower().split()
    scores = bm25_model.get_scores(query_tokens)
    top_matches = np.argsort(scores)[-top_n:][::-1]
    return [documents[i] for i in top_matches]

# Load the summarization model
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
summarization_pipeline = pipeline(
    "summarization",
    model="facebook/bart-large-cnn",
    tokenizer=tokenizer,
    max_length=100,
    min_length=30,
    do_sample=False
)

def generate_detailed_response(prompt):
    retrieved_texts = fetch_relevant_docs(prompt)
    context_snippet = " ".join(retrieved_texts)[:512]

    input_text = f"Based on the following context, answer the question.\nContext: {context_snippet}\nQuestion: {prompt}"
    response = summarization_pipeline(input_text, max_length=256, min_length=30, truncation=True)[0]['summary_text']
    return response

def evaluate_qa_system(query):
    refined_answer = generate_detailed_response(query)

    return {
        "Query": query,
        "Generated Response": refined_answer
    }

sample_queries = [
    "What are the fundamental principles of quantum mechanics?",
    "Can you explain how machine learning algorithms work?"
]

for query in sample_queries:
    output = evaluate_qa_system(query)
    print(f"\nQuery: {output['Query']}")
    print(f"Generated Response: {output['Generated Response']}")

In [None]:
!pip install rank_bm25 sentence-transformers datasets transformers

from datasets import load_dataset
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
from transformers import pipeline
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load dataset (Wikipedia subset)
dataset = load_dataset("wikipedia", "20220301.simple", split="train[:1000]")
docs = dataset["text"][:500]  # Taking 500 docs

# **BM25 Tokenization**
tokenized_docs = [doc.lower().split() for doc in docs]
bm25 = BM25Okapi(tokenized_docs)

# **Sentence Embeddings Model**
embedding_model = SentenceTransformer("all-mpnet-base-v2")
doc_embeddings = embedding_model.encode(docs)  # Convert documents into embeddings

# **BART Summarization Model**
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# **Search Query**
query = "What is artificial intelligence?"
query_tokens = query.lower().split()

# **BM25 Retrieval**
bm25_scores = bm25.get_scores(query_tokens)
top_bm25_idx = np.argsort(bm25_scores)[::-1][:5]  # Top 5 BM25 results

# **Embedding Retrieval (Semantic Search)**
query_embedding = embedding_model.encode(query)
similarities = cosine_similarity([query_embedding], doc_embeddings)[0]
top_embedding_idx = np.argsort(similarities)[::-1][:5]  # Top 5 Semantic results

# **Hybrid Approach: Combine BM25 + Embeddings**
hybrid_scores = (bm25_scores / np.max(bm25_scores)) + (similarities / np.max(similarities))
top_hybrid_idx = np.argsort(hybrid_scores)[::-1][:3]  # Top 3 Hybrid results

# **Generate Answer using BART**
retrieved_text = " ".join([docs[idx] for idx in top_hybrid_idx])  # Merge top documents
generated_answer = summarizer(retrieved_text, max_length=100, min_length=50, do_sample=False)[0]['summary_text']

# **Print Results**
print("\n🔹 **Retrieved Documents (BM25 + Embeddings Hybrid):**")
for idx in top_hybrid_idx:
    print(f" - {docs[idx][:200]}...")

print("\n🔹 **Generated Answer using BART:**")
print(generated_answer)




Device set to use cuda:0


RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
