# RAG (Retrieval-Augmented Generation) based QA system

This notebook implements a Retrieval-Augmented Generation (RAG) based QA system.
We use:

FAISS for document retrieval
Hugging Face Transformers for question answering
Sentence-Transformers for embedding documents



1.   Tutorial: Implementing a basic RAG-based QA system using FAISS for retrieval and Hugging Face Transformers for generation.


2.   Assignment Question: A task to modify/enhance the system within 30 minutes.




1. Install Dependencies

`faiss-cpu:` Fast Approximate Nearest Neighbors (ANN) search for retrieval

`transformers:` Pretrained models for text generation

`datasets:` Load large datasets like Wikipedia

`sentence-transformers:` Convert text into vector embeddings

In [1]:
!pip install faiss-cpu transformers datasets sentence-transformers


Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvi

2. Import Libraries



Why these libraries?

`faiss`: Efficient document retrieval

`sentence-transformers:` Converts text to embeddings

`transformers:` Loads Hugging Face models for answering questions

`datasets:` Loads Wikipedia snippets

In [2]:
import faiss
import numpy as np
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from sentence_transformers import SentenceTransformer
from datasets import load_dataset


Load and Embed the Dataset

1. Loads 1000 Wikipedia articles but selects first 500
2. Converts each article into numerical embeddings using all-MiniLM-L6-v2
3. These embeddings allow similarity searches

In [3]:
# Load sample dataset
dataset = load_dataset("wikipedia", "20220301.simple", split="train[:1000]")  # 1000 articles
docs = dataset["text"][:500]  # Taking 500 docs for efficiency

# Embed using Sentence Transformer
embedder = SentenceTransformer("all-MiniLM-L6-v2")
doc_embeddings = embedder.encode(docs, convert_to_numpy=True)

# Build FAISS Index
index = faiss.IndexFlatL2(doc_embeddings.shape[1])
index.add(doc_embeddings)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/16.0k [00:00<?, ?B/s]

wikipedia.py:   0%|          | 0.00/36.7k [00:00<?, ?B/s]

The repository for wikipedia contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/wikipedia.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


train-00000-of-00001.parquet:   0%|          | 0.00/134M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/205328 [00:00<?, ? examples/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

 Build a FAISS Index for Fast Retrieval

Why FAISS?

- FAISS is a fast vector search library
- Uses L2 distance to find the closest documents

Define the Retrieval-Augmented QA Pipeline

`How retrieval works?`
- Encodes the query into an embedding
- Searches for the top k most similar Wikipedia articles
- Returns those relevant documents

In [4]:
def retrieve_documents(query, k=3):
    query_embedding = embedder.encode([query])
    distances, indices = index.search(query_embedding, k)
    return [docs[i] for i in indices[0]]

# Load HuggingFace Model for Generation
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")


def generate_answer(question):
    retrieved_docs = retrieve_documents(question)
    context = " ".join(retrieved_docs)  # Combine retrieved documents
    input_text = f"Context: {context} Question: {question}"

    inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
    output = model.generate(**inputs, max_length=100)
    return tokenizer.decode(output[0], skip_special_tokens=True)




tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Retrieve relevant documents using FAISS

- Print the retrieved context (useful for debugging)
- Combine the text into a single input
- Feed it to FLAN-T5 to generate an answer


FLAN-T5 reads the context and answers


In [5]:
# Test the system
question = "Tell me a science fact"
print(generate_answer(question))


Helium is a noble gas, because it does not regularly mix with other chemicals and form new compounds. It has the lowest boiling point of all the elements. It is the second most common element in the universe, after hydrogen, and has no color or smell. However, helium has a red-orange glow when placed in an electric field.


Conclusion:
This notebook demonstrates a basic RAG-based QA system using:

1. FAISS for fast document retrieval
2. Sentence Transformers for embeddings
3. FLAN-T5 for answer generation

# Assignment (30 min task)
Modify the system by improving retrieval or generation:

`Enhance Retrieval`

Try BM25 instead of FAISS (Hint: Use rank_bm25 library).
Experiment with different embeddings (sentence-transformers/all-mpnet-base-v2).
Improve Answer Generation:

Use a larger language model like facebook/bart-large-cnn for better summarization.
Fine-tune the model on a QA dataset.

Deliverable: Write a Colab cell showing the modification and compare outputs before/after.

In [6]:
!pip install rank_bm25 # Install the correct package 'rank_bm25'

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [7]:
import nltk

# Download the required NLTK data package
nltk.download('punkt_tab')



[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [8]:
from rank_bm25 import BM25Okapi
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from datasets import load_dataset
import nltk
import torch

# Load dataset (Google Colab compatible)
def load_data():
    dataset = load_dataset("wikipedia", "20220301.simple", split="train[:2000]")  # Increased dataset size
    docs = dataset["text"][:1000]  # Using 1000 docs for better accuracy
    return docs

docs = load_data()

# Embed and Index using BM25
def embed_and_index(docs):
    tokenized_docs = [nltk.word_tokenize(doc.lower()) for doc in docs]
    return BM25Okapi(tokenized_docs), tokenized_docs

bm25, tokenized_docs = embed_and_index(docs)

# Retrieve documents
def retrieve_documents(query, k=5):
    tokenized_query = nltk.word_tokenize(query.lower())
    doc_scores = bm25.get_scores(tokenized_query)
    top_indices = sorted(range(len(doc_scores)), key=lambda i: doc_scores[i], reverse=True)[:k]
    return [docs[i] for i in top_indices]

# Load HuggingFace Model for Generation
def load_model():
    tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
    model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large")
    return tokenizer, model

tokenizer, model = load_model()

# Generate Answer
def generate_answer(question):
    retrieved_docs = retrieve_documents(question)
    if not retrieved_docs:
        return "Sorry, I couldn't find relevant information."

    context = " ".join(retrieved_docs)
    input_text = f"Context: {context} Question: {question}"

    inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
    with torch.no_grad():
        output = model.generate(**inputs, max_length=150, num_return_sequences=1, temperature=0.7)
    return tokenizer.decode(output[0], skip_special_tokens=True)



In [9]:
# Test Case
question = "what is helium"
answer = generate_answer(question)
print("Question:", question)
print("Answer:", answer)



Question: what is helium
Answer: Helium is a noble gas, because it does not regularly mix with other chemicals and form new compounds. It has the lowest boiling point of all the elements. It is the second most common element in the universe, after hydrogen, and has no color or smell. However, helium has a red-orange glow when placed in an electric field.
