# RAG Pipeline Implementation for SQuAD v2 Dataset

This assignments demonstrates a simple RAG (Retrieval-Augmented Generation) system.

Components:
1. Data Loading: SQuAD v2 dataset simulation
2. Semantic Retrieval: TF-IDF based document retrieval
3. Answer Generation: Rule-based pattern matching
4. Evaluation: Ground truth comparison with accuracy metrics

### About the Dataset - SQuAD V2

SQuAD (Standford Question Answering Dataset) is a reading comprehension dataset, consisting of questions posed by crowdworkers on set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage or the question might be unanswerable.
[SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/). It contains 100,000+ questions and answer pairs.





## **Install and Load libraries**

| Library            | Description                                  |
|--------------------|----------------------------------------------|
| `datasets`         | Provides access to various datasets, including SQuAD v2. |
| `sentence-transformers` | Used for generating sentence embeddings.     |
| `faiss-cpu`        | Provides efficient similarity search for vectors. |
| `transformers`     | Offers pre-trained models for various NLP tasks. |
| `accelerate`       | Helps in training models efficiently on different hardware setups. |

In [None]:
!pip install datasets sentence-transformers faiss-cpu transformers accelerate



In [None]:
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, util
import faiss
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from accelerate import Accelerator

## Load SQuAD v2 Dataset

In [None]:
dataset = load_dataset("squad_v2")

## Explore the Dataset

In [None]:
dataset
### The dataset contains 5 Columns - ID, Title, context , question and answers.
### training dataset lenght = 130319 records
### Validation dataset length = 11873 records

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 130319
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})

In [None]:
### Print 2000 th training dataset record
dataset['train'][2000]

{'id': '56cd796762d2951400fa65ed',
 'title': 'IPod',
 'context': 'On October 21, 2008, Apple reported that only 14.21% of total revenue for fiscal quarter 4 of year 2008 came from iPods. At the September 9, 2009 keynote presentation at the Apple Event, Phil Schiller announced total cumulative sales of iPods exceeded 220 million. The continual decline of iPod sales since 2009 has not been a surprising trend for the Apple corporation, as Apple CFO Peter Oppenheimer explained in June 2009: "We expect our traditional MP3 players to decline over time as we cannibalize ourselves with the iPod Touch and the iPhone." Since 2009, the company\'s iPod sales have continually decreased every financial quarter and in 2013 a new model was not introduced onto the market.',
 'question': 'Who was Chief Financial Officer of Apple in July of 2009?',
 'answers': {'text': ['Peter Oppenheimer'], 'answer_start': [384]}}

In [None]:
### Print 2005 th training dataset record
dataset['train'][2005]

{'id': '56cd7a3f62d2951400fa6600',
 'title': 'IPod',
 'context': 'iPods have won several awards ranging from engineering excellence,[not in citation given] to most innovative audio product, to fourth best computer product of 2006. iPods often receive favorable reviews; scoring on looks, clean design, and ease of use. PC World says that iPod line has "altered the landscape for portable audio players". Several industries are modifying their products to work better with both the iPod line and the AAC audio format. Examples include CD copy-protection schemes, and mobile phones, such as phones from Sony Ericsson and Nokia, which play AAC files rather than WMA.',
 'question': 'What rank did iPod achieve among various computer products in 2006?',
 'answers': {'text': ['fourth'], 'answer_start': [127]}}

In [None]:
dataset['train'][2015]

{'id': '56cd7ab462d2951400fa660d',
 'title': 'IPod',
 'context': 'Besides earning a reputation as a respected entertainment device, the iPod has also been accepted as a business device. Government departments, major institutions and international organisations have turned to the iPod line as a delivery mechanism for business communication and training, such as the Royal and Western Infirmaries in Glasgow, Scotland, where iPods are used to train new staff.',
 'question': 'Where is Royal and Western Infirmaries located?',
 'answers': {'text': ['Glasgow, Scotland'], 'answer_start': [334]}}

## Prepare corpus

In [None]:
### Selecting 3000 records
train_docs = dataset['train'].select(range(3000))
corpus =[item['context'] for item in train_docs]

## Semantic Retrieval: TF-IDF based document retrieval

Using Sentence transformers to embed and store documents in FAISS, a fast vector search library.

In [None]:
### Load embedding Model
embedder = SentenceTransformer('all-MiniLM-L6-v2')

### Create embeddings in batches to avoid Out of memeory
import numpy as np # Import numpy
batch_size = 50 # Adjust batch size based on available memory
doc_embeddings = []
for i in range(0, len(corpus), batch_size):
    batch = corpus[i:i + batch_size]
    batch_embeddings = embedder.encode(batch, convert_to_numpy=True, show_progress_bar=False)
    doc_embeddings.append(batch_embeddings)

doc_embeddings = np.concatenate(doc_embeddings) # Assuming np is imported

### Build FIASS index

In [None]:
### "flat" index means it stores all vectors directly without any compression or hierarchical structure,
### making it highly accurate but potentially slower for very large datasets compared to more advanced indexes.
index = faiss.IndexFlatL2(doc_embeddings.shape[1])
### doc_embeddings provide dimensionality of the vectors.
index.add(doc_embeddings)
## index.add(doc_embeddings.cpu().numpy())
print(f"Number of documents in corpus: {index.ntotal}")

Number of documents in corpus: 3000


## Implement Retrival Function

This function retrives the most relevant passage for an user query

In [None]:
def retriever(query, k=1):
  query_vec = embedder.encode([query], convert_to_numpy=True)
  distances, indices = index.search(query_vec, k)
  return [corpus[i] for i in indices[0]]

#### Test sample query

In [None]:
import textwrap
query = "In year 2008, How much Apple revenue came from iPods?"
results = retriever(query)
print(textwrap.fill(results[0], width=90))

On October 21, 2008, Apple reported that only 14.21% of total revenue for fiscal quarter 4
of year 2008 came from iPods. At the September 9, 2009 keynote presentation at the Apple
Event, Phil Schiller announced total cumulative sales of iPods exceeded 220 million. The
continual decline of iPod sales since 2009 has not been a surprising trend for the Apple
corporation, as Apple CFO Peter Oppenheimer explained in June 2009: "We expect our
traditional MP3 players to decline over time as we cannibalize ourselves with the iPod
Touch and the iPhone." Since 2009, the company's iPod sales have continually decreased
every financial quarter and in 2013 a new model was not introduced onto the market.


- From the query output, The model is able to capture the context for the question asked however, the respone is not accurate.  It is responding with whole context in the output. All the details may not be required in the output.



## Using OpenAI LLM for Answer Generation


In [None]:
!pip install openai



In [99]:
from openai import OpenAI
from google.colab import userdata

client = OpenAI(api_key=userdata.get("OPENAI_API_KEY"))
### Function to Answer the questions
def rag_answer_openai(query, top_k=1):

    retrieved_docs = retriever(query, k=top_k)
    context = "\n".join(retrieved_docs)

    ### Construct the combine RAG prompt
    ### retriever function captures the context from the existing dataset.
    ### And passed to the prompt.
    prompt = f"""
    You are a question-answering assistant.
    Answer the question based only on the context below.
    If the answer is not contained in the context, say "I don’t know".
    context: {context}
    Question: {query}
    Answer:
    """

    ### Generate response using gpt-3.5-trubo
    response = client.chat.completions.create(
        model="gpt-4.1-mini",
        #model="gpt-3.5-turbo",
        #model="o4-mini",
        messages=[
          {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"}
        ],
        temperature=0.5,
        max_tokens=200
    )

    ### Return the answer
    return response.choices[0].message.content.strip()

- OpenAI model ran output tokens.
- Explored gpt-oss-120b for Huggingface but model was too large in colab.
- Then Tried - Microsoft/Phi-2-mini-4k-instruct


In [82]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import textwrap

# Load the gpt-oss-120b model and tokenizer
model_name = "openai/gpt-oss-120b"
#model_name = "microsoft/Phi-3-mini-4k-instruct" # Changed to a smaller model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

### Function to Answer the questions using the local model
def rag_answer_local(query, top_k=1, width=90):

    retrieved_docs = retriever(query, k=top_k)
    context = "\n".join(retrieved_docs)

    ### Construct the combine RAG prompt
    prompt = f"""
    You are a question-answering assistant.
    Answer the question based only on the context below.
    If the answer is not contained in the context, say "I don’t know".
    context: {context}
    Question: {query}
    Answer:
    """

    ### Generate response using the local model
    output = pipe(prompt, max_new_tokens=200, do_sample=False)[0]["generated_text"]
    answer = output.split("Answer:")[-1].strip()
    wrapped_answer = textwrap.fill(answer, width=width)
    return wrapped_answer

Fetching 15 files:   0%|          | 0/15 [00:00<?, ?it/s]

model-00011-of-00014.safetensors:   0%|          | 0.00/4.12G [00:00<?, ?B/s]

KeyboardInterrupt: 

In [100]:
questions = [
    "When did Phil Schiller announce that cumulative iPod sales had exceeded 220 million ?",
    "Was a new iPod model introduced in 2013?",
    "What kind of awards have iPods won?",
    "Does Jagdish Mane owns an iPod ?"
]

for q in questions:
    print(f"Q: {q}")
    print("A:", textwrap.fill(rag_answer_openai(q), width=90))
    #print("A:", textwrap.fill(rag_answer_local(q), width=90))
    print("-" * 80)


Q: When did Phil Schiller announce that cumulative iPod sales had exceeded 220 million ?
A: Phil Schiller announced that cumulative iPod sales had exceeded 220 million on September
9, 2009.
--------------------------------------------------------------------------------
Q: Was a new iPod model introduced in 2013?
A: No, a new iPod model was not introduced in 2013 according to the provided context. The new
model mentioned was announced in mid-2015 and released on July 15, 2015.
--------------------------------------------------------------------------------
Q: What kind of awards have iPods won?
A: iPods have won awards ranging from engineering excellence to most innovative audio
product, as well as being recognized as the fourth best computer product of 2006.
--------------------------------------------------------------------------------
Q: Does Jagdish Mane owns an iPod ?
A: The provided context does not contain any information about Jagdish Mane or whether he
owns an iPod. Therefore

- From above Q and A response, We see that the answers are very precise and with correct context and easy to understand.
- Also It has correctly answered the out of context question "Does Jagdish owns an iPod? " as "it is not possible to determine if Jagdish owns an iPOD"

## Evaluation
Lets capture a sample validation dataset

In [103]:
sample_dataset = dataset["validation"][2001]
sample_dataset

### Printing original dataset

{'id': '57114e8d50c2381900b54a5b',
 'title': 'Steam_engine',
 'context': 'The efficiency of a Rankine cycle is usually limited by the working fluid. Without the pressure reaching supercritical levels for the working fluid, the temperature range the cycle can operate over is quite small; in steam turbines, turbine entry temperatures are typically 565 °C (the creep limit of stainless steel) and condenser temperatures are around 30 °C. This gives a theoretical Carnot efficiency of about 63% compared with an actual efficiency of 42% for a modern coal-fired power station. This low turbine entry temperature (compared with a gas turbine) is why the Rankine cycle is often used as a bottoming cycle in combined-cycle gas turbine power stations.[citation needed]',
 'question': "What limits the Rankine cycle's efficiency?",
 'answers': {'text': ['working fluid', 'working fluid', 'the working fluid'],
  'answer_start': [60, 60, 56]}}

#### Compare True Answer and Model Answer

In [105]:
print("Question:", sample_dataset["question"])
print("-" * 90)
print("True Answer:", textwrap.fill(sample_dataset["answers"]["text"][0], width=90))
print("-" * 90)
print("Model Answer:", textwrap.fill(rag_answer_openai(sample_dataset["question"]), width=90))
#print("Model Answer:", textwrap.fill(rag_answer_local(sample_dataset["question"]), width=90))


Question: What limits the Rankine cycle's efficiency?
------------------------------------------------------------------------------------------
True Answer: working fluid
------------------------------------------------------------------------------------------
Model Answer: The Rankine cycle's efficiency is primarily limited by the maximum temperature and
pressure at which the cycle operates, as well as the properties of the working fluid.
Practical constraints such as material strength, heat exchanger effectiveness, and
thermodynamic irreversibilities (like friction, turbulence, and non-ideal fluid behavior)
also reduce efficiency. Additionally, the temperature difference between the heat source
and the condenser limits the amount of useful work that can be extracted, as dictated by
the second law of thermodynamics.


- The model has found the right passage from the dataset.
- The answers are very precise.
- And it has also coorectly identified the question which was out of context and responded correctly that it is not sure about the context.