**Overview**  
- This notebook leverages the Hugging Face Transformers library to build an effective QA system.
- There are 2 datasets: train data (with 20K rows) and test data (10K rows).  
- The **train dataset** contains three columns: context, question, and answer. Train a model that can generate the answer to a given question based on the provided context.  
- The **test dataset** contains context and question columns, but the answer column is empty. The model predicts the missing answer values.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install torch torchvision datasets evaluate transformers accelerate -U

In [None]:
import pandas as pd
import torch
from datasets import Dataset, load_dataset
from transformers import T5TokenizerFast, T5ForConditionalGeneration, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
import evaluate
import re
from sentence_transformers import SentenceTransformer, util

## Load the Training and Test Dataset

In [None]:
train_df = pd.read_parquet('train.parquet')
test_df = pd.read_parquet('test_without_label.parquet')

In [None]:
train_df.head(5)

Unnamed: 0,context,question,answer
9983,The world's first institution of technology or...,What year was the Banská Akadémia founded?,1735
43267,The standard specifies how speed ratings shoul...,What is another speed that can also be reporte...,SOS-based speed
81021,The most impressive and famous of Sumerian bui...,Where were the use of advanced materials and t...,Sumerian temples and palaces
49374,Ann Arbor has a council-manager form of govern...,Who is elected every even numbered year?,mayor
53414,"Shortly before his death, when he was already ...",What was the purpose of top secret ICBM commit...,decide on the feasibility of building an ICBM ...


In [None]:
test_df.head()

Unnamed: 0,context,question,answer
63695,Perhaps the most famous raid by Oeselian pirat...,What important figure was killed in the raid?,?
80051,"Following a peak in growth in 1979, the Liberi...",In 2011 Liberia's economy was considered what?,?
32271,A plethora of anti-aircraft gun systems of sma...,The combat batteries of an Army AAA battalion ...,?
52439,Avicenna's legacy in classical psychology is p...,What subject is seen throughout Avicenna's Boo...,?
33889,"The desire to explore, record and systematize ...",In what year was Charles Burney's A General Hi...,?


In [None]:
train_df.shape, test_df.shape

((20000, 3), (10000, 3))

In [None]:
# Split dataset: 80% train, 20% validation
train_df, val_df = train_test_split(train_df, test_size=0.2, random_state=42)
train_df = train_df.reset_index(drop=True)
val_df = val_df.reset_index(drop=True)

In [None]:
# Convert the DataFrames into the Hugging Face `datasets` format
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
test_dataset = Dataset.from_pandas(test_df)

## Load the Pretrained T5 Model and Tokenizer

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = T5TokenizerFast.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("t5-base").to( device)

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

## Tokenize new dataset

In [None]:
# Preprocessing function
def preprocess_function(examples):
    inputs = [f"question: {q.strip()} context: {c.strip()}" for q, c in zip(examples["question"], examples["context"])]
    targets = [a.strip() for a in examples.get("answer", [""] * len(inputs))]

    tokenized_inputs = tokenizer(inputs, truncation=True, padding="max_length", max_length=384)
    tokenized_targets = tokenizer(targets, truncation=True, padding="max_length", max_length=64)

    tokenized_inputs["labels"] = tokenized_targets["input_ids"]
    return tokenized_inputs

In [None]:
# Apply tokenization
train_dataset = train_dataset.map(preprocess_function, batched=True)
val_dataset = val_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

## Fine-Tuning T5-BASE Model

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./t5-base-qa-train-data",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    learning_rate=3e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=6,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=50,
    fp16=True,
    gradient_accumulation_steps=2,
    warmup_ratio=0.1,
    lr_scheduler_type="linear",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    report_to="none",
)



## Add early stopping and  retrain the model

In [None]:
from transformers import EarlyStoppingCallback
# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=1)]
)

  trainer = Trainer(


## Train and save the model

In [None]:
# Train model
trainer.train()

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,0.025,0.01646
2,0.0214,0.015524
3,0.0168,0.015653


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'].


TrainOutput(global_step=1500, training_loss=1.323224512577057, metrics={'train_runtime': 2310.3468, 'train_samples_per_second': 41.552, 'train_steps_per_second': 1.299, 'total_flos': 2.192248406016e+16, 'train_loss': 1.323224512577057, 'epoch': 3.0})

In [None]:
trainer.save_model("t5-base-qa-train-data-finetuned")

Observations:
- Training loss is decreasing steadily (model is learning).
- Slight increase in validation loss in epoch 3 indicates overfitting.
- Since training has stopped before completing all epochs (due to early stopping), the best version of the model has been saved.

## Evaluate model performance using trainer.evaluate()

In [None]:
results = trainer.evaluate()
print("Evaluation Results:", results)

Evaluation Results: {'eval_loss': 0.015524194575846195, 'eval_runtime': 66.1557, 'eval_samples_per_second': 60.463, 'eval_steps_per_second': 3.779, 'epoch': 3.0}


- Evaluation loss is very low (eval_loss = 0.015) which means the model is performing well on unseen data.

## Evaluate Model Performance on Validation Data

In [None]:
# Move model to correct device once before inference
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()  # Set model to evaluation mode

# Prediction function
def answer_question(question, context):
    input_text = f"question: {question} context: {context}"
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)

    # Move inputs to correct device
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        output = model.generate(**inputs, max_length=64, num_beams=3)  # Beam search for better results

    answer = tokenizer.decode(output[0], skip_special_tokens=True)
    return answer

## Make predictions for answers on validation data

In [None]:
# Compute evaluation metrics
metric = evaluate.load("squad")
val_df["predicted_answer"] = val_df.apply(lambda row: answer_question(row["question"], row["context"]), axis=1)
formatted_predictions = [{"id": str(i), "prediction_text": row["predicted_answer"].strip().lower()} for i, row in val_df.iterrows()]
formatted_references = [{"id": str(i), "answers": {"text": [row["answer"].strip().lower()], "answer_start": [row["context"].find(row["answer"].strip())]}} for i, row in val_df.iterrows()]

Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]

## Calcuate F1 score and Exact Match

In [None]:
results = metric.compute(predictions=formatted_predictions, references=formatted_references)
print(f"Exact Match (EM): {results['exact_match']:.2f}%")
print(f"F1 Score: {results['f1']:.2f}%")

Exact Match (EM): 80.17%
F1 Score: 90.57%


- Results (Exact Match: 80.17% and F1 score: 90.57%) indicate that T5 BASE model is performing well on the Question Answering task
- Exact Match of 80.17% means that 80.17% of the model's predictions are a perfect match with the actual answers in the dataset
- High F1 Score (90.57%) means the model is good at extracting relevant information.

## Compute Similarity

In [None]:
# Compute similarity
model_st = SentenceTransformer('all-MiniLM-L6-v2')
embeddings1 = model_st.encode(val_df['predicted_answer'].tolist(), convert_to_tensor=True)
embeddings2 = model_st.encode(val_df['answer'].tolist(), convert_to_tensor=True)
similarity_scores = util.pytorch_cos_sim(embeddings1, embeddings2)
val_df['similarity'] = similarity_scores.max(dim=1).values.cpu().numpy()

In [None]:
val_df.head()

Unnamed: 0,context,question,answer,predicted_answer,similarity
0,Hans Bielenstein writes that as far back as th...,Who believed that they were the true Han Weste...,foreign officials,"foreign officials administering the various ""D...",0.595965
1,"In 1838, there was a flurry of entrepreneurial...",For what reason was asphalt used in the floori...,damp proofing,damp proofing,1.0
2,The first sulfonamide and first commercially a...,What company developed Prontosil?,IG Farben,IG Farben,1.0
3,The 1910 election saw 42 Labour MPs elected to...,How many MP were elected in the 1910 election?,42,42,1.0
4,"Ye Zhiping, the principal of Sangzao Middle Sc...",How many students attended the school?,2323,2323,1.0


## Predict answers for test dataset

In [None]:
test_data = test_df.copy()

In [None]:
# Save test predictions
test_data["answer"] = test_data.apply(lambda row: answer_question(row["question"], row["context"]), axis=1)
test_data.to_parquet("test_with_predicted_labels_t5.parquet", index=False)