
## Assignment 2 - Khashayar Vani - Gen AI

After fine‑tuning the models (training steps are commented out by default), we evaluate them on the validation subset. Different metrics are appropriate for each architecture:

* **GPT‑2** (decoder‑only): We generate summaries using greedy decoding and compute ROUGE metrics (ROUGE‑1, ROUGE‑2, ROUGE‑L). We also compute perplexity using the loss returned by the trainer.

* **BERT** (encoder‑only): BERT is not designed to generate full sequences; instead we use it for downstream tasks such as text classification. For a classification scenario, the evaluation metrics are typically confusion matrix and F1-score.

* **T5** (encoder‑decoder): We generate summaries using greedy decoding and compute ROUGE metrics.  Perplexity is computed similarly to GPT‑2 by exponentiating the validation loss.

The code below demonstrates evaluation routines for each model. Running these functions requires trained models; if you skipped training above, the evaluation will use the pre‑trained weights and therefore will not yield good summarization quality.


## Environment

In [1]:
!pip -q install transformers datasets evaluate accelerate sentencepiece sacremoses



[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


## Config

In [2]:
from datasets import load_dataset
import evaluate
import numpy as np
import random
import torch
from transformers import (
    AutoTokenizer, AutoModelForQuestionAnswering, DataCollatorWithPadding,
    AutoModelForCausalLM, AutoModelForSeq2SeqLM,
    Trainer, TrainingArguments, default_data_collator
)

SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)

# ==== YOU CAN CHANGE THESE VALUES ====
# pick small models so training finishes quickly on Colab
MODEL_BERT = "bert-base-uncased"
MODEL_GPT2 = "distilgpt2"
MODEL_T5  = "t5-small"

# use a small subset first to iterate; increase later for your “final” run
MAX_TRAIN = 2000      # set to None for full
MAX_VAL   = 500

BATCH_SIZE = 8
EPOCHS = 2
LR = 3e-5
MAX_QUESTION_LEN = 64
MAX_CONTEXT_LEN  = 384
MAX_INPUT_LEN_T5 = 512
MAX_GEN_LEN = 64

device = "cuda" if torch.cuda.is_available() else "cpu"
device


  from .autonotebook import tqdm as notebook_tqdm


'cpu'

## Dataset: SQuAD v1.1
The reason I chose this Dataset is that aligns with 3 architectures, gives clear Em/ F1 evaluation, and showcases generative VS extractive behavior.

In [3]:
raw = load_dataset("squad")

def take_subset(ds, n):
    return ds.select(range(min(n, len(ds)))) if n else ds

train_raw = take_subset(raw["train"], MAX_TRAIN)
val_raw   = take_subset(raw["validation"], MAX_VAL)

train_raw, val_raw


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Generating train split: 100%|██████████| 87599/87599 [00:01<00:00, 44811.36 examples/s]
Generating validation split: 100%|██████████| 10570/10570 [00:00<00:00, 61324.90 examples/s]


(Dataset({
     features: ['id', 'title', 'context', 'question', 'answers'],
     num_rows: 2000
 }),
 Dataset({
     features: ['id', 'title', 'context', 'question', 'answers'],
     num_rows: 500
 }))

## Metrics (Exact Match & F1)

I am re-using the common SQuAD metric. (For GPT-2 / T5 and I will compare their generated string to gold answers directly; for BERT the Trainer will output start/end logits >> Then post-process to text then compute metrics.)

In [4]:
squad_metric = evaluate.load("squad")

def normalize_text(s):
    import re, string
    def remove_articles(text): return re.sub(r"\b(a|an|the)\b", " ", text)
    def white_space_fix(text): return " ".join(text.split())
    def remove_punc(text): return "".join(ch for ch in text if ch not in set(string.punctuation))
    def lower(text): return text.lower()
    return white_space_fix(remove_articles(remove_punc(lower(s))))

def f1_em(preds, refs):
    pred_dict = {str(i): p for i, p in enumerate(preds)}
    ref_dict  = {str(i): {"answers":{"text":[r],"answer_start":[0]}, "id":str(i)} for i, r in enumerate(refs)}
    # Using evaluate's squad metric requires structured inputs, so we convert:
    references = [{"id": k, "answers": {"text":[v["answers"]["text"][0]], "answer_start":[0]}} for k,v in ref_dict.items()]
    predictions = [{"id": k, "prediction_text": v} for k,v in pred_dict.items()]
    return squad_metric.compute(predictions=predictions, references=references)


Downloading builder script: 4.53kB [00:00, 6.91MB/s]
Downloading extra modules: 3.32kB [00:00, 3.29MB/s]


## BERT (Encoder-Only) — Extractive QA

Tokenization

In [5]:
bert_tok = AutoTokenizer.from_pretrained(MODEL_BERT, use_fast=True)

def preprocess_bert(batch):
    return bert_tok(
        batch["question"], batch["context"],
        truncation="only_second", max_length=MAX_CONTEXT_LEN, stride=128,
        return_overflowing_tokens=True, return_offsets_mapping=True, padding="max_length"
    )

tokenized_train_bert = train_raw.map(preprocess_bert, batched=True, remove_columns=train_raw.column_names)
tokenized_val_bert   = val_raw.map(preprocess_bert, batched=True, remove_columns=val_raw.column_names)


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Map: 100%|██████████| 2000/2000 [00:11<00:00, 170.83 examples/s]
Map: 100%|██████████| 500/500 [00:01<00:00, 345.94 examples/s]


Align start/end positions

In [None]:
def add_labels_bert(examples, raw_examples):
    # tie back overflows to original QA spans
    offset_mapping = examples["offset_mapping"]
    sample_map     = examples["overflow_to_sample_mapping"]
    start_positions, end_positions = [], []
    for i, offsets in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answers    = raw_examples["answers"][sample_idx]
        # default to CLS if no answer
        start_char = answers["answer_start"][0]
        end_char   = start_char + len(answers["text"][0])
        sequence_ids = examples.sequence_ids(i)

        # find context
        idx = 0
        while idx < len(sequence_ids) and sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while idx < len(sequence_ids) and sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # if answer outside context
        if offsets[context_start][0] > start_char or offsets[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
            continue

        # otherwise set token start/end
        start_idx = context_start
        while start_idx <= context_end and offsets[start_idx][0] <= start_char:
            start_idx += 1
        start_positions.append(start_idx - 1)

        end_idx = context_end
        while end_idx >= context_start and offsets[end_idx][1] >= end_char:
            end_idx -= 1
        end_positions.append(end_idx + 1)

    examples["start_positions"] = start_positions
    examples["end_positions"] = end_positions
    examples.pop("offset_mapping")
    return examples

##(It gave me Error this line, Itried to fix it but it did not work) tokenized_train_bert = tokenized_train_bert.map(lambda e, i=0: add_labels_bert(e, train_raw), batched=False) 
##(It gave me Error this line, Itried to fix it but it did not work) tokenized_val_bert   = tokenized_val_bert.map(lambda e, i=0: add_labels_bert(e, val_raw),   batched=False)


Train

In [9]:
bert_model = AutoModelForQuestionAnswering.from_pretrained(MODEL_BERT).to(device)

##(Error) args_bert = TrainingArguments(
    output_dir="./bert-qa",
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    learning_rate=LR,
    num_train_epochs=EPOCHS,
    evaluation_strategy="epoch",
    save_strategy="no",
    logging_steps=50,
    report_to="none"
)

data_collator_bert = DataCollatorWithPadding(bert_tok)

trainer_bert = Trainer(
    model=bert_model,
    args=args_bert,
    train_dataset=tokenized_train_bert,
    eval_dataset=tokenized_val_bert,
    tokenizer=bert_tok,
    data_collator=data_collator_bert
)

trainer_bert.train()


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'

Post-process predictions >> EM/F1

In [10]:
def predict_bert(samples, tokenized, model, tok, max_answer_len=30):
    model.eval()
    preds = []
    for i in range(len(tokenized)):
        inputs = {k: torch.tensor(tokenized[i][k]).unsqueeze(0).to(device) for k in ["input_ids","attention_mask"]}
        with torch.no_grad():
            out = model(**inputs)
        start_logits = out.start_logits[0].cpu().numpy()
        end_logits   = out.end_logits[0].cpu().numpy()

        start_idx = int(start_logits.argmax())
        end_idx   = int(end_logits.argmax())
        if end_idx < start_idx: end_idx = start_idx
        end_idx = min(end_idx, start_idx + max_answer_len)

        ans_ids = inputs["input_ids"][0][start_idx:end_idx+1].cpu().numpy()
        text = tok.decode(ans_ids, skip_special_tokens=True).strip()
        preds.append(text if text else "")
    return preds

bert_val_preds = predict_bert(val_raw, tokenized_val_bert, bert_model, bert_tok)
gold = [ans["text"][0] for ans in val_raw["answers"]]
bert_scores = f1_em(bert_val_preds, gold)
bert_scores


KeyboardInterrupt: 

## T5 (Encoder-Decoder) > Generative QA

Prepare "text-to-text" pairs

T5 likes explicit task prefixes:

In [11]:
t5_tok = AutoTokenizer.from_pretrained(MODEL_T5)

def preprocess_t5(batch):
    inputs = [f"question: {q}  context: {c}" for q,c in zip(batch["question"], batch["context"])]
    model_inputs = t5_tok(inputs, max_length=MAX_INPUT_LEN_T5, truncation=True)
    with t5_tok.as_target_tokenizer():
        labels = t5_tok([a["text"][0] for a in batch["answers"]], max_length=MAX_GEN_LEN, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

train_t5 = train_raw.map(preprocess_t5, batched=True, remove_columns=train_raw.column_names)
val_t5   = val_raw.map(preprocess_t5,   batched=True, remove_columns=val_raw.column_names)


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Map: 100%|██████████| 2000/2000 [00:07<00:00, 273.24 examples/s]
Map: 100%|██████████| 500/500 [00:01<00:00, 286.41 examples/s]


Train

In [12]:
t5_model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_T5).to(device)
args_t5 = TrainingArguments(
    output_dir="./t5-qa",
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    learning_rate=LR,
    num_train_epochs=EPOCHS,
    evaluation_strategy="epoch",
    save_strategy="no",
    predict_with_generate=True,
    logging_steps=50,
    report_to="none"
)

trainer_t5 = Trainer(
    model=t5_model,
    args=args_t5,
    train_dataset=train_t5,
    eval_dataset=val_t5,
    tokenizer=t5_tok,
    data_collator=default_data_collator
)

trainer_t5.train()


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'

Generate & Score

In [13]:
def generate_t5(batch, model, tok):
    inputs = [f"question: {q}  context: {c}" for q,c in zip(batch["question"], batch["context"])]
    enc = tok(inputs, return_tensors="pt", padding=True, truncation=True, max_length=MAX_INPUT_LEN_T5).to(device)
    with torch.no_grad():
        gen = model.generate(**enc, max_new_tokens=MAX_GEN_LEN)
    return tok.batch_decode(gen, skip_special_tokens=True)

t5_val_preds = generate_t5(val_raw, t5_model, t5_tok)
t5_scores = f1_em(t5_val_preds, gold)
t5_scores


KeyboardInterrupt: 

## GPT-2 (Decoder-Only) — Generative QA

I convert QA into causal LM fine-tuning: predict the answer after the prompt:

In [None]:
"Question: {q}\nContext: {c}\nAnswer:"


The labels are only the answer tokens (mask the prompt with -100 so loss ignores it).

In [14]:
gpt2_tok = AutoTokenizer.from_pretrained(MODEL_GPT2)
if gpt2_tok.pad_token is None:
    gpt2_tok.pad_token = gpt2_tok.eos_token

def build_gpt2_example(q, c, a):
    prompt = f"Question: {q}\nContext: {c}\nAnswer:"
    inp = gpt2_tok(prompt, truncation=True, max_length=MAX_CONTEXT_LEN, add_special_tokens=False)
    ans = gpt2_tok(" " + a, truncation=True, max_length=MAX_GEN_LEN, add_special_tokens=False)

    input_ids = inp["input_ids"] + ans["input_ids"]
    attention_mask = [1]*len(input_ids)
    labels = [-100]*len(inp["input_ids"]) + ans["input_ids"]
    return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

def preprocess_gpt2(batch):
    outs = [build_gpt2_example(q,c,a["text"][0]) for q,c,a in zip(batch["question"], batch["context"], batch["answers"])]
    maxlen = max(len(x["input_ids"]) for x in outs)
    for x in outs:
        pad = maxlen - len(x["input_ids"])
        x["input_ids"] += [gpt2_tok.pad_token_id]*pad
        x["attention_mask"] += [0]*pad
        x["labels"] += [-100]*pad
    return {k:[x[k] for x in outs] for k in ["input_ids","attention_mask","labels"]}

train_gpt2 = train_raw.map(preprocess_gpt2, batched=True, remove_columns=train_raw.column_names)
val_gpt2   = val_raw.map(preprocess_gpt2,   batched=True, remove_columns=val_raw.column_names)


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Map: 100%|██████████| 2000/2000 [00:18<00:00, 108.59 examples/s]
Map: 100%|██████████| 500/500 [00:02<00:00, 168.78 examples/s]


Train

In [15]:
gpt2_model = AutoModelForCausalLM.from_pretrained(MODEL_GPT2).to(device)

args_gpt2 = TrainingArguments(
    output_dir="./gpt2-qa",
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    learning_rate=5e-5,       # often a bit higher is fine for GPT-2
    num_train_epochs=EPOCHS,
    evaluation_strategy="epoch",
    save_strategy="no",
    logging_steps=50,
    report_to="none"
)

trainer_gpt2 = Trainer(
    model=gpt2_model,
    args=args_gpt2,
    train_dataset=train_gpt2,
    eval_dataset=val_gpt2,
    tokenizer=gpt2_tok,
    data_collator=default_data_collator
)

trainer_gpt2.train()


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


KeyboardInterrupt: 

Generate & Score

In [16]:
def generate_gpt2(batch, model, tok):
    prompts = [f"Question: {q}\nContext: {c}\nAnswer:" for q,c in zip(batch["question"], batch["context"])]
    enc = tok(prompts, return_tensors="pt", padding=True, truncation=True, max_length=MAX_CONTEXT_LEN).to(device)
    with torch.no_grad():
        gen = model.generate(**enc, max_new_tokens=MAX_GEN_LEN, do_sample=False)
    out_texts = tok.batch_decode(gen, skip_special_tokens=True)
    # strip prompt to keep just the answer
    answers = []
    for prompt, full in zip(prompts, out_texts):
        answers.append(full.split("Answer:")[-1].strip())
    return answers

gpt2_val_preds = generate_gpt2(val_raw, gpt2_model, gpt2_tok)
gpt2_scores = f1_em(gpt2_val_preds, gold)
gpt2_scores


NameError: name 'gpt2_model' is not defined

## Qualitative Samples (put 1–2 in the report)

In [17]:
def show_sample(i):
    q = val_raw[i]["question"]
    c = val_raw[i]["context"]
    print("Q:", q)
    print("GT:", val_raw[i]["answers"]["text"][0])

    print("\n[BERT]")
    print(bert_val_preds[i])

    print("\n[T5]")
    print(t5_val_preds[i])

    print("\n[GPT-2]")
    print(gpt2_val_preds[i])

show_sample(0)
show_sample(1)


Q: Which NFL team represented the AFC at Super Bowl 50?
GT: Denver Broncos

[BERT]


NameError: name 'bert_val_preds' is not defined