Rameesha - MSDS - 24F-8014

**Task 3: Encoder-Decoder Model (T5) — Text Summarization**

**Problem:**
Fine-tune	T5	for	summarizing	long	news	articles	into	concise summaries.

---


**Dataset:**
https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail

---


**Objective:**
Fine-tune	a	pre-trained	T5	model	(such	as	't5-small'	or	't5-base')	on	the	CNN/DailyMail dataset	for	abstractive	summarization.

---

**Deliverables:**
* Preprocessing	script	for	text	and	summaries
* Fine-tuning	code	for	T5
* Evaluation	using	ROUGE	metrics
* Example	outputs	comparing	original	vs	summarized	text

In [None]:
!pip install -q transformers datasets rouge_score pandas numpy torch

import os
import pandas as pd
import numpy as np
import torch
from datasets import Dataset
from transformers import (
    T5Tokenizer,
    T5ForConditionalGeneration,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    DataCollatorForSeq2Seq
)
from rouge_score import rouge_scorer

os.environ["WANDB_DISABLED"] = "true"

**dataset**

In [None]:
train_data = pd.read_csv('/content/drive/MyDrive/ANLP/project_2/task_3/cnn_dailymail/train.csv')
val_data = pd.read_csv('/content/drive/MyDrive/ANLP/project_2/task_3/cnn_dailymail/validation.csv')
test_data = pd.read_csv('/content/drive/MyDrive/ANLP/project_2/task_3/cnn_dailymail/test.csv')

print(f"train: {len(train_data)} | val: {len(val_data)} | test: {len(test_data)}")

train: 287113 | val: 13368 | test: 11490


**data cleaning**

In [None]:
def clean_df(df):
    df = df.dropna(subset=['article', 'highlights'])
    df = df[df['article'].str.len() >= 100]
    df = df[df['highlights'].str.len() >= 10]
    return df.reset_index(drop=True)

train_data = clean_df(train_data)
val_data = clean_df(val_data)
test_data = clean_df(test_data)

TRAIN_SIZE, VAL_SIZE = 10000, 1000
train_data = train_data.head(TRAIN_SIZE)
val_data = val_data.head(VAL_SIZE)

print(f"after cleaning → train: {len(train_data)}, val: {len(val_data)}")

after cleaning → train: 10000, val: 1000


**load t5 model**

In [None]:

MODEL_NAME = "t5-base"
tokenizer = T5Tokenizer.from_pretrained(MODEL_NAME)
model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME)

print("model loading done")

model loading done


**tokenization**

In [None]:

MAX_INPUT_LEN = 512
MAX_TARGET_LEN = 128

def tokenize_data(batch):
    inputs = ["summarize: " + text for text in batch['article']]
    model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LEN, truncation=True)
    targets = tokenizer(
        text_target=batch['highlights'],
        max_length=MAX_TARGET_LEN,
        truncation=True
    )
    model_inputs["labels"] = targets["input_ids"]
    return model_inputs

hf_train = Dataset.from_pandas(train_data[['article', 'highlights']])
hf_val = Dataset.from_pandas(val_data[['article', 'highlights']])
hf_test = Dataset.from_pandas(test_data[['article', 'highlights']])

tokenized_train = hf_train.map(tokenize_data, batched=True, remove_columns=['article', 'highlights'])
tokenized_val = hf_val.map(tokenize_data, batched=True, remove_columns=['article', 'highlights'])
tokenized_test = hf_test.map(tokenize_data, batched=True, remove_columns=['article', 'highlights'])

print("tokenization done")

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/11490 [00:00<?, ? examples/s]

tokenization done


In [None]:
collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model, padding=True)

**metric**

In [None]:
rouge = rouge_scorer.RougeScorer(['rouge1','rouge2','rougeL'], use_stemmer=True)

def compute_metrics(eval_pred):
    preds, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    r1, r2, rL = [], [], []
    for p, l in zip(decoded_preds, decoded_labels):
        score = rouge.score(l, p)
        r1.append(score['rouge1'].fmeasure)
        r2.append(score['rouge2'].fmeasure)
        rL.append(score['rougeL'].fmeasure)

    return {
        'rouge1': np.mean(r1) * 100,
        'rouge2': np.mean(r2) * 100,
        'rougeL': np.mean(rL) * 100
    }


**training**

In [13]:

args = Seq2SeqTrainingArguments(
    output_dir="./t5_summarization_model",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=2e-5,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    predict_with_generate=True,
    fp16=torch.cuda.is_available(),
    logging_steps=50,
    report_to="none",
    metric_for_best_model="rougeL"
)

trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=collator,
    compute_metrics=compute_metrics
)

trainer.train()
print("training done")


  trainer = Seq2SeqTrainer(


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel
1,1.5117,1.572908,25.691808,12.386694,20.933471


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel
1,1.5117,1.572908,25.691808,12.386694,20.933471
2,1.5492,1.562252,25.547476,12.36827,20.951884
3,1.5653,1.564956,25.26018,11.969481,20.710637


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'].


training done


**model evaluation**

In [14]:
results = trainer.evaluate()
print(f"rouge-1: {results['eval_rouge1']:.2f}")
print(f"rouge-2: {results['eval_rouge2']:.2f}")
print(f"rouge-L: {results['eval_rougeL']:.2f}")

save_path = "/content/drive/MyDrive/ANLP/project_2/task_3/t5_finetuned_summarizer"
trainer.save_model(save_path)
tokenizer.save_pretrained(save_path)
print("model saved to /content/drive/MyDrive/ANLP/project_2/task_3/t5_finetuned_summarizer ")

rouge-1: 25.55
rouge-2: 12.37
rouge-L: 20.95
model saved to /content/drive/MyDrive/ANLP/project_2/task_3/t5_finetuned_summarizer 


**Test**

In [17]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

def generate_summary(text):
    input_text = "summarize: " + text
    inputs = tokenizer(input_text, return_tensors="pt", max_length=MAX_INPUT_LEN, truncation=True).to(device)
    with torch.no_grad():
        outputs = model.generate(
            inputs["input_ids"],
            max_length=MAX_TARGET_LEN,
            num_beams=4,
            early_stopping=True
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

for i in range(3):
    article = test_data.iloc[i]['article']
    ref = test_data.iloc[i]['highlights']
    gen = generate_summary(article)

    print(f"\ntest {i+1}")
    print(f"Article Preview: {article[:200]}...")
    print(f"\nReference Summary:\n{ref}\n")
    print(f"Generated Summary:\n{gen}")



test 1
Article Preview: Ever noticed how plane seats appear to be getting smaller and smaller? With increasing numbers of people taking to the skies, some experts are questioning if having such packed out planes is putting p...

Reference Summary:
Experts question if  packed out planes are putting passengers at risk .
U.S consumer advisory group says minimum space must be stipulated .
Safety tests conducted on planes with more leg room than airlines offer .

Generated Summary:
Tests conducted by the FAA use planes with a 31 inch pitch between each row of seats, a standard which on some airlines has decreased . Many economy seats on United Airlines have 30 inches of space, while some airlines offer as little as 28 inches .

test 2
Article Preview: A drunk teenage boy had to be rescued by security after jumping into a lions' enclosure at a zoo in western India. Rahul Kumar, 17, clambered over the enclosure fence at the Kamla Nehru Zoological Par...

Reference Summary:
Drunk teenage boy 