# 📚 NoteBook 10 FLAN-T5 Evaluation

# 🚀 PROJECT PLAN

MKEM Implementation – Transformer-Based Abstractive Text Summarization

# 🎯 Problem Statement Recap

# 🔍 Objective:
    
To build and compare transformer-based summarization models (T5, BART, Pegasus,BARTScore,ProphetNet,BigBird,LED,mTS,FLAN-T5,GPT 3.5 Turbo) and then enhance them using MKEM (Multi-Knowledge-Enhanced Model) on curated English news datasets.

# 📌 Phase-1 Objective

✅ Implement the following 3 summarization models:

PEGASUS (Google)---NoteBook(2)

BART (Facebook)---NoteBook(3)

T5 (Google)---NoteBook(1)

Final Comparison + MKEM---NoteBook(4)

NewsSum(Indian Newspaper)---NoteBook(5)

BARTScore---NoteBook(6)

ProphetNet---NoteBook(7)

BigBird-Pegasus---NoteBook(8)

LED(Longformer)---NoteBook(9)

allenai/PRIMERA ---NoteBook(10)

FLAN-T5---NoteBook(11)

GPT-3.5 Turbo---NoteBook(12)

# ✅ Evaluate on 3 benchmark datasets:
    
CNN/DailyMail

Newssum (IndianNewsPaper)

# ✅ Evaluation Metrics:

ROUGE-1

ROUGE-2

ROUGE-L

BERTScore

# 📊 Final Output (Per Model × Dataset):
    
You must submit structured results:

Dataset name

Model used

ROUGE-1, ROUGE-2, ROUGE-L, BERTScore

Inference Time

GPU used

Short analysis/observations

# 1.🚀 FLAN-T5  on CNN Dataset

**✏️Step 1: Install & Import Libraries**

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import pandas as pd
import time
import evaluate
import bert_score

# Set device
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Using device:", device)



Using device: cpu


**✏️ Step 2: Load Model & Tokenizer**

In [2]:
model_name = "google/flan-t5-base"  # Faster than large

tokenizer_flan = AutoTokenizer.from_pretrained(model_name)
model_flan = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

print(f"✅ Loaded {model_name}")

✅ Loaded google/flan-t5-base


**✏️ Step 3: Load CNN Dataset**

In [3]:
import pandas as pd

# Load cleaned CNN dataset (from Notebook 1)
df_cnn = pd.read_csv("cnn_dailymail.csv")

# Drop missing or empty articles/highlights
df_cnn = df_cnn.dropna(subset=["article", "highlights"])
df_cnn = df_cnn[df_cnn["article"].str.strip().astype(bool)]

print(f"✅ CNN Dataset Loaded. Shape: {df_cnn.shape}")
df_cnn.head()

✅ CNN Dataset Loaded. Shape: (5, 3)


Unnamed: 0,article,highlights,id
0,"LONDON, England (Reuters) -- Harry Potter star...",Harry Potter star Daniel Radcliffe gets £20M f...,42c027e4ff9730fbb3de84c1af0d2c506e41c3e4
1,Editor's note: In our Behind the Scenes series...,Mentally ill inmates in Miami are housed on th...,ee8871b15c50d0db17b0179a6d2beab35065f1e9
2,"MINNEAPOLIS, Minnesota (CNN) -- Drivers who we...","NEW: ""I thought I was going to die,"" driver sa...",06352019a19ae31e527f37f7571c6dd7f0c5da37
3,WASHINGTON (CNN) -- Doctors removed five small...,"Five small polyps found during procedure; ""non...",24521a2abb2e1f5e34e6824e0f9e56904a2b0e88
4,(CNN) -- The National Football League has ind...,"NEW: NFL chief, Atlanta Falcons owner critical...",7fe70cc8b12fab2d0a258fababf7d9c6b5e1262a


**✏️ Step 4: Define Summarization Function**

In [4]:
def summarize_with_flan(text):
    inputs = tokenizer_flan(text, return_tensors="pt", truncation=True, max_length=512).to(device)
    summary_ids = model_flan.generate(
        inputs["input_ids"],
        max_length=150,
        min_length=40,
        length_penalty=2.0,
        num_beams=4
    )
    return tokenizer_flan.decode(summary_ids[0], skip_special_tokens=True)

**✏️ Step 5: Generate Predictions**

In [5]:
import time

start_time = time.time()
flan_cnn_preds = [summarize_with_flan(article) for article in df_cnn["article"]]
flan_cnn_refs = df_cnn["highlights"].tolist()
inference_time = round(time.time() - start_time, 2)

print(f"⏱ Inference Time: {inference_time} seconds")

⏱ Inference Time: 139.93 seconds


**✏️ Step 6: Evaluate with ROUGE & BERTScore**

In [6]:
import gc
import torch

# After generating flan_cnn_preds and flan_cnn_refs
del model_flan
del tokenizer_flan
gc.collect()
torch.cuda.empty_cache()

In [7]:
import evaluate
rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")

# ROUGE
rouge_results = rouge.compute(predictions=flan_cnn_preds, references=flan_cnn_refs)

# BERTScore
bert_results = bertscore.compute(predictions=flan_cnn_preds, references=flan_cnn_refs, lang="en")
bert_f1 = round(sum(bert_results["f1"]) / len(bert_results["f1"]), 4)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  return forward_call(*args, **kwargs)


In [8]:
import torch
import pandas as pd
import evaluate

# Load metrics
rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")

# ROUGE
rouge_results = rouge.compute(predictions=flan_cnn_preds, references=flan_cnn_refs)

# BERTScore
bert_results = bertscore.compute(predictions=flan_cnn_preds, references=flan_cnn_refs, lang="en")
bert_f1 = round(sum(bert_results["f1"]) / len(bert_results["f1"]), 4)

# Create result dictionary
flan_cnn_scores = {
    "Dataset": ["CNN"],
    "Model": ["FLAN-T5"],
    "ROUGE-1": [rouge_results["rouge1"]],
    "ROUGE-2": [rouge_results["rouge2"]],
    "ROUGE-L": [rouge_results["rougeL"]],
    "BERTScore": [bert_f1],
    "Inference Time (s)": [inference_time],
    "GPU Used": [torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"]
}

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**💾 Save the Scores to .CSV Files**

**So that we can use to comapair models in different NoteBooks**

In [9]:
flan_cnn_scores_df = pd.DataFrame(flan_cnn_scores)

# Save CSV
flan_cnn_scores_df.to_csv("flan_cnn_scores.csv", index=False)
print("✅ FLAN-T5 CNN scores saved to flan_cnn_scores.csv")

flan_cnn_scores_df

# Display result in table format
#from tabulate import tabulate
#print(tabulate(flan_cnn_scores_df, headers='keys', tablefmt='psql', showindex=False))

✅ FLAN-T5 CNN scores saved to flan_cnn_scores.csv


Unnamed: 0,Dataset,Model,ROUGE-1,ROUGE-2,ROUGE-L,BERTScore,Inference Time (s),GPU Used
0,CNN,FLAN-T5,0.220208,0.046759,0.142072,0.8456,139.93,CPU


# 2.🚀 FLAN-T5 on NewsSum Dataset

**✏️ Step 1: Load NewsSum Dataset**

In [10]:
# Load NewsSum dataset
df_newsum = pd.read_csv("newsum_cleaned.csv")

# Prepare reference summaries
flan_newsum_refs = df_newsum["Summary"].tolist()

print(f"✅ NewsSum dataset loaded. Shape: {df_newsum.shape}")

✅ NewsSum dataset loaded. Shape: (1003, 4)


**✏️Step 2: Generate Summaries with FLAN-T5**

In [11]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "google/flan-t5-base"  # or flan-t5-large if you want bigger model
tokenizer_flan = AutoTokenizer.from_pretrained(model_name)
model_flan = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

print(f"✅ Loaded {model_name}")

✅ Loaded google/flan-t5-base


In [12]:
import time

# ✅ Limit to 5 rows for quick test
df_newsum_small = df_newsum.iloc[:5].copy()
flan_newsum_refs = df_newsum_small["Summary"].tolist()

# Run summarization
start_time = time.time()
flan_newsum_preds = [summarize_with_flan(article) for article in df_newsum_small["Article"]]
inference_time = round(time.time() - start_time, 2)

print(f"⏱ Inference Time (NewsSum - FLAN-T5, 5 rows): {inference_time} seconds")

⏱ Inference Time (NewsSum - FLAN-T5, 5 rows): 213.29 seconds


**✏️Step 3: Evaluate with ROUGE and BERTScore**

In [13]:
# ROUGE
rouge_results = rouge.compute(predictions=flan_newsum_preds, references=flan_newsum_refs)

# BERTScore
bert_results = bertscore.compute(predictions=flan_newsum_preds, references=flan_newsum_refs, lang="en")
bert_f1 = round(sum(bert_results["f1"]) / len(bert_results["f1"]), 4)

  return forward_call(*args, **kwargs)


**💾 Step 4: Save Evaluation Scores to CSV**

In [14]:
flan_newsum_scores = {
    "Dataset": ["NewsSum"],
    "Model": ["FLAN-T5"],
    "ROUGE-1": [rouge_results["rouge1"]],
    "ROUGE-2": [rouge_results["rouge2"]],
    "ROUGE-L": [rouge_results["rougeL"]],
    "BERTScore": [bert_f1],
    "Inference Time (s)": [inference_time],
    "GPU Used": [torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"]
}

flan_newsum_scores_df = pd.DataFrame(flan_newsum_scores)

# Save file
flan_newsum_scores_df.to_csv("flan_newsum_scores.csv", index=False)
print("✅ FLAN-T5 NewsSum scores saved to flan_newsum_scores.csv")
flan_newsum_scores_df

✅ FLAN-T5 NewsSum scores saved to flan_newsum_scores.csv


Unnamed: 0,Dataset,Model,ROUGE-1,ROUGE-2,ROUGE-L,BERTScore,Inference Time (s),GPU Used
0,NewsSum,FLAN-T5,0.38682,0.277672,0.318049,0.8765,213.29,CPU
