# 📚 NoteBook 1 T5 Evaluation

# 🚀 PROJECT PLAN
MKEM Implementation – Transformer-Based Abstractive Text Summarization

# 🎯 Problem Statement Recap

# 🔍 Objective:
To build and compare transformer-based summarization models (T5, BART, Pegasus) and then enhance them using MKEM (Multi-Knowledge-Enhanced Model) on curated English news datasets.

# 📌 Phase-1 Objective

# ✅ Implement the following 3 summarization models:
    
1. PEGASUS (Google)---NoteBook(2)

2. BART (Facebook)---NoteBook(3)

3. T5 (Google)---NoteBook(1)

# ✅ Evaluate on 3 benchmark datasets:
    
1. CNN/DailyMail

2. XSum

3. MultiNews

# ✅ Evaluation Metrics:
    
ROUGE-1

ROUGE-2

ROUGE-L

BERTScore

# 📊 Final Output (Per Model × Dataset):
    
You must submit structured results:

Dataset name

Model used

ROUGE-1, ROUGE-2, ROUGE-L, BERTScore

Short analysis/observations

**📦 Step 1: Load Datasets (CNN/DailyMail, XSum, MultiNews)**

In [1]:
!pip install datasets




[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip




**🔹 1. Setup**

In [2]:
!pip install datasets pandas




[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
!pip install --upgrade evaluate datasets




[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip




**🔹 2. Import Libraries**

In [4]:
from datasets import load_dataset
import pandas as pd




**🔹 3. Load Datasets with Verification Fix**

In [39]:
# ✅ Load datasets properly
from datasets import load_dataset

# CNN/DailyMail (long-form summaries)
cnn_dailymail = load_dataset("cnn_dailymail", "3.0.0")

# XSum (extreme summarization)
xsum = load_dataset("xsum")

# MultiNews (multi-document summarization)
multi_news = load_dataset("multi_news")

Using the latest cached version of the dataset since xsum couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at C:\Users\SAMIM IMTIAZ\.cache\huggingface\datasets\xsum\default\1.2.0\082863bf4754ee058a5b6f6525d0cb2b18eadb62c7b370b095d1364050a52b71 (last modified on Sun Aug  3 14:51:50 2025).
Using the latest cached version of the dataset since multi_news couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at C:\Users\SAMIM IMTIAZ\.cache\huggingface\datasets\multi_news\default\1.0.0\2f1f69a2bedc8ad1c5d8ae5148e4755ee7095f465c1c01ae8f85454342065a72 (last modified on Sun Aug  3 14:55:04 2025).


In [40]:
# ✅ Show original dataset sizes (test set only or all splits as needed)
print("📊 CNN/DailyMail Test Set:", cnn_dailymail['test'].shape)
print("📊 XSum Test Set:", xsum['test'].shape)
print("📊 MultiNews Test Set:", multi_news['test'].shape)

📊 CNN/DailyMail Test Set: (11490, 3)
📊 XSum Test Set: (11334, 3)
📊 MultiNews Test Set: (5622, 2)


**🔹 4. Preview Sample Records**

In [6]:
print("🔹 CNN/DailyMail Example:\n", cnn_dailymail["train"][0])
print("🔹 XSum Example:\n", xsum["train"][0])
print("🔹 MultiNews Example:\n", multi_news["train"][0])

🔹 CNN/DailyMail Example:
 {'article': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie

**🔹 5. Convert to Pandas DataFrame **

In [7]:
# Convert first 5 records from each dataset into a DataFrame
df_cnn = pd.DataFrame(cnn_dailymail["train"][:5])
df_xsum = pd.DataFrame(xsum["train"][:5])
df_multi = pd.DataFrame(multi_news["train"][:5])

In [8]:
df_cnn.head()

Unnamed: 0,article,highlights,id
0,"LONDON, England (Reuters) -- Harry Potter star...",Harry Potter star Daniel Radcliffe gets £20M f...,42c027e4ff9730fbb3de84c1af0d2c506e41c3e4
1,Editor's note: In our Behind the Scenes series...,Mentally ill inmates in Miami are housed on th...,ee8871b15c50d0db17b0179a6d2beab35065f1e9
2,"MINNEAPOLIS, Minnesota (CNN) -- Drivers who we...","NEW: ""I thought I was going to die,"" driver sa...",06352019a19ae31e527f37f7571c6dd7f0c5da37
3,WASHINGTON (CNN) -- Doctors removed five small...,"Five small polyps found during procedure; ""non...",24521a2abb2e1f5e34e6824e0f9e56904a2b0e88
4,(CNN) -- The National Football League has ind...,"NEW: NFL chief, Atlanta Falcons owner critical...",7fe70cc8b12fab2d0a258fababf7d9c6b5e1262a


In [9]:
df_xsum.head()

Unnamed: 0,document,summary,id
0,"The full cost of damage in Newton Stewart, one...",Clean-up operations are continuing across the ...,35232142
1,A fire alarm went off at the Holiday Inn in Ho...,Two tourist buses have been destroyed by fire ...,40143035
2,Ferrari appeared in a position to challenge un...,Lewis Hamilton stormed to pole position at the...,35951548
3,"John Edward Bates, formerly of Spalding, Linco...",A former Lincolnshire Police officer carried o...,36266422
4,Patients and staff were evacuated from Cerahpa...,An armed man who locked himself into a room at...,38826984


In [10]:
df_multi.head()

Unnamed: 0,document,summary
0,"National Archives \n \n Yes, it’s that time ag...",– The unemployment rate dropped to 8.2% last m...
1,LOS ANGELES (AP) — In her first interview sinc...,"– Shelly Sterling plans ""eventually"" to divorc..."
2,"GAITHERSBURG, Md. (AP) — A small, private jet ...",– A twin-engine Embraer jet that the FAA descr...
3,Tucker Carlson Exposes His Own Sexism on Twitt...,– Tucker Carlson is in deep doodoo with conser...
4,A man accused of removing another man's testic...,– What are the three most horrifying words in ...


# ✅ Implementation of the Models on the DataSets

# 🚀 T5 Implementation Plan

#  1. 🚀 T5 on CNN/DailyMail

**✅ Step 1: Install Required Transformers Library**

In [11]:
!pip install transformers evaluate




[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip





In [12]:
!pip install --upgrade transformers




[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [13]:
!pip install --upgrade evaluate




[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [14]:
!pip install bert-score




[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


**✅ Step 2: Import Required Modules**

In [15]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch



**✅ Step 3: Load Pretrained t5-base Model & Tokenizer**

In [16]:
!pip install sentencepiece

#📌 Why Need This
#T5 and PEGASUS both use SentencePiece tokenization.
#Without it, Hugging Face can’t load the tokenizer.




[notice] A new release of pip is available: 24.3.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip





In [17]:
# Load T5 model and tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("t5-base")

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


**✅ Step 4: Define Function to Generate Summaries**

In [18]:
def generate_summary(text, max_input_length=512, max_output_length=150):
    # Prefix for T5 summarization
    input_text = "summarize: " + text.strip().replace("\n", " ")
    
    inputs = tokenizer(
        input_text,
        max_length=max_input_length,
        padding="max_length",
        truncation=True,
        return_tensors="pt"
    ).to(device)
    
    summary_ids = model.generate(
        inputs["input_ids"],
        num_beams=4,
        max_length=max_output_length,
        early_stopping=True
    )
    
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary


**✅ Step 5: Generate/Test Summaries for First 5 Articles**

In [19]:
# Select first 5 articles
sample_articles = df_cnn["article"][:5]
generated_summaries = []

# Generate summaries one by one
for i, article in enumerate(sample_articles):
    print(f"\n📰 Original Article #{i+1} (shortened):\n", article[:500], "...\n")
    
    summary = generate_summary(article)
    generated_summaries.append(summary)
    
    print(f"✍️ Generated Summary #{i+1}:\n", summary)
    print("-" * 100)


📰 Original Article #1 (shortened):
 LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as s ...

✍️ Generated Summary #1:
 young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties . at 18, he will be able to gamble in a casino, buy a drink in a pub or see "Hostel: Part II" details of how he'll mark his landmark birthday are under wraps .
----------------------------------------------------------------------------------------------------

📰 Original Article #2 (shortened):
 Editor's note: In our Behind the 

**📊 Step 6: Evaluate with ROUGE (on 3 summaries)**

In [20]:
from evaluate import load


# Load ROUGE evaluator
rouge = load("rouge")

# Reference  (actual) summaries from CNN dataset
reference_summaries = df_cnn["highlights"][:5]

# Calculate ROUGE scores
scores = rouge.compute(predictions=generated_summaries, references=reference_summaries)
print("📊 ROUGE Evaluation on 5 samples:\n", scores)


📊 ROUGE Evaluation on 5 samples:
 {'rouge1': 0.3136340920448867, 'rouge2': 0.11841269841269841, 'rougeL': 0.23506277164448078, 'rougeLsum': 0.2833816425120773}


**✅ Let’s Format ROUGE Evaluation samples results in Table**

In [21]:
import pandas as pd

# Format the current ROUGE output from evaluate.load("rouge")
def format_rouge_output(scores):
    rows = []
    for metric, f1 in scores.items():
        rows.append({
            "Metric": metric.upper(),
            "Precision": "-",  # Not available in current version
            "Recall": "-",     # Not available in current version
            "F1-Score": round(f1, 4)
        })
    return pd.DataFrame(rows)

# Display
formatted_scores = format_rouge_output(scores)
print("📊 Formatted ROUGE Results:")
display(formatted_scores)

📊 Formatted ROUGE Results:


Unnamed: 0,Metric,Precision,Recall,F1-Score
0,ROUGE1,-,-,0.3136
1,ROUGE2,-,-,0.1184
2,ROUGEL,-,-,0.2351
3,ROUGELSUM,-,-,0.2834


**📚Step 7: CNN BERTScore**

In [22]:
# Define CNN predictions and references 
cnn_preds = generated_summaries  
cnn_refs = df_cnn["highlights"][:len(cnn_preds)]

In [23]:
from evaluate import load

# Load BERTScore evaluator
bertscore = load("bertscore")

# Provide  actual generated predictions and references
predictions = cnn_preds  # Your T5 generated summaries
references = cnn_refs    # Actual CNN highlights

# Compute BERTScore
bertscore_result = bertscore.compute(predictions=predictions, references=references, lang="en")

# Calculate average BERTScore F1
average_f1 = sum(bertscore_result['f1']) / len(bertscore_result['f1'])
print(f"✅ CNN BERTScore (F1 Avg): {average_f1:.4f}")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  return forward_call(*args, **kwargs)


✅ CNN BERTScore (F1 Avg): 0.8595


#  2. 🚀 T5 on XSUM

**✅ Step 1: Review df_xsum Format**

In [24]:
df_xsum.head()

Unnamed: 0,document,summary,id
0,"The full cost of damage in Newton Stewart, one...",Clean-up operations are continuing across the ...,35232142
1,A fire alarm went off at the Holiday Inn in Ho...,Two tourist buses have been destroyed by fire ...,40143035
2,Ferrari appeared in a position to challenge un...,Lewis Hamilton stormed to pole position at the...,35951548
3,"John Edward Bates, formerly of Spalding, Linco...",A former Lincolnshire Police officer carried o...,36266422
4,Patients and staff were evacuated from Cerahpa...,An armed man who locked himself into a room at...,38826984


**✅ Step 2: Generate Summaries with T5**

In [25]:
# Select first 3 XSum articles
xsum_articles = df_xsum["document"][:3]
xsum_references = df_xsum["summary"][:3]
xsum_generated = []

for i, article in enumerate(xsum_articles):
    print(f"\n📄 XSum Article #{i+1} (shortened):\n", article[:500], "...\n")
    
    summary = generate_summary(article)
    xsum_generated.append(summary)
    
    print(f"✍️ Generated Summary #{i+1}:\n", summary)
    print("-" * 100)


📄 XSum Article #1 (shortened):
 The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.
Repair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.
Trains on the west coast mainline face disruption due to damage at the Lamington Viaduct.
Many businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.
First Minister Nicola Sturgeon visited the area to inspect the damage.
The water ...

✍️ Generated Summary #1:
 many roads in peeblesshire remain badly affected by standing water . first minister Nicola Sturgeon visited the area to inspect the damage . the waters breached a retaining wall, flooding many commercial properties . a flood alert remains in place across the Borders because of constant rain .
----------------------------------------------------------------------------------------------------

📄 XSum Article #2 (shortened):
 A fire 

In [26]:
from evaluate import load


# Load ROUGE evaluator
rouge = load("rouge")

# XSum references
reference_summaries = df_xsum["summary"][:3]

# XSum predictions — generated earlier in xsum_generated
scores = rouge.compute(predictions=xsum_generated, references=reference_summaries)
print("📊 ROUGE Evaluation on 3 XSum samples:\n", scores)

📊 ROUGE Evaluation on 3 XSum samples:
 {'rouge1': 0.13979392926761347, 'rouge2': 0.010928961748633878, 'rougeL': 0.08688387635756056, 'rougeLsum': 0.08688387635756056}


**✅ Let’s Format ROUGE Evaluation samples results in Table**

In [27]:
import pandas as pd

# Format the current ROUGE output from evaluate.load("rouge")
def format_rouge_output(scores):
    rows = []
    for metric, f1 in scores.items():
        rows.append({
            "Metric": metric.upper(),
            "Precision": "-",  # Not available in current version
            "Recall": "-",     # Not available in current version
            "F1-Score": round(f1, 4)
        })
    return pd.DataFrame(rows)

# Display
formatted_scores = format_rouge_output(scores)
print("📊 Formatted ROUGE Results:")
display(formatted_scores)

📊 Formatted ROUGE Results:


Unnamed: 0,Metric,Precision,Recall,F1-Score
0,ROUGE1,-,-,0.1398
1,ROUGE2,-,-,0.0109
2,ROUGEL,-,-,0.0869
3,ROUGELSUM,-,-,0.0869


**✅ Prepare XSum Evaluation Data**

In [28]:
xsum_preds = generated_summaries  # Replace with T5-generated outputs for XSum
xsum_refs = df_xsum["summary"][:len(xsum_preds)]  # Actual references

# 3. 🚀 T5 on MultiNews

**✅ Step 1: Inspect the Data**

In [29]:
df_multi.head()

Unnamed: 0,document,summary
0,"National Archives \n \n Yes, it’s that time ag...",– The unemployment rate dropped to 8.2% last m...
1,LOS ANGELES (AP) — In her first interview sinc...,"– Shelly Sterling plans ""eventually"" to divorc..."
2,"GAITHERSBURG, Md. (AP) — A small, private jet ...",– A twin-engine Embraer jet that the FAA descr...
3,Tucker Carlson Exposes His Own Sexism on Twitt...,– Tucker Carlson is in deep doodoo with conser...
4,A man accused of removing another man's testic...,– What are the three most horrifying words in ...


**✅ Step 2: Generate Summaries Using T5**

In [30]:
multi_articles = df_multi["document"][:3]
multi_generated = []

for i, article in enumerate(multi_articles):
    print(f"\n📰 MultiNews Article #{i+1} (truncated):\n", article[:500], "...\n")
    
    summary = generate_summary(article)
    multi_generated.append(summary)
    
    print(f"✍️ Generated Summary #{i+1}:\n", summary)
    print("-" * 80)


📰 MultiNews Article #1 (truncated):
 National Archives 
 
 Yes, it’s that time again, folks. It’s the first Friday of the month, when for one ever-so-brief moment the interests of Wall Street, Washington and Main Street are all aligned on one thing: Jobs. 
 
 A fresh update on the U.S. employment situation for January hits the wires at 8:30 a.m. New York time offering one of the most important snapshots on how the economy fared during the previous month. Expectations are for 203,000 new jobs to be created, according to economists p ...

✍️ Generated Summary #1:
 economists polled by Dow Jones Newswires expect 203,000 new jobs to be created . the unemployment rate is expected to hold steady at 8.3% . the economy has added 858,000 jobs since December .
--------------------------------------------------------------------------------

📰 MultiNews Article #2 (truncated):
 LOS ANGELES (AP) — In her first interview since the NBA banned her estranged husband, Shelly Sterling says she will fig

**✅ Step 3: ROUGE Evaluation**

In [31]:
from evaluate import load


# Load ROUGE evaluator
rouge = load("rouge")

reference_summaries = df_multi["summary"][:3]

scores = rouge.compute(predictions=multi_generated, references=reference_summaries)
print("📊 ROUGE Evaluation on 3 MultiNews samples:\n", scores)

📊 ROUGE Evaluation on 3 MultiNews samples:
 {'rouge1': 0.3602525630462772, 'rouge2': 0.21184298157982365, 'rougeL': 0.2478186706173735, 'rougeLsum': 0.24781867061737353}


**✅ Step 4: Results Tabular Format**

In [32]:
format_rouge_output(scores)

Unnamed: 0,Metric,Precision,Recall,F1-Score
0,ROUGE1,-,-,0.3603
1,ROUGE2,-,-,0.2118
2,ROUGEL,-,-,0.2478
3,ROUGELSUM,-,-,0.2478


**✅ Prepare Multinews Evaluation Data**

In [33]:
multi_preds = generated_summaries  # Replace with T5-generated outputs for XSum
multi_refs = df_xsum["summary"][:len(xsum_preds)]  # Actual references

# ✅ T5 with ROUGE & BERTScore

In [34]:
def evaluate_metrics(dataset_name, predictions, references):
    # ROUGE
    rouge_scores = rouge.compute(predictions=predictions, references=references, use_stemmer=True)

    # BERTScore
    bert_scores = bertscore.compute(predictions=predictions, references=references, lang="en")
    avg_bertscore = sum(bert_scores["f1"]) / len(bert_scores["f1"])

    # Store result
    summary_results.append({
        "Dataset": dataset_name,
        "ROUGE-1": round(rouge_scores["rouge1"], 4),
        "ROUGE-2": round(rouge_scores["rouge2"], 4),
        "ROUGE-L": round(rouge_scores["rougeL"], 4),
        "BERTScore": round(avg_bertscore, 4)
    })

In [35]:
# 🧹 Clear any previous results
summary_results = []

# ✅ Evaluate all datasets
evaluate_metrics("CNN DailyMail", cnn_preds, cnn_refs)
evaluate_metrics("XSum", xsum_preds, xsum_refs)
evaluate_metrics("MultiNews", multi_preds, multi_refs)

# 📊 Display final summary
df_summary = pd.DataFrame(summary_results)
print("✅ T5 Model ROUGE + BERTScore Summary")
print(df_summary.to_string(index=False))


  return forward_call(*args, **kwargs)


✅ T5 Model ROUGE + BERTScore Summary
      Dataset  ROUGE-1  ROUGE-2  ROUGE-L  BERTScore
CNN DailyMail   0.3177   0.1184   0.2351     0.8595
         XSum   0.0807   0.0000   0.0605     0.8196
    MultiNews   0.0807   0.0000   0.0605     0.8196


# ✅ DataSets CSV Files to Save

**So that we can use the dataset in different NoteBooks**

In [36]:
# Save T5 datasets to CSV 
df_cnn.to_csv("cnn_dailymail.csv", index=False)
df_xsum.to_csv("xsum.csv", index=False)
df_multi.to_csv("multi_news.csv", index=False)

In [38]:
# Display their shapes
print("📊 CNN/DailyMail Shape:", df_cnn.shape)
print("📊 XSum Shape:", df_xsum.shape)
print("📊 MultiNews Shape:", df_multi.shape)

📊 CNN/DailyMail Shape: (5, 3)
📊 XSum Shape: (5, 3)
📊 MultiNews Shape: (5, 2)


# 💾 Save the Scores to .CSV Files

**So that we can use to comapair models in different NoteBooks**

In [37]:
# Save to CSV
df_summary["Model"] = "T5"  
df_summary.to_csv("t5_scores.csv", index=False)