# 📚 NoteBook 8 LED Evaluation

# 🚀 PROJECT PLAN

MKEM Implementation – Transformer-Based Abstractive Text Summarization

# 🎯 Problem Statement Recap

# 🔍 Objective:
    
To build and compare transformer-based summarization models (T5, BART, Pegasus,BARTScore,ProphetNet,BigBird,LED,mTS,FLAN-T5,GPT 3.5 Turbo) and then enhance them using MKEM (Multi-Knowledge-Enhanced Model) on curated English news datasets.

# 📌 Phase-1 Objective

✅ Implement the following 3 summarization models:

PEGASUS (Google)---NoteBook(2)

BART (Facebook)---NoteBook(3)

T5 (Google)---NoteBook(1)

Final Comparison + MKEM---NoteBook(4)

NewsSum(Indian Newspaper)---NoteBook(5)

BARTScore---NoteBook(6)

ProphetNet---NoteBook(7)

BigBird-Pegasus---NoteBook(8)

LED(Longformer)---NoteBook(9)

allenai/PRIMERA ---NoteBook(10)

FLAN-T5---NoteBook(11)

GPT-3.5 Turbo---NoteBook(12)

# ✅ Evaluate on 3 benchmark datasets:
    
CNN/DailyMail

Newssum (IndianNewsPaper)

# ✅ Evaluation Metrics:
    
ROUGE-1

ROUGE-2

ROUGE-L

BERTScore

# 📊 Final Output (Per Model × Dataset):
    
You must submit structured results:

Dataset name

Model used

ROUGE-1, ROUGE-2, ROUGE-L, BERTScore

Inference Time

GPU used

Short analysis/observations

# **1.🚀 LED on CNN Dataset**

**✏️Step 1: Install & Import Libraries**

In [1]:
!pip install transformers accelerate sentencepiece evaluate bert-score -q

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import pandas as pd
import time
import evaluate
import bert_score

# Device setup
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("✅ Using device:", device)



✅ Using device: cpu


**✏️ Step 2: Load Model & Tokenizer**

In [3]:
# ✏️ Step 2: Load Model & Tokenizer for LED
model_name = "allenai/led-base-16384"

tokenizer_led = AutoTokenizer.from_pretrained(model_name)
model_led = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

print("✅ LED model and tokenizer loaded.")

tokenizer_config.json:   0%|          | 0.00/27.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/648M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


✅ LED model and tokenizer loaded.


model.safetensors:   0%|          | 0.00/648M [00:00<?, ?B/s]

**✏️ Step 3: Load CNN Dataset**

In [4]:
import pandas as pd

# Load cleaned CNN dataset (from Notebook 1)
df_cnn = pd.read_csv("cnn_dailymail.csv")

# Drop missing or empty articles/highlights
df_cnn = df_cnn.dropna(subset=["article", "highlights"])
df_cnn = df_cnn[df_cnn["article"].str.strip().astype(bool)]

# Optional: Limit to small sample for quick testing
df_cnn = df_cnn[:5]

print(f"✅ CNN Dataset loaded with {len(df_cnn)} rows.")
df_cnn.head()

✅ CNN Dataset loaded with 5 rows.


Unnamed: 0,article,highlights,id
0,"LONDON, England (Reuters) -- Harry Potter star...",Harry Potter star Daniel Radcliffe gets £20M f...,42c027e4ff9730fbb3de84c1af0d2c506e41c3e4
1,Editor's note: In our Behind the Scenes series...,Mentally ill inmates in Miami are housed on th...,ee8871b15c50d0db17b0179a6d2beab35065f1e9
2,"MINNEAPOLIS, Minnesota (CNN) -- Drivers who we...","NEW: ""I thought I was going to die,"" driver sa...",06352019a19ae31e527f37f7571c6dd7f0c5da37
3,WASHINGTON (CNN) -- Doctors removed five small...,"Five small polyps found during procedure; ""non...",24521a2abb2e1f5e34e6824e0f9e56904a2b0e88
4,(CNN) -- The National Football League has ind...,"NEW: NFL chief, Atlanta Falcons owner critical...",7fe70cc8b12fab2d0a258fababf7d9c6b5e1262a


**✏️ Step 4: Define Summarization Function**

In [5]:
def summarize_with_led(text):
    # Tokenize input
    inputs = tokenizer_led(
        text,
        return_tensors="pt",
        truncation=True,
        padding="longest",
        max_length=4096  # LED supports long input sequences
    ).to(device)
    
    # Generate summary
    summary_ids = model_led.generate(
        inputs["input_ids"],
        max_length=150,
        min_length=40,
        length_penalty=2.0,
        num_beams=4
    )
    
    # Decode and return
    return tokenizer_led.decode(summary_ids[0], skip_special_tokens=True)

**✏️ Step 5: Generate Predictions**

In [6]:
import time

start_time = time.time()

# Generate summaries for CNN articles
led_cnn_preds = [summarize_with_led(article) for article in df_cnn["article"]]

# Get references (human-written summaries)
led_cnn_refs = df_cnn["highlights"].tolist()

# Measure inference time
inference_time = round(time.time() - start_time, 2)
print(f"✅ LED summarization completed in {inference_time} seconds")

Input ids are automatically padded from 565 to 1024 to be a multiple of `config.attention_window`: 1024
`cache.key_cache[idx]` is deprecated and will be removed in v4.56.0. Use `cache.layers[idx].keys` instead.
`cache.value_cache[idx]` is deprecated and will be removed in v4.56.0. Use `cache.layers[idx].values` instead.
Input ids are automatically padded from 888 to 1024 to be a multiple of `config.attention_window`: 1024
Input ids are automatically padded from 919 to 1024 to be a multiple of `config.attention_window`: 1024
Input ids are automatically padded from 531 to 1024 to be a multiple of `config.attention_window`: 1024
Input ids are automatically padded from 1196 to 2048 to be a multiple of `config.attention_window`: 1024


✅ LED summarization completed in 108.81 seconds


**✏️ Step 6: Evaluate with ROUGE & BERTScore**

In [7]:
import evaluate
import torch

# Load metrics
rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")

# ✅ Compute ROUGE
rouge_scores = rouge.compute(predictions=led_cnn_preds, references=led_cnn_refs)

# ✅ Compute BERTScore
bert_scores = bertscore.compute(predictions=led_cnn_preds, references=led_cnn_refs, lang="en")

# ✅ Prepare results dictionary
led_cnn_results = {
    "Dataset": ["CNN"],
    "Model": ["LED"],
    "ROUGE-1": [rouge_scores["rouge1"]],
    "ROUGE-2": [rouge_scores["rouge2"]],
    "ROUGE-L": [rouge_scores["rougeL"]],
    "BERTScore": [round(sum(bert_scores["f1"]) / len(bert_scores["f1"]), 4)],
    "Inference Time (s)": [inference_time],
    "GPU Used": [torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"]
}

# ✅ Save to CSV
import pandas as pd
led_cnn_df = pd.DataFrame(led_cnn_results)
led_cnn_df.to_csv("LED_CNN_Evaluation.csv", index=False)

print("✅ LED CNN evaluation saved to LED_CNN_Evaluation.csv")
led_cnn_df

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  return forward_call(*args, **kwargs)


✅ LED CNN evaluation saved to LED_CNN_Evaluation.csv


Unnamed: 0,Dataset,Model,ROUGE-1,ROUGE-2,ROUGE-L,BERTScore,Inference Time (s),GPU Used
0,CNN,LED,0.28072,0.121688,0.190097,0.8515,108.81,CPU


# 💾 Save the Scores to .CSV Files

So that we can use to comapair models in different NoteBooks

In [8]:
led_cnn_df = pd.DataFrame(led_cnn_results)
led_cnn_df.to_csv("LED_CNN_Evaluation.csv", index=False)

print("✅ LED CNN evaluation saved to LED_CNN_Evaluation.csv")

✅ LED CNN evaluation saved to LED_CNN_Evaluation.csv


# 2.🚀 LED on NewsSum Dataset

**✏️ Step 1: Load NewsSum Dataset**

In [9]:
import pandas as pd

# Load cleaned NewsSum dataset
df_newsum = pd.read_csv("newsum_cleaned.csv")

# Drop any missing rows
df_newsum = df_newsum.dropna(subset=["Article", "Summary"])
df_newsum = df_newsum[df_newsum["Article"].str.strip().astype(bool)]

df_newsum.head()

Unnamed: 0,Headline,Article,Category,Summary
0,Elephant death brings to fore man-animal confl...,The death of a pregnant elephant in the buffer...,Local News,Thousands of farmers in Kerala have either aba...
1,Cases filed after two â€˜commit suicideâ€™ in ...,Two suicides were reported from Vadodara and D...,Crime and Justice,"In the first incident, a 30-year-old woman all..."
2,Woman alleges father tied to MP hospital bed o...,A day after a woman alleged that her father ha...,Health and Wellness,"The hospital denied the allegation, saying the..."
3,"Sena member, author, app designer â€“ the many...","Assistant police inspector Sachin Vaze, who wa...",Defense,"On Saturday, Vaze along with police constables..."
4,"Manager, owner of resort where Gujarat Congres...","The manager and owner of a resort in Rajkot, w...",Politics,The resort is reportedly owned by Indranil Raj...


**✏️Step 2: Generate Summaries with LED**

In [11]:
import time

def batch_summarize_with_led(texts, batch_size=2):
    preds = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        inputs = tokenizer_led(batch, return_tensors="pt", truncation=True, padding="longest", max_length=4096).to(device)
        summary_ids = model_led.generate(inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4)
        preds.extend([tokenizer_led.decode(g, skip_special_tokens=True) for g in summary_ids])
    return preds

# ✅ Run in batches
start_time = time.time()
led_newsum_preds = batch_summarize_with_led(df_newsum["Article"].tolist(), batch_size=2)
led_newsum_refs = df_newsum["Summary"].tolist()
inference_time = round(time.time() - start_time, 2)

print(f"⏱ Inference Time: {inference_time} seconds for {len(led_newsum_preds)} summaries")

Input ids are automatically padded from 597 to 1024 to be a multiple of `config.attention_window`: 1024
Input ids are automatically padded from 1230 to 2048 to be a multiple of `config.attention_window`: 1024
Input ids are automatically padded from 1761 to 2048 to be a multiple of `config.attention_window`: 1024
Input ids are automatically padded from 943 to 1024 to be a multiple of `config.attention_window`: 1024
Input ids are automatically padded from 1279 to 2048 to be a multiple of `config.attention_window`: 1024
Input ids are automatically padded from 995 to 1024 to be a multiple of `config.attention_window`: 1024
Input ids are automatically padded from 937 to 1024 to be a multiple of `config.attention_window`: 1024
Input ids are automatically padded from 512 to 1024 to be a multiple of `config.attention_window`: 1024
Input ids are automatically padded from 692 to 1024 to be a multiple of `config.attention_window`: 1024
Input ids are automatically padded from 1203 to 2048 to be a 

⏱ Inference Time: 13571.06 seconds for 1003 summaries


**✏️Step 3: Evaluate with ROUGE and BERTScore**

In [12]:
import evaluate
import torch

# Load metrics
rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")

# ✅ Compute ROUGE
rouge_scores = rouge.compute(predictions=led_newsum_preds, references=led_newsum_refs)

# ✅ Compute BERTScore
bert_scores = bertscore.compute(predictions=led_newsum_preds, references=led_newsum_refs, lang="en")

# ✅ Prepare results dictionary
led_newsum_results = {
    "Dataset": ["NewsSum"],
    "Model": ["LED"],
    "ROUGE-1": [rouge_scores["rouge1"]],
    "ROUGE-2": [rouge_scores["rouge2"]],
    "ROUGE-L": [rouge_scores["rougeL"]],
    "BERTScore": [round(sum(bert_scores["f1"]) / len(bert_scores["f1"]), 4)],
    "Inference Time (s)": [inference_time],
    "GPU Used": [torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"]
}

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  return forward_call(*args, **kwargs)


**💾 Step 4: Save Evaluation Scores to CSV**

In [13]:
led_newsum_df = pd.DataFrame(led_newsum_results)
led_newsum_df.to_csv("LED_NewsSum_Evaluation.csv", index=False)

print("✅ LED NewsSum evaluation saved to LED_NewsSum_Evaluation.csv")
led_newsum_df

✅ LED NewsSum evaluation saved to LED_NewsSum_Evaluation.csv


Unnamed: 0,Dataset,Model,ROUGE-1,ROUGE-2,ROUGE-L,BERTScore,Inference Time (s),GPU Used
0,NewsSum,LED,0.330616,0.264168,0.299004,0.8744,13571.06,CPU


In [15]:
import pandas as pd

# Load individual score files
led_cnn_df = pd.read_csv("LED_CNN_Evaluation.csv")
led_newsum_df = pd.read_csv("LED_NewsSum_Evaluation.csv")

# Merge into one DataFrame
led_all_scores = pd.concat([led_cnn_df, led_newsum_df], ignore_index=True)
led_all_scores.to_csv("led_all_scores.csv", index=False)
led_all_scores

# Save merged scores
led_all_scores.to_csv("led_all_scores.csv", index=False)

print("✅ led_cnn + led_newsum saved to bigbird_all_scores.csv")
led_all_scores

✅ led_cnn + led_newsum saved to bigbird_all_scores.csv


Unnamed: 0.1,Unnamed: 0,Dataset,Model,ROUGE-1,ROUGE-2,ROUGE-L,BERTScore,Inference Time (s),GPU Used
0,0.0,CNN,LED,0.28072,0.121688,0.190097,0.8515,108.81,CPU
1,,NewsSum,LED,0.330616,0.264168,0.299004,0.8744,13571.06,CPU
