# 📚 NoteBook 7 BigBird Evaluation

# 🚀 PROJECT PLAN

MKEM Implementation – Transformer-Based Abstractive Text Summarization

# 🎯 Problem Statement Recap

# 🔍 Objective:
    
To build and compare transformer-based summarization models (T5, BART, Pegasus,BARTScore,ProphetNet,BigBird,LED,mTS,FLAN-T5,GPT 3.5 Turbo) and then enhance them using MKEM (Multi-Knowledge-Enhanced Model) on curated English news datasets.

# 📌 Phase-1 Objective

✅ Implement the following 3 summarization models:

PEGASUS (Google)---NoteBook(2)

BART (Facebook)---NoteBook(3)

T5 (Google)---NoteBook(1)

Final Comparison + MKEM---NoteBook(4)

NewsSum(Indian Newspaper)---NoteBook(5)

BARTScore---NoteBook(6)

ProphetNet---NoteBook(7)

BigBird-Pegasus---NoteBook(8)

LED(Longformer)---NoteBook(9)

mTS ---NoteBook(10)

FLAN-T5---NoteBook(11)

GPT-3.5 Turbo---NoteBook(12)

# ✅ Evaluate on 3 benchmark datasets:
    
CNN/DailyMail

Newssum (IndianNewsPaper)

# ✅ Evaluation Metrics:
    
ROUGE-1

ROUGE-2

ROUGE-L

BERTScore

# 📊 Final Output (Per Model × Dataset):
    
You must submit structured results:

Dataset name

Model used

ROUGE-1, ROUGE-2, ROUGE-L, BERTScore

Inference Time

GPU used

Short analysis/observations

# 1.🚀 BigBird on CNN Dataset

**✏️ Step 1: Install & Import Libraries**

In [1]:
!pip install transformers evaluate bert_score -q

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import pandas as pd
import time
import evaluate
import bert_score



**✏️ Step 2: Load Model & Tokenizer**

In [3]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Set device
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# ✅ Use BigBird Pegasus pre-trained on arXiv
model_id = "google/bigbird-pegasus-large-arxiv"

tokenizer_bigbird = AutoTokenizer.from_pretrained(model_id)
model_bigbird = AutoModelForSeq2SeqLM.from_pretrained(model_id).to(device)

**✏️ Step 3: Load CNN Dataset**

In [4]:
df_cnn = pd.read_csv("cnn_dailymail.csv")

# Drop missing or empty values
df_cnn.dropna(subset=["article", "highlights"], inplace=True)
df_cnn = df_cnn[df_cnn["article"].str.strip().astype(bool)]

# Optional: Subset for testing
df_cnn = df_cnn[:5]

df_cnn.head()

Unnamed: 0,article,highlights,id
0,"LONDON, England (Reuters) -- Harry Potter star...",Harry Potter star Daniel Radcliffe gets £20M f...,42c027e4ff9730fbb3de84c1af0d2c506e41c3e4
1,Editor's note: In our Behind the Scenes series...,Mentally ill inmates in Miami are housed on th...,ee8871b15c50d0db17b0179a6d2beab35065f1e9
2,"MINNEAPOLIS, Minnesota (CNN) -- Drivers who we...","NEW: ""I thought I was going to die,"" driver sa...",06352019a19ae31e527f37f7571c6dd7f0c5da37
3,WASHINGTON (CNN) -- Doctors removed five small...,"Five small polyps found during procedure; ""non...",24521a2abb2e1f5e34e6824e0f9e56904a2b0e88
4,(CNN) -- The National Football League has ind...,"NEW: NFL chief, Atlanta Falcons owner critical...",7fe70cc8b12fab2d0a258fababf7d9c6b5e1262a


**✏️ Step 4: Define Summarization Function**

In [5]:
def summarize_with_bigbird(text):
    # Tokenize and truncate input to BigBird Pegasus limits
    inputs = tokenizer_bigbird(
        text,
        return_tensors="pt",
        truncation=True,
        padding="longest",
        max_length=1024  
    ).to(device)
    
    # Generate summary
    summary_ids = model_bigbird.generate(
        inputs["input_ids"],
        max_length=160,   
        min_length=40,
        length_penalty=2.0,
        num_beams=4,
        early_stopping=True
    )
    
    # Decode and return summary
    return tokenizer_bigbird.decode(summary_ids[0], skip_special_tokens=True)

**✏️ Step 5: Generate Predictions**

In [6]:
start_time = time.time()

bigbird_preds = [summarize_with_bigbird(article) for article in df_cnn["article"]]
bigbird_refs = df_cnn["highlights"].tolist()

end_time = time.time()
inference_time = round(end_time - start_time, 2)
print("🕒 Inference Time:", inference_time, "seconds")

Attention type 'block_sparse' is not possible if sequence_length: 560 <= num global tokens: 2 * config.block_size + min. num sliding tokens: 3 * config.block_size + config.num_random_blocks * config.block_size + additional buffer: config.num_random_blocks * config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3. Changing attention type to 'original_full'...


🕒 Inference Time: 718.42 seconds


**✏️ Step 6: Evaluate with ROUGE & BERTScore**

In [7]:
# ROUGE
rouge = evaluate.load("rouge")
rouge_scores = rouge.compute(predictions=bigbird_preds, references=bigbird_refs)

# BERTScore
bertscore = evaluate.load("bertscore")
bert_scores = bertscore.compute(predictions=bigbird_preds, references=bigbird_refs, lang="en")

# Display results
print("📊 ROUGE Scores:\n", rouge_scores)
print("📊 BERTScore (F1 average):", round(sum(bert_scores["f1"]) / len(bert_scores["f1"]), 4))

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  return forward_call(*args, **kwargs)


📊 ROUGE Scores:
 {'rouge1': 0.0713815392469786, 'rouge2': 0.0, 'rougeL': 0.05693226960623641, 'rougeLsum': 0.0633391344260503}
📊 BERTScore (F1 average): 0.786


# 💾 Save the Scores to .CSV Files

**So that we can use to comapair models in different NoteBooks**

In [8]:
# Create summary dictionary
bigbird_result = {
    "Dataset": ["CNN"],
    "Model": ["BigBird-Pegasus"],
    "ROUGE-1": [rouge_scores["rouge1"]],
    "ROUGE-2": [rouge_scores["rouge2"]],
    "ROUGE-L": [rouge_scores["rougeL"]],
    "BERTScore": [round(sum(bert_scores["f1"]) / len(bert_scores["f1"]), 4)],
    "Inference Time (s)": [inference_time],
    "GPU Used": [torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"]
}

df_bigbird_eval = pd.DataFrame(bigbird_result)
df_bigbird_eval.to_csv("BigBird_CNN_Evaluation.csv", index=False)
df_bigbird_eval

Unnamed: 0,Dataset,Model,ROUGE-1,ROUGE-2,ROUGE-L,BERTScore,Inference Time (s),GPU Used
0,CNN,BigBird-Pegasus,0.071382,0.0,0.056932,0.786,718.42,CPU


# 2.🚀 BigBird on NewsSum Dataset

**✏️ Step 1: Load NewsSum Dataset**

In [9]:
import pandas as pd

# Load cleaned NewsSum dataset
df_newsum = pd.read_csv("newsum_cleaned.csv")

# Drop missing or empty articles/summaries
df_newsum = df_newsum.dropna(subset=["Article", "Summary"])
df_newsum = df_newsum[df_newsum["Article"].str.strip().astype(bool)]

# Optional: limit for quick test
df_newsum = df_newsum[:5]

df_newsum.head()

Unnamed: 0,Headline,Article,Category,Summary
0,Elephant death brings to fore man-animal confl...,The death of a pregnant elephant in the buffer...,Local News,Thousands of farmers in Kerala have either aba...
1,Cases filed after two â€˜commit suicideâ€™ in ...,Two suicides were reported from Vadodara and D...,Crime and Justice,"In the first incident, a 30-year-old woman all..."
2,Woman alleges father tied to MP hospital bed o...,A day after a woman alleged that her father ha...,Health and Wellness,"The hospital denied the allegation, saying the..."
3,"Sena member, author, app designer â€“ the many...","Assistant police inspector Sachin Vaze, who wa...",Defense,"On Saturday, Vaze along with police constables..."
4,"Manager, owner of resort where Gujarat Congres...","The manager and owner of a resort in Rajkot, w...",Politics,The resort is reportedly owned by Indranil Raj...


**✏️Step 2: Generate Summaries with BigBird**

In [10]:
def summarize_with_bigbird(text):
    inputs = tokenizer_bigbird(
        text, return_tensors="pt",
        truncation=True, padding="longest",
        max_length=4096
    ).to(device)

    summary_ids = model_bigbird.generate(
        inputs["input_ids"],
        max_length=150, min_length=40,
        length_penalty=2.0, num_beams=4
    )

    return tokenizer_bigbird.decode(summary_ids[0], skip_special_tokens=True)

start_time = time.time()

# Generate predictions
bigbird_newsum_preds = [summarize_with_bigbird(article) for article in df_newsum["Article"]]
bigbird_newsum_refs = df_newsum["Summary"].tolist()

end_time = time.time()
inference_time = round(end_time - start_time, 2)
print("🕒 Inference Time:", inference_time, "seconds")

🕒 Inference Time: 513.18 seconds


**✏️Step 3: Evaluate with ROUGE and BERTScore**

In [27]:
import evaluate
rouge = evaluate.load("rouge")
bertscore = evaluate.load("bertscore")

# ROUGE
rouge_results = rouge.compute(predictions=bigbird_newsum_preds, references=bigbird_newsum_refs)

# BERTScore
bertscore_results = bertscore.compute(predictions=bigbird_newsum_preds, references=bigbird_newsum_refs, lang="en")

# Combine scores
bigbird_newsum_scores = {
    "Dataset": ["NewsSum"],   
    "Model": ["BigBird-Pegasus"],
    "ROUGE-1": [rouge_results["rouge1"]],
    "ROUGE-2": [rouge_results["rouge2"]],
    "ROUGE-L": [rouge_results["rougeL"]],
    "BERTScore": [round(sum(bertscore_results["f1"]) / len(bertscore_results["f1"]), 4)],
    "Inference Time (s)": [inference_time],
    "GPU Used": [torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"]
}

# Save
bigbird_newsum_scores_df = pd.DataFrame(bigbird_newsum_scores)
bigbird_newsum_scores_df.to_csv("bigbird_newsum_scores.csv", index=False)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  return forward_call(*args, **kwargs)


**💾 Step 4: Save Evaluation Scores to CSV**

In [25]:
bigbird_newsum_scores_df = pd.DataFrame(bigbird_newsum_scores)  # use the NewsSum dict
bigbird_newsum_scores_df.to_csv("bigbird_newsum_scores.csv", index=False)
print("✅ BigBird NewsSum scores saved to bigbird_newsum_scores.csv")

✅ BigBird NewsSum scores saved to bigbird_newsum_scores.csv


In [28]:
import pandas as pd

# Load individual score files
cnn_scores = pd.read_csv("BigBird_CNN_Evaluation.csv")
bigbird_newsum_scores_df = pd.read_csv("bigbird_newsum_scores.csv")

# Merge into one DataFrame
bigbird_all_scores = pd.concat([cnn_scores, bigbird_newsum_scores_df], ignore_index=True)
bigbird_all_scores.to_csv("bigbird_all_scores.csv", index=False)
bigbird_all_scores

# Save merged scores
bigbird_all_scores.to_csv("bigbird_all_scores.csv", index=False)

print("✅ BigBird CNN + NewsSum scores saved to bigbird_all_scores.csv")
bigbird_all_scores


✅ BigBird CNN + NewsSum scores saved to bigbird_all_scores.csv


Unnamed: 0,Dataset,Model,ROUGE-1,ROUGE-2,ROUGE-L,BERTScore,Inference Time (s),GPU Used
0,CNN,BigBird-Pegasus,0.071382,0.0,0.056932,0.786,718.42,CPU
1,NewsSum,BigBird-Pegasus,0.10718,0.007547,0.066587,0.781,513.18,CPU
