Install Libraries:
pip install rouge_score==0.1.2
pip install evaluate
pip install -U accelerate --quiet
pip install datasets
pip install nltk
Download NLTK Resources:
nltk.download("punkt")
nltk.download("punkt_tab")

In [1]:
import nltk
nltk.download("punkt")
nltk.download("punkt_tab")

[nltk_data] Downloading package punkt to /Users/user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /Users/user/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

Part II : Dataset Loading And Exploration

Dataset Loading: Load the train.csv and test.csv datasets using pandas.
Sampling: Take a smaller sample of the datasets (e.g., 100 samples from train, 50 from test) to reduce computational load.
Exploration: Display the first example from the training sample, showing the article (prompt_text) and its reference summary (prompt_title).
Data Inspection: Print the sampled train and test DataFrames to understand the dataset structure.

In [2]:
from datasets import load_dataset
import pandas as pd

# Step 1: Load the dataset from Hugging Face
dataset = load_dataset("abisee/cnn_dailymail", "1.0.0")

# Step 2: Convert train and test splits to pandas DataFrames
train_df = dataset["train"].to_pandas()
test_df = dataset["test"].to_pandas()

# Step 3: Sample smaller subsets to reduce computational load
train_sample = train_df.sample(n=100, random_state=42)
test_sample = test_df.sample(n=50, random_state=42)

# Step 4: Display the first example from the training sample
first_example = train_sample.iloc[0]
print("📰 Article:\n", first_example["article"])
print("\n📝 Reference Summary:\n", first_example["highlights"])

# Step 5: Inspect the sampled DataFrames
print("\n📊 Train Sample Head:")
print(train_sample.head())

print("\n📊 Test Sample Head:")
print(test_sample.head())

📰 Article:
 Nasa has warned of an impending asteroid pass - and says it will be the closest until 2027. The asteroid, designated 2004 BL86, will safely pass about three times the distance of Earth to the moon on January 26. It will be the closest by any known space rock this large until asteroid 1999 AN10 flies past Earth in 2027. See the Asteroid's route below . At the time of its closest approach on January 26, the asteroid will be approximately 745,000 miles (1.2 million kilometers) from Earth. Due to its orbit around the sun, the asteroid is currently only visible by astronomers with large telescopes who are located in the southern hemisphere. But by Jan. 26, the space rock's changing position will make it visible to those in the northern hemisphere. From its reflected brightness, astronomers estimate that the asteroid is about a third of a mile (0.5 kilometers) in size. At the time of its closest approach on January 26, the asteroid will be approximately 745,000 miles (1.2 million

Part III : Summarization With T5

Function Implementation: Implement the summarize_with_t5 function:
Use T5ForConditionalGeneration and AutoTokenizer from transformers.
Handle CUDA/MPS availability for GPU acceleration.
Implement batch processing using the batch_generator function.
Tokenize input articles with a “summarize: ” prefix.
Generate summaries using model.generate().
Decode generated token IDs back to text.
Clear CUDA cache (torch.cuda.empty_cache()) and garbage collect (gc.collect()) after each batch and at the end of the function.
Summary Generation: Generate summaries for the training sample using t5-small.
Result Display: Display the generated summaries alongside the reference summaries in a pandas DataFrame.

In [3]:
import torch
import gc
from transformers import T5ForConditionalGeneration, AutoTokenizer
import pandas as pd

def batch_generator(data, batch_size):
    """Yield batches of data"""
    for i in range(0, len(data), batch_size):
        yield data[i : i + batch_size]

def summarize_with_t5(texts, batch_size=8):
    """
    Summarizes a list of articles using T5-small model in batches.
    
    Args:
        texts (list of str): Input articles to summarize.
        batch_size (int): Number of samples to process per batch.
    
    Returns:
        List of generated summaries.
    """
    # Load model and tokenizer
    model_name = "t5-small"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = T5ForConditionalGeneration.from_pretrained(model_name)
    
    # Use GPU if available (CUDA or Apple MPS)
    if torch.cuda.is_available():
        device = torch.device("cuda")
    elif torch.backends.mps.is_available():  # For Mac MPS
        device = torch.device("mps")
    else:
        device = torch.device("cpu")
    model.to(device)
    model.eval()
    
    summaries = []
    
    for batch in batch_generator(texts, batch_size):
        # Add "summarize: " prefix as T5 expects
        inputs = ["summarize: " + article for article in batch]
        
        # Tokenize inputs (limit max length for speed & memory)
        encoded = tokenizer(inputs, return_tensors="pt", padding=True, truncation=True, max_length=512)
        input_ids = encoded.input_ids.to(device)
        attention_mask = encoded.attention_mask.to(device)
        
        # Generate summaries
        with torch.no_grad():
            outputs = model.generate(input_ids=input_ids,
                                     attention_mask=attention_mask,
                                     max_length=150,      # max summary length
                                     num_beams=4,         # beam search for better summaries
                                     early_stopping=True)
        
        # Decode to text
        decoded_summaries = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in outputs]
        summaries.extend(decoded_summaries)
        
        # Clear cache & garbage collect
        if device.type == "cuda":
            torch.cuda.empty_cache()
        gc.collect()
    
    # Final cleanup
    if device.type == "cuda":
        torch.cuda.empty_cache()
    gc.collect()
    
    return summaries

#Summary Generation
texts_to_summarize = train_sample["article"].tolist()
reference_summaries = train_sample["highlights"].tolist()

generated_summaries = summarize_with_t5(texts_to_summarize, batch_size=8)

#Display results side-by-side
df_results = pd.DataFrame({
    "Article": texts_to_summarize,
    "Reference Summary": reference_summaries,
    "Generated Summary": generated_summaries
})

print(df_results.head())


                                             Article  \
0  Nasa has warned of an impending asteroid pass ...   
1  BAGHDAD, Iraq (CNN) -- Iraq's most powerful Su...   
2  By . David Kent . Andy Carroll has taken an un...   
3  Los Angeles (CNN) -- Los Angeles has long been...   
4  London (CNN) -- Few shows can claim such an au...   

                                   Reference Summary  \
0  2004 BL86 will pass about three times the dist...   
1  Iraqi Islamic Party calls Quran incident "blat...   
2  Carroll takes to Instagram to post selfie ahea...   
3  Pop stars from all over Europe are setting the...   
4  NEW: Young athletes light the Olympic cauldron...   

                                   Generated Summary  
0  the asteroid, designated 2004 BL86, will pass ...  
1  a sniper section leader used a Quran for targe...  
2  the striker is out for four months after teari...  
3  aspiring pop stars from the u.s. and abroad ar...  
4  a billion people around the world would be glu..

Part IV : Accuracy Evaluation

Accuracy Calculation: Calculate the accuracy of the t5-small summaries by comparing them to the reference summaries.
Result Interpretation: Print the calculated accuracy. Discuss why the accuracy is likely to be very low or zero, reinforcing the limitations of this metric.

In [4]:
# Calculate exact-match accuracy
correct = sum(gen.strip() == ref.strip() for gen, ref in zip(generated_summaries, reference_summaries))
total = len(reference_summaries)
accuracy = correct / total

print(f"Exact-match Accuracy: {accuracy:.4f}")


Exact-match Accuracy: 0.0000


Exact-match accuracy is expected as close to 0 since: summarization is based semantics, and not necessarily on actual words beind the same.
Therefore finding exact strings between a text and its summary is very unlikeley. Better metrics include ROUGE or human evaluation.

Part V : ROUGE Metric Implementation

Metric Introduction: Introduce ROUGE (Recall-Oriented Understudy for Gisting Evaluation) as a standard metric for summarization.
Library Usage: Load the rouge evaluation metric using evaluate.load("rouge").
Preprocessing: Explain the need to format the input summaries with newlines between sentences, and the use of the nltk sentence tokenizer.
Function Definition: Create the compute_rouge_score function to calculate ROUGE scores, handling the necessary preprocessing.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a standard automatic metric for evaluating summarization systems. It measures overlap between the generated summaries and reference summaries, focusing on:

ROUGE-1: Overlap of unigrams (single words)
ROUGE-2: Overlap of bigrams (pairs of words)
ROUGE-L: Longest common subsequence
It’s widely used because it correlates better with human judgment than exact-match accuracy.

In [5]:
import evaluate
rouge = evaluate.load("rouge")

def preprocess_summary(summary):
    sentences = nltk.sent_tokenize(summary)
    return "\n".join(sentences)

def compute_rouge_score(predictions, references):
    """
    Compute ROUGE scores between generated summaries and reference summaries.

    Args:
        predictions (list of str): Generated summaries.
        references (list of str): Reference summaries.

    Returns:
        dict: ROUGE scores.
    """
    # Preprocess summaries to have newlines between sentences
    preds = [preprocess_summary(p) for p in predictions]
    refs = [preprocess_summary(r) for r in references]
    
    # Compute ROUGE
    results = rouge.compute(predictions=preds, references=refs)
    
    # Return nicely rounded scores
    return {k: round(v * 100, 2) for k, v in results.items()}


Part VI : Understanding ROUGE Scores

Exact Match Test: Calculate ROUGE scores when the generated summaries are identical to the reference summaries.
Null Prediction Test: Calculate ROUGE scores when the generated summaries are empty.
Stemming Effect: Demonstrate the impact of stemming on ROUGE scores using simple examples.
N-gram Analysis: Explore how ROUGE-1 and ROUGE-2 scores change with varying degrees of overlap between generated and reference summaries.
Symmetry: Show the symmetry of rouge score with respect to predictions and references.

In [6]:
#1. Exact Match Test

exact_refs = [
    "The cat sat on the mat.",
    "Dogs are great pets.",
    "The quick brown fox jumps over the lazy dog."
]
exact_preds = exact_refs.copy()  # identical predictions

exact_match_scores = compute_rouge_score(exact_preds, exact_refs)
print("Exact Match ROUGE Scores:")
for metric, score in exact_match_scores.items():
    print(f"{metric}: {score}")

#Expected: ROUGE scores should be very high (close to 100), since predictions perfectly match references.

Exact Match ROUGE Scores:
rouge1: 100.0
rouge2: 100.0
rougeL: 100.0
rougeLsum: 100.0


In [7]:
#2. Null Prediction Test
null_preds = ["", "", ""]  # empty predictions

null_scores = compute_rouge_score(null_preds, exact_refs)
print("\nNull Prediction ROUGE Scores:")
for metric, score in null_scores.items():
    print(f"{metric}: {score}")

#Expected: ROUGE scores will be very low or zero, reflecting no overlap.


Null Prediction ROUGE Scores:
rouge1: 0.0
rouge2: 0.0
rougeL: 0.0
rougeLsum: 0.0


In [8]:
#3. Stemming Effect
refs_stem = ["The cat is running fast."]
preds_stem_1 = ["The cat is running fast."]  # exact match
preds_stem_2 = ["The cat runs fast."]        # different form of 'run'

scores_stem_1 = compute_rouge_score(preds_stem_1, refs_stem)
scores_stem_2 = compute_rouge_score(preds_stem_2, refs_stem)

print("\nStemming Effect ROUGE Scores:")
print("Exact match:")
for metric, score in scores_stem_1.items():
    print(f"{metric}: {score}")

print("\nDifferent verb form ('running' vs 'runs'):")
for metric, score in scores_stem_2.items():
    print(f"{metric}: {score}")

#Expected: The scores in the second case should still be reasonably high due to stemming matching ‘running’ and ‘runs’.


Stemming Effect ROUGE Scores:
Exact match:
rouge1: 100.0
rouge2: 100.0
rougeL: 100.0
rougeLsum: 100.0

Different verb form ('running' vs 'runs'):
rouge1: 66.67
rouge2: 28.57
rougeL: 66.67
rougeLsum: 66.67


In [9]:
#4. N-gram Analysis
refs_ngrams = ["The cat sat on the mat."]
preds_full = ["The cat sat on the mat."]  # full overlap
preds_partial_1 = ["The cat sat on the."] # missing last word
preds_partial_2 = ["The cat on mat."]     # missing multiple words

print("\nN-gram Overlap ROUGE Scores:")

print("\nFull overlap:")
print(compute_rouge_score([preds_full[0]], [refs_ngrams[0]]))

print("\nPartial overlap 1 (missing one word):")
print(compute_rouge_score([preds_partial_1[0]], [refs_ngrams[0]]))

print("\nPartial overlap 2 (missing multiple words):")
print(compute_rouge_score([preds_partial_2[0]], [refs_ngrams[0]]))

#Expected: ROUGE-1 scores decrease with less overlap, ROUGE-2 scores decrease even more because bigram matches are fewer.


N-gram Overlap ROUGE Scores:

Full overlap:
{'rouge1': np.float64(100.0), 'rouge2': np.float64(100.0), 'rougeL': np.float64(100.0), 'rougeLsum': np.float64(100.0)}

Partial overlap 1 (missing one word):
{'rouge1': np.float64(90.91), 'rouge2': np.float64(88.89), 'rougeL': np.float64(90.91), 'rougeLsum': np.float64(90.91)}

Partial overlap 2 (missing multiple words):
{'rouge1': np.float64(80.0), 'rouge2': np.float64(25.0), 'rougeL': np.float64(80.0), 'rougeLsum': np.float64(80.0)}


In [10]:
#5. Symmetry Check
refs_sym = ["The cat sat on the mat."]
preds_sym = ["The cat sat on the mat."]

scores_ref_pred = compute_rouge_score(preds_sym, refs_sym)
scores_pred_ref = compute_rouge_score(refs_sym, preds_sym)

print("\nSymmetry Check:")
print("ROUGE(predictions, references):", scores_ref_pred)
print("ROUGE(references, predictions):", scores_pred_ref)

#Expected: Scores should be very close or equal.


Symmetry Check:
ROUGE(predictions, references): {'rouge1': np.float64(100.0), 'rouge2': np.float64(100.0), 'rougeL': np.float64(100.0), 'rougeLsum': np.float64(100.0)}
ROUGE(references, predictions): {'rouge1': np.float64(100.0), 'rouge2': np.float64(100.0), 'rougeL': np.float64(100.0), 'rougeLsum': np.float64(100.0)}


Part VII : Comparing Small And Large Models

Model Selection: Choose t5-small, t5-base, and gpt2 models.
Summary Generation: Generate summaries for the training sample using each model.
ROUGE Calculation: Calculate ROUGE scores for each model’s summaries using compute_rouge_score.
Per-Row ROUGE: Create the compute_rouge_per_row function to calculate and store ROUGE scores for each individual article in a DataFrame.
Result Display: Display the per-row ROUGE scores for each model.
GPT2 Specifics: implement the summarize_with_gpt2 function, handling the “TL;DR:” prompt, and the token length limitations.

In [11]:
import torch
import pandas as pd
from transformers import (
    AutoTokenizer, 
    T5ForConditionalGeneration, 
    GPT2LMHeadModel, 
    GPT2Tokenizer
)
import nltk
import evaluate
import gc

nltk.download("punkt")
rouge = evaluate.load("rouge")

[nltk_data] Downloading package punkt to /Users/user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [12]:
import torch

def get_device():
    if torch.cuda.is_available():
        print("Using CUDA GPU")
        return torch.device("cuda")
    elif torch.backends.mps.is_available():
        try:
            # Optional test to confirm MPS usability
            _ = torch.tensor([1.0], device="mps")
            print("Using MPS GPU (macOS)")
            return torch.device("mps")
        except Exception as e:
            print(f"MPS available but not usable: {e}")
    print("Using CPU")
    return torch.device("cpu")


In [13]:
import torch
from transformers import AutoTokenizer, T5ForConditionalGeneration
from tqdm import tqdm

def summarize_with_t5_model(model_name, texts, batch_size=8, max_input_length=512, max_output_length=150):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = T5ForConditionalGeneration.from_pretrained(model_name)

    device = get_device()
    model.to(device)
    model.eval()

    summaries = []
    for i in tqdm(range(0, len(texts), batch_size), desc=f"Summarizing with {model_name}"):
        batch = texts[i:i+batch_size]
        inputs = ["summarize: " + text for text in batch]
        encoding = tokenizer(inputs, padding=True, truncation=True, max_length=max_input_length, return_tensors="pt").to(device)

        with torch.no_grad():
            output = model.generate(
                input_ids=encoding["input_ids"],
                attention_mask=encoding["attention_mask"],
                max_length=max_output_length,
                num_beams=2,  # Lowered for speed
                early_stopping=True
            )

        decoded = [tokenizer.decode(ids, skip_special_tokens=True) for ids in output]
        summaries.extend(decoded)

    return summaries



In [14]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch
from tqdm import tqdm

def summarize_with_gpt2(texts, batch_size=4, max_input_length=512, max_output_length=50):
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    tokenizer.pad_token = tokenizer.eos_token
    model = GPT2LMHeadModel.from_pretrained("gpt2")

    device = get_device()
    model.to(device)
    model.eval()

    summaries = []

    for i in tqdm(range(0, len(texts), batch_size), desc="Summarizing with GPT-2"):
        batch = texts[i:i+batch_size]
        prompts = ["TL;DR: " + text.strip() for text in batch]
        inputs = tokenizer(prompts, padding=True, truncation=True, max_length=max_input_length, return_tensors="pt").to(device)

        with torch.no_grad():
            output = model.generate(
                input_ids=inputs["input_ids"],
                attention_mask=inputs["attention_mask"],
                max_new_tokens=max_output_length,
                num_beams=2,
                early_stopping=True,
                pad_token_id=tokenizer.eos_token_id
            )

        decoded = tokenizer.batch_decode(output, skip_special_tokens=True)
        summaries.extend([text.split("TL;DR:")[-1].strip() for text in decoded])

    return summaries



In [15]:
from nltk.tokenize import sent_tokenize
from tqdm import tqdm
import evaluate
import pandas as pd

rouge = evaluate.load("rouge")

def preprocess_summary(summary):
    return "\n".join(sent_tokenize(summary.strip()))

def compute_rouge_per_row(predictions, references):
    rows = []
    for pred, ref in tqdm(zip(predictions, references), total=len(predictions), desc="Computing ROUGE"):
        score = rouge.compute(
            predictions=[preprocess_summary(pred)],
            references=[preprocess_summary(ref)],
            use_stemmer=True
        )
        rows.append({
            "ROUGE-1": round(score["rouge1"] * 100, 2),
            "ROUGE-2": round(score["rouge2"] * 100, 2),
            "ROUGE-L": round(score["rougeL"] * 100, 2),
        })
    return pd.DataFrame(rows)


In [16]:
texts = train_sample["article"].tolist()
references = train_sample["highlights"].tolist()

def evaluate_models(texts, references):
    summaries_small = summarize_with_t5_model("t5-small", texts)
    df_small = compute_rouge_per_row(summaries_small, references)
    df_small.columns = ["ROUGE-1 (t5-small)", "ROUGE-2 (t5-small)", "ROUGE-L (t5-small)"]

    summaries_base = summarize_with_t5_model("t5-base", texts)
    df_base = compute_rouge_per_row(summaries_base, references)
    df_base.columns = ["ROUGE-1 (t5-base)", "ROUGE-2 (t5-base)", "ROUGE-L (t5-base)"]

    summaries_gpt2 = summarize_with_gpt2(texts)
    df_gpt2 = compute_rouge_per_row(summaries_gpt2, references)
    df_gpt2.columns = ["ROUGE-1 (gpt2)", "ROUGE-2 (gpt2)", "ROUGE-L (gpt2)"]

    df_results = pd.concat([df_small, df_base, df_gpt2], axis=1)

    model_outputs = {
        "t5-small": summaries_small,
        "t5-base": summaries_base,
        "gpt2": summaries_gpt2
    }

    return df_results, model_outputs

df_results, model_outputs = evaluate_models(texts, references)
print(df_results.head())


Using MPS GPU (macOS)


Summarizing with t5-small: 100%|██████████| 13/13 [00:57<00:00,  4.42s/it]
Computing ROUGE: 100%|██████████| 100/100 [00:04<00:00, 22.36it/s]


Using MPS GPU (macOS)


Summarizing with t5-base: 100%|██████████| 13/13 [01:22<00:00,  6.35s/it]
Computing ROUGE: 100%|██████████| 100/100 [00:04<00:00, 23.76it/s]


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Using MPS GPU (macOS)


Summarizing with GPT-2:   4%|▍         | 1/25 [00:06<02:31,  6.33s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Summarizing with GPT-2:   8%|▊         | 2/25 [00:12<02:19,  6.08s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Summarizing with GPT-2:  16%|█▌        | 4/25 [00:16<01:12,  3.46s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Summarizing with GPT-2:  24%|██▍       | 6/25 [00:21<00:55,  2.90s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Summarizing with GPT-2:  40%

   ROUGE-1 (t5-small)  ROUGE-2 (t5-small)  ROUGE-L (t5-small)  \
0               47.50               33.33               42.50   
1               26.97               13.79               20.22   
2               25.64                5.26               12.82   
3               34.92               22.58               26.98   
4               36.36                6.98               29.55   

   ROUGE-1 (t5-base)  ROUGE-2 (t5-base)  ROUGE-L (t5-base)  ROUGE-1 (gpt2)  \
0              47.37              35.14              44.74           16.44   
1              43.96              20.22              26.37           16.12   
2              28.99              14.93              23.19           11.55   
3              38.38              22.68              34.34           17.37   
4              33.77               8.00              23.38           11.43   

   ROUGE-2 (gpt2)  ROUGE-L (gpt2)  
0           14.93           16.44  
1            8.30           11.57  
2            6.00            9.1




🌟 Part VIII : Comparing All Models

Aggregation Function: Create the compare_models function to aggregate ROUGE scores for all models into a single DataFrame, showing average scores.
Summary Comparison Function: Create the compare_models_summaries function to display the generated summaries from all models side-by-side in a DataFrame.
Result Display: Display the aggregated ROUGE scores and the side-by-side summary comparisons.

In [17]:
def compare_models(reference_summaries, model_outputs: dict):
    """
    Compute average ROUGE scores for each model and return a DataFrame.
    
    Args:
        reference_summaries (list of str): Ground-truth summaries.
        model_outputs (dict): Keys are model names, values are lists of generated summaries.
    
    Returns:
        pd.DataFrame: Model-wise average ROUGE scores.
    """
    avg_scores = {}
    for model_name, preds in model_outputs.items():
        df_rouge = compute_rouge_per_row(preds, reference_summaries)
        avg = df_rouge.mean().round(2)
        avg_scores[model_name] = avg.values  # ROUGE-1, ROUGE-2, ROUGE-L

    return pd.DataFrame(avg_scores, index=["ROUGE-1", "ROUGE-2", "ROUGE-L"]).T


In [18]:
def compare_models_summaries(texts, reference_summaries, model_outputs: dict, n=5):
    """
    Show sample article, reference summary, and summaries from each model side-by-side.

    Args:
        texts (list of str): Input articles.
        reference_summaries (list of str): Ground-truth summaries.
        model_outputs (dict): Dict of model_name: list of summaries.
        n (int): Number of rows to display.
    
    Returns:
        pd.DataFrame: Comparison of summaries per article.
    """
    data = {
        "Article": texts[:n],
        "Reference": reference_summaries[:n]
    }
    for model_name, summaries in model_outputs.items():
        data[model_name] = summaries[:n]
    
    return pd.DataFrame(data)


In [19]:
# Inputs
texts = train_sample["article"].tolist()
references = train_sample["highlights"].tolist()

# ROUGE score comparison
df_avg_rouge = compare_models(references, model_outputs)
print("🔢 Average ROUGE Scores for All Models:")
print(df_avg_rouge)

# Side-by-side summary comparison
df_sample_summaries = compare_models_summaries(texts, references, model_outputs, n=5)
print("\n📝 Sample Summary Comparisons:")
print(df_sample_summaries)

Computing ROUGE: 100%|██████████| 100/100 [00:04<00:00, 24.02it/s]
Computing ROUGE: 100%|██████████| 100/100 [00:04<00:00, 23.85it/s]
Computing ROUGE: 100%|██████████| 100/100 [00:05<00:00, 19.57it/s]

🔢 Average ROUGE Scores for All Models:
          ROUGE-1  ROUGE-2  ROUGE-L
t5-small    37.45    17.17    27.30
t5-base     39.92    19.71    29.14
gpt2        16.97     9.04    11.80

📝 Sample Summary Comparisons:
                                             Article  \
0  Nasa has warned of an impending asteroid pass ...   
1  BAGHDAD, Iraq (CNN) -- Iraq's most powerful Su...   
2  By . David Kent . Andy Carroll has taken an un...   
3  Los Angeles (CNN) -- Los Angeles has long been...   
4  London (CNN) -- Few shows can claim such an au...   

                                           Reference  \
0  2004 BL86 will pass about three times the dist...   
1  Iraqi Islamic Party calls Quran incident "blat...   
2  Carroll takes to Instagram to post selfie ahea...   
3  Pop stars from all over Europe are setting the...   
4  NEW: Young athletes light the Olympic cauldron...   

                                            t5-small  \
0  2004 BL86 will pass about three times the dist...   





T5 models (especially fine-tuned versions) are explicitly trained for summarization tasks, so they naturally produce summaries that align better with reference highlights, yielding higher ROUGE scores.
GPT-2, on the other hand, is a general-purpose language model trained mainly for language modeling (predicting the next token), not specifically for summarization. When you prompt it to summarize, it can generate plausible text, but it generally won't match references as closely as a model trained for summarization.
Observing ROUGE scores — T5 models scoring roughly twice as high as GPT-2 — are perfectly expected.