## Exercises XP: W7_D3

#### What You’ll Learn

- **Practical LLM Evaluation:** Gain hands-on experience evaluating LLMs for summarization.  
- **Metric Deep Dive:** Understand the strengths and weaknesses of various evaluation metrics (accuracy, ROUGE).  
- **Model Comparison:** Learn to systematically compare different LLMs and model sizes.  
- **Hugging Face Proficiency:** Enhance your skills in using Hugging Face’s *transformers* and *evaluate* libraries.  
- **Customization:** Implement and analyze the effects of modifying evaluation metrics and model parameters.  
- **Data Handling:** Learn how to load, process, and sample text datasets using pandas.  
- **Text Preprocessing:** Understand the importance of text preprocessing for NLP tasks.  
- **Debugging and Analysis:** Develop skills in debugging and analyzing LLM outputs.  

---

#### What You Will Create

- **Evaluation Scripts:** Python scripts to calculate and compare summarization metrics.  
- **Comparative Reports:** DataFrames and visualizations summarizing the performance of different LLMs.  
- **Modified Evaluation Metrics:** Custom accuracy metrics tailored for summarization.  
- **Summarization Outputs:** Generated summaries from various LLMs for comparative analysis.  
- **Analytical Reports:** Documentation of findings, including discussions on metric behavior and model performance.  
- **Custom Functions:** Functions to load datasets, generate summaries, and compute ROUGE scores.  
- **Model Comparison Tables:** Tables comparing performance of different LLMs across various metrics.  

---

All of today’s exercises are part of a single, hands-on tutorial designed to teach you how to evaluate LLMs on summarization tasks. Together, you will:

- Measure accuracy on summary outputs  
- Compute ROUGE-N scores  
- Build a consistent framework for comparing different model sizes and architectures  

Each part builds on the last, giving you a cohesive workflow for assessing and contrasting summarization performance.

---

# Learning Objectives

- **Metric Understanding:** Learn to compute ROUGE-N and understand its nuances.  
- **Intuition Building:** Develop an intuitive understanding of ROUGE-N and its application to summarization.  
- **Comparative Analysis:** Test and compare various LLMs and model sizes on a consistent dataset.  

#### Part II: Dataset Loading and Exploration

- **Dataset Loading:** Load the *train.csv* and *test.csv* datasets using pandas.  
- **Sampling:** Take a smaller sample of the datasets (e.g., 100 samples from train, 50 from test) to reduce computational load.  
- **Exploration:** Display the first example from the training sample, showing the article (*prompt_text*) and its reference summary (*prompt_title*).  
- **Data Inspection:** Print the sampled train and test DataFrames to understand the dataset structure.

---

#### Part III: Summarization with T5

- **Function Implementation:** Implement the *summarize_with_t5* function:
  - Use *T5ForConditionalGeneration* and *AutoTokenizer* from *transformers*.
  - Handle CUDA availability for GPU acceleration.
  - Implement batch processing using the *batch_generator* function.
  - Tokenize input articles with a “summarize: ” prefix.
  - Generate summaries using *model.generate()*.
  - Decode generated token IDs back to text.
  - Clear CUDA cache (*torch.cuda.empty_cache()*) and garbage collect (*gc.collect()*) after each batch and at the end of the function.
- **Summary Generation:** Generate summaries for the training sample using **t5-small**.
- **Result Display:** Display the generated summaries alongside the reference summaries in a pandas DataFrame.

---

#### Part IV: Accuracy Evaluation

- **Accuracy Calculation:** Calculate the accuracy of the **t5-small** summaries by comparing them to the reference summaries.  
- **Result Interpretation:** Print the calculated accuracy. Discuss why the accuracy is likely to be very low or zero, reinforcing the limitations of this metric.

#### Part V: ROUGE Metric Implementation

- **Metric Introduction:** Introduce **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)** as a standard metric for summarization.  
- **Library Usage:** Load the ROUGE evaluation metric using *evaluate.load("rouge")*.  
- **Preprocessing:** Explain the need to format the input summaries with newlines between sentences, and the use of the NLTK sentence tokenizer.  
- **Function Definition:** Create the *compute_rouge_score* function to calculate ROUGE scores, handling the necessary preprocessing.

---

#### Part VI: Understanding ROUGE Scores

- **Exact Match Test:** Calculate ROUGE scores when the generated summaries are identical to the reference summaries.  
- **Null Prediction Test:** Calculate ROUGE scores when the generated summaries are empty.  
- **Stemming Effect:** Demonstrate the impact of stemming on ROUGE scores using simple examples.  
- **N-gram Analysis:** Explore how ROUGE-1 and ROUGE-2 scores change with varying degrees of overlap between generated and reference summaries.  
- **Symmetry:** Show the symmetry of ROUGE score with respect to predictions and references.

---

#### Part VII: Comparing Small and Large Models

- **Model Selection:** Choose **t5-small**, **t5-base**, and **gpt2** models.  
- **Summary Generation:** Generate summaries for the training sample using each model.  
- **ROUGE Calculation:** Calculate ROUGE scores for each model’s summaries using *compute_rouge_score*.  
- **Per-Row ROUGE:** Create the *compute_rouge_per_row* function to calculate and store ROUGE scores for each individual article in a DataFrame.  
- **Result Display:** Display the per-row ROUGE scores for each model.  
- **GPT2 Specifics:** Implement the *summarize_with_gpt2* function, handling the “TL;DR:” prompt, and the token length limitations.

---

#### Part VIII: Comparing All Models

- **Aggregation Function:** Create the *compare_models* function to aggregate ROUGE scores for all models into a single DataFrame, showing average scores.  
- **Summary Comparison Function:** Create the *compare_models_summaries* function to display the generated summaries from all models side-by-side in a DataFrame.  
- **Result Display:** Display the aggregated ROUGE scores and the side-by-side summary comparisons.

### Part I. Setup

#### Install Libraries

In [None]:
!pip install datasets
!pip install rouge_score==0.1.2
!pip install evaluate
!pip install -U accelerate --quiet
!pip install datasets
!pip install nltk
!pip install hf_xet

#### Download NLTK Resources

In [None]:
import nltk
nltk.download("punkt")
nltk.download("punkt_tab")

### Part II : Dataset Loading and Exploration

In [36]:
import os
import gc
import torch
import pandas as pd
from datasets import load_dataset
from transformers import T5ForConditionalGeneration, AutoTokenizer
from evaluate import load
from nltk.tokenize import sent_tokenize
import warnings
warnings.filterwarnings("ignore", message="Xet Storage is enabled")

#### Load CNN/DailyMail dataset

In [5]:
# Load the latest version (3.0.0)
dataset = load_dataset("abisee/cnn_dailymail", "3.0.0")

os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"
# Access splits
train_dataset = dataset["train"]
test_dataset = dataset["test"]

# Display first example
print(train_dataset[0])

{'article': 'LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won\'t cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don\'t plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don\'t think I\'ll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office char

#### Convert to pandas DataFrame with expected columns

In [None]:
train_df = pd.DataFrame({
    "prompt_text": train_dataset["article"],
    "prompt_title": train_dataset["highlights"]
})

test_df = pd.DataFrame({
    "prompt_text": test_dataset["article"],
    "prompt_title": test_dataset["highlights"]
})

# Show first rows
print(train_df.head())

                                         prompt_text  \
0  LONDON, England (Reuters) -- Harry Potter star...   
1  Editor's note: In our Behind the Scenes series...   
2  MINNEAPOLIS, Minnesota (CNN) -- Drivers who we...   
3  WASHINGTON (CNN) -- Doctors removed five small...   
4  (CNN)  -- The National Football League has ind...   

                                        prompt_title  
0  Harry Potter star Daniel Radcliffe gets £20M f...  
1  Mentally ill inmates in Miami are housed on th...  
2  NEW: "I thought I was going to die," driver sa...  
3  Five small polyps found during procedure; "non...  
4  NEW: NFL chief, Atlanta Falcons owner critical...  


#### Sampling train and test datasets

In [8]:
# Sample 100 rows from train, 50 rows from test
train_sample = train_df.sample(100, random_state=42).reset_index(drop=True)
test_sample = test_df.sample(50, random_state=42).reset_index(drop=True)

# Display first example from train sample
print("Article (prompt_text):\n", train_sample.loc[0, "prompt_text"])
print("\nReference Summary (prompt_title):\n", train_sample.loc[0, "prompt_title"])

# Display sampled DataFrames
print("\nTrain sample shape:", train_sample.shape)
print(train_sample.head())

print("\nTest sample shape:", test_sample.shape)
print(test_sample.head())

Article (prompt_text):
 Nasa has warned of an impending asteroid pass - and says it will be the closest until 2027. The asteroid, designated 2004 BL86, will safely pass about three times the distance of Earth to the moon on January 26. It will be the closest by any known space rock this large until asteroid 1999 AN10 flies past Earth in 2027. See the Asteroid's route below . At the time of its closest approach on January 26, the asteroid will be approximately 745,000 miles (1.2 million kilometers) from Earth. Due to its orbit around the sun, the asteroid is currently only visible by astronomers with large telescopes who are located in the southern hemisphere. But by Jan. 26, the space rock's changing position will make it visible to those in the northern hemisphere. From its reflected brightness, astronomers estimate that the asteroid is about a third of a mile (0.5 kilometers) in size. At the time of its closest approach on January 26, the asteroid will be approximately 745,000 miles 

### Part III : Summarization with T5

In [14]:
def summarize_with_t5(texts, model_name="t5-small", batch_size=4, max_length=100):
    """
    Generate summaries for a list of texts using a T5 model.
    """

    # Load model and tokenizer
    model = T5ForConditionalGeneration.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Use GPU if available
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)

    summaries = []

    # Process in batches
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        # Add summarize prefix
        inputs = ["summarize: " + text for text in batch_texts]

        # Tokenize
        inputs_tokenized = tokenizer(
            inputs,
            max_length=512,
            truncation=True,
            padding="longest",
            return_tensors="pt"
        ).to(device)

        # Generate summaries
        outputs = model.generate(
            **inputs_tokenized,
            max_length=max_length,
            num_beams=4,
            early_stopping=True
        )

        # Decode summaries
        decoded_summaries = [tokenizer.decode(ids, skip_special_tokens=True) for ids in outputs]
        summaries.extend(decoded_summaries)

        # Clear GPU memory
        torch.cuda.empty_cache()
        gc.collect()

    return summaries

#### Generate summaries with T5-small

In [18]:
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"

In [19]:
# Generate summaries for the training sample
generated_summaries = summarize_with_t5(train_sample["prompt_text"].tolist(), model_name="t5-small")

# Combine generated summaries and reference summaries into a DataFrame
results_df = train_sample.copy()
results_df["generated_summary"] = generated_summaries

# Display first rows
results_df[["prompt_text", "prompt_title", "generated_summary"]].head()

Unnamed: 0,prompt_text,prompt_title,generated_summary
0,Nasa has warned of an impending asteroid pass ...,2004 BL86 will pass about three times the dist...,"the asteroid, designated 2004 BL86, will pass ..."
1,"BAGHDAD, Iraq (CNN) -- Iraq's most powerful Su...","Iraqi Islamic Party calls Quran incident ""blat...",a sniper section leader used a Quran for targe...
2,By . David Kent . Andy Carroll has taken an un...,Carroll takes to Instagram to post selfie ahea...,the striker is out for four months after teari...
3,Los Angeles (CNN) -- Los Angeles has long been...,Pop stars from all over Europe are setting the...,aspiring pop stars from the u.s. and abroad ar...
4,London (CNN) -- Few shows can claim such an au...,NEW: Young athletes light the Olympic cauldron...,a billion people around the world would be glu...


### Part IV : Accuracy Evaluation

In [20]:
def calculate_accuracy(reference_summaries, generated_summaries):
    """
    Calculate simple accuracy: percentage of exact matches between reference and generated summaries.
    """
    matches = sum(1 for ref, gen in zip(reference_summaries, generated_summaries) if ref.strip() == gen.strip())
    return matches / len(reference_summaries)

# Calculate accuracy
accuracy_score = calculate_accuracy(train_sample["prompt_title"], results_df["generated_summary"])
print("Accuracy Score:", accuracy_score)

Accuracy Score: 0.0


#### Accuracy Score Interpretation

The accuracy score is 0.0 because accuracy requires exact string matches between the generated summaries and the reference summaries.  

In summarization tasks, even if the generated summary is correct, it is unlikely to be identical to the human-written summary — small differences in phrasing, synonyms, or sentence structure result in an accuracy of 0.  

This demonstrates why accuracy is not a useful metric for summarization and why metrics like ROUGE are preferred, as they measure partial overlaps rather than exact matches.

### Part V : ROUGE Metric Implementation

In [22]:
# Load ROUGE metric
rouge_metric = load("rouge")

def compute_rouge_score(references, predictions):
    """
    Compute ROUGE-1, ROUGE-2, and ROUGE-L scores between reference and predicted summaries.
    Handles preprocessing with sentence tokenization and newline joining.
    """

    # Preprocess: join sentences with newline for ROUGE
    references_processed = ["\n".join(sent_tokenize(ref)) for ref in references]
    predictions_processed = ["\n".join(sent_tokenize(pred)) for pred in predictions]

    # Compute ROUGE scores
    scores = rouge_metric.compute(
        predictions=predictions_processed,
        references=references_processed,
        use_stemmer=True
    )

    return scores

Downloading builder script: 0.00B [00:00, ?B/s]

#### Preprocessing for ROUGE

ROUGE expects summaries to be split by newlines between sentences.  
We use *nltk.sent_tokenize()* to segment each summary into sentences, then join them with *"\n"*.  
This ensures that ROUGE correctly counts overlapping n-grams across sentence boundaries and gives more accurate results.

#### Compute ROUGE scores for T5 summaries

In [23]:
# Calculate ROUGE between generated summaries and reference summaries
rouge_scores = compute_rouge_score(
    train_sample["prompt_title"].tolist(),
    results_df["generated_summary"].tolist()
)

# Display results
print("ROUGE-1:", rouge_scores["rouge1"])
print("ROUGE-2:", rouge_scores["rouge2"])
print("ROUGE-L:", rouge_scores["rougeL"])

ROUGE-1: 0.3651971923063798
ROUGE-2: 0.1707697237185407
ROUGE-L: 0.26830679191874074


#### ROUGE Score Interpretation

- **ROUGE-1 (0.36):** About 36% overlap in unigrams (single words) between generated and reference summaries.
- **ROUGE-2 (0.17):** About 17% overlap in bigrams (two-word sequences).
- **ROUGE-L (0.26):** About 26% overlap in the longest common subsequence.

These scores are typical for abstractive summarization models like T5-small on CNN/DailyMail.  
ROUGE provides a more informative evaluation than accuracy, as it rewards partial matches and overlap of key information rather than exact string matches.

### Part VI : Understanding ROUGE Scores

#### 1. Exact Match Test

In [24]:
ref = ["The cat is on the mat."]
pred = ["The cat is on the mat."]

scores_exact = compute_rouge_score(ref, pred)
print("Exact Match Test:", scores_exact)

Exact Match Test: {'rouge1': np.float64(1.0), 'rouge2': np.float64(1.0), 'rougeL': np.float64(1.0), 'rougeLsum': np.float64(1.0)}


#### 2. Null Prediction Test

In [25]:
ref = ["The cat is on the mat."]
pred = [""]

scores_null = compute_rouge_score(ref, pred)
print("Null Prediction Test:", scores_null)

Null Prediction Test: {'rouge1': np.float64(0.0), 'rouge2': np.float64(0.0), 'rougeL': np.float64(0.0), 'rougeLsum': np.float64(0.0)}


#### 3. Stemming Effect

In [26]:
ref = ["The cats are running quickly."]
pred = ["The cat runs quick."]

scores_stemming = compute_rouge_score(ref, pred)
print("Stemming Effect Test:", scores_stemming)

Stemming Effect Test: {'rouge1': np.float64(0.6666666666666665), 'rouge2': np.float64(0.28571428571428575), 'rougeL': np.float64(0.6666666666666665), 'rougeLsum': np.float64(0.6666666666666665)}


#### 4. N-gram Analysis

In [27]:
ref = ["The quick brown fox jumps over the lazy dog."]
pred = ["The quick fox leaps over the dog."]

scores_ngram = compute_rouge_score(ref, pred)
print("N-gram Analysis Test:", scores_ngram)

N-gram Analysis Test: {'rouge1': np.float64(0.75), 'rouge2': np.float64(0.28571428571428575), 'rougeL': np.float64(0.75), 'rougeLsum': np.float64(0.75)}


#### 5. Symmetry Test

In [28]:
ref = ["A summary about climate change and global warming."]
pred = ["Climate change and global warming summary."]

scores_symmetry_1 = compute_rouge_score(ref, pred)
scores_symmetry_2 = compute_rouge_score(pred, ref)

print("Symmetry Test (Ref→Pred):", scores_symmetry_1)
print("Symmetry Test (Pred→Ref):", scores_symmetry_2)

Symmetry Test (Ref→Pred): {'rouge1': np.float64(0.8571428571428571), 'rouge2': np.float64(0.6666666666666666), 'rougeL': np.float64(0.7142857142857143), 'rougeLsum': np.float64(0.7142857142857143)}
Symmetry Test (Pred→Ref): {'rouge1': np.float64(0.8571428571428571), 'rouge2': np.float64(0.6666666666666666), 'rougeL': np.float64(0.7142857142857143), 'rougeLsum': np.float64(0.7142857142857143)}


#### Understanding ROUGE Score Behavior

**1. Exact Match Test:**  
ROUGE-1, ROUGE-2, and ROUGE-L are 1.0 when the prediction exactly matches the reference.  
This confirms that ROUGE gives a perfect score for identical summaries.

**2. Null Prediction Test:**  
ROUGE scores are 0.0 when the prediction is empty.  
This indicates no overlap, which is expected.

**3. Stemming Effect Test:**  
ROUGE-1 and ROUGE-L are around 0.66, and ROUGE-2 is 0.28.  
The use of stemming allows partial credit for related word forms (e.g., "cats" vs "cat", "running" vs "runs").

**4. N-gram Analysis Test:**  
ROUGE-1 is higher (0.75) while ROUGE-2 is lower (0.28).  
This shows that unigram overlap is common, but bigram overlap is harder to achieve and provides a stricter measure.

**5. Symmetry Test:**  
ROUGE scores are the same whether we compare Ref→Pred or Pred→Ref.  
This demonstrates that ROUGE is symmetric, meaning it treats both texts equally during comparison.

### Part VII : Comparing Small and Large Models

#### Generate summaries with T5-base

In [33]:
import os
os.environ["HF_HUB_DISABLE_SYMLINKS_WARNING"] = "1"

In [37]:
# Generate summaries for the training sample using t5-base
generated_summaries_t5_base = summarize_with_t5(
    train_sample["prompt_text"].tolist(),
    model_name="t5-base"
)

# Stocker dans un DataFrame pour comparaison
results_df_t5_base = train_sample.copy()
results_df_t5_base["generated_summary"] = generated_summaries_t5_base

# Aperçu
results_df_t5_base[["prompt_text", "prompt_title", "generated_summary"]].head()

Unnamed: 0,prompt_text,prompt_title,generated_summary
0,Nasa has warned of an impending asteroid pass ...,2004 BL86 will pass about three times the dist...,2004 BL86 will pass about three times the dist...
1,"BAGHDAD, Iraq (CNN) -- Iraq's most powerful Su...","Iraqi Islamic Party calls Quran incident ""blat...",u.s. soldier's desecration of the holy book re...
2,By . David Kent . Andy Carroll has taken an un...,Carroll takes to Instagram to post selfie ahea...,england striker posts glum-looking selfie in h...
3,Los Angeles (CNN) -- Los Angeles has long been...,Pop stars from all over Europe are setting the...,singers from all over Europe and farther east ...
4,London (CNN) -- Few shows can claim such an au...,NEW: Young athletes light the Olympic cauldron...,"""Isles of Wonder"" features tributes to the Bri..."


#### 2. Calculer les scores ROUGE pour T5-base

In [38]:
rouge_scores_t5_base = compute_rouge_score(
    train_sample["prompt_title"].tolist(),
    results_df_t5_base["generated_summary"].tolist()
)

print("ROUGE-1:", rouge_scores_t5_base["rouge1"])
print("ROUGE-2:", rouge_scores_t5_base["rouge2"])
print("ROUGE-L:", rouge_scores_t5_base["rougeL"])

ROUGE-1: 0.4037968138277834
ROUGE-2: 0.20489791582622335
ROUGE-L: 0.2974758003233011


#### ROUGE Score Interpretation for T5-base

- **ROUGE-1 (0.40):** About 40% overlap in unigrams (single words) between the generated summaries and the reference summaries.  
- **ROUGE-2 (0.20):** About 20% overlap in bigrams (two-word sequences), showing that some short phrases are captured correctly.  
- **ROUGE-L (0.29):** About 29% overlap in the longest common subsequence, indicating that the generated summaries follow the structure of the reference summaries more closely than T5-small.

These results are higher than those of T5-small, which suggests that the larger T5-base model captures more relevant details and preserves more accurate phrasing.  
However, the scores are still moderate, meaning that there is room for improvement, possibly through fine-tuning or using even larger models.

#### Generate summaries with GPT-2

In [42]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

def summarize_with_gpt2(texts, model_name="gpt2", max_length=100, batch_size=2):
    """
    Generate summaries using GPT-2 with TL;DR: prompting.
    Uses max_new_tokens to avoid input length conflicts.
    """
    # Load model and tokenizer
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
    model = GPT2LMHeadModel.from_pretrained(model_name)
    
    # GPT-2 has no pad token, set EOS token as pad
    tokenizer.pad_token = tokenizer.eos_token

    # Use GPU if available
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)

    summaries = []

    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        # Add TL;DR: prompt
        inputs = [text + "\nTL;DR:" for text in batch_texts]

        # Tokenize inputs
        inputs_tokenized = tokenizer(
            inputs,
            return_tensors="pt",
            truncation=True,
            padding="longest",
            max_length=512
        ).to(device)

        # Generate outputs (using max_new_tokens instead of max_length)
        outputs = model.generate(
            **inputs_tokenized,
            max_new_tokens=max_length,
            num_beams=4,
            no_repeat_ngram_size=2,
            early_stopping=True
        )

        # Decode and post-process
        decoded_summaries = [
            tokenizer.decode(ids, skip_special_tokens=True).split("TL;DR:")[-1].strip()
            for ids in outputs
        ]
        summaries.extend(decoded_summaries)

        # Clear memory
        torch.cuda.empty_cache()
        gc.collect()

    return summaries

In [None]:
generated_summaries_gpt2 = summarize_with_gpt2(
    train_sample["prompt_text"].tolist(),
    model_name="gpt2"
)

#### Compute ROUGE for GPT-2 summaries

In [47]:
# --- Store GPT-2 summaries in a DataFrame ---

results_df_gpt2 = train_sample.copy()
results_df_gpt2["generated_summary"] = generated_summaries_gpt2

# Preview
results_df_gpt2[["prompt_text", "prompt_title", "generated_summary"]].head()

Unnamed: 0,prompt_text,prompt_title,generated_summary
0,Nasa has warned of an impending asteroid pass ...,2004 BL86 will pass about three times the dist...,Nasa has warned of an impending asteroid pass ...
1,"BAGHDAD, Iraq (CNN) -- Iraq's most powerful Su...","Iraqi Islamic Party calls Quran incident ""blat...","BAGHDAD, Iraq (CNN) -- Iraq's most powerful Su..."
2,By . David Kent . Andy Carroll has taken an un...,Carroll takes to Instagram to post selfie ahea...,By . David Kent . Andy Carroll has taken an un...
3,Los Angeles (CNN) -- Los Angeles has long been...,Pop stars from all over Europe are setting the...,Los Angeles (CNN) -- Los Angeles has long been...
4,London (CNN) -- Few shows can claim such an au...,NEW: Young athletes light the Olympic cauldron...,London (CNN) -- Few shows can claim such an au...


In [48]:
rouge_scores_gpt2 = compute_rouge_score(
    train_sample["prompt_title"].tolist(),
    results_df_gpt2["generated_summary"].tolist()
)

print("ROUGE-1:", rouge_scores_gpt2["rouge1"])
print("ROUGE-2:", rouge_scores_gpt2["rouge2"])
print("ROUGE-L:", rouge_scores_gpt2["rougeL"])

ROUGE-1: 0.16322296501504488
ROUGE-2: 0.0707278938022373
ROUGE-L: 0.10861939238521591


#### Per-row ROUGE score calculation

In [None]:
def compute_rouge_per_row(references, predictions):
    """
    Compute ROUGE scores per row and return as a DataFrame.
    """
    per_row_data = []

    for ref, pred in zip(references, predictions):
        score = compute_rouge_score([ref], [pred])
        per_row_data.append({
            "reference": ref,
            "prediction": pred,
            "ROUGE-1": score["rouge1"],
            "ROUGE-2": score["rouge2"],
            "ROUGE-L": score["rougeL"]
        })

    return pd.DataFrame(per_row_data)

### Per-row ROUGE for T5-small

In [50]:
per_row_t5_small = compute_rouge_per_row(
    train_sample["prompt_title"].tolist(),
    results_df["generated_summary"].tolist()
)

# Per-row ROUGE for T5-base
per_row_t5_base = compute_rouge_per_row(
    train_sample["prompt_title"].tolist(),
    results_df_t5_base["generated_summary"].tolist()
)

# Per-row ROUGE for GPT-2
per_row_gpt2 = compute_rouge_per_row(
    train_sample["prompt_title"].tolist(),
    results_df_gpt2["generated_summary"].tolist()
)

# Display first few rows of one model
per_row_t5_small.head()

Unnamed: 0,reference,prediction,ROUGE-1,ROUGE-2,ROUGE-L
0,2004 BL86 will pass about three times the dist...,"the asteroid, designated 2004 BL86, will pass ...",0.481928,0.345679,0.409639
1,"Iraqi Islamic Party calls Quran incident ""blat...",a sniper section leader used a Quran for targe...,0.298851,0.141176,0.206897
2,Carroll takes to Instagram to post selfie ahea...,the striker is out for four months after teari...,0.25641,0.052632,0.128205
3,Pop stars from all over Europe are setting the...,aspiring pop stars from the u.s. and abroad ar...,0.29703,0.141414,0.237624
4,NEW: Young athletes light the Olympic cauldron...,a billion people around the world would be glu...,0.363636,0.069767,0.295455


#### ROUGE Score Interpretation for GPT-2

- **ROUGE-1 (0.16):** Only about 16% unigram overlap between GPT-2 summaries and the reference summaries, which is very low.
- **ROUGE-2 (0.07):** Around 7% bigram overlap, showing that GPT-2 rarely captures correct short phrase structures.
- **ROUGE-L (0.10):** Only about 10% overlap in the longest common subsequence, indicating that GPT-2 does not follow the structure of the reference summaries.

These results confirm that GPT-2 performs poorly on summarization tasks compared to T5-small and T5-base. GPT-2 is not designed for summarization — it is a general-purpose language model without fine-tuning for this task — which explains the very low ROUGE scores and why its predictions often repeat parts of the input text rather than creating true summaries.

### Part VIII : Comparing All Models

#### Create comparison table for average ROUGE scores

In [51]:
# Stocker les scores ROUGE globaux dans un dict
avg_scores = {
    "Model": ["T5-small", "T5-base", "GPT-2"],
    "ROUGE-1": [
        rouge_scores["rouge1"],         # T5-small
        rouge_scores_t5_base["rouge1"], # T5-base
        rouge_scores_gpt2["rouge1"]     # GPT-2
    ],
    "ROUGE-2": [
        rouge_scores["rouge2"],
        rouge_scores_t5_base["rouge2"],
        rouge_scores_gpt2["rouge2"]
    ],
    "ROUGE-L": [
        rouge_scores["rougeL"],
        rouge_scores_t5_base["rougeL"],
        rouge_scores_gpt2["rougeL"]
    ]
}

# Créer DataFrame
avg_scores_df = pd.DataFrame(avg_scores)
avg_scores_df

Unnamed: 0,Model,ROUGE-1,ROUGE-2,ROUGE-L
0,T5-small,0.365197,0.17077,0.268307
1,T5-base,0.403797,0.204898,0.297476
2,GPT-2,0.163223,0.070728,0.108619


#### Side-by-side comparison of generated summaries

In [52]:
# Construire un DataFrame combiné
comparison_df = pd.DataFrame({
    "Article": train_sample["prompt_text"],
    "Reference Summary": train_sample["prompt_title"],
    "T5-small Summary": results_df["generated_summary"],
    "T5-base Summary": results_df_t5_base["generated_summary"],
    "GPT-2 Summary": results_df_gpt2["generated_summary"]
})

# Afficher les 5 premiers exemples
comparison_df.head()

Unnamed: 0,Article,Reference Summary,T5-small Summary,T5-base Summary,GPT-2 Summary
0,Nasa has warned of an impending asteroid pass ...,2004 BL86 will pass about three times the dist...,"the asteroid, designated 2004 BL86, will pass ...",2004 BL86 will pass about three times the dist...,Nasa has warned of an impending asteroid pass ...
1,"BAGHDAD, Iraq (CNN) -- Iraq's most powerful Su...","Iraqi Islamic Party calls Quran incident ""blat...",a sniper section leader used a Quran for targe...,u.s. soldier's desecration of the holy book re...,"BAGHDAD, Iraq (CNN) -- Iraq's most powerful Su..."
2,By . David Kent . Andy Carroll has taken an un...,Carroll takes to Instagram to post selfie ahea...,the striker is out for four months after teari...,england striker posts glum-looking selfie in h...,By . David Kent . Andy Carroll has taken an un...
3,Los Angeles (CNN) -- Los Angeles has long been...,Pop stars from all over Europe are setting the...,aspiring pop stars from the u.s. and abroad ar...,singers from all over Europe and farther east ...,Los Angeles (CNN) -- Los Angeles has long been...
4,London (CNN) -- Few shows can claim such an au...,NEW: Young athletes light the Olympic cauldron...,a billion people around the world would be glu...,"""Isles of Wonder"" features tributes to the Bri...",London (CNN) -- Few shows can claim such an au...


### Final Comparison of Models (T5-small vs T5-base vs GPT-2)

#### **1. Average ROUGE Scores**
- **T5-small:** ROUGE-1 = 0.36, ROUGE-2 = 0.17, ROUGE-L = 0.26  
- **T5-base:** ROUGE-1 = 0.40, ROUGE-2 = 0.20, ROUGE-L = 0.29  
- **GPT-2:** ROUGE-1 = 0.16, ROUGE-2 = 0.07, ROUGE-L = 0.10  

T5-base clearly outperforms T5-small, achieving higher overlap with the reference summaries. GPT-2 performs significantly worse, as expected, because it is not specialized for summarization tasks.

---

#### **2. Qualitative Comparison (Side-by-Side Summaries)**
- **T5-small:** Produces reasonable summaries, but sometimes misses key details or uses more generic phrasing.
- **T5-base:** More accurate and closer to reference summaries, often capturing both the structure and important details.
- **GPT-2:** Largely copies parts of the input text rather than producing a concise summary, reflecting its lack of fine-tuning for summarization.

---

#### **3. Key Observations**
- Model size and fine-tuning matter: T5-base (larger and trained for summarization) clearly performs best.
- ROUGE scores provide a good quantitative comparison, aligning with the qualitative analysis of generated summaries.
- GPT-2 is unsuitable for summarization without fine-tuning, as shown by low ROUGE scores and poor output quality.

---

#### **4. Conclusion**
For summarization tasks, models explicitly trained for summarization (like the T5 family) significantly outperform general-purpose models like GPT-2. Among T5 variants, T5-base offers a noticeable improvement over T5-small, making it a better choice for tasks requiring higher accuracy and better content coverage.