In [10]:
#!pip install ../Course_Tools/introdl

In [None]:
import os
import torch
import transformers
from datasets import load_dataset
from transformers import (
    BartForConditionalGeneration, BartTokenizer, 
    Trainer, TrainingArguments, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM, AutoTokenizer
)
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

from evaluate import load
from pathlib import Path

from evaluate import load
import numpy as np

from transformers import EvalPrediction
import numpy as np
import torch
from evaluate import load

from introdl.utils import config_paths_keys, wrap_print_text, cleanup_torch

print = wrap_print_text(print, width = 100)
paths = config_paths_keys()
MODELS_PATH = paths['MODELS_PATH']
DATA_PATH = paths['DATA_PATH']

MODELS_PATH=C:\Users\bagge\My Drive\Python_Projects\DS776_Develop_Project\models
DATA_PATH=C:\Users\bagge\My Drive\Python_Projects\DS776_Develop_Project\data
TORCH_HOME=C:\Users\bagge\My Drive\Python_Projects\DS776_Develop_Project\downloads
HF_HOME=C:\Users\bagge\My Drive\Python_Projects\DS776_Develop_Project\downloads
HF_HUB_CACHE=C:\Users\bagge\My Drive\Python_Projects\DS776_Develop_Project\downloads
Successfully logged in to Hugging Face Hub.


# Section 2 - Metrics


### 📚 **Brief Introductions to the Four Metrics**

1. **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**
   - Measures **n-gram, subsequence, or skip-bigram overlap** between a candidate and reference text.
   - Commonly used for **extractive summarization** but also applied to **abstractive summarization**.
   - Variants: **ROUGE-N (e.g., ROUGE-1, ROUGE-2), ROUGE-L (Longest Common Subsequence), ROUGE-S (Skip-bigrams)**.

2. **BLEU (Bilingual Evaluation Understudy)**
   - Measures **n-gram overlap** between a candidate text and one or more reference texts.
   - Originally designed for **machine translation**, but adapted for **summarization**.
   - Uses a **brevity penalty** to avoid favoring overly short outputs.
   - Often reported with **1-gram to 4-gram precision scores**.

3. **BERTScore**
   - Measures **semantic similarity** between candidate and reference texts using **contextual embeddings** from models like BERT.
   - Matches tokens based on their **cosine similarity in embedding space**.
   - Effective for **abstractive summarization**, especially when paraphrasing is present.

4. **BARTScore**
   - Uses **pretrained language models (e.g., BART)** to estimate the **likelihood of a summary given the source text** and vice versa.
   - Evaluates summaries using **bidirectional scoring**: Coverage (`P(summary | source)`) and Faithfulness (`P(source | summary)`).
   - Particularly useful for evaluating **fluency, coherence, and factual consistency**.

---

### 📊 **Comparison Table**

| **Metric**    | **Use-Cases**                        | **Strengths**                                                                                          | **Weaknesses**                                                                                    |
|---------------|--------------------------------------|--------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|
| **ROUGE**     | Extractive summarization, Some Abstractive Summarization | - Easy to compute and interpret. <br> - Works well for extractive tasks. <br> - Multiple variants for different needs. | - Ignores paraphrasing. <br> - Surface-level comparison. <br> - Sensitive to minor wording changes. |
| **BLEU**      | Machine Translation, Summarization   | - Simple and fast to compute. <br> - Precision-oriented. <br> - Useful for extractive and some abstractive tasks. | - Penalizes paraphrasing. <br> - Limited to local n-gram matching. <br> - Ignores semantic similarity. |
| **BERTScore** | Abstractive Summarization, Paraphrasing | - Captures semantic similarity well. <br> - Robust to paraphrasing and rephrasing. <br> - Works well for abstractive summaries. | - Ignores coherence and sentence structure. <br> - Dependent on quality of pretrained embeddings. |
| **BARTScore** | Abstractive Summarization, Coherence Evaluation, Faithfulness Check | - Measures fluency, coherence, and factual consistency. <br> - Can evaluate coverage and faithfulness. <br> - Useful for abstractive summarization. | - Sensitive to the training domain. <br> - Can prioritize fluency over factual accuracy. |

---

### 🔑 **Summary**
- **ROUGE and BLEU** are foundational metrics that are simple to compute and interpret but have limitations in handling semantic similarity and paraphrasing.
- **BERTScore** introduces the idea of using embeddings to capture meaning, making it more robust for abstractive summarization.
- **BARTScore** leverages pretrained language models to evaluate coherence and coverage, providing deeper insight into the quality of generated summaries.

Would you like me to help you design a **hands-on tutorial** where students apply these metrics to evaluate different summaries and compare their results?


For an **introductory NLP class**, it's best to focus on metrics that are:

1. **Simple to understand and compute.**
2. **Illustrative of important concepts.**
3. **Widely used in research and applications.**
4. **Relevant to both extractive and abstractive summarization.**

### 🔑 **Recommended Metrics**

| **Metric**    | **Why Include It?**                                    | **Focus Areas**                                   |
|---------------|-------------------------------------------------------|--------------------------------------------------|
| **ROUGE**     | - Most commonly used for summarization evaluation. <br> - Straightforward to compute and interpret. <br> - Shows how n-gram overlap works (ROUGE-N) and how ordering matters (ROUGE-L). | - Introduce n-gram overlap. <br> - Precision, recall, F1-score. <br> - Compare ROUGE-N vs. ROUGE-L. |
| **BLEU**      | - Easy to understand as a basic n-gram overlap metric. <br> - Commonly used for MT but applicable to summarization. | - Explain brevity penalty. <br> - Illustrate difference between extractive and abstractive summaries. |
| **BERTScore** | - Introduces semantic similarity beyond n-gram overlap. <br> - Demonstrates strengths of contextual embeddings. | - Compare embedding-based metrics with n-gram metrics. <br> - Show limitations with coherence and order. |
| **BARTScore** | - Demonstrates how pretrained language models can be used for evaluation. <br> - Illustrates sequence-to-sequence models. | - Discuss how bidirectional scoring works. <br> - Show difference between fluency, coverage, and faithfulness. |

### 📌 **Rationale for Choice**

1. **ROUGE and BLEU** are easy to understand and compute. They provide a clear starting point for introducing how evaluation metrics work, especially for **extractive summarization**.
2. **BERTScore** introduces the idea of **semantic similarity** using contextual embeddings. This bridges the gap to **abstractive summarization** and modern deep learning models.
3. **BARTScore** is an excellent introduction to using **pretrained language models** for evaluation, which is critical for students to understand **modern evaluation techniques**.

### 🚫 **Metrics to Defer for Later**
- **METEOR, MoverScore, SummaQA, BLEURT** are more advanced and require a deeper understanding of embeddings, paraphrasing, and QA systems.
- They're worth mentioning, but they’re not essential for an introductory lecture.

### 📖 **Suggested Flow for Teaching**
1. **Introduction to Evaluation Metrics:** Explain precision, recall, and F1-score.
2. **ROUGE and BLEU:** Compute these metrics for extractive vs. abstractive summaries.
3. **BERTScore:** Show how embeddings can capture meaning beyond surface-level overlap.
4. **BARTScore:** Demonstrate how modern models are used for evaluation.
5. **Comparison and Discussion:** Compare these metrics using a small summarization task.

Would you like me to help you create a **Jupyter Notebook tutorial** for your students with these metrics?

# Section 3 

### Comparing Specialized Summarization Models (BART, T5, PEGASUS) vs. LLMs (GPT-4, LLaMA)

Summarization tasks can be tackled using **specialized encoder-decoder models** like **BART, T5, and PEGASUS**, or **general-purpose decoder-only models** like **GPT-4 and LLaMA**. Each approach has its own strengths and weaknesses.

---

## ✅ **Strengths of Specialized Summarization Models (BART, T5, PEGASUS)**
1. **Architectural Efficiency**
   - Encoder-decoder models process the entire input once with the encoder before generating the summary, making them *computationally efficient* for summarization.
   - In contrast, decoder-only models must repeatedly attend to the entire input during generation, which is particularly costly for long inputs.

2. **Tailored Training Objectives**
   - These models are pre-trained specifically for text-to-text tasks.
     - **BART:** Trained as a denoising autoencoder, making it robust to noisy or incomplete input.
     - **T5:** Uses a “text-to-text” framework, making it versatile across various NLP tasks, including summarization.
     - **PEGASUS:** Pre-trained to generate summaries by masking entire sentences during training, directly optimizing for abstractive summarization.

3. **Alignment with Summarization Tasks**
   - Fine-tuning on summarization datasets (e.g., CNN/Daily Mail, XSum) leads to **high-quality summaries** that are concise and relevant.
   - Performance on benchmarks often surpasses general-purpose LLMs.

4. **Better Control over Output**
   - Easier to enforce structure, conciseness, or adherence to specific formatting requirements.
   - Less prone to **hallucinations** or verbose outputs compared to general-purpose LLMs.

5. **Domain-Specific Optimization**
   - Fine-tuning encoder-decoder models on specialized datasets (e.g., medical or legal texts) produces highly accurate summaries with relevant terminology and structure.

---

## ❌ **Weaknesses of Specialized Summarization Models**
1. **Limited Generalization**
   - Models like BART, T5, and PEGASUS require fine-tuning for specific summarization tasks.
   - Struggle with novel domains or tasks without retraining.

2. **Less Effective at Zero-Shot Summarization**
   - General-purpose LLMs can perform reasonably well on summarization tasks without fine-tuning, which is challenging for encoder-decoder models.

3. **Inflexibility**
   - Encoder-decoder models are often designed for fixed inputs and outputs, making them less adaptable to creative or open-ended summarization tasks.

---

## ✅ **Strengths of LLMs (GPT-4, LLaMA) for Summarization**
1. **Generalization Across Tasks**
   - Capable of summarization **without fine-tuning** through prompt engineering (e.g., “Summarize the following text...”).
   - Strong performance across various domains with minimal adjustments.

2. **Few-Shot & Zero-Shot Learning**
   - Easily adaptable to new domains or styles through *in-context learning* (providing examples within the prompt).

3. **Versatility**
   - Handles a wide range of tasks beyond summarization, making them highly flexible for mixed-use applications.
   - Can switch between extractive, abstractive, or creative summarization depending on the prompt.

4. **Ease of Use**
   - No need for specialized training or fine-tuning, making them immediately usable for various summarization tasks.

---

## ❌ **Weaknesses of LLMs for Summarization**
1. **Inefficiency for Long Texts**
   - Decoder-only models process the entire input text during every generation step, resulting in high computational costs for long documents.

2. **Prone to Hallucination**
   - Without fine-tuning or careful prompting, LLMs can generate irrelevant or incorrect information, particularly for factual summarization tasks.

3. **Less Structured Output**
   - Outputs may be verbose or off-topic unless the prompt is carefully designed to enforce structure and conciseness.

4. **Lack of Task-Specific Optimization**
   - General-purpose LLMs may underperform compared to fine-tuned encoder-decoder models on specific summarization datasets.

---

## 📊 **Summary: When to Use Each Approach**

| Aspect                  | Specialized Models (BART, T5, PEGASUS) | LLMs (GPT-4, LLaMA)                     |
|-------------------------|-----------------------------------------|----------------------------------------|
| Efficiency              | High (Pre-processed input)             | Low (Full input processed repeatedly) |
| Domain Adaptation       | Excellent (Fine-tuning)                | Moderate (Prompt engineering)         |
| Generalization          | Limited                                | High                                  |
| Output Control          | High (Structure, format)               | Moderate (Requires careful prompting)|
| Ease of Use             | Requires fine-tuning                   | Prompt-based, no fine-tuning needed   |
| Hallucination Risk      | Lower                                   | Higher                                |

---

## State-of-the-art Models

As of April 2025, specialized models for text summarization continue to evolve, building upon foundational architectures like BART, T5, and PEGASUS. Recent advancements have introduced models such as **Longformer Encoder-Decoder (LED)** and **HERA**, which are tailored to address specific challenges in summarizing lengthy documents.

**Longformer Encoder-Decoder (LED):**
LED is designed to handle long documents by extending the Transformer architecture's context window. This allows for efficient processing of extended text inputs, making it particularly effective for summarizing lengthy documents without the need to truncate content. citeturn0search15

**HERA:**
HERA focuses on improving long document summarization by segmenting the text based on its semantic structure. It retrieves and reorders segments related to the same event, enhancing the coherence and relevance of the generated summaries. This approach has shown improvements in handling complex narratives within extensive documents. citeturn0academia20

While these specialized models offer targeted solutions for summarization tasks, large language models (LLMs) like GPT-4 and Google's Gemini have also demonstrated strong summarization capabilities. However, specialized models remain relevant, especially in scenarios requiring domain-specific knowledge or the ability to process longer texts effectively. citeturn0search16

In summary, the state-of-the-art in specialized summarization models as of April 2025 includes architectures like LED and HERA, which build upon earlier models to address challenges associated with long document summarization. These models offer efficient and coherent summarization solutions, particularly for extensive and complex texts. 

# Fine-Tuning and Evaluating a BART Model for Summarization

Here’s how you can **refactor and organize your notebook** into clean, modular sections to accomplish your goals:

---

### 📘 Section 1: Setup

In [2]:

# Load model and tokenizer
model_name = "facebook/bart-large-cnn"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

---

### 📰 Section 2: Load and Preview Data

We're going to choose small subsets for training and validation to demonsrate how fine-tuning works and performs, but in a production setting we'd use all of the training data that we can or at least as much as we can afford to use for training.

In [3]:
# Load XSum dataset
dataset = load_dataset("xsum", cache_dir=DATA_PATH)
num_train = min(6000, len(dataset["train"]))
num_val = min(200, len(dataset["validation"]))
train_data = dataset["train"].select(range(num_train))
val_data = dataset["validation"].select(range(num_val))

# Preview 2 validation samples
for i in range(2):
    print(f"\n📄 Article {i+1}:\n{val_data[i]['document']}")
    print(f"📝 Summary:\n{val_data[i]['summary']}")


📄 Article 1:
The ex-Reading defender denied fraudulent trading charges relating to the Sodje Sports Foundation -
a charity to raise money for Nigerian sport.
Mr Sodje, 37, is jointly charged with elder brothers Efe, 44, Bright, 50 and Stephen, 42.
Appearing at the Old Bailey earlier, all four denied the offence.
The charge relates to offences which allegedly took place between 2008 and 2014.
Sam, from Kent, Efe and Bright, of Greater Manchester, and Stephen, from Bexley, are due to stand
trial in July.
They were all released on bail.
📝 Summary:
Former Premier League footballer Sam Sodje has appeared in court alongside three brothers accused of
charity fraud.

📄 Article 2:
Voges was forced to retire hurt on 86 after suffering the injury while batting during the County
Championship draw with Somerset on 4 June.
Middlesex hope to have the Australian back for their T20 Blast game against Hampshire at Lord's on 3
August.
The 37-year-old has scored 230 runs in four first-class games this se

---

### 🧹 Section 3: Preprocess

In [4]:

def preprocess_function(examples):
    model_inputs = tokenizer(
        examples['document'], max_length=512, truncation=True
    )
    labels = tokenizer(
        text_target=examples['summary'], max_length=64, truncation=True
    )
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

# Tokenize data
eval_subset_size = 200
tokenized_train = train_data.map(preprocess_function, batched=True)
tokenized_val = val_data.select(range(eval_subset_size)).map(preprocess_function, batched=True)

---

### 🔧 Section 4: Fine-Tune BART-Large-CNN

In [5]:
training_args = Seq2SeqTrainingArguments(
    output_dir=str(MODELS_PATH / "xsum_bart_large"),
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01,
    fp16=True,
    disable_tqdm=False,
    predict_with_generate=True,
)


In [6]:

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    data_collator=data_collator,
)

# Train
trainer.train()

Epoch,Training Loss,Validation Loss
1,1.9036,1.798864
2,1.2306,1.832299




TrainOutput(global_step=1500, training_loss=1.5514997965494792, metrics={'train_runtime': 303.1938, 'train_samples_per_second': 39.579, 'train_steps_per_second': 4.947, 'total_flos': 1.2950887990296576e+16, 'train_loss': 1.5514997965494792, 'epoch': 2.0})

There's some evidence of overfitting there since the validation loss increases.  We're using a very small subset of the data for this demonstration so it's not surprising that we're seeing overfitting.  Let's look at the predicted summary for the base model and the fine-tuned model for the first article in the validation set.

**Note:**  This line of code `fine_tuned_model.config.forced_bos_token_id = None` shouldn't be necessary, but `transformers` is setting `forced_bos_token_id = 0` in the saved model which causes the text generation to work incorrectly.  I'm opening an issue on Github for this.

In [7]:
# let's make a helper function to generate summaries

def generate_summary(text, model, tokenizer, device):
    # Tokenize the input text and convert it into tensors suitable for the model
    # `max_length=512` ensures the input is truncated if it exceeds 512 tokens
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True).to(device)
    
    # Generate the summary using the model
    # `num_beams=4` specifies the beam search size for better quality summaries
    # `max_length=64` limits the length of the generated summary
    # `early_stopping=True` stops generation when all beams reach the end token
    summary_ids = model.generate(inputs["input_ids"], num_beams=4, max_length=64, early_stopping=True)
    
    # Decode the generated token IDs back into a human-readable string
    # `skip_special_tokens=True` removes special tokens like <s> and </s>
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)


The first article in the validation set is:

In [8]:
sample_article = val_data[0]['document']
print(f"📄 Sample article:\n{sample_article}\n")

📄 Sample article:
The ex-Reading defender denied fraudulent trading charges relating to the Sodje Sports Foundation -
a charity to raise money for Nigerian sport.
Mr Sodje, 37, is jointly charged with elder brothers Efe, 44, Bright, 50 and Stephen, 42.
Appearing at the Old Bailey earlier, all four denied the offence.
The charge relates to offences which allegedly took place between 2008 and 2014.
Sam, from Kent, Efe and Bright, of Greater Manchester, and Stephen, from Bexley, are due to stand
trial in July.
They were all released on bail.



Now let's load the base-model, our fine-tuned model, and a model that has been fine-tuned on the complete xsum dataset.  We'll generate the summaries for each so we can compare them qualitatively.

In [9]:
device = "cuda" if torch.cuda.is_available() else "cpu"

# reload the base model
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn").to(device)
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")

# Reload the fine-tuned model
checkpoint_path = MODELS_PATH / "xsum_bart_large" / "checkpoint-1500"
fine_tuned_model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint_path).to("cuda" if torch.cuda.is_available() else "cpu")
fine_tuned_model.config.forced_bos_token_id = None # Set to None to squash bug
# tokenizer is the same as base model

# Fully-fine-tuned model summary
full_ft_model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-xsum").to(device)

reference_summary = val_data[0]['summary']
base_summary = generate_summary(sample_article, model, tokenizer, device)
fine_tuned_summary = generate_summary(sample_article, fine_tuned_model, tokenizer, device)
full_ft_summary = generate_summary(sample_article, full_ft_model, tokenizer, device)

print(f"📝 Reference Summary:\n{reference_summary}\n")
print(f"📝 Base Model Summary: \n{base_summary}\n" )
print(f"📝 Fine-Tuned Model Summary: \n{fine_tuned_summary}\n" )
print(f"📝 Fully Fine-Tuned Model Summary: \n{full_ft_summary}\n" )




📝 Reference Summary:
Former Premier League footballer Sam Sodje has appeared in court alongside three brothers accused of
charity fraud.

📝 Base Model Summary:
Sam Sodje, 37, is jointly charged with elder brothers Efe, 44, Bright, 50 and Stephen, 42. The
charge relates to offences which allegedly took place between 2008 and 2014. Sam, from Kent, Efe and
Bright, of Greater Manchester, and Stephen,. from Bexley,

📝 Fine-Tuned Model Summary:
Former England footballer Sam Sodje has appeared in court charged with embezzling more than £300,000
from a sports charity he set up in his home country of Nigeria, the capital city of Lagos, the Old
Bailey has heard for the first time.

📝 Fully Fine-Tuned Model Summary:
Former Premier League footballer Sam Sodje has appeared in court charged with fraud.



---

### 📏 Section 5: Define Evaluation Metrics (ROUGE and BERTScore)

In [None]:
# Load metrics
rouge = load("rouge")
bertscore = load("bertscore")

def compute_metrics(eval_pred):
    """
    Compute ROUGE and BERTScore metrics for model predictions vs. reference summaries.

    Parameters:
        eval_pred (EvalPrediction): Contains tokenized model predictions and reference label.

    Returns:
        dict: Dictionary with ROUGE F1 scores (rouge1, rouge2, rougeL, rougeLsum)
              and BERTScore precision, recall, and F1.
    """
    predictions, labels = eval_pred

    # Some models return a tuple (logits, ...)
    if isinstance(predictions, tuple):
        predictions = predictions[0]

    # Convert to numpy arrays
    predictions = np.asarray(predictions)
    labels = np.asarray(labels)

    # If predictions are logits, take argmax
    if predictions.ndim == 3:
        predictions = np.argmax(predictions, axis=-1)

    # Convert to lists
    predictions = predictions.tolist()
    labels = labels.tolist()

    # Replace -100 with pad_token_id
    labels = [[(token if token != -100 else tokenizer.pad_token_id) for token in label] for label in labels]

    # Decode
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Strip whitespace
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [label.strip() for label in decoded_labels]

    # Compute ROUGE
    rouge_result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    rouge_f1 = {f"{key}_f1": value * 100 for key, value in rouge_result.items()}

    # Compute BERTScore
    bert_result = bertscore.compute(predictions=decoded_preds, references=decoded_labels, lang="en")
    bert_f1 = {
        "bertscore_precision": np.mean(bert_result["precision"]) * 100,
        "bertscore_recall": np.mean(bert_result["recall"]) * 100,
        "bertscore_f1": np.mean(bert_result["f1"]) * 100,
    }

    # Combine results
    return {**rouge_f1, **bert_f1}


---

### 🧪 Section 6: Evaluate Fine-Tuned Model

In [None]:
# Reload the fine-tuned model
device = "cuda" if torch.cuda.is_available() else "cpu"
checkpoint_path = MODELS_PATH / "xsum_bart_large" / "checkpoint-1500"
fine_tuned_model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint_path).to(device)

trainer_with_metrics = Trainer(
    model=fine_tuned_model,
    args=training_args,
    eval_dataset=tokenized_val,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

ft_results = trainer_with_metrics.evaluate()
print("\n📈 Fine-Tuned BART Results:")
print(ft_results)




📈 Fine-Tuned BART Results:
{'eval_loss': 1.8322992324829102, 'eval_model_preparation_time': 0.004, 'eval_rouge1_f1':
53.466338054514615, 'eval_rouge2_f1': 28.142027701746425, 'eval_rougeL_f1': 50.86069110365715,
'eval_rougeLsum_f1': 50.82767043771599, 'eval_runtime': 53.2635, 'eval_samples_per_second': 3.755,
'eval_steps_per_second': 0.469}


: 

In [16]:
sample_input = "NASA has announced a new moon mission to be launched next year."
model.eval()
with torch.no_grad():
    inputs = tokenizer(sample_input, return_tensors="pt", truncation=True).to(model.device)
    output = base_model.generate(**inputs, max_length=64)
    print(tokenizer.decode(output[0], skip_special_tokens=True))


NASA has announced a new moon mission to be launched next year. The mission will be the first to
orbit the moon's surface. The moon mission is expected to launch in 2018. The cost of the mission is
not yet known. The project will cost an estimated $1.5 billion.


In [10]:
from transformers import BartForConditionalGeneration, BartTokenizer
from datasets import load_dataset
from evaluate import load
import torch

# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load fine-tuned model and tokenizer
checkpoint_path = MODELS_PATH / "xsum_bart_large" / "checkpoint-1125"
model = BartForConditionalGeneration.from_pretrained(checkpoint_path).to(device)
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")

# Load dataset and evaluation subset
val_data = load_dataset("xsum", split="validation[:20]")  # test on 20 samples

# Load ROUGE
rouge = load("rouge")

# Generate and collect predictions
references = []
predictions = []

for example in val_data:
    article = example["document"]
    reference = example["summary"]
    references.append(reference.strip())

    # Tokenize input
    inputs = tokenizer(article, return_tensors="pt", truncation=True, max_length=512).to(device)

    # Generate
    output_ids = model.generate(
        **inputs,
        max_length=64,
        num_beams=4,
        early_stopping=True
    )
    generated = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    predictions.append(generated.strip())

# Print a few examples
for i in range(3):
    print(f"\n--- Example {i+1} ---")
    print(f"📄 Article:\n{val_data[i]['document'][:300]}...")
    print(f"📝 Reference Summary:\n{references[i]}")
    print(f"🧠 Generated Summary:\n{predictions[i]}")

# Compute and print ROUGE scores
results = rouge.compute(predictions=predictions, references=references, use_stemmer=True)
print("\n📊 ROUGE Scores (F1):")
for k, v in results.items():
    print(f"{k}: {v*100:.2f}")



--- Example 1 ---
📄 Article:
The ex-Reading defender denied fraudulent trading charges relating to the Sodje Sports Foundation -
a charity to raise money for Nigerian sport.
Mr Sodje, 37, is jointly charged with elder brothers Efe, 44, Bright, 50 and Stephen, 42.
Appearing at the Old Bailey earlier, all four denied the offence....
📝 Reference Summary:
Former Premier League footballer Sam Sodje has appeared in court alongside three brothers accused of
charity fraud.
🧠 Generated Summary:


--- Example 2 ---
📄 Article:
Voges was forced to retire hurt on 86 after suffering the injury while batting during the County
Championship draw with Somerset on 4 June.
Middlesex hope to have the Australian back for their T20 Blast game against Hampshire at Lord's on 3
August.
The 37-year-old has scored 230 runs in four first-c...
📝 Reference Summary:
Middlesex batsman Adam Voges will be out until August after suffering a torn calf muscle in his
right leg.
🧠 Generated Summary:


--- Example 3 ---
📄 Ar

---

### 📊 Section 7: Evaluate Pretrained Model

In [15]:
base_model = BartForConditionalGeneration.from_pretrained(model_name).to("cuda" if torch.cuda.is_available() else "cpu")

base_trainer = Trainer(
    model=base_model,
    args=training_args,
    eval_dataset=tokenized_val,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

base_results = base_trainer.evaluate()
print("\n📊 Pretrained BART Results:")
print(base_results)


📊 Pretrained BART Results:
{'eval_loss': 8.350750923156738, 'eval_model_preparation_time': 0.003, 'eval_rouge1_f1':
42.87823639342292, 'eval_rouge2_f1': 16.809071657348294, 'eval_rougeL_f1': 39.68965887481013,
'eval_rougeLsum_f1': 39.617765500714405, 'eval_runtime': 10.041, 'eval_samples_per_second': 19.918,
'eval_steps_per_second': 2.49}


---

### 👁️ Section 8: Qualitative Comparison

In [20]:
def generate_summary(model, article):
    inputs = tokenizer(article, return_tensors="pt", max_length=512, truncation=True).to(model.device)
    output_ids = model.generate(**inputs, max_length=64, min_length = 10, num_beams=4, early_stopping=True)
    print("Generated IDs:", output_ids) 
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

print("\n🔍 Qualitative Comparison:\n")
for i in range(2):
    article = val_data[i]['document']
    reference = val_data[i]['summary']
    bart_sum = generate_summary(base_model, article)
    fine_tuned_sum = generate_summary(fine_tuned_model, article)

    print(f"\n📄 Article {i+1}:\n{article}")
    print(f"📝 Reference:\n{reference}")
    print(f"📚 BART Summary:\n{bart_sum}")
    print(f"🧠 Fine-Tuned Summary:\n{fine_tuned_sum}")
    print(f"Length of fine-tuned summary: {len(fine_tuned_sum.split())} words")


🔍 Qualitative Comparison:

Generated IDs: tensor([[    2,     0, 21169, 33270,  2359,     6,  2908,     6,    16, 13521,
          1340,    19, 15172,  5396,   381,  7068,     6,  3550,     6, 15463,
             6,   654,     8,  3259,     6,  3330,     4,    20,  1427, 16009,
             7,  9971,    61,  2346,   362,   317,   227,  2266,     8,   777,
             4,    20,  1931,    12, 43952,  5142,  2296, 15381,  1446,  1103,
             4,     2]], device='cuda:0')
Generated IDs: tensor([[2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2]], device='cuda:0')

📄 Article 1:
The ex-Reading defender denied fraudulent trading charges relating to the Sodje Sports Foundation -
a charity to raise money for Nigerian sport.
Mr Sodje, 37, is jointly charged with elder brothers Efe, 44, Bright, 50 and Stephen, 42.
Appearing at the Old Bailey earlier, all four denied the offence.
The charge relates to offences which allegedly took place between 2008 and 2014.
Sam, from Kent, Efe and Bright, of G

---

Let me know if you'd like this turned into a downloadable `.ipynb`, modular scripts, or paired with a `requirements.txt`.

# 📝 Exercise: Domain-Specific Summarization Using Fine-Tuned Models vs. General-Purpose LLMs

## Overview
In this exercise, you will fine-tune a specialized summarization model, `facebook/bart-large`, on the **PubMed** dataset to perform domain-specific summarization. You will then compare its performance to a general-purpose LLM (`gpt-4` or another open LLM like `LLaMA-2`) on the same task.

---

## Learning Objectives
1. Understand the process of fine-tuning a pre-trained transformer model for domain-specific summarization.
2. Compare performance of fine-tuned models to general-purpose LLMs.
3. Use standard summarization metrics (e.g., ROUGE) to evaluate performance.
4. Explore trade-offs between specialized and general-purpose models for summarization.

---

## Part 1: Fine-Tuning `facebook/bart-large` on PubMed Dataset

### Setup
Install necessary libraries:
```bash
pip install transformers datasets accelerate
```

### Load Dataset

In [None]:
from datasets import load_dataset

dataset = load_dataset("scientific_papers", "pubmed")

### Fine-Tuning Code
Follow the provided code snippet to fine-tune `facebook/bart-large` on a subset of the PubMed dataset.

- Train on a subset (2000 samples).
- Validate on a smaller subset (500 samples).
- Use `fp16=True` for efficient training.


---

## Part 2: Evaluation with ROUGE Metrics

Add the following to your code after training:

In [None]:
from datasets import load_metric

eval_results = trainer.evaluate()

rouge = load_metric("rouge")
outputs = trainer.predict(tokenized_val)

preds = tokenizer.batch_decode(outputs.predictions, skip_special_tokens=True)
refs = [example['abstract'] for example in val_data]

rouge_output = rouge.compute(predictions=preds, references=refs, use_stemmer=True)
print(rouge_output)

---

## Part 3: Comparison with General-Purpose LLM

### Instructions
1. Use an LLM (e.g., GPT-4 via API or LLaMA-2 locally) to generate summaries for the same validation set.
2. Compare outputs using the same ROUGE metrics.

### Prompt Example:
"Summarize the following scientific paper: \n\n{Article}"

---

## Part 4: Analysis & Discussion

Answer the following questions:
1. How do the ROUGE scores of the fine-tuned BART model compare to those of the general-purpose LLM?
2. Are there qualitative differences in the summaries (e.g., coherence, conciseness, relevance)?
3. What are the strengths and weaknesses of each approach?
4. In what scenarios would you prefer using a fine-tuned model over a general-purpose LLM?

---

## Extension (Optional)
- Fine-tune the model on the full PubMed dataset or another domain-specific dataset.
- Compare the performance of BART, PEGASUS, and T5 for this task.
- Explore different evaluation metrics (e.g., BERTScore).

---

## Submission
Submit a Jupyter notebook containing:
- Your fine-tuning code.
- ROUGE metric outputs for both models.
- Your analysis and discussion.

---

Good luck and have fun! 🎉

# Assignment: Comparing General and Fine-Tuned BART Models

## Objective
In this assignment, you will compare the performance of two pretrained BART models on the task of scientific text summarization:
- A general-domain BART model: `facebook/bart-large` (trained on CNN/DailyMail)
- A domain-specific BART model: 'ccdv/lsg-bart-base-4096-pubmed' (fine-tuned on PubMed)

You will evaluate the models both qualitatively (by examining the generated summaries) and quantitatively (using ROUGE scores).

## Instructions

### Part 1: Setup
1. **Install the required packages**
```bash
%pip install transformers datasets rouge_score
```

2. **Import necessary libraries**

In [None]:
from transformers import BartTokenizer, BartForConditionalGeneration, pipeline
from datasets import load_metric

### Part 2: Load Models and Tokenizers

In [None]:
# Load general BART model
general_tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")
general_model = BartForConditionalGeneration.from_pretrained("facebook/bart-large")

general_summarizer = pipeline("summarization", model=general_model, tokenizer=general_tokenizer)

# Load domain-specific BART model
pubmed_tokenizer = BartTokenizer.from_pretrained("ccdv/lsg-bart-base-4096-pubmed")
pubmed_model = BartForConditionalGeneration.from_pretrained("ccdv/lsg-bart-base-4096-pubmed")

domain_summarizer = pipeline("summarization", model=pubmed_model, tokenizer=pubmed_tokenizer)

### Part 3: Provide Scientific Text for Summarization
Paste a sample scientific text below. It should be at least a few paragraphs long.

In [None]:
sample_text = """
Your scientific text goes here.
"""

### Part 4: Generate Summaries

In [None]:
# Generate summaries using both models
general_summary = general_summarizer(sample_text, max_length=150, min_length=40, do_sample=False)[0]['summary_text']
domain_summary = domain_summarizer(sample_text, max_length=150, min_length=40, do_sample=False)[0]['summary_text']

print("General Model Summary:\n", general_summary)
print("\nDomain-Finetuned Model Summary:\n", domain_summary)

### Part 5: Evaluate using ROUGE Metrics

In [None]:
from rouge_score import rouge_scorer

# Initialize the ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Provide a reference summary for comparison
reference_summary = """
Your reference summary goes here.
"""

# Compute ROUGE scores for each model
scores_general = scorer.score(reference_summary, general_summary)
scores_domain = scorer.score(reference_summary, domain_summary)

# Display results
print("\nGeneral Model ROUGE Scores:")
for metric, score in scores_general.items():
    print(f"{metric}: {score}")

print("\nDomain Model ROUGE Scores:")
for metric, score in scores_domain.items():
    print(f"{metric}: {score}")

### Part 6: Analysis
1. **Qualitative Comparison**
    - Compare the outputs of the general and domain-specific models. What differences do you notice? Which model produces more coherent, relevant, or precise summaries?

2. **Quantitative Comparison**
    - Compare the ROUGE scores from both models. Which model achieves higher scores? Does this align with your qualitative observations?

3. **Discussion**
    - Discuss why the domain-specific model might perform better on PubMed data. What trade-offs might exist between using a general-purpose model vs. a fine-tuned model?

### Submission
Save your notebook and submit it according to your instructor’s guidelines.