In [1]:
!nvidia-smi

Thu Nov 28 18:42:28 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P0              50W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# Fine-Tuning Pegasus Model for Abstractive Text Summarization

This script fine-tunes a pre-trained Pegasus model for abstractive text summarization using the CNN/Daily Mail dataset. The process includes data loading, model preparation, training, evaluation, and saving the best model based on validation loss.

## Key Features
- **Model**: Uses the 'google/pegasus-large' pre-trained model.
- **Dataset**: Trains on a subset of the CNN/Daily Mail dataset (1000 samples).
- **Epochs**: Fine-tunes for a specified number of epochs.
- **Metrics**: Evaluates performance using BLEU and ROUGE scores.
- **Batch Processing**: Implements gradient accumulation to handle larger batch sizes.
- **Model Saving**: Saves the best performing model based on validation loss.
- **Comparison**: Compares the performance of the base model and the fine-tuned model.
- **Performance Improvement**: Reports the improvement in BLEU and ROUGE scores achieved through fine-tuning.

## Performance Metrics

### Base Model Performance
- **BLEU**: 0.026
- **ROUGE-1**: 0.299
- **ROUGE-2**: 0.098
- **ROUGE-L**: 0.192

### Fine-Tuned Model Performance
- **BLEU Score**: 0.05830824748912286
- **ROUGE-1**: 0.33447035562023225
- **ROUGE-2**: 0.13160724821103553
- **ROUGE-L**: 0.2448122833529142

### Performance Improvement
- **BLEU**: +0.032019698130967805
- **ROUGE-1**: +0.035372132340698026
- **ROUGE-2**: +0.033823160043814124
- **ROUGE-L**: +0.052479002091481874

In [4]:
!pip install datasets
!pip install rouge_score
!pip install nltk

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [5]:
import random
import torch
from tqdm import tqdm
from torch.utils.data import DataLoader, Dataset
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
from datasets import load_dataset
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
from transformers import get_linear_schedule_with_warmup
import os

# Set random seed for reproducibility
random_seed = 42
random.seed(random_seed)
torch.manual_seed(random_seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(random_seed)

In [6]:
# Load pre-trained model and tokenizer
model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-large")
tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-large")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/3.09k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/260 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

In [7]:
# Count the number of trainable parameters in the model
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total Trainable Parameters: {total_params:,}")

Total Trainable Parameters: 568,699,904


In [8]:
# Constants
MAX_INPUT_LENGTH = 1024
MAX_TARGET_LENGTH = 128
BATCH_SIZE = 4
GRADIENT_ACCUMULATION_STEPS = 4
NUM_EPOCHS = 10
LEARNING_RATE = 5e-5

In [9]:
# Load dataset
dataset = load_dataset("cnn_dailymail", "3.0.0")

# Prepare data
full_train_data = dataset["train"].select(range(1000))  # 1000
train_size = int(0.9 * len(full_train_data))
val_size = len(full_train_data) - train_size

train_data = full_train_data.select(range(train_size))
val_data = full_train_data.select(range(train_size, len(full_train_data)))

README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [10]:
print("Length of train_data:", len(train_data))
print("Length of val_data:", len(val_data))

Length of train_data: 900
Length of val_data: 100


In [29]:
class SummarizationDataset(Dataset):
    def __init__(self, data, tokenizer, max_input_length=MAX_INPUT_LENGTH, max_target_length=MAX_TARGET_LENGTH):
        self.data = data
        self.tokenizer = tokenizer
        self.max_input_length = max_input_length
        self.max_target_length = max_target_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        article = self.data[idx]["article"]
        summary = self.data[idx]["highlights"]

        inputs = self.tokenizer(article, max_length=self.max_input_length, truncation=True, padding="max_length", return_tensors="pt")
        targets = self.tokenizer(summary, max_length=self.max_target_length, truncation=True, padding="max_length", return_tensors="pt")

        return {
            "input_ids": inputs.input_ids.squeeze(),
            "attention_mask": inputs.attention_mask.squeeze(),
            "labels": targets.input_ids.squeeze()
        }

train_dataset = SummarizationDataset(train_data, tokenizer)
val_dataset = SummarizationDataset(val_data, tokenizer)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)

# Set up device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def evaluate(model, data_loader):
    model.eval()
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    bleu_scores = []
    rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}

    with torch.no_grad():
        for batch in tqdm(data_loader, desc="Evaluating"):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            generated_ids = model.generate(input_ids=input_ids, attention_mask=attention_mask, max_length=MAX_TARGET_LENGTH)

            for i in range(len(input_ids)):
                reference = tokenizer.decode(labels[i], skip_special_tokens=True)
                generated_summary = tokenizer.decode(generated_ids[i], skip_special_tokens=True)

                bleu_score = sentence_bleu([reference.split()], generated_summary.split())
                bleu_scores.append(bleu_score)

                rouge_result = scorer.score(reference, generated_summary)
                for metric in rouge_scores:
                    rouge_scores[metric].append(rouge_result[metric].fmeasure)

    avg_bleu = sum(bleu_scores) / len(bleu_scores)
    avg_rouge = {metric: sum(scores) / len(scores) for metric, scores in rouge_scores.items()}

    return avg_bleu, avg_rouge

# Evaluate base model
print("Evaluating base model...")
base_bleu, base_rouge = evaluate(model, val_loader)
print("Base Model Performance:")
print(f"BLEU Score: {base_bleu}")
print(f"ROUGE Scores: {base_rouge}")

Evaluating base model...


Evaluating: 100%|██████████| 25/25 [01:22<00:00,  3.30s/it]

Base Model Performance:
BLEU Score: 0.026288549358155056
ROUGE Scores: {'rouge1': 0.2990982232795342, 'rouge2': 0.09778408816722141, 'rougeL': 0.19233328126143232}





In [35]:
# Fine-tuning
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)
num_training_steps = len(train_loader) * NUM_EPOCHS
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)

best_val_loss = float('inf')
best_model_path = 'best_pegasus_model.pth'

for epoch in range(NUM_EPOCHS):
    model.train()
    total_train_loss = 0
    progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{NUM_EPOCHS} [Train]")

    for i, batch in enumerate(progress_bar):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        total_train_loss += loss.item()

        loss = loss / GRADIENT_ACCUMULATION_STEPS
        loss.backward()

        if (i + 1) % GRADIENT_ACCUMULATION_STEPS == 0:
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()

        progress_bar.set_postfix({"train_loss": loss.item() * GRADIENT_ACCUMULATION_STEPS})

    avg_train_loss = total_train_loss / len(train_loader)

    # Validation
    model.eval()
    total_val_loss = 0
    with torch.no_grad():
        for batch in tqdm(val_loader, desc=f"Epoch {epoch+1}/{NUM_EPOCHS} [Val]"):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_val_loss += loss.item()

    avg_val_loss = total_val_loss / len(val_loader)

    print(f"Epoch {epoch+1}/{NUM_EPOCHS}, Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}")

    # Evaluate every 2 epochs
    if (epoch + 1) % 2 == 0:
        print(f"Evaluating after epoch {epoch+1}...")
        current_bleu, current_rouge = evaluate(model, val_loader)
        print(f"Current BLEU Score: {current_bleu}")
        print(f"Current ROUGE Scores: {current_rouge}")

    # Save the model if it's the best so far based on validation loss
    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        torch.save(model.state_dict(), best_model_path)
        print(f"New best model saved with validation loss: {best_val_loss:.4f}")

# Load the best model for final evaluation
if os.path.exists(best_model_path):
    model.load_state_dict(torch.load(best_model_path))
    print(f"Loaded best model from {best_model_path}")
else:
    print("No saved model found. Using the model from the last epoch.")

# Evaluate fine-tuned model
print("Evaluating fine-tuned model...")
fine_tuned_bleu, fine_tuned_rouge = evaluate(model, val_loader)
print("Fine-tuned Model Performance:")
print(f"BLEU Score: {fine_tuned_bleu}")
print(f"ROUGE Scores: {fine_tuned_rouge}")

# Print performance improvement
print("Performance Improvement:")
print(f"BLEU: {fine_tuned_bleu - base_bleu}")
print(f"ROUGE-1: {fine_tuned_rouge['rouge1'] - base_rouge['rouge1']}")
print(f"ROUGE-2: {fine_tuned_rouge['rouge2'] - base_rouge['rouge2']}")
print(f"ROUGE-L: {fine_tuned_rouge['rougeL'] - base_rouge['rougeL']}")

Epoch 1/10 [Train]: 100%|██████████| 225/225 [02:20<00:00,  1.61it/s, train_loss=6.71]
Epoch 1/10 [Val]: 100%|██████████| 25/25 [00:05<00:00,  4.65it/s]


Epoch 1/10, Train Loss: 7.4636, Val Loss: 6.8137
New best model saved with validation loss: 6.8137


Epoch 2/10 [Train]: 100%|██████████| 225/225 [02:19<00:00,  1.61it/s, train_loss=6.69]
Epoch 2/10 [Val]: 100%|██████████| 25/25 [00:05<00:00,  4.65it/s]


Epoch 2/10, Train Loss: 6.7187, Val Loss: 6.3782
Evaluating after epoch 2...


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
Evaluating: 100%|██████████| 25/25 [01:21<00:00,  3.25s/it]


Current BLEU Score: 0.037923041143913705
Current ROUGE Scores: {'rouge1': 0.31771908402970306, 'rouge2': 0.11809687661522462, 'rougeL': 0.21862539604135886}
New best model saved with validation loss: 6.3782


Epoch 3/10 [Train]: 100%|██████████| 225/225 [02:19<00:00,  1.61it/s, train_loss=5.26]
Epoch 3/10 [Val]: 100%|██████████| 25/25 [00:05<00:00,  4.64it/s]


Epoch 3/10, Train Loss: 6.0713, Val Loss: 5.0543
New best model saved with validation loss: 5.0543


Epoch 4/10 [Train]: 100%|██████████| 225/225 [02:19<00:00,  1.61it/s, train_loss=1.78]
Epoch 4/10 [Val]: 100%|██████████| 25/25 [00:05<00:00,  4.65it/s]


Epoch 4/10, Train Loss: 3.2170, Val Loss: 1.0458
Evaluating after epoch 4...


Evaluating: 100%|██████████| 25/25 [01:07<00:00,  2.68s/it]


Current BLEU Score: 0.04283036950897102
Current ROUGE Scores: {'rouge1': 0.26864267289091737, 'rouge2': 0.10515347508776139, 'rougeL': 0.19706724386699154}
New best model saved with validation loss: 1.0458


Epoch 5/10 [Train]: 100%|██████████| 225/225 [02:19<00:00,  1.61it/s, train_loss=1.21]
Epoch 5/10 [Val]: 100%|██████████| 25/25 [00:05<00:00,  4.64it/s]


Epoch 5/10, Train Loss: 1.0254, Val Loss: 0.8590
New best model saved with validation loss: 0.8590


Epoch 6/10 [Train]: 100%|██████████| 225/225 [02:19<00:00,  1.61it/s, train_loss=0.744]
Epoch 6/10 [Val]: 100%|██████████| 25/25 [00:05<00:00,  4.66it/s]


Epoch 6/10, Train Loss: 0.8277, Val Loss: 0.8244
Evaluating after epoch 6...


Evaluating: 100%|██████████| 25/25 [01:00<00:00,  2.42s/it]


Current BLEU Score: 0.06612014396622451
Current ROUGE Scores: {'rouge1': 0.34877814790901546, 'rouge2': 0.14381547302377723, 'rougeL': 0.24555802349828285}
New best model saved with validation loss: 0.8244


Epoch 7/10 [Train]: 100%|██████████| 225/225 [02:19<00:00,  1.61it/s, train_loss=0.781]
Epoch 7/10 [Val]: 100%|██████████| 25/25 [00:05<00:00,  4.65it/s]


Epoch 7/10, Train Loss: 0.7702, Val Loss: 0.8168
New best model saved with validation loss: 0.8168


Epoch 8/10 [Train]: 100%|██████████| 225/225 [02:19<00:00,  1.61it/s, train_loss=0.709]
Epoch 8/10 [Val]: 100%|██████████| 25/25 [00:05<00:00,  4.66it/s]


Epoch 8/10, Train Loss: 0.7237, Val Loss: 0.8117
Evaluating after epoch 8...


Evaluating: 100%|██████████| 25/25 [00:59<00:00,  2.37s/it]


Current BLEU Score: 0.05939039734879392
Current ROUGE Scores: {'rouge1': 0.345182970975227, 'rouge2': 0.1394900157217645, 'rougeL': 0.2444813051679993}
New best model saved with validation loss: 0.8117


Epoch 9/10 [Train]: 100%|██████████| 225/225 [02:19<00:00,  1.61it/s, train_loss=0.626]
Epoch 9/10 [Val]: 100%|██████████| 25/25 [00:05<00:00,  4.65it/s]


Epoch 9/10, Train Loss: 0.6989, Val Loss: 0.8095
New best model saved with validation loss: 0.8095


Epoch 10/10 [Train]: 100%|██████████| 225/225 [02:19<00:00,  1.61it/s, train_loss=0.819]
Epoch 10/10 [Val]: 100%|██████████| 25/25 [00:05<00:00,  4.64it/s]


Epoch 10/10, Train Loss: 0.6728, Val Loss: 0.8053
Evaluating after epoch 10...


Evaluating: 100%|██████████| 25/25 [01:02<00:00,  2.51s/it]


Current BLEU Score: 0.05830824748912286
Current ROUGE Scores: {'rouge1': 0.33447035562023225, 'rouge2': 0.13160724821103553, 'rougeL': 0.2448122833529142}
New best model saved with validation loss: 0.8053


  model.load_state_dict(torch.load(best_model_path))


Loaded best model from best_pegasus_model.pth
Evaluating fine-tuned model...


Evaluating: 100%|██████████| 25/25 [01:02<00:00,  2.51s/it]

Fine-tuned Model Performance:
BLEU Score: 0.05830824748912286
ROUGE Scores: {'rouge1': 0.33447035562023225, 'rouge2': 0.13160724821103553, 'rougeL': 0.2448122833529142}
Performance Improvement:
BLEU: 0.032019698130967805
ROUGE-1: 0.035372132340698026
ROUGE-2: 0.033823160043814124
ROUGE-L: 0.052479002091481874





In [14]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


def generate_summary(model, article, tokenizer, max_length=128):
    inputs = tokenizer(article, max_length=1024, truncation=True, return_tensors="pt").to(device)
    summary_ids = model.generate(inputs["input_ids"], num_beams=4, max_length=max_length, early_stopping=True)
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Generate summaries with base model
print("Base Model Summaries:")
for i in range(3):  # Generate 3 summaries
    article = val_data[i]["article"]
    reference = val_data[i]["highlights"]
    generated = generate_summary(model, article, tokenizer)
    print(f"\nArticle {i+1}:")
    print(f"Reference: {reference}")
    print(f"Generated: {generated}")
    print("-" * 50)

Base Model Summaries:

Article 1:
Reference: Photos of Taliban in the uniforms of dead French soldiers provokes outrage .
Magazine Paris Match features photos of Taliban and their commander .
10 French troops were killed and a further 21 injured in an ambush .
Generated: Joel Le Pahun, father of one of the killed soldiers, told the newspaper the pictures were "despicable." Green MP Daniel Cohn-Bendit called them "voyeurism." However, Paris Match editor Laurent Valdiguie defended the publication, saying it was "legitimate" given the importance of the story.
--------------------------------------------------

Article 2:
Reference: The Middle East's richest man is Prince Alwaleed Bin Talal Alsaud .
He ranks 19th in the world in the Forbes Rich List .
Seven other billionaires from the Middle East rank in the top 100 .
Generated: The Middle East's richest man: Prince Alwaleed Bin Talal Alsaud . The Middle East's richest man is Prince Alwaleed Bin Talal Alsaud, the 51 year old Saudi who has 

In [16]:
# Load the fine-tuned model
best_model_path = "/content/drive/MyDrive/NLP-Project/best_pegasus_model.pth"
fine_tuned_model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-large")
fine_tuned_model.load_state_dict(torch.load(best_model_path))
fine_tuned_model.to(device)

print("\nFine-tuned Model Summaries:")
for i in range(3):  # Generate 3 summaries
    article = val_data[i]["article"]
    reference = val_data[i]["highlights"]
    generated = generate_summary(fine_tuned_model, article, tokenizer)
    print(f"\nArticle {i+1}:")
    print(f"Reference: {reference}")
    print(f"Generated: {generated}")
    print("-" * 50)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  fine_tuned_model.load_state_dict(torch.load(best_model_path))



Fine-tuned Model Summaries:

Article 1:
Reference: Photos of Taliban in the uniforms of dead French soldiers provokes outrage .
Magazine Paris Match features photos of Taliban and their commander .
10 French troops were killed and a further 21 injured in an ambush .
Generated: Paris Match includes photos of Taliban fighters and their commander . The latest edition includes photos of the Taliban fighters and their commander, "Farouki," wearing French uniforms . Father of one of the 10 French soldiers says pictures are "despicable"
--------------------------------------------------

Article 2:
Reference: The Middle East's richest man is Prince Alwaleed Bin Talal Alsaud .
He ranks 19th in the world in the Forbes Rich List .
Seven other billionaires from the Middle East rank in the top 100 .
Generated: Prince Alwaleed Bin Talal Alsaud ranks 19th in the list and is considered to be the most active and successful investor in the Middle East . He took his investment vehicle, Kingdom Holding,

# Fine-Tuning Pegasus Model for Abstractive Text Summarization on datasize of 5000 articles.

This script fine-tunes a pre-trained Pegasus model for abstractive text summarization using the CNN/Daily Mail dataset. The process includes data loading, model preparation, training, evaluation, and saving the best model based on validation loss.

BLEU Score: 0.07084515511453165
ROUGE Scores: {'rouge1': 0.3628623757565976, 'rouge2': 0.14441779133424118, 'rougeL': 0.25522549668218547}


## Performance Metrics

### Base Model Performance
- **BLEU**: 0.027228885603706635
- **ROUGE-1**: 0.027228885603706635
- **ROUGE-2**: 0.027228885603706635
- **ROUGE-L**: 0.027228885603706635

### Fine-Tuned Model Performance
- **BLEU Score**: 0.07084515511453165
- **ROUGE-1**: 0.3628623757565976
- **ROUGE-2**: 0.14441779133424118
- **ROUGE-L**: 0.25522549668218547



In [17]:
import random
import time
import os
import torch
from tqdm import tqdm
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader, Dataset
from transformers import PegasusForConditionalGeneration, PegasusTokenizer, get_linear_schedule_with_warmup
from datasets import load_dataset
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
from google.colab import drive, files

# Mount Google Drive
drive.mount('/content/drive')

# Install required libraries
!pip install transformers datasets rouge_score nltk tqdm matplotlib

# Constants
MAX_INPUT_LENGTH = 1024
MAX_TARGET_LENGTH = 128
BATCH_SIZE = 4
GRADIENT_ACCUMULATION_STEPS = 4
NUM_EPOCHS = 10
LEARNING_RATE = 5e-5

def set_seed(seed=42):
    random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

def load_data(num_samples=1000):
    dataset = load_dataset("cnn_dailymail", "3.0.0")
    full_train_data = dataset["train"].select(range(num_samples))
    train_size = int(0.9 * len(full_train_data))
    train_data = full_train_data.select(range(train_size))
    val_data = full_train_data.select(range(train_size, len(full_train_data)))
    return train_data, val_data

class SummarizationDataset(Dataset):
    def __init__(self, data, tokenizer, max_input_length=MAX_INPUT_LENGTH, max_target_length=MAX_TARGET_LENGTH):
        self.data = data
        self.tokenizer = tokenizer
        self.max_input_length = max_input_length
        self.max_target_length = max_target_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        article = self.data[idx]["article"]
        summary = self.data[idx]["highlights"]
        inputs = self.tokenizer(article, max_length=self.max_input_length, truncation=True, padding="max_length", return_tensors="pt")
        targets = self.tokenizer(summary, max_length=self.max_target_length, truncation=True, padding="max_length", return_tensors="pt")
        return {
            "input_ids": inputs.input_ids.squeeze(),
            "attention_mask": inputs.attention_mask.squeeze(),
            "labels": targets.input_ids.squeeze()
        }

def create_dataloaders(train_data, val_data, tokenizer):
    train_dataset = SummarizationDataset(train_data, tokenizer)
    val_dataset = SummarizationDataset(val_data, tokenizer)
    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
    return train_loader, val_loader

def evaluate(model, data_loader, tokenizer, device):
    model.eval()
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    bleu_scores = []
    rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}

    with torch.no_grad():
        for batch in tqdm(data_loader, desc="Evaluating"):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            generated_ids = model.generate(input_ids=input_ids, attention_mask=attention_mask, max_length=MAX_TARGET_LENGTH)

            for i in range(len(input_ids)):
                reference = tokenizer.decode(labels[i], skip_special_tokens=True)
                generated_summary = tokenizer.decode(generated_ids[i], skip_special_tokens=True)

                bleu_score = sentence_bleu([reference.split()], generated_summary.split())
                bleu_scores.append(bleu_score)

                rouge_result = scorer.score(reference, generated_summary)
                for metric in rouge_scores:
                    rouge_scores[metric].append(rouge_result[metric].fmeasure)

    avg_bleu = sum(bleu_scores) / len(bleu_scores)
    avg_rouge = {metric: sum(scores) / len(scores) for metric, scores in rouge_scores.items()}

    return avg_bleu, avg_rouge

def train_model(model, train_loader, val_loader, tokenizer, device, num_epochs):
    optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)
    num_training_steps = len(train_loader) * num_epochs
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)

    best_val_loss = float('inf')
    best_model_path = '/content/drive/My Drive/NLP-Project/best_pegasus_model_modular_script_test.pth'

    train_losses = []
    val_losses = []
    bleu_scores = []
    rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}

    start_time = time.time()

    for epoch in range(num_epochs):
        model.train()
        total_train_loss = 0
        progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs} [Train]")

        for i, batch in enumerate(progress_bar):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_train_loss += loss.item()

            loss = loss / GRADIENT_ACCUMULATION_STEPS
            loss.backward()

            if (i + 1) % GRADIENT_ACCUMULATION_STEPS == 0:
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()

            progress_bar.set_postfix({"train_loss": loss.item() * GRADIENT_ACCUMULATION_STEPS})

        avg_train_loss = total_train_loss / len(train_loader)
        train_losses.append(avg_train_loss)

        # Validation
        model.eval()
        total_val_loss = 0
        with torch.no_grad():
            for batch in tqdm(val_loader, desc=f"Epoch {epoch+1}/{num_epochs} [Val]"):
                input_ids = batch["input_ids"].to(device)
                attention_mask = batch["attention_mask"].to(device)
                labels = batch["labels"].to(device)

                outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
                loss = outputs.loss
                total_val_loss += loss.item()

        avg_val_loss = total_val_loss / len(val_loader)
        val_losses.append(avg_val_loss)

        print(f"Epoch {epoch+1}/{num_epochs}, Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}")

        # Evaluate every 2 epochs
        if (epoch + 1) % 2 == 0:
            print(f"Evaluating after epoch {epoch+1}...")
            current_bleu, current_rouge = evaluate(model, val_loader, tokenizer, device)
            bleu_scores.append(current_bleu)
            for metric in rouge_scores:
                rouge_scores[metric].append(current_rouge[metric])
            print(f"Current BLEU Score: {current_bleu}")
            print(f"Current ROUGE Scores: {current_rouge}")

        # Save the model if it's the best so far based on validation loss
        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            torch.save(model.state_dict(), best_model_path)
            print(f"New best model saved with validation loss: {best_val_loss:.4f}")

    end_time = time.time()
    training_time = end_time - start_time
    print(f"Total training time: {training_time:.2f} seconds")

    return train_losses, val_losses, bleu_scores, rouge_scores, training_time

def plot_training_progress(train_losses, val_losses, bleu_scores, rouge_scores):
    plt.figure(figsize=(12, 8))
    plt.subplot(2, 2, 1)
    plt.plot(train_losses, label='Train Loss')
    plt.plot(val_losses, label='Validation Loss')
    plt.legend()
    plt.title('Training and Validation Loss')

    plt.subplot(2, 2, 2)
    plt.plot(bleu_scores)
    plt.title('BLEU Score')

    plt.subplot(2, 2, 3)
    for metric, scores in rouge_scores.items():
        plt.plot(scores, label=metric)
    plt.legend()
    plt.title('ROUGE Scores')

    plt.tight_layout()
    plt.savefig('training_progress.png')
    plt.close()

    # Download the plot
    files.download('training_progress.png')

def generate_summary(model, article, tokenizer, device, max_length=128):
    inputs = tokenizer(article, max_length=1024, truncation=True, return_tensors="pt").to(device)
    summary_ids = model.generate(inputs["input_ids"], num_beams=4, max_length=max_length, early_stopping=True)
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

def main():
    set_seed()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-large")
    tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-large")
    model.to(device)

    train_data, val_data = load_data(num_samples=5000)
    train_loader, val_loader = create_dataloaders(train_data, val_data, tokenizer)

    print("Evaluating base model...")
    base_bleu, base_rouge = evaluate(model, val_loader, tokenizer, device)
    print("Base Model Performance:")
    print(f"BLEU Score: {base_bleu}")
    print(f"ROUGE Scores: {base_rouge}")

    train_losses, val_losses, bleu_scores, rouge_scores, training_time = train_model(model, train_loader, val_loader, tokenizer, device, NUM_EPOCHS)

    plot_training_progress(train_losses, val_losses, bleu_scores, rouge_scores)

    print("Evaluating fine-tuned model...")
    fine_tuned_bleu, fine_tuned_rouge = evaluate(model, val_loader, tokenizer, device)
    print("Fine-tuned Model Performance:")
    print(f"BLEU Score: {fine_tuned_bleu}")
    print(f"ROUGE Scores: {fine_tuned_rouge}")

    print("Performance Improvement:")
    print(f"BLEU: {fine_tuned_bleu - base_bleu}")
    print(f"ROUGE-1: {fine_tuned_rouge['rouge1'] - base_rouge['rouge1']}")
    print(f"ROUGE-2: {fine_tuned_rouge['rouge2'] - base_rouge['rouge2']}")
    print(f"ROUGE-L: {fine_tuned_rouge['rougeL'] - base_rouge['rougeL']}")

    print(f"Total training time: {training_time:.2f} seconds")

    # Generate example summaries
    print("\nGenerating example summaries...")
    for i in range(3):
        article = val_data[i]["article"]
        reference = val_data[i]["highlights"]
        generated = generate_summary(model, article, tokenizer, device)
        print(f"\nArticle {i+1}:")
        print(f"Reference: {reference}")
        print(f"Generated: {generated}")
        print("-" * 50)

if __name__ == "__main__":
    main()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Evaluating base model...


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
Evaluating: 100%|██████████| 125/125 [06:40<00:00,  3.21s/it]


Base Model Performance:
BLEU Score: 0.027228885603706635
ROUGE Scores: {'rouge1': 0.29546262944841484, 'rouge2': 0.09988257187753315, 'rougeL': 0.1893963564402305}


Epoch 1/10 [Train]: 100%|██████████| 1125/1125 [11:40<00:00,  1.61it/s, train_loss=1.11]
Epoch 1/10 [Val]: 100%|██████████| 125/125 [00:27<00:00,  4.63it/s]


Epoch 1/10, Train Loss: 4.5232, Val Loss: 0.8539
New best model saved with validation loss: 0.8539


Epoch 2/10 [Train]: 100%|██████████| 1125/1125 [11:40<00:00,  1.61it/s, train_loss=0.837]
Epoch 2/10 [Val]: 100%|██████████| 125/125 [00:27<00:00,  4.60it/s]


Epoch 2/10, Train Loss: 0.8746, Val Loss: 0.7983
Evaluating after epoch 2...


Evaluating: 100%|██████████| 125/125 [04:53<00:00,  2.35s/it]


Current BLEU Score: 0.05723063253261528
Current ROUGE Scores: {'rouge1': 0.3343995619262688, 'rouge2': 0.12859040561585394, 'rougeL': 0.23005640722224163}
New best model saved with validation loss: 0.7983


Epoch 3/10 [Train]: 100%|██████████| 1125/1125 [11:41<00:00,  1.60it/s, train_loss=0.673]
Epoch 3/10 [Val]: 100%|██████████| 125/125 [00:27<00:00,  4.63it/s]


Epoch 3/10, Train Loss: 0.7910, Val Loss: 0.7814
New best model saved with validation loss: 0.7814


Epoch 4/10 [Train]: 100%|██████████| 1125/1125 [11:40<00:00,  1.61it/s, train_loss=0.673]
Epoch 4/10 [Val]: 100%|██████████| 125/125 [00:27<00:00,  4.62it/s]


Epoch 4/10, Train Loss: 0.7474, Val Loss: 0.7701
Evaluating after epoch 4...


Evaluating: 100%|██████████| 125/125 [04:36<00:00,  2.21s/it]


Current BLEU Score: 0.06638918428998888
Current ROUGE Scores: {'rouge1': 0.35204206325960075, 'rouge2': 0.14036960881993635, 'rougeL': 0.2451482071812417}
New best model saved with validation loss: 0.7701


Epoch 5/10 [Train]: 100%|██████████| 1125/1125 [11:40<00:00,  1.61it/s, train_loss=0.896]
Epoch 5/10 [Val]: 100%|██████████| 125/125 [00:27<00:00,  4.62it/s]


Epoch 5/10, Train Loss: 0.7112, Val Loss: 0.7644
New best model saved with validation loss: 0.7644


Epoch 6/10 [Train]: 100%|██████████| 1125/1125 [11:40<00:00,  1.61it/s, train_loss=0.601]
Epoch 6/10 [Val]: 100%|██████████| 125/125 [00:27<00:00,  4.62it/s]


Epoch 6/10, Train Loss: 0.6790, Val Loss: 0.7636
Evaluating after epoch 6...


Evaluating: 100%|██████████| 125/125 [04:47<00:00,  2.30s/it]


Current BLEU Score: 0.06690326051331127
Current ROUGE Scores: {'rouge1': 0.35003137436937154, 'rouge2': 0.13871897181210435, 'rougeL': 0.24619002649832475}
New best model saved with validation loss: 0.7636


Epoch 7/10 [Train]: 100%|██████████| 1125/1125 [11:41<00:00,  1.60it/s, train_loss=0.927]
Epoch 7/10 [Val]: 100%|██████████| 125/125 [00:27<00:00,  4.62it/s]


Epoch 7/10, Train Loss: 0.6501, Val Loss: 0.7610
New best model saved with validation loss: 0.7610


Epoch 8/10 [Train]: 100%|██████████| 1125/1125 [11:40<00:00,  1.61it/s, train_loss=0.492]
Epoch 8/10 [Val]: 100%|██████████| 125/125 [00:27<00:00,  4.63it/s]


Epoch 8/10, Train Loss: 0.6243, Val Loss: 0.7627
Evaluating after epoch 8...


Evaluating: 100%|██████████| 125/125 [04:38<00:00,  2.22s/it]


Current BLEU Score: 0.06885554488189281
Current ROUGE Scores: {'rouge1': 0.35780621059633066, 'rouge2': 0.14257286127574187, 'rougeL': 0.25020514690533696}


Epoch 9/10 [Train]: 100%|██████████| 1125/1125 [11:40<00:00,  1.61it/s, train_loss=0.692]
Epoch 9/10 [Val]: 100%|██████████| 125/125 [00:27<00:00,  4.61it/s]


Epoch 9/10, Train Loss: 0.5976, Val Loss: 0.7642


Epoch 10/10 [Train]: 100%|██████████| 1125/1125 [11:40<00:00,  1.61it/s, train_loss=0.647]
Epoch 10/10 [Val]: 100%|██████████| 125/125 [00:27<00:00,  4.62it/s]


Epoch 10/10, Train Loss: 0.5743, Val Loss: 0.7692
Evaluating after epoch 10...


Evaluating: 100%|██████████| 125/125 [04:33<00:00,  2.19s/it]


Current BLEU Score: 0.07084515511453165
Current ROUGE Scores: {'rouge1': 0.3628623757565976, 'rouge2': 0.14441779133424118, 'rougeL': 0.25522549668218547}
Total training time: 8732.52 seconds


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Evaluating fine-tuned model...


Evaluating: 100%|██████████| 125/125 [04:33<00:00,  2.19s/it]


Fine-tuned Model Performance:
BLEU Score: 0.07084515511453165
ROUGE Scores: {'rouge1': 0.3628623757565976, 'rouge2': 0.14441779133424118, 'rougeL': 0.25522549668218547}
Performance Improvement:
BLEU: 0.04361626951082502
ROUGE-1: 0.06739974630818274
ROUGE-2: 0.04453521945670803
ROUGE-L: 0.06582914024195496
Total training time: 8732.52 seconds

Generating example summaries...

Article 1:
Reference: Belgian architect imagines climate refugees living on a futuristic Lilypad ecopolis .
The structure would support 50,000 inhabitants in a zero carbon environment .
The goal is to "create a harmonious coexistence of humans and nature"
Generated: The Lilypad is the creation of Belgian architect Vincent Callebaut . "It is" he says, "a true amphibian, half aquatic and half terrestrial city, able to accommodate 50,000 inhabitants"
--------------------------------------------------

Article 2:
Reference: Swedish entrepreneur to open a Jumbo Hostel at Arlanda airport in Sweden .
Decommissioned Boeing