In [None]:
!nvidia-smi

Thu Nov 28 18:42:28 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P0              50W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# Fine-Tuning Pegasus Model for Abstractive Text Summarization

This script fine-tunes a pre-trained Pegasus model for abstractive text summarization using the CNN/Daily Mail dataset. The process includes data loading, model preparation, training, evaluation, and saving the best model based on validation loss.

## Key Features
- **Model**: Uses the 'google/pegasus-large' pre-trained model.
- **Dataset**: Trains on a subset of the CNN/Daily Mail dataset (1000 samples).
- **Epochs**: Fine-tunes for a specified number of epochs.
- **Metrics**: Evaluates performance using BLEU and ROUGE scores.
- **Batch Processing**: Implements gradient accumulation to handle larger batch sizes.
- **Model Saving**: Saves the best performing model based on validation loss.
- **Comparison**: Compares the performance of the base model and the fine-tuned model.
- **Performance Improvement**: Reports the improvement in BLEU and ROUGE scores achieved through fine-tuning.

## Performance Metrics

### Base Model Performance
- **BLEU**: 0.026
- **ROUGE-1**: 0.299
- **ROUGE-2**: 0.098
- **ROUGE-L**: 0.192

### Fine-Tuned Model Performance
- **BLEU Score**: 0.05830824748912286
- **ROUGE-1**: 0.33447035562023225
- **ROUGE-2**: 0.13160724821103553
- **ROUGE-L**: 0.2448122833529142

### Performance Improvement
- **BLEU**: +0.032019698130967805
- **ROUGE-1**: +0.035372132340698026
- **ROUGE-2**: +0.033823160043814124
- **ROUGE-L**: +0.052479002091481874

In [None]:
!pip install datasets
!pip install rouge_score
!pip install nltk

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [None]:
import random
import torch
from tqdm import tqdm
from torch.utils.data import DataLoader, Dataset
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
from datasets import load_dataset
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
from transformers import get_linear_schedule_with_warmup
import os

# Set random seed for reproducibility
random_seed = 42
random.seed(random_seed)
torch.manual_seed(random_seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(random_seed)

In [None]:
# Load pre-trained model and tokenizer
model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-large")
tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-large")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/3.09k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/260 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

In [None]:
# Count the number of trainable parameters in the model
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total Trainable Parameters: {total_params:,}")

Total Trainable Parameters: 568,699,904


In [None]:
# Constants
MAX_INPUT_LENGTH = 1024
MAX_TARGET_LENGTH = 128
BATCH_SIZE = 4
GRADIENT_ACCUMULATION_STEPS = 4
NUM_EPOCHS = 10
LEARNING_RATE = 5e-5

In [None]:
# Load dataset
dataset = load_dataset("cnn_dailymail", "3.0.0")

# Prepare data
full_train_data = dataset["train"].select(range(1000))  # 1000
train_size = int(0.9 * len(full_train_data))
val_size = len(full_train_data) - train_size

train_data = full_train_data.select(range(train_size))
val_data = full_train_data.select(range(train_size, len(full_train_data)))

README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

In [None]:
print("Length of train_data:", len(train_data))
print("Length of val_data:", len(val_data))

Length of train_data: 900
Length of val_data: 100


In [None]:
class SummarizationDataset(Dataset):
    def __init__(self, data, tokenizer, max_input_length=MAX_INPUT_LENGTH, max_target_length=MAX_TARGET_LENGTH):
        self.data = data
        self.tokenizer = tokenizer
        self.max_input_length = max_input_length
        self.max_target_length = max_target_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        article = self.data[idx]["article"]
        summary = self.data[idx]["highlights"]

        inputs = self.tokenizer(article, max_length=self.max_input_length, truncation=True, padding="max_length", return_tensors="pt")
        targets = self.tokenizer(summary, max_length=self.max_target_length, truncation=True, padding="max_length", return_tensors="pt")

        return {
            "input_ids": inputs.input_ids.squeeze(),
            "attention_mask": inputs.attention_mask.squeeze(),
            "labels": targets.input_ids.squeeze()
        }

train_dataset = SummarizationDataset(train_data, tokenizer)
val_dataset = SummarizationDataset(val_data, tokenizer)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)

# Set up device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def evaluate(model, data_loader):
    model.eval()
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    bleu_scores = []
    rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}

    with torch.no_grad():
        for batch in tqdm(data_loader, desc="Evaluating"):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            generated_ids = model.generate(input_ids=input_ids, attention_mask=attention_mask, max_length=MAX_TARGET_LENGTH)

            for i in range(len(input_ids)):
                reference = tokenizer.decode(labels[i], skip_special_tokens=True)
                generated_summary = tokenizer.decode(generated_ids[i], skip_special_tokens=True)

                bleu_score = sentence_bleu([reference.split()], generated_summary.split())
                bleu_scores.append(bleu_score)

                rouge_result = scorer.score(reference, generated_summary)
                for metric in rouge_scores:
                    rouge_scores[metric].append(rouge_result[metric].fmeasure)

    avg_bleu = sum(bleu_scores) / len(bleu_scores)
    avg_rouge = {metric: sum(scores) / len(scores) for metric, scores in rouge_scores.items()}

    return avg_bleu, avg_rouge

# Evaluate base model
print("Evaluating base model...")
base_bleu, base_rouge = evaluate(model, val_loader)
print("Base Model Performance:")
print(f"BLEU Score: {base_bleu}")
print(f"ROUGE Scores: {base_rouge}")

Evaluating base model...


Evaluating: 100%|██████████| 25/25 [01:22<00:00,  3.30s/it]

Base Model Performance:
BLEU Score: 0.026288549358155056
ROUGE Scores: {'rouge1': 0.2990982232795342, 'rouge2': 0.09778408816722141, 'rougeL': 0.19233328126143232}





In [None]:
# Fine-tuning
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)
num_training_steps = len(train_loader) * NUM_EPOCHS
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)

best_val_loss = float('inf')
best_model_path = 'best_pegasus_model.pth'

for epoch in range(NUM_EPOCHS):
    model.train()
    total_train_loss = 0
    progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{NUM_EPOCHS} [Train]")

    for i, batch in enumerate(progress_bar):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        total_train_loss += loss.item()

        loss = loss / GRADIENT_ACCUMULATION_STEPS
        loss.backward()

        if (i + 1) % GRADIENT_ACCUMULATION_STEPS == 0:
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()

        progress_bar.set_postfix({"train_loss": loss.item() * GRADIENT_ACCUMULATION_STEPS})

    avg_train_loss = total_train_loss / len(train_loader)

    # Validation
    model.eval()
    total_val_loss = 0
    with torch.no_grad():
        for batch in tqdm(val_loader, desc=f"Epoch {epoch+1}/{NUM_EPOCHS} [Val]"):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_val_loss += loss.item()

    avg_val_loss = total_val_loss / len(val_loader)

    print(f"Epoch {epoch+1}/{NUM_EPOCHS}, Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}")

    # Evaluate every 2 epochs
    if (epoch + 1) % 2 == 0:
        print(f"Evaluating after epoch {epoch+1}...")
        current_bleu, current_rouge = evaluate(model, val_loader)
        print(f"Current BLEU Score: {current_bleu}")
        print(f"Current ROUGE Scores: {current_rouge}")

    # Save the model if it's the best so far based on validation loss
    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        torch.save(model.state_dict(), best_model_path)
        print(f"New best model saved with validation loss: {best_val_loss:.4f}")

# Load the best model for final evaluation
if os.path.exists(best_model_path):
    model.load_state_dict(torch.load(best_model_path))
    print(f"Loaded best model from {best_model_path}")
else:
    print("No saved model found. Using the model from the last epoch.")

# Evaluate fine-tuned model
print("Evaluating fine-tuned model...")
fine_tuned_bleu, fine_tuned_rouge = evaluate(model, val_loader)
print("Fine-tuned Model Performance:")
print(f"BLEU Score: {fine_tuned_bleu}")
print(f"ROUGE Scores: {fine_tuned_rouge}")

# Print performance improvement
print("Performance Improvement:")
print(f"BLEU: {fine_tuned_bleu - base_bleu}")
print(f"ROUGE-1: {fine_tuned_rouge['rouge1'] - base_rouge['rouge1']}")
print(f"ROUGE-2: {fine_tuned_rouge['rouge2'] - base_rouge['rouge2']}")
print(f"ROUGE-L: {fine_tuned_rouge['rougeL'] - base_rouge['rougeL']}")

Epoch 1/10 [Train]: 100%|██████████| 225/225 [02:20<00:00,  1.61it/s, train_loss=6.71]
Epoch 1/10 [Val]: 100%|██████████| 25/25 [00:05<00:00,  4.65it/s]


Epoch 1/10, Train Loss: 7.4636, Val Loss: 6.8137
New best model saved with validation loss: 6.8137


Epoch 2/10 [Train]: 100%|██████████| 225/225 [02:19<00:00,  1.61it/s, train_loss=6.69]
Epoch 2/10 [Val]: 100%|██████████| 25/25 [00:05<00:00,  4.65it/s]


Epoch 2/10, Train Loss: 6.7187, Val Loss: 6.3782
Evaluating after epoch 2...


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
Evaluating: 100%|██████████| 25/25 [01:21<00:00,  3.25s/it]


Current BLEU Score: 0.037923041143913705
Current ROUGE Scores: {'rouge1': 0.31771908402970306, 'rouge2': 0.11809687661522462, 'rougeL': 0.21862539604135886}
New best model saved with validation loss: 6.3782


Epoch 3/10 [Train]: 100%|██████████| 225/225 [02:19<00:00,  1.61it/s, train_loss=5.26]
Epoch 3/10 [Val]: 100%|██████████| 25/25 [00:05<00:00,  4.64it/s]


Epoch 3/10, Train Loss: 6.0713, Val Loss: 5.0543
New best model saved with validation loss: 5.0543


Epoch 4/10 [Train]: 100%|██████████| 225/225 [02:19<00:00,  1.61it/s, train_loss=1.78]
Epoch 4/10 [Val]: 100%|██████████| 25/25 [00:05<00:00,  4.65it/s]


Epoch 4/10, Train Loss: 3.2170, Val Loss: 1.0458
Evaluating after epoch 4...


Evaluating: 100%|██████████| 25/25 [01:07<00:00,  2.68s/it]


Current BLEU Score: 0.04283036950897102
Current ROUGE Scores: {'rouge1': 0.26864267289091737, 'rouge2': 0.10515347508776139, 'rougeL': 0.19706724386699154}
New best model saved with validation loss: 1.0458


Epoch 5/10 [Train]: 100%|██████████| 225/225 [02:19<00:00,  1.61it/s, train_loss=1.21]
Epoch 5/10 [Val]: 100%|██████████| 25/25 [00:05<00:00,  4.64it/s]


Epoch 5/10, Train Loss: 1.0254, Val Loss: 0.8590
New best model saved with validation loss: 0.8590


Epoch 6/10 [Train]: 100%|██████████| 225/225 [02:19<00:00,  1.61it/s, train_loss=0.744]
Epoch 6/10 [Val]: 100%|██████████| 25/25 [00:05<00:00,  4.66it/s]


Epoch 6/10, Train Loss: 0.8277, Val Loss: 0.8244
Evaluating after epoch 6...


Evaluating: 100%|██████████| 25/25 [01:00<00:00,  2.42s/it]


Current BLEU Score: 0.06612014396622451
Current ROUGE Scores: {'rouge1': 0.34877814790901546, 'rouge2': 0.14381547302377723, 'rougeL': 0.24555802349828285}
New best model saved with validation loss: 0.8244


Epoch 7/10 [Train]: 100%|██████████| 225/225 [02:19<00:00,  1.61it/s, train_loss=0.781]
Epoch 7/10 [Val]: 100%|██████████| 25/25 [00:05<00:00,  4.65it/s]


Epoch 7/10, Train Loss: 0.7702, Val Loss: 0.8168
New best model saved with validation loss: 0.8168


Epoch 8/10 [Train]: 100%|██████████| 225/225 [02:19<00:00,  1.61it/s, train_loss=0.709]
Epoch 8/10 [Val]: 100%|██████████| 25/25 [00:05<00:00,  4.66it/s]


Epoch 8/10, Train Loss: 0.7237, Val Loss: 0.8117
Evaluating after epoch 8...


Evaluating: 100%|██████████| 25/25 [00:59<00:00,  2.37s/it]


Current BLEU Score: 0.05939039734879392
Current ROUGE Scores: {'rouge1': 0.345182970975227, 'rouge2': 0.1394900157217645, 'rougeL': 0.2444813051679993}
New best model saved with validation loss: 0.8117


Epoch 9/10 [Train]: 100%|██████████| 225/225 [02:19<00:00,  1.61it/s, train_loss=0.626]
Epoch 9/10 [Val]: 100%|██████████| 25/25 [00:05<00:00,  4.65it/s]


Epoch 9/10, Train Loss: 0.6989, Val Loss: 0.8095
New best model saved with validation loss: 0.8095


Epoch 10/10 [Train]: 100%|██████████| 225/225 [02:19<00:00,  1.61it/s, train_loss=0.819]
Epoch 10/10 [Val]: 100%|██████████| 25/25 [00:05<00:00,  4.64it/s]


Epoch 10/10, Train Loss: 0.6728, Val Loss: 0.8053
Evaluating after epoch 10...


Evaluating: 100%|██████████| 25/25 [01:02<00:00,  2.51s/it]


Current BLEU Score: 0.05830824748912286
Current ROUGE Scores: {'rouge1': 0.33447035562023225, 'rouge2': 0.13160724821103553, 'rougeL': 0.2448122833529142}
New best model saved with validation loss: 0.8053


  model.load_state_dict(torch.load(best_model_path))


Loaded best model from best_pegasus_model.pth
Evaluating fine-tuned model...


Evaluating: 100%|██████████| 25/25 [01:02<00:00,  2.51s/it]

Fine-tuned Model Performance:
BLEU Score: 0.05830824748912286
ROUGE Scores: {'rouge1': 0.33447035562023225, 'rouge2': 0.13160724821103553, 'rougeL': 0.2448122833529142}
Performance Improvement:
BLEU: 0.032019698130967805
ROUGE-1: 0.035372132340698026
ROUGE-2: 0.033823160043814124
ROUGE-L: 0.052479002091481874





In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


def generate_summary(model, article, tokenizer, max_length=128):
    inputs = tokenizer(article, max_length=1024, truncation=True, return_tensors="pt").to(device)
    summary_ids = model.generate(inputs["input_ids"], num_beams=4, max_length=max_length, early_stopping=True)
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Generate summaries with base model
print("Base Model Summaries:")
for i in range(3):  # Generate 3 summaries
    article = val_data[i]["article"]
    reference = val_data[i]["highlights"]
    generated = generate_summary(model, article, tokenizer)
    print(f"\nArticle {i+1}:")
    print(f"Reference: {reference}")
    print(f"Generated: {generated}")
    print("-" * 50)

Base Model Summaries:

Article 1:
Reference: Photos of Taliban in the uniforms of dead French soldiers provokes outrage .
Magazine Paris Match features photos of Taliban and their commander .
10 French troops were killed and a further 21 injured in an ambush .
Generated: Joel Le Pahun, father of one of the killed soldiers, told the newspaper the pictures were "despicable." Green MP Daniel Cohn-Bendit called them "voyeurism." However, Paris Match editor Laurent Valdiguie defended the publication, saying it was "legitimate" given the importance of the story.
--------------------------------------------------

Article 2:
Reference: The Middle East's richest man is Prince Alwaleed Bin Talal Alsaud .
He ranks 19th in the world in the Forbes Rich List .
Seven other billionaires from the Middle East rank in the top 100 .
Generated: The Middle East's richest man: Prince Alwaleed Bin Talal Alsaud . The Middle East's richest man is Prince Alwaleed Bin Talal Alsaud, the 51 year old Saudi who has 

In [None]:
# Load the fine-tuned model
best_model_path = "/content/drive/MyDrive/NLP-Project/best_pegasus_model.pth"
fine_tuned_model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-large")
fine_tuned_model.load_state_dict(torch.load(best_model_path))
fine_tuned_model.to(device)

print("\nFine-tuned Model Summaries:")
for i in range(3):  # Generate 3 summaries
    article = val_data[i]["article"]
    reference = val_data[i]["highlights"]
    generated = generate_summary(fine_tuned_model, article, tokenizer)
    print(f"\nArticle {i+1}:")
    print(f"Reference: {reference}")
    print(f"Generated: {generated}")
    print("-" * 50)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  fine_tuned_model.load_state_dict(torch.load(best_model_path))



Fine-tuned Model Summaries:

Article 1:
Reference: Photos of Taliban in the uniforms of dead French soldiers provokes outrage .
Magazine Paris Match features photos of Taliban and their commander .
10 French troops were killed and a further 21 injured in an ambush .
Generated: Paris Match includes photos of Taliban fighters and their commander . The latest edition includes photos of the Taliban fighters and their commander, "Farouki," wearing French uniforms . Father of one of the 10 French soldiers says pictures are "despicable"
--------------------------------------------------

Article 2:
Reference: The Middle East's richest man is Prince Alwaleed Bin Talal Alsaud .
He ranks 19th in the world in the Forbes Rich List .
Seven other billionaires from the Middle East rank in the top 100 .
Generated: Prince Alwaleed Bin Talal Alsaud ranks 19th in the list and is considered to be the most active and successful investor in the Middle East . He took his investment vehicle, Kingdom Holding,

# Fine-Tuning Pegasus Model for Abstractive Text Summarization on datasize of 5000 articles.

This script fine-tunes a pre-trained Pegasus model for abstractive text summarization using the CNN/Daily Mail dataset. The process includes data loading, model preparation, training, evaluation, and saving the best model based on validation loss.

BLEU Score: 0.07084515511453165
ROUGE Scores: {'rouge1': 0.3628623757565976, 'rouge2': 0.14441779133424118, 'rougeL': 0.25522549668218547}


## Performance Metrics

### Base Model Performance
- **BLEU**: 0.027228885603706635
- **ROUGE-1**: 0.027228885603706635
- **ROUGE-2**: 0.027228885603706635
- **ROUGE-L**: 0.027228885603706635

### Fine-Tuned Model Performance
- **BLEU Score**: 0.07084515511453165
- **ROUGE-1**: 0.3628623757565976
- **ROUGE-2**: 0.14441779133424118
- **ROUGE-L**: 0.25522549668218547



In [None]:
import random
import time
import os
import torch
from tqdm import tqdm
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader, Dataset
from transformers import PegasusForConditionalGeneration, PegasusTokenizer, get_linear_schedule_with_warmup
from datasets import load_dataset
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
from google.colab import drive, files

# Mount Google Drive
drive.mount('/content/drive')

# Install required libraries
!pip install transformers datasets rouge_score nltk tqdm matplotlib

# Constants
MAX_INPUT_LENGTH = 1024
MAX_TARGET_LENGTH = 128
BATCH_SIZE = 4
GRADIENT_ACCUMULATION_STEPS = 4
NUM_EPOCHS = 10
LEARNING_RATE = 5e-5

def set_seed(seed=42):
    random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

def load_data(num_samples=1000):
    dataset = load_dataset("cnn_dailymail", "3.0.0")
    full_train_data = dataset["train"].select(range(num_samples))
    train_size = int(0.9 * len(full_train_data))
    train_data = full_train_data.select(range(train_size))
    val_data = full_train_data.select(range(train_size, len(full_train_data)))
    return train_data, val_data

class SummarizationDataset(Dataset):
    def __init__(self, data, tokenizer, max_input_length=MAX_INPUT_LENGTH, max_target_length=MAX_TARGET_LENGTH):
        self.data = data
        self.tokenizer = tokenizer
        self.max_input_length = max_input_length
        self.max_target_length = max_target_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        article = self.data[idx]["article"]
        summary = self.data[idx]["highlights"]
        inputs = self.tokenizer(article, max_length=self.max_input_length, truncation=True, padding="max_length", return_tensors="pt")
        targets = self.tokenizer(summary, max_length=self.max_target_length, truncation=True, padding="max_length", return_tensors="pt")
        return {
            "input_ids": inputs.input_ids.squeeze(),
            "attention_mask": inputs.attention_mask.squeeze(),
            "labels": targets.input_ids.squeeze()
        }

def create_dataloaders(train_data, val_data, tokenizer):
    train_dataset = SummarizationDataset(train_data, tokenizer)
    val_dataset = SummarizationDataset(val_data, tokenizer)
    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
    return train_loader, val_loader

def evaluate(model, data_loader, tokenizer, device):
    model.eval()
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    bleu_scores = []
    rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}

    with torch.no_grad():
        for batch in tqdm(data_loader, desc="Evaluating"):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            generated_ids = model.generate(input_ids=input_ids, attention_mask=attention_mask, max_length=MAX_TARGET_LENGTH)

            for i in range(len(input_ids)):
                reference = tokenizer.decode(labels[i], skip_special_tokens=True)
                generated_summary = tokenizer.decode(generated_ids[i], skip_special_tokens=True)

                bleu_score = sentence_bleu([reference.split()], generated_summary.split())
                bleu_scores.append(bleu_score)

                rouge_result = scorer.score(reference, generated_summary)
                for metric in rouge_scores:
                    rouge_scores[metric].append(rouge_result[metric].fmeasure)

    avg_bleu = sum(bleu_scores) / len(bleu_scores)
    avg_rouge = {metric: sum(scores) / len(scores) for metric, scores in rouge_scores.items()}

    return avg_bleu, avg_rouge

def train_model(model, train_loader, val_loader, tokenizer, device, num_epochs):
    optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)
    num_training_steps = len(train_loader) * num_epochs
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)

    best_val_loss = float('inf')
    best_model_path = '/content/drive/My Drive/NLP-Project/best_pegasus_model_modular_script_test.pth'

    train_losses = []
    val_losses = []
    bleu_scores = []
    rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}

    start_time = time.time()

    for epoch in range(num_epochs):
        model.train()
        total_train_loss = 0
        progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs} [Train]")

        for i, batch in enumerate(progress_bar):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_train_loss += loss.item()

            loss = loss / GRADIENT_ACCUMULATION_STEPS
            loss.backward()

            if (i + 1) % GRADIENT_ACCUMULATION_STEPS == 0:
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()

            progress_bar.set_postfix({"train_loss": loss.item() * GRADIENT_ACCUMULATION_STEPS})

        avg_train_loss = total_train_loss / len(train_loader)
        train_losses.append(avg_train_loss)

        # Validation
        model.eval()
        total_val_loss = 0
        with torch.no_grad():
            for batch in tqdm(val_loader, desc=f"Epoch {epoch+1}/{num_epochs} [Val]"):
                input_ids = batch["input_ids"].to(device)
                attention_mask = batch["attention_mask"].to(device)
                labels = batch["labels"].to(device)

                outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
                loss = outputs.loss
                total_val_loss += loss.item()

        avg_val_loss = total_val_loss / len(val_loader)
        val_losses.append(avg_val_loss)

        print(f"Epoch {epoch+1}/{num_epochs}, Train Loss: {avg_train_loss:.4f}, Val Loss: {avg_val_loss:.4f}")

        # Evaluate every 2 epochs
        if (epoch + 1) % 2 == 0:
            print(f"Evaluating after epoch {epoch+1}...")
            current_bleu, current_rouge = evaluate(model, val_loader, tokenizer, device)
            bleu_scores.append(current_bleu)
            for metric in rouge_scores:
                rouge_scores[metric].append(current_rouge[metric])
            print(f"Current BLEU Score: {current_bleu}")
            print(f"Current ROUGE Scores: {current_rouge}")

        # Save the model if it's the best so far based on validation loss
        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            torch.save(model.state_dict(), best_model_path)
            print(f"New best model saved with validation loss: {best_val_loss:.4f}")

    end_time = time.time()
    training_time = end_time - start_time
    print(f"Total training time: {training_time:.2f} seconds")

    return train_losses, val_losses, bleu_scores, rouge_scores, training_time

def plot_training_progress(train_losses, val_losses, bleu_scores, rouge_scores):
    plt.figure(figsize=(12, 8))
    plt.subplot(2, 2, 1)
    plt.plot(train_losses, label='Train Loss')
    plt.plot(val_losses, label='Validation Loss')
    plt.legend()
    plt.title('Training and Validation Loss')

    plt.subplot(2, 2, 2)
    plt.plot(bleu_scores)
    plt.title('BLEU Score')

    plt.subplot(2, 2, 3)
    for metric, scores in rouge_scores.items():
        plt.plot(scores, label=metric)
    plt.legend()
    plt.title('ROUGE Scores')

    plt.tight_layout()
    plt.savefig('training_progress.png')
    plt.close()

    # Download the plot
    files.download('training_progress.png')

def generate_summary(model, article, tokenizer, device, max_length=128):
    inputs = tokenizer(article, max_length=1024, truncation=True, return_tensors="pt").to(device)
    summary_ids = model.generate(inputs["input_ids"], num_beams=4, max_length=max_length, early_stopping=True)
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

def main():
    set_seed()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-large")
    tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-large")
    model.to(device)

    train_data, val_data = load_data(num_samples=20000)
    train_loader, val_loader = create_dataloaders(train_data, val_data, tokenizer)

    print("Evaluating base model...")
    base_bleu, base_rouge = evaluate(model, val_loader, tokenizer, device)
    print("Base Model Performance:")
    print(f"BLEU Score: {base_bleu}")
    print(f"ROUGE Scores: {base_rouge}")

    train_losses, val_losses, bleu_scores, rouge_scores, training_time = train_model(model, train_loader, val_loader, tokenizer, device, NUM_EPOCHS)

    plot_training_progress(train_losses, val_losses, bleu_scores, rouge_scores)

    print("Evaluating fine-tuned model...")
    fine_tuned_bleu, fine_tuned_rouge = evaluate(model, val_loader, tokenizer, device)
    print("Fine-tuned Model Performance:")
    print(f"BLEU Score: {fine_tuned_bleu}")
    print(f"ROUGE Scores: {fine_tuned_rouge}")

    print("Performance Improvement:")
    print(f"BLEU: {fine_tuned_bleu - base_bleu}")
    print(f"ROUGE-1: {fine_tuned_rouge['rouge1'] - base_rouge['rouge1']}")
    print(f"ROUGE-2: {fine_tuned_rouge['rouge2'] - base_rouge['rouge2']}")
    print(f"ROUGE-L: {fine_tuned_rouge['rougeL'] - base_rouge['rougeL']}")

    print(f"Total training time: {training_time:.2f} seconds")

    # Generate example summaries
    print("\nGenerating example summaries...")
    for i in range(3):
        article = val_data[i]["article"]
        reference = val_data[i]["highlights"]
        generated = generate_summary(model, article, tokenizer, device)
        print(f"\nArticle {i+1}:")
        print(f"Reference: {reference}")
        print(f"Generated: {generated}")
        print("-" * 50)

if __name__ == "__main__":
    main()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Evaluating base model...


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
Evaluating: 100%|██████████| 125/125 [06:40<00:00,  3.21s/it]


Base Model Performance:
BLEU Score: 0.027228885603706635
ROUGE Scores: {'rouge1': 0.29546262944841484, 'rouge2': 0.09988257187753315, 'rougeL': 0.1893963564402305}


Epoch 1/10 [Train]: 100%|██████████| 1125/1125 [11:40<00:00,  1.61it/s, train_loss=1.11]
Epoch 1/10 [Val]: 100%|██████████| 125/125 [00:27<00:00,  4.63it/s]


Epoch 1/10, Train Loss: 4.5232, Val Loss: 0.8539
New best model saved with validation loss: 0.8539


Epoch 2/10 [Train]: 100%|██████████| 1125/1125 [11:40<00:00,  1.61it/s, train_loss=0.837]
Epoch 2/10 [Val]: 100%|██████████| 125/125 [00:27<00:00,  4.60it/s]


Epoch 2/10, Train Loss: 0.8746, Val Loss: 0.7983
Evaluating after epoch 2...


Evaluating: 100%|██████████| 125/125 [04:53<00:00,  2.35s/it]


Current BLEU Score: 0.05723063253261528
Current ROUGE Scores: {'rouge1': 0.3343995619262688, 'rouge2': 0.12859040561585394, 'rougeL': 0.23005640722224163}
New best model saved with validation loss: 0.7983


Epoch 3/10 [Train]: 100%|██████████| 1125/1125 [11:41<00:00,  1.60it/s, train_loss=0.673]
Epoch 3/10 [Val]: 100%|██████████| 125/125 [00:27<00:00,  4.63it/s]


Epoch 3/10, Train Loss: 0.7910, Val Loss: 0.7814
New best model saved with validation loss: 0.7814


Epoch 4/10 [Train]: 100%|██████████| 1125/1125 [11:40<00:00,  1.61it/s, train_loss=0.673]
Epoch 4/10 [Val]: 100%|██████████| 125/125 [00:27<00:00,  4.62it/s]


Epoch 4/10, Train Loss: 0.7474, Val Loss: 0.7701
Evaluating after epoch 4...


Evaluating: 100%|██████████| 125/125 [04:36<00:00,  2.21s/it]


Current BLEU Score: 0.06638918428998888
Current ROUGE Scores: {'rouge1': 0.35204206325960075, 'rouge2': 0.14036960881993635, 'rougeL': 0.2451482071812417}
New best model saved with validation loss: 0.7701


Epoch 5/10 [Train]: 100%|██████████| 1125/1125 [11:40<00:00,  1.61it/s, train_loss=0.896]
Epoch 5/10 [Val]: 100%|██████████| 125/125 [00:27<00:00,  4.62it/s]


Epoch 5/10, Train Loss: 0.7112, Val Loss: 0.7644
New best model saved with validation loss: 0.7644


Epoch 6/10 [Train]: 100%|██████████| 1125/1125 [11:40<00:00,  1.61it/s, train_loss=0.601]
Epoch 6/10 [Val]: 100%|██████████| 125/125 [00:27<00:00,  4.62it/s]


Epoch 6/10, Train Loss: 0.6790, Val Loss: 0.7636
Evaluating after epoch 6...


Evaluating: 100%|██████████| 125/125 [04:47<00:00,  2.30s/it]


Current BLEU Score: 0.06690326051331127
Current ROUGE Scores: {'rouge1': 0.35003137436937154, 'rouge2': 0.13871897181210435, 'rougeL': 0.24619002649832475}
New best model saved with validation loss: 0.7636


Epoch 7/10 [Train]: 100%|██████████| 1125/1125 [11:41<00:00,  1.60it/s, train_loss=0.927]
Epoch 7/10 [Val]: 100%|██████████| 125/125 [00:27<00:00,  4.62it/s]


Epoch 7/10, Train Loss: 0.6501, Val Loss: 0.7610
New best model saved with validation loss: 0.7610


Epoch 8/10 [Train]: 100%|██████████| 1125/1125 [11:40<00:00,  1.61it/s, train_loss=0.492]
Epoch 8/10 [Val]: 100%|██████████| 125/125 [00:27<00:00,  4.63it/s]


Epoch 8/10, Train Loss: 0.6243, Val Loss: 0.7627
Evaluating after epoch 8...


Evaluating: 100%|██████████| 125/125 [04:38<00:00,  2.22s/it]


Current BLEU Score: 0.06885554488189281
Current ROUGE Scores: {'rouge1': 0.35780621059633066, 'rouge2': 0.14257286127574187, 'rougeL': 0.25020514690533696}


Epoch 9/10 [Train]: 100%|██████████| 1125/1125 [11:40<00:00,  1.61it/s, train_loss=0.692]
Epoch 9/10 [Val]: 100%|██████████| 125/125 [00:27<00:00,  4.61it/s]


Epoch 9/10, Train Loss: 0.5976, Val Loss: 0.7642


Epoch 10/10 [Train]: 100%|██████████| 1125/1125 [11:40<00:00,  1.61it/s, train_loss=0.647]
Epoch 10/10 [Val]: 100%|██████████| 125/125 [00:27<00:00,  4.62it/s]


Epoch 10/10, Train Loss: 0.5743, Val Loss: 0.7692
Evaluating after epoch 10...


Evaluating: 100%|██████████| 125/125 [04:33<00:00,  2.19s/it]


Current BLEU Score: 0.07084515511453165
Current ROUGE Scores: {'rouge1': 0.3628623757565976, 'rouge2': 0.14441779133424118, 'rougeL': 0.25522549668218547}
Total training time: 8732.52 seconds


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Evaluating fine-tuned model...


Evaluating: 100%|██████████| 125/125 [04:33<00:00,  2.19s/it]


Fine-tuned Model Performance:
BLEU Score: 0.07084515511453165
ROUGE Scores: {'rouge1': 0.3628623757565976, 'rouge2': 0.14441779133424118, 'rougeL': 0.25522549668218547}
Performance Improvement:
BLEU: 0.04361626951082502
ROUGE-1: 0.06739974630818274
ROUGE-2: 0.04453521945670803
ROUGE-L: 0.06582914024195496
Total training time: 8732.52 seconds

Generating example summaries...

Article 1:
Reference: Belgian architect imagines climate refugees living on a futuristic Lilypad ecopolis .
The structure would support 50,000 inhabitants in a zero carbon environment .
The goal is to "create a harmonious coexistence of humans and nature"
Generated: The Lilypad is the creation of Belgian architect Vincent Callebaut . "It is" he says, "a true amphibian, half aquatic and half terrestrial city, able to accommodate 50,000 inhabitants"
--------------------------------------------------

Article 2:
Reference: Swedish entrepreneur to open a Jumbo Hostel at Arlanda airport in Sweden .
Decommissioned Boeing

## Fine-Tuning Pegasus Model for Abstractive Text Summarization on datasize of 20,000 articles.

### Pegasus Model Training Results: CNN-Daily Mail Dataset

#### Training Overview
Model training was conducted over 10 epochs with the following specifications:
- **Base Model**: google/pegasus-large
- **Dataset**: CNN-Daily Mail
- **Batch Size**: 4
- **Initial Learning Rate**: 2e-5

---

#### Training Metrics Progression
| Epoch | Train Loss | Val Loss | Learning Rate | BLEU  | ROUGE-1 | ROUGE-2 | ROUGE-L |
|-------|------------|----------|---------------|-------|---------|---------|---------|
| 1     | 7.9441     | 6.6767   | 5.00e-06      | -     | -       | -       | -       |
| 2     | 6.2990     | 4.5098   | 9.99e-06      | 0.0308| 0.1535  | 0.0653  | 0.1063  |
| 3     | 2.1489     | 0.8050   | 1.50e-05      | -     | -       | -       | -       |
| 4     | 0.8219     | 0.7685   | 2.00e-05      | 0.0811| 0.3553  | 0.1535  | 0.2530  |
| 5     | 0.7729     | 0.7573   | 1.94e-05      | -     | -       | -       | -       |
| 6     | 0.7418     | 0.7504   | 1.89e-05      | 0.0827| 0.3587  | 0.1552  | 0.2544  |
| 7     | 0.7193     | 0.7476   | 1.83e-05      | -     | -       | -       | -       |
| 8     | 0.7001     | 0.7442   | 1.78e-05      | 0.0815| 0.3591  | 0.1542  | 0.2572  |
| 9     | 0.6832     | 0.7425   | 1.72e-05      | -     | -       | -       | -       |
| 10    | 0.6684     | 0.7425   | 1.67e-05      | 0.0843| 0.3626  | 0.1569  | 0.2585  |

---

#### Base Model Performance
- **BLEU**: 0.027228885603706635
- **ROUGE-1**: 0.027228885603706635
- **ROUGE-2**: 0.027228885603706635
- **ROUGE-L**: 0.027228885603706635

#### Final Model Performance
- **Best Validation Loss**: 0.7425 (Epoch 10)
- **Final BLEU**: 0.0843
- **Final ROUGE Scores**:
  - ROUGE-1: 0.3626
  - ROUGE-2: 0.1569
  - ROUGE-L: 0.2585
- **Total Training Time**: 17,830.88 seconds (~4.95 hours)

This training report highlights the progress and outcomes of fine-tuning Pegasus on the CNN-Daily Mail dataset. Further optimizations and improvements can build upon these results for better performance.

In [None]:
! pip install transformers datasets rouge_score nltk tqdm matplotlib
import random
import time
import os
import torch
from tqdm import tqdm
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader, Dataset
from transformers import PegasusForConditionalGeneration, PegasusTokenizer, get_linear_schedule_with_warmup
from datasets import load_dataset
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
from google.colab import drive, files

# Mount Google Drive
drive.mount('/content/drive')

# Install required libraries
!pip install transformers datasets rouge_score nltk tqdm matplotlib

# Modified constants
MAX_INPUT_LENGTH = 1024
MAX_TARGET_LENGTH = 128
BATCH_SIZE = 4
GRADIENT_ACCUMULATION_STEPS = 4
NUM_EPOCHS = 10
LEARNING_RATE = 2e-5  # Reduced from 5e-5
WARMUP_RATIO = 0.1  # Added warmup
WEIGHT_DECAY = 0.01  # Added weight decay

def set_seed(seed=42):
    random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

def load_data(num_samples=1000):
    dataset = load_dataset("cnn_dailymail", "3.0.0")
    full_train_data = dataset["train"].select(range(num_samples))
    train_size = int(0.9 * len(full_train_data))
    train_data = full_train_data.select(range(train_size))
    val_data = full_train_data.select(range(train_size, len(full_train_data)))
    return train_data, val_data

class SummarizationDataset(Dataset):
    def __init__(self, data, tokenizer, max_input_length=MAX_INPUT_LENGTH, max_target_length=MAX_TARGET_LENGTH):
        self.data = data
        self.tokenizer = tokenizer
        self.max_input_length = max_input_length
        self.max_target_length = max_target_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        article = self.data[idx]["article"]
        summary = self.data[idx]["highlights"]
        inputs = self.tokenizer(article, max_length=self.max_input_length, truncation=True, padding="max_length", return_tensors="pt")
        targets = self.tokenizer(summary, max_length=self.max_target_length, truncation=True, padding="max_length", return_tensors="pt")
        return {
            "input_ids": inputs.input_ids.squeeze(),
            "attention_mask": inputs.attention_mask.squeeze(),
            "labels": targets.input_ids.squeeze()
        }

def create_dataloaders(train_data, val_data, tokenizer):
    train_dataset = SummarizationDataset(train_data, tokenizer)
    val_dataset = SummarizationDataset(val_data, tokenizer)
    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
    return train_loader, val_loader

def evaluate(model, data_loader, tokenizer, device):
    model.eval()
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    bleu_scores = []
    rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}

    with torch.no_grad():
        for batch in tqdm(data_loader, desc="Evaluating"):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            generated_ids = model.generate(input_ids=input_ids, attention_mask=attention_mask, max_length=MAX_TARGET_LENGTH)

            for i in range(len(input_ids)):
                reference = tokenizer.decode(labels[i], skip_special_tokens=True)
                generated_summary = tokenizer.decode(generated_ids[i], skip_special_tokens=True)

                bleu_score = sentence_bleu([reference.split()], generated_summary.split())
                bleu_scores.append(bleu_score)

                rouge_result = scorer.score(reference, generated_summary)
                for metric in rouge_scores:
                    rouge_scores[metric].append(rouge_result[metric].fmeasure)

    avg_bleu = sum(bleu_scores) / len(bleu_scores)
    avg_rouge = {metric: sum(scores) / len(scores) for metric, scores in rouge_scores.items()}

    return avg_bleu, avg_rouge

def train_model(model, train_loader, val_loader, tokenizer, device, num_epochs):
    # Modified optimizer with weight decay
    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
            "weight_decay": WEIGHT_DECAY,
        },
        {
            "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
            "weight_decay": 0.0,
        },
    ]
    optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=LEARNING_RATE)

    # Modified scheduler with warmup
    num_training_steps = len(train_loader) * num_epochs
    num_warmup_steps = int(num_training_steps * WARMUP_RATIO)
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=num_warmup_steps,
        num_training_steps=num_training_steps
    )

    best_val_loss = float('inf')
    best_model_path = '/content/drive/My Drive/NLP-Project/best_pegasus_model_20000_data.pth'
    patience = 3  # Early stopping patience
    no_improve = 0

    train_losses = []
    val_losses = []
    bleu_scores = []
    rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}

    start_time = time.time()

    for epoch in range(num_epochs):
        model.train()
        total_train_loss = 0
        progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs} [Train]")

        # Training loop with gradient clipping
        for i, batch in enumerate(progress_bar):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_train_loss += loss.item()

            loss = loss / GRADIENT_ACCUMULATION_STEPS
            loss.backward()

            if (i + 1) % GRADIENT_ACCUMULATION_STEPS == 0:
                # Add gradient clipping
                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()

            progress_bar.set_postfix({"train_loss": loss.item() * GRADIENT_ACCUMULATION_STEPS})

        avg_train_loss = total_train_loss / len(train_loader)
        train_losses.append(avg_train_loss)

        # Validation with early stopping
        model.eval()
        total_val_loss = 0
        with torch.no_grad():
            for batch in tqdm(val_loader, desc=f"Epoch {epoch+1}/{num_epochs} [Val]"):
                input_ids = batch["input_ids"].to(device)
                attention_mask = batch["attention_mask"].to(device)
                labels = batch["labels"].to(device)

                outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
                loss = outputs.loss
                total_val_loss += loss.item()

        avg_val_loss = total_val_loss / len(val_loader)
        val_losses.append(avg_val_loss)

        print(f"Epoch {epoch+1}/{num_epochs}")
        print(f"Average Train Loss: {avg_train_loss:.4f}")
        print(f"Average Val Loss: {avg_val_loss:.4f}")
        print(f"Learning Rate: {scheduler.get_last_lr()[0]:.2e}")


        # Evaluate every 2 epochs
        if (epoch + 1) % 2 == 0:
            print(f"Evaluating after epoch {epoch+1}...")
            current_bleu, current_rouge = evaluate(model, val_loader, tokenizer, device)
            bleu_scores.append(current_bleu)
            for metric in rouge_scores:
                rouge_scores[metric].append(current_rouge[metric])
            print(f"Current BLEU Score: {current_bleu}")
            print(f"Current ROUGE Scores: {current_rouge}")

        # Save best model and check for early stopping
        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            torch.save(model.state_dict(), best_model_path)
            print(f"New best model saved with validation loss: {best_val_loss:.4f}")
            no_improve = 0
        else:
            no_improve += 1
            if no_improve >= patience:
                print(f"Early stopping triggered after {epoch+1} epochs")
                break




    end_time = time.time()
    training_time = end_time - start_time

    return train_losses, val_losses, bleu_scores, rouge_scores, training_time

def plot_training_progress(train_losses, val_losses, bleu_scores, rouge_scores):
    plt.figure(figsize=(12, 8))
    plt.subplot(2, 2, 1)
    plt.plot(train_losses, label='Train Loss')
    plt.plot(val_losses, label='Validation Loss')
    plt.legend()
    plt.title('Training and Validation Loss')

    plt.subplot(2, 2, 2)
    plt.plot(bleu_scores)
    plt.title('BLEU Score')

    plt.subplot(2, 2, 3)
    for metric, scores in rouge_scores.items():
        plt.plot(scores, label=metric)
    plt.legend()
    plt.title('ROUGE Scores')

    plt.tight_layout()
    plt.savefig('training_progress.png')
    plt.close()

    # Download the plot
    files.download('training_progress.png')

def generate_summary(model, article, tokenizer, device, max_length=128):
    inputs = tokenizer(article, max_length=1024, truncation=True, return_tensors="pt").to(device)
    summary_ids = model.generate(inputs["input_ids"], num_beams=4, max_length=max_length, early_stopping=True)
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

def main():
    set_seed()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    model = PegasusForConditionalGeneration.from_pretrained("google/pegasus-large")
    tokenizer = PegasusTokenizer.from_pretrained("google/pegasus-large")
    model.to(device)

    train_data, val_data = load_data(num_samples=10000)
    train_loader, val_loader = create_dataloaders(train_data, val_data, tokenizer)

    print("Evaluating base model...")
    base_bleu, base_rouge = evaluate(model, val_loader, tokenizer, device)
    print("Base Model Performance:")
    print(f"BLEU Score: {base_bleu}")
    print(f"ROUGE Scores: {base_rouge}")

    train_losses, val_losses, bleu_scores, rouge_scores, training_time = train_model(model, train_loader, val_loader, tokenizer, device, NUM_EPOCHS)

    plot_training_progress(train_losses, val_losses, bleu_scores, rouge_scores)

    print("Evaluating fine-tuned model...")
    fine_tuned_bleu, fine_tuned_rouge = evaluate(model, val_loader, tokenizer, device)
    print("Fine-tuned Model Performance:")
    print(f"BLEU Score: {fine_tuned_bleu}")
    print(f"ROUGE Scores: {fine_tuned_rouge}")

    print("Performance Improvement:")
    print(f"BLEU: {fine_tuned_bleu - base_bleu}")
    print(f"ROUGE-1: {fine_tuned_rouge['rouge1'] - base_rouge['rouge1']}")
    print(f"ROUGE-2: {fine_tuned_rouge['rouge2'] - base_rouge['rouge2']}")
    print(f"ROUGE-L: {fine_tuned_rouge['rougeL'] - base_rouge['rougeL']}")

    print(f"Total training time: {training_time:.2f} seconds")

    # Generate example summaries
    print("\nGenerating example summaries...")
    for i in range(3):
        article = val_data[i]["article"]
        reference = val_data[i]["highlights"]
        generated = generate_summary(model, article, tokenizer, device)
        print(f"\nArticle {i+1}:")
        print(f"Reference: {reference}")
        print(f"Generated: {generated}")
        print("-" * 50)

if __name__ == "__main__":
    main()

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/3.09k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/260 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/15.6k [00:00<?, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

Epoch 1/10 [Train]: 100%|██████████| 2250/2250 [23:25<00:00,  1.60it/s, train_loss=7.53]
Epoch 1/10 [Val]: 100%|██████████| 250/250 [00:53<00:00,  4.66it/s]


Epoch 1/10
Average Train Loss: 7.9441
Average Val Loss: 6.6767
Learning Rate: 5.00e-06
New best model saved with validation loss: 6.6767


Epoch 2/10 [Train]: 100%|██████████| 2250/2250 [23:22<00:00,  1.60it/s, train_loss=4.65]
Epoch 2/10 [Val]: 100%|██████████| 250/250 [00:53<00:00,  4.66it/s]


Epoch 2/10
Average Train Loss: 6.2990
Average Val Loss: 4.5098
Learning Rate: 9.99e-06
Evaluating after epoch 2...


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
Evaluating: 100%|██████████| 250/250 [13:15<00:00,  3.18s/it]


Current BLEU Score: 0.030783904729758504
Current ROUGE Scores: {'rouge1': 0.15350869834911562, 'rouge2': 0.0653266064916824, 'rougeL': 0.10633723076060746}
New best model saved with validation loss: 4.5098


Epoch 3/10 [Train]: 100%|██████████| 2250/2250 [23:23<00:00,  1.60it/s, train_loss=1.17]
Epoch 3/10 [Val]: 100%|██████████| 250/250 [00:53<00:00,  4.65it/s]


Epoch 3/10
Average Train Loss: 2.1489
Average Val Loss: 0.8050
Learning Rate: 1.50e-05
New best model saved with validation loss: 0.8050


Epoch 4/10 [Train]: 100%|██████████| 2250/2250 [23:23<00:00,  1.60it/s, train_loss=0.746]
Epoch 4/10 [Val]: 100%|██████████| 250/250 [00:53<00:00,  4.65it/s]


Epoch 4/10
Average Train Loss: 0.8219
Average Val Loss: 0.7685
Learning Rate: 2.00e-05
Evaluating after epoch 4...


Evaluating: 100%|██████████| 250/250 [10:13<00:00,  2.45s/it]


Current BLEU Score: 0.08111376435136992
Current ROUGE Scores: {'rouge1': 0.35530882111447015, 'rouge2': 0.15353207366318938, 'rougeL': 0.25295840546254866}
New best model saved with validation loss: 0.7685


Epoch 5/10 [Train]: 100%|██████████| 2250/2250 [23:23<00:00,  1.60it/s, train_loss=0.999]
Epoch 5/10 [Val]: 100%|██████████| 250/250 [00:53<00:00,  4.65it/s]


Epoch 5/10
Average Train Loss: 0.7729
Average Val Loss: 0.7573
Learning Rate: 1.94e-05
New best model saved with validation loss: 0.7573


Epoch 6/10 [Train]: 100%|██████████| 2250/2250 [23:23<00:00,  1.60it/s, train_loss=0.871]
Epoch 6/10 [Val]: 100%|██████████| 250/250 [00:53<00:00,  4.65it/s]


Epoch 6/10
Average Train Loss: 0.7418
Average Val Loss: 0.7504
Learning Rate: 1.89e-05
Evaluating after epoch 6...


Evaluating: 100%|██████████| 250/250 [10:12<00:00,  2.45s/it]


Current BLEU Score: 0.08272685832260049
Current ROUGE Scores: {'rouge1': 0.3586757104083939, 'rouge2': 0.1551616457890007, 'rougeL': 0.25441458398752514}
New best model saved with validation loss: 0.7504


Epoch 7/10 [Train]: 100%|██████████| 2250/2250 [23:23<00:00,  1.60it/s, train_loss=0.568]
Epoch 7/10 [Val]: 100%|██████████| 250/250 [00:53<00:00,  4.65it/s]


Epoch 7/10
Average Train Loss: 0.7193
Average Val Loss: 0.7476
Learning Rate: 1.83e-05
New best model saved with validation loss: 0.7476


Epoch 8/10 [Train]: 100%|██████████| 2250/2250 [23:23<00:00,  1.60it/s, train_loss=0.568]
Epoch 8/10 [Val]: 100%|██████████| 250/250 [00:53<00:00,  4.66it/s]


Epoch 8/10
Average Train Loss: 0.7001
Average Val Loss: 0.7442
Learning Rate: 1.78e-05
Evaluating after epoch 8...


Evaluating: 100%|██████████| 250/250 [09:41<00:00,  2.33s/it]


Current BLEU Score: 0.08152388329939204
Current ROUGE Scores: {'rouge1': 0.3590828801761069, 'rouge2': 0.1541846090382325, 'rougeL': 0.2571938868408105}
New best model saved with validation loss: 0.7442


Epoch 9/10 [Train]: 100%|██████████| 2250/2250 [23:22<00:00,  1.60it/s, train_loss=0.718]
Epoch 9/10 [Val]: 100%|██████████| 250/250 [00:53<00:00,  4.66it/s]


Epoch 9/10
Average Train Loss: 0.6832
Average Val Loss: 0.7425
Learning Rate: 1.72e-05
New best model saved with validation loss: 0.7425


Epoch 10/10 [Train]: 100%|██████████| 2250/2250 [23:23<00:00,  1.60it/s, train_loss=0.687]
Epoch 10/10 [Val]: 100%|██████████| 250/250 [00:53<00:00,  4.63it/s]


Epoch 10/10
Average Train Loss: 0.6684
Average Val Loss: 0.7425
Learning Rate: 1.67e-05
Evaluating after epoch 10...


Evaluating: 100%|██████████| 250/250 [09:54<00:00,  2.38s/it]


Current BLEU Score: 0.08428911517596681
Current ROUGE Scores: {'rouge1': 0.36259173481518503, 'rouge2': 0.15686449208025968, 'rougeL': 0.2584988971139973}
New best model saved with validation loss: 0.7425


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Evaluating fine-tuned model...


Evaluating: 100%|██████████| 250/250 [09:58<00:00,  2.40s/it]


Fine-tuned Model Performance:
BLEU Score: 0.08428911517596681
ROUGE Scores: {'rouge1': 0.36259173481518503, 'rouge2': 0.15686449208025968, 'rougeL': 0.2584988971139973}
Total training time: 17830.88 seconds

Generating example summaries...

Article 1:
Reference: Maj. Curtis Daniel Miller of Palacios, Texas, was shot down on March 29, 1972 .
Miller and his crew were flying over southern Laos when a missile struck their plane .
Miller will be buried with full military honors at the Dallas-Ft. Worth National Cemetery .
Generated: Air Force Maj. Curtis Daniel Miller of Palacios, Texas, was one of 14 men . Miller and his crew were flying over southern Laos when a missile struck their plane . Rescue teams had to call off the search after two days because of heavy fighting .
--------------------------------------------------

Article 2:
Reference: Group files motion for preliminary injunction against a Mississippi school district .
Case will be argued before a federal judge in Mississippi on 

In [6]:
# import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = "1"


# Training a Scaled-Down Pegasus Model From Scratch

## Model Architecture & Dataset
- Created a smaller Pegasus model with:
  - 8 encoder and decoder layers (reduced from 16)
  - 512 dimensional embeddings (reduced from 1024)
  - 2048 FFN dimensions (reduced from 4096)
  - ~108.5M total parameters, with 108M trainable parameters

- Dataset: CNN/DailyMail
  - Training samples: 13,500
  - Validation samples: 1,500
  - Max input length: 512 tokens
  - Max output length: 128 tokens

## Training Configuration
- Batch size: 4 with gradient accumulation steps of 2
- Initial learning rate: 2e-5 with linear warmup
- ReduceLROnPlateau scheduler with factor 0.1
- Weight decay: 0.01
- Training epochs: 20
- Early stopping patience: 3

## Training Results
- Loss progression:
  - Initial training loss: 10.29 → Final training loss: 2.13
  - Initial validation loss: 8.31 → Final validation loss: 2.16
  - Consistent decrease in both training and validation loss

- ROUGE Scores progression:
  - ROUGE-1: 0.0 → 0.122 (12.2%)
  - ROUGE-2: 0.0 → 0.014 (1.4%)
  - ROUGE-L: 0.0 → 0.101 (10.1%)

## Key Observations
1. Model showed steady improvement in both training and validation loss
2. ROUGE scores improved gradually but remained relatively low
3. Training completed in ~5 hours (~17,975 seconds)
4. No significant overfitting observed (validation loss tracks training loss)
5. Memory usage remained stable (~4.6-4.7 GB GPU memory)

## Limitations & Future Work
1. Limited performance on ROUGE metrics suggests room for improvement
2. Small batch size due to memory constraints
3. Could benefit from:
   - Larger model capacity
   - More training data
   - Longer training time
   - Larger batch size with more computational resources

In [2]:
!pip install transformers datasets rouge_score nltk tqdm matplotlib

import os
import torch
from transformers import PegasusConfig, PegasusForConditionalGeneration, PegasusTokenizer, get_linear_schedule_with_warmup
from datasets import load_dataset
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm
import matplotlib.pyplot as plt
import time
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
from google.colab import drive, files

# Mount Google Drive
drive.mount('/content/drive')

def verify_model_size(model):
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total_params, trainable_params

def cleanup():
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    plt.close('all')

def create_small_pegasus_config():
    '''Create a smaller Pegasus configuration suitable for training on 3000 samples.'''
    config = PegasusConfig(
        vocab_size=96103,  # Original vocab size for tokenizer compatibility
        encoder_layers=8,  # Reduced from 16
        decoder_layers=8,  # Reduced from 16
        encoder_attention_heads=16,  # Reduced from 16
        decoder_attention_heads=16,  # Reduced from 16
        encoder_ffn_dim=2048,  # Reduced from 4096
        decoder_ffn_dim=2048,  # Reduced from 4096
        d_model=512,  # Reduced from 1024
        max_position_embeddings=512,  # Reduced context length
        pad_token_id=0,
        eos_token_id=1,
        forced_eos_token_id=1,
        activation_function='gelu',
        dropout=0.2,  # Increased dropout for smaller dataset
        attention_dropout=0.2,
        activation_dropout=0.2,
        num_beams=4,
        encoder_layerdrop=0.1,  # Added layerdrop for regularization
        decoder_layerdrop=0.1,
        scale_embedding=True,
        use_cache=True,
        is_encoder_decoder=True
    )
    return config

TRAINING_PARAMS = {
    'MAX_INPUT_LENGTH': 512,
    'MAX_TARGET_LENGTH': 128,
    'BATCH_SIZE': 4,
    'GRADIENT_ACCUMULATION_STEPS': 2,
    'NUM_EPOCHS': 20,
    'LEARNING_RATE': 2e-5,
    'WARMUP_RATIO': 0.1,
    'WEIGHT_DECAY': 0.01,
    'EARLY_STOPPING_PATIENCE': 3
}

def set_seed(seed=42):
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

# def load_data(num_samples=3000):
#     dataset = load_dataset('cnn_dailymail', '3.0.0')
#     full_train_data = dataset['train'].select(range(num_samples))
#     train_size = int(0.9 * len(full_train_data))
#     train_data = full_train_data.select(range(train_size))
#     val_data = full_train_data.select(range(train_size, len(full_train_data)))
#     return train_data, val_data

def load_data(num_samples=1000):
    # Load the dataset
    dataset = load_dataset('cnn_dailymail', '3.0.0')

    # Calculate how many samples we want for each split
    train_samples = int(0.9 * num_samples)  # 90% of samples for training
    val_samples = num_samples - train_samples  # 10% of samples for validation

    # Randomly select indices for training and validation
    train_indices = range(train_samples)
    val_indices = range(len(dataset['validation']))[:val_samples]

    # Select the samples from the respective splits
    train_data = dataset['train'].select(train_indices)
    val_data = dataset['validation'].select(val_indices)

    print(f"Training samples: {len(train_data)}")
    print(f"Validation samples: {len(val_data)}")

    return train_data, val_data



class SummarizationDataset(Dataset):
    def __init__(self, data, tokenizer, max_input_length, max_target_length):
        self.data = data
        self.tokenizer = tokenizer
        self.max_input_length = max_input_length
        self.max_target_length = max_target_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        article = self.data[idx]['article']
        summary = self.data[idx]['highlights']
        inputs = self.tokenizer(article, max_length=self.max_input_length, truncation=True, padding='max_length', return_tensors='pt')
        targets = self.tokenizer(summary, max_length=self.max_target_length, truncation=True, padding='max_length', return_tensors='pt')
        return {
            'input_ids': inputs.input_ids.squeeze(),
            'attention_mask': inputs.attention_mask.squeeze(),
            'labels': targets.input_ids.squeeze()
        }

def create_dataloaders(train_data, val_data, tokenizer, batch_size):
    train_dataset = SummarizationDataset(train_data, tokenizer, TRAINING_PARAMS['MAX_INPUT_LENGTH'], TRAINING_PARAMS['MAX_TARGET_LENGTH'])
    val_dataset = SummarizationDataset(val_data, tokenizer, TRAINING_PARAMS['MAX_INPUT_LENGTH'], TRAINING_PARAMS['MAX_TARGET_LENGTH'])
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
    return train_loader, val_loader

def evaluate(model, data_loader, tokenizer, device):
    model.eval()
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}

    with torch.no_grad():
        for batch in tqdm(data_loader, desc="Evaluating"):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            generated_ids = model.generate(input_ids=input_ids, attention_mask=attention_mask, max_length=TRAINING_PARAMS['MAX_TARGET_LENGTH'])

            for i in range(len(input_ids)):
                reference = tokenizer.decode(labels[i], skip_special_tokens=True)
                generated_summary = tokenizer.decode(generated_ids[i], skip_special_tokens=True)

                rouge_result = scorer.score(reference, generated_summary)
                for metric in rouge_scores:
                    rouge_scores[metric].append(rouge_result[metric].fmeasure)

    avg_rouge = {metric: sum(scores) / len(scores) for metric, scores in rouge_scores.items()}

    return avg_rouge



def train_model(model, train_loader, val_loader, tokenizer, device, num_epochs):
    try:
        no_decay = ['bias', 'LayerNorm.weight']
        optimizer_grouped_parameters = [
            {
                'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
                'weight_decay': TRAINING_PARAMS['WEIGHT_DECAY'],
            },
            {
                'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
                'weight_decay': 0.0,
            },
        ]
        optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=TRAINING_PARAMS['LEARNING_RATE'])
        num_training_steps = len(train_loader) * num_epochs
        num_warmup_steps = int(num_training_steps * TRAINING_PARAMS['WARMUP_RATIO'])
        scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)
        # Scheduler for reducing learning rate when validation loss stagnates
        scheduler_plateau = torch.optim.lr_scheduler.ReduceLROnPlateau(
            optimizer,
            mode='min',
            factor=0.1,
            patience=4,
            verbose=True
        )
        best_val_loss = float('inf')
        best_model_path = "/content/drive/My Drive/NLP-Project/best_pegasus_scartch.pt"
        patience = TRAINING_PARAMS['EARLY_STOPPING_PATIENCE']
        no_improve = 0

        train_losses = []
        val_losses = []
        rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}

        start_time = time.time()

        for epoch in range(num_epochs):
            # Initialize these at the start of each epoch
            total_train_loss = 0
            batch_count = 0
            avg_train_loss = float('inf')  # Default value

            try:
                # Clear GPU cache before each epoch
                if torch.cuda.is_available():
                    torch.cuda.empty_cache()
                    print(f"GPU Memory before epoch {epoch + 1}: {torch.cuda.memory_allocated() / 1024**2:.2f} MB")

                model.train()
                progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs} [Train]")

                for i, batch in enumerate(progress_bar):
                    try:
                        input_ids = batch['input_ids'].to(device)
                        attention_mask = batch['attention_mask'].to(device)
                        labels = batch['labels'].to(device)

                        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
                        current_loss = outputs.loss / TRAINING_PARAMS['GRADIENT_ACCUMULATION_STEPS']
                        current_loss.backward()

                        if (i + 1) % TRAINING_PARAMS['GRADIENT_ACCUMULATION_STEPS'] == 0:
                            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
                            optimizer.step()
                            scheduler.step()
                            optimizer.zero_grad()

                        # Update loss tracking
                        total_train_loss += current_loss.item() * TRAINING_PARAMS['GRADIENT_ACCUMULATION_STEPS']
                        batch_count += 1

                        # Update progress bar
                        if batch_count > 0:
                            avg_train_loss = total_train_loss / batch_count
                            progress_bar.set_postfix({'loss': avg_train_loss})

                        # Clear memory after optimization step
                        del outputs, current_loss
                        if torch.cuda.is_available():
                            torch.cuda.empty_cache()

                    except RuntimeError as e:
                        if "out of memory" in str(e):
                            if torch.cuda.is_available():
                                torch.cuda.empty_cache()
                            print(f"WARNING: out of memory in batch {i}. Skipping batch...")
                            print(f"GPU Memory: {torch.cuda.memory_allocated() / 1024**2:.2f} MB")
                            continue
                        else:
                            raise e

                if batch_count > 0:
                    avg_train_loss = total_train_loss / batch_count
                    train_losses.append(avg_train_loss)

                # Validation phase
                model.eval()
                total_val_loss = 0
                val_batch_count = 0

                with torch.no_grad():
                    for batch in tqdm(val_loader, desc=f'Epoch {epoch+1}/{num_epochs} [Val]'):
                        try:
                            input_ids = batch['input_ids'].to(device)
                            attention_mask = batch['attention_mask'].to(device)
                            labels = batch['labels'].to(device)

                            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
                            total_val_loss += outputs.loss.item()
                            val_batch_count += 1

                            # Clear memory after each validation batch
                            del outputs
                            if torch.cuda.is_available():
                                torch.cuda.empty_cache()

                        except RuntimeError as e:
                            if "out of memory" in str(e):
                                if torch.cuda.is_available():
                                    torch.cuda.empty_cache()
                                print("WARNING: out of memory in validation. Skipping batch...")
                                continue
                            else:
                                raise e

                if val_batch_count > 0:
                    avg_val_loss = total_val_loss / val_batch_count
                    val_losses.append(avg_val_loss)

                    print(f"Epoch {epoch+1}/{num_epochs}")
                    print(f"Average Train Loss: {avg_train_loss:.4f}")
                    print(f"Average Val Loss: {avg_val_loss:.4f}")
                    print(f"Learning Rate: {scheduler.get_last_lr()[0]:.2e}")


                    scheduler_plateau.step(avg_val_loss)
                    current_lr = optimizer.param_groups[0]['lr']
                    print(f"Updated learning rate after plateau adjustment: {current_lr}")

                    # # Evaluate ROUGE scores
                    # if (epoch + 1) % 2 == 0:
                    print(f"Evaluating after epoch {epoch+1}...")
                    current_rouge = evaluate(model, val_loader, tokenizer, device)
                    for metric in rouge_scores:
                        rouge_scores[metric].append(current_rouge[metric])

                    print(f"Current ROUGE Scores: {current_rouge}")

                    # Model saving with error handling
                    if avg_val_loss < best_val_loss:
                        best_val_loss = avg_val_loss
                        try:
                            torch.save(model.state_dict(), best_model_path)
                            print(f"New best model saved with validation loss: {best_val_loss:.4f}")
                        except Exception as e:
                            print(f"Error saving model: {e}")
                        no_improve = 0
                    else:
                        no_improve += 1
                        if no_improve >= patience:
                            print(f"Early stopping triggered after {epoch+1} epochs")
                            break

            except Exception as e:
                print(f"Error in epoch {epoch + 1}: {e}")
                # Attempt to save checkpoint even if epoch fails
                checkpoint_dict = {
                    'epoch': epoch,
                    'model_state_dict': model.state_dict(),
                    'optimizer_state_dict': optimizer.state_dict(),
                    'scheduler_state_dict': scheduler.state_dict(),
                }

                if batch_count > 0:
                    checkpoint_dict['train_loss'] = avg_train_loss
                if 'avg_val_loss' in locals():
                    checkpoint_dict['val_loss'] = avg_val_loss

                torch.save(checkpoint_dict, f'emergency_checkpoint_epoch_{epoch}.pth')
                continue

        end_time = time.time()
        training_time = end_time - start_time

    except Exception as e:
        print(f"Critical training error: {e}")
        raise
    finally:
        # Clean up
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

    return train_losses, val_losses, rouge_scores, training_time


def plot_training_progress(train_losses, val_losses, rouge_scores):
    plt.figure(figsize=(12, 8))
    plt.subplot(2, 2, 1)
    plt.plot(train_losses, label='Train Loss')
    plt.plot(val_losses, label='Validation Loss')
    plt.legend()
    plt.title('Training and Validation Loss')

    plt.subplot(2, 2, 3)
    for metric, scores in rouge_scores.items():
        plt.plot(scores, label=metric)
    plt.legend()
    plt.title('ROUGE Scores')

    plt.tight_layout()
    plt.savefig('training_progress.png')
    plt.close()
    files.download('training_progress.png')


# def generate_summary(model, article, tokenizer, device, max_length=128):
#     inputs = tokenizer(article, max_length=1024, truncation=True, return_tensors="pt").to(device)
#     summary_ids = model.generate(inputs["input_ids"], num_beams=4, max_length=max_length, early_stopping=True)
#     return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

def generate_summary(model, article, tokenizer, device, max_length=None):
    """
    Generate a summary for the given article using the trained model.

    Args:
        model: The trained Pegasus model
        article: Input article text
        tokenizer: Pegasus tokenizer
        device: Device to run generation on
        max_length: Optional override for maximum length. If None, uses TRAINING_PARAMS value
    """
    # Use the same max length as training if not specified
    if max_length is None:
        max_length = TRAINING_PARAMS['MAX_TARGET_LENGTH']

    inputs = tokenizer(
        article,
        max_length=TRAINING_PARAMS['MAX_INPUT_LENGTH'],  # Use consistent input length
        truncation=True,
        return_tensors="pt"
    ).to(device)

    summary_ids = model.generate(
        inputs["input_ids"],
        num_beams=4,
        max_length=max_length,
        early_stopping=True,
        length_penalty=2.0,  # Added for better length control
        min_length=int(max_length/4),  # Added reasonable minimum length
        no_repeat_ngram_size=4  # Prevent repetition
    )

    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

def inspect_frozen_params(model):
    frozen_params = []
    trainable_params = []
    for name, param in model.named_parameters():
        if not param.requires_grad:
            frozen_params.append(name)
        else:
            trainable_params.append(name)

    print("\nFrozen parameters:")
    for name in frozen_params:
        print(f"- {name}")

    print("\nNumber of frozen parameters:", len(frozen_params))
    print("Number of trainable parameters:", len(trainable_params))

def main():
    try:
        set_seed()
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-large')
        config = create_small_pegasus_config()
        model = PegasusForConditionalGeneration(config).to(device)
        total_params,trainable_params = verify_model_size(model)
        print(f"Total parameters: {total_params:,}")
        print(f"Trainable parameters: {trainable_params:,}")
        inspect_frozen_params(model)


        train_data, val_data = load_data(num_samples=15000)
        train_loader, val_loader = create_dataloaders(train_data, val_data, tokenizer, TRAINING_PARAMS['BATCH_SIZE'])
        train_losses, val_losses, rouge_scores, training_time = train_model(model, train_loader, val_loader, tokenizer, device, TRAINING_PARAMS['NUM_EPOCHS'])
        plot_training_progress(train_losses, val_losses,rouge_scores)

        print("Evaluating fine-tuned model...")
        fine_tuned_rouge = evaluate(model, val_loader, tokenizer, device)
        print("Fine-tuned Model Performance:")
        print(f"ROUGE Scores: {fine_tuned_rouge}")
        print(f"Total training time: {training_time:.2f} seconds")

        print("\nGenerating example summaries...")
        for i in range(3):
            article = val_data[i]["article"]
            reference = val_data[i]["highlights"]
            generated = generate_summary(model, article, tokenizer, device)
            print(f"\nArticle {i+1}:")
            print(f"Reference: {reference}")
            print(f"Generated: {generated}")
            print("-" * 50)
    finally:
        # Add cleanup at the end
        cleanup()

if __name__ == '__main__':
    main()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Total parameters: 108,582,400
Trainable parameters: 108,058,112

Frozen parameters:
- model.encoder.embed_positions.weight
- model.decoder.embed_positions.weight

Number of frozen parameters: 2
Number of trainable parameters: 341
Training samples: 13500
Validation samples: 1500
GPU Memory before epoch 1: 3452.28 MB


Epoch 1/20 [Train]: 100%|██████████| 3375/3375 [07:23<00:00,  7.62it/s, loss=10.3]
Epoch 1/20 [Val]: 100%|██████████| 375/375 [00:16<00:00, 22.58it/s]


Epoch 1/20
Average Train Loss: 10.2948
Average Val Loss: 8.3167
Learning Rate: 5.00e-06
Updated learning rate after plateau adjustment: 4.998518518518519e-06
Evaluating after epoch 1...


Evaluating: 100%|██████████| 375/375 [10:27<00:00,  1.67s/it]


Current ROUGE Scores: {'rouge1': 0.0, 'rouge2': 0.0, 'rougeL': 0.0}
New best model saved with validation loss: 8.3167
GPU Memory before epoch 2: 4689.84 MB


Epoch 2/20 [Train]: 100%|██████████| 3375/3375 [07:19<00:00,  7.67it/s, loss=5.85]
Epoch 2/20 [Val]: 100%|██████████| 375/375 [00:16<00:00, 22.51it/s]


Epoch 2/20
Average Train Loss: 5.8547
Average Val Loss: 3.0042
Learning Rate: 1.00e-05
Updated learning rate after plateau adjustment: 9.997037037037038e-06
Evaluating after epoch 2...


Evaluating: 100%|██████████| 375/375 [10:13<00:00,  1.63s/it]


Current ROUGE Scores: {'rouge1': 0.0, 'rouge2': 0.0, 'rougeL': 0.0}
New best model saved with validation loss: 3.0042
GPU Memory before epoch 3: 4641.73 MB


Epoch 3/20 [Train]: 100%|██████████| 3375/3375 [07:20<00:00,  7.67it/s, loss=3.07]
Epoch 3/20 [Val]: 100%|██████████| 375/375 [00:16<00:00, 22.71it/s]


Epoch 3/20
Average Train Loss: 3.0708
Average Val Loss: 2.5049
Learning Rate: 1.50e-05
Updated learning rate after plateau adjustment: 1.4995555555555557e-05
Evaluating after epoch 3...


Evaluating: 100%|██████████| 375/375 [10:14<00:00,  1.64s/it]


Current ROUGE Scores: {'rouge1': 0.0050341852696208335, 'rouge2': 0.0, 'rougeL': 0.004935419837522068}
New best model saved with validation loss: 2.5049
GPU Memory before epoch 4: 4689.84 MB


Epoch 4/20 [Train]: 100%|██████████| 3375/3375 [07:19<00:00,  7.67it/s, loss=2.86]
Epoch 4/20 [Val]: 100%|██████████| 375/375 [00:16<00:00, 22.15it/s]


Epoch 4/20
Average Train Loss: 2.8635
Average Val Loss: 2.4195
Learning Rate: 2.00e-05
Updated learning rate after plateau adjustment: 1.9994074074074076e-05
Evaluating after epoch 4...


Evaluating: 100%|██████████| 375/375 [09:31<00:00,  1.52s/it]


Current ROUGE Scores: {'rouge1': 0.034429570476826, 'rouge2': 0.002414235516553599, 'rougeL': 0.032556181706631514}
New best model saved with validation loss: 2.4195
GPU Memory before epoch 5: 4673.81 MB


Epoch 5/20 [Train]: 100%|██████████| 3375/3375 [07:21<00:00,  7.65it/s, loss=2.75]
Epoch 5/20 [Val]: 100%|██████████| 375/375 [00:16<00:00, 22.38it/s]


Epoch 5/20
Average Train Loss: 2.7514
Average Val Loss: 2.3603
Learning Rate: 1.94e-05
Updated learning rate after plateau adjustment: 1.9445267489711937e-05
Evaluating after epoch 5...


Evaluating: 100%|██████████| 375/375 [06:49<00:00,  1.09s/it]


Current ROUGE Scores: {'rouge1': 0.044923699501044044, 'rouge2': 0.0029889508515651268, 'rougeL': 0.04289649221533504}
New best model saved with validation loss: 2.3603
GPU Memory before epoch 6: 4645.74 MB


Epoch 6/20 [Train]: 100%|██████████| 3375/3375 [07:22<00:00,  7.63it/s, loss=2.66]
Epoch 6/20 [Val]: 100%|██████████| 375/375 [00:17<00:00, 22.04it/s]


Epoch 6/20
Average Train Loss: 2.6639
Average Val Loss: 2.3122
Learning Rate: 1.89e-05
Updated learning rate after plateau adjustment: 1.8889876543209877e-05
Evaluating after epoch 6...


Evaluating: 100%|██████████| 375/375 [08:19<00:00,  1.33s/it]


Current ROUGE Scores: {'rouge1': 0.07595309390178907, 'rouge2': 0.006925472630014003, 'rougeL': 0.06603409531879864}
New best model saved with validation loss: 2.3122
GPU Memory before epoch 7: 4657.77 MB


Epoch 7/20 [Train]: 100%|██████████| 3375/3375 [07:23<00:00,  7.60it/s, loss=2.59]
Epoch 7/20 [Val]: 100%|██████████| 375/375 [00:17<00:00, 21.83it/s]


Epoch 7/20
Average Train Loss: 2.5931
Average Val Loss: 2.2778
Learning Rate: 1.83e-05
Updated learning rate after plateau adjustment: 1.833448559670782e-05
Evaluating after epoch 7...


Evaluating: 100%|██████████| 375/375 [07:32<00:00,  1.21s/it]


Current ROUGE Scores: {'rouge1': 0.10408555725492907, 'rouge2': 0.011041729259314655, 'rougeL': 0.08826307357589085}
New best model saved with validation loss: 2.2778
GPU Memory before epoch 8: 4673.81 MB


Epoch 8/20 [Train]: 100%|██████████| 3375/3375 [07:20<00:00,  7.66it/s, loss=2.53]
Epoch 8/20 [Val]: 100%|██████████| 375/375 [00:16<00:00, 22.22it/s]


Epoch 8/20
Average Train Loss: 2.5332
Average Val Loss: 2.2521
Learning Rate: 1.78e-05
Updated learning rate after plateau adjustment: 1.7779094650205764e-05
Evaluating after epoch 8...


Evaluating: 100%|██████████| 375/375 [06:26<00:00,  1.03s/it]


Current ROUGE Scores: {'rouge1': 0.11999836940842963, 'rouge2': 0.012992491141023872, 'rougeL': 0.10268735013904205}
New best model saved with validation loss: 2.2521
GPU Memory before epoch 9: 4649.75 MB


Epoch 9/20 [Train]: 100%|██████████| 3375/3375 [07:24<00:00,  7.59it/s, loss=2.48]
Epoch 9/20 [Val]: 100%|██████████| 375/375 [00:17<00:00, 21.68it/s]


Epoch 9/20
Average Train Loss: 2.4808
Average Val Loss: 2.2363
Learning Rate: 1.72e-05
Updated learning rate after plateau adjustment: 1.7223703703703704e-05
Evaluating after epoch 9...


Evaluating: 100%|██████████| 375/375 [06:54<00:00,  1.11s/it]


Current ROUGE Scores: {'rouge1': 0.11377748892844196, 'rouge2': 0.012412104309210229, 'rougeL': 0.09894268773391017}
New best model saved with validation loss: 2.2363
GPU Memory before epoch 10: 4661.78 MB


Epoch 10/20 [Train]: 100%|██████████| 3375/3375 [07:24<00:00,  7.59it/s, loss=2.43]
Epoch 10/20 [Val]: 100%|██████████| 375/375 [00:17<00:00, 21.75it/s]


Epoch 10/20
Average Train Loss: 2.4338
Average Val Loss: 2.2231
Learning Rate: 1.67e-05
Updated learning rate after plateau adjustment: 1.6668312757201648e-05
Evaluating after epoch 10...


Evaluating: 100%|██████████| 375/375 [06:25<00:00,  1.03s/it]


Current ROUGE Scores: {'rouge1': 0.10764093830326518, 'rouge2': 0.01196031163801933, 'rougeL': 0.09175299680542268}
New best model saved with validation loss: 2.2231
GPU Memory before epoch 11: 4661.78 MB


Epoch 11/20 [Train]: 100%|██████████| 3375/3375 [07:24<00:00,  7.59it/s, loss=2.39]
Epoch 11/20 [Val]: 100%|██████████| 375/375 [00:17<00:00, 21.94it/s]


Epoch 11/20
Average Train Loss: 2.3931
Average Val Loss: 2.2102
Learning Rate: 1.61e-05
Updated learning rate after plateau adjustment: 1.6112921810699588e-05
Evaluating after epoch 11...


Evaluating: 100%|██████████| 375/375 [06:04<00:00,  1.03it/s]


Current ROUGE Scores: {'rouge1': 0.10439878219466862, 'rouge2': 0.011672007204139185, 'rougeL': 0.0908611702788155}
New best model saved with validation loss: 2.2102
GPU Memory before epoch 12: 4677.82 MB


Epoch 12/20 [Train]: 100%|██████████| 3375/3375 [07:21<00:00,  7.64it/s, loss=2.35]
Epoch 12/20 [Val]: 100%|██████████| 375/375 [00:16<00:00, 22.55it/s]


Epoch 12/20
Average Train Loss: 2.3548
Average Val Loss: 2.1971
Learning Rate: 1.56e-05
Updated learning rate after plateau adjustment: 1.555753086419753e-05
Evaluating after epoch 12...


Evaluating: 100%|██████████| 375/375 [06:44<00:00,  1.08s/it]


Current ROUGE Scores: {'rouge1': 0.10877540710008865, 'rouge2': 0.012347329030034172, 'rougeL': 0.09245968262825015}
New best model saved with validation loss: 2.1971
GPU Memory before epoch 13: 4689.84 MB


Epoch 13/20 [Train]: 100%|██████████| 3375/3375 [07:23<00:00,  7.61it/s, loss=2.32]
Epoch 13/20 [Val]: 100%|██████████| 375/375 [00:16<00:00, 22.08it/s]


Epoch 13/20
Average Train Loss: 2.3185
Average Val Loss: 2.1915
Learning Rate: 1.50e-05
Updated learning rate after plateau adjustment: 1.5002139917695475e-05
Evaluating after epoch 13...


Evaluating: 100%|██████████| 375/375 [06:18<00:00,  1.01s/it]


Current ROUGE Scores: {'rouge1': 0.11154331645546203, 'rouge2': 0.013070492769470976, 'rougeL': 0.0954848306928511}
New best model saved with validation loss: 2.1915
GPU Memory before epoch 14: 4673.81 MB


Epoch 14/20 [Train]: 100%|██████████| 3375/3375 [07:21<00:00,  7.65it/s, loss=2.29]
Epoch 14/20 [Val]: 100%|██████████| 375/375 [00:16<00:00, 22.60it/s]


Epoch 14/20
Average Train Loss: 2.2860
Average Val Loss: 2.1820
Learning Rate: 1.44e-05
Updated learning rate after plateau adjustment: 1.4446748971193417e-05
Evaluating after epoch 14...


Evaluating: 100%|██████████| 375/375 [06:11<00:00,  1.01it/s]


Current ROUGE Scores: {'rouge1': 0.11232662637792919, 'rouge2': 0.012676381089667368, 'rougeL': 0.0966048287586607}
New best model saved with validation loss: 2.1820
GPU Memory before epoch 15: 4689.84 MB


Epoch 15/20 [Train]: 100%|██████████| 3375/3375 [07:22<00:00,  7.63it/s, loss=2.26]
Epoch 15/20 [Val]: 100%|██████████| 375/375 [00:16<00:00, 22.40it/s]


Epoch 15/20
Average Train Loss: 2.2569
Average Val Loss: 2.1810
Learning Rate: 1.39e-05
Updated learning rate after plateau adjustment: 1.389135802469136e-05
Evaluating after epoch 15...


Evaluating: 100%|██████████| 375/375 [06:41<00:00,  1.07s/it]


Current ROUGE Scores: {'rouge1': 0.10960783672956229, 'rouge2': 0.012597864274740864, 'rougeL': 0.09312678575021405}
New best model saved with validation loss: 2.1810
GPU Memory before epoch 16: 4689.84 MB


Epoch 16/20 [Train]: 100%|██████████| 3375/3375 [07:22<00:00,  7.63it/s, loss=2.23]
Epoch 16/20 [Val]: 100%|██████████| 375/375 [00:16<00:00, 22.08it/s]


Epoch 16/20
Average Train Loss: 2.2275
Average Val Loss: 2.1761
Learning Rate: 1.33e-05
Updated learning rate after plateau adjustment: 1.3335967078189302e-05
Evaluating after epoch 16...


Evaluating: 100%|██████████| 375/375 [06:18<00:00,  1.01s/it]


Current ROUGE Scores: {'rouge1': 0.11914727295222112, 'rouge2': 0.014235564241302998, 'rougeL': 0.09912414402428271}
New best model saved with validation loss: 2.1761
GPU Memory before epoch 17: 4657.77 MB


Epoch 17/20 [Train]: 100%|██████████| 3375/3375 [07:23<00:00,  7.62it/s, loss=2.2]
Epoch 17/20 [Val]: 100%|██████████| 375/375 [00:17<00:00, 22.06it/s]


Epoch 17/20
Average Train Loss: 2.2014
Average Val Loss: 2.1745
Learning Rate: 1.28e-05
Updated learning rate after plateau adjustment: 1.2780576131687244e-05
Evaluating after epoch 17...


Evaluating: 100%|██████████| 375/375 [06:03<00:00,  1.03it/s]


Current ROUGE Scores: {'rouge1': 0.12123387464322766, 'rouge2': 0.014395758343787975, 'rougeL': 0.10129097740868273}
New best model saved with validation loss: 2.1745
GPU Memory before epoch 18: 4677.82 MB


Epoch 18/20 [Train]: 100%|██████████| 3375/3375 [07:21<00:00,  7.65it/s, loss=2.18]
Epoch 18/20 [Val]: 100%|██████████| 375/375 [00:16<00:00, 22.07it/s]


Epoch 18/20
Average Train Loss: 2.1766
Average Val Loss: 2.1702
Learning Rate: 1.22e-05
Updated learning rate after plateau adjustment: 1.2225185185185187e-05
Evaluating after epoch 18...


Evaluating: 100%|██████████| 375/375 [06:18<00:00,  1.01s/it]


Current ROUGE Scores: {'rouge1': 0.11870735771062395, 'rouge2': 0.01363382795824403, 'rougeL': 0.09926578073472443}
New best model saved with validation loss: 2.1702
GPU Memory before epoch 19: 4661.78 MB


Epoch 19/20 [Train]: 100%|██████████| 3375/3375 [07:23<00:00,  7.60it/s, loss=2.15]
Epoch 19/20 [Val]: 100%|██████████| 375/375 [00:17<00:00, 21.77it/s]


Epoch 19/20
Average Train Loss: 2.1538
Average Val Loss: 2.1650
Learning Rate: 1.17e-05
Updated learning rate after plateau adjustment: 1.1669794238683127e-05
Evaluating after epoch 19...


Evaluating: 100%|██████████| 375/375 [06:15<00:00,  1.00s/it]


Current ROUGE Scores: {'rouge1': 0.12145279128431749, 'rouge2': 0.015150090584222994, 'rougeL': 0.10129657039268287}
New best model saved with validation loss: 2.1650
GPU Memory before epoch 20: 4677.82 MB


Epoch 20/20 [Train]: 100%|██████████| 3375/3375 [07:25<00:00,  7.57it/s, loss=2.13]
Epoch 20/20 [Val]: 100%|██████████| 375/375 [00:17<00:00, 21.80it/s]


Epoch 20/20
Average Train Loss: 2.1311
Average Val Loss: 2.1639
Learning Rate: 1.11e-05
Updated learning rate after plateau adjustment: 1.111440329218107e-05
Evaluating after epoch 20...


Evaluating: 100%|██████████| 375/375 [06:07<00:00,  1.02it/s]


Current ROUGE Scores: {'rouge1': 0.12260109995164857, 'rouge2': 0.014801708761302736, 'rougeL': 0.10146980858142782}
New best model saved with validation loss: 2.1639


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Evaluating fine-tuned model...


Evaluating: 100%|██████████| 375/375 [06:09<00:00,  1.02it/s]


Fine-tuned Model Performance:
ROUGE Scores: {'rouge1': 0.12260109995164857, 'rouge2': 0.014801708761302736, 'rougeL': 0.10146980858142782}
Total training time: 17975.81 seconds

Generating example summaries...

Article 1:
Reference: Zully Broussard decided to give a kidney to a stranger .
A new computer program helped her donation spur transplants for six kidney patients .
Generated: CNN.com: "I't know what you can't know that we't be a good," she says . She says it's no longer than a few years ago, he says . He says she's more than 200 percent of people have been a year .
--------------------------------------------------

Article 2:
Reference: The 20th MLS season begins this weekend .
League has changed dramatically since its inception in 1996 .
Some question whether rules regarding salary caps and transfers need to change .
Generated: South Africa's World Cup team has been held in the World Cup . The United States has been made in the world's first time since 2008 . The World Cup is