<a href="https://www.kaggle.com/code/ahmedmostafadora/fine-tuning-pegasus-on-dialogue-summarization?scriptVersionId=212904884" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 1- Importing Libs.

In [1]:
!pip install evaluate rouge-score
from datasets import load_dataset
import pandas as pd
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments, pipeline
import torch
from tqdm import tqdm
import evaluate
from kaggle_secrets import UserSecretsClient
from huggingface_hub import login
import os

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l- done
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l- \ done
[?25h  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=c6c741463838cc74653f5a2ef57f8eb5585f9002c40faf52748fd30419f204ae
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge-score
Installing collected packages: rouge-score, evaluate
Successfully installed evaluate-0.4.3 rouge-score-0.1.2


# 2. Preparing environment

In [2]:
user_secrets = UserSecretsClient()
secret_value_0 = user_secrets.get_secret("hf_token")

In [3]:

os.environ['HF_TOKEN']=secret_value_0

login(token=os.getenv('HF_TOKEN'))

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


In [4]:
model_ckpt = "google/pegasus-cnn_dailymail"
device = 'cuda' if torch.cuda.is_available() else "cpu"

# 3. loading and taking a look at the dataset

In [5]:
dataset = load_dataset("knkarthick/dialogsum")

README.md:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

train.csv:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

validation.csv:   0%|          | 0.00/442k [00:00<?, ?B/s]

test.csv:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [6]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

In [7]:
print(dataset['train']['dialogue'][0], dataset['train']['summary'][0], sep='\n\n')

#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?
#Person2#: I found it would be a good idea to get a check-up.
#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.
#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?
#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.
#Person2#: Ok.
#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?
#Person2#: Yes.
#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.
#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.
#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.
#Person2#: Ok, thanks doctor.

Mr. Smith's getting a check-up, and Doctor Hawkins advises him to h

# 4. Evaluating the model before fine tuning

In [8]:
metric = evaluate.load("rouge")

def chunks(list_of_elements, batch_size):
    for i in range(0, len(list_of_elements), batch_size):
        yield list_of_elements[i : i + batch_size]

def evaluate_summaries(model, 
                       tokenizer, 
                       dataset, 
                       batch_size, 
                       metric, 
                       col_name='dialogue', 
                       col_summary='summary'):
    article_batches = list(chunks(dataset[col_name], batch_size))
    summary_batches = list(chunks(dataset[col_summary], batch_size))

    for article_batch, summary_batch in tqdm(zip(article_batches, summary_batches), total=len(article_batches)):
        inputs = tokenizer(article_batch, max_length=1024, 
                           truncation=True, return_tensors='pt', padding='max_length')
        summaries = model.generate(input_ids=inputs['input_ids'].to(device), 
                                   attention_mask=inputs['attention_mask'].to(device), 
                                   max_length=128, length_penalty=0.8, num_beams=5)
        decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True, 
                                              clean_up_tokenization_spaces=True) 
                             for s in summaries]
        decoded_summaries = [d.replace('<n>', " ") for d in decoded_summaries]
        metric.add_batch(predictions=decoded_summaries, references=summary_batch)

    score = metric.compute()
    return score

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [9]:
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

score = evaluate_summaries(model, tokenizer, dataset['validation'][:100], 4, metric)

tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

100%|██████████| 25/25 [01:05<00:00,  2.60s/it]


In [10]:
rouge_names = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
rouge_dict = dict((rn, score[rn]) for rn in rouge_names)

pd.DataFrame(rouge_dict, index=["PEGASUS"])

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
PEGASUS,0.262833,0.067448,0.201512,0.201632


# 5. Fine Tuning PEGASUS

In [11]:
def tokenize(example_batch):
    input_encodings = tokenizer(example_batch['dialogue'], max_length=1024, truncation=True)
    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(example_batch['summary'], truncation=True, max_length=128)

    return {
        "input_ids": input_encodings['input_ids'], 
        "attention_mask": input_encodings['attention_mask'], 
        "labels": target_encodings['input_ids']
    }


train_dataset = dataset['train'].map(tokenize, batched=True) 
val_dataset = dataset['validation'].map(tokenize, batched=True) 
test_dataset = dataset['test'].map(tokenize, batched=True) 

columns = ['input_ids', 'labels', 'attention_mask']
train_dataset.set_format(type='torch', columns=columns)
val_dataset.set_format(type="torch", columns=columns)
test_dataset.set_format(type="torch", columns=columns)

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]



Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [12]:
seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [13]:
from transformers import Trainer, TrainingArguments
training_arguments = TrainingArguments(
    output_dir='pegasus-dialogue',  
    num_train_epochs=1,
    warmup_steps=100,
    per_device_train_batch_size=1, 
    per_device_eval_batch_size=1,
    weight_decay=0.01, 
    logging_steps=50, 
    push_to_hub=True, 
    eval_strategy='steps', 
    eval_steps=500, 
    save_steps=1e6, 
    gradient_accumulation_steps=8,
    # predict_with_generate=True,
    fp16=True,
    report_to=[]
)

In [14]:
trainer = Trainer(model=model, 
                         tokenizer=tokenizer, 
                         args=training_arguments, 
                         data_collator=seq2seq_data_collator, 
                         train_dataset=train_dataset.select(range(4000)), 
                         eval_dataset=val_dataset
                        )

  trainer = Trainer(model=model,


In [15]:
trainer.train()

  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)


Step,Training Loss,Validation Loss
500,1.2735,1.184964




TrainOutput(global_step=500, training_loss=1.5296255264282226, metrics={'train_runtime': 743.3371, 'train_samples_per_second': 5.381, 'train_steps_per_second': 0.673, 'total_flos': 2320143985360896.0, 'train_loss': 1.5296255264282226, 'epoch': 1.0})

# 6. Evaluating on the validation set

In [16]:
score = evaluate_summaries(trainer.model, tokenizer, dataset['test'][:100], 4, metric)
rouge_names = ['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
rouge_dict = dict((rn, score[rn]) for rn in rouge_names)

pd.DataFrame(rouge_dict, index=["fine-tuned-PEGASUS"])

100%|██████████| 25/25 [01:08<00:00,  2.75s/it]


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
fine-tuned-PEGASUS,0.36757,0.129781,0.293908,0.294477


In [17]:
trainer.push_to_hub('pegasus-dialogsum-v2')

CommitInfo(commit_url='https://huggingface.co/Ahmed167/pegasus-dialogue/commit/da4ec61d260e8591db917c6101c543402acb8e1d', commit_message='pegasus-dialogsum-v2', commit_description='', oid='da4ec61d260e8591db917c6101c543402acb8e1d', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Ahmed167/pegasus-dialogue', endpoint='https://huggingface.co', repo_type='model', repo_id='Ahmed167/pegasus-dialogue'), pr_revision=None, pr_num=None)

## A sample of the test set

In [18]:
pipe = pipeline('summarization', model="Ahmed167/pegasus-dialogue")
sample = train_dataset.select(range(1))
dialogue = sample['dialogue']
reference = sample['summary']

print(f"Dialogue:\n{dialogue[0]}")
print("*"*20)
print(f"model summary:\n{pipe(dialogue, length_penalty=0.8, num_beams=8, max_length=128)[0]['summary_text']}")
print("*"*20)
print(f"reference:\n{reference[0]}")

config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/275 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/20.3k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/6.60M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.77k [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Dialogue:
#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?
#Person2#: I found it would be a good idea to get a check-up.
#Person1#: Yes, well, you haven't had one for 5 years. You should have one every year.
#Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor?
#Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good.
#Person2#: Ok.
#Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith?
#Person2#: Yes.
#Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit.
#Person2#: I've tried hundreds of times, but I just can't seem to kick the habit.
#Person1#: Well, we have classes and some medications that might help. I'll give you more information before you leave.
#Person2#: Ok, thanks doctor.
********************
model summary:
Mr. Smith has not had 