# Training a Summarization Model

Now let's see how we can use `HuggingFace` to train a summarization model on a new dataset. We'll use the SAMSum dataset.

In [1]:
from datasets import load_dataset
!pip install py7zr

dataset_samsum = load_dataset("gigaword")
split_lengths = [len(dataset_samsum[split]) for split in dataset_samsum]

print(f"Split lengths: {split_lengths}")
print(f"Features: {dataset_samsum['train'].column_names}")
print(f"\nDialogue:")
print(dataset_samsum["test"][0]["document"])
print("\nSummary")
print(dataset_samsum["test"][0]["summary"])

[0m

  0%|          | 0/3 [00:00<?, ?it/s]

Split lengths: [3803957, 189651, 1951]
Features: ['document', 'summary']

Dialogue:
japan 's nec corp. and UNK computer corp. of the united states said wednesday they had agreed to join forces in supercomputer sales .

Summary
nec UNK in computer sales tie-up


In [2]:
from transformers import pipeline

# Evaluate this using PEGASUS
pipe = pipeline("summarization", model="google/pegasus-cnn_dailymail", framework='pt')
pipe_out = pipe(dataset_samsum["test"][0]["document"])
print("Summary:")
print(pipe_out[0]["summary_text"].replace(" .<n>", ".\n"))

Your max_length is set to 128, but you input_length is only 33. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=16)


Summary:
Japan 's nec corp. and UNK computer corp.<n>of the united states said they had agreed to join forces in supercomputer sales .


# Evaluating the entire test set

We will need a way to compare the baseline PEGASUS model to the finetuned version. We'll create an evaluation loop for this.

In [3]:
from tqdm import tqdm
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

def chunks(list_of_elements, batch_size):
    for i in range(0, len(list_of_elements), batch_size):
        yield list_of_elements[i : i + batch_size]

def evaluate_summaries(dataset, metric, model, tokenizer,
                       batch_size=16, device=device,
                       column_text="article", column_summary="highlights"):
    article_batches = list(chunks(dataset[column_text], batch_size))
    target_batches = list(chunks(dataset[column_summary], batch_size))

    for article_batch, target_batch in tqdm(
        zip(article_batches, target_batches), total=len(article_batches)):

        inputs = tokenizer(article_batch, max_length=1024, truncation=True,
                        padding="max_length", return_tensors="pt")

        summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                                   attention_mask=inputs["attention_mask"].to(device),
                                   length_penalty=0.8, num_beams=8, max_length=128)

        decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
                                              clean_up_tokenization_spaces=True)
                             for s in summaries]

        decoded_summaries = [d.replace("<n>", " ") for d in decoded_summaries]
        
    return metric.compute(predictions=decoded_summaries, references=target_batch)

In [4]:
# Load the model directly
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_ckpt = "ainize/bart-base-cnn"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

In [5]:
!pip install evaluate
import evaluate

!pip install rouge_score
rouge_metric = evaluate.load("rouge")
score = evaluate_summaries(dataset_samsum["test"], rouge_metric, model,
                           tokenizer, column_text="document",
                           column_summary="summary", batch_size=8)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[0mhuggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[0m

100%|██████████| 244/244 [11:04<00:00,  2.72s/it]


In [6]:
import pandas as pd

pd.DataFrame(score, index=["bart"])

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
bart,0.235341,0.075865,0.211554,0.211612


In order to fine tune this model, we need to be able to tokenize the data. We can also limit the lengths of each dialogue and summary to 1024 and 128, respectively.

In [7]:
def convert_examples_to_features(example_batch):
    input_encodings = tokenizer(example_batch["document"], truncation=True,
                                max_length=1024)

    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(example_batch["summary"], max_length=128,
                                     truncation=True)

    return {"input_ids": input_encodings["input_ids"],
            "attention_mask": input_encodings["attention_mask"],
            "labels": target_encodings["input_ids"]}

dataset_samsum_pt = dataset_samsum.map(convert_examples_to_features,
                                       batched=True)

columns = ["input_ids", "labels", "attention_mask"]
dataset_samsum_pt.set_format(type="torch", columns=columns)

  0%|          | 0/3804 [00:00<?, ?ba/s]



  0%|          | 0/190 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

# Preparing a batch of data

When training `seq2seq` models, we need to apply "teacher forcing". The encoder will receive input tokens using the labels shifted by one as well as the encoder output. The prediction is then compared to the shifted labels to calculate the loss. To clarify, the decoder only sees the previous ground truth labels.

`HuggingFace` provides a `DataCollatorForSeq2Seq` class that handles this for us.

In [8]:
from transformers import DataCollatorForSeq2Seq

seq2seq_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [9]:
from transformers import TrainingArguments, Trainer

# Gradient accumulation saves memory by updating the model only every X batches
training_args = TrainingArguments(
    output_dir="bart-samsum", num_train_epochs=1, warmup_steps=500,
    per_device_train_batch_size=1, per_device_eval_batch_size=1,
    weight_decay=0.01, logging_steps=10, push_to_hub=False,
    evaluation_strategy="steps", eval_steps=500, save_steps=1e6,
    gradient_accumulation_steps=16)

##### Decreasing the size of `train` and `validation` from `3.8M, 189k to 30k, 3k` respectively. Because, when running with full dataset, it is taking too much to train the model, so I've decided to decrease the dataset size which can be completed under 4 hours. Also, I kept the `test` set as the same what dataset provides by default as it has only `1k rows`, so we can use the same as the test set.

In [10]:
train_sample = dataset_samsum_pt['train'].shuffle().select(range(50000))
validation_sample = dataset_samsum_pt['validation'].shuffle().select(range(5000))

In [11]:
!pip install wandb --upgrade
trainer = Trainer(model=model, args=training_args,
                  tokenizer=tokenizer, data_collator=seq2seq_collator,
                  train_dataset=train_sample,
                  eval_dataset=validation_sample)

trainer.train()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[0m

[34m[1mwandb[0m: Currently logged in as: [33mteja-atech[0m ([33mspark5[0m). Use [1m`wandb login --relogin`[0m to force relogin


You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
500,2.4256,2.183717
1000,2.1962,2.037916
1500,2.3502,1.976657


TrainOutput(global_step=1562, training_loss=2.4293487020094475, metrics={'train_runtime': 5844.4964, 'train_samples_per_second': 8.555, 'train_steps_per_second': 0.267, 'total_flos': 1413587043348480.0, 'train_loss': 2.4293487020094475, 'epoch': 1.0})

In [12]:
# Evaluate after finetuning
score_ft = evaluate_summaries(
    dataset_samsum["test"], rouge_metric, trainer.model, tokenizer,
    batch_size=2, column_text="document", column_summary="summary")
pd.DataFrame(score_ft, index=[f"bart_finetuned"])

100%|██████████| 976/976 [05:30<00:00,  2.95it/s]


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
bart_finetuned,0.421053,0.235294,0.421053,0.421053


##### Rouge score before and after training the bert-base-cnn. I've taken the dataset `Gigaword` from huggingface and limited number of rows to consider because it is such as big dataset which is taking lot of time to train.
We can see that there is increse in `ROUGE SCORE` before and after fine tuning the bart transformer.

In [13]:
sample_text = dataset_samsum["test"][0]["document"]
reference = dataset_samsum["test"][0]["summary"]

inputs = tokenizer(sample_text, max_length=1024, truncation=True,
                   padding="max_length", return_tensors="pt")

summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                           attention_mask=inputs["attention_mask"].to(
    device),
    length_penalty=0.8, num_beams=8, max_length=128)

decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
                                      clean_up_tokenization_spaces=True)
                     for s in summaries]

decoded_summaries = [d.replace("<n>", " ") for d in decoded_summaries]


In [14]:
print(decoded_summaries)

['nepal UNK to join forces in supercomputer sales']
