<a href="https://colab.research.google.com/github/NLPiation/tutorial_notebooks/blob/main/summarization/hf_BART_train_breakdown.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Sample code to show how we train a model using seq2seq architecture!

The code is the supplementary material to the story published in NLPiation medium. Follow [the link](https://pub.towardsai.net/how-to-train-a-seq2seq-text-summarization-model-with-sample-code-ft-huggingface-pytorch-8ba97492f885) for a detailed explanation of the encoder-decoder architecture and code.

> The purpose of this code is to show the flow of the data and basically what is happening under the hood while we want to train a summarization model. It is the 2nd part of the series where I write about the basics of text summarization task.

## Download, and Load the Libraries

In [None]:
!pip install transformers
!pip install datasets

Collecting transformers
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 5.4 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 509 kB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 47.8 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 39.2 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 38.8 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting 

In [None]:
from transformers import BartTokenizer, BartForConditionalGeneration
import torch

## Load the Model/Tokenizer

In [None]:
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.51G [00:00<?, ?B/s]

In [None]:
# Put the model on GPU if available.
if torch.cuda.is_available():
  model = model.to("cuda")

## Load The Dataset

I used only 1% of the CNN/DailyMail dataset to train this model. Make sure to remove the [0:1%] part from the code below if you want to train/fine-tune the model on the full dataset.

In [None]:
import datasets
train_data = datasets.load_dataset("cnn_dailymail", "3.0.0", split="train[0:1%]")
validation_data = datasets.load_dataset("cnn_dailymail", "3.0.0", split="validation[0:1%]")

Downloading:   0%|          | 0.00/3.51k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

Downloading and preparing dataset cnn_dailymail/3.0.0 (download: 558.32 MiB, generated: 1.28 GiB, post-processed: Unknown size, total: 1.82 GiB) to /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234...


  0%|          | 0/5 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/159M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/376M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/572k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/661k [00:00<?, ?B/s]

  0%|          | 0/5 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset cnn_dailymail downloaded and prepared to /root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234. Subsequent calls will reuse this data.


Reusing dataset cnn_dailymail (/root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/3cb851bf7cf5826e45d49db2863f627cba583cbc32342df7349dfe6c38060234)


## Prepare The Dataset

We start by writing a function to handle the tokenization and change the format of the data to an acceptable structure for the model. The dataset by default has *id, article, and highlights* columns that needed to be changed to *input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, labels* using the set_format() function.

In [None]:
article_length=512
summary_length=64

def process_data_to_model_inputs(batch):
  # tokenize the inputs and labels
  inputs = tokenizer(batch["article"], padding="max_length", truncation=True, max_length=article_length)
  outputs = tokenizer(batch["highlights"], padding="max_length", truncation=True, max_length=summary_length)

  batch["input_ids"] = inputs.input_ids
  batch["attention_mask"] = inputs.attention_mask
  batch["decoder_input_ids"] = outputs.input_ids
  batch["decoder_attention_mask"] = outputs.attention_mask
  batch["labels"] = outputs.input_ids.copy()

  # We have to make sure that the PAD token is ignored for calculating the loss
  batch["labels"] = [[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]]

  return batch

In [None]:
train_data = train_data.map(
    process_data_to_model_inputs, 
    batched=True,
    remove_columns=["article", "highlights", "id"]
)

train_data.set_format(
    type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids",
                           "decoder_attention_mask", "labels"],
)

  0%|          | 0/3 [00:00<?, ?ba/s]

In [None]:
validation_data = validation_data.map(
    process_data_to_model_inputs,
    batched=True,
    remove_columns=["article", "highlights", "id"]
)

validation_data.set_format(
    type="torch", columns=["input_ids", "attention_mask", "decoder_input_ids",
                           "decoder_attention_mask", "labels"],
)

  0%|          | 0/1 [00:00<?, ?ba/s]

Print one sample from the tokenized dataset to see what it looks like.

In [None]:
train_data

Dataset({
    features: ['attention_mask', 'decoder_attention_mask', 'decoder_input_ids', 'input_ids', 'labels'],
    num_rows: 2871
})

In [None]:
next( iter( train_data ) )

{'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1

As you see, the dataset is not batched yet. We can use PyTorch's DataLoader function to take care of batching the data. Consider using a larger batch_size if you do not have hardware limitation.

In [None]:
from torch.utils.data import DataLoader

batch_size      = 4

train_data      = DataLoader(train_data, batch_size=batch_size)
validation_data = DataLoader(validation_data, batch_size=batch_size)

Now, The data are packed in a batch of 4 when we try to get one sample from the dataset.

In [None]:
next( iter( train_data ) )

{'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1]]),
 'decoder_attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

## Loss Function

In [None]:
from torch.nn import CrossEntropyLoss

loss_fct = CrossEntropyLoss()

## Optimizer

In [None]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

In [None]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_data)
num_validation_steps = num_epochs * len(validation_data)

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

## Split the model

This step is just to show the flow of data during training. It is done for the teaching purpose. We can use `model(**batch)` to directly get the loss value. You will see an example of this in the validation loop. But for now, let's split the model based on the [blog post's](http://test.com) Figure 1.

In [None]:
the_encoder = model.get_encoder()
the_decoder = model.get_decoder()
last_linear_layer = model.lm_head

⚠️  Remember to comment the code below if you want to actually train/fine-tune your model. I froze the whole model except the last decoder's layer just to speed up the training process for demonstration.

In [None]:
for name, param in model.named_parameters():
  if not name.startswith( "model.decoder.layers.11" ):
    param.requires_grad = False

## The Training + Validation Loop

In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps + num_validation_steps))

for epoch in range(num_epochs):

    # The Training Loop for One Epoch
    model.train()
    training_loss = 0.0
    validation_loss = 0.0
    print("Training...")
    for batch in train_data:
      if torch.cuda.is_available():
        batch = {k: v.to('cuda') for k, v in batch.items()}

      encoder_output = the_encoder(input_ids = batch['input_ids'],
                                   attention_mask = batch['attention_mask'])
      
      decoder_output = the_decoder(input_ids=batch['decoder_input_ids'],
                                   attention_mask=batch['decoder_attention_mask'],
                                   encoder_hidden_states=encoder_output[0],
                                   encoder_attention_mask=batch['attention_mask'])

      decoder_output = decoder_output.last_hidden_state
      lm_head_output = last_linear_layer(decoder_output)

      loss = loss_fct(lm_head_output.view(-1, model.config.vocab_size),
                      batch['labels'].view(-1))
      training_loss += loss.item()

      loss.backward()
      optimizer.step()
      lr_scheduler.step()
      optimizer.zero_grad()
      progress_bar.update(1)
    
    # Evaluate the Model performance on Validation set
    # after the 1 epoch Training.
    model.eval()
    print("Validating...")
    for batch in validation_data:
        if torch.cuda.is_available():
          batch = {k: v.to('cuda') for k, v in batch.items()}
        
        with torch.no_grad():
            outputs = model(**batch)

        loss = outputs.loss
        validation_loss += loss
        progress_bar.update(1)
    
    training_loss = training_loss / len( train_data )
    validation_loss = validation_loss / len( validation_data )
    print("Epoch {}:\tTraining Loss {:.2f}\t/\tValidation Loss {:.2f}".format(epoch+1, training_loss, validation_loss))


  0%|          | 0/2256 [00:00<?, ?it/s]

Training...
Validation...
Epoch 0:	Training Loss 2.86	/	Validation Loss 1.10
Training...
Validation...
Epoch 1:	Training Loss 0.94	/	Validation Loss 0.70
Training...
Validation...
Epoch 2:	Training Loss 0.68	/	Validation Loss 0.63


As you see, we do one epoch of training and then put the model on evaluation mode to do one epoch on validation dataset. It is apearant that the model is learning by looking at the loss values. (lower is better)

Read the medium post [here](https://pub.towardsai.net/how-to-train-a-seq2seq-text-summarization-model-with-sample-code-ft-huggingface-pytorch-8ba97492f885) if you still have question.

Consider following me on Twitter ([@NLPiation](https://twitter.com/NLPiation) where I am mostly writing about NLP and also is a great place to have discussions.