## 🤗 Finetune **Longformer Encoder-Decoder (LED)** for Abstract Generation 🤗


---
This notebook is based on the training script provided with the [LED](https://huggingface.co/transformers/model_doc/led.html) model from the [Huggingface Transformers](https://huggingface.co/transformers/) library. The original script can be found [here](https://colab.research.google.com/drive/12LjJazBl7Gam0XBPy_y0CTOJZeZ34c2v?usp=sharing#scrollTo=6GRz0rksYb3h)


---
The *Longformer Encoder-Decoder (LED)* was recently added as an extension to [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.

In this notebook we will finetune *LED* for Summarization on [Pubmed](https://huggingface.co/datasets/viewer/?dataset=scientific_papers). *Pubmed* is a long-range summarization dataset, which makes it a good candidate for LED. LED will be finetuned up to an input length of 8K tokens on a single GPU.

We will leverage 🤗`Seq2SeqTrainer`, gradient checkpointing and as usual 🤗`datasets`.

Training this model takes a decently powerful GPU. The original notebook recommends a GPU with at least 15GB of VRAM. Fortunately, we have access to cloud computing resources, so we are able to do run experiments with thos model.

In [1]:
!nvidia-smi

Sun Nov 27 03:17:25 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    53W / 400W |      0MiB / 40960MiB |     27%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Install all of the packages needed for this project. We need to use the `-f https://download.pytorch.org/whl/torch_stable.html` flag to install the correct version of PyTorch for the GPU we are using.

In [2]:
!pip install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html

[0m[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'[0m[31m
[0m

## Dataset

Let's start by loading and preprocessing the dataset. NOTE: we will have to change this slightly when we switch to predicting the introduction instead of the abstract.

In [3]:
from datasets import load_dataset, load_metric

In [4]:
def lists_to_single_str(dataset):
    dataset['section_titles'] = '\n'.join(dataset['section_titles'])
    dataset['section_texts'] = '\n'.join(dataset['section_texts'])

    return dataset

# load the dataset from cnn_papers.json and nlp_papers.json (TODO: include ml_papers.json)
dataset = load_dataset('json', data_files=['cnn_papers.json', 'nlp_papers.json'], split='train')
dataset = dataset.map(lists_to_single_str) # convert the paper sections into a format that can be processed by the model

Using custom data configuration default-77fc4b40608da6cf
Found cached dataset json (/home/jupyter/.cache/huggingface/datasets/json/default-77fc4b40608da6cf/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab)
Loading cached processed dataset at /home/jupyter/.cache/huggingface/datasets/json/default-77fc4b40608da6cf/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab/cache-1ddb6fba6c7fa090.arrow


Right now, the dataset should be split with an 80/20 train/test split. We may change this later to a train/val/test split.

In [5]:
seed = 42 # set the seed for reproducibility (and so that we can get the same test dataset when evaluating the model)

dataset = dataset.train_test_split(test_size=0.2, seed=seed)
print(len(dataset['train']))
print(len(dataset['test']))

Loading cached split indices for dataset at /home/jupyter/.cache/huggingface/datasets/json/default-77fc4b40608da6cf/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab/cache-9feec4ea58ef01d4.arrow and /home/jupyter/.cache/huggingface/datasets/json/default-77fc4b40608da6cf/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab/cache-188f380262a3d755.arrow


78
20


Let's take a quick look at one of the papers

In [6]:
import random

paper = dataset['train'][random.randint(0, len(dataset['train']))]
print(paper['title'], '\n')
print(paper['section_titles'], '\n')
print(paper['abstract'])

Compilation, Analysis and Application of a Comprehensive Bangla Corpus KUMono 

Introduction
Background and Related Works
Scope and Objective of Proposed Work
Development of Monolingual Bangla Corpus KUMono
Statistical Analysis of KUMono Bangla Language Profile
Quality Assurance of KUMono Corpus
Article Classification Using the KUMono Corpus
Conclusion 

Research in Natural Language Processing (NLP) and computational linguistics highly depends on a good quality representative corpus of any specific language. Bangla is one of the most spoken languages in the world but Bangla NLP research is in its early stage of development due to the lack of quality public corpus. This article describes the detailed compilation methodology of a comprehensive monolingual Bangla corpus, KUMono ( K hulna U niversity Mono lingual corpus). The newly developed corpus consists of more than 350 million word tokens and more than one million unique tokens from 18 major text categories of online Bangla websites. 

## Tokenizing

Now, we tokenize it using a Autotokenizer from HuggingFace.

In [7]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("allenai/led-base-16384")

Note that for the sake of this notebook, we finetune the "smaller" LED checkpoint ["allenai/led-base-16384"](https://huggingface.co/allenai/led-base-16384). Better performance can however be attained by finetuning ["allenai/led-large-16384"](https://huggingface.co/allenai/led-large-16384) at the cost of a higher required GPU RAM.

In [8]:
max_input_length = 16384
max_output_length = 512
batch_size = 2

Now, let's write down the input data processing function that will be used to map each data sample to the correct model format.
As explained earlier `article` represents here our input data and `abstract` is the target data. The datasamples are thus tokenized up to the respective maximum lengths of 8192 and 512.

In addition to the usual `attention_mask`, LED can make use of an additional `global_attention_mask` defining which input tokens are attended globally and which are attended only locally, just as it's the case of [Longformer](https://huggingface.co/transformers/model_doc/longformer.html). For more information on Longformer's self-attention, please take a look at the corresponding [docs](https://huggingface.co/transformers/model_doc/longformer.html#longformer-self-attention). For summarization, we follow recommendations of the [paper](https://arxiv.org/abs/2004.05150) and use global attention only for the very first token. Finally, we make sure that no loss is computed on padded tokens by setting their index to `-100`.

In [9]:
def process_data_to_model_inputs(batch):
    # tokenize the inputs and labels
    inputs = tokenizer(
        batch["section_texts"],
        padding="max_length",
        truncation=True,
        max_length=max_input_length,
    )
    outputs = tokenizer(
        batch["abstract"],
        padding="max_length",
        truncation=True,
        max_length=max_output_length,
    )

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask

    # create 0 global_attention_mask lists
    batch["global_attention_mask"] = len(batch["input_ids"]) * [
        [0 for _ in range(len(batch["input_ids"][0]))]
    ]

    # since above lists are references, the following line changes the 0 index for all samples
    batch["global_attention_mask"][0][0] = 1
    batch["labels"] = outputs.input_ids

    # We have to make sure that the PAD token is ignored
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in labels]
        for labels in batch["labels"]
    ]

    return batch

Now that we have a function that tokenizes the data, we can apply it to our dataset.

In [10]:
dataset = dataset.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=["title", "section_texts", "abstract", "section_titles"], # remove the columns that we don't need anymore since we've already tokenized them
)

Loading cached processed dataset at /home/jupyter/.cache/huggingface/datasets/json/default-77fc4b40608da6cf/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab/cache-f9d6ff017ebcc8c5.arrow
Loading cached processed dataset at /home/jupyter/.cache/huggingface/datasets/json/default-77fc4b40608da6cf/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab/cache-fdea6a27757abb72.arrow


Finally, the datasets should be converted into the PyTorch format as follows.

In [11]:
dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
)

## Model

Alright, we're almost ready to start training. Let's load the model via the `AutoModelForSeq2SeqLM` class.

In [12]:
from transformers import AutoModelForSeq2SeqLM

We've decided to stick to the smaller model `"allenai/led-base-16384"` for the sake of this notebook. In addition, we directly enable gradient checkpointing and disable the caching mechanism to save memory.

In [13]:
led = AutoModelForSeq2SeqLM.from_pretrained("allenai/led-base-16384", gradient_checkpointing=True, use_cache=False)

During training, we want to evaluate the model on Rouge, the most common metric used in summarization, to make sure the model is indeed improving during training. For this, we set fitting generation parameters. We'll use beam search with a small beam of just 2 to save memory. Also, we force the model to generate at least 100 tokens, but no more than 512. In addition, some other generation parameters are set that have been found helpful for generation. For more information on those parameters, please take a look at the [docs](https://huggingface.co/transformers/main_classes/model.html?highlight=generate#transformers.generation_utils.GenerationMixin.generate).

In [14]:
# set generate hyperparameters
led.config.num_beams = 4
led.config.max_length = max_output_length
led.config.min_length = 100
led.config.length_penalty = 2.0
led.config.early_stopping = True
led.config.no_repeat_ngram_size = 3

Next, we also have to define the function the will compute the `"rouge"` score during evalution.

Let's load the `"rouge"` metric from 🤗datasets and define the `compute_metrics(...)` function.

In [15]:
rouge = load_metric("rouge")

  """Entry point for launching an IPython kernel.


The compute metrics function expects the generation output, called `pred.predictions` as well as the gold label, called `pred.label_ids`.

Those tokens are decoded and consequently, the rouge score can be computed.

In [16]:
def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(
        predictions=pred_str, references=label_str, rouge_types=["rouge2"]
    )["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

## Training

Now, we're ready to start training. Let's import the `Seq2SeqTrainer` and `Seq2SeqTrainingArguments`.

In [17]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

In contrast to the usual `Trainer`, the `Seq2SeqTrainer` makes it possible to use the `generate()` function during evaluation. This should be enabled with `predict_with_generate=True`. Because our GPU RAM is limited, we make use of gradient accumulation by setting `gradient_accumulation_steps=4` to have an effective `batch_size` of 2 * 4 = 8.

Other training arguments can be read upon in the [docs](https://huggingface.co/transformers/main_classes/trainer.html?highlight=trainingarguments#transformers.TrainingArguments).

In [18]:
# enable fp16 apex training
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    fp16=True,
    output_dir="./",
    gradient_accumulation_steps=4,
    num_train_epochs=5, # since our dataset is so small, we may want to train for more epochs
)

The training arguments, along with the model, tokenizer, datasets and the `compute_metrics` function can then be passed to the `Seq2SeqTrainer`

In [19]:
trainer = Seq2SeqTrainer(
    model=led,
    tokenizer=tokenizer,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
)

Using cuda_amp half precision backend


Now we can start training.

In [20]:
trainer.train()

***** Running training *****
  Num examples = 78
  Num Epochs = 5
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 4
  Total optimization steps = 45
You're using a LEDTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge2 Precision,Rouge2 Recall,Rouge2 Fmeasure
0,3.2835,2.90826,0.1272,0.1058,0.1138
1,2.7275,2.813853,0.1329,0.1231,0.1253
2,2.4544,2.752409,0.1534,0.1274,0.1379
3,2.2689,2.784334,0.1661,0.1394,0.1491
4,2.0345,2.757815,0.1767,0.1287,0.1457


***** Running Evaluation *****
  Num examples = 20
  Batch size = 2
***** Running Evaluation *****
  Num examples = 20
  Batch size = 2
***** Running Evaluation *****
  Num examples = 20
  Batch size = 2
***** Running Evaluation *****
  Num examples = 20
  Batch size = 2
***** Running Evaluation *****
  Num examples = 20
  Batch size = 2


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=45, training_loss=2.553788842095269, metrics={'train_runtime': 1390.349, 'train_samples_per_second': 0.281, 'train_steps_per_second': 0.032, 'total_flos': 4147514626277376.0, 'train_loss': 2.553788842095269, 'epoch': 4.92})

## Evaluation

Lets take a look at how well the original model compares to the fine-tuned model.

First, lets load the test dataset.

In [21]:
from datasets import load_dataset

def lists_to_single_str(dataset):
    dataset['section_titles'] = '\n'.join(dataset['section_titles'])
    dataset['section_texts'] = '\n'.join(dataset['section_texts'])

    return dataset

# load the dataset from cnn_papers.json and nlp_papers.json (TODO: include ml_papers.json)
dataset = load_dataset('json', data_files=['cnn_papers.json', 'nlp_papers.json'], split='train')
dataset = dataset.map(lists_to_single_str) # convert the paper sections into a format that can be processed by the model
seed = 42 # set the seed for reproducibility

dataset = dataset.train_test_split(test_size=0.2, seed=seed)



And we can define a function that will let us evaluate the model on the test dataset.

In [22]:
import torch
import copy
from datasets import load_metric

def evaluate_model(model, tokenizer):
    model.eval()

    def generate_answer(batch):
      inputs_dict = tokenizer(batch["section_texts"], padding="max_length", max_length=max_input_length, return_tensors="pt", truncation=True)
      input_ids = inputs_dict.input_ids.to("cuda")
      attention_mask = inputs_dict.attention_mask.to("cuda")
      global_attention_mask = torch.zeros_like(attention_mask)
      # put global attention on <s> token
      global_attention_mask[:, 0] = 1

      predicted_abstract_ids = model.generate(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask)
      batch["predicted_abstract"] = tokenizer.batch_decode(predicted_abstract_ids, skip_special_tokens=True)

      return batch

    result = dataset['test'].map(generate_answer, batched=True, batch_size=4)

    # load rouge
    rouge = load_metric("rouge")

    print("Rouge Results:", rouge.compute(predictions=result["predicted_abstract"], references=result["abstract"], rouge_types=["rouge2"])["rouge2"].mid)

    return result

Here are the results from the original LED-Base-16384 model.

In [None]:
from transformers import AutoTokenizer

original_tokenizer = AutoTokenizer.from_pretrained("allenai/led-base-16384")
original_model = AutoModelForSeq2SeqLM.from_pretrained("allenai/led-base-16384").to('cuda')

# set generate hyperparameters to be same as LED
original_model.config.num_beams = 4
original_model.config.max_length = max_output_length
original_model.config.min_length = 100
original_model.config.length_penalty = 2.0
original_model.config.early_stopping = True
original_model.config.no_repeat_ngram_size = 3

untrained_result = evaluate_model(original_model, original_tokenizer)

loading configuration file config.json from cache at /home/jupyter/.cache/huggingface/hub/models--allenai--led-base-16384/snapshots/25756ed025a94fdf2bc4987af86a58fd999047ec/config.json
Model config LEDConfig {
  "_name_or_path": "allenai/led-base-16384",
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "architectures": [
    "LEDForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "attention_window": [
    1024,
    1024,
    1024,
    1024,
    1024,
    1024
  ],
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 768,
  "decoder_attention_heads": 12,
  "decoder_ffn_dim": 3072,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "encoder_attention_heads": 12,
  "encoder_ffn_dim": 3072,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 6,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "ini

  0%|          | 0/5 [00:00<?, ?ba/s]



Rouge Results: Score(precision=0.041941008797352505, recall=0.08265459349468335, fmeasure=0.0552338543896531)


And here are the results from the fine-tuned model.

In [None]:
result = evaluate_model(led, tokenizer)

  0%|          | 0/5 [00:00<?, ?ba/s]



Rouge Results: Score(precision=0.1737773041727364, recall=0.13004354034720275, fmeasure=0.14547487791886393)


Let's compare an example abstract from the test dataset to the abstracts predicted by the fine-tuned and untrained models.

Here is the original abstract:

In [None]:
index = random.randint(0, len(result))
dataset['test'][index]['abstract']

'Due to the huge variety of 5G services, Network slicing is promising mechanism for dividing the physical network resources in to multiple logical network slices according to the requirements of each user. Highly accurate and fast traffic classification algorithm is required to ensure better Quality of Service (QoS) and effective network slicing. Fine-grained resource allocation can be realized by Software Defined Networking (SDN) with centralized controlling of network resources. However, the relevant research activities have concentrated on the deep learning systems which consume enormous computation and storage requirements of SDN controller that results in limitations of speed and accuracy of traffic classification mechanism. To fill this gap, this paper proposes Intelligent SDN Multi Spike Neural System (IMSNS) by implementing Moderately Multi-Spike Return Neural Networks (MMSRNN) controller with time based coding achieving remarkable reduction on energy consumption and accurate t

Here is the abstract predicted by the untrained model:

In [None]:
untrained_result[index]['predicted_abstract']

'With The exponential growth of the communication devices especially with rise of 5G services, these devices require reliability, low latency, high bandwidth and better QoS to achieve high service satisfaction rates. So it became necessary to adopt Network slicing and resource allocation mechanism [1]. Network slicing refers to selecting appropriate slices for the specific traffic type to provide better-performing and cost-efficient services. Identifying the traffic application types is an essential function to configure network slicing that facilitates fine-grained management and resource utilization [2], [3]. An efficient and fast classification algorithm is required to realize application awareness because of the different network resource requirements of different applications. The ⌊⌉ is a round function. F1- score: it is a valuable score that balances the precision and recall values, the larger F1–score means that the network is more efficient. It is important to note that it is n

And lastly, here is the abstract predicted by the fine-tuned model:

In [None]:
result[index]['predicted_abstract']

'Network slicing and resource allocation mechanism (SDN) are the enabling technologies of Network slicing and used by our intelligent model to make smart decision without human intervention. However, these devices require reliability, low latency, high bandwidth and better QoS to achieve high service satisfaction rates. This paper implements time-based coding, Moderately Multi-Spike Recurrent Neural Network (MMSRNN) as classifier controller to realize accurate load balancing for efficient utilization of network slices and slice failure conditions. A new training algorithm is proposed to update the weights and threshold of the proposed model to provide better-performing and cost-efficient services.'