## 🤗 Finetune **Longformer Encoder-Decoder (LED)** for Abstract Generation 🤗


---
This notebook is based on the training script provided with the [LED](https://huggingface.co/transformers/model_doc/led.html) model from the [Huggingface Transformers](https://huggingface.co/transformers/) library. The original script can be found [here](https://colab.research.google.com/drive/12LjJazBl7Gam0XBPy_y0CTOJZeZ34c2v?usp=sharing#scrollTo=6GRz0rksYb3h)


---
The *Longformer Encoder-Decoder (LED)* was recently added as an extension to [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.

In this notebook we will finetune *LED* for Summarization on [Pubmed](https://huggingface.co/datasets/viewer/?dataset=scientific_papers). *Pubmed* is a long-range summarization dataset, which makes it a good candidate for LED. LED will be finetuned up to an input length of 8K tokens on a single GPU.

We will leverage 🤗`Seq2SeqTrainer`, gradient checkpointing and as usual 🤗`datasets`.

Training this model takes a decently powerful GPU. The original notebook recommends a GPU with at least 15GB of VRAM. Fortunately, we have access to cloud computing resources, so we are able to do run experiments with thos model.

In [1]:
!nvidia-smi

Sun Nov 27 01:00:17 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0    53W / 400W |      0MiB / 40960MiB |     27%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Install all of the packages needed for this project. We need to use the `-f https://download.pytorch.org/whl/torch_stable.html` flag to install the correct version of PyTorch for the GPU we are using.

In [2]:
!pip install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html

[0m[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'requirements.txt'[0m[31m
[0m

## Dataset

Let's start by loading and preprocessing the dataset. NOTE: we will have to change this slightly when we switch to predicting the introduction instead of the abstract.

In [3]:
from datasets import load_dataset, load_metric

In [4]:
def lists_to_single_str(dataset):
    dataset['section_titles'] = '\n'.join(dataset['section_titles'])
    dataset['section_texts'] = '\n'.join(dataset['section_texts'])

    return dataset

# load the dataset from cnn_papers.json and nlp_papers.json (TODO: include ml_papers.json)
dataset = load_dataset('json', data_files=['cnn_papers.json', 'nlp_papers.json'], split='train')
dataset = dataset.map(lists_to_single_str) # convert the paper sections into a format that can be processed by the model

Using custom data configuration default-77fc4b40608da6cf
Found cached dataset json (/home/jupyter/.cache/huggingface/datasets/json/default-77fc4b40608da6cf/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab)
Loading cached processed dataset at /home/jupyter/.cache/huggingface/datasets/json/default-77fc4b40608da6cf/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab/cache-1ddb6fba6c7fa090.arrow


Right now, the dataset should be split with an 80/20 train/test split. We may change this later to a train/val/test split.

In [5]:
seed = 42 # set the seed for reproducibility (and so that we can get the same test dataset when evaluating the model)

dataset = dataset.train_test_split(test_size=0.2, seed=seed)
print(len(dataset['train']))
print(len(dataset['test']))

Loading cached split indices for dataset at /home/jupyter/.cache/huggingface/datasets/json/default-77fc4b40608da6cf/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab/cache-9feec4ea58ef01d4.arrow and /home/jupyter/.cache/huggingface/datasets/json/default-77fc4b40608da6cf/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab/cache-188f380262a3d755.arrow


78
20


Let's take a quick look at one of the papers

In [8]:
import random

paper = dataset['train'][random.randint(0, len(dataset['train']))]
print(paper['title'], '\n')
print(paper['section_titles'], '\n')
print(paper['abstract'])

Analytical Techniques for Developing Argumentative Writing in STEM: A Pilot Study 

Introduction
Background
Methodology
Discussion
Conclusion 

Contribution: Demonstrates how to use experiential learning (EL) to improve argumentative writing. Presents the design and development of a natural language processing (NLP) application for aiding instructors in providing feedback on student essays. Discusses how EL combined with automated support provides an analytical approach to improving written-communication skills. Background: High-quality, timely, feedback is an effective way to improve students’ writing. However, large class sizes and limited instructor backgrounds often make formative feedback impossible. Recent trends, including lowering entry requirements, have added to these challenges. Assistive technologies for implementing inclusive education provide viable solutions. Research Questions: 1) How and why can EL be used to develop argumentative writing skills in university STEM stud

## Tokenizing

Now, we tokenize it using a Autotokenizer from HuggingFace.

In [9]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("allenai/led-base-16384")

Note that for the sake of this notebook, we finetune the "smaller" LED checkpoint ["allenai/led-base-16384"](https://huggingface.co/allenai/led-base-16384). Better performance can however be attained by finetuning ["allenai/led-large-16384"](https://huggingface.co/allenai/led-large-16384) at the cost of a higher required GPU RAM.

In [10]:
max_input_length = 16384
max_output_length = 512
batch_size = 2

Now, let's write down the input data processing function that will be used to map each data sample to the correct model format.
As explained earlier `article` represents here our input data and `abstract` is the target data. The datasamples are thus tokenized up to the respective maximum lengths of 8192 and 512.

In addition to the usual `attention_mask`, LED can make use of an additional `global_attention_mask` defining which input tokens are attended globally and which are attended only locally, just as it's the case of [Longformer](https://huggingface.co/transformers/model_doc/longformer.html). For more information on Longformer's self-attention, please take a look at the corresponding [docs](https://huggingface.co/transformers/model_doc/longformer.html#longformer-self-attention). For summarization, we follow recommendations of the [paper](https://arxiv.org/abs/2004.05150) and use global attention only for the very first token. Finally, we make sure that no loss is computed on padded tokens by setting their index to `-100`.

In [11]:
def process_data_to_model_inputs(batch):
    # tokenize the inputs and labels
    inputs = tokenizer(
        batch["section_texts"],
        padding="max_length",
        truncation=True,
        max_length=max_input_length,
    )
    outputs = tokenizer(
        batch["abstract"],
        padding="max_length",
        truncation=True,
        max_length=max_output_length,
    )

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask

    # create 0 global_attention_mask lists
    batch["global_attention_mask"] = len(batch["input_ids"]) * [
        [0 for _ in range(len(batch["input_ids"][0]))]
    ]

    # since above lists are references, the following line changes the 0 index for all samples
    batch["global_attention_mask"][0][0] = 1
    batch["labels"] = outputs.input_ids

    # We have to make sure that the PAD token is ignored
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in labels]
        for labels in batch["labels"]
    ]

    return batch

Now that we have a function that tokenizes the data, we can apply it to our dataset.

In [12]:
dataset = dataset.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=["title", "section_texts", "abstract", "section_titles"], # remove the columns that we don't need anymore since we've already tokenized them
)

Loading cached processed dataset at /home/jupyter/.cache/huggingface/datasets/json/default-77fc4b40608da6cf/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab/cache-f9d6ff017ebcc8c5.arrow


  0%|          | 0/10 [00:00<?, ?ba/s]

Finally, the datasets should be converted into the PyTorch format as follows.

In [13]:
dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
)

## Model

Alright, we're almost ready to start training. Let's load the model via the `AutoModelForSeq2SeqLM` class.

In [14]:
from transformers import AutoModelForSeq2SeqLM

We've decided to stick to the smaller model `"allenai/led-base-16384"` for the sake of this notebook. In addition, we directly enable gradient checkpointing and disable the caching mechanism to save memory.

In [15]:
led = AutoModelForSeq2SeqLM.from_pretrained("allenai/led-base-16384", gradient_checkpointing=True, use_cache=False)

During training, we want to evaluate the model on Rouge, the most common metric used in summarization, to make sure the model is indeed improving during training. For this, we set fitting generation parameters. We'll use beam search with a small beam of just 2 to save memory. Also, we force the model to generate at least 100 tokens, but no more than 512. In addition, some other generation parameters are set that have been found helpful for generation. For more information on those parameters, please take a look at the [docs](https://huggingface.co/transformers/main_classes/model.html?highlight=generate#transformers.generation_utils.GenerationMixin.generate).

In [16]:
# set generate hyperparameters
led.config.num_beams = 2
led.config.max_length = max_output_length
led.config.min_length = 100
led.config.length_penalty = 2.0
led.config.early_stopping = True
led.config.no_repeat_ngram_size = 3

Next, we also have to define the function the will compute the `"rouge"` score during evalution.

Let's load the `"rouge"` metric from 🤗datasets and define the `compute_metrics(...)` function.

In [17]:
rouge = load_metric("rouge")

  """Entry point for launching an IPython kernel.


The compute metrics function expects the generation output, called `pred.predictions` as well as the gold label, called `pred.label_ids`.

Those tokens are decoded and consequently, the rouge score can be computed.

In [18]:
def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(
        predictions=pred_str, references=label_str, rouge_types=["rouge2"]
    )["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

## Training

Now, we're ready to start training. Let's import the `Seq2SeqTrainer` and `Seq2SeqTrainingArguments`.

In [19]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

In contrast to the usual `Trainer`, the `Seq2SeqTrainer` makes it possible to use the `generate()` function during evaluation. This should be enabled with `predict_with_generate=True`. Because our GPU RAM is limited, we make use of gradient accumulation by setting `gradient_accumulation_steps=4` to have an effective `batch_size` of 2 * 4 = 8.

Other training arguments can be read upon in the [docs](https://huggingface.co/transformers/main_classes/trainer.html?highlight=trainingarguments#transformers.TrainingArguments).

In [20]:
# enable fp16 apex training
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    fp16=True,
    output_dir="./",
    gradient_accumulation_steps=4,
    num_train_epochs=5, # since our dataset is so small, we may want to train for more epochs
)

The training arguments, along with the model, tokenizer, datasets and the `compute_metrics` function can then be passed to the `Seq2SeqTrainer`

In [21]:
trainer = Seq2SeqTrainer(
    model=led,
    tokenizer=tokenizer,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
)

Using cuda_amp half precision backend


Now we can start training.

In [22]:
trainer.train()

***** Running training *****
  Num examples = 78
  Num Epochs = 5
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 4
  Total optimization steps = 45
You're using a LEDTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Rouge2 Precision,Rouge2 Recall,Rouge2 Fmeasure
0,3.2836,2.908238,0.1247,0.0976,0.1074
1,2.7275,2.81384,0.1502,0.1145,0.1271
2,2.4544,2.75234,0.1627,0.1286,0.1421
3,2.269,2.784289,0.1586,0.1276,0.1404
4,2.0346,2.757705,0.1715,0.1224,0.1407


***** Running Evaluation *****
  Num examples = 20
  Batch size = 2
***** Running Evaluation *****
  Num examples = 20
  Batch size = 2
***** Running Evaluation *****
  Num examples = 20
  Batch size = 2
***** Running Evaluation *****
  Num examples = 20
  Batch size = 2
***** Running Evaluation *****
  Num examples = 20
  Batch size = 2


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=45, training_loss=2.5538204193115233, metrics={'train_runtime': 948.4658, 'train_samples_per_second': 0.411, 'train_steps_per_second': 0.047, 'total_flos': 4147514626277376.0, 'train_loss': 2.5538204193115233, 'epoch': 4.92})

## Evaluation

Lets take a look at how well the original model compares to the fine-tuned model.

First, lets load the test dataset.

In [23]:
from datasets import load_dataset

def lists_to_single_str(dataset):
    dataset['section_titles'] = '\n'.join(dataset['section_titles'])
    dataset['section_texts'] = '\n'.join(dataset['section_texts'])

    return dataset

# load the dataset from cnn_papers.json and nlp_papers.json (TODO: include ml_papers.json)
dataset = load_dataset('json', data_files=['cnn_papers.json', 'nlp_papers.json'], split='train')
dataset = dataset.map(lists_to_single_str) # convert the paper sections into a format that can be processed by the model
seed = 42 # set the seed for reproducibility

dataset = dataset.train_test_split(test_size=0.2, seed=seed)



And we can define a function that will let us evaluate the model on the test dataset.

In [24]:
import torch
import copy
from datasets import load_metric

def evaluate_model(model, tokenizer):
    model.eval()

    def generate_answer(batch):
      inputs_dict = tokenizer(batch["section_texts"], padding="max_length", max_length=max_input_length, return_tensors="pt", truncation=True)
      input_ids = inputs_dict.input_ids.to("cuda")
      attention_mask = inputs_dict.attention_mask.to("cuda")
      global_attention_mask = torch.zeros_like(attention_mask)
      # put global attention on <s> token
      global_attention_mask[:, 0] = 1

      predicted_abstract_ids = model.generate(input_ids, attention_mask=attention_mask, global_attention_mask=global_attention_mask)
      batch["predicted_abstract"] = tokenizer.batch_decode(predicted_abstract_ids, skip_special_tokens=True)

      return batch

    result = dataset['test'].map(generate_answer, batched=True, batch_size=4)

    # load rouge
    rouge = load_metric("rouge")

    print("Rouge Results:", rouge.compute(predictions=result["predicted_abstract"], references=result["abstract"], rouge_types=["rouge2"])["rouge2"].mid)

    return result

Here are the results from the original LED-Base-16384 model.

In [25]:
from transformers import AutoTokenizer

original_tokenizer = AutoTokenizer.from_pretrained("allenai/led-base-16384")
original_model = AutoModelForSeq2SeqLM.from_pretrained("allenai/led-base-16384").to('cuda')

# set generate hyperparameters to be same as LED
original_model.config.num_beams = 2
original_model.config.max_length = max_output_length
original_model.config.min_length = 100
original_model.config.length_penalty = 2.0
original_model.config.early_stopping = True
original_model.config.no_repeat_ngram_size = 3

untrained_result = evaluate_model(original_model, original_tokenizer)

loading configuration file config.json from cache at /home/jupyter/.cache/huggingface/hub/models--allenai--led-base-16384/snapshots/25756ed025a94fdf2bc4987af86a58fd999047ec/config.json
Model config LEDConfig {
  "_name_or_path": "allenai/led-base-16384",
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "architectures": [
    "LEDForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "attention_window": [
    1024,
    1024,
    1024,
    1024,
    1024,
    1024
  ],
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 768,
  "decoder_attention_heads": 12,
  "decoder_ffn_dim": 3072,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "encoder_attention_heads": 12,
  "encoder_ffn_dim": 3072,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 6,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "ini

  0%|          | 0/5 [00:00<?, ?ba/s]



Rouge Results: Score(precision=0.033648956888410866, recall=0.06818665980326807, fmeasure=0.04457399718828964)


And here are the results from the fine-tuned model.

In [26]:
result = evaluate_model(led, tokenizer)

  0%|          | 0/5 [00:00<?, ?ba/s]



Rouge Results: Score(precision=0.16880125113348232, recall=0.12340367724501047, fmeasure=0.14010747252353423)


Let's compare an example abstract from the test dataset to the abstracts predicted by the fine-tuned and untrained models.

Here is the original abstract:

In [27]:
index = random.randint(0, len(result))
dataset['test'][index]['abstract']

'Increased renewable energy penetration in isolated power systems has a clear impact on the quality of system frequency. The flywheel energy storage system (FESS) is a mature technology with a fast frequency response, high power density, high round-trip efficiency, low maintenance, no depth of discharge effects, and resilience to withstand continuous charge-discharge cycling without lifetime degradation. These FESS properties allows to effectively address the frequency quality problem. This study analyzes the contribution of a FESS to reducing frequency deviations in an isolated system that combines a diesel plant, wind farm, and pump-storage hydropower plant based on the El Hierro power system. This study approaches this analysis by comparing six different FESS governor control schemes (GCSs). Of these six GCSs, the nonlinear proportional variant (NLP\nV\n) is a singular contribution based on the NLP scheme previously developed by the same researchers. Different governor’s parameter s

Here is the abstract predicted by the untrained model:

In [28]:
untrained_result[index]['predicted_abstract']

'. 4) and ii) comparison between the different types of energy systems. The FESS and the GCS are different types. The GCS is different from the other types of power systems. It is important to note that the FESS is different than the other type of power system. The difference between FESS, GCS, and GCS can be seen in Figure 1. The differences between the two type of energy system are different. The different types are different from each other. For example, in the Fess, G CS is different in frequency from the G CS. The change in frequency between the FSS and G CS can be shown in Figure 2. The switch between the GTS and the other kinds of energy sources can be explained in Figure 3. Figure 3 shows the change in Frequency between the power system and the different type of electricity sources. Figure 2 shows the switch between different types and the power systems can be described in Figure 4. Figure 4 shows the difference between the switch from GCS to GCS. Figure 5 shows the transition 

And lastly, here is the abstract predicted by the fine-tuned model:

In [29]:
result[index]['predicted_abstract']

'Increasing renewable energy (RE) penetration in isolated power systems is a challenge that has been constantly addressed in recent decades. In this study, a realistic dynamic model of a small isolated power system with a high penetration of renewables is proposed. This model is based on a multi-objective optimization of two parameters: the frequency deviation and the amplitude of the frequency response. The experimental results show that the FESS can be used in a wide operating range of power systems, and that it can be adapted to other power systems. The FESS has been developed using a nonlinear proportional (NLP) GCS, where the frequency deviations are proportional to the input of the generator and the rotor inertial expression is used to determine the frequency of the generated power. The model also includes a parametric parameter, which is used for determining the effective frequency response of a FESS plant under different GCS settings. The results are presented in Table 1. The p