## 🤗 Finetune **Longformer Encoder-Decoder (LED)** on 8K Tokens 🤗

The *Longformer Encoder-Decoder (LED)* was recently added as an extension to [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) by Iz Beltagy, Matthew E. Peters, Arman Cohan.

In this notebook we will finetune *LED* for Summarization on [Pubmed](https://huggingface.co/datasets/viewer/?dataset=scientific_papers). *Pubmed* is a long-range summarization dataset, which makes it a good candidate for LED. LED will be finetuned up to an input length of 8K tokens on a single GPU.

We will leverage 🤗`Seq2SeqTrainer`, gradient checkpointing and as usual 🤗`datasets`.

First, let's try to get a GPU with at least 15GB RAM.

In [None]:
# crash colab to get more RAM
# !kill -9 -1

To check that we are having enough RAM we can run the following command.
If the randomely allocated GPU is too small, the above cells can be run 
to crash the notebook hoping to get a better GPU.

In [1]:
!nvidia-smi

Sat Mar 25 16:17:24 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 528.33       Driver Version: 528.33       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ... WDDM  | 00000000:01:00.0  On |                  N/A |
|  0%   34C    P5    23W / 320W |   1303MiB / 10240MiB |     54%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Next, we install 🤗Transformers, 🤗Datasets, and `rouge_score`.



In [2]:
import torch

torch.cuda.is_available()

True

In [3]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


Let's start by loading and preprocessing the dataset.



In [4]:
import datasets
import random
import pandas as pd

from datasets import load_dataset, load_metric
from functools import partial
from IPython.display import display, HTML
from transformers import AutoTokenizer
from transformers import AutoModelForSeq2SeqLM
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# Load the metric scoring object early
rouge = load_metric("rouge")

  rouge = load_metric("rouge")


Next, we download the pubmed train and validation dataset ([click to see on 🤗Datasets Hub](https://huggingface.co/datasets/scientific_papers)). This can take a couple of minutes **☕** .

In [6]:
from typing import Tuple, Optional
def get_dataset(data: str, host: Optional[str] = None) -> Tuple:
  """Getting the training and validation data for our models"""

  train_dataset = load_dataset(data, host, split="train")
  val_dataset = load_dataset(data, host, split="validation")

  return (train_dataset, val_dataset)

It's always a good idea to take a look at some data samples. Let's do that here.

In [7]:
led_train, led_val = get_dataset(data="scientific_papers", host="pubmed")
print(led_train)

Found cached dataset multi_x_science_sum (C:/Users/milan/.cache/huggingface/datasets/multi_x_science_sum/default/1.1.0/2876ec0401f8f5c5acf7f4857dbc8d6229a390ab428321ab848f03f14b7f9729)
Found cached dataset multi_x_science_sum (C:/Users/milan/.cache/huggingface/datasets/multi_x_science_sum/default/1.1.0/2876ec0401f8f5c5acf7f4857dbc8d6229a390ab428321ab848f03f14b7f9729)


In [9]:
def show_random_elements(dataset, num_examples=4):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [20]:
# Non-consecutive added token '<doc-sep>' found. Should have index 50266 but has index 50265 in saved vocabulary (Centrum).

def get_tokenizer(host_tokenizer: str):
  """return the tokenizer for LLM training"""

  return AutoTokenizer.from_pretrained(host_tokenizer)


led_tokenizer = get_tokenizer("allenai/led-base-16384")

AssertionError: Non-consecutive added token '<doc-sep>' found. Should have index 50266 but has index 50265 in saved vocabulary.

Note that for the sake of this notebook, we finetune the "smaller" LED checkpoint ["allenai/led-base-16384"](https://huggingface.co/allenai/led-base-16384). Better performance can however be attained by finetuning ["allenai/led-large-16384"](https://huggingface.co/allenai/led-large-16384) at the cost of a higher required GPU RAM.

Now, let's write down the input data processing function that will be used to map each data sample to the correct model format.
As explained earlier `article` represents here our input data and `abstract` is the target data. The datasamples are thus tokenized up to the respective maximum lengths of 8192 and 512.

In addition to the usual `attention_mask`, LED can make use of an additional `global_attention_mask` defining which input tokens are attended globally and which are attended only locally, just as it's the case of [Longformer](https://huggingface.co/transformers/model_doc/longformer.html). For more information on Longformer's self-attention, please take a look at the corresponding [docs](https://huggingface.co/transformers/model_doc/longformer.html#longformer-self-attention). For summarization, we follow recommendations of the [paper](https://arxiv.org/abs/2004.05150) and use global attention only for the very first token. Finally, we make sure that no loss is computed on padded tokens by setting their index to `-100`.

In [10]:
# Setting up input/output parameters
max_input_length = 8192
max_output_length = 512
batch_size = 2

def process_data_to_model_inputs(batch, model_tokenizer):
    # tokenize the inputs and labels
    inputs = model_tokenizer(
        batch["article"],
        padding="max_length",
        truncation=True,
        max_length=max_input_length,
    )
    outputs = model_tokenizer(
        batch["abstract"],
        padding="max_length",
        truncation=True,
        max_length=max_output_length,
    )

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask

    # create 0 global_attention_mask lists
    batch["global_attention_mask"] = len(batch["input_ids"]) * [
        [0 for _ in range(len(batch["input_ids"][0]))]
    ]

    # since above lists are references, the following line changes the 0 index for all samples
    batch["global_attention_mask"][0][0] = 1
    batch["labels"] = outputs.input_ids

    # We have to make sure that the PAD token is ignored
    batch["labels"] = [
        [-100 if token == model_tokenizer.pad_token_id else token for token in labels]
        for labels in batch["labels"]
    ]

    return batch

For the sake of this notebook, we will reduce the training and validation data 
to a dummy dataset of sizes 250 and 25 respectively. For a full training run, those lines should be commented out.

Great, having defined the mapping function, let's preprocess the training data

In [11]:
def prep_and_convert_data(train: datasets.arrow_dataset.Dataset, validation: datasets.arrow_dataset.Dataset, train_range: Optional[int] = None, validation_range: Optional[int] = None) -> Tuple:
  """Processing the training and validation dataset to be trained"""

  processed_model_data = partial(process_data_to_model_inputs, model_tokenizer=led_tokenizer)

  if train_range and validation_range:
    train_dataset = train.select(range(train_range))
    val_dataset = validation.select(range(validation_range))
  else:
    train_dataset = train
    val_dataset = validation

  train_dataset = train_dataset.map(
      processed_model_data,
      batched=True,
      batch_size=batch_size,
      remove_columns=["article", "abstract", "section_names"],
  )
  val_dataset = val_dataset.map(
    processed_model_data,
    batched=True,
    batch_size=batch_size,
    remove_columns=["article", "abstract", "section_names"],
  )
  train_dataset.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
  )
  val_dataset.set_format(
      type="torch",
      columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
  )

  return (train_dataset, val_dataset)


train_dataset, val_dataset = prep_and_convert_data(train=led_train, validation=led_val, train_range=200, validation_range=40)

Loading cached processed dataset at C:\Users\milan\.cache\huggingface\datasets\scientific_papers\pubmed\1.1.1\306757013fb6f37089b6a75469e6638a553bd9f009484938d8f75a4c5e84206f\cache-efe70b533a7ae5c8.arrow
Loading cached processed dataset at C:\Users\milan\.cache\huggingface\datasets\scientific_papers\pubmed\1.1.1\306757013fb6f37089b6a75469e6638a553bd9f009484938d8f75a4c5e84206f\cache-64007b77b81ec3a8.arrow


We've decided to stick to the smaller model `"allenai/led-base-16384"` for the sake of this notebook. In addition, we directly enable gradient checkpointing and disable the caching mechanism to save memory.

In [12]:
print(val_dataset, val_dataset['labels'])

Dataset({
    features: ['input_ids', 'attention_mask', 'global_attention_mask', 'labels'],
    num_rows: 40
}) tensor([[    0,  3618,     8,  ...,  -100,  -100,  -100],
        [    0,  3618,   627,  ...,  -100,  -100,  -100],
        [    0,  1437, 50118,  ...,  -100,  -100,  -100],
        ...,
        [    0, 31892,   741,  ...,  -100,  -100,  -100],
        [    0, 14701,   876,  ...,  -100,  -100,  -100],
        [    0, 37788, 10395,  ...,  -100,  -100,  -100]])


In [13]:
def get_model(model_host: str):
  """Get either the LED or Centrum model"""

  return AutoModelForSeq2SeqLM.from_pretrained(model_host, gradient_checkpointing=True, use_cache=False)

led = get_model(model_host="allenai/led-base-16384")

During training, we want to evaluate the model on Rouge, the most common metric used in summarization, to make sure the model is indeed improving during training. For this, we set fitting generation parameters. We'll use beam search with a small beam of just 2 to save memory. Also, we force the model to generate at least 100 tokens, but no more than 512. In addition, some other generation parameters are set that have been found helpful for generation. For more information on those parameters, please take a look at the [docs](https://huggingface.co/transformers/main_classes/model.html?highlight=generate#transformers.generation_utils.GenerationMixin.generate).

In [14]:
# set generate hyperparameters
led.config.num_beams = 2
led.config.max_length = 512
led.config.min_length = 100
led.config.length_penalty = 2.0
led.config.early_stopping = True
led.config.no_repeat_ngram_size = 3

In [15]:
# Compute metrics for rouge
def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = led_tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = led_tokenizer.pad_token_id
    label_str = led_tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(
        predictions=pred_str, references=label_str, rouge_types=["rouge2"]
    )["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

Now, we're ready to start training. Let's import the `Seq2SeqTrainer` and `Seq2SeqTrainingArguments`.

In contrast to the usual `Trainer`, the `Seq2SeqTrainer` makes it possible to use the `generate()` function during evaluation. This should be enabled with `predict_with_generate=True`. Because our GPU RAM is limited, we make use of gradient accumulation by setting `gradient_accumulation_steps=4` to have an effective `batch_size` of 2 * 4 = 8.

Other training arguments can be read upon in the [docs](https://huggingface.co/transformers/main_classes/trainer.html?highlight=trainingarguments#transformers.TrainingArguments).

In [16]:
# enable fp16 apex training

training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    fp16=True,
    output_dir="./",
    logging_steps=5,
    eval_steps=10,
    save_steps=10,
    save_total_limit=2,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
)

The training arguments, along with the model, tokenizer, datasets and the `compute_metrics` function can then be passed to the `Seq2SeqTrainer`

In [17]:
trainer = Seq2SeqTrainer(
    model=led,
    tokenizer=led_tokenizer,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

and we can start training. This will take about ~35min.

In [18]:
trainer.train()

 20%|██        | 5/25 [00:46<03:06,  9.32s/it]

{'loss': 3.3828, 'learning_rate': 4e-05, 'epoch': 0.2}




{'loss': 2.9623, 'learning_rate': 3e-05, 'epoch': 0.4}


INFO:absl:Using default tokenizer.
                                               
 40%|████      | 10/25 [19:46<02:37, 10.47s/it]

{'eval_loss': 2.699963331222534, 'eval_rouge2_precision': 0.0798, 'eval_rouge2_recall': 0.1698, 'eval_rouge2_fmeasure': 0.1029, 'eval_runtime': 1080.5846, 'eval_samples_per_second': 0.037, 'epoch': 0.4}


OutOfMemoryError: CUDA out of memory. Tried to allocate 770.00 MiB (GPU 0; 10.00 GiB total capacity; 5.15 GiB already allocated; 0 bytes free; 8.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

This completes the fine-tuning tutorial for LED. This training script with some small changes was used to train [this](https://huggingface.co/patrickvonplaten/led-large-16384-pubmed) checkpoint, called `" patrickvonplaten/led-large-16384-pubmed"` on a single GPU for ca. 3 days. Evaluating `" patrickvonplaten/led-large-16384-pubmed"` on Pubmed's test data gives a Rouge-2 score of **19.33** which is around 1 Rouge-2 point below SOTA performance on Pubmed.

In the Appendix below, the condensed training and evaluation scripts that were used locally to finetune `" patrickvonplaten/led-large-16384-pubmed"` are attached.