<a href="https://colab.research.google.com/github/vasudevgupta7/bigbird/blob/nq/bigbird-pegasus-evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 🤗 20-lines of code to reproduce SOTA on Arxiv with **Longformer Encoder-Decoder (LED)** 🤗

At the time of writing this notebook, the best-performing model on long-range summarization is the Longformer Encoder-Decoder (LED) model by [Beltagy et al. (2020)](https://arxiv.org/pdf/2004.05150.pdf).

This notebook shows how to reproduce the official results in 20-some lines of code with 🤗Datasets and 🤗Transformers.

First, let's try to get a GPU with at least 15GB RAM.

In [1]:
# crash colab to get more RAM
# !kill -9 -1

To check that we are having enough RAM we can run the following command.
If the randomely allocated GPU is too small, the above cells can be run 
to crash the notebook hoping to get a better GPU.

In [2]:
!nvidia-smi

Fri Apr 16 17:55:18 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           On   | 00000000:00:04.0 Off |                    0 |
| N/A   62C    P0    68W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:00:05.0 Off |                    0 |
| N/A   45C    P8    32W / 149W |      0MiB / 11441MiB |      0%      Default |
|       

Next, we install 🤗Transformers, 🤗Datasets, and `rouge_score`.



In [3]:
%%capture
# !pip install datasets==1.2.1
# !pip install git+https://github.com/vasudevgupta7/transformers@add_bigbird_pegasus
# !pip install rouge_score

We will evaluate **LED** on the **_arxiv_** dataset using the **Rouge-2** metric. Let's 
import the two loading functions `load_dataset` and `load_metric`.

In [4]:
from datasets import load_dataset, load_metric

Let's download the arxiv dataset ([click to see on 🤗Datasets Hub](https://huggingface.co/datasets/scientific_papers)). This can take a couple of minutes **☕** .

In [5]:
test_dataset = load_dataset("scientific_papers", "arxiv", split="test")

Reusing dataset scientific_papers (/home/vasu/.cache/huggingface/datasets/scientific_papers/arxiv/1.1.1/043e40ed208b8a66ee9e8228c86874946c99d2fc6155a1daee685795851cfdfc)


Next, we import the LED model and LED tokenizer.

In [6]:
from transformers import BigBirdPegasusForConditionalGeneration, BigBirdPegasusTokenizer

The official checkpoint "allenai/led-large-16384-arxiv" ([click to see on 🤗Model Hub](https://huggingface.co/allenai/led-large-16384-arxiv)) has already been fine-tuned on arxiv. In this notebook, we are just interested in evaluating the model. To save memory, let's convert the model into fp16 using the `.half()` method.

Next, we install 🤗Transformers, 🤗Datasets, and `rouge_score`.



In [7]:
tokenizer = BigBirdPegasusTokenizer.from_pretrained("vasudevgupta/bigbird-pegasus-large-arxiv")
model = BigBirdPegasusForConditionalGeneration.from_pretrained("vasudevgupta/bigbird-pegasus-large-arxiv").to("cuda")

In [8]:
model.device

device(type='cuda', index=0)

Now we can write the evaluation function for LED.
First, we tokenize each *article* up to a maximum length of 16384 tokens.
Next, we make sure that the very first token of the encoder attends to all other tokens, but activating its `global_attention_mask`.
We will make use of beam search (with `num_beams=4`) to generate the predicted *abstract* of the *article*. Finally, the predicted *abstract* tokens are decoded and the resulting predicted *abstract* string is saved in the batch.


In [18]:
import torch

def generate_answer(batch):
  inputs_dict = tokenizer(batch["article"], padding="max_length", max_length=4096, return_tensors="pt", truncation=True)
  input_ids = inputs_dict.input_ids.to("cuda")
  attention_mask = inputs_dict.attention_mask.to("cuda")

  # global_attention_mask = torch.zeros_like(attention_mask)
  # # put global attention on <s> token
  # global_attention_mask[:, 0] = 1

  predicted_abstract_ids = model.generate(input_ids, attention_mask=attention_mask, max_length=512)
  batch["predicted_abstract"] = tokenizer.batch_decode(predicted_abstract_ids, skip_special_tokens=True)
  return batch

  

Because of the very lange input size of over 16K tokens, in this notebook it would take over 12h to evaluate the whole test dataset. For the sake of this notebook, we'll only evaluate on the first 600 examples. Therefore, we cut the whole 6000+ samples dataset to just 600 samples using 🤗Datasets' convenient `.select()` functionality. 

In [19]:
dataset_small = test_dataset.select(range(10))

In [20]:
dataset_small

Dataset({
    features: ['article', 'abstract', 'section_names'],
    num_rows: 10
})

Alright, let's map each sample to the predicted *abstract*. This will take ca. 90 minutes if you're given a fast GPU.

In [21]:
result_small = dataset_small.map(generate_answer)

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))




In [22]:
result_small["predicted_abstract"]

[['a a a this this a a a a this a a a for                  .......... the.... the the the the the the the the the the the the the the the the the the the the the the the the. the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the for the the the the the the the the the this this this this for for for the the the for for for the the the the the the the the for the the the the the the for for for for for for for for for for the for for for  for for the the for for for for for for for for for for for for for for for for for for for for for for for for for the the the the the this this the the the this for for   for        is      is is       the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the 

The only thing left to do is to evaluate our predictions now by making use of the *rouge* metric. Let's load the metric.

In [15]:
rouge = load_metric("rouge")

Now, we can compute the rouge score on all predicted *abstracts*.

In [1]:
rouge.compute(predictions=result_small["predicted_abstract"], references=result_small["abstract"], rouge_types=["rouge2"])["rouge2"].mid

For our 600 samples, we get a *Rouge-2* score of **19.39** 🔥🔥🔥. The [official paper](https://arxiv.org/pdf/2004.05150.pdf) reports a new state-of-the-art score of  **19.62** on the whole test dataset which aligns very well with our observation here${}^1$. LED significantly outperforms [PEGASUS](https://huggingface.co/transformers/model_doc/pegasus.html) and also slightly outperforms [BigBird](https://arxiv.org/abs/2007.14062) despite *PEGASUS* and 
*BigBird* making use of a "summarization-specific" pre-training objective.

The arxiv dataset contains many documents of lengths exceeding 14K tokens, which cannot be handled well by *PEGASUS* and *BigBird* as those models are limited to 1024 and 4096 tokens respectively.
This shows the importance of LED's capability to process very long documents.

Thanks to Iz Beltagy for open-sourcing the model checkpoints for useful tips.



---

The checkpoint was also evaluated on the complete dataset with the exact same hyperparameters as this notebook yielding a score of **19.43** which is close enough to the official results to confirm the effectiveness of LED.
