<a href="https://colab.research.google.com/github/vasudevgupta7/bigbird/blob/main/notebooks/bigbird_pegasus_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Evaluate 🤗's BigBirdPegasus on Pubmed**

In this notebook, we evaluate BigBird on the long-range summarization task of **[pubmed](https://huggingface.co/datasets/scientific_papers)**. BigBird was introduced in [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by *Manzil Zaheer et al.* It has achieved outstanding performance on long document summarization using an efficient block sparse attention mechanism. Please refer to this [blog post](https://huggingface.co/blog/big-bird) for an in-detail explanation of BigBird's block sparse attention.

Let's see what GPU we got. We need at least ~12 GB GPU memory to be able to run this notebook.

In [None]:
!nvidia-smi

Thu May  6 11:31:46 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Let's first install `transformers`, `datasets`, `rouge_score` and `sentencepiece`.

In [None]:
%%capture
!pip3 install datasets
!pip3 install rouge_score
!pip3 install git+https://github.com/vasudevgupta7/transformers@add_bigbird_pegasus
!pip3 install sentencepiece

As mentioned above, we will evaluate **BigBirdPegasus** on the **_pubmed_** dataset using the **Rouge-2** metric. For this, let's 
import the two loading functions `load_dataset` and `load_metric`. Futher, we import the `BigBirdPegasusForConditionalGeneration` and `AutoTokenizer` tokenizer.

In [None]:
from datasets import load_dataset, load_metric
import torch
from transformers import BigBirdPegasusForConditionalGeneration, AutoTokenizer

Let's define some variables which will be useful later on.

In [None]:
DATASET_NAME = "pubmed"
DEVICE = "cuda"
CACHE_DIR = DATASET_NAME
MODEL_ID = f"google/bigbird-pegasus-large-{DATASET_NAME}"

### PubMed dataset split

|               |Training | Validation | Test |
|---------------|---------|------------|------|
| Total samples | 119924  | 6633       | 6658 |

Let's download the `pubmed` dataset ([click to see on 🤗Datasets Hub](https://huggingface.co/datasets/scientific_papers)) & load only the samples from test split (i.e. only 6658 samples). This can take a couple of minutes **☕** .

In [None]:
test_dataset = load_dataset("scientific_papers", DATASET_NAME, split="test", cache_dir=CACHE_DIR)
test_dataset

Reusing dataset scientific_papers (pubmed/scientific_papers/pubmed/1.1.1/306757013fb6f37089b6a75469e6638a553bd9f009484938d8f75a4c5e84206f)


Dataset({
    features: ['article', 'abstract', 'section_names'],
    num_rows: 6658
})

The official checkpoint `google/bigbird-pegasus-large-pubmed` ([click to see on 🤗Model Hub](https://huggingface.co/google/bigbird-pegasus-large-pubmed)) has already been fine-tuned on pubmed. In this notebook, we are just interested in evaluating the model.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = BigBirdPegasusForConditionalGeneration.from_pretrained(MODEL_ID).to(DEVICE)
rouge = load_metric("rouge")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1054.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1915455.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=775.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1184.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2308148159.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2170.0, style=ProgressStyle(description…




`BigBirdPegasus` makes use of *block sparse attention*. Let's verify the `config`'s attention type and the `block_size`.

In [None]:
model.config.attention_type, model.config.block_size

('block_sparse', 64)

### PubMed Statistics

|                 | Median | 90%-ile |
|-----------------|--------|---------|
| Articles Length | 2715   | 6101    |
| Summary Length  | 212    | 318     |

The median article length is around **2715** (as shown in the above table) while `BigBirdPegasus` can handle sequence up to **4096** length. But there will be quite a few samples with lengths > 4096, we will truncate them simply.

Now we can write the evaluation function for BigBirdPegasus.
First, we tokenize each *article* up to a maximum length of 4096 tokens.
We will make use of beam search (with `num_beams=5` & `length_penalty=0.8`) to generate the predicted *abstract* of the *article*. Finally, the predicted *abstract* tokens are decoded and the resulting predicted *abstract* string is saved in the batch.

In [None]:
def generate_answer(batch):
  inputs_dict = tokenizer(batch["article"], padding="max_length", max_length=4096, return_tensors="pt", truncation=True)
  inputs_dict = {k: inputs_dict[k].to(DEVICE) for k in inputs_dict}
  predicted_abstract_ids = model.generate(**inputs_dict, max_length=256, num_beams=5, length_penalty=0.8)
  batch["predicted_abstract"] = tokenizer.decode(predicted_abstract_ids[0], skip_special_tokens=True)
  print(batch["predicted_abstract"])
  return batch

Let's take 2 samples and verify the predictions to be sure everything works as expected 🙂.

In [None]:
dataset_small = test_dataset.select(range(2))
result_small = dataset_small.map(generate_answer)

rouge.compute(predictions=result_small["predicted_abstract"], references=result_small["abstract"])

HBox(children=(FloatProgress(value=0.0, max=2.0), HTML(value='')))

although anxiety is the most prominent and prevalent mood disorder in patients with parkinson's disease ( pd ), few studies have investigated the relationship between anxiety and cognition in pd.<n> the aim of this study was to examine the influence of anxiety on cognition in pd by comparing pd patients with and without anxiety.<n> seventeen pd patients with anxiety ( pda+ ) and thirty - three pd patients without anxiety ( pda ) were included in this study.<n> self - reported anxiety was assessed using the hospital anxiety and depression scale ( hads ).<n> groups were matched for age, disease duration, hoehn and yahr ( h&y ) stages, disease severity, and depression.<n> performance on neuropsychological tests of attention ( digit span forward and backward, trail making test part b, logical memory test, and boston naming test ) and executive function ( verbal fluency and attentional set - shifting ) were compared between groups.<n> pd patients with anxiety demonstrated worse performance 

{'rouge1': AggregateScore(low=Score(precision=0.3005181347150259, recall=0.4581005586592179, fmeasure=0.4113475177304965), mid=Score(precision=0.381897485436609, recall=0.5548929759588225, fmeasure=0.43601083751693365), high=Score(precision=0.4632768361581921, recall=0.651685393258427, fmeasure=0.4606741573033708)),
 'rouge2': AggregateScore(low=Score(precision=0.140625, recall=0.1853932584269663, fmeasure=0.1864406779661017), mid=Score(precision=0.1640625, recall=0.24610572012257406, fmeasure=0.18964891041162227), high=Score(precision=0.1875, recall=0.3068181818181818, fmeasure=0.19285714285714284)),
 'rougeL': AggregateScore(low=Score(precision=0.17098445595854922, recall=0.24581005586592178, fmeasure=0.23404255319148937), mid=Score(precision=0.20978601329000907, recall=0.3082982863599272, fmeasure=0.2406167822137222), high=Score(precision=0.24858757062146894, recall=0.3707865168539326, fmeasure=0.24719101123595505)),
 'rougeLsum': AggregateScore(low=Score(precision=0.227979274611398

Because of the very large input size of ~ 4K tokens in this notebook, it would take over (time) to evaluate the whole filtered test dataset. For the sake of this notebook, we'll only evaluate the first 600 examples. Therefore, we cut the whole 6000+ samples to just 600 samples using 🤗Datasets' convenient `.select()` functionality.

In [None]:
test_dataset = test_dataset.select(range(600))

Alright, let's map each sample to the predicted *abstract*. This will take ~ 2 hours if you're given a fast GPU.

In [None]:
result = test_dataset.map(generate_answer)

HBox(children=(FloatProgress(value=0.0, max=600.0), HTML(value='')))

although anxiety is the most prominent and prevalent mood disorder in patients with parkinson's disease ( pd ), few studies have investigated the relationship between anxiety and cognition in pd.<n> the aim of this study was to examine the influence of anxiety on cognition in pd by comparing pd patients with and without anxiety.<n> seventeen pd patients with anxiety ( pda+ ) and thirty - three pd patients without anxiety ( pda ) were included in this study.<n> self - reported anxiety was assessed using the hospital anxiety and depression scale ( hads ).<n> groups were matched for age, disease duration, hoehn and yahr ( h&y ) stages, disease severity, and depression.<n> performance on neuropsychological tests of attention ( digit span forward and backward, trail making test part b, logical memory test, and boston naming test ) and executive function ( verbal fluency and attentional set - shifting ) were compared between groups.<n> pd patients with anxiety demonstrated worse performance 

The only thing left to do is to evaluate our predictions now by making use of the *rouge* metric. Now, we can compute the rouge score on all predicted *abstracts*.

In [None]:
rouge.compute(predictions=result_filtered["predicted_abstract"], references=result["abstract"])

For our 600 samples, we get a *Rouge-2* score of **19.6** 🔥🔥🔥.

In case you want to evaluate [`google/bigbird-pegasus-large-arxiv`](https://huggingface.co/google/bigbird-pegasus-large-pubmed) on `arxiv` dataset from [`scientific_papers`](https://huggingface.co/datasets/scientific_papers), you can just change the `DATASET_NAME` to `arxiv` in the cell above.