## **Evaluate 🤗's BigBirdPegasus on Pubmed**

In this notebook, we evaluate BigBird on the long-range summarization task of **[pubmed](https://huggingface.co/datasets/scientific_papers)**. BigBird was introduced in [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) by *Manzil Zaheer et al.* It has achieved outstanding performance on long document summarization using an efficient block sparse attention mechanism. Please refer to this [blog post](https://huggingface.co/blog/big-bird) for an in-detail explanation of BigBird's block sparse attention.

Let's see what GPU we got. We need at least ~12 GB GPU memory to be able to run this notebook.

In [None]:
!nvidia-smi

Mon Nov 29 22:34:04 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   70C    P8    35W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Let's first install `transformers`, `datasets`, `rouge_score` and `sentencepiece`.

In [None]:
%%capture
!pip3 install datasets
!pip3 install rouge_score
!pip3 install git+https://github.com/huggingface/transformers
!pip3 install sentencepiece

As mentioned above, we will evaluate **BigBirdPegasus** on the **_pubmed_** dataset using the **Rouge-2** metric. For this, let's 
import the two loading functions `load_dataset` and `load_metric`. Futher, we import the `BigBirdPegasusForConditionalGeneration` and `AutoTokenizer` tokenizer.

In [None]:
from datasets import load_dataset, load_metric
import torch
from transformers import BigBirdPegasusForConditionalGeneration, AutoTokenizer

Let's define some variables which will be useful later on.

In [None]:
DATASET_NAME = "pubmed"
DEVICE = "cuda"
CACHE_DIR = DATASET_NAME
MODEL_ID = f"google/bigbird-pegasus-large-{DATASET_NAME}"

To begin with, let's take a look at the PubMed dataset ([click to see on 🤗Datasets Hub](https://huggingface.co/datasets/scientific_papers)).
PubMed consists of scientific papers in the field of medicine. The dataset splits each paper into the *article*, and the *abstract* whereas the article consists of the whole paper minus the abstract. Thus, the input to be summarized is defined by the article and the gold label by the abstract.

The following table summarizes the size of the *train*, *validation*, and *test* split of the dataset.

|               |Training | Validation | Test |
|---------------|---------|------------|------|
| Total samples | 119924  | 6633       | 6658 |

In this notebook, we are only interested in evaluating *BigBird*. To do so, let's download the *test* split of the `pubmed` dataset. This can take a couple of minutes **☕** .

In [None]:
test_dataset = load_dataset("scientific_papers", DATASET_NAME, split="test", ignore_verifications=True, cache_dir=CACHE_DIR)
test_dataset

Downloading:   0%|          | 0.00/2.03k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

Downloading and preparing dataset scientific_papers/pubmed (download: 4.20 GiB, generated: 2.33 GiB, post-processed: Unknown size, total: 6.53 GiB) to pubmed/scientific_papers/pubmed/1.1.1/306757013fb6f37089b6a75469e6638a553bd9f009484938d8f75a4c5e84206f...


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/3.62G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/880M [00:00<?, ?B/s]

  0%|          | 0/2 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset scientific_papers downloaded and prepared to pubmed/scientific_papers/pubmed/1.1.1/306757013fb6f37089b6a75469e6638a553bd9f009484938d8f75a4c5e84206f. Subsequent calls will reuse this data.


Dataset({
    features: ['article', 'abstract', 'section_names'],
    num_rows: 6658
})

The official checkpoint `google/bigbird-pegasus-large-pubmed` ([click to see on 🤗Model Hub](https://huggingface.co/google/bigbird-pegasus-large-pubmed)) has already been fine-tuned on pubmed, so we can simply load the weights are run the model in inference mode.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = BigBirdPegasusForConditionalGeneration.from_pretrained(MODEL_ID).to(DEVICE)
rouge = load_metric("rouge")

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.35M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/775 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.15G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

`BigBirdPegasus` makes use of *block sparse attention*. Let's verify the `config`'s attention type and the `block_size`.

In [None]:
model.config.attention_type, model.config.block_size

('block_sparse', 64)

Next, we will take a look at the length distribution of the dataset. The following table shows the *median* and the 90% quantile of the article, and abstract (summary). 

|                 | Median | 90%-ile |
|-----------------|--------|---------|
| Articles Length | 2715   | 6101    |
| Summary Length  | 212    | 318     |

`BigBirdPegasus` can handle sequence up to a length of **4096** which is significantly higher than the median input length of **2715**. However, many input samples are longer than **4096**, which consequently need to be truncated. 
The summaries have a median length of **212** with 90% being shorter than **318**. Given this data, 256 seems to be a reasonable choice as the model's maximum generation length.

Now we can write the evaluation function for BigBirdPegasus.
First, we tokenize each *article* up to a maximum length of 4096 tokens.
We will make use of beam search (with `num_beams=5` & `length_penalty=0.8`) to generate the predicted *abstract* of the *article*. Finally, the predicted *abstract* tokens are decoded and the resulting predicted *abstract* string is saved in the batch.

In [None]:
def generate_answer(batch):
  inputs_dict = tokenizer(batch["article"], padding="max_length", max_length=4096, return_tensors="pt", truncation=True)
  inputs_dict = {k: inputs_dict[k].to(DEVICE) for k in inputs_dict}
  predicted_abstract_ids = model.generate(**inputs_dict, max_length=256, num_beams=5, length_penalty=0.8)
  batch["predicted_abstract"] = tokenizer.decode(predicted_abstract_ids[0], skip_special_tokens=True)
  print(batch["predicted_abstract"])
  return batch

Let's take 2 samples and verify the predictions to be sure everything works as expected 🙂.

In [None]:
dataset_small = test_dataset.select(range(100))
result = dataset_small.map(generate_answer)

rouge.compute(predictions=result["predicted_abstract"], references=result["abstract"])

  0%|          | 0/100 [00:00<?, ?ex/s]

  * num_indices_to_pick_from


although anxiety is the most prominent and prevalent mood disorder in patients with parkinson's disease ( pd ), few studies have investigated the relationship between anxiety and cognition in pd.<n> the aim of this study was to examine the influence of anxiety on cognition in pd by comparing pd patients with and without anxiety.<n> seventeen pd patients with anxiety ( pda+ ) and thirty - three pd patients without anxiety ( pda ) were included in this study.<n> self - reported anxiety was assessed using the hospital anxiety and depression scale ( hads ).<n> groups were matched for age, disease duration, hoehn and yahr ( h&y ) stages, disease severity, and depression.<n> performance on neuropsychological tests of attention ( digit span forward and backward, trail making test part b, logical memory test, and boston naming test ) and executive function ( verbal fluency and attentional set - shifting ) were compared between groups.<n> pd patients with anxiety demonstrated worse performance 

{'rouge1': AggregateScore(low=Score(precision=0.46624189146840234, recall=0.4033354692709586, fmeasure=0.4214648583947058), mid=Score(precision=0.49535913316123015, recall=0.4234048112383335, fmeasure=0.44075778929972265), high=Score(precision=0.525737767423606, recall=0.4431802949165373, fmeasure=0.4601420070326769)),
 'rouge2': AggregateScore(low=Score(precision=0.19472558115170413, recall=0.1649397687221042, fmeasure=0.17405139473785838), mid=Score(precision=0.2241457887029173, recall=0.18390598030573682, fmeasure=0.19601669394797938), high=Score(precision=0.2549070442620303, recall=0.20545935943758536, fmeasure=0.22192902661386935)),
 'rougeL': AggregateScore(low=Score(precision=0.2756096420897867, recall=0.23654613475454667, fmeasure=0.24651038482808327), mid=Score(precision=0.302477260375182, recall=0.25497152840988896, fmeasure=0.26741587520001553), high=Score(precision=0.3313681380961463, recall=0.27716432205308883, fmeasure=0.29054789265659625)),
 'rougeLsum': AggregateScore(l

In [None]:
output = []
for i in range(len(result["predicted_abstract"])):
  item = {"id":i,"abstract":result['abstract'][i],"predicted_abstract":result['predicted_abstract'][i]}
  output.append(item)

In [None]:
import json
with open("bigbird_pubmed.txt","w") as f:
  for i in range(len(output)):
    json_str = json.dumps(output[i])
    f.write(json_str+"\n")

### T5

In [None]:
train_dataset = load_dataset("scientific_papers", DATASET_NAME, split="train", ignore_verifications=True, cache_dir=CACHE_DIR)
train_dataset

Reusing dataset scientific_papers (pubmed/scientific_papers/pubmed/1.1.1/306757013fb6f37089b6a75469e6638a553bd9f009484938d8f75a4c5e84206f)


Dataset({
    features: ['article', 'abstract', 'section_names'],
    num_rows: 119924
})

In [None]:
# for T5
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("gayanin/t5-small-finetuned-pubmed").to(DEVICE)
rouge = load_metric("rouge")

In [None]:
def generate_answer_T5(batch):
  inputs_dict = tokenizer(batch["article"], padding="max_length", max_length=512, return_tensors="pt", truncation=True)
  inputs_dict = {k: inputs_dict[k].to(DEVICE) for k in inputs_dict}
  predicted_abstract_ids = model.generate(**inputs_dict, max_length=350, num_beams=5, length_penalty=0, repetition_penalty=2.5,early_stopping=True)
  batch["predicted_abstract"] = tokenizer.decode(predicted_abstract_ids[0], skip_special_tokens=True)
  print(batch["predicted_abstract"])
  return batch

In [None]:
dataset_small = train_dataset.select(range(10))
result = dataset_small.map(generate_answer_T5)

rouge.compute(predictions=result["predicted_abstract"], references=result["abstract"])

  0%|          | 0/10 [00:00<?, ?ex/s]

In iran a national free food program ( nffp) is implemented in elementary schools of deprived areas to cover all poor students.
Anemia in patients with cancer - associated anemia.
olanzapine improves preexisting symptoms of tardive dystonia after longer exposure to antipsychotics.
A novel target - specific approach to control insect pests without affecting beneficial arthropods.
Recurrent cough syncope in the context of a left sided glomus jugulare tumor with intracranial extension into the posterior cranial fossa.
Is mir-210 a class of small rnas that do not code amino acid sequences?
Midwife - led primary delivery care for low - risk pregnant women during labor in japan.
The association of obesity with the genetic variant of the insulin receptor substrate.
lipid apheresis in familial hypercholesterolemia patients.
Agenesis of inferior vena cava ( ivc) as a cause of recurrent deep vein thrombosis in the right leg.


{'rouge1': AggregateScore(low=Score(precision=0.5721611721611721, recall=0.03508800113755108, fmeasure=0.06537488409627208), mid=Score(precision=0.7080586080586081, recall=0.06099762955789746, fmeasure=0.10892877876519458), high=Score(precision=0.8385302197802198, recall=0.09696386984893049, fmeasure=0.16826199122287322)),
 'rouge2': AggregateScore(low=Score(precision=0.1878495670995671, recall=0.008984546850832126, fmeasure=0.01696502637509636), mid=Score(precision=0.3262149125384419, recall=0.03246026314599923, fmeasure=0.05688013450506864), high=Score(precision=0.48699410148674854, recall=0.06625939777214211, fmeasure=0.11348766010163089)),
 'rougeL': AggregateScore(low=Score(precision=0.4662637362637363, recall=0.027700386160133206, fmeasure=0.0522055315948588), mid=Score(precision=0.5935714285714286, recall=0.05290927180418689, fmeasure=0.09373683064722493), high=Score(precision=0.7390521978021978, recall=0.0933670845324873, fmeasure=0.16028892675746803)),
 'rougeLsum': AggregateS

Because of the very large input size of ~ 4K tokens in this notebook, it would take over (time) to evaluate the whole filtered test dataset. For the sake of this notebook, we'll only evaluate the first 600 examples. Therefore, we cut the 6000+ samples to just 600 samples using 🤗Datasets' convenient `.select()` function.

In [None]:
test_dataset = test_dataset.select(range(600))

Alright, now let's map each sample to the predicted *abstract*. This will take *ca.* 2 hours if you have been given a fast GPU.

In [None]:
result = test_dataset.map(generate_answer)

HBox(children=(FloatProgress(value=0.0, max=600.0), HTML(value='')))

although anxiety is the most prominent and prevalent mood disorder in patients with parkinson's disease ( pd ), few studies have investigated the relationship between anxiety and cognition in pd.<n> the aim of this study was to examine the influence of anxiety on cognition in pd by comparing pd patients with and without anxiety.<n> seventeen pd patients with anxiety ( pda+ ) and thirty - three pd patients without anxiety ( pda ) were included in this study.<n> self - reported anxiety was assessed using the hospital anxiety and depression scale ( hads ).<n> groups were matched for age, disease duration, hoehn and yahr ( h&y ) stages, disease severity, and depression.<n> performance on neuropsychological tests of attention ( digit span forward and backward, trail making test part b, logical memory test, and boston naming test ) and executive function ( verbal fluency and attentional set - shifting ) were compared between groups.<n> pd patients with anxiety demonstrated worse performance 

Finally, we can evaluate the predictions using the *rouge* metric.

In [None]:
rouge.compute(predictions=result["predicted_abstract"], references=result["abstract"])

For our 600 samples, we get a *Rouge-2* score of **19.6** 🔥🔥🔥.

**Note**: As stated in the [official paper](https://arxiv.org/pdf/2007.14062.pdf) *BigBirdPegasus* achieves a new state-of-the-art of **20.65** Rouge-2 score on PubMed. Evaluation in this notebook might be slightly worse since a different `length_penalty` is used for generation and data pre-processing is kept as simple as possibe (no "*newline*" removal and space removal before special tokens).

In case you want to evaluate [`google/bigbird-pegasus-large-arxiv`](https://huggingface.co/google/bigbird-pegasus-large-pubmed) on `arxiv` dataset from [`scientific_papers`](https://huggingface.co/datasets/scientific_papers), you can just change the `DATASET_NAME` to `arxiv` in the cell above.