<a href="https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/How_to_evaluate_Longformer_on_TriviaQA_using_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluation of a model using ü§ónlp

*This notebook shows how `nlp` can be leveraged to evaluate **Longformer** on **TriviaQA** .*

- The [`nlp`](https://github.com/huggingface/nlp) library allows simple and intuitive access to nlp datasets and metrics.

- **Longformer** is transformer-based model for long-range sequence modeling introduced by *Iz Beltagy, Matthew E. Peters, Arman Cohan* (see paper [here](https://arxiv.org/abs/2004.05150)) and can now be accessed via Transformers via the [docs](https://huggingface.co/transformers/model_doc/longformer.html).

- **TriviaQA** is a reading comprehension dataset containing question-answer-evidence triplets (see paper here [here](https://homes.cs.washington.edu/~eunsol/papers/acl17jcwz.pdf)).

We will evaluate a pretrained `LongformerForQuestionAnswering` model on the *validation* dataset of **TriviaQA**. Along the way, this notebook will show you how `nlp` can be used for effortless preprocessing of data and analysis of the results.

Alright! Let's start by installing the `nlp` library and loading *TriviaQA*. 

### Installs and Imports

In [0]:
# ATTENTION. Rerunning this command remove the cached trivia qa dataset completely 
!rm -rf /root/.cache/

In [19]:
# install nlp
!pip uninstall -y -qq pyarrow
!pip uninstall -y -qq nlp
!pip install -qq git+https://github.com/huggingface/nlp.git
!pip install -qq git+https://github.com/huggingface/transformers.git

import nlp
import torch

[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 63.8MB 47kB/s 
[?25h  Building wheel for nlp (setup.py) ... [?25l[?25hdone
  Building wheel for transformers (setup.py) ... [?25l[?25hdone


### Data cleaning and preprocessing 

The total *TriviaQA* dataset has a size of 17 GB once processed.
Downloading and preprocessing the dataset will take around *15 minutes*. ‚òï
Afterwards the data is serialized in *parquet* format for quick reloading via *Arrow*.



In [20]:
validation_dataset = nlp.load_dataset("trivia_qa", "rc", split="validation[:1%]")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=12254.0, style=ProgressStyle(descriptio‚Ä¶




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=13607.0, style=ProgressStyle(descriptio‚Ä¶


Downloading and preparing dataset trivia_qa/rc (download: 2.48 GiB, generated: 14.92 GiB, total: 17.40 GiB) to /root/.cache/huggingface/datasets/trivia_qa/rc/1.1.0...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2665779500.0, style=ProgressStyle(descr‚Ä¶




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset trivia_qa downloaded and prepared to /root/.cache/huggingface/datasets/trivia_qa/rc/1.1.0. Subsequent calls will reuse this data.


First, let's get an overview of the dataset üßê

In [21]:
validation_dataset

Dataset(schema: {'question': 'string', 'question_id': 'string', 'question_source': 'string', 'entity_pages': 'struct<doc_source: list<item: string>, filename: list<item: string>, title: list<item: string>, wiki_context: list<item: string>>', 'search_results': 'struct<description: list<item: string>, filename: list<item: string>, rank: list<item: int32>, title: list<item: string>, url: list<item: string>, search_context: list<item: string>>', 'answer': 'struct<aliases: list<item: string>, normalized_aliases: list<item: string>, matched_wiki_entity_name: string, normalized_matched_wiki_entity_name: string, normalized_value: string, type: string, value: string>'}, num_rows: 187)

1% of the validation data corresponds to 187 examples, which we can use as a good snapshot of the actual dataset and get be used to get familiar with the dataset.

Let's check out the datatset's structure.

In [22]:
# check out schema
validation_dataset.schema

question: string not null
question_id: string not null
question_source: string not null
entity_pages: struct<doc_source: list<item: string>, filename: list<item: string>, title: list<item: string>, wiki_context: list<item: string>> not null
  child 0, doc_source: list<item: string>
      child 0, item: string
  child 1, filename: list<item: string>
      child 0, item: string
  child 2, title: list<item: string>
      child 0, item: string
  child 3, wiki_context: list<item: string>
      child 0, item: string
search_results: struct<description: list<item: string>, filename: list<item: string>, rank: list<item: int32>, title: list<item: string>, url: list<item: string>, search_context: list<item: string>> not null
  child 0, description: list<item: string>
      child 0, item: string
  child 1, filename: list<item: string>
      child 0, item: string
  child 2, rank: list<item: int32>
      child 0, item: int32
  child 3, title: list<item: string>
      child 0, item: string
  child 4,

Alright, quite a lot of entries here! For Questions Answering, all we need is the *question*, the *context* and the *answer*. 

The **question** is a single entry, so we keep it.

Because *Longformer* was trained on the Wikipedia part of *TriviaQA*, we will use `validation_dataset["entity_pages"]["search_context"]` as our **context**. 

We can also see that there are multiple entries for the **answer**. In this use case, we define a correct output of the model as one that is one of the answer aliases `validation_dataset["answer"]["aliases"]`. Lastly, we also keep `validation_dataset["answer"]["normalized_value"]`. All other columns can be disregarded. 

We apply the `.map()` function to map the dataset into the format as defined above.

In [23]:
# define the mapping function
def format_dataset(example):
    # the context might be comprised of multiple contexts => me merge them here
    example["context"] = " ".join(("\n".join(example["entity_pages"]["wiki_context"])).split("\n"))
    example["targets"] = example["answer"]["aliases"]
    example["norm_target"] = example["answer"]["normalized_value"]
    return example

# map the dataset and throw out all unnecessary columns
validation_dataset = validation_dataset.map(format_dataset, remove_columns=["search_results", "question_source", "entity_pages", "answer", "question_id"])

187it [00:00, 734.26it/s]


Now, we can check out a first example of the dataset.

In [24]:
validation_dataset[0]

{'context': 'Andrew Lloyd Webber, Baron Lloyd-Webber   (born 22 March 1948) is an English composer and impresario of musical theatre.   Several of his musicals have run for more than a decade both in the West End and on Broadway. He has composed 13 musicals, a song cycle, a set of variations, two film scores, and a Latin Requiem Mass. Several of his songs have been widely recorded and were hits outside of their parent musicals, notably "The Music of the Night" from The Phantom of the Opera, "I Don\'t Know How to Love Him" from Jesus Christ Superstar, "Don\'t Cry for Me, Argentina" and "You Must Love Me" from Evita, "Any Dream Will Do" from Joseph and the Amazing Technicolor Dreamcoat and "Memory" from Cats.  He has received a number of awards, including a knighthood in 1992, followed by a peerage from Queen Elizabeth II for services to Music, seven Tonys, three Grammys (as well as the Grammy Legend Award), an Academy Award, fourteen Ivor Novello Awards, seven Olivier Awards, a Golden G

Great üôÇ. That's exactly, the structure we wanted! Some examples might have an empty context so we will filter those examples out.
For this we can use the convenient `.filter()` function of `nlp`.

In [25]:
validation_dataset = validation_dataset.filter(lambda x: len(x["context"]) > 0)
# check out how many samples are left
validation_dataset

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 21.73it/s]


Dataset(schema: {'question': 'string', 'context': 'string', 'targets': 'list<item: string>', 'norm_target': 'string'}, num_rows: 187)

Looks like all examples have an context in our case. Let's think about the evaluation on *Longformer* now. 

On a usual 24GB GPU, *Longformer* is able to process inputs of up to a length of **4096** tokens. As a rule of thumb, 4 is the average number of characters per word piece. Therefore, it is a good idea to check for how many examples, the *question* + *context* exceeds 4 * 4096 characters.
Again we can apply the convenient `.map()` function.

---

**Note**: *Google Colab only offers GPUs with ~12 GB of RAM, so that we will set the max length to only 1024, which will obviously decrease performance of Longformer. A conventional 24 GB GPU (TITAN RTX) can fit up to a sequence length of 4096. So in this notebook we will also check how many examples exceed 4 * 1024 characters instead of 4 * 4096.

In [26]:

print("\n\nLength for each example")
print(30 * "=")

# length for each example
validation_dataset.map(lambda x, i: print(f"Id: {i} - Question Length: {len(x['question'])} - context Length: {len(x['context'])}"), with_indices=True)
print(30 * "=")

print("\n")
print("Num examples larger than 4 * 1024 characters: ")
# filter out examples smaller than 4 * 1024
short_validation_dataset = validation_dataset.filter(lambda x: (len(x['question']) + len(x['context'])) < 4 * 1024)
short_validation_dataset


187it [00:00, 5231.31it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 55.87it/s]



Length for each example
Id: 0 - Question Length: 69 - context Length: 31886
Id: 0 - Question Length: 69 - context Length: 31886
Id: 0 - Question Length: 69 - context Length: 31886
Id: 1 - Question Length: 61 - context Length: 92734
Id: 2 - Question Length: 46 - context Length: 2458
Id: 3 - Question Length: 49 - context Length: 33175
Id: 4 - Question Length: 55 - context Length: 27460
Id: 5 - Question Length: 69 - context Length: 95689
Id: 6 - Question Length: 60 - context Length: 80213
Id: 7 - Question Length: 54 - context Length: 40965
Id: 8 - Question Length: 47 - context Length: 8346
Id: 9 - Question Length: 58 - context Length: 66514
Id: 10 - Question Length: 57 - context Length: 43083
Id: 11 - Question Length: 51 - context Length: 823
Id: 12 - Question Length: 48 - context Length: 14555
Id: 13 - Question Length: 55 - context Length: 137066
Id: 14 - Question Length: 64 - context Length: 91380
Id: 15 - Question Length: 79 - context Length: 115129
Id: 16 - Question Length: 100 - co




Dataset(schema: {'question': 'string', 'context': 'string', 'targets': 'list<item: string>', 'norm_target': 'string'}, num_rows: 16)

Interesting! We can see that only 16 examples have less than 4 * 1024 = 4096 characters...

Most examples seem to have a very long context which will have to be cut to Longformer's maximum length.

### Evaluation

It's time to evaluate *Longformer* on *TriviaQA* üöÄ.

Let's write our evaluation function and import the pretrained `LongformerForQuestionAnswering` model. For more details on `LongformerForQuestionAnswering`, see [here](https://huggingface.co/transformers/model_doc/longformer.html?highlight=longformerforquestionanswering#transformers.LongformerForQuestionAnswering).

In [27]:
from transformers import LongformerTokenizerFast, LongformerForQuestionAnswering

tokenizer = LongformerTokenizerFast.from_pretrained("allenai/longformer-large-4096-finetuned-triviaqa")

# download the 1.7 GB pretrained model. It might take ~1min
model = LongformerForQuestionAnswering.from_pretrained("allenai/longformer-large-4096-finetuned-triviaqa")
model.to("cuda")

def evaluate(example):
    def get_answer(question, context):
        # encode question and context so that they are seperated by a tokenizer.sep_token and cut at max_length
        encoding = tokenizer.encode_plus(question, context, return_tensors="pt", max_length=1024)
        input_ids = encoding["input_ids"].to("cuda")
        attention_mask = encoding["attention_mask"].to("cuda")

        # the forward method will automatically set global attention on question tokens
        # The scores for the possible start token and end token of the answer are retrived
        start_scores, end_scores = model(input_ids=input_ids, attention_mask=attention_mask)

        # Let's take the most likely token using `argmax` and retrieve the answer
        all_tokens = tokenizer.convert_ids_to_tokens(encoding["input_ids"][0].tolist())
        answer_tokens = all_tokens[torch.argmax(start_scores): torch.argmax(end_scores)+1]
        answer = tokenizer.decode(tokenizer.convert_tokens_to_ids(answer_tokens))[1:].replace('"', '')  # remove space prepending space token and remove unnecessary '"'
        
        return answer

    # save the model's output here
    example["output"] = get_answer(example["question"], example["context"])

    # save if it's a match or not
    example["match"] = (example["output"] in example["targets"])

    return example


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti‚Ä¶




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti‚Ä¶




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=831.0, style=ProgressStyle(description_‚Ä¶




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1738538029.0, style=ProgressStyle(descr‚Ä¶




We are interesting in the performance of the model on short and long samples.
First we evaluate the model on `short_validation_dataset`, which comprises only the 16 samples that are shorter than 4 * 1024 characters.

In [28]:
results_short = short_validation_dataset.map(evaluate)

16it [00:03,  4.19it/s]


Now let's check for how many questions we were correct!

In [29]:
print(f"\nNum Correct examples: {sum(results_short['match'])}/{len(results_short)}")
wrong_results = results_short.filter(lambda x: x['match'] is False)
print(f"\nWrong examples: ")
wrong_results.map(lambda x, i: print(f"{i} - Output: {x['output']} - Target: {x['norm_target']}"), with_indices=True)

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 949.80it/s]
5it [00:00, 6316.72it/s]


Num Correct examples: 11/16

Wrong examples: 
0 - Output:  - Target: mutiny on bounty
0 - Output:  - Target: mutiny on bounty
0 - Output:  - Target: mutiny on bounty
1 - Output: Roy Orbison - Target: donny osmond
2 - Output: Gary Lewis - Target: gary lewis and playboys
3 - Output: Collapsible baby buggy - Target: baby buggy
4 - Output:  - Target: basket ball





Dataset(schema: {'question': 'string', 'context': 'string', 'targets': 'list<item: string>', 'norm_target': 'string', 'output': 'string', 'match': 'bool'}, num_rows: 5)

11/16 - not bad üî•. Also we can see that two out of the wrong outputs are very close to the correct solution (Number 2 and 3). 

**Note**: *Longformer reached a new SOTA on TriviaQA - see Table 9 in [paper](https://arxiv.org/abs/2004.05150). In order to reproduce the exact results, please refer to the following [instructions](https://github.com/allenai/longformer/blob/master/scripts/cheatsheet.txt).*

Second, we evaluate `LongformerForQuestionAnswering` on the all of the examples.

In [30]:
results = validation_dataset.map(evaluate)

187it [00:53,  3.50it/s]


In [31]:
print(f"Correct examples: {sum(results['match'])}/{len(results)}")

Correct examples: 75/187


Here, we now see a clear degradation. Only less than half the samples are correct. 75 out of 187 is still a very good score though, especially given that we can only use 1024 tokens in this notebook.

Now you should have all the tools necessary to preprocess your data and evaluate your model with ü§ónlp in no time!

ü§ó ü§ó **Finish** ü§óü§ó

Thanks goes out to Iz Beltagy for proof reading the notebook!