<a href="https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/How_to_evaluate_Longformer_on_TriviaQA_using_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Evaluation of a model using NLP

*This notebook shows how `nlp` can be leveraged `nlp` to evaluate **Longformer** on **TriviaQA** .*

- The [`nlp`](https://github.com/huggingface/nlp) library allows simple and intuitive access to nlp datasets and metrics.

- **Longformer** is transformer-based model for long-range sequence modeling introduced by *Iz Beltagy, Matthew E. Peters, Arman Cohan* (see paper [here](https://arxiv.org/abs/2004.05150)) and can now be accessed via 🤗 Transformers via the [docs](https://huggingface.co/transformers/model_doc/longformer.html).

- **TriviaQA** is a reading comprehension dataset containing question-answer-evidence triplets (see paper here [here](https://homes.cs.washington.edu/~eunsol/papers/acl17jcwz.pdf))

We will evaluate a pretrained `LongformerForQuestionAnswering` model on the *validation* dataset of **TriviaQA**. Along the way, this notebook will show you how `nlp` can be used for effortless preprocessing of data and analysis of the results.

Alright! Let's start by installing the `nlp` library and loading **TriviaQA**. 

In [7]:
# install nlp
# ATTENTION. Rerunning this command remove the cached trivia qa dataset completely 
!rm -rf /root/.cache/huggingface/datasets/
!pip uninstall -y -qq pyarrow
!pip uninstall -y -qq nlp
!pip install -qq git+https://github.com/huggingface/nlp.git
!pip install -qq git+https://github.com/huggingface/transformers.git

# import nlp
import nlp

  Building wheel for nlp (setup.py) ... [?25l[?25hdone
  Building wheel for transformers (setup.py) ... [?25l[?25hdone


The total **TriviaQA** dataset has a size of 17 GB once processed.
Downloading and preprocessing the dataset will take around *15 minutes*.
# ☕
Afterwards the data is serialized in *parquet* format for quick reloading via *Arrow*.



In [0]:
validation_dataset = nlp.load_dataset("trivia_qa", "rc", split="validation[:1%]")

First, let's get an overview of the dataset 🧐

In [16]:
validation_dataset

Dataset(schema: {'question': 'string', 'question_id': 'string', 'question_source': 'string', 'entity_pages': 'struct<doc_source: list<item: string>, filename: list<item: string>, title: list<item: string>, wiki_context: list<item: string>>', 'search_results': 'struct<description: list<item: string>, filename: list<item: string>, rank: list<item: int32>, title: list<item: string>, url: list<item: string>, search_context: list<item: string>>', 'answer': 'struct<aliases: list<item: string>, normalized_aliases: list<item: string>, matched_wiki_entity_name: string, normalized_matched_wiki_entity_name: string, normalized_value: string, type: string, value: string>'}, num_rows: 187)

1% of the validation data corresponds to 187 examples, which we can use as a good snapshot of the actual dataset and get be used to get familiar with the dataset.

Let's check out the datatset's structure.

In [21]:
# check out schema
validation_dataset.schema

question: string not null
question_id: string not null
question_source: string not null
entity_pages: struct<doc_source: list<item: string>, filename: list<item: string>, title: list<item: string>, wiki_context: list<item: string>> not null
  child 0, doc_source: list<item: string>
      child 0, item: string
  child 1, filename: list<item: string>
      child 0, item: string
  child 2, title: list<item: string>
      child 0, item: string
  child 3, wiki_context: list<item: string>
      child 0, item: string
search_results: struct<description: list<item: string>, filename: list<item: string>, rank: list<item: int32>, title: list<item: string>, url: list<item: string>, search_context: list<item: string>> not null
  child 0, description: list<item: string>
      child 0, item: string
  child 1, filename: list<item: string>
      child 0, item: string
  child 2, rank: list<item: int32>
      child 0, item: int32
  child 3, title: list<item: string>
      child 0, item: string
  child 4,

Alright! For Questions Answering we are intersting in the *question*, the *evidence* and the *answer*. Because **Longformer** was trained on the Wikipedia part of **TriviaQA**, we will use `validation_dataset["entity_pages"]["search_context"]` as our evidence. 
We can also see that there are multiple answers. 

In this use case, we define a correct output of the model as one that is one of the answer aliases `validation_dataset["answer"]["aliases"]`. Lastly, we also keep `validation_dataset["answer"]["normalized_value"]`. All other columns can be disregarded. We apply the `.map()` function to map the dataset into the format as defined above.

In [0]:
# define the mapping function
def format_dataset(example):
    # the evidence might be comprised of multiple contexts => me merge them here
    example["evidence"] = " ".join(("\n".join(example["entity_pages"]["wiki_context"])).split("\n"))
    example["targets"] = example["answer"]["aliases"]
    example["norm_target"] = example["answer"]["normalized_value"]
    return example

# map the dataset and throw out all unnecessary columns
validation_dataset = validation_dataset.map(format_dataset, remove_columns=["search_results", "question_source", "entity_pages", "answer", "question_id"])

Now, we can check out a first example of the dataset.

In [25]:
validation_dataset[0]

{'evidence': 'Andrew Lloyd Webber, Baron Lloyd-Webber   (born 22 March 1948) is an English composer and impresario of musical theatre.   Several of his musicals have run for more than a decade both in the West End and on Broadway. He has composed 13 musicals, a song cycle, a set of variations, two film scores, and a Latin Requiem Mass. Several of his songs have been widely recorded and were hits outside of their parent musicals, notably "The Music of the Night" from The Phantom of the Opera, "I Don\'t Know How to Love Him" from Jesus Christ Superstar, "Don\'t Cry for Me, Argentina" and "You Must Love Me" from Evita, "Any Dream Will Do" from Joseph and the Amazing Technicolor Dreamcoat and "Memory" from Cats.  He has received a number of awards, including a knighthood in 1992, followed by a peerage from Queen Elizabeth II for services to Music, seven Tonys, three Grammys (as well as the Grammy Legend Award), an Academy Award, fourteen Ivor Novello Awards, seven Olivier Awards, a Golden 

Great, that is the structure we wanted! Some examples might have an empty evidence so we will filter those examples out.
For this we can use the convenient `.filter()` function of `nlp`.

In [26]:
validation_dataset = validation_dataset.filter(lambda x: len(x["evidence"]) > 0)
# check out how many samples are left
validation_dataset

100%|██████████| 1/1 [00:00<00:00, 20.90it/s]


Dataset(schema: {'question': 'string', 'evidence': 'string', 'targets': 'list<item: string>', 'norm_target': 'string'}, num_rows: 187)

Looks like all examples have an evidence in our case. Ok, let's think about the evaluation on **Longformer** now. 

On a usual 24GB GPU **Longformer** is able to process inputs of up to a length of **4096** tokens. As a rule of thumb, one token corresponds to more or less 4 characters for Longformer's word embeddings.
Let's check how long *question* + *evidence* is in terms of characters for our examples and index each example with it's length. Again we can apply the convenient `.map()` function.

---

**Note**: *Google Colab only offers GPUs with ~12 GB of RAM, so that we will set the max length to only 1024, which will obviously decrease performance of Longformer. A conventional 24 GB GPU (TITAN RTX) can fit up to a sequence length of 4096. So here we will also check how many examples have 4 * 1024 tokens or less* 

In [34]:

print("\n\nLength for each example")
print(30 * "=")

# length for each example
validation_dataset.map(lambda x, i: print(f"Id: {i} - Question Length: {len(x['question'])} - Evidence Length: {len(x['evidence'])}"), with_indices=True)
print(30 * "=")

print("\n")
print("Num examples larger than 4096 characters: ")
# filter out examples smaller than 4 * 1024
short_validation_examples = validation_dataset.filter(lambda x: (len(x['question']) + len(x['evidence'])) < 4 * 1024)
short_validation_examples


187it [00:00, 4274.00it/s]



Length for each example
Id: 0 - Question Length: 69 - Evidence Length: 31886
Id: 0 - Question Length: 69 - Evidence Length: 31886
Id: 0 - Question Length: 69 - Evidence Length: 31886
Id: 1 - Question Length: 61 - Evidence Length: 92734
Id: 2 - Question Length: 46 - Evidence Length: 2458
Id: 3 - Question Length: 49 - Evidence Length: 33175
Id: 4 - Question Length: 55 - Evidence Length: 27460
Id: 5 - Question Length: 69 - Evidence Length: 95689
Id: 6 - Question Length: 60 - Evidence Length: 80213
Id: 7 - Question Length: 54 - Evidence Length: 40965
Id: 8 - Question Length: 47 - Evidence Length: 8346
Id: 9 - Question Length: 58 - Evidence Length: 66514
Id: 10 - Question Length: 57 - Evidence Length: 43083
Id: 11 - Question Length: 51 - Evidence Length: 823
Id: 12 - Question Length: 48 - Evidence Length: 14555
Id: 13 - Question Length: 55 - Evidence Length: 137066
Id: 14 - Question Length: 64 - Evidence Length: 91380
Id: 15 - Question Length: 79 - Evidence Length: 115129
Id: 16 - Questio




Dataset(schema: {'question': 'string', 'evidence': 'string', 'targets': 'list<item: string>', 'norm_target': 'string'}, num_rows: 16)

Interesting! We can see that only 16 examples have less than 4096 characters...

Most examples seem to have a very long evidence which will have to be cut to to Longformer's maximum length.

Let's write our evaluation function and import a pretrained `LongformerForQuestionAnswering`. For more details on the `LongformerForQuestionAnswering` model, see [here](https://huggingface.co/transformers/model_doc/longformer.html?highlight=longformerforquestionanswering#transformers.LongformerForQuestionAnswering}

In [0]:
from transformers import LongformerTokenizer, LongformerForQuestionAnswering

tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-large-4096-finetuned-triviaqa")

# download the 1.7 GB pretrained model. It might take ~1min
model = LongformerForQuestionAnswering.from_pretrained("allenai/longformer-large-4096-finetuned-triviaqa")
model.to("cuda")

def evaluate(example):
    def get_answer(question, evidence):
        # encode question and evidence so that they are seperated by a tokenizer.sep_token and cut at max_length
        encoding = tokenizer.encode_plus(question, evidence, return_tensors="pt", max_length=1024)
        input_ids = encoding["input_ids"].to("cuda")
        attention_mask = encoding["attention_mask"].to("cuda")

        # the forward method will automatically set global attention on question tokens
        # The scores for the possible start token and end token of the answer are retrived
        start_scores, end_scores = model(input_ids=input_ids, attention_mask=attention_mask)

        # Let's take the most likely token using `argmax` and retrieve the answer
        all_tokens = tokenizer.convert_ids_to_tokens(encoding["input_ids"][0].tolist())
        answer_tokens = all_tokens[torch.argmax(start_scores): torch.argmax(end_scores)+1]
        answer = tokenizer.decode(tokenizer.convert_tokens_to_ids(answer_tokens))[1:].replace('"', '')  # remove space prepending space token and remove unnecessary '"'
        
        return answer

    # save the model's output here
    example["output"] = get_answer(example["question"], example["evidence"])

    # save if it's a match or not
    example["match"] = (example["output"] in example["targets"])

    return example


Let's evaluate two datasets:
- One that includes only examples with less then 4096 characters and 
- Another one that includes all examples

In [36]:
import torch
results_short = short_validation_examples.map(evaluate)

16it [00:03,  4.12it/s]


Now let's check for how many questions we were correct!

In [52]:
print(f"\nNum Correct examples: {sum(results_short['match'])}/{len(results_short)}")
wrong_results = results_short.filter(lambda x: x['match'] is False)
print(f"\nWrong examples: ")
wrong_results.map(lambda x, i: print(f"{i} - Output: {x['output']} - Target: {x['norm_target']}"), with_indices=True)

5it [00:00, 3460.08it/s]


Num Correct examples: 11/16

Wrong examples: 
0 - Output:  - Target: mutiny on bounty
0 - Output:  - Target: mutiny on bounty
0 - Output:  - Target: mutiny on bounty
1 - Output: Roy Orbison - Target: donny osmond
2 - Output: Gary Lewis - Target: gary lewis and playboys
3 - Output: Collapsible baby buggy - Target: baby buggy
4 - Output:  - Target: basket ball





Dataset(schema: {'question': 'string', 'evidence': 'string', 'targets': 'list<item: string>', 'norm_target': 'string', 'output': 'string', 'match': 'bool'}, num_rows: 5)

11/16 is not bad! Also we can see that two of the wrong outputs are very close to the correct solution (Number 2 and 3)

Second, we evaluate `LongformerForQuestionAnswering` on the short examples

In [37]:
results = validation_dataset.map(evaluate)

187it [01:54,  1.63it/s]


In [43]:
print(f"Correct examples: {sum(results['match'])}/{len(results)}")

Correct examples: 75/187


Here, we now see a clear degradation. Only less than half the samples are correct. Still 75 out of 187 is a good score, also given that we can only use 1024 tokens in this notebook.