## Question Answering
In this exercise, you will experiment with one of NLP’s exciting tasks - Question Answering!

You will first evaluate a pre-trained model on Squad, a leading question-answering dataset, and evaluate its performance. Those with an approved access to GPUs in AWS or a different provider are encouraged to also fine-tune a base model on the Squad dataset.

We will use HuggingFace’s Transformers, the leading package for NLP tasks using transformers. Your code should roughly follow the code of [this guide](https://huggingface.co/docs/transformers/tasks/question_answering) and [this notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb).  

(**Important Note:** The guide writes considerable amount of code to handle the case of context longer than the max input sequence. For simplicity, in your code, you should remove from the datasets all contexts longer than
`max_length = 384`)


This exercise utilizes large models. While we only fine-tune existing models, the time required for fine-tuning could be still large so you are not expected to make many runs.


Install the transformers, datasets libraries

In [None]:
! pip install datasets transformers
!pip install transformers[torch]

Import required libraries.  
Make sure your version of Transformers is at least 4.11.0.

In [2]:
import transformers

print(transformers.__version__)

4.30.2


We will use the 🤗 [Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions load_dataset and load_metric.

For our example here, we'll use version 1.1 of Stanford's [SQUAD dataset](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/).

Load the Squad v1.1 dataset.

In [None]:
from datasets import load_dataset, load_metric

datasets = load_dataset("squad")

filtered_datasets = {}
for dataset in ['train', 'validation']:
    datasets[dataset] = datasets[dataset].filter(lambda example: len(example['context']) <= 384)


### Getting to know the dataset

The datasets object itself is DatasetDict, which contains one key for the training, validation and test set.

In [4]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 6409
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 653
    })
})

We can see the training, validation and test sets all have a column for the context, the question and the answers to those questions.

To access an actual element, you need to select a split first, then give an index:

In [5]:
datasets["train"][0]

{'id': '56be9bb83aeaaa14008c915c',
 'title': 'Beyoncé',
 'context': "On January 7, 2012, Beyoncé gave birth to her first child, a daughter, Blue Ivy Carter, at Lenox Hill Hospital in New York. Five months later, she performed for four nights at Revel Atlantic City's Ovation Hall to celebrate the resort's opening, her first performances since giving birth to Blue Ivy.",
 'question': 'When did Beyonce have her first child?',
 'answers': {'text': ['January 7, 2012'], 'answer_start': [3]}}

Now, answer these questions:  

What is the shortest context in the training dataset?


In [48]:
import numpy as np
shortest_context = np.inf
for example in datasets['train']:
    context_len = len(example['context'])
    if shortest_context > context_len:
      shortest_context = context_len
print(f'shortest context in training dataset is: {shortest_context} characters')

shortest context in training dataset is: 384 characters


What is the longest answer in the dataset?

In [7]:
longest_answer = 0
for dataset in ['train','validation']:
  for example in datasets[dataset]:
      answer_len = len(example['answers']['text'][0])
      if longest_answer < answer_len:
        longest_answer = answer_len
print(f'longest answer in the dataset is: {longest_answer} characters')

longest answer in the dataset is: 181 characters


In [8]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 6409
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 653
    })
})

Is there a question that appears multiple times? What is the most common question?

In [9]:
questions_dict = {}
for dataset in ['train','validation']:
  for example in datasets[dataset]:
      key = example['question']
      questions_dict[key] = questions_dict.get(key, 0) + 1

max_key = max(questions_dict, key=lambda k: questions_dict[k])
max_value = questions_dict[max_key]
print(f"The most frequent question is : '{max_key}' which appeard {max_value} times")

The most frequent question is : 'Which Caribbean nation is in the top quartile of HDI (but missing IHDI)?' which appeard 6 times


### HuggingFace transformers’ tokenizers
As a preprocessing step, the HuggingFace code tokenizes input sequences using a Tokenizer. Read more about tokenizers here:
https://huggingface.co/docs/tokenizers/pipeline  
https://huggingface.co/transformers/v3.0.2/preprocessing.html

For this question, use the BERT tokenizer. The tokenizer sometimes breaks words into smaller chunks, so the number of tokens can be larger than the number of words.

Using the first 1,000 context datapoints, print the 30 most common tokens by the tokenizer.


In [30]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from collections import Counter

tokenizer = AutoTokenizer.from_pretrained("batterydata/bert-base-cased-squad-v1")

# tokenizer = AutoTokenizer.from_pretrained("potsawee/t5-large-generation-squad-QuestionAnswer")
# model = AutoModelForSeq2SeqLM.from_pretrained("potsawee/t5-large-generation-squad-QuestionAnswer")

data = datasets['train'][:1000]

tokenized_data = tokenizer(
    data['question'],
    data['context'],
    max_length=384,
    padding="max_length",
    truncation=True,
    return_tensors="pt"
)

# Count the token occurrences
counter = Counter([tokenizer._convert_id_to_token(i) for i in tokenized_data["input_ids"][0]])
most_common_tokens = counter.most_common(30)
print(most_common_tokens)

Downloading (…)okenizer_config.json:   0%|          | 0.00/334 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

[('[PAD]', 304), (',', 7), ('her', 3), ('first', 3), ('to', 3), ('child', 2), ('[SEP]', 2), ('birth', 2), ('Blue', 2), ('Ivy', 2), ('at', 2), ('.', 2), ("'", 2), ('s', 2), ('[CLS]', 1), ('When', 1), ('did', 1), ('Bey', 1), ('##on', 1), ('##ce', 1), ('have', 1), ('?', 1), ('On', 1), ('January', 1), ('7', 1), ('2012', 1), ('Beyoncé', 1), ('gave', 1), ('a', 1), ('daughter', 1)]


In [31]:
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0] # take the first char position from data
        end_char = answer["answer_start"][0] + len(answer["text"][0]) # add to start position the length of the answer
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs



tokenized_squad = datasets.map(preprocess_function, batched=True, remove_columns=datasets["train"].column_names)

Map:   0%|          | 0/6409 [00:00<?, ? examples/s]

Map:   0%|          | 0/653 [00:00<?, ? examples/s]

### Load a pretrained Question Answering model
In this section, you will use a model pretrained on the Squad dataset for question answering.  

Choose a model you'd like to use.  
You can see a list of available models here: https://huggingface.co/models?dataset=dataset:squad&sort=downloads


Load the model.

In [32]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()
model = AutoModelForQuestionAnswering.from_pretrained("batterydata/bert-base-cased-squad-v1")

### Pretrained Model Error Analysis
Here you will evaluate your model’s performance.

Write code to manually review a few errors of the model.


In [33]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./results",            # output directory
    num_train_epochs=3,               # total number of training epochs
    per_device_train_batch_size=16,   # batch size per device during training
    per_device_eval_batch_size=64,    # batch size for evaluation
    warmup_steps=500,                 # number of warmup steps for learning rate scheduler
    weight_decay=0.01,                # strength of weight decay
    logging_dir="./logs",             # directory for storing logs
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=tokenized_squad['train'],         # training dataset
    eval_dataset=tokenized_squad['validation'],             # evaluation dataset
    data_collator=data_collator,
)

predictions = trainer.predict(tokenized_squad['validation'])


In [34]:
import numpy as np
from tqdm import tqdm

predicted_start_positions = np.argmax(predictions.predictions[0], axis=1)
predicted_end_positions = np.argmax(predictions.predictions[1], axis=1)

predicted_answers = []
for i in tqdm(range(len(predicted_start_positions))):
    start = predicted_start_positions[i]
    end = predicted_end_positions[i]
    tokens = tokenized_squad['validation'][i]['input_ids'][start:end+1]
    predicted_answers.append(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(tokens)))


100%|██████████| 653/653 [00:00<00:00, 1697.96it/s]


In [76]:

errors = 0
sample = range(55, len(predicted_answers), 20)
for i in sample:

    answer = predicted_answers[i]

    if datasets['validation'][i]['answers']['text'][0] != answer:
        print(f"Example #{i}")
        print(f"Contexts: {datasets['validation'][i]['context']}")
        print(f"Question: {datasets['validation'][i]['question']}")
        print(f"Predicted: {answer}")
        print(f"Actual: {datasets['validation'][i]['answers']['text'][0]}")
        errors += 1
        print('\n')
    if errors == 10:  # stop after 30 errors
        break


Example #75
Contexts: On May 21, 2013, NFL owners at their spring meetings in Boston voted and awarded the game to Levi's Stadium. The $1.2 billion stadium opened in 2014. It is the first Super Bowl held in the San Francisco Bay Area since Super Bowl XIX in 1985, and the first in California since Super Bowl XXXVII took place in San Diego in 2003.
Question: When was the last time California hosted a Super Bowl?
Predicted: 2003
Actual: 2003.


Example #115
Contexts: Peyton Manning became the first quarterback ever to lead two different teams to multiple Super Bowls. He is also the oldest quarterback ever to play in a Super Bowl at age 39. The past record was held by John Elway, who led the Broncos to victory in Super Bowl XXXIII at age 38 and is currently Denver's Executive Vice President of Football Operations and General Manager.
Question: Prior to Manning, who was the oldest quarterback to play in a Super Bowl?
Predicted: Peyton
Actual: John Elway


Example #135
Contexts: The Panthers

Do you see a pattern there? Is there any hypothesis you form for cases where the model fails?

> A lot of these errors are actually acceptable answers, with slight differences in phrasing.
>
> Of the ones that are not, it's hard to say what the issue is. We should probably create a more robust way to compare answers looking for further insight.   

Write code that runs inference and outputs the predicted answer to a context and question texts typed by the user. We recommend that you use ipywidgets for interactivity:  
https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20Basics.html

Use the award-winning GUI you’ve just created, to try to manually poke holes in the model. Try to characterize the cases your model mishandles.


In [36]:
from torch.nn.functional import softmax
import torch

def ask_question(context, question):
    # Tokenize the input
    inputs = tokenizer.encode_plus(question, context, return_tensors='pt')

    # Remove the 'token_type_ids' from inputs
    inputs.pop("token_type_ids", None)
    inputs.to("cuda:0")
    # Get the model's predictions
    outputs = model(**inputs)

    # Get the start and end scores from the model output
    start_scores = outputs.start_logits
    end_scores = outputs.end_logits

    # Apply softmax to convert the scores into probabilities
    start_probs = softmax(start_scores, dim=-1)
    end_probs = softmax(end_scores, dim=-1)

    # Find the tokens with the highest start and end probabilities
    answer_start = torch.argmax(start_probs)
    answer_end = torch.argmax(end_probs)

    # Get the string version of the predicted answer
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    answer = tokenizer.convert_tokens_to_string(tokens[answer_start:answer_end+1])

    return answer




In [92]:
keep_going=True
while(keep_going):
    print(f'Enter context: ', end='')
    context = input()
    print('Enter question: ', end='')
    question = input()
    answer = ask_question(context, question)
    print("Here's your answer, scum: ", end='')
    print(answer)

    print()
    print('Had enough? Y/N')
    keep_going = (input() == 'N')
    print()


Enter context: I am hella sad
Enter question: who is sad?
Here's your answer, scum: 

Had enough? Y/N
N
Enter context: mike is the president of the united states
Enter question: who is president?
Here's your answer, scum: [CLS]

Had enough? Y/N
Y


In [59]:
context = "The quick brown fox jumps over the lazy dog."
question = "Who jumps over the lazy dog?"
answer = ask_question(context, question)
print(answer)


context = "The answer to 1+1 is 2."
question = "What is the answer to 1+1?"
answer = ask_question(context, question)
print(answer)

context = "Numbers are making problems for this models predictions"
question = "What makes problems for this models predictions?"
answer = ask_question(context, question)
print(answer)


context = "The dude is trying to understand why this model doesn't work. he thinks it might be happening when the context is very short, so he created a relatively large context to test his hypothesis"
question = "who tries to understand why the model doesnt work?"
answer = ask_question(context, question)
print(answer)

brown fox
2
Numbers



In [47]:
len(context)

186

Next, evaluate your model’s performance for different lengths of input text and of answer length.

In [81]:
context = "Tom Hanks won the Best Actor Oscar for the movie Forrest Gump."
question = "Who won the Oscar?"
answer = ask_question(context, question)
print(f'Question:{question}')
print(f'Answer:{answer}')
print()

context = "Apollo 11 was the spaceflight that first landed humans on the Moon. Commander Neil Armstrong and lunar module pilot Buzz Aldrin formed the American crew that landed the Apollo Lunar Module Eagle on July 20, 1969."
question = "What is the Apollo 11 mission?"
answer = ask_question(context, question)
print(f'Question: {question}')
print(f'Answer: {answer}')
print()

context = "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects."
question = "Who created Python?"
answer = ask_question(context, question)
print(f'Question: {question}')
print(f'Answer: {answer}')
print()

context = "The Solar System is the gravitationally bound system of the Sun and the objects that orbit it, either directly or indirectly. Of the objects that orbit the Sun directly, the largest are the eight planets, with the remainder being smaller objects, the dwarf planets and small Solar System bodies. Of the objects that orbit the Sun indirectly—the moons—two are larger than the smallest planet, Mercury."
question = "What is the Solar System composed of?"
answer = ask_question(context, question)
print(f'Question: {question}')
print(f'Answer: {answer}')
print()


Question:Who won the Oscar?
Answer:Tom Hanks

Question: What is the Apollo 11 mission?
Answer: Apollo Lunar Module Eagle

Question: Who created Python?
Answer: Guido van Rossum

Question: What is the Solar System composed of?
Answer: smaller objects, the dwarf planets and small Solar System bodies. Of the objects that orbit the Sun indirectly — the moons



BONUS: Can you think of other axes that would be interesting to use to evaluate your model?

### [Advanced] Fine-tune a Model
Here, you will fine-tune a base model to the Squad dataset, and evaluate its performance.

What metric do you find suited? Why?


Train the model to fine-tune on the dataset.

Write below your train and validation loss.

## Recommended Resources
For an open discussion on Question Answering related topics, you are very encouraged to watch this workshop: https://www.youtube.com/watch?v=Ihgk8kGLpIE

This screencast uses T5 on a different Q&A dataset: https://www.youtube.com/watch?v=_l2wJb3QPdk



That's it - good luck!