<img src="https://i.imgur.com/RFR6UZX.jpg" width="100%"/>


# 6- 🤗 Pre & post-processing
### [chaii - Hindi and Tamil Question Answering](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering) - A quick overview for QA noobs

Hi and welcome! This is the sixth kernel of the series `chaii - Hindi and Tamil Question Answering - A quick overview for QA noobs`.

---


**In this kernel, we will go over the 3 main functions of huggingface's [QA example notebook](https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb), these are:**
1. **`prepare_train_features`,**
2. **`prepare_validation_features`, and**
3. **`postprocess_qa_predictions`**

**These functions are used in all of the [top models](https://www.kaggle.com/julian3833/4-exploring-public-models-qa-for-qa-noobs/), so gaining an understanding of them is crucial.**


---

The full series consist of the following notebooks:
1. [The competition](https://www.kaggle.com/julian3833/1-the-competition-qa-for-qa-noobs)
2. [The dataset](https://www.kaggle.com/julian3833/2-the-dataset-qa-for-qa-noobs)
3. [The metric (Jaccard)](https://www.kaggle.com/julian3833/3-the-metric-jaccard-qa-for-qa-noobs)
4. [Exploring Public Models](https://www.kaggle.com/julian3833/4-exploring-public-models-qa-for-qa-noobs/)
5. [🥇 XLM-Roberta + Torch's extra data [LB: 0.749]](https://www.kaggle.com/julian3833/5-xlm-roberta-torch-s-extra-data-lb-0-749)
6. _[🤗 Pre & post processing](https://www.kaggle.com/julian3833/6-pre-post-processing-qa-for-qa-noobs/) (This notebook)_
7. [Public Models Revisited](https://www.kaggle.com/julian3833/7-public-models-revisited-qa-for-qa-noobs/)

This is an ongoing project, so expect more notebooks to be added to the series soon. Actually, we are currently working on the following ones:
* Reviewing `squad2`, `mlqa` and others
* About `xlm-roberta-large-squad2`
* Own improvements

---


# Introduction

If you have been checking the public work, there are 3 functions that look a little bit intimidating: `prepare_train_features`, `prepare_validation_features`, and `postprocess_qa_predictions`.

The idea of this notebook is to approach them slowly, in order to lose our fears and gain confidence with them. Let's go!

In [None]:
%env WANDB_DISABLED=True
import collections
from tqdm.auto import tqdm
import numpy as np
import pandas as pd
from transformers import AutoModelForQuestionAnswering, AutoTokenizer

MODEL_NAME = '../input/xlm-roberta-squad2/deepset/xlm-roberta-large-squad2'


max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# prepare_train_features()

This function has a lot of code which makes it look overwhelming, but conceptually it is not. I have rewritten it with two auxiliary functions to make it more accessible.

It is doing two operations:
1. Tokenizing the input
2. Adding the start and end position of the answer to the tokenized data


The original code is hidden below. It has 77 lines of code. The version that is used in all the training notebooks is the same as in the tutorial notebook.

From those 77 lines, 22 are for the part 1 (tokenizing) and 55 for the part 2 (finding the start and end positions of the answer).

In [None]:
def prepare_train_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [None]:
def prepare_train_features(examples):
    # Tokenize
    tokenized_examples = tokenize(examples) 
    
    # Set the start and end position tokens
    tokenized_examples = set_start_and_end_positions(tokenized_examples, examples['answers']) 
    return tokenized_examples

## prepare_train_features() part 1: Tokenization

The tokenization part has some particularities, but it's still a basic usage of huggingface's tokenizer. You can have a good overview of all the relevant arguments used by following [this tutorial](https://huggingface.co/transformers/preprocessing.html).

Also, the [The tokenization pipeline](https://huggingface.co/docs/tokenizers/python/latest/pipeline.html#the-tokenization-pipeline) is an excellent reference for the Tokenizer.

To make it more simple, we will remove the `pad_on_right` clause for now. We will assume the model pads on its right, which means that the question goes first and the context goes second.

Let's see the code and explain it right below:

In [None]:
def tokenize(examples):
    examples["question"] = [q.lstrip() for q in examples["question"]]
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length"
    )
    return tokenized_examples

It is not uncommon in NLP tasks to receive two pieces of text as the input. This is why the tokenizer is prepared to accept two string parameters as its first two arguments.

It can also receive a batch (a list) of elements as each of those parameters, but let's start with a simple call with plain strings and build on top of that.

In [None]:
question = "What was the World War II?"
ctx = """World War II, often abbreviated as WW2, was a global war that lasted from 1939 to 1945. It involved the vast majority of the world's countries forming two opposing military alliances: the Allies and the Axis powers."""

The simplest returned value from a `tokenizer` call has the keys `input_ids` and `attention_mask`.

In [None]:
tokenized_example = tokenizer(question, ctx)
print(tokenized_example)

In [None]:
tokenized_example.keys()

In [None]:
# The attention mask for now is full of ones.
# It will come into play with padding and max_length in a moment.
print(tokenized_example['attention_mask'])

### inputs_ids
input_ids has a tokenized version of the question and the context concatenated with the strings `<s>` and `</s>` (that are special tokens signaling "Sentence start" and "sentence end"), already codified as integers using the vocabulary indexes.

Note that this is the common way to pass a pair of sentences to a transformer model: the transformer always receives one sentence.  The fact that they are two is encoded _within_ the sentence using the special separator tokens.

In [None]:
print(tokenized_example['input_ids'])

In [None]:
# With decode we can go back from the token ids to the tokens
tokenizer.decode(tokenized_example['input_ids'])

### padding="max_length"
`padding="max_length"` will put dummy tokens (`1`) at the end of the input sequence till it reaches the max_length.

This will affect the `attention_mask` output as well, setting to `0` all the positions associated with the dummy tokens, signaling to the model that they are not relevant pieces of information.


In [None]:
tokenized = tokenizer(question, ctx, padding="max_length", max_length=100)
tokenized

In [None]:
# The token with id 1 is the special token <pad>
tokenizer.decode(tokenized['input_ids'])

### truncation, stride, return_overflowing_tokens

This set of arguments require a more extensive explanation. Stick with me till the end, please:

* A model can only process sequences up to a certain amount of tokens, the maximum sequence length (`max_length`, which in this case is 384)
* A common practice is to truncate the sequences to that length, using `truncation=True`
* But for QA, since truncation can drop the answer, we cannot do just that. We need to do something smarter.
* If the concatenation of the question and the context is larger than the max_length, we will split the context into various pieces, creating multiple train samples from a unique question-context pair. These  samples are referred to as "features" in all the code.
* These train samples / "features" have the following form: (question, part 1 of the context), (question, part 2 of the context),... (question, part n of the context)
* Some of these train samples will have an answer; some won't.
* This is controlled by the parameters `truncation="only_second"` and `return_overflowing_tokens=True`
* An overlap between context splits is allowed to avoid the edge case when an answer is cut by the split. This is controlled by the `stride` parameter, which provides the size of this overlap.


The problem is explained in much more detail in the notebook [Question Answering Tutorial](https://www.kaggle.com/thedrcat/question-answering-tutorial) by thedrcat.


In [None]:
# Call with max_length=40 and stride=5
tokenized = tokenizer(question, ctx, truncation="only_second", stride=5, max_length=40, return_overflowing_tokens=True)
tokenized

In [None]:
# The previous call created 3 tokenized samples for our unique (question, context)
len(tokenized['input_ids'])

In [None]:
# See the length of each sample (40, 40, 21)
# Also note that the question is always present (we only want to split the context, that's why truncate is set to "only_second")
# In this case, we allowed an overlap of 5 tokens
for tokens in tokenized['input_ids']:
    print(len(tokens), tokenizer.decode(tokens))

Finally, note that the result has a new key, `overflow_to_sample_mapping`. This is just the mapping of the final samples to the original inputs. Since here there was only one question, all the samples were generated from it and we have `[0, 0, 0]` (3 samples generated from the first input (zero-indexed).

This value will be used in the following step.

In [None]:
tokenized['overflow_to_sample_mapping']

### return_offsets_mapping

The last argument: `return_offsets_mapping=True`. This argument will add the key `offset_mapping` to the output, which will be used, along with `overflow_to_sample_mapping` in the next step.

This is a full call with all the arguments for the tokenizer by the way:

In [None]:
tokenized = tokenizer(question, ctx, 
                      truncation="only_second", 
                      stride=10, 
                      max_length=80, 
                      return_overflowing_tokens=True, 
                      return_offsets_mapping=True, 
                      padding="max_length")
tokenized

The offset mapping maps the tokens to their character start and end positions in the string they come from:

In [None]:
tokens = tokenized['input_ids'][0]
offsets = tokenized['offset_mapping'][0]

In [None]:
tokens[:10]

In [None]:
offsets[:10]

In [None]:
# The id of token at index 4
tokens[4]

In [None]:
# Which is this token
tokenizer.decode(tokens[4])

In [None]:
# Here, we get the same word using the offset mapping
start, end = offsets[4]
question[start:end]

This map will be required in the following part. 
For now, we have covered all the Tokenizer part of `prepare_train_features()`,... that was a lot!

By the way, this step is the same in `prepare_validation_features()`, so... that counts twice.

Nice work!

Grab a coffee, and let's jump into the second part:


## prepare_train_features() part 2: Adding start and end positions keys to the tokenized data

For this part, we will go over a high-level description first, and dig into the commented code after.

For each training sample, the model requires the keys `start_position` and `end_position`. These keys signal the start and the end token of the answer in the full token input (this is, the question and the context concatenated). This is the case when the answer is present in that piece of the context. If there is no answer in the piece of the context, these values are set to the "class token" (a special token).

Originally, the SQuAD dataset has the keys `answer_start` and `text`, where `answer_start` signals the _character_ (and not the token) in which the answer starts in the context. 
This function maps from `answer_start` and `text` to `start_position` and `end_position`, taking care of the fact that more than one features (tokenized split samples) might exist for each original sample (due to the split potentially performed in the previous tokenization step).

First, we will verify this behaviour running with only one example. After that, a commented version of the function will be presented in order to understand the underlying code.

In [None]:
def set_start_and_end_positions(tokenized_examples, example_answers):
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized_examples.pop("offset_mapping")
    
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)
        sequence_ids = tokenized_examples.sequence_ids(i)
        sample_index = sample_mapping[i]
        answers = example_answers[sample_index]
        
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)
    return tokenized_examples


In [None]:
# We need the answer start and the answer text for this second part:
question, ctx[44:56]

In [None]:
answers = [{'answer_start': [44], 'text': ['a global war']}]

In [None]:
# This is one of the examples we went over previously, it's the one that generates 3 splits
tokenized_example = tokenizer(question, ctx, truncation="only_second", stride=10, max_length=40, return_overflowing_tokens=True, return_offsets_mapping=True, padding="max_length")
tokenized_example

In [None]:
final_result = set_start_and_end_positions(tokenized_example, answers)
final_result

In [None]:
# This means: splits 2 and 3 don't have the answer, split 1 has the answer and it starts in token 25
final_result['start_positions']

In [None]:
# This means: splits 2 and 3 don't have the answer, split 1 has the answer and it ends in token 27
final_result['end_positions']

In [None]:
# We can verify it easily:
tokenizer.decode(final_result['input_ids'][0][25:27+1])

Let's get into the code now. A caveat: the function is long, but what it does is not that complex to grasp conceptually, so understanding _the details_ is not very important, as long as you understand its general behaviour. 

It has 4 pointers in 2 different worlds, making it obscure. Don't get me wrong: I am not saying it is _bad code_ at all. I'm just saying it's a simple yet annoying task. 

Said this, let's dig into it:

In [None]:
def set_start_and_end_positions(tokenized_examples, example_answers):
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized_examples.pop("offset_mapping")
    
    # The goal of this function is to populate these two lists
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    # Go over all the final training samples (already split)
    for i, offsets in enumerate(offset_mapping):
        
        input_ids = tokenized_examples["input_ids"][i]
        
        # Get the class token, will be used to fill samples with no answer
        cls_index = input_ids.index(tokenizer.cls_token_id)
        
        # Sequece ids is a mask of 0s for the question and 1s for the context
        sequence_ids = tokenized_examples.sequence_ids(i)
        
        # Go back to the original set of samples (before the split)
        sample_index = sample_mapping[i]
        # And get the answer for it
        answers = example_answers[sample_index] 
        
        
        # If there is not answer set, return CLS special token 
        # (I don't think this case applies in ChaII)
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start and end of the answer in the original full context 
            # (char position, 44 and 56 in the previous example)
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])
            
            # The two whiles below are a weird way to initialize two pointers 
            # to the context start and end
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # If the answer is not in this split of the context, return CLS special token
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Advance with the start token pointer till the start_char is reached
                while (token_start_index < len(offsets) 
                       and offsets[token_start_index][0] <= start_char):
                    token_start_index += 1
                # That is our start position
                tokenized_examples["start_positions"].append(token_start_index - 1)
                
                # Go back with the end token pointer till the end_char is reached
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                # That is our end position
                tokenized_examples["end_positions"].append(token_end_index + 1)
    return tokenized_examples

That was it for `preparare_train_features()`. To recap: it does the 1) tokenization part and after 2) it adds the start and end positions:

```python
def prepare_train_features(examples):
    # Tokenize
    tokenized_examples = tokenize(examples)
    # Set the start and end position tokens
    tokenized_examples = set_start_and_end_positions(tokenized_examples, examples['answers']) 
    return tokenized_examples
```

Finally, below we will execute a full call to `prepare_train_features`. Note that the `max_length` is set to `384`, so the _padding_ takes the lead over the the _truncation_. You should be able to identify all the components if you look with patience:

In [None]:
# I named it plural to signal that it's a batch (although it's a batch of length 1)
examples = {'question': [question], 'context': [ctx], 'answers': [{'answer_start': [44], 'text': ['a global war']}], 'id': ['ww2-q1']}
tokenized_examples = prepare_train_features(examples)
tokenized_examples

# Evaluation-related pre and post-processing


What comes next is, as well, technically complex but theoretically simple enough.

Let's start by understanding what the Question Answering model returns, which will make the understanding of everything else much easier.


## The predictions of the QA head
For a given sample, the model should find a `start_position` and an `end_position` for the answer.
These positions are tokens indices. In concrete, the model works as a kind of double classifier.

For the `start_position`, it will output a probability distribution over all the possible tokens.
The ground truth, from this perspective, is a one-hot encoded vector with zeros in all positions and a one in the position of the token starting the answer.
This can be evaluated with Cross entropy, and, in fact, it is. The code below is the actual code of the roberta model in huggingface
(you can check it by yourself in github [here](https://github.com/huggingface/transformers/blob/master/src/transformers/models/roberta/modeling_roberta.py#L1542)):

```python
loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
start_loss = loss_fct(start_logits, start_positions)
end_loss = loss_fct(end_logits, end_positions)
total_loss = (start_loss + end_loss) / 2
```

For the `end_position`, it will generate a probability distribution as well. And this probability distribution can be evaluated against the one-hot-encoded vector of the actual end position.

Each one of these problems is a common classification problem. Consider, for example, MNIST and digit recognition. For a sample showing a `3`, the actual answer would be encoded as `y_true = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]` (where there is a 1 in the forth position, corresponding to 3 starting from zero), while a prediction would be probability distribution over the 10 possible values: `y_proba = [0.01, 0.0, 0.05, 0.44, 0.15, 0.0, 0.05, 0.02, 0.25, 0.03]`

In the case of the `start_position`, instead of 10 possible digits, when have `n` possible tokens, and only one of them is the actual starting-position token (the one which will have the `1` in the `y_true`). 
The model then generates a distribution over all the tokens, and with cross entropy we get a loss, that will help the model to learn in the subsequent iterations.

The same happens for the `end_position` and in the last line, the final reported loss is just the average of both losses. 

A little caveat: the softmax layer is not applied, so there is no probability distribution but logits, which can be translated to probabilities easily and provide a ranking or a score for the possible answers (the highest logit will map to the highest probability and so on).

All in all, the conclusion of this section is the following:

<h2 style="text-align: center; background-color:#C8FF33;padding:40px;border-radius: 30px;">
    Question Answering is modeled as a double classification problem.
</h2>


Now let's see this in action:

In [None]:
import torch
from transformers import AutoModelForQuestionAnswering
# In order to pass the example to a model we need to cast it to torch tensors
# This is typically handled automatically by the 🤗 Datasets, but here we are
# outside of it
tokenized_examples = {k: torch.tensor(v) for k, v in tokenized_examples.items()}

# Load the model
model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME)

In [None]:
# Since the model is already for English QA, we can just use it out of the box
predictions = model(**tokenized_examples)
predictions.keys()

In [None]:
predictions.loss

In [None]:
# The 1 is the batch size
# The 384 is the total tokens (we have padded to the max length of the model, which is 384)

predictions.start_logits.size()

In [None]:
# These are the logits for the first 5 tokens. 
# A higher logit means the model considers 
# it is more probable it is the starting token of the answer
predictions.start_logits[0][:5]

In [None]:
# The token 25 is the one with the highest probability
torch.argmax(predictions.start_logits[0])

In [None]:
# Which is correct!
tokenized_examples['start_positions']

In [None]:
# All the same can be done for end_logits
torch.argmax(predictions.end_logits[0])

In [None]:
# The most probable end position is not the actual one (27) but 34.
tokenized_examples['end_positions']

Let's see 10 most probable tokens for start and for end positions:

In [None]:
# Turn logits into probabilities
start_probas = torch.nn.functional.softmax(predictions.start_logits[0], dim=0)
# Get the token ids
token_ids =tokenized_examples['input_ids'][0]

# Iterate through the tokens and their probabilities, in descending order of probability
for i, (proba, token) in enumerate(sorted(zip(start_probas, token_ids), key=lambda x: x[0], reverse=True), 1):
    # Print the first 10
    if i <= 10:  
        print(f"{i:2d}) {proba*100:5.2f}%  - {tokenizer.decode(token)}")

In [None]:
# Same for the end logits
end_probas = torch.nn.functional.softmax(predictions.end_logits[0], dim=0)
for i, (proba, token) in enumerate(sorted(zip(end_probas, token_ids), key=lambda x: x[0], reverse=True), 1):
    if i <= 10:  
        print(f"{i:2d}) {proba*100:5.2f}%  - {tokenizer.decode(token)}")

With this already on the table, we can discuss what the next two functions do.
They handle three different problems, each of them very simple in its nature:


**1. Map from two logits to a string**

First of all, the answers are built with a `start_position` and an `end_position`. The answer with the highest probability, and therefore the one the model would return with a simple unique solution, starts with 
the token `a` and ends with the token `1945`: `a global war that lasted from 1939 to 1945`.

We need to implement this, tough —all the way from the 2 logits to the string.

**2. Get a _valid_ answer**

On the other hand, the start and end logits returned as the first answer might be invalid. For example, the end position might be pointing to a token that is before the one pointed by the start position. Or they might be pointing to the question section of the input and not to the context.

These are two consistency checks:
1. Start position is before end position
2. Both positions point to the context and not somewhere else


So a more thoughtful approach to obtaining the answer would be to get a set of the most probable ones and keep the best scored valid one.


**3. Rollback context splitting**

Finally, we have to rollback the splitting done to the context that we extensively covered in the previous section.


With these three tasks on the top of our head, we can jump into the two remaining functions:



# prepare_validation_features()

`prepare_validation_features` does three things:
1. It tokenizes, exactly as in the train case
2. It adds an "example_id" key to each tokenized example, with a reference to the original sentence the sample belonged to, before the context splitting.
3. It modifies the key "offset_mapping" of each tokenized example, to make it easy to map from the logits to the string later on (in `postprocess_qa_predictions`)

The actual code is hidden in the cell below, and it has 45 lines of code.
Below, visible, you will find a modified version, with the same exact functionality, using two auxiliary functions. 

The `tokenize` function, which is exactly the same as the one used for the `prepare_training_features` and a new one, defined below, `add_example_id_and_modify_offset_mapping`. This function performs the steps 2 and 3 in the enumeration presented above. I added my own comments to it.

In [None]:
def prepare_validation_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

In [None]:
def prepare_validation_features(examples):
    tokenized_examples = tokenize(examples)
    tokenized_examples = add_example_id_and_modify_offset_mapping(tokenized_examples, examples["id"])
    return tokenized_examples

In [None]:
def add_example_id_and_modify_offset_mapping(tokenized_examples, example_ids):
    context_index = 1 
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        
        sample_index = sample_mapping[i]
        # Populate the "example_id" key with a reference
        # to the original sample (before the context split phase)
        # This is going to be used downstream
        tokenized_examples["example_id"].append(example_ids[sample_index])
        
        # The offset_mapping maps tokens to chars in the original piece of text
        # Here we are setting to None all the tokens that don't belong to the context
        # This is going to be used downstream
        sequence_ids = tokenized_examples.sequence_ids(i)
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]
    return tokenized_examples

Ok, let's check it with our simple WW2 example:

In [None]:
examples

In [None]:
# It's very similar to prepare_training_features
# with example_id (which would have been useful had the input been large enough)
# and a lot of None in the offset_mapping
tokenized_validation_examples = prepare_validation_features(examples)
tokenized_validation_examples

Let's use a larger example, where the `example_id` key will be actually useful:

In [None]:
large_ctx = """World War II or the Second World War, often abbreviated as WWII or WW2, was a global war that lasted from 1939 to 1945. It involved the vast majority of the world's countries—including all of the great powers—forming two opposing military alliances: the Allies and the Axis powers. In a total war directly involving more than 100 million personnel from more than 30 countries, the major participants threw their entire economic, industrial, and scientific capabilities behind the war effort, blurring the distinction between civilian and military resources. Aircraft played a major role in the conflict, enabling the strategic bombing of population centres and the only two uses of nuclear weapons in war to this day. World War II was by far the deadliest conflict in human history; it resulted in 70 to 85 million fatalities, a majority being civilians. Tens of millions of people died due to genocides (including the Holocaust), starvation, massacres, and disease. In the wake of the Axis defeat, Germany and Japan were occupied, and war crimes tribunals were conducted against German and Japanese leaders.
World War II is generally considered to have begun on 1 September 1939, when Nazi Germany, under Adolf Hitler, invaded Poland. The United Kingdom and France subsequently declared war on Germany on the 3rd of September. Under the Molotov–Ribbentrop Pact of August 1939, Germany and the Soviet Union had partitioned Poland and marked out their "spheres of influence" across Finland, Romania and the Baltic states. From late 1939 to early 1941, in a series of campaigns and treaties, Germany conquered or controlled much of continental Europe, and formed the Axis alliance with Italy and Japan (along with other countries later on). Following the onset of campaigns in North Africa and East Africa, and the fall of France in mid-1940, the war continued primarily between the European Axis powers and the British Empire, with war in the Balkans, the aerial Battle of Britain, the Blitz of the UK, and the Battle of the Atlantic. On 22 June 1941, Germany led the European Axis powers in an invasion of the Soviet Union, opening the Eastern Front, the largest land theatre of war in history and trapping the Axis powers, crucially the German Wehrmacht, in a war of attrition."""

In [None]:
# https://stackoverflow.com/questions/2465921/how-to-copy-a-dictionary-and-only-edit-the-copy
large_examples = dict(examples) # Copy losing the reference to the original
large_examples

In [None]:
large_examples['question'].append("When did the WW2 begin?")
large_examples['context'].append(large_ctx)
large_examples['answers'].append({'answer_start': [large_ctx.find("1 September")], 'text': ['1 September 1939']})
large_examples['id'].append('ww2-q2')
large_examples

In [None]:
tokenized_validation_large_examples = prepare_validation_features(large_examples)
# The second example generated 2 features
len(tokenized_validation_large_examples['input_ids'])

In [None]:
# The second and third features reference ww2-q2 as their originating example
tokenized_validation_large_examples['example_id']

# postprocess_qa_predictions()

Ok, now that we have `prepare_validation_features()`, we can dig into the last function: `postprocess_qa_predictions()`

The actual code is hidden in the next code cell. It has ~84 lines.

A functionaly-equivalent modularized version, with auxiliary functions, is visible below. Hopefully, it will help in the learning process.

In [None]:
def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions


N_BEST = 20
MAX_ANSWER_LENGTH = 30

def build_feature_per_example_map(examples, features):
     # Build a map example to its corresponding features.
    example_id_to_index = {k["id"]: i for i, k in enumerate(examples)}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)
    
    return features_per_example


def is_valid_answer(start_index, end_index, offset_mapping):

    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
    # to part of the input_ids that are not in the context.
    if (
        start_index >= len(offset_mapping)
        or end_index >= len(offset_mapping)
        or offset_mapping[start_index] is None
        or offset_mapping[end_index] is None
    ):
        return False
    # Don't consider answers with a length that is either < 0 or > MAX_ANSWER_LENGTH.
    if end_index < start_index or end_index - start_index + 1 > MAX_ANSWER_LENGTH:
        return False

    return True

def get_valid_answer(start_index, end_index, offset_mapping, start_logits, end_logits, context):
    start_char = offset_mapping[start_index][0]
    end_char = offset_mapping[end_index][1]
    answer = {
            "score": start_logits[start_index] + end_logits[end_index],
            "text": context[start_char: end_char]
        }
    return answer

def get_n_best_token_ids(logits):
    return np.argsort(logits)[-1 : -N_BEST - 1 : -1].tolist()


def get_best_answer(valid_answers):
    if len(valid_answers) > 0:
        best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        best_answer = best_answer["text"]
    else:
        best_answer = ""
    return best_answer


def get_valid_answers(example, features, logits, feature_indices):
    # Logits for all the features, not only these ones
    all_start_logits, all_end_logits = logits
    
    valid_answers = []
    for feature_index in feature_indices:
        # We grab the predictions of the model for this feature.
        start_logits = all_start_logits[feature_index]
        end_logits = all_end_logits[feature_index]
        start_indexes = get_n_best_token_ids(start_logits)
        end_indexes = get_n_best_token_ids(end_logits)
    
        offset_mapping = features[feature_index]["offset_mapping"]

        for start_index in start_indexes:
            for end_index in end_indexes:
                if is_valid_answer(start_index, end_index, offset_mapping):
                    answer = get_valid_answer(start_index, end_index, offset_mapping, start_logits, end_logits, example["context"])
                    valid_answers.append(answer)
    return valid_answers

def process_one_example(example, features, logits, feature_indices):
    valid_answers = get_valid_answers(example, features, logits, feature_indices)
    return get_best_answer(valid_answers)


The function receives 3 main arguments:
1. the original examples (what we are calling `examples`), 
2. the tokenized, split examples that `prepare_validation_features()` outputs (`tokenized_validation_examples`), as the argument `features` and 
3. the start and end logits returned by the model as predictions.

It first creates a map from the features to the original examples with the function `build_feature_per_example_map()`.

After that, it processes each original example, extracting all the answers from the features, keeping the valid ones, ranking them, and returning the best one.
I have wrapped all that code in `process_one_example()`, which we will cover in detail shortly.

We will unpack these two auxiliary functions below, but I want the high-level function to be functional, so I have put the auxiliary functions hidden above.

Let's see the high-level view of the function for now, and let's use it with our toy data:

In [None]:
def postprocess_qa_predictions(examples, features, logits):
    features_per_example = build_feature_per_example_map(examples, features)
    predictions = {}
    for example_index, example in enumerate(tqdm(examples)):
        # The features (final validation examples) that were originated by this example
        feature_indices = features_per_example[example_index]
        # Get the best answer for this example with process_one_example
        predictions[example["id"]] = best_answer = process_one_example(example, features, logits, feature_indices)
    return predictions

The original code receives a Hugging Face Dataset as the first and the second input, and relies on that for some simple tasks.
In particular, it uses the "dual nature" of the Dataset (the fact that it offers both the interfaces of a list and of a dictionary).
Since our toy data is just a dictionary, I had to make a smallish change to a line of code and turn the dictionaries of lists into lists of dictionaries:


In [None]:
# https://stackoverflow.com/questions/5558418/list-of-dicts-to-from-dict-of-lists
examples_as_list = [dict(zip(examples,t)) for t in zip(*examples.values())]
tokenized_validation_examples_as_list = [dict(zip(tokenized_validation_examples,t)) for t in zip(*tokenized_validation_examples.values())]

In [None]:
# Get the start and end logits together in a tuple
start_logits = predictions.start_logits.detach().numpy()
end_logits = predictions.end_logits.detach().numpy()
logits = (start_logits, end_logits)

In [None]:
result = postprocess_qa_predictions(examples_as_list, tokenized_validation_examples_as_list, logits)

# The best result, the one we have previously checked
result

In [None]:
# Here, all the same but for large_examples, with the 2nd example generating 2 features
tokenized_large_train_examples = prepare_train_features(large_examples)
tokenized_large_train_examples = {k: torch.tensor(v) for k, v in tokenized_large_train_examples.items()}
large_preds = model(**tokenized_large_train_examples)

large_examples_as_list = [dict(zip(examples,t)) for t in zip(*large_examples.values())]
tokenized_validation_large_examples_as_list = [dict(zip(tokenized_validation_large_examples,t)) for t in zip(*tokenized_validation_large_examples.values())]

large_logits = (large_preds.start_logits.detach().numpy(), large_preds.end_logits.detach().numpy())
postprocess_qa_predictions(large_examples_as_list, tokenized_validation_large_examples_as_list, large_logits)

That's it, that's the interface. Let's get inside the code now, we have to cover two functions and we will have finished: `build_feature_per_example_map` and `process_one_example`.

Let's start with the easy one, `build_feature_per_example_map()`, by checking its output for our known examples:

In [None]:
def build_feature_per_example_map(examples, features):
    example_id_to_index = {k["id"]: i for i, k in enumerate(examples)}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)    
    return features_per_example

In [None]:
# The example with index 0 has only one feature, the one in position 0
build_feature_per_example_map(examples_as_list, tokenized_validation_examples_as_list)

In [None]:
# Here, for large_examples, the example with index 1 (the second one) has 2 features associated with it, the 2nd and 3rd (with indexes 1 and 2)
build_feature_per_example_map(large_examples_as_list, tokenized_validation_large_examples_as_list)

FYI, I have changed slightly one line of the original code: `example_id_to_index = {k: i for i, k in enumerate(examples["id"])}` to `example_id_to_index = {k["id"]: i for i, k in enumerate(examples)}` because it was assuming the dual nature of the Dataset which my list doesn't have.

`process_one_example()` is as follows:

In [None]:
def process_one_example(example, features, logits, feature_indices):
    # Get all the valid answers from the features
    valid_answers = get_valid_answers(example, features, logits, feature_indices)
    # Return the best one
    return get_best_answer(valid_answers)

In [None]:
def get_valid_answers(example, features, logits, feature_indices):
    # Logits for all the features, not only these ones
    all_start_logits, all_end_logits = logits
    
    # We will populate this list going through all the features for this example
    valid_answers = []
    
    for feature_index in feature_indices:
        # For each feature, get all the start and end logits
        start_logits = all_start_logits[feature_index]
        end_logits = all_end_logits[feature_index]
        
        # Get the indexes of the tokens with the highest ranked logits
        # We are getting the best ranked 20 starts and 20 ends
        start_indexes = get_n_best_token_ids(start_logits)
        end_indexes = get_n_best_token_ids(end_logits)
    
        offset_mapping = features[feature_index]["offset_mapping"]

        # Go through all the combinations of (start, end) looking for valid answers
        # These are 20x20 (400) with the current configuration
        for start_index in start_indexes:
            for end_index in end_indexes:
                # Check if (start_index, end_index) define a valid answer
                if is_valid_answer(start_index, end_index, offset_mapping):
                    # Extract the string defined by (start_index and end_index) if it is a valid one
                    answer = get_valid_answer(start_index, end_index, offset_mapping, start_logits, end_logits, example["context"])
                    valid_answers.append(answer)
    return valid_answers

Below we define the last four auxiliary functions with abundant comments:


```python
def get_n_best_token_ids()
def is_valid_answer()
def get_valid_answer()
def get_best_answer()
```

In [None]:
MAX_ANSWER_LENGTH = 30

def get_n_best_token_ids(logits, n_best=20):
    # Sort a set of logits and get the indexes of the `n_best` largest ones
    return np.argsort(logits)[-1 : -n_besto - 1 : -1].tolist()

def is_valid_answer(start_index, end_index, offset_mapping):
    # The end token is before the start token, doesn't make sense. Invalid.
    if end_index < start_index:
        return False
    
    # Length exceeds a certain predefined amount of tokens. Invalid.
    length = end_index - start_index + 1
    if length > MAX_ANSWER_LENGTH:
        return False
    
    # Indexes point to somewhere outside of the current tokens. Invalid.
    # I am not sure if this case is possible sincerely.
    if start_index >= len(offset_mapping) or end_index >= len(offset_mapping):
        return False
    
    # The start or end indexes point outside of the CONTEXT. Invalid!
    # The offset_mapping was prepared for this in prepare_validation_features()
    # Setting to None all the tokens that are part of the question or special tokens
    if offset_mapping[start_index] is None or offset_mapping[end_index] is None:
         return False

    # In any other case, the start and end indexes are pointing to a valid answer
    return True


def get_valid_answer(start_index, end_index, offset_mapping, start_logits, end_logits, context):
    # Map from token indexes to char indexes of the context
    start_char = offset_mapping[start_index][0]
    end_char = offset_mapping[end_index][1]
    
    answer = {
            # Extract the text
            "text": context[start_char: end_char],
            # Get the sum of the logits as a simple-enough way of scoring (and ranking)
            # different valid answers
            "score": start_logits[start_index] + end_logits[end_index],
            
        }
    return answer

def get_best_answer(valid_answers):
    if len(valid_answers) > 0:
        # Get the answer with the best score
        # The score was defined as the sum of the logits of the start and end indexes
        best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        best_answer = best_answer["text"]
    else:
        # If no valid answers were found, return the empty string
        best_answer = ""
    return best_answer

And that was it; we have gone through all the pre and post-processing code relevant for the current competitions.

Congratulations! 

# Wrapping up

To sum up, let's go through a high-level description of the functionality each of the three functions implements:

### prepare_train_features()
This function turns the original train examples into train features, tokenizing them and taking care of large contexts when they appear, generating more than one "train feature" (tokenized train example).
It also takes care of massaging the squad-like target (`answer_start` pointing to a char and `answer_text`) into the model-friendly ones `start_position` and `end_position`, pointing to tokens in the feature.

### prepare_validation_features()
This function tokenizes the input and takes care of large contexts mimiking the previous one. It adds a back-reference from features to their originating examples (`example_id`) and massages the `offset_mapping` to help `postprocess_qa_predictions()` down the path.

### postprocess_qa_predictions()
This function gets one unique answer for an original example, starting with the model predictions for all the generated features for that example. It performs various steps for this:
1. It gets a list of ranked valid answers from the predicted logits for each given feature performing various consistency checks of the pair (start, end). 
2. It translates from tokens to chars, extracting the final string answer.
3. It takes care of going back from the features to the original example.


&nbsp;
&nbsp;

# References
* [🤗 QA fine-tuning example reference](https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb)
* [Question Answering Tutorial](https://www.kaggle.com/thedrcat/question-answering-tutorial) by thedrcat.
* [🤗 Preprocessing tutorial](https://huggingface.co/transformers/preprocessing.html)
* [The tokenization pipeline](https://huggingface.co/docs/tokenizers/python/latest/pipeline.html#the-tokenization-pipeline)
* [HuggingFace course](https://huggingface.co/course/chapter1)



## What's next?

Stay tuned, I'm working on the next 2 notebooks of the series: `Exploring Public Models Revisited` and `Reviewing squad2, mlqa and others`



&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;

## Remember to upvote the notebook if you found it useful! 🤗

