<a href="https://colab.research.google.com/github/Rami-RK/HugingFace_Transformers/blob/main/Bert_QA_Squad.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Bert for Extractive Q&A**

### **Learning Objectives:**
At the end of the experiment, you will be able to understand :
* processing & tokenizing of inputs/raw data
* converting logits into string output
* how to fine-tune a transformer for context-based question-answering
* train, evaluate, save the model as a apipeline and use for inference

### **Introduction**

This notebook aims to study the use of the BERT model for question-answering tasks by training a BERT model and fine-tuning it. The SQuAD (Stanford Question Answering Dataset) is going to be used which contains a question, and a passage of text containing the answer.BERT needs to highlight the "span" of text corresponding to the correct answer. In this assignment, we'll be using a Huggingface transformer.

**Highlights:**

* Data Set : SQuAD = Stanford Question Answering Data

* Task: Extractive question-answering:

* Input : Context + question

* Output : answer (substring of context)

* No need to generate text, no need for encoder-decoder architecture.
* Bert (encoder) is used to tackle this problem  

#### **BERT Input Format**

Let us understand about BERT input format. To feed a QA task into BERT, we pack both the question and the reference text into the input. Look into the below diagram shows the input format in a more comprehensive way.

<center>
<img src= "https://drive.google.com/uc?export=view&id=1dfgTaE_SABpr2blqwTjq9PTyhYabO8_m" width=700px/>
</center>

The two pieces of text are separated by the special `[SEP]` token. Further, the BERT also uses "Segment Embeddings" to differentiate the question from the reference text. These are simply two embeddings (for segments "A" and "B") that BERT learned, and which it adds to the token embeddings before feeding them into the input layer.

Now, let us understand about start and end token classifiers that helps the algorithm to mark the piece of text that we want to find.

#### **Start & End Token Classifiers**

BERT needs to highlight a "span" of text containing the answer--this is represented as simply predicting which token marks the start of the answer, and which token marks the end.


<center>
<img src= "http://www.mccormickml.com/assets/BERT/SQuAD/start_token_classification.png" width=600px/>
</center>

For every token in the text, we feed its final embedding into the start token classifier. The start token classifier only has a single set of weights (represented by the blue "start" rectangle in the above illustration) which it applies to every word.

After taking the dot product between the output embeddings and the 'start' weights, we apply the softmax activation to produce a probability distribution over all of the words. Whichever word has the highest probability of being the start token is the one that we pick.

We repeat this process for the end token--we have a separate weight vector this.

<center>
<img src= "http://www.mccormickml.com/assets/BERT/SQuAD/end_token_classification.png" width=500px/>
</center>



### Install huggingface transformers library

In [None]:
! pip install -U accelerate
! pip install -U transformers

Collecting accelerate
  Downloading accelerate-0.22.0-py3-none-any.whl (251 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/251.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m112.6/251.2 kB[0m [31m3.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.2/251.2 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.22.0
Collecting transformers
  Downloading transformers-4.32.1-py3-none-any.whl (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m28.2 MB/s[0m eta [36m0:00:00[0m
Collecting toke

### **Loading the data & Exploration**

In [None]:
!pip install transformers datasets

Collecting datasets
  Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets-2.14.

In [None]:
from datasets import load_dataset
raw_datasets=load_dataset('squad')
raw_datasets

Downloading builder script:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.67k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

* **Example data**

Training sample 1

In [None]:
raw_datasets['train'][1]

{'id': '5733be284776f4190066117f',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'What is in front of the Notre Dame Main Building?',
 'answers': {'text': ['a copper statue of Christ'], 'answer_start': [188]}}

**Context**

In [None]:
raw_datasets['train'][1]['context']

'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'

**Question**

In [None]:
raw_datasets['train'][1]['question']

**Answers**

In [4]:
raw_datasets['train'][1]['answers']

Note above format of the answer :  inside a list i.e. there may be multiple answer for a question and position of the start of answer.

Also same answer can appear multiple times.

But in the dataset multiple answers may be there for validation set only.

* For train set, there is only one asnwer per sample, since the loss function is only built for one target per input, easy to work without extra effort for this part.

* **Code cell below is for cheking that train set has always 1 answer and not multiple answers, or no answers.**





In [None]:
raw_datasets['train'].filter(lambda x: len(x['answers']['text'])!=1)

Filter:   0%|          | 0/87599 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 0
})

#### **But validation set may have multiple answers.**
One example is given below:

In [None]:
raw_datasets['validation'][2]['answers']

{'text': ['Santa Clara, California',
  "Levi's Stadium",
  "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California."],
 'answer_start': [403, 355, 355]}

In [None]:
raw_datasets['validation'][2]['context']

'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.'

In [None]:
raw_datasets['validation'][2]['question']

'Where did Super Bowl 50 take place?'

### **Understanding Tokenizer**
* Load tokenizer from transformer.
* This tokenizer can handle 2 input texts.
* Bert based Model checkpoint for pre-trined model.

In [None]:
from transformers import AutoTokenizer

In [None]:
model_checkpoint ='distilbert-base-cased' # 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

**Chekcing what happens after tokenization :**

In [None]:
context_sampl1='You are playing?'
question_sample1='What are you doing?'
inputs_sample1 = tokenizer(question_sample1,context_sampl1)
print(inputs_sample1)
tokenizer.decode(inputs_sample1['input_ids'])

{'input_ids': [101, 1327, 1132, 1128, 1833, 136, 102, 1192, 1132, 1773, 136, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


'[CLS] What are you doing? [SEP] You are playing? [SEP]'

Decoding the outputs -->  Turn the token IDs back into word,  Gives --> A big, long string containing both the question and the context, concatenated together.

* Starts with the special **CLS** token, followed by the first sentence followed by the **SEP** token followed by the second sentence, and finally, one more **SEP** token.

* The context may contain more than one sentence.

**Long contexts**

*  For QA the context may be very long but Bert can handle only limited number of tokens.
* We can't truncate context or answers.
* Soln: Split the context into multiple windows  --> context splitted like: Window1;  Window2;  Window3 ...
* The answer may get cut off if it's on the bounday -->Soln: Overlapping windows called stride.

#### **Checking for train sample**

In [None]:
context = raw_datasets['train'][1]['context']
question = raw_datasets['train'][1]['question']
inputs = tokenizer(question,context)
print(inputs)
print(len(inputs['input_ids']))


{'input_ids': [101, 1327, 1110, 1107, 1524, 1104, 1103, 10360, 8022, 4304, 4334, 136, 102, 22182, 1193, 117, 1103, 1278, 1144, 170, 2336, 1959, 119, 1335, 4184, 1103, 4304, 4334, 112, 188, 2284, 10945, 1110, 170, 5404, 5921, 1104, 1103, 6567, 2090, 119, 13301, 1107, 1524, 1104, 1103, 4304, 4334, 1105, 4749, 1122, 117, 1110, 170, 7335, 5921, 1104, 4028, 1114, 1739, 1146, 14089, 5591, 1114, 1103, 7051, 107, 159, 21462, 1566, 24930, 2508, 152, 1306, 3965, 107, 119, 5893, 1106, 1103, 4304, 4334, 1110, 1103, 19349, 1104, 1103, 11373, 4641, 119, 13301, 1481, 1103, 171, 17506, 9538, 1110, 1103, 144, 10595, 2430, 117, 170, 14789, 1282, 1104, 8070, 1105, 9284, 119, 1135, 1110, 170, 16498, 1104, 1103, 176, 10595, 2430, 1120, 10111, 20500, 117, 1699, 1187, 1103, 6567, 2090, 25153, 1193, 1691, 1106, 2216, 17666, 6397, 3786, 1573, 25422, 13149, 1107, 8109, 119, 1335, 1103, 1322, 1104, 1103, 1514, 2797, 113, 1105, 1107, 170, 2904, 1413, 1115, 8200, 1194, 124, 11739, 1105, 1103, 3487, 17917, 114, 117

In [None]:
tokenizer.decode(inputs['input_ids'])

'[CLS] What is in front of the Notre Dame Main Building? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive ( and in a direct line that connects through 3 statues and the Gold Dome ), is a simple, modern stone statue of Mary. [SEP]'

#### **Full Tokenizer Call** for long context splitting into multiple samples :


In [None]:
inputs = tokenizer(
    question,
    context,
    max_length=100,                  # Maximum length of entire input question + context + special tokens
    truncation='only_second',        # Context is the second input, truncate only this
    stride=50,                       # Overlap between context window when splitted into multiple windows
    return_overflowing_tokens=True,  # Referring to Overlapped tokens
    )

inputs.keys()


dict_keys(['input_ids', 'attention_mask', 'overflow_to_sample_mapping'])

In [None]:
inputs['overflow_to_sample_mapping']

[0, 0, 0, 0]

In [None]:
for ids in inputs["input_ids"]:
  print(tokenizer.decode(ids))

[CLS] What is in front of the Notre Dame Main Building? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the G [SEP]
[CLS] What is in front of the Notre Dame Main Building? [SEP] facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernade [SEP]
[CLS] What is in front of the Notre Dame Main Building? [SEP] of the Sacred Heart. Immediately behind the basilica is the Grotto, 

See above : If we have one question in context pair, this might be converted into multiple input samples depending on how long the context is.

### **Multiple Sample input**

In [None]:
inputs=tokenizer(
    raw_datasets['train'][:3]['question'],
    raw_datasets['train'][:3]['context'],
    max_length=100,
    truncation='only_second',
    stride=50,
    return_overflowing_tokens=True,
    return_offsets_mapping=True,    #
    )
inputs['overflow_to_sample_mapping']
# Integers output refers to : In raw input, the first sample is 0, the second sample is 1 ,etc.

[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2]

In [None]:
for ids in inputs["input_ids"]:
  print(tokenizer.decode(ids))

[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basi [SEP]
[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin [SEP]
[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Next to the Main Building is the B

**return_offsets_mapping**: Give us a list of list of tuples.

Each of these tuples corresponds to a token in the input sequence.Each tuple contains the start and end character positions of that token.

Question  **Where is Tommy**?

Context : **Tommy is at home.**

**[CLS]  where  is  Tommy  ?  [SEP]  Tommy  is  at  home  . [SEP]**

**(0,0)  (0,5) (6,8) ...   ....                ....           .(0,0), ...**

The first tuple is zero zero, which corresponds to the special **CLS** token because technically this doesn't take up any space.

The second tuple goes from 0 to 5 because the word **where** contains five letters.

The next tuple goes from 6 to 8 because the word is contains two letters and so forth.

In [None]:
inputs['offset_mapping']

[[(0, 0),
  (0, 2),
  (3, 7),
  (8, 11),
  (12, 15),
  (16, 22),
  (23, 27),
  (28, 37),
  (38, 44),
  (45, 47),
  (48, 52),
  (53, 55),
  (56, 59),
  (59, 63),
  (64, 70),
  (70, 71),
  (0, 0),
  (0, 13),
  (13, 15),
  (15, 16),
  (17, 20),
  (21, 27),
  (28, 31),
  (32, 33),
  (34, 42),
  (43, 52),
  (52, 53),
  (54, 56),
  (56, 58),
  (59, 62),
  (63, 67),
  (68, 76),
  (76, 77),
  (77, 78),
  (79, 83),
  (84, 88),
  (89, 91),
  (92, 93),
  (94, 100),
  (101, 107),
  (108, 110),
  (111, 114),
  (115, 121),
  (122, 126),
  (126, 127),
  (128, 139),
  (140, 142),
  (143, 148),
  (149, 151),
  (152, 155),
  (156, 160),
  (161, 169),
  (170, 173),
  (174, 180),
  (181, 183),
  (183, 184),
  (185, 187),
  (188, 189),
  (190, 196),
  (197, 203),
  (204, 206),
  (207, 213),
  (214, 218),
  (219, 223),
  (224, 226),
  (226, 229),
  (229, 232),
  (233, 237),
  (238, 241),
  (242, 248),
  (249, 250),
  (250, 251),
  (251, 254),
  (254, 256),
  (257, 259),
  (260, 262),
  (263, 264),
  (264, 2

#### **Recreate inputs for just a single context- question pair**

In [None]:
context = raw_datasets['train'][1]['context']
question = raw_datasets['train'][1]['question']
inputs = tokenizer(question,context)
print(inputs, '\n\n')
tokenizer.decode(inputs['input_ids'])

{'input_ids': [101, 1327, 1110, 1107, 1524, 1104, 1103, 10360, 8022, 4304, 4334, 136, 102, 22182, 1193, 117, 1103, 1278, 1144, 170, 2336, 1959, 119, 1335, 4184, 1103, 4304, 4334, 112, 188, 2284, 10945, 1110, 170, 5404, 5921, 1104, 1103, 6567, 2090, 119, 13301, 1107, 1524, 1104, 1103, 4304, 4334, 1105, 4749, 1122, 117, 1110, 170, 7335, 5921, 1104, 4028, 1114, 1739, 1146, 14089, 5591, 1114, 1103, 7051, 107, 159, 21462, 1566, 24930, 2508, 152, 1306, 3965, 107, 119, 5893, 1106, 1103, 4304, 4334, 1110, 1103, 19349, 1104, 1103, 11373, 4641, 119, 13301, 1481, 1103, 171, 17506, 9538, 1110, 1103, 144, 10595, 2430, 117, 170, 14789, 1282, 1104, 8070, 1105, 9284, 119, 1135, 1110, 170, 16498, 1104, 1103, 176, 10595, 2430, 1120, 10111, 20500, 117, 1699, 1187, 1103, 6567, 2090, 25153, 1193, 1691, 1106, 2216, 17666, 6397, 3786, 1573, 25422, 13149, 1107, 8109, 119, 1335, 1103, 1322, 1104, 1103, 1514, 2797, 113, 1105, 1107, 170, 2904, 1413, 1115, 8200, 1194, 124, 11739, 1105, 1103, 3487, 17917, 114, 117

'[CLS] What is in front of the Notre Dame Main Building? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive ( and in a direct line that connects through 3 statues and the Gold Dome ), is a simple, modern stone statue of Mary. [SEP]'

In [None]:
# Context is long, lets split into multiple samples
inputs=tokenizer(
    question,
    context,
    max_length=80,
    truncation='only_second',
    stride=50,
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    )
inputs.keys()

dict_keys(['input_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])

In [None]:
for ids in inputs["input_ids"]:
  print(tokenizer.decode(ids))

[CLS] What is in front of the Notre Dame Main Building? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to [SEP]
[CLS] What is in front of the Notre Dame Main Building? [SEP] s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basi [SEP]
[CLS] What is in front of the Notre Dame Main Building? [SEP] the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a M

In [None]:
inputs['offset_mapping']

[[(0, 0),
  (0, 4),
  (5, 7),
  (8, 10),
  (11, 16),
  (17, 19),
  (20, 23),
  (24, 29),
  (30, 34),
  (35, 39),
  (40, 48),
  (48, 49),
  (0, 0),
  (0, 13),
  (13, 15),
  (15, 16),
  (17, 20),
  (21, 27),
  (28, 31),
  (32, 33),
  (34, 42),
  (43, 52),
  (52, 53),
  (54, 56),
  (56, 58),
  (59, 62),
  (63, 67),
  (68, 76),
  (76, 77),
  (77, 78),
  (79, 83),
  (84, 88),
  (89, 91),
  (92, 93),
  (94, 100),
  (101, 107),
  (108, 110),
  (111, 114),
  (115, 121),
  (122, 126),
  (126, 127),
  (128, 139),
  (140, 142),
  (143, 148),
  (149, 151),
  (152, 155),
  (156, 160),
  (161, 169),
  (170, 173),
  (174, 180),
  (181, 183),
  (183, 184),
  (185, 187),
  (188, 189),
  (190, 196),
  (197, 203),
  (204, 206),
  (207, 213),
  (214, 218),
  (219, 223),
  (224, 226),
  (226, 229),
  (229, 232),
  (233, 237),
  (238, 241),
  (242, 248),
  (249, 250),
  (250, 251),
  (251, 254),
  (254, 256),
  (257, 259),
  (260, 262),
  (263, 264),
  (264, 265),
  (265, 268),
  (268, 269),
  (269, 270),
 

In [None]:
len(inputs['offset_mapping'])

8

In [None]:
len(inputs['offset_mapping'][0])

80

### **Aligning the target**

**Problems:**
* Splitted contexts into multiple windows
* Answer in the dataset comes with a start postion within context and sfter splitting  that position is no longer valid
* The position of the answer will change in each window of the context

* The asnwer is also the target for the neural network

**How can we recompute the targets for each context window?**

First lets see some helping concepts and function which will combinely perform the task.

**Sequence IDs**

This takes in an integer corresponding to the input sample.

As usual, we count from zero 1 to 2 and so forth, just like array and list indices.

Output :  
* None --> for special token in the tokenized input.
* zeros --> correspond to the tokens from the first sentence,
* ones --> which correspond to the tokens from the second sentence.

* Same as the token type IDs that we saw earlier, however,  token type IDs are not present in all types of models such as distilled Berts.Therefore, we can't depend on it. Instead, this sequence ID function can always be called no matter which model we use.

**Where is the context?**
* In example, result of cell below, the context is where 1s are.
* But we need to know actual positions i.e. the index of the first 1 and the index of the last 1.

In [None]:
print(inputs.sequence_ids(7))

[None, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, None]


In [None]:
type(inputs.sequence_ids(0))

list

#### Finding the start and end of the context (the first and last '1') :

In [None]:
sequence_ids = inputs.sequence_ids(0)
ctx_start = sequence_ids.index(1)
ctx_end = len(sequence_ids)- sequence_ids[::-1].index(1)-1 #
ctx_start,ctx_end # Start and end position of context

(13, 98)

#### **Finding the answer:**
We have found the context window but how to find start and end character postions of the answer? Lets see!

In [None]:
answer = raw_datasets['train'][1]['answers']
answer

{'text': ['a copper statue of Christ'], 'answer_start': [188]}

In [None]:
# For full context if the context is not splitted
ans_start_char =answer['answer_start'][0]
ans_end_char = ans_start_char + len(answer['text'][0])

##### For context windows

Help from --> offset mapping, context start , context end

In [None]:
offset = inputs['offset_mapping'][0]
offset

[(0, 0),
 (0, 4),
 (5, 7),
 (8, 10),
 (11, 16),
 (17, 19),
 (20, 23),
 (24, 29),
 (30, 34),
 (35, 39),
 (40, 48),
 (48, 49),
 (0, 0),
 (0, 13),
 (13, 15),
 (15, 16),
 (17, 20),
 (21, 27),
 (28, 31),
 (32, 33),
 (34, 42),
 (43, 52),
 (52, 53),
 (54, 56),
 (56, 58),
 (59, 62),
 (63, 67),
 (68, 76),
 (76, 77),
 (77, 78),
 (79, 83),
 (84, 88),
 (89, 91),
 (92, 93),
 (94, 100),
 (101, 107),
 (108, 110),
 (111, 114),
 (115, 121),
 (122, 126),
 (126, 127),
 (128, 139),
 (140, 142),
 (143, 148),
 (149, 151),
 (152, 155),
 (156, 160),
 (161, 169),
 (170, 173),
 (174, 180),
 (181, 183),
 (183, 184),
 (185, 187),
 (188, 189),
 (190, 196),
 (197, 203),
 (204, 206),
 (207, 213),
 (214, 218),
 (219, 223),
 (224, 226),
 (226, 229),
 (229, 232),
 (233, 237),
 (238, 241),
 (242, 248),
 (249, 250),
 (250, 251),
 (251, 254),
 (254, 256),
 (257, 259),
 (260, 262),
 (263, 264),
 (264, 265),
 (265, 268),
 (268, 269),
 (269, 270),
 (271, 275),
 (276, 278),
 (279, 282),
 (283, 287),
 (288, 296),
 (297, 299),


In [None]:
offset[13:]

In [None]:
offset = inputs['offset_mapping'][0]

start_idx=0
end_idx=0

if offset[ctx_start][0]> ans_start_char or offset[ctx_end][1]< ans_end_char:
  # condn for ans doesn't exist within context
  print("target is (0,0)")
else:
  i=ctx_start
  for start_end_char in offset[ctx_start:]: # Loop through every tuple in the offset mapping
    start,end=start_end_char
    if start == ans_start_char:
      start_idx=i
    if end==ans_end_char:
      end_idx=i
      break
    i +=1
start_idx, end_idx

(53, 57)

In [None]:
# check : This will give us the token IDs of the answer.
input_ids = inputs['input_ids'][0]
input_ids[start_idx:end_idx + 1]


[170, 7335, 5921, 1104, 4028]

In [None]:
# convert this back into English text --> call tokenize or decode.
tokenizer.decode(input_ids[start_idx:end_idx+1])

'a copper statue of Christ'

#### **Defing `find_answer_token_idx` function**

Defining funtion for all above operation:

In [None]:
def find_answer_token_idx(ctx_start,ctx_end,ans_start_char,ans_end_char,offset):
  start_idx=0
  end_idx=0

  if offset[ctx_start][0]> ans_start_char or offset[ctx_end][1]<ans_end_char:
    pass
  else:
    i=ctx_start
    for start_end_char in offset[ctx_start:]:
      if start == ans_start_char:
        start_idx=i
      if end==ans_end_char:
        end_idx=i
        break
      i +=1
  return start_idx,end_idx

### **Finding the answer in all context windows**

Try it on all context windows, sometimes answer may not appear!

In [None]:
start_idxs=[]
end_idxs=[]
for i, offset in enumerate(inputs['offset_mapping']):
  sequence_ids = inputs.sequence_ids(i)

  ctx_start = sequence_ids.index(1)
  ctx_end =len(sequence_ids) - sequence_ids[::-1].index(1)-1

  start_idx, end_idx = find_answer_token_idx(ctx_start,ctx_end,ans_start_char,ans_end_char,offset)

  start_idxs.append(start_idx)
  end_idxs.append(end_idx)
start_idxs, end_idxs

([0, 0, 0, 0], [13, 13, 0, 0])

#### Explanation of above result :

Notice how the answer appears at **0 to 13 for the first window** and at **0 to 13 for the second window**.The answer is positioned to such that it appears in both context windows. However, for the third and fourth context Windows, the answer does not appear so.The target is zero zero in these cases.

### **Final  Tokenizer Functions for tokenizing the data: train/valid**

Final application of the tokenizer to process the data

In [None]:
#some question have leading and/ or trailing whitespace
for q in raw_datasets['train']['question'][:1000]:
  if q.strip() !=q:
    print(q)

In what city and state did Beyonce  grow up? 
 The album, Dangerously in Love  achieved what spot on the Billboard Top 100 chart?
Which song did Beyonce sing at the first couple's inaugural ball? 
What event did Beyoncé perform at one month after Obama's inauguration? 
Where was the album released? 
What movie influenced Beyonce towards empowerment themes? 


### **Function for Train set**

In [None]:
max_length = 384
stride =128

def tokenize_fn_train(batch):
  questions =[q.strip() for q in batch['question']]
  # tokenizing with padding
  inputs=tokenizer(questions,
                   batch['context'],
                   max_length = max_length,
                   truncation='only_second',
                   stride=stride,
                   return_overflowing_tokens=True,
                   return_offsets_mapping=True,
                   padding="max_length",)
  # we don't need these later so remove them
  offset_mapping = inputs.pop("offset_mapping")
  orig_sample_idxs = inputs.pop("overflow_to_sample_mapping")
  answers = batch['answers'] # from raw  input, not from the tokenizer
  start_idxs, end_idxs =[],[]

  # same loop as above

  for i, offset in enumerate(offset_mapping):
    sample_idx= orig_sample_idxs[i]
    answer=answers[sample_idx]

    ans_start_char =answer['answer_start'][0]
    ans_end_char = ans_start_char + len(answer['text'][0])

    sequence_ids = inputs.sequence_ids(i)

    # find start + end of context (first 1 and last 1)

    ctx_start =sequence_ids.index(1)
    ctx_end = len(sequence_ids)-sequence_ids[::-1].index(1)-1

    start_idx, end_idx = find_answer_token_idx(ctx_start,ctx_end,ans_start_char,ans_end_char,offset)
    start_idxs.append(start_idx)
    end_idxs.append(end_idx)

    inputs['start_positions']=start_idxs
    inputs['end_positions']=end_idxs

  return inputs

#### Tokenizing the train-dataset

In [None]:
train_dataset =raw_datasets['train'].map(
    tokenize_fn_train,
    batched=True,
    remove_columns=raw_datasets['train'].column_names)
len(raw_datasets['train']), len(train_dataset)

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

(87599, 88729)

### **Function for validation set**
Tokenize the validation set differently:

**An example input: Notice the id** -->It's a string which contains alphanumeric characters.It uniquely identify which original sample the tokenizer output comes from. Since, the tokenizer can generate multiple samples per input when the context is split up.

In [None]:
# we will keep these id's for later
raw_datasets['validation'][0]

{'id': '56be4db0acb8001400a502ec',
 'title': 'Super_Bowl_50',
 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'question': 'Which NFL team represented the AFC at Super Bowl 50?',
 'answers': {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'],


* We won't need the targets since we will just compare with the original answer
* Overwrite offset_mapping with Nones in place of question

In [None]:
def tokenize_fn_validation(batch):
  questions=[q.strip() for q in batch['question']]
  inputs=tokenizer(questions,
                   batch['context'],
                   max_length = max_length,
                   truncation='only_second',
                   stride=stride,return_overflowing_tokens=True,
                   return_offsets_mapping=True,
                   padding="max_length",)
  # don't need these later and removing
  orig_sample_idxs = inputs.pop("overflow_to_sample_mapping")
  sample_ids=[]
  # rewrite offset mapping by replacing question tuples with None
  # this will be helpful later on when we compute metrics
  for i in range(len(inputs['input_ids'])):
    sample_idx=orig_sample_idxs[i]
    sample_ids.append(batch['id'][sample_idx])

    sequence_ids =inputs.sequence_ids(i)
    # zeros in the positions corresponding to question and ones corresponding to context
    offset = inputs['offset_mapping'][i]
    inputs['offset_mapping'][i]=[x if sequence_ids[j] == 1 else None for j, x in enumerate(offset)]
    # only context tuple remains and other becomes None
    inputs['sample_id']=sample_ids

    inputs['sample_id'] =sample_ids
  return inputs

#### **Tokenizing the validation-dataset**

In [None]:
validation_dataset = raw_datasets['validation'].map(
    tokenize_fn_validation,
    batched=True,
    remove_columns=raw_datasets['validation'].column_names)
len(raw_datasets['validation']), len(validation_dataset)

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

(10570, 10822)

### **Metrics Computation**

In [None]:
from datasets import load_metric
metric = load_metric("squad")

  metric = load_metric("squad")


Downloading builder script:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

In [None]:
## Create dummy data
predicted_answers= [{'id':'1','prediction_text':'CV Raman'},
 {'id':'2','prediction_text':'physicist'}]

true_answers= [{'id':'1','answers':{'text':['CV Raman'],'answer_start':[100]}},
 {'id':'2','answers':{'text':['physicist'],'answer_start':[100]}}]

In [None]:
metric.compute( predictions=predicted_answers, references=true_answers)

{'exact_match': 100.0, 'f1': 100.0}

* exact_match : total correct match (accuracy)
* f1:More involved

### **From Logits to Answer**

* To convert the model outputs back into a string answer so that call metric compute can be used.
* We need to read the answers!
* For understanding this process we are going to use a pre_trained question- answer model and convert its predictions(logits) into string answers

#### **Trying with subset of dataset**

In [None]:
small_validation_dataset = raw_datasets['validation'].select(range(100)) # subset of 100 sample
trained_checkpoint ='distilbert-base-cased-distilled-squad'# checkpoint for squad bert
tokenizer2=AutoTokenizer.from_pretrained(trained_checkpoint) # need different tokenizer as this is different model
old_tokenizer=tokenizer # save existing tokenizer
tokenizer=tokenizer2 # temporarily change tokenizer with tokenizer2

small_validation_processed= small_validation_dataset.map(
    tokenize_fn_validation,
    batched=True,
    remove_columns=raw_datasets["validation"].column_names)
# change it back
tokenizer=old_tokenizer

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

#### **Using pre-trained model and getting the output**

In [None]:
import torch
from transformers import AutoModelForQuestionAnswering # a model class specifically built for this task.
# for tensorflow :
# from transformers import TFAutoModelForQuestionAnswering

In [None]:
# the trained model doesn't use these columns
small_model_inputs = small_validation_processed.remove_columns(['sample_id','offset_mapping'])
small_model_inputs.set_format('torch')

# get gpu device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# move tensors to gpu device
small_model_inputs_gpu ={k:small_model_inputs[k].to(device) for k in small_model_inputs.column_names}

# download the model
trained_model = AutoModelForQuestionAnswering.from_pretrained(trained_checkpoint).to(device)

# get the model output
with torch.no_grad():
  outputs=trained_model(**small_model_inputs_gpu)

Downloading model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

#### **Looking at the Model Outputs**

* Both start_logits and end_logits have shape N xT
* N--> number of samples; T--> number of time steps

Suppose that we have a batch of 'N' samples, each with a padded sequence length of T.Then we would expect both of these logits to be of size N by T.

For simplicity, suppose we only have one sample that is one question, one context, and one answer.So effectively the start logic will be a vector of size T and the end logics will also be a vector of size T. If we took the SoftMax over the time dimension, we would get the probabilities for each time step.

What  do we do with these probabilities?

In [None]:
outputs

QuestionAnsweringModelOutput(loss=None, start_logits=tensor([[ -2.2607,  -5.1783,  -5.2709,  ...,  -9.5243,  -9.5183,  -9.5288],
        [ -2.5961,  -5.5482,  -5.5313,  ...,  -9.9598,  -9.9533,  -9.9860],
        [ -3.7127,  -7.1848,  -8.5388,  ..., -11.6557, -11.6571, -11.6505],
        ...,
        [ -2.0260,  -4.4167,  -4.4980,  ...,  -8.1479,  -8.1530,  -8.1760],
        [ -4.1553,  -5.8304,  -7.1643,  ..., -10.5255, -10.5251, -10.4890],
        [ -3.2000,  -5.8162,  -6.7249,  ...,  -9.4935,  -9.5038,  -9.4871]]), end_logits=tensor([[ -0.7353,  -4.9236,  -5.1048,  ...,  -8.8734,  -8.8915,  -8.8550],
        [ -1.3056,  -5.3870,  -5.4945,  ...,  -9.4895,  -9.5039,  -9.4959],
        [ -2.7649,  -7.2201,  -9.0916,  ..., -11.3106, -11.3414, -11.2702],
        ...,
        [ -0.0768,  -4.8210,  -4.4374,  ...,  -8.0483,  -8.0502,  -7.9903],
        [ -2.7347,  -5.3650,  -7.2549,  ..., -10.0498, -10.0661,  -9.9886],
        [ -1.0991,  -4.2569,  -6.1267,  ...,  -8.6882,  -8.6889,  -8.627

#### Convert into Numpy arrays

In [None]:
start_logits = outputs.start_logits.cpu().numpy()
end_logits= outputs.end_logits.cpu().numpy()

In [None]:
small_validation_processed['sample_id'][:5]

# remembering that in processed data set, there is an attribute called sample ID.
# This is a unique string identifier corresponding to each question context pair.

['56be4db0acb8001400a502ec',
 '56be4db0acb8001400a502ed',
 '56be4db0acb8001400a502ee',
 '56be4db0acb8001400a502ef',
 '56be4db0acb8001400a502f0']

In [None]:
# Some sample ID may be repeated
len(validation_dataset['sample_id']), len(set(validation_dataset['sample_id']))

(10822, 10570)

#### **Mapping sample ID to Dataset Index**

* Create a dictionary that maps each sample ID to all the possible indices of where that sample resides in the process data set. For example, if the first sample gets split into four windows, then its sample ID would point to a list containing zero, one, two and three, which are the first four positions in the list containing all the processed samples.

* Loops through all the sample IDs and stores the corresponding indices as values.

In [None]:
# example : {'56be4db.....':[0,1,2,3],...]}
sample_id2idxs={}
for i, id_ in enumerate(small_validation_processed['sample_id']):
  if id_ not in sample_id2idxs:
    sample_id2idxs[id_] =[i]
  else:
    print("here")
    sample_id2idxs[id_].append(i)

In [None]:
start_logits.shape, end_logits.shape

((100, 384), (100, 384))

In [None]:
(-start_logits[0]).argsort()

array([ 46,  57,  47,  38,  39,  58,  50,  43,  45,  54,  56,  49,  13,
        42,  40,  35,  27,  31,  48,  41,  53,  44,  37,  59,  78,  15,
         0,  52,  24,  65,  81,  70,  18,  51,  55,  26,  69,  29,  28,
        75,  61,  64,  23,  36,  32,  11, 101,  62,  66,  34,  95,  30,
        63,  21,  19,  20,  17,  14,  22,  33,  68,  87, 171,  12,  76,
        71,  73,  92, 110,  84, 151,   1,  74,   2,   6,  16,  80,  79,
       105,  98,  10,  96, 136, 169, 106, 100,  93, 165,  67, 109,   8,
        90,   3, 115,  60,   5,  97,   7, 103, 102,  86,  72, 111,  89,
       108,   4,  88,  25, 132,  77, 123, 150, 124, 153,  83, 118,  82,
        85, 107, 114, 143, 164, 137, 130, 166, 159, 131,  91,   9, 144,
       139, 160,  94, 141, 128, 112, 134, 152, 170, 154, 117, 127, 104,
       140, 157, 155, 133, 145, 119, 162, 138, 135, 156, 167, 168, 126,
       148, 163, 161, 116,  99, 120, 142, 158, 125, 146, 113, 121, 147,
       149, 129, 122, 311, 312, 304, 309, 313, 310, 300, 307, 31

In [None]:
start_logits[0][(-start_logits[0]).argsort()]

array([10.694445  ,  9.803685  ,  4.4599743 ,  4.400488  ,  2.943783  ,
        2.7017365 ,  2.012652  ,  1.578078  ,  0.52237403,  0.02073932,
       -0.02802548, -0.04970503, -0.38572538, -0.69453716, -0.7979469 ,
       -0.8678012 , -0.87220144, -1.3516847 , -1.3703673 , -1.3878812 ,
       -1.5135032 , -1.7355448 , -1.8827012 , -1.8932868 , -1.9078932 ,
       -1.9304957 , -2.2607293 , -2.2983854 , -2.3069293 , -2.502737  ,
       -2.5100586 , -2.530837  , -2.5399904 , -2.6718087 , -2.7323527 ,
       -2.771015  , -2.7713625 , -2.9521286 , -3.0604622 , -3.1706011 ,
       -3.2045465 , -3.5693336 , -3.5798001 , -3.6668804 , -3.7250557 ,
       -3.7498548 , -3.7632139 , -3.9968119 , -4.011324  , -4.0687966 ,
       -4.0944815 , -4.1954722 , -4.238309  , -4.332359  , -4.352411  ,
       -4.3879614 , -4.388608  , -4.3966093 , -4.6790495 , -4.703027  ,
       -4.775753  , -4.777808  , -4.7882147 , -4.7882433 , -4.822122  ,
       -4.8725367 , -4.884931  , -4.8981423 , -5.072093  , -5.10

In [None]:
small_validation_processed['offset_mapping'][0]

#### **Converting logits to answer**

In [None]:
n_largest = 20 # maximum number of logits we want to consider
max_answer_length =30 # largest number of token for answer
predicted_answers =[]
# we are looping through the original (untokenized) dataset
# # because we need to grab the answer from the original string context
for sample in small_validation_dataset:
  sample_id = sample['id']
  context = sample['context']
  # update these as we loop through candidate answers
  best_score = float('-inf')
  best_answer = None
  # now loop through the *expanded* input samples(fixed size context windows)
  # from here we will pick the highest probability start/end combination
  for idx in sample_id2idxs[sample_id]:
    start_logit = start_logits[idx]
    end_logit = end_logits[idx]
    offsets = small_validation_processed[idx]['offset_mapping']

    start_indices =(-start_logit).argsort()
    end_indices =(-end_logit).argsort()

    for start_idx in start_indices[:n_largest]:
      for end_idx in end_indices[:n_largest]:
        #skip answers not contained in context window
        # recall: we set entries not pertaining to context to None earlier
        if offsets[start_idx] is None or offsets[end_idx] is None:
          continue
        # Skip answers where end<start
        if end_idx < start_idx:
          continue
        #skip answers that are too long
        if end_idx - start_idx +1 > max_answer_length:
          continue
        # score calculation
        score =start_logit[start_idx] + end_logit[end_idx]
        if score >best_score:
          best_score = score
          #find positions of start and end characters
          # recall: offsets contains tuples for each token:
          #(start_char,end_char)
          first_ch = offsets[start_idx][0]
          last_ch = offsets[end_idx][1]

          best_answer = context[first_ch:last_ch]
  predicted_answers.append({'id':sample_id,'prediction_text':best_answer})

In [None]:
predicted_answers

In [None]:
small_validation_dataset['answers'][0]

{'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'],
 'answer_start': [177, 177, 177]}

In [None]:
# testing it
true_answers=[{'id':x['id'],'answers':x['answers']} for x in small_validation_dataset]
metric.compute(predictions=predicted_answers, references=true_answers)

{'exact_match': 83.0, 'f1': 88.25000000000004}

#### **Defining full compute_metrices**

In [None]:
from tqdm import tqdm

In [None]:
def compute_metrics(start_logits, end_logits, processed_dataset, orig_dataset):
  sample_id2idxs={}
  for i, id_ in enumerate(processed_dataset['sample_id']):
    if id_ not in sample_id2idxs:
      sample_id2idxs[id_]=[i]
    else:
      sample_id2idxs[id_].append(i)
  predicted_answers=[]
  for sample in tqdm(orig_dataset):
    sample_id = sample['id']
    context =sample['context']
    #update these as we loop through candidate answers
    best_score = float('-inf')
    best_answer = None

    # now loop through the *expanded* input samples (fixed size context windows)
    #from here we will pick the highest probability start/end combination

    for idx in sample_id2idxs[sample_id]:
      start_logit=start_logits[idx]
      end_logit= end_logits[idx]
      offsets = processed_dataset[idx]['offset_mapping']

      start_indices =(-start_logit).argsort()
      end_indices = (-end_logit).argsort()

      for start_idx in start_indices[:n_largest]:
        for end_idx in end_indices[:n_largest]:
          #skip answers not contained in context window
          # recall: we set entries not pertaining to context to None earlier
          if offsets[start_idx] is None or offsets[end_idx] is None:
            continue
          # Skip answers where end<start
          if end_idx < start_idx:
            continue
          #skip answers that are too long
          if end_idx - start_idx +1 > max_answer_length:
            continue
          # score calculation
          score =start_logit[start_idx] + end_logit[end_idx]
          if score >best_score:
            best_score = score
            #find positions of start and end characters
            # recall: offsets contains tuples for each token:
            #(start_char,end_char)
            first_ch = offsets[start_idx][0]
            last_ch = offsets[end_idx][1]

            best_answer = context[first_ch:last_ch]
    predicted_answers.append({'id':sample_id,'prediction_text':best_answer})
  true_answers = [ {'id':x['id'],'answers':x['answers']} for x in orig_dataset]
  return metric.compute(predictions=predicted_answers, references=true_answers)


In [None]:
compute_metrics(start_logits, end_logits, small_validation_processed, small_validation_dataset)

### **Train and evaluate**

In [None]:
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

In [None]:
from transformers import TrainingArguments

In [None]:
args = TrainingArguments("finetuened-squad",
                         evaluation_strategy ="no",
                         save_strategy="epoch",
                         learning_rate = 2e-5,
                         num_train_epochs=3,
                         weight_decay = 0.01,
                         fp16=True,
                         )

In [None]:
from transformers import Trainer

In [None]:
trainer =Trainer(model=model,
                 args=args,
                 train_dataset=train_dataset,
                 eval_dataset=validation_dataset,
                 tokenizer=tokenizer)
trainer.train()

In [None]:
trainer_output = trainer.predict(validation_dataset)

In [None]:
type(trainer_output)

transformers.trainer_utils.PredictionOutput

In [None]:
trainer_output

PredictionOutput(predictions=(array([[ -5.5742188, -11.046875 , -11.046875 , ..., -11.828125 ,
        -11.84375  , -11.84375  ],
       [ -6.3671875, -11.078125 , -11.0703125, ..., -11.8203125,
        -11.8359375, -11.8359375],
       [ -8.609375 , -11.109375 , -11.1171875, ..., -11.6484375,
        -11.625    , -11.640625 ],
       ...,
       [ -6.75     , -11.421875 , -11.765625 , ..., -11.7734375,
        -11.7890625, -11.7734375],
       [ -6.2421875, -11.2109375, -11.4765625, ..., -11.71875  ,
        -11.6796875, -11.6953125],
       [ -5.5117188, -11.46875  , -11.6640625, ..., -11.7265625,
        -11.734375 , -11.71875  ]], dtype=float32), array([[ -4.7460938, -10.9453125, -10.84375  , ..., -11.3828125,
        -11.375    , -11.375    ],
       [ -5.59375  , -10.921875 , -10.8046875, ..., -11.390625 ,
        -11.375    , -11.3828125],
       [ -6.890625 , -10.0625   , -10.546875 , ..., -11.6328125,
        -11.671875 , -11.640625 ],
       ...,
       [ -5.4414062, -10.6015

In [None]:
predictions,_, _ = trainer_output

In [None]:
predictions

(array([[ -5.5742188, -11.046875 , -11.046875 , ..., -11.828125 ,
         -11.84375  , -11.84375  ],
        [ -6.3671875, -11.078125 , -11.0703125, ..., -11.8203125,
         -11.8359375, -11.8359375],
        [ -8.609375 , -11.109375 , -11.1171875, ..., -11.6484375,
         -11.625    , -11.640625 ],
        ...,
        [ -6.75     , -11.421875 , -11.765625 , ..., -11.7734375,
         -11.7890625, -11.7734375],
        [ -6.2421875, -11.2109375, -11.4765625, ..., -11.71875  ,
         -11.6796875, -11.6953125],
        [ -5.5117188, -11.46875  , -11.6640625, ..., -11.7265625,
         -11.734375 , -11.71875  ]], dtype=float32),
 array([[ -4.7460938, -10.9453125, -10.84375  , ..., -11.3828125,
         -11.375    , -11.375    ],
        [ -5.59375  , -10.921875 , -10.8046875, ..., -11.390625 ,
         -11.375    , -11.3828125],
        [ -6.890625 , -10.0625   , -10.546875 , ..., -11.6328125,
         -11.671875 , -11.640625 ],
        ...,
        [ -5.4414062, -10.6015625, -10.

In [None]:
start_logits, end_logits = predictions

In [None]:
compute_metrics(
    start_logits,
    end_logits,
    validation_dataset,
    raw_datasets['validation']
)

100%|██████████| 10570/10570 [00:17<00:00, 588.82it/s]


{'exact_match': 77.12393566698202, 'f1': 85.37269282340277}

### **Saving the model and Building pipeline**

In [None]:
trainer.save_model('my_saved_model')

In [None]:
from transformers import pipeline
qa = pipeline("question-answering", model= 'my_saved_model',device=0)

### **Using pipeline for inference**

In [5]:
context ='today I went to the store to purchase a carton of milk'
question ='what did I buy?'

In [None]:
qa(context=context, question=question)

{'score': 0.8841713070869446,
 'start': 38,
 'end': 54,
 'answer': 'a carton of milk'}

### Reference

1. [Data processing for Question Answering](https://www.youtube.com/watch?v=qgaM0weJHpA&list=PLo2EIpI_JMQtYmOWSszkfIi4sgz2NsySi&index=12)
2.[The Post processing step in Question Answering (PyTorch)](https://www.youtube.com/watch?v=BNy08iIWVJM&list=PLo2EIpI_JMQtYmOWSszkfIi4sgz2NsySi&index=12)
3. [The Post processing step in Question Answering (Tensorflow)](https://www.youtube.com/watch?v=VN67ZpN33Ss&list=PLo2EIpI_JMQtYmOWSszkfIi4sgz2NsySi&index=13)