# Problem 4 - Using BERT for question answering  **(10)**

In this question, we will use a pre-trained model for generating answers to a question based on a paragraph.

In [None]:
# Install the transformers library that will be used for BERT models.
!pip install transformers

Collecting transformers
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.19.1-py3-none-any.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.1/311.1 kB[0m [31m40.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m59.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m62.4 MB/s[0m eta [36m0:00:00[0m
Col

## 5.1 **(1)**

We will use the BertForQuestionAnswering model and the BertTokenizer as our tokenizer.

In [None]:
import torch
from transformers import BertForQuestionAnswering
from transformers import BertTokenizer

#Get the pretrained 'bert-large-uncased-whole-word-masking-finetuned-squad' model from the BertForQuestionAnswering library
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

Downloading (…)lve/main/config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

We define the question as well as the textual paragraph which the question is based on.


In [None]:
question = "What was BERT trained on?"

paragraph = "BERT stands for Bidirectional Encoder Representation of Transformer. I feel that its name itself is descriptive enough to get the gist. Still, to understand it better, it’s encoder part of the encoder-decoder transformer model, it’s also bidirectional in nature, which means that for any input it’s able to learn dependencies from both left and right of any word. It was trained on Wikipedia text and BooksCorpus and open-sourced back in 2018 by Google. You can find the official repository and paper at Github: BERT and BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. There are two models introduced in the paper. BERT base — 12 layers (transformer blocks), 110 million parameters. BERT Large — 24 layers, 340 million parameters. Later google also released Multi-lingual BERT to accelerate the research"

## 5.2 **(2)**

Use the encode_plus function. Define the text parameter as the question, and the text_pair as the paragraph.

You can refer to: https://huggingface.co/docs/transformers/v4.19.0/en/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__

In [None]:
encoding = tokenizer.encode_plus(text=question , text_pair=paragraph , add_special_tokens=True)

## 5.3 **(2)**

The encoding is a dictionary with multiple keys. Your task is to identify which keys will be used for the inputs and which will be used for the segment embeddings.

In [None]:
print(encoding.keys())

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])


In [None]:
inputs = encoding['input_ids']  #Token embeddings

sentence_embedding = encoding['token_type_ids'] #Segment embeddings


# we convert the input ids to tokens
tokens = tokenizer.convert_ids_to_tokens(inputs) #input tokens

The model returns the most probable start and end words scores.

In [None]:
scores = model(input_ids=torch.tensor([inputs]), token_type_ids=torch.tensor([sentence_embedding]))
print(scores)

QuestionAnsweringModelOutput(loss=None, start_logits=tensor([[-5.4397, -5.1747, -8.2072, -8.1577, -7.4659, -6.3724, -9.4946, -5.4397,
         -1.1487, -5.9532, -7.5790, -2.0921, -7.5579, -7.1692, -7.1343, -4.7308,
         -6.7858, -7.5462, -5.3032, -7.8723, -5.3115, -7.4651, -7.9509, -6.8902,
         -7.9474, -8.6392, -6.2064, -7.5566, -8.5925, -8.2583, -6.7350, -8.4192,
         -8.4794, -7.7465, -8.4259, -7.5299, -8.5404, -9.1108, -7.8087, -8.6896,
         -7.1670, -7.7759, -8.2495, -8.5528, -8.7607, -5.9098, -8.4287, -8.4879,
         -6.4561, -7.5364, -8.4136, -6.9562, -8.3993, -6.9945, -4.6695, -6.9653,
         -7.7577, -7.9943, -5.2502, -7.7105, -5.6726, -8.0013, -5.9587, -8.3135,
         -6.2524, -8.2741, -8.4364, -8.1030, -3.8497, -8.1456, -8.0798, -8.2123,
         -8.8444, -7.8705, -8.4785, -7.6194, -7.5587, -8.2160, -6.8581, -6.2672,
         -5.9076, -6.9228, -8.1743, -8.5033, -6.8941, -7.6130, -6.1818, -6.1809,
         -7.9184, -8.4508, -7.5274, -7.6268, -8.8922, -7

## 5.4 **(2)**

Now we have start scores and end scores we can get both the start index and the end index and use both the indices for span prediction.

In [None]:
# Use torch.argmax to get the indices for the start and end words with the highest probability.
# Use scores.start_logits and scores.end_logits
start_index = torch.argmax(scores.start_logits)

end_index = torch.argmax(scores.end_logits)


if end_index >= start_index:
    get = " ".join(tokens[start_index:end_index+1])
else:
    print("I am unable to find the answer to this question. Can you please ask another question?")

## 5.5 **(1)**
Display the answer given by the model.

In [None]:
print(get)

wikipedia text and books ##corp ##us


## 5.6 **(2)**

Did you see any unusual tokens in the answer? What could be the reason for that?

**Answer**:

Yes, we can see that ##corp and ##us are unusual tokens in out as they weer not present (with #s) in our input paragraph. These tokens are examples BETS model's "WordPiece" tokenization method. BERT uses a subword tokenization approach that breaks word (especially big/long words) into smaller words or characters if the original word is not in its vocabulary. This is a useful method for handling unknown words by breaking them down or to reduce vocabulary size.

In BERT's "WordPiece" tokenization method prefix "##" is put to show that there is a continuation of the word/character - the word is not whole but part of a longer word. This is exactly what we see in our ouput - ##corp and ##us are parts of a bigger word. After examining the paragraph we were working with these are most likely coming from the word "BooksCorpus" that was probably broken down into "books", "corp" and "us".  


