<a href="https://colab.research.google.com/github/HemantTiwariGitHub/CapstoneProject2021/blob/main/Question_Answering_with_SQuAD_2_0_20210102.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Question answering** comes in many forms. In this example, we’ll look at the particular type of extractive QA that involves answering a question about a passage by highlighting the segment of the passage that answers the question. This involves fine-tuning a model which predicts a start position and an end position in the passage. We will use the Stanford Question Answering Dataset (SQuAD) 2.0.

We will start by downloading the data:

## **Note :**

Please write your code in the cells with the "**Your code here**" placeholder.

## **Download SQuAD 2.0 Data**

Note : This dataset can be explored in the Hugging Face model hub (SQuAD V2), and can be alternatively downloaded with the 🤗 NLP library with load_dataset("squad_v2").

In [2]:
!mkdir squad
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json -O /content/squad/train-v2.0.json
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json -O /content/squad/dev-v2.0.json

mkdir: cannot create directory ‘squad’: File exists
--2021-01-03 09:18:11--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.109.153, 185.199.108.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘/content/squad/train-v2.0.json’


2021-01-03 09:18:11 (151 MB/s) - ‘/content/squad/train-v2.0.json’ saved [42123633/42123633]

--2021-01-03 09:18:12--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.109.153, 185.199.108.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.109.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘/content/squad/dev-v2.0.json’


202

In [3]:
import json
from pathlib import Path

def loadJSONData(filename):
    with open(filename) as jsonDataFile:
        data = json.load(jsonDataFile)
    return data

In [4]:
#Data has Multiple Titles
#Every Title has Multiple Paragraphs and Each Para has Text in Context
#Every Paragraphs has Multiple Questions and Every Question has multiple answers with Answer start index
#If Answer is plausible , is_impossible is False.

def preprocessSQUAD(JSONData):
    contextList = []
    questionsList = []
    answersList = []

    titlesCount = len(JSONData['data'])
    BaseData = JSONData['data']
    print("Length Of JSON Data : ", titlesCount)
    for titleID in (range(titlesCount)):
      title = BaseData[titleID]['title']
      #print("Title : ", title);
      paragraphs = BaseData[titleID]['paragraphs']
      paragraphCount = len(paragraphs)

      for paraID in range(paragraphCount):
        context = paragraphs[paraID]['context']
        #print("Context : ",context);
        
        questions = paragraphs[paraID]['qas']
        questionCount = len(questions)
        
        for questionID in range(questionCount):
          
          # No Need to Process Questions whose Answers are not present
          if (questions[questionID]['is_impossible'] == True):
            continue

          questionText = questions[questionID]['question']
          answers = questions[questionID]['answers']

          #The SQUAD answer is a List  and in DEV most of times there are multiple answers
          for answer in answers:
            #Prepare The list of Context, Question and Answers parallely.
            contextList.append(context)
            questionsList.append(questionText)
            answersList.append(answer)


    print("Length of Context, Questions and Answers" , len (contextList), " , ", len(questionsList),  " , ", len(answersList) )    
    return contextList, questionsList, answersList


Each split is in a structured json file with a number of questions and answers for each passage (or context). We’ll take this apart into parallel lists of contexts, questions, and answers (note that the contexts here are repeated since there are multiple questions per context):

In [5]:
def read_squad(path):
   # Your code here
  dataInJSON = loadJSONData(path)
  return preprocessSQUAD(dataInJSON)


train_contexts, train_questions, train_answers = read_squad('/content/squad/train-v2.0.json')
print("Length of Context, Questions and Answers" , len (train_contexts), " , ", len(train_questions),  " , ", len(train_answers) ) 
val_contexts, val_questions, val_answers = read_squad('/content/squad/dev-v2.0.json')
print("Length of Context, Questions and Answers" , len (val_contexts), " , ", len(val_questions),  " , ", len(val_answers) ) 


Length Of JSON Data :  442
Length of Context, Questions and Answers 86821  ,  86821  ,  86821
Length of Context, Questions and Answers 86821  ,  86821  ,  86821
Length Of JSON Data :  35
Length of Context, Questions and Answers 20302  ,  20302  ,  20302
Length of Context, Questions and Answers 20302  ,  20302  ,  20302


The contexts and questions are just strings. The answers are dicts containing the subsequence of the passage with the correct answer as well as an integer indicating the character at which the answer begins. In order to train a model on this data we need (1) the tokenized context/question pairs, and (2) integers indicating at which token positions the answer begins and ends.

First, let’s get the character position at which the answer ends in the passage (we are given the starting position). Sometimes SQuAD answers are off by one or two characters, so we will also adjust for that.

In [6]:
def add_end_idx(answers, contexts):
    offByOneCount = 0
    offByTwoCount = 0
    exactCount = 0
    for answer, context in zip(answers, contexts):
      # Your code here
     
      # extract Answers and Start Positions
      #print(answer)
      answerText = answer['text']
      answerStartIndex = answer['answer_start']
      
      # calculate the end positions
      answerEndIndex = answerStartIndex + len (answerText)
      #print("Answer : ",answerText)
      #print("AnswerStartIndex : ",answerStartIndex," AnswerEndIndex : ",answerEndIndex )  

      # Check if Answers are off by 1 or 2 and fix
      if context[answerStartIndex:answerEndIndex] == answerText:
        answer['answer_end'] = answerEndIndex
        exactCount = exactCount + 1

      # Answer is off by 1 char    
      elif context[answerStartIndex - 1:answerEndIndex - 1] == answerText:
        answer['answer_start'] = answerStartIndex - 1
        answer['answer_end'] = answerEndIndex - 1     
        offByOneCount = offByOneCount + 1

      elif context[answerStartIndex + 1:answerEndIndex + 1] == answerText:
        answer['answer_start'] = answerStartIndex + 1
        answer['answer_end'] = answerEndIndex + 1     
        offByOneCount = offByOneCount + 1

      # Answer is off by 2 chars
      elif context[answerStartIndex - 2:answerEndIndex - 2] == answerText:
        answer['answer_start'] = answerStartIndex - 2
        answer['answer_end'] = answerEndIndex - 2
        offByTwoCount = offByTwoCount + 1
      
      elif context[answerStartIndex + 2:answerEndIndex + 2] == answerText:
        answer['answer_start'] = answerStartIndex + 2
        answer['answer_end'] = answerEndIndex + 2
        offByTwoCount = offByTwoCount + 1

      else:
        print("!!Answer is outside correctable range!!") 

    print ("OffByOne : " , offByOneCount, " , OffByTwo : ", offByTwoCount, " exact : ", exactCount)
add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts)

OffByOne :  0  , OffByTwo :  0  exact :  86821
OffByOne :  0  , OffByTwo :  0  exact :  20302


Now train_answers and val_answers include the character end positions and the corrected start positions. Next, let’s tokenize our context/question pairs. 🤗 Tokenizers can accept parallel lists of sequences and encode them together as sequence pairs.

In [7]:
!pip install transformers==4.0.1
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# Your code here
train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)

# Your code here
val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True)



Next we need to convert our character start/end positions to token start/end positions. When using 🤗 Fast Tokenizers, we can use the <b>built in char_to_token()</b> method.

In [8]:
def add_token_positions(encodings, answers):
    start_positions = []
    end_positions = []
    
    # Your code here
    for answerIndex in range(len(answers)):
      #print (answers[answerIndex])
      start_positions.append(encodings.char_to_token(answerIndex, answers[answerIndex]['answer_start']))
      end_positions.append(encodings.char_to_token(answerIndex, answers[answerIndex]['answer_end'] - 1))
      
      # if None, the answer passage has been truncated
      if start_positions[-1] is None:
        start_positions[-1] = tokenizer.model_max_length
      
      if end_positions[-1] is None:
        end_positions[-1] = tokenizer.model_max_length

    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

add_token_positions(train_encodings, train_answers)
add_token_positions(val_encodings, val_answers)

Our data is ready. Let’s just put it in a PyTorch/TensorFlow dataset so that we can easily use it for training. In PyTorch, we define a custom Dataset class. In TensorFlow, we pass a tuple of (inputs_dict, labels_dict) to the from_tensor_slices method.

In [9]:
import tensorflow as tf

# Your code here
train_dataset = tf.data.Dataset.from_tensor_slices((
    {key: train_encodings[key] for key in ['input_ids', 'attention_mask']},
    {key: train_encodings[key] for key in ['start_positions', 'end_positions']}
))

# Your code here
val_dataset = tf.data.Dataset.from_tensor_slices((
    {key: val_encodings[key] for key in ['input_ids', 'attention_mask']},
    {key: val_encodings[key] for key in ['start_positions', 'end_positions']}
))

Now we can use a DistilBert model with a QA head for training:

In [10]:
from transformers import TFDistilBertForQuestionAnswering

# Your code here
model = TFDistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForQuestionAnswering: ['vocab_projector', 'vocab_layer_norm', 'activation_13', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs', 'dropout_19']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The data and model are both ready to go. You can train the model with Trainer/TFTrainer exactly as in the sequence classification example above. If using native PyTorch, replace labels with start_positions and end_positions in the training example. If using Keras’s fit, we need to make a minor modification to handle this example since it involves multiple model outputs.

In [None]:
# Keras will expect a tuple when dealing with labels

# Write your code here to replace labels with start_positions and end_positions in the training example
train_dataset = train_dataset.map(lambda x, y: (x, (y['start_positions'], y['end_positions'])))

# Keras will assign a separate loss for each output and add them together. So we'll just use the standard CE loss
# instead of using the built-in model.compute_loss, which expects a dict of outputs and averages the two terms.
# Note that this means the loss will be 2x of when using TFTrainer since we're adding instead of averaging them.

# Your code here
loss =  tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.distilbert.return_dict = False # if using 🤗 Transformers >3.02, make sure outputs are tuples

# Your code here
optimizer =  tf.keras.optimizers.Adam(learning_rate=5e-5)

model.compile(optimizer=optimizer, loss=loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size= 2)

Epoch 1/3
