<a href="https://colab.research.google.com/github/HemantTiwariGitHub/CapstoneProject2021/blob/main/Question_Answering_with_SQuAD_2_0_20210102.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Question answering** comes in many forms. In this example, we’ll look at the particular type of extractive QA that involves answering a question about a passage by highlighting the segment of the passage that answers the question. This involves fine-tuning a model which predicts a start position and an end position in the passage. We will use the Stanford Question Answering Dataset (SQuAD) 2.0.

We will start by downloading the data:

## **Note :**

Please write your code in the cells with the "**Your code here**" placeholder.

## **Download SQuAD 2.0 Data**

Note : This dataset can be explored in the Hugging Face model hub (SQuAD V2), and can be alternatively downloaded with the 🤗 NLP library with load_dataset("squad_v2").

In [6]:
!mkdir squad
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json -O /content/squad/train-v2.0.json
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json -O /content/squad/dev-v2.0.json

--2021-01-02 08:12:50--  https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 42123633 (40M) [application/json]
Saving to: ‘/content/squad/train-v2.0.json’


2021-01-02 08:12:51 (134 MB/s) - ‘/content/squad/train-v2.0.json’ saved [42123633/42123633]

--2021-01-02 08:12:51--  https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
Resolving rajpurkar.github.io (rajpurkar.github.io)... 185.199.108.153, 185.199.109.153, 185.199.111.153, ...
Connecting to rajpurkar.github.io (rajpurkar.github.io)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4370528 (4.2M) [application/json]
Saving to: ‘/content/squad/dev-v2.0.json’


2021-01-02 08:12:51 (23.2 MB/s) - ‘/content/squad/dev-v

In [32]:
!pip install tqdm
import tqdm



In [33]:
import json
from pathlib import Path

def loadJSONData(filename):
    with open(filename) as jsonDataFile:
        data = json.load(jsonDataFile)
    return data

In [84]:
#Data has Multiple Titles
#Every Title has Multiple Paragraphs and Each Para has Text in Context
#Every Paragraphs has Multiple Questions and Every Question has multiple answers with Answer start index
#If Answer is plausible , is_impossible is False.

def preprocessSQUAD(JSONData):
    lengthOfData = len(JSONData['data'])
    BaseData = JSONData['data']
    print("Length Of JSON Data : ", lengthOfData)
    for titleID in (range(2)):
      title = BaseData[titleID]['title']
      print("Title : ", title);
      paragraphs = BaseData[titleID]['paragraphs']
      paragraphCount = len(paragraphs)

      for paraID in range(paragraphCount):
        context = paragraphs[paraID]['context']
        print("Context : ",context);
        
        questions = paragraphs[paraID]['qas']
        questionCount = len(questions)
        
        for questionID in range(questionCount):
          questionText = questions[questionID]['question']
          answers = questions[questionID]['answers']
          answersCount = len(answers)
          print("Question : ",questionText);
          
          for answerID in range (answersCount):
            answerText = answers[answerID]['text']
            answerStartIndex = answers[answerID]['answer_start']
            answerEndIndex = answerStartIndex + len (answerText)
            print("Answer : ",answerText)
            print("AnswerStartIndex : ",answerStartIndex," AnswerEndIndex : ",answerEndIndex )
            

                            







In [85]:
read_squad('/content/squad/dev-v2.0.json')

Length Of JSON Data :  35
Title :  Normans
Context :  The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.
Question :  In what country is Normandy located?
Answer :  France
AnswerStartIndex :  159  AnswerEndIndex :  165
Answer :  France
AnswerStartIndex :  159  AnswerEndIndex :  165
Answer :  France
AnswerSta

In [37]:
def preprocessSQUAD1(JSONData):
    num_mappingprob, num_tokenprob, num_spanalignprob = 0, 0, 0
    examples = []
    print(len(JSONData['data'])

    for articles_id in tqdm(range(len(JSONData['data']))):

        article_paragraphs = JSONData['data'][articles_id]['paragraphs']
        for pid in range(len(article_paragraphs)):

            context = unicode(article_paragraphs[pid]['context'])
            context = context.replace("''", '" ')
            context = context.replace("``", '" ')

            context_tokens = tokenize(context) # list of strings (lowercase)
            context = context.lower()

            qas = article_paragraphs[pid]['qas'] # list of questions

            charloc2wordloc = get_char_word_loc_mapping(context, context_tokens) # charloc2wordloc maps the character location (int) of a context token to a pair giving (word (string), word loc (int)) of that token

            if charloc2wordloc is None: # there was a problem
                num_mappingprob += len(qas)
                continue # skip this context example

            # for each question, process the question and answer and write to file
            for qn in qas:

                # read the question text and tokenize
                question = unicode(qn['question']) # string
                question_tokens = tokenize(question) # list of strings

                # of the three answers, just take the first
                ans_text = unicode(qn['answers'][0]['text']).lower() # get the answer text
                ans_start_charloc = qn['answers'][0]['answer_start'] # answer start loc (character count)
                ans_end_charloc = ans_start_charloc + len(ans_text) # answer end loc (character count) (exclusive)

                # Check that the provided character spans match the provided answer text
                if context[ans_start_charloc:ans_end_charloc] != ans_text:
                  # Sometimes this is misaligned, mostly because "narrow builds" of Python 2 interpret certain Unicode characters to have length 2 https://stackoverflow.com/questions/29109944/python-returns-length-of-2-for-single-unicode-character-string
                  # We should upgrade to Python 3 next year!
                  num_spanalignprob += 1
                  continue

                # get word locs for answer start and end (inclusive)
                ans_start_wordloc = charloc2wordloc[ans_start_charloc][1] # answer start word loc
                ans_end_wordloc = charloc2wordloc[ans_end_charloc-1][1] # answer end word loc
                assert ans_start_wordloc <= ans_end_wordloc

                # Check retrieved answer tokens match the provided answer text.
                # Sometimes they won't match, e.g. if the context contains the phrase "fifth-generation"
                # and the answer character span is around "generation",
                # but the tokenizer regards "fifth-generation" as a single token.
                # Then ans_tokens has "fifth-generation" but the ans_text is "generation", which doesn't match.
                ans_tokens = context_tokens[ans_start_wordloc:ans_end_wordloc+1]
                if "".join(ans_tokens) != "".join(ans_text.split()):
                    num_tokenprob += 1
                    continue # skip this question/answer pair

                examples.append((' '.join(context_tokens), ' '.join(question_tokens), ' '.join(ans_tokens), ' '.join([str(ans_start_wordloc), str(ans_end_wordloc)])))

                num_exs += 1

    #print "Number of (context, question, answer) triples discarded due to char -> token mapping problems: ", num_mappingprob
    #print "Number of (context, question, answer) triples discarded because character-based answer span is unaligned with tokenization: ", num_tokenprob
    #print "Number of (context, question, answer) triples discarded due character span alignment problems (usually Unicode problems): ", num_spanalignprob
    #print "Processed %i examples of total %i\n" % (num_exs, num_exs + num_mappingprob + num_tokenprob + num_spanalignprob)


SyntaxError: ignored

Each split is in a structured json file with a number of questions and answers for each passage (or context). We’ll take this apart into parallel lists of contexts, questions, and answers (note that the contexts here are repeated since there are multiple questions per context):

In [45]:
def read_squad(path):
  dataInJSON = loadJSONData(path)
  preprocessSQUAD(dataInJSON)

  # Your code here
  
  #return contexts, questions, answers

#train_contexts, train_questions, train_answers = read_squad('/content/squad/train-v2.0.json')
#val_contexts, val_questions, val_answers = read_squad('/content/squad/dev-v2.0.json')


The contexts and questions are just strings. The answers are dicts containing the subsequence of the passage with the correct answer as well as an integer indicating the character at which the answer begins. In order to train a model on this data we need (1) the tokenized context/question pairs, and (2) integers indicating at which token positions the answer begins and ends.

First, let’s get the character position at which the answer ends in the passage (we are given the starting position). Sometimes SQuAD answers are off by one or two characters, so we will also adjust for that.

In [None]:
def add_end_idx(answers, contexts):
    for answer, context in zip(answers, contexts):
      
      # Your code here
        ...

add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts)

Now train_answers and val_answers include the character end positions and the corrected start positions. Next, let’s tokenize our context/question pairs. 🤗 Tokenizers can accept parallel lists of sequences and encode them together as sequence pairs.

In [None]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# Your code here
train_encodings = ...

# Your code here
val_encodings = ...

Next we need to convert our character start/end positions to token start/end positions. When using 🤗 Fast Tokenizers, we can use the <b>built in char_to_token()</b> method.

In [None]:
def add_token_positions(encodings, answers):
    start_positions = []
    end_positions = []
    
    # Your code here
    ...

    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

add_token_positions(train_encodings, train_answers)
add_token_positions(val_encodings, val_answers)

Our data is ready. Let’s just put it in a PyTorch/TensorFlow dataset so that we can easily use it for training. In PyTorch, we define a custom Dataset class. In TensorFlow, we pass a tuple of (inputs_dict, labels_dict) to the from_tensor_slices method.

In [None]:
import tensorflow as tf

# Your code here
train_dataset = ...
))

# Your code here
val_dataset = ...
))

Now we can use a DistilBert model with a QA head for training:

In [None]:
from transformers import TFDistilBertForQuestionAnswering

# Your code here
model = ...

The data and model are both ready to go. You can train the model with Trainer/TFTrainer exactly as in the sequence classification example above. If using native PyTorch, replace labels with start_positions and end_positions in the training example. If using Keras’s fit, we need to make a minor modification to handle this example since it involves multiple model outputs.

In [None]:
# Keras will expect a tuple when dealing with labels

# Write your code here to replace labels with start_positions and end_positions in the training example
train_dataset = train_dataset.map(...)

# Keras will assign a separate loss for each output and add them together. So we'll just use the standard CE loss
# instead of using the built-in model.compute_loss, which expects a dict of outputs and averages the two terms.
# Note that this means the loss will be 2x of when using TFTrainer since we're adding instead of averaging them.

# Your code here
loss = ...
model.distilbert.return_dict = False # if using 🤗 Transformers >3.02, make sure outputs are tuples

# Your code here
optimizer = ...

model.compile(optimizer=optimizer, loss=loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)