<a href="https://colab.research.google.com/github/Azizkhaled/NLP/blob/main/Question_Answering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Open Domain Quesion Answering (ODQA)

## Dataset: SQuAD

The SQuAD (Stanford Question and Answering Dataset) is a hugely popular dataset containing question and answer pairs scraped from Wikipedia, covering topics ranging from Beyonce, to Physics.

### Download the data

In [1]:
url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/'
files = ['train-v2.0.json', 'dev-v2.0.json']

In [2]:
import os

squad_dir = './data'

os.mkdir(squad_dir)

In [3]:
import requests

for file in files:
    res = requests.get(url+file)
    # write to file in chunks
    with open(os.path.join(squad_dir, file), 'wb') as f:
        for chunk in res.iter_content(chunk_size=40):
            f.write(chunk)

In [4]:
import json

with open(os.path.join(squad_dir, 'train-v2.0.json'), 'rb') as f:
    squad = json.load(f)

### Reorganize the Train data

In [5]:
# initialize list where we will place all of our data
new_squad = []

# we need to loop through groups -> paragraphs -> qa_pairs
for group in squad['data']:
    for paragraph in group['paragraphs']:
        # we pull out the context from here
        context = paragraph['context']
        for qa_pair in paragraph['qas']:
            # we pull out the question
            question = qa_pair['question']
            # now the logic to check if we have 'answers' or 'plausible_answers'
            if 'answers' in qa_pair.keys() and len(qa_pair['answers']) > 0:
                answer = qa_pair['answers'][0]['text']
            elif 'plausible_answers' in qa_pair.keys() and len(qa_pair['plausible_answers']) > 0:
                answer = qa_pair['plausible_answers'][0]['text']
            else:
                # this shouldn't happen, but just in case we just set answer = None
                answer = None
            # append dictionary sample to parsed squad
            new_squad.append({
                'question': question,
                'answer': answer,
                'context': context
            })

### Save the train data

In [6]:
with open(os.path.join(squad_dir, 'train.json'), 'w') as f:
    json.dump(new_squad, f)

### Same operation for dev data

In [7]:
with open(os.path.join(squad_dir, 'dev-v2.0.json'), 'rb') as f:
    squad_dev = json.load(f)

In [8]:
squad_dev['data'][0]['paragraphs'][0]

{'qas': [{'question': 'In what country is Normandy located?',
   'id': '56ddde6b9a695914005b9628',
   'answers': [{'text': 'France', 'answer_start': 159},
    {'text': 'France', 'answer_start': 159},
    {'text': 'France', 'answer_start': 159},
    {'text': 'France', 'answer_start': 159}],
   'is_impossible': False},
  {'question': 'When were the Normans in Normandy?',
   'id': '56ddde6b9a695914005b9629',
   'answers': [{'text': '10th and 11th centuries', 'answer_start': 94},
    {'text': 'in the 10th and 11th centuries', 'answer_start': 87},
    {'text': '10th and 11th centuries', 'answer_start': 94},
    {'text': '10th and 11th centuries', 'answer_start': 94}],
   'is_impossible': False},
  {'question': 'From which countries did the Norse originate?',
   'id': '56ddde6b9a695914005b962a',
   'answers': [{'text': 'Denmark, Iceland and Norway', 'answer_start': 256},
    {'text': 'Denmark, Iceland and Norway', 'answer_start': 256},
    {'text': 'Denmark, Iceland and Norway', 'answer_star

In [9]:
# initialize list where we will place all of our data
dev_squad = []

# we need to loop through groups -> paragraphs -> qa_pairs
for group in squad_dev['data']:
    for paragraph in group['paragraphs']:
        # we pull out the context from here
        context = paragraph['context']
        for qa_pair in paragraph['qas']:
            # we pull out the question
            question = qa_pair['question']
            # now the logic to check if we have 'answers' or 'plausible_answers'
            if 'answers' in qa_pair.keys() and len(qa_pair['answers']) > 0:
              #get all answers
                answer = [answer['text'] for answer in qa_pair['answers']]
            elif 'plausible_answers' in qa_pair.keys() and len(qa_pair['plausible_answers']) > 0:
                #get all answers
                answer = [answer['text'] for answer in qa_pair['plausible_answers']]
            else:
                # this shouldn't happen, but just in case we just set answer = None
                answer = []

            # append dictionary sample to parsed squad
            dev_squad.append({
                'question': question,
                'answers': list(set(answer)), #convert to set to remove duplicates
                'context': context
            })

In [10]:
dev_squad

[{'question': 'In what country is Normandy located?',
  'answers': ['France'],
  'context': 'The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse ("Norman" comes from "Norseman") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.'},
 {'question': 'When were the Normans in Normandy?',
  'answers': ['in the 10th and 11th centuries', '10th and 11th centuries'],
  'context': 'The Normans (Norman

In [11]:
with open(os.path.join(squad_dir, 'dev.json'), 'w') as f:
    json.dump(dev_squad, f)

## Question Answering with Bert Transformet model

For our first QA model we will setup a simple question-answering pipeline using HuggingFace transformers and a pretrained BERT model. We will be testing it on our SQuAD data so let's load that first.

### Initialize the model

In [12]:
pip install transformers

Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m30.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m40.8 MB/s[0m eta [36m0:00:0

In [13]:
from transformers import BertTokenizer, BertForQuestionAnswering

# we can get these models from hugging face
modelname = 'deepset/bert-base-cased-squad2'

tokenizer = BertTokenizer.from_pretrained(modelname)
model = BertForQuestionAnswering.from_pretrained(modelname)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/152 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/508 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at deepset/bert-base-cased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Data pipeline

In [14]:
from transformers import pipeline

qa = pipeline('question-answering', model=model, tokenizer=tokenizer)

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'


In [15]:
with open('./data/dev.json', 'r') as f:
    squad = json.load(f)

## Make predictions

In [16]:
QA = []

for pair in squad[0:4]:
    # pass in our question and context to return an answer
    ans = qa({
        'question': pair['question'],
        'context': pair['context']
    })
    # append predicted answer and real to answers list
    QA.append({
        'Question': pair['question'],
        'predicted': ans['answer'],
        'true': pair['answers']
    })

In [17]:
QA

[{'Question': 'In what country is Normandy located?',
  'predicted': 'France.',
  'true': ['France']},
 {'Question': 'When were the Normans in Normandy?',
  'predicted': '10th and 11th centuries',
  'true': ['in the 10th and 11th centuries', '10th and 11th centuries']},
 {'Question': 'From which countries did the Norse originate?',
  'predicted': 'Denmark, Iceland and Norway',
  'true': ['Denmark, Iceland and Norway']},
 {'Question': 'Who was the Norse leader?',
  'predicted': 'Rollo,',
  'true': ['Rollo']}]

GOOD JOB!

## Evaluating the model

### Exact Match EM


In [18]:
em = []

for answer in QA:
  for true in answer['true']:
    if answer['predicted'] == true:
        em.append(1)
    else:
        em.append(0)

# then total up all values in em and divide by number of values
sum(em)/len(em)

0.4

We can see that we got 0.23 accuracy. This is because we included every mistake in the calculation.

### More understanding EM

lets filter out anything thats not a number or a letter

In [19]:
import re

em = []

for answer in QA:
  for true in answer['true']:
    pred = re.sub('[^0-9a-z ]', '', answer['predicted'].lower())
    true = re.sub('[^0-9a-z ]', '',true.lower())
    if pred == true:
        em.append(1)
    else:
        em.append(0)

# then total up all values in em and divide by number of values
sum(em)/len(em)

0.8

The actual exact match accuracy is 0.8

### ROUGE

ROUGE stands for **R**ecall-**O**riented **U**nderstudy for **G**isting **E**valuation. The name is deceptively complicated, because this is not a difficult metric to understand, and it's incredibly easy to implement.

ROUGE-N: N is the number of words in each group

Most common are
uni-gram and bi-gram


In [21]:
pip install rouge

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [24]:
from rouge import Rouge

In [23]:
true = '10th and 11th centuries'
predicted = 'in the 10th and 11th centuries'

In [25]:
rouge = Rouge()

In [26]:
rouge.get_scores(predicted, true)

[{'rouge-1': {'r': 1.0, 'p': 0.6666666666666666, 'f': 0.7999999952000001},
  'rouge-2': {'r': 1.0, 'p': 0.6, 'f': 0.7499999953125},
  'rouge-l': {'r': 1.0, 'p': 0.6666666666666666, 'f': 0.7999999952000001}}]

In [42]:
model_pred = [ans['predicted'] for ans in QA]

true = [ans['true'][0] for ans in QA]

In [43]:
rouge.get_scores(model_pred, true)

[{'rouge-1': {'r': 1.0, 'p': 1.0, 'f': 0.999999995},
  'rouge-2': {'r': 0.0, 'p': 0.0, 'f': 0.0},
  'rouge-l': {'r': 1.0, 'p': 1.0, 'f': 0.999999995}},
 {'rouge-1': {'r': 0.6666666666666666, 'p': 1.0, 'f': 0.7999999952000001},
  'rouge-2': {'r': 0.6, 'p': 1.0, 'f': 0.7499999953125},
  'rouge-l': {'r': 0.6666666666666666, 'p': 1.0, 'f': 0.7999999952000001}},
 {'rouge-1': {'r': 1.0, 'p': 1.0, 'f': 0.999999995},
  'rouge-2': {'r': 1.0, 'p': 1.0, 'f': 0.999999995},
  'rouge-l': {'r': 1.0, 'p': 1.0, 'f': 0.999999995}},
 {'rouge-1': {'r': 0.0, 'p': 0.0, 'f': 0.0},
  'rouge-2': {'r': 0.0, 'p': 0.0, 'f': 0.0},
  'rouge-l': {'r': 0.0, 'p': 0.0, 'f': 0.0}}]

In [44]:
rouge.get_scores(model_pred, true, avg=True)

{'rouge-1': {'r': 0.6666666666666666, 'p': 0.75, 'f': 0.6999999963000001},
 'rouge-2': {'r': 0.4, 'p': 0.5, 'f': 0.43749999757812497},
 'rouge-l': {'r': 0.6666666666666666, 'p': 0.75, 'f': 0.6999999963000001}}

#### apply rouge to 50 predictions

In [None]:
from tqdm import tqdm
reference = []
pred = []
for pair in tqdm(squad[0:50], leave=True):
    # pass in our question and context to return an answer
    ans = qa({
        'question': pair['question'],
        'context': pair['context']
    })
    # append predicted answer and real to answers list
    reference.append(pair['answers'][0])
    pred.append(ans['answer'])


#### Clean the data and prediction, keep only numbers and letters to avoid missevaluaion

In [65]:
import re

clean = re.compile('(?i)[^0-9a-z ]')

# apply this to both lists
model_out = [clean.sub('', text) for text in pred]
reference = [clean.sub('', text) for text in reference]

In [69]:
# recalculate individual scores
scores = rouge.get_scores(model_out, reference)

print(model_out[4], ' | ', reference[4], ' | ', scores[4]['rouge-1']['f'])
print(model_out[22], ' | ', reference[22], ' | ', scores[22]['rouge-1']['p'])

10th  |  the first half of the 10th century  |  0.2857142832653061
King Charles III of West Francia and the famed Viking ruler Rollo  |  King Charles III  |  0.25


In [67]:
rouge.get_scores(model_out, reference, avg=True)

{'rouge-1': {'r': 0.6185714285714284,
  'p': 0.5751515151515152,
  'f': 0.555797310616831},
 'rouge-2': {'r': 0.32871428571428574,
  'p': 0.3331601731601731,
  'f': 0.30692152121475363},
 'rouge-l': {'r': 0.6185714285714284,
  'p': 0.5751515151515152,
  'f': 0.555797310616831}}

r: Recall = (number of matching n-grams)/(number of predicted n-grams)

p: Precision = (number of matching n-grams)/(number of truth n-grams)

f: F1score = 2*(p * r)/(p+r)