<a href="https://colab.research.google.com/github/Azizkhaled/NLP-with-Aziz/blob/main/Question_Answering_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Open Domain Quesion Answering (ODQA)

## Dataset: SQuAD

The SQuAD (Stanford Question and Answering Dataset) is a hugely popular dataset containing question and answer pairs scraped from Wikipedia, covering topics ranging from Beyonce, to Physics.

### Download the data

In [5]:
url = 'https://rajpurkar.github.io/SQuAD-explorer/dataset/'
files = ['train-v2.0.json', 'dev-v2.0.json']

In [6]:
import os

squad_dir = './data/squad'

os.mkdir(squad_dir)

In [7]:
import requests

for file in files:
    res = requests.get(url+file)
    # write to file in chunks
    with open(os.path.join(squad_dir, file), 'wb') as f:
        for chunk in res.iter_content(chunk_size=40):
            f.write(chunk)

In [8]:
import json

with open(os.path.join(squad_dir, 'train-v2.0.json'), 'rb') as f:
    squad = json.load(f)

### Reorganize the Train data

In [30]:
# initialize list where we will place all of our data
new_squad = []

# we need to loop through groups -> paragraphs -> qa_pairs
for group in squad['data']:
    for paragraph in group['paragraphs']:
        # we pull out the context from here
        context = paragraph['context']
        for qa_pair in paragraph['qas']:
            # we pull out the question
            question = qa_pair['question']
            # now the logic to check if we have 'answers' or 'plausible_answers'
            if 'answers' in qa_pair.keys() and len(qa_pair['answers']) > 0:
                answer = qa_pair['answers'][0]['text']
            elif 'plausible_answers' in qa_pair.keys() and len(qa_pair['plausible_answers']) > 0:
                answer = qa_pair['plausible_answers'][0]['text']
            else:
                # this shouldn't happen, but just in case we just set answer = None
                answer = None
            # append dictionary sample to parsed squad
            new_squad.append({
                'question': question,
                'answer': answer,
                'context': context
            })

### Save the train data

In [32]:
with open(os.path.join(squad_dir, 'train.json'), 'w') as f:
    json.dump(new_squad, f)

### Same operation for dev data

In [33]:
with open(os.path.join(squad_dir, 'dev-v2.0.json'), 'rb') as f:
    squad_dev = json.load(f)

In [None]:
squad_dev['data'][0]['paragraphs'][0]

In [60]:
# initialize list where we will place all of our data
dev_squad = []

# we need to loop through groups -> paragraphs -> qa_pairs
for group in squad_dev['data']:
    for paragraph in group['paragraphs']:
        # we pull out the context from here
        context = paragraph['context']
        for qa_pair in paragraph['qas']:
            # we pull out the question
            question = qa_pair['question']
            # now the logic to check if we have 'answers' or 'plausible_answers'
            if 'answers' in qa_pair.keys() and len(qa_pair['answers']) > 0:
              #get all answers
                answer = [answer['text'] for answer in qa_pair['answers']]
            elif 'plausible_answers' in qa_pair.keys() and len(qa_pair['plausible_answers']) > 0:
                #get all answers
                answer = [answer['text'] for answer in qa_pair['plausible_answers']]
            else:
                # this shouldn't happen, but just in case we just set answer = None
                answer = []

            # append dictionary sample to parsed squad
            dev_squad.append({
                'question': question,
                'answers': list(set(answer)), #convert to set to remove duplicates
                'context': context
            })

In [None]:
dev_squad

In [61]:
with open(os.path.join(squad_dir, 'dev.json'), 'w') as f:
    json.dump(dev_squad, f)

## Question Answering with Bert Transformet model

For our first QA model we will setup a simple question-answering pipeline using HuggingFace transformers and a pretrained BERT model. We will be testing it on our SQuAD data so let's load that first.

### Initialize the model

In [63]:
pip install transformers

Successfully installed huggingface-hub-0.16.4 safetensors-0.3.2 tokenizers-0.13.3 transformers-4.31.0


In [65]:
from transformers import BertTokenizer, BertForQuestionAnswering

# we can get these models from hugging face
modelname = 'deepset/bert-base-cased-squad2'

tokenizer = BertTokenizer.from_pretrained(modelname)
model = BertForQuestionAnswering.from_pretrained(modelname)

Some weights of the model checkpoint at deepset/bert-base-cased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Data pipeline

In [69]:
from transformers import pipeline

qa = pipeline('question-answering', model=model, tokenizer=tokenizer)

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'


In [67]:
with open('./data/squad/dev.json', 'r') as f:
    squad = json.load(f)

## Make predictions

In [82]:
QA = []

for pair in squad[25:30]:
    # pass in our question and context to return an answer
    ans = qa({
        'question': pair['question'],
        'context': pair['context']
    })
    # append predicted answer and real to answers list
    QA.append({
        'Question': pair['question'],
        'predicted': ans['answer'],
        'true': pair['answers']
    })

In [83]:
QA

[{'Question': 'What treaty was established in the 9th century?',
  'predicted': 'the treaty of Saint-Clair-sur-Epte',
  'true': ['treaty of Saint-Clair-sur-Epte']},
 {'Question': 'Who established a treaty with King Charles the third of France?',
  'predicted': 'Charles III of West Francia and the famed Viking ruler Rollo,',
  'true': ['Rollo']},
 {'Question': 'What did the French promises to protect Rollo and his men from?',
  'predicted': 'Viking incursions.',
  'true': ['further Viking incursions.']},
 {'Question': 'Who upon arriving gave the original viking settlers a common identity?',
  'predicted': '"Frankish".',
  'true': ['Rollo']},
 {'Question': 'When did Rollo begin to arrive in Normandy?',
  'predicted': 'the 880s,',
  'true': ['880s']}]

GOOD JOB!