<a href="https://colab.research.google.com/github/KevinLolochum/BERT-MODELS/blob/main/BERT_Question_Answering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**BERT Extracitve Question Answering in PyTorch**

Is the task of answering questions given a large text that contains the answer in it.

Install transformers Library

In [None]:
!pip install transformers

Import important libraries

In [20]:
import numpy as np
import pandas as pd
import torch
import tensorflow_datasets as tfds
from transformers import BertForQuestionAnswering, BertTokenizer

***1. Downloading and Exploring the data***

Import the **S**tanford **Qu**estion**A**nswering **D**ataset (**SQuAD**)

You can explore the dataset [here](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/), download on tfds or Kaggle.
* SQuaD 1.1 contains over 100,000 question-answer pairs on 500+ articles.
* In SQuAD dataset, a single sample consists of a paragraph and a set questions. 
* The goal is to find, for each question, a span of text in a paragraph that answers that question.
* Model performance is measured as the percentage of predictions that closely match any of the ground-truth answers.

In [None]:
from google.colab import drive

drive.mount('/content/gdrive')

In [5]:
# Downloading from drive
df = pd.read_csv('/content/gdrive/My Drive/SQuAD.csv')

In [6]:
#Checking shape and null values
print(f'shape: {df.shape}:\n\n Null values: {df.isna().sum()}')

shape: (86821, 6):

 Null values: Unnamed: 0      0
context         0
question        0
id              0
answer_start    0
text            1
dtype: int64


In [7]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,context,question,id,answer_start,text
0,0,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,When did Beyonce start becoming popular?,56be85543aeaaa14008c9063,269,in the late 1990s
1,1,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,What areas did Beyonce compete in when she was...,56be85543aeaaa14008c9065,207,singing and dancing




* Data has a 86821 examples and 6 columns
*  Data looks pretty clean (no null values)
* From a slice of the data above the stanford dataset has an id, context, question, an answer_start and a text(answer) columns.
* The answers_start is the starting word position for the correct answer to each question based on the context.
* For simplicity I will make the dataset smaller for our model training.





In [8]:
# Taking five thousand examples.

df =  df.iloc[:5000, 1:]
df.shape

(5000, 5)

***FYI***


* Here is a quick example of how you could use another powerful model from transformers called [pipeline models](https://huggingface.co/transformers/task_summary.html),  [source code](https://huggingface.co/transformers/_modules/transformers/pipelines.html#QuestionAnsweringPipeline.__call__).
* I have mentioned this model because it was specifically fine tuned on SQuAD and is very powerful for this tasks.
* As you can see it got the answer right for the first entry and is even more precise than the given answer.



In [10]:
from transformers import pipeline
import warnings
warnings.filterwarnings('ignore')


nlp = pipeline("question-answering")

example = df.iloc[0]
result = nlp(question=example['question'], context= example['context'])
print(f"question: '{example['question']}', \n 'Model_Answer: '{result['answer']}', 'Actual_Answer: '{example['text']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}");

question: 'When did Beyonce start becoming popular?', 
 'Model_Answer: 'late 1990s', 'Actual_Answer: 'in the late 1990s', score: 0.4887, start: 276, end: 286


***3.Instantiating the model***



* I am loading a pre-finetuned model.



In [26]:
# We will be using the Bert-large-cased model, that has been fine tuned on SQuAD.
# Model has 340 M parameters
MODEL = "bert-large-uncased-whole-word-masking-finetuned-squad"

Tokenizer = BertTokenizer.from_pretrained(MODEL)
Model = BertForQuestionAnswering.from_pretrained(MODEL, return_dict = True)


***4. Preparing inputs for our model***

**Inputs/parameters**. Here are the [explanations](https://huggingface.co/transformers/glossary.html#attention-mask) of what these paramenters represent.

*  input_ids - Ids of word embeddings
*  attention_masks - Values to point inputs that should be attended to, i.e inputs that are not paddings.
*  input_type_ids - Classification and separation tokens.
*  segment_ids - Whether the segment is a question or an answer.

**outputs**
* Start_logits - probabilities that the start value is an input_id x. (torch.FloatTensor of shape (batch_size, sequence_length)) – Span-start scores (before SoftMax)
* End_logits - Probabilities that the end value is an input_id x. (torch.FloatTensor of shape (batch_size, sequence_length)) – Span-start scores (before SoftMax)
* Other return values are loss (cross enhropy loss). Hidden states and attention heads when specified.






In [52]:
# Def a train function
examples = df.iloc[0:5]

for i in range(len(examples)):
  context = examples.iloc[i]['context']
  question = examples.iloc[i]['question']
  
  inputs = Tokenizer.encode_plus(question, context,
                               add_special_tokens=True,
                               return_tensors="pt",
                               padding = True)
  input_ids = inputs["input_ids"].tolist()[0]

  text_tokens = Tokenizer.convert_ids_to_tokens(input_ids)
  Outputs = Model(**inputs)

  answer_start = torch.argmax(Outputs.start_logits)  # The most likely beginning of the answer
  answer_end = torch.argmax(Outputs.end_logits)  # The most likely end of answer

  answer = Tokenizer.convert_tokens_to_string(Tokenizer.convert_ids_to_tokens(input_ids[answer_start:(answer_end+1)]))

  print(f"Question: {question}")
  print(f"Model_Answer: {answer},\n'True_answer': {examples.iloc[i]['text']}")





Question: When did Beyonce start becoming popular?
Model_Answer: late 1990s,
'True_answer': in the late 1990s
Question: What areas did Beyonce compete in when she was growing up?
Model_Answer: singing and dancing,
'True_answer': singing and dancing
Question: When did Beyonce leave Destiny's Child and become a solo singer?
Model_Answer: 2003,
'True_answer': 2003
Question: In what city and state did Beyonce  grow up? 
Model_Answer: houston,
'True_answer': Houston, Texas
Question: In which decade did Beyonce become famous?
Model_Answer: 1990s,
'True_answer': late 1990s


As you can see our model does great in answering the five questions above.

Training and evaluation using Hugging face trainer. [Trainer](https://huggingface.co/transformers/training.html#trainer),  [Source Code](https://huggingface.co/transformers/_modules/transformers/trainer.html#Trainer).

In [None]:
# Using hugging face trainer.
# trainer.train() to train and trainer.evaluate() to evaluate.

from transformers import Trainer

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total # of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset            # evaluation dataset
)