<a href="https://colab.research.google.com/github/KevinLolochum/BERT-MODELS/blob/main/BERT_Question_Answering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**BERT Extracitve Question Answering in PyTorch**

Is the task of answering questions given a large text that contains the answer in it.

Install transformers Library

In [None]:
!pip install transformers

Import important libraries

In [3]:
import numpy as np
import pandas as pd
import torch
import tensorflow_datasets as tfds
from transformers import BertForQuestionAnswering, BertTokenizer

***1. Downloading and Exploring the data***

Import the **S**tanford **Qu**estion**A**nswering **D**ataset (**SQuAD**)

You can explore the dataset [here](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/), download on tfds or Kaggle.

SQuaD 1.1 contains over 100,000 question-answer pairs on 500+ articles. In SQuAD dataset, a single sample consists of a paragraph and a set questions. The goal is to find, for each question, a span of text in a paragraph that answers that question. Model performance is measured as the percentage of predictions that closely match any of the ground-truth answers.

In [37]:
from google.colab import drive

drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [78]:
# Downloading from drive
df = pd.read_csv('/content/gdrive/My Drive/SQuAD.csv')

In [54]:
print(f'shape: {df.shape}:\n\n Null values: {df.isna().sum()}')

shape: (86821, 6):

 Null values: Unnamed: 0      0
context         0
question        0
id              0
answer_start    0
text            1
dtype: int64


In [49]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,context,question,id,answer_start,text
0,0,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,When did Beyonce start becoming popular?,56be85543aeaaa14008c9063,269,in the late 1990s
1,1,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,What areas did Beyonce compete in when she was...,56be85543aeaaa14008c9065,207,singing and dancing




* Data has a 86821 examples and 6 columns
*  Data looks pretty clean (no null values)
* From a slice of the data above the stanford dataset has anan id, context, question, an answer_start and a question.
* The answers_start is the starting word position for the correct answer to each question based on the context.
* The text is the name of the answer to the question.
* For simplicity I will make the dataset smaller for our model training.





In [79]:
# Taking few examples.

df =  df.iloc[:10000, 1:]
df.shape

(10000, 5)

***FYI***


* Here is a quick example of how you could use another powerful model from transformers called [pipeline models](https://huggingface.co/transformers/task_summary.html),  [source code](https://huggingface.co/transformers/_modules/transformers/pipelines.html#QuestionAnsweringPipeline.__call__).
* I have mentioned this model because it was specifically fine tuned on SQuAD and is very powerful for this tasks.
* As you can see it got the answer right for the first entry and is even more precise than the given answer.



In [87]:
from transformers import pipeline
import warnings
warnings.filterwarnings('ignore')


nlp = pipeline("question-answering")

example = df.iloc[0]
result = nlp(question=example['question'], context= example['context'])
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}");

Answer: 'late 1990s', score: 0.4887, start: 276, end: 286


***3.Instantiating the model***

In [None]:
# Using BERT cased model
MODEL = 'bert-base-cased'

Tokenizer = BertTokenizer.from_pretrained(MODEL)
Model = BertForQuestionAnswering.from_pretrained(MODEL, return_dict = True)


***4. Preparing inputs for our model***

Inputs/parameters. Here are the [explanations](https://huggingface.co/transformers/glossary.html#attention-mask) of what these paramenters represent.

*  input_ids - Ids of word embeddings
*  attention_masks - Values to point inputs that should be attended to, i.e inputs that are not paddings.
*  input_type_ids - Classification and separation tokens.
*  segment_ids - Whether the segment is a question or an answer.



In [None]:
for entry in example:
  context = " ".join(str(example.context).split())
  question = " ".join(str(example.question).split())

  inputs = Tokenizer(question, context, add_special_tokens=True, return_tensors="pt")
  input_ids = inputs["input_ids"].tolist()[0]

  text_tokens = Tokenizer.convert_ids_to_tokens(input_ids)
  answer_start_scores, answer_end_scores = Model(**inputs)

  answer_start = torch.argmax(answer_start_scores)
          # Get the most likely beginning of answer with the argmax of the score
  answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score

  answer = Tokenizer.convert_tokens_to_string(Tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

  print(f"Question: {question}")
  print(f"Answer: {answer}")

In [None]:
context = " ".join(str(example.context).split())
question = " ".join(str(example.question).split())

inputs = Tokenizer(question, context, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]

text_tokens = Tokenizer.convert_ids_to_tokens(input_ids)
answer_start_scores, answer_end_scores = Model(**inputs)

answer_end_scores

In [None]:
    #Outputs from the model
        return QuestionAnsweringModelOutput(
            loss=total_loss,
            start_logits=start_logits,
            end_logits=end_logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,

Training and evaluation using Hugging face trainer. [Trainer](https://huggingface.co/transformers/training.html#trainer),  [Source Code](https://huggingface.co/transformers/_modules/transformers/trainer.html#Trainer).

In [None]:
# Using hugging face trainer.
# trainer.train() to train and trainer.evaluate() to evaluate.

from transformers import Trainer

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total # of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset            # evaluation dataset
)