# **Question Answering System**


## **INTRODUCTION**


---




*   Being able to automatically answer questions accurately remains a difficult problem in natural language processing. ​

*   Question Answering is a branch of the Natural Language Understanding  field, and it aims to implement systems that, given a question in natural language, can extract relevant information from provided data and present it in the form of natural language answer.​



*   QA systems allow a user to express a question in natural language and get an immediate and brief response.​
*   QA systems are now found in search engines and phone conversational interfaces, and they’re fairly good at answering.​
* There are three types of QA tasks that we can perform, namely:
  * Extractive QA: Where the answer lies in the context or the comprehension and it generally acts as a reading comprehension task.
  *Open Generative QA: Here, the model generates the answer based on the provided context.
  *Closed Generative QA: Here, the model generates the answer without any prior context.









## **DATASET**


---



There are three question files, one for each year of students: S08, S09, and S10, as well as 690,000 words worth of cleaned text from Wikipedia that was used to generate the questions.​



The "questionanswerpairs.txt" files contain both the questions and answers. The columns in this file are as follows:​

*   ArticleTitle: Name of the Wikipedia article from which questions and answers initially came.​
*   
Question: Question that need to be answered.​

*   Answer: Answer to the question.​

*   DifficultyFromQuestioner: Prescribed difficulty rating for the question as given to the question-writer.​
*   DifficultyFromAnswerer: Difficulty rating assigned by the individual who evaluated and answered the question, which may differ from the difficulty in field 4.​



*   ArticleFile: Name of the file with the relevant article.











​





##Dataset Preparation


---

We are preparing dataset so as to make it usable for testing our pre-trained models.

The steps that we followed:​

*   First we mounted the files on to the google drive and converted the '.txt' file into the dataframe format.
​
*   We removed the question-answer pairs for which the answers were categorized into 'yes' or 'no' as our model is an extractive type QA model, which is only able to give us answers as text that is already present in our context. And, to be able to answer as 'Yes' or 'No', it requires the model to generate the text, which might beyond it's scope.

*   Then we extracted the list of viable question and answers and stored them as a list, that could be used for future evaluation purposes.​


In [None]:
#Mount the Google Drive

from google.colab import drive
drive.mount('/content/drive')

In [None]:
#Imported the "questionanswerpairs.txt" files which contain both the questions and answers and transform it into Dataframe.​

import pandas as pd

dataset = pd.read_csv("/content/drive/MyDrive/Question_Answer_Dataset_v1.1/S08/question_answer_pairs.txt", 
                      sep="\t",encoding= 'unicode_escape',on_bad_lines='skip')

#Remove all null answers and its corresponding questions and remove all the duplicate questions from the dataframe.​
dataset =dataset.dropna(axis=0)
dataset = dataset.drop_duplicates(subset='Question')
dataset.reset_index(inplace = True)

dataset.tail(10)

In [None]:
# Extract a list of questions from the questions column from dataframe and 
# its corresponding context using which we will run both the models and compare the accuracy. 


dataset = dataset[dataset['Answer'] != 'no' ]
dataset = dataset[dataset['Answer'] != 'yes' ]
dataset = dataset[dataset['Answer'] != 'No' ]
dataset = dataset[dataset['Answer'] != 'Yes' ]

df = dataset[dataset['ArticleFile'] == 'data/set3/a4' ]
df.reset_index(inplace = True)
display(df.head())
print(df.shape)
Questions_List = list(df['Question'])
c = "/content/drive/MyDrive/Question_Answer_Dataset_v1.1/S08/" + 'data/set3/a4' + ".txt"

with open(c) as file:
  lines = file.read()
context = lines
#print(context)

In [None]:
#List of Questions extracted from dataframe

display(Questions_List)

## Question Answering with Hugging Face models

Fine tuning Our First Hugging Face Model -- **Distilbert-base-uncased** over adversarial_qa dataset

## Adversarial QA:

---

* The source passages are from Wikipedia and are the same as those used in SQuAD v1.1.​

* title: the title of the Wikipedia page from which the context is sourced​

* context: the context/passage​

* id: a string identifier for each question​

* answers: a list of all provided answers (one per question in our case, but multiple may exist in SQuAD) with an answer_start field which is the character index of the start of the answer span, and a text field which is the answer text.

* This dataset is used to  provide a better judgement for the pre-trained models with SQuAD dataset.

In [None]:
%pip install transformers


Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 5.4 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 2.7 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 40.9 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 37.4 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 40.2 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Fo

In [None]:
%pip install datasets

Collecting datasets
  Downloading datasets-2.0.0-py3-none-any.whl (325 kB)
[K     |████████████████████████████████| 325 kB 7.7 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 66.5 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.3.0-py3-none-any.whl (136 kB)
[K     |████████████████████████████████| 136 kB 60.6 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 44.7 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 53.2 MB/s 
Collecting async-timeout<5.0,>=4.0.0a3
  Downloading async_timeout

Installing the required libraries for our use.

In [None]:
from transformers import BertForQuestionAnswering

from transformers import AutoTokenizer

from transformers import Trainer, TrainingArguments


In [None]:
from datasets import load_dataset,load_metric

Loading the **'adversarial_qa' dataset** for pre-training.

In [None]:
dt = load_dataset("adversarial_qa","adversarialQA")

Downloading builder script:   0%|          | 0.00/2.90k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.04k [00:00<?, ?B/s]

Downloading and preparing dataset adversarial_qa/adversarialQA (download: 8.60 MiB, generated: 31.98 MiB, post-processed: Unknown size, total: 40.58 MiB) to /root/.cache/huggingface/datasets/adversarial_qa/adversarialQA/1.0.0/92356be07b087c5c6a543138757828b8d61ca34de8a87807d40bbc0e6c68f04b...


Downloading data:   0%|          | 0.00/9.02M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/30000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3000 [00:00<?, ? examples/s]

Dataset adversarial_qa downloaded and prepared to /root/.cache/huggingface/datasets/adversarial_qa/adversarialQA/1.0.0/92356be07b087c5c6a543138757828b8d61ca34de8a87807d40bbc0e6c68f04b. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
dt

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers', 'metadata'],
        num_rows: 30000
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers', 'metadata'],
        num_rows: 3000
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers', 'metadata'],
        num_rows: 3000
    })
})

Here we are specifying the model that we are going to use for tokenizing the dataseta as well as for pre-training it.

In [None]:
model_name = 'distilbert-base-uncased'

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Now, we pre-process the training examples. Here, we convert the 'question', 'context' and 'answer' variables into tokens and also determine the start and end positions that are associated with the 'answer' variable. We do this to ensure that the dataset is in the format that is expected by the model so as to avoid any errors and the model not getting trained properly or with any biases.

The tokens size expected by the BERT model is 512, so the max_length and the stride variables make sure that the token size for each of the sample is in the form that is expected by our model.

In [None]:
max_length = 384
stride = 128


def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

Here, we select the examples that we will be using for the training purpose. Initially, we selected 6000 examples, but the training exceeded the Google Colab time out settings, so we weren't successful in being able to train it.

In [None]:
train_dataset = dt["train"].map(
    preprocess_training_examples,
    batched=True,
    remove_columns=dt["train"].column_names,
)
len(dt["train"]), len(train_dataset)

train_dataset = train_dataset.select(range(1000))

len(dt["train"]), len(train_dataset)


  0%|          | 0/30 [00:00<?, ?ba/s]

(30000, 1000)

Here, we follow the same steps as above to pre-process the validation examples, which is to be used in our training arguments.

In [None]:
def preprocess_validation_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

Here, we select 300 validation examples to be a part of our training arguments.

In [None]:
validation_dataset = dt["validation"].map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=dt["validation"].column_names,
)
len(dt["validation"]), len(validation_dataset)

validation_dataset = validation_dataset.select(range(300))

len(dt["validation"]), len(validation_dataset)


  0%|          | 0/3 [00:00<?, ?ba/s]

(3000, 300)

Here, we use DataCollatorWithPadding to initialize the batches for the pre-training of the model.

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Initializing the model to be used and specifying it in the Question-Answering framework so that the model wouldn't have to look for itself as to what it has to do. (As the same base models could also be used for other tasks such as 'Text Classification','Sentiment Analysis', etc.)

In [None]:
model = BertForQuestionAnswering.from_pretrained(model_name)

You are using a model of type distilbert to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.


Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing BertForQuestionAnswering: ['distilbert.transformer.layer.2.output_layer_norm.weight', 'distilbert.embeddings.LayerNorm.weight', 'distilbert.transformer.layer.4.sa_layer_norm.bias', 'distilbert.transformer.layer.1.output_layer_norm.weight', 'vocab_layer_norm.weight', 'distilbert.transformer.layer.4.attention.v_lin.weight', 'distilbert.transformer.layer.5.sa_layer_norm.bias', 'distilbert.transformer.layer.5.sa_layer_norm.weight', 'distilbert.transformer.layer.1.attention.k_lin.bias', 'distilbert.embeddings.word_embeddings.weight', 'vocab_transform.weight', 'distilbert.transformer.layer.5.attention.k_lin.weight', 'distilbert.transformer.layer.0.ffn.lin2.weight', 'distilbert.transformer.layer.3.sa_layer_norm.weight', 'distilbert.transformer.layer.0.attention.out_lin.weight', 'distilbert.transformer.layer.1.attention.q_lin.bias', 'distilbert.transformer.layer.2.sa_layer_norm.weight', 'distilbert.

Here, we are specifying the training arguments for our model and initializing the learning rate and the number of epochs it is to run. We have also specified that there is no evaluation_strategy for this model.

In [None]:
args = TrainingArguments(
    "bert-finetuned-squad",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=1,
    weight_decay=0.01,
)

Here, we provide the Trainer with the model, tokenizer, the training dataset, the evaluation as well as the training arguments that it has to use in order to train our model. 

'.train()' method starts the training process. 

It took us **1.5 hours** to train one epoch on 1000 training examples and 300 validation examples.

In [None]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    tokenizer=tokenizer,
)
trainer.train()

***** Running training *****
  Num examples = 1000
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 125


Step,Training Loss


Saving model checkpoint to bert-finetuned-squad/checkpoint-125
Configuration saved in bert-finetuned-squad/checkpoint-125/config.json
Model weights saved in bert-finetuned-squad/checkpoint-125/pytorch_model.bin
tokenizer config file saved in bert-finetuned-squad/checkpoint-125/tokenizer_config.json
Special tokens file saved in bert-finetuned-squad/checkpoint-125/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=125, training_loss=5.20370849609375, metrics={'train_runtime': 3983.2326, 'train_samples_per_second': 0.251, 'train_steps_per_second': 0.031, 'total_flos': 195972567552000.0, 'train_loss': 5.20370849609375, 'epoch': 1.0})

Before sharing the model on the 'Hugging Face' network, we have to store it locally on the drive.

In [None]:
model.save_pretrained("/content/607-project-adeversarial")
tokenizer.save_pretrained("/content/607-project-adeversarial")

Configuration saved in /content/607-project-adeversarial/config.json
Model weights saved in /content/607-project-adeversarial/pytorch_model.bin
tokenizer config file saved in /content/607-project-adeversarial/tokenizer_config.json
Special tokens file saved in /content/607-project-adeversarial/special_tokens_map.json


('/content/607-project-adeversarial/tokenizer_config.json',
 '/content/607-project-adeversarial/special_tokens_map.json',
 '/content/607-project-adeversarial/vocab.txt',
 '/content/607-project-adeversarial/added_tokens.json',
 '/content/607-project-adeversarial/tokenizer.json')

This facilitates the login for Hugging Face, incase you are not using the terminal, and we didn't have the option to use the terminal without having the ColabPro membership. So, we had to go through this way. 

In order for this to work, first you have to have signed upon with Hugging Face.

After, you have logged into your Hugging Face account, you are able to produce a token that is to be used here to authenticate the pushing on the Hugging Face Hub, which will store all your files. 

In [None]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


The model files are very large, approximately 400MB. In order to handle large files and putting them onto the Hugging Face Repository, we use the git lfs (large file system) package.

In [None]:
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
!sudo apt-get install git-lfs
!git lfs install

Detected operating system as Ubuntu/bionic.
Checking for curl...
Detected curl...
Checking for gpg...
Detected gpg...
Running apt-get update... done.
Installing apt-transport-https... done.
Installing /etc/apt/sources.list.d/github_git-lfs.list...done.
Importing packagecloud gpg key... done.
Running apt-get update... done.

The repository is setup! You can now install packages.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 96 not upgraded.
Need to get 6,800 kB of archives.
After this operation, 15.3 MB of additional disk space will be used.
Get:1 https://packagecloud.io/github/git-lfs/ubuntu bionic/main amd64 git-lfs amd64 3.1.2 [6,800 kB]
Fetched 6,800 kB in 0s (15.8 MB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl

In [None]:
%cd /content/bert-finetuned-squad/checkpoint-125
!git clone https://{KrishnaAgarwal16}:{}@github.com/{KrishnaAgarwal16}/{607-Project}.git

/content/bert-finetuned-squad/checkpoint-125
Cloning into '{607-Project}'...
fatal: unable to access 'https://{KrishnaAgarwal16}:{}@github.com/{KrishnaAgarwal16}/{607-Project}.git/': The requested URL returned error: 400


After connecting to the Hugging Face Hub with the token login, we can just push our model as in github.

In [None]:
model.push_to_hub("607-project-adversarial")

Cloning https://huggingface.co/KrishnaAgarwal16/607-project-adversarial into local empty directory.
Configuration saved in 607-project-adversarial/config.json
Model weights saved in 607-project-adversarial/pytorch_model.bin


Upload file pytorch_model.bin:   0%|          | 32.0k/415M [00:00<?, ?B/s]

To https://huggingface.co/KrishnaAgarwal16/607-project-adversarial
   3200090..7504f83  main -> main



'https://huggingface.co/KrishnaAgarwal16/607-project-adversarial/commit/7504f833dcdbcbfc48f7cf6286601ed5b0d5f535'

Now, the model has been successfully published on the repository, we can access it using the 'BertForQuestionAnswering' functionality by just specifying the name with which it has been saved on the network.

P.S. Here, I believe the tokenizer could not be uploaded successfully so we had to use the tokenizer separately for the succesful running of our model. 

In [None]:
model_name = "KrishnaAgarwal16/607-project-adversarial"
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = BertForQuestionAnswering.from_pretrained(model_name)

loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.91b885ab15d631bf9cee9dc9d25ece0afd932f2f5130eba28f2055b2220c0333
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.18.0",
  "vocab_size": 30522
}

loading file https://huggingface.co/distilbert-base-uncased/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/0e1bbfda7f63a99bb52e3915dcf10

Downloading:   0%|          | 0.00/415M [00:00<?, ?B/s]

storing https://huggingface.co/KrishnaAgarwal16/607-project-adversarial/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/9157f6dabd3c9c7378b8ac559fbffd05ce462ba551b541e4977a39b40e8b08e8.011ba628c43abf46c9bf1431f93f7cfeb411e9cb89b7a6967897e848393b7677
creating metadata file for /root/.cache/huggingface/transformers/9157f6dabd3c9c7378b8ac559fbffd05ce462ba551b541e4977a39b40e8b08e8.011ba628c43abf46c9bf1431f93f7cfeb411e9cb89b7a6967897e848393b7677
loading weights file https://huggingface.co/KrishnaAgarwal16/607-project-adversarial/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/9157f6dabd3c9c7378b8ac559fbffd05ce462ba551b541e4977a39b40e8b08e8.011ba628c43abf46c9bf1431f93f7cfeb411e9cb89b7a6967897e848393b7677
All model checkpoint weights were used when initializing BertForQuestionAnswering.

All the weights of BertForQuestionAnswering were initialized from the model checkpoint at KrishnaAgarwal16/607-project-adversarial.
If your 

The pipeline function from the transformers library directly handles all the steps involved in the 'Question-Answering' task, we just have to specify the model, tokenizer and the task involved. ('might get confused with 'text-classification' or other NLP tasks)

In [None]:
from transformers import pipeline

In [None]:
model_name = "KrishnaAgarwal16/607-project-adversarial"
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = BertForQuestionAnswering.from_pretrained(model_name)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/415M [00:00<?, ?B/s]

In [None]:
qa = pipeline('question-answering',model,tokenizer=tokenizer)

In [None]:
score = []
for i in range(len(Questions_List)):
  x = qa({
    'question':Questions_List[i],
    'context':context
      })
  print(x)
  score.append(x)




  tensor = as_tensor(value)
  for span_id in range(num_spans)


{'score': 0.001872930326499045, 'start': 82583, 'end': 82618, 'answer': 'presidential election, 1864\nUlysses'}
{'score': 0.00113432586658746, 'start': 300, 'end': 312, 'answer': 'presidential'}
{'score': 0.0018682500813156366, 'start': 82583, 'end': 82618, 'answer': 'presidential election, 1864\nUlysses'}
{'score': 0.0016789911314845085, 'start': 82583, 'end': 82595, 'answer': 'presidential'}
{'score': 0.001317953341640532, 'start': 69868, 'end': 69880, 'answer': 'Presidential'}
{'score': 0.001673764898441732, 'start': 82583, 'end': 82595, 'answer': 'presidential'}
{'score': 0.0011371343862265348, 'start': 300, 'end': 312, 'answer': 'presidential'}
{'score': 0.0014163665473461151, 'start': 60486, 'end': 60498, 'answer': 'Presidential'}
{'score': 0.0015923931496217847, 'start': 60486, 'end': 60498, 'answer': 'Presidential'}
{'score': 0.0011301416670903563, 'start': 300, 'end': 312, 'answer': 'presidential'}
{'score': 0.0016768764471635222, 'start': 82583, 'end': 82595, 'answer': 'presi

In [None]:
score

[{'answer': 'presidential election, 1864\nUlysses',
  'end': 82618,
  'score': 0.001872930326499045,
  'start': 82583},
 {'answer': 'presidential',
  'end': 312,
  'score': 0.00113432586658746,
  'start': 300},
 {'answer': 'presidential election, 1864\nUlysses',
  'end': 82618,
  'score': 0.0018682500813156366,
  'start': 82583},
 {'answer': 'presidential',
  'end': 82595,
  'score': 0.0016789911314845085,
  'start': 82583},
 {'answer': 'Presidential',
  'end': 69880,
  'score': 0.001317953341640532,
  'start': 69868},
 {'answer': 'presidential',
  'end': 82595,
  'score': 0.001673764898441732,
  'start': 82583},
 {'answer': 'presidential',
  'end': 312,
  'score': 0.0011371343862265348,
  'start': 300},
 {'answer': 'Presidential',
  'end': 60498,
  'score': 0.0014163665473461151,
  'start': 60486},
 {'answer': 'Presidential',
  'end': 60498,
  'score': 0.0015923931496217847,
  'start': 60486},
 {'answer': 'presidential',
  'end': 312,
  'score': 0.0011301416670903563,
  'start': 300},

Here, we are using the score metric to identify the similarity indexes of our answers to the correct one's in our dataset. Here, as we can see that the scores for our pre-trained model are very less, which can be attributed to lower training samples and inadequate epoch training.

In [None]:
count = 0
for i in range(len(score)):
  count = count + 1*(score[i]['answer'] == df['Answer'][i])

acc = count/len(score)
acc


0.0

Comparing the pre-trained model with the base **'distilbert-base-uncased'** model from hugging face.

In [None]:
model_name = 'distilbert-base-uncased'
model = BertForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
qa = pipeline('question-answering',model,tokenizer=tokenizer)

In [None]:
context = ("Abraham Lincoln was born on February 12, 1809, to Thomas Lincoln and Nancy Hanks, two uneducated farmers. Lincoln was born in a one-room log cabin on the Sinking Spring Farm, in southeast Hardin County, Kentucky (now part of LaRue County). This area was at the time considered the frontier. The name Abraham was chosen to commemorate his grandfather, who was killed in an American Indian raid in 1786. Donald (1995) p 21  His elder sister, Sarah Lincoln, was born in 1807; a younger brother, Thomas Jr, died in infancy. It is sometimes debated whether Lincoln had Marfan syndrome, an autosomal dominant disorder of the connective tissue characterized by long limbs and great physical stature.  Marfan syndrome: Introduction Aug 1, 2006. Symbolic log cabin at Abraham Lincoln Birthplace National Historic Site. For some time, Thomas Lincoln was a respected and relatively affluent citizen of the Kentucky back country. He had purchased the Sinking Spring Farm in December of 1808 for $200 cash and assumption of a debt.  The farm site is now preserved as part of Abraham Lincoln Birthplace National Historic Site.  The family belonged to a Baptist church that had seceded from a larger church over the issue of slavery. Though Lincoln was exposed to his parents' anti-slavery sentiment from a very young age, he never joined their church, or any other church for that matter.  As a youth he had little use for religion. Life of Abraham Lincoln, Colonel Ward H. Lamon, 1872 - portions reprinted in  Chapter VIII: Abraham Lincoln, Deist, and Admirer of Thomas Paine, From the book Religious Beliefs of Our Presidents by Franklin Steiner (1936). Lincoln was just seven years old when, in 1816, the family was forced to make a new start in Perry County (now in Spencer County), Indiana. He later noted that this move was \"partly on account of slavery,\" and partly because of difficulties with land deeds in Kentucky: Unlike land in the Northwest Territory, Kentucky never had a proper U.S. survey, and farmers often had difficulties proving title to their property."
"Lincoln was only nine when his mother, then thirty-four years old, died of milk sickness. Soon afterwards, his father remarried  to Sarah Bush Johnston. Sarah Lincoln raised young Lincoln like one of her own children. Years later she compared Lincoln to her own son, saying \"Both were good boys, but I must say — both now being dead that Abe was the best boy I ever saw or ever expect to see.\" Lincoln was affectionate toward his stepmother, whom he would call \"Mother\" for the rest of his life, but he was distant from his father.  Donald, (1995) pp. 28, 152."
"In 1830, after more economic and land-title difficulties in Indiana, the family settled on public land  /ref> in Macon County, Illinois. Some scholars believe that it was his father's repeated land-title difficulties and ensuing financial hardships that led young Lincoln to study law.  The following winter was desolate and especially brutal, and the family considered moving back to Indiana. The following year, when his father relocated the family to a new homestead in Coles County, Illinois, twenty-two-year-old Lincoln struck out on his own, canoeing down the Sangamon River to the village of New Salem in Sangamon County. Later that year, hired by New Salem businessman Denton Offutt and accompanied by friends, he took goods from New Salem to New Orleans via flatboat on the Sangamon, Illinois and Mississippi rivers. While in New Orleans, he may have witnessed a slave auction, though as a frequent visitor to Kentucky, he would have had several earlier opportunities to witness  similar sales.  Donald, (1995) ch. 2."
"Lincoln's formal education consisted of about 18 months of schooling. Largely self-educated, he read every book he could get his hands on, once walking.  just to borrow one  While his favorite book was The Life of George Washington, Lincoln mastered the Bible, Shakespeare, and English and American history, and developed a plain writing style that puzzled audiences more used to grandiose rhetoric. He was also a talented local wrestler and skilled with an ax; some rails he had allegedly split in his youth were exhibited at the 1860 Republican National Convention, as the party celebrated the poor-boy-made-good theme. Lincoln avoided hunting and fishing because he did not like killing animals, even for food. Though he was unusually tall at  , 4 inches and strong, Lincoln spent so much time reading that some neighbors suspected he must be doing it to avoid strenuous manual labor."
)

In [None]:
question = ["why did Lincoln's mother die?"]

In [None]:
qa({
    'question':question[0],
    'context':context
})

  tensor = as_tensor(value)
  for span_id in range(num_spans)


{'answer': '. He was also a talented local',
 'end': 4074,
 'score': 0.000106231847894378,
 'start': 4044}

So, we were able to conclude that even though our pre-trained model was only trained on 1000 examples and 1 epoch, it still **performed 5 times** better than the base model i.e. **'distilbert-base-uncased'**.

Using Hugging Face inbuilt pre-trained model **deepset/bert-base-cased-squad2​** for aswering the list of questions.

The model, **deepset/bert-base-cased-squad2** has been provided on the Hugging Face Repository by the **deepset** organisation who have pre-trained the **bert-base-cased** model on the **squad2** dataset. 

The **squad2** dataset is provided by Stanford which includes all the questions available in the **squad** dataset alongwith an additional 50,000 unanswerable questions, which helps reduce the bias and might make the model more robust for everyday use.

In [None]:
model = BertForQuestionAnswering.from_pretrained('deepset/bert-base-cased-squad2')
tokenizer = AutoTokenizer.from_pretrained('deepset/bert-base-cased-squad2')


Downloading:   0%|          | 0.00/508 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/413M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/152 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [None]:
qa = pipeline('question-answering',model,tokenizer=tokenizer)

In [None]:
new_score = []
for i in range(len(Questions_List)):
  x = qa({
    'question':Questions_List[i],
    'context':context
      })
  print(x)
  new_score.append(x)




  tensor = as_tensor(value)
  for span_id in range(num_spans)


{'score': 0.915874183177948, 'start': 7355, 'end': 7364, 'answer': '18 months'}
{'score': 0.9896573424339294, 'start': 8261, 'end': 8265, 'answer': '1832'}
{'score': 0.4672646224498749, 'start': 47540, 'end': 47609, 'answer': 'United States Note, the first paper currency in United States history'}
{'score': 0.9411458373069763, 'start': 57232, 'end': 57244, 'answer': 'Grace Bedell'}
{'score': 0.8433478474617004, 'start': 45061, 'end': 45065, 'answer': '1776'}
{'score': 0.9991657733917236, 'start': 48658, 'end': 48666, 'answer': 'Kentucky'}
{'score': 0.9725509285926819, 'start': 1225, 'end': 1229, 'answer': '1860'}
{'score': 0.9986327886581421, 'start': 44782, 'end': 44799, 'answer': 'John Wilkes Booth'}
{'score': 0.9876161217689514, 'start': 1728, 'end': 1744, 'answer': 'Ulysses S. Grant'}
{'score': 0.9568571448326111, 'start': 75686, 'end': 75693, 'answer': 'slavery'}
{'score': 0.7027047872543335, 'start': 5317, 'end': 5322, 'answer': 'seven'}
{'score': 0.889444887638092, 'start': 1247

Here, the **score** for every question is very high and the answers are accurate and could be found in the comprehension of the context. We could attribute the higher score value to the robustness of the pre-trained model, the features it might include and the number of epochs it has been trained on.

In [None]:
count = 0
for i in range(len(new_score)):
  count = count + 1*(new_score[i]['answer'] == df['Answer'][i])

acc = count/len(new_score)
acc



0.42857142857142855

So, accuracy of pretrained  model is 0.42 which is far better than our untrained model which was unable to answer any question correctly.

Code output for **Yes or No** type of question:

In [None]:
question = ["Did Lincoln's mother die of pneumonia?"]

In [None]:
qa({
    'question':question[0],
    'context':context
})

  tensor = as_tensor(value)
  for span_id in range(num_spans)


{'answer': 'and fishing because he did not like killing animals',
 'end': 4343,
 'score': 0.00010440014011692256,
 'start': 4292}

Here, we can see that the answer given by the model is for the question, "How did Lincoln's mother die?", but it is not what we are hoping for it to answer. So, eventhough the model has been pre-trained on the **squad2** dataset, it does not perform very well on the **boolean** type of questions. 

## Conclusion:
----

We have explored this project as a learning opportunity to further enhance our understanding towards NLP related tasks and explore on the functionalities, models and datasets offered by the Hugging Face community. We have learnt a lot about pre-processing techniques, transformers and pre-trained models, and plan to expand upon it in our future projects.

To conclude our project, we would like to discuss some of the problems we faced and how we intend to solve them in the future as well as some future work related to it.

### Problems:

* It was a challenge to convert our chosen dataset into the predefined format in which the Hugging Face requires it to be so that it can be used to pretrain an existing model on the Hugging Face repository. 

* Time and computation resources required to train the model and the limitations of having a free access to Google Colab.

### Futurework:

* We will convert at least one of the articles we have in our dataset into the prescribed format and use it for pretraining the model as it involves a lot of manual work and we wish to start small.

* We are also looking into optimizing our model and making it more robust by changing the learning rate, increasing the epochs, increasing the training and validation examples and more. 

* We are also exploring cloud services and GPU options which we can incorporate to increase the computing resources required for better training of the model.

* An another idea would be to use the **boolean** question-answering datasets to pre-train our model as it might make our model more efficient in answering **Yes or No** type of questions. 



## References
---

* http://www.cs.cmu.edu/~nasmith/papers/smith+heilman+hwa.nsf08.pdf​

* https://lilianweng.github.io/posts/2020-10-29-odqa/​

* https://huggingface.co/distilbert-base-uncased​

* https://www.sciencedirect.com/science/article/pii/S131915781830082X#b0130​

* http://www.cs.cmu.edu/~ark/QA-data/​

* http://nlpprogress.com/english/question_answering.html​

* https://en.wikipedia.org/wiki/Question_answering#:~:text=Question%20answering%20(QA)%20is%20a,humans%20in%20a%20natural%20language.

* https://huggingface.co/course/chapter7/7?fw=pt

* https://huggingface.co/docs/transformers/model_sharing

* https://huggingface.co/course/chapter7/7?fw=pt#preparing-the-data

* https://huggingface.co/docs/transformers/main_classes/pipelines

* https://huggingface.co/datasets/adversarial_qa

* https://asperbrothers.com/blog/question-answering-python/

* https://www.youtube.com/watch?v=scJsty_DR3o