### Purpose
### Evaluation of various BERT models for extractive Q&A task
* Given two sequences, a question and a context, within that context, BERT extracts our answer to the question.

### Abstract
* We are going to evaluate various BERT models for extractive Q&A task.  The first three models are popular models from Hugging Face.  The last model is a fine-tuned version of the "bert-large-uncased" pretrained model.  We use three simple questions to test each model and compare the result.

### Team Members
* Sean Tran - 101449600
* Mohammed Mujtaba Rabbani  - 101387404

### Three simple questions to evaluate the models
1.  Trivia Question:  Where is the capital of Canada?
2.  Information seeking:  Where does everyone live?
3.  Seeking information from a long text:  When was the new province of Upper Canada created?

### Long Text used in Question3
* wikipedia https://en.wikipedia.org/wiki/Toronto
* "During the American Revolutionary War, an influx of British settlers arrived there as United Empire Loyalists fled for the British-controlled
lands north of Lake Ontario. The Crown granted them land to compensate for their losses in the Thirteen Colonies. The new province of
Upper Canada was being created and needed a capital. In 1787, the British Lord Dorchester arranged for the Toronto Purchase with the
Mississaugas of the New Credit First Nation, thereby securing more than a quarter of a million acres (1000 km2) of land in the Toronto area.
Dorchester intended the location to be named Toronto. The first 25 years after the Toronto purchase were quiet, although
"there were occasional independent fur traders" present in the area, with the usual complaints of debauchery and drunkenness."



### Model1 from Hugging Face: DistilBERT base cased distilled SQuAD
* https://huggingface.co/distilbert-base-cased-distilled-squad

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.3-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m43.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m37.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.3 tokenizers-0.13.2 transformers-4.27.3


In [None]:
from transformers import pipeline

In [None]:
question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

Downloading (…)lve/main/config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

#### Simple Triva Question: Where is the capital of Canada?

In [None]:
context="The capital of Canada is Ottawa."
question="Where is the captial of Canada?"

question_answerer(question=question, context=context)

{'score': 0.9686177968978882, 'start': 25, 'end': 31, 'answer': 'Ottawa'}

#### Simple Information Seeking Question: Where does everyone live?

In [None]:
question="Where does everyone live?"
context="Everyone study at GBC, but they live out of town."

question_answerer(question=question, context=context)

{'score': 0.6477463245391846, 'start': 37, 'end': 48, 'answer': 'out of town'}

#### Seeking information from a long text

In [None]:
# wikipedia https://en.wikipedia.org/wiki/Toronto

long_text = """
During the American Revolutionary War, an influx of British settlers arrived there as United Empire Loyalists fled for the British-controlled
lands north of Lake Ontario. The Crown granted them land to compensate for their losses in the Thirteen Colonies. The new province of
Upper Canada was being created and needed a capital. In 1787, the British Lord Dorchester arranged for the Toronto Purchase with the
Mississaugas of the New Credit First Nation, thereby securing more than a quarter of a million acres (1000 km2) of land in the Toronto area.
Dorchester intended the location to be named Toronto. The first 25 years after the Toronto purchase were quiet, although
"there were occasional independent fur traders" present in the area, with the usual complaints of debauchery and drunkenness.
"""
question = "When was the new province of Upper Canada created?"

question_answerer(question=question, context=long_text)

{'score': 0.9288270473480225, 'start': 333, 'end': 337, 'answer': '1787'}

### Model2 from Hugging Face: RoBERTa-based-SQuAD2
* https://huggingface.co/deepset/roberta-base-squad2
* Language model: roberta-base
* Training data: SQuAD 2.0
* Evaluating data: SQuAD 2.0
* Infrastructure: 4x Tesla v100

In [None]:
# !pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.3-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m43.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m71.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.3 tokenizers-0.13.2 transformers-4.27.3


In [None]:
from transformers import pipeline, BertForQuestionAnswering

In [None]:
model_name = "deepset/roberta-base-squad2"
qa = pipeline(model=model_name, tokenizer=model_name, revison="v1.0", task="question-answering")

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/496M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

#### Simple Triva Question: Where is the capital of Canada?

In [None]:
context="The capital of Canada is Ottawa."
question="Where is the captial of Canada?"

sequence = question, context
qa(*sequence)

{'score': 0.8784342408180237, 'start': 25, 'end': 31, 'answer': 'Ottawa'}

#### Simple Information Seeking Question: Where does everyone live?

In [None]:
question="Where does everyone live?"
context="Everyone study at GBC, but they live out of town."

sequence = question, context
qa(*sequence)

{'score': 0.37237265706062317, 'start': 37, 'end': 48, 'answer': 'out of town'}

#### Seeking information from a long text

In [None]:
# wikipedia https://en.wikipedia.org/wiki/Toronto

long_text = """
During the American Revolutionary War, an influx of British settlers arrived there as United Empire Loyalists fled for the British-controlled
lands north of Lake Ontario. The Crown granted them land to compensate for their losses in the Thirteen Colonies. The new province of
Upper Canada was being created and needed a capital. In 1787, the British Lord Dorchester arranged for the Toronto Purchase with the
Mississaugas of the New Credit First Nation, thereby securing more than a quarter of a million acres (1000 km2) of land in the Toronto area.
Dorchester intended the location to be named Toronto. The first 25 years after the Toronto purchase were quiet, although
"there were occasional independent fur traders" present in the area, with the usual complaints of debauchery and drunkenness.
"""
question = "When was the new province of Upper Canada created?"

sequence = question, long_text
qa(*sequence)

{'score': 0.48114246129989624, 'start': 333, 'end': 337, 'answer': '1787'}

### Model3 from Hugging Face: BERT large model (uncased) whole word masking finetuned on SQuAD

In [None]:
squad_pipe = pipeline("question-answering", "bert-large-uncased-whole-word-masking-finetuned-squad")

Downloading (…)lve/main/config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

#### Simple Triva Question: Where is the capital of Canada?

In [None]:
context="The capital of Canada is Ottawa."
question="Where is the captial of Canada?"

sequence = question, context
squad_pipe(*sequence)

{'score': 0.6013673543930054, 'start': 25, 'end': 31, 'answer': 'Ottawa'}

#### Simple Information Seeking Question: Where does everyone live?

In [None]:
question="Where does everyone live?"
context="Everyone study at GBC, but they live out of town."

sequence = question, context
squad_pipe(*sequence)

{'score': 0.975112795829773, 'start': 37, 'end': 48, 'answer': 'out of town'}

#### Seeking information from a long text

In [None]:
# wikipedia https://en.wikipedia.org/wiki/Toronto

long_text = """
During the American Revolutionary War, an influx of British settlers arrived there as United Empire Loyalists fled for the British-controlled
lands north of Lake Ontario. The Crown granted them land to compensate for their losses in the Thirteen Colonies. The new province of
Upper Canada was being created and needed a capital. In 1787, the British Lord Dorchester arranged for the Toronto Purchase with the
Mississaugas of the New Credit First Nation, thereby securing more than a quarter of a million acres (1000 km2) of land in the Toronto area.
Dorchester intended the location to be named Toronto. The first 25 years after the Toronto purchase were quiet, although
"there were occasional independent fur traders" present in the area, with the usual complaints of debauchery and drunkenness.
"""
question = "When was the new province of Upper Canada created?"

sequence = question, long_text
squad_pipe(*sequence)

{'score': 0.7342033386230469, 'start': 333, 'end': 337, 'answer': '1787'}

### Model4: Fine-Tune the "bert-large-uncased" pretrained model

#### Dataset used: a variation of adversarial_qa, used in the O'Reilly video series
* https://huggingface.co/datasets/adversarial_qa

In [None]:
# !pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.3-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m29.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.3 tokenizers-0.13.2 transformers-4.27.3


In [None]:
from transformers import BertTokenizerFast, BertForQuestionAnswering, pipeline, \
                         DataCollatorWithPadding, TrainingArguments, Trainer, \
                         AutoModelForQuestionAnswering, AutoTokenizer
from datasets import Dataset
import pandas as pd

In [None]:
bert_tokenizer = BertTokenizerFast.from_pretrained('bert-large-uncased', return_token_type_ids=True)

qa_bert = BertForQuestionAnswering.from_pretrained('bert-large-uncased')

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-large-uncased

In [None]:
# mount google drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
qa_df = pd.read_csv('/content/gdrive/MyDrive/Colab Notebooks/data/qa.csv')

qa_df.shape

(29989, 5)

In [None]:
qa_df.head()

Unnamed: 0,question,context,start_positions,end_positions,answer
0,What sare the benifts of the blood brain barrir?,Another approach to brain function is to exami...,56,60,isolated from the bloodstream
1,What is surrounded by cerebrospinal fluid?,Another approach to brain function is to exami...,16,16,brain
2,What does the skull protect?,Another approach to brain function is to exami...,11,11,brain
3,What has been injected into rats to produce pr...,Another approach to brain function is to exami...,153,153,chemicals
4,What can cause issues with how the brain works?,Another approach to brain function is to exami...,93,94,brain damage


In [None]:
qa_df.iloc[0]

question            What sare the benifts of the blood brain barrir?
context            Another approach to brain function is to exami...
start_positions                                                   56
end_positions                                                     60
answer                                 isolated from the bloodstream
Name: 0, dtype: object

In [None]:
# only grab 4,000 examples to speed up training time
qa_dataset = Dataset.from_pandas(qa_df.sample(4000, random_state=42))

# Dataset has a built in train test split method
qa_dataset = qa_dataset.train_test_split(test_size=0.2)

In [None]:
# standard preprocessing here with truncation on to truncate longer text
def preprocess(data):
    return bert_tokenizer(data['question'], data['context'], truncation=True)

qa_dataset = qa_dataset.map(preprocess, batched=True)

Map:   0%|          | 0/3200 [00:00<?, ? examples/s]

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

#### Freeze all pre-trained layers except header

In [None]:
# freeze all but the last 2 encoder layers in BERT to speed up training
for name, param in qa_bert.bert.named_parameters():
    if 'encoder.layer.22' in name:
        break
    param.requires_grad = False  # disable training in BERT

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=bert_tokenizer)

In [None]:
!mkdir qa
!mkdir qa/results
!mkdir qa/logs

In [None]:
batch_size = 32
epochs = 2

training_args = TrainingArguments(
    output_dir='./qa/results',
    num_train_epochs=epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    logging_dir='./qa/logs',
    save_strategy='epoch',
    logging_steps=10,
    evaluation_strategy='epoch',
    load_best_model_at_end=True
)

trainer = Trainer(
    model=qa_bert,
    args=training_args,
    train_dataset=qa_dataset['train'],
    eval_dataset=qa_dataset['test'],
    data_collator=data_collator
)

# Get initial metrics
trainer.evaluate()

{'eval_loss': 5.842594623565674,
 'eval_runtime': 12.4713,
 'eval_samples_per_second': 64.147,
 'eval_steps_per_second': 2.005}

In [None]:
trainer.train()   # This is a long process



Epoch,Training Loss,Validation Loss
1,4.4249,4.338307
2,4.0393,4.22591


TrainOutput(global_step=200, training_loss=4.465833511352539, metrics={'train_runtime': 149.8132, 'train_samples_per_second': 42.72, 'train_steps_per_second': 1.335, 'total_flos': 4216561921009152.0, 'train_loss': 4.465833511352539, 'epoch': 2.0})

#### Save the trained model for future use

In [None]:
trainer.save_model()

#### Evaluation:

In [None]:
pipe = pipeline("question-answering", './qa/results', tokenizer=bert_tokenizer)

#### Simple Triva Question: Where is the capital of Canada?

In [None]:
context="The capital of Canada is Ottawa."
question="Where is the captial of Canada?"

pipe(question=question, context=context)

#### Simple Information Seeking Question: Where does everyone live?

In [None]:
context="Everyone study at GBC, but they live out of town."
question="Where does everyone live?"

pipe(question=question, context=context)

{'score': 0.6477463245391846, 'start': 37, 'end': 48, 'answer': 'out of town'}

#### Seeking information from a long text

In [None]:
# wikipedia https://en.wikipedia.org/wiki/Toronto

long_text = """
During the American Revolutionary War, an influx of British settlers arrived there as United Empire Loyalists fled for the British-controlled
lands north of Lake Ontario. The Crown granted them land to compensate for their losses in the Thirteen Colonies. The new province of
Upper Canada was being created and needed a capital. In 1787, the British Lord Dorchester arranged for the Toronto Purchase with the
Mississaugas of the New Credit First Nation, thereby securing more than a quarter of a million acres (1000 km2) of land in the Toronto area.
Dorchester intended the location to be named Toronto. The first 25 years after the Toronto purchase were quiet, although
"there were occasional independent fur traders" present in the area, with the usual complaints of debauchery and drunkenness.
"""
question = "When was the new province of Upper Canada created?"

pipe(question=question, context=context)

### Comparing Result from Different Models

### Summary and Conclusion

### References