<p align="center">
 <img src="http://www.di.uoa.gr/themes/corporate_lite/logo_el.png" title="Department of Informatics and Telecommunications - University of Athens"/> </p>

---
<h1 align="center"> 
  Artificial Intelligence
</h1>
<h1 align="center" > 
  Deep Learning for Natural Language Processing
</h1>

---
<h2 align="center"> 
 <b>Konstantinos Nikoletos</b>
</h2>

<h3 align="center"> 
 <b>Winter 2020-2021</b>
</h3>


---
---

<h3 align="center"> 
 <b>Task Description</b>
</h3>

Build a BERT-based model which returns “an answer”, given a user question and a
passage which includes the answer of the question. For this question answering task, you
will use the SQuAD 2.0 dataset which has been discussed in the lecture “Textual Question
Answering”. You should start with the BERT-base pretrained model “bert-base-uncased”
and fine-tune it to have a question answering task as explained in the lecture on BERT.
Note that this has been done already by the BERT team and it is available publicly, but
we would like you to try to implement this by yourself. If you copy from the BERT code
for this task, your mark for this exercise will be 0.
For more information about question answering systems, you can read the post “How to
Build an Open-Domain Question Answering System?” and the survey “Recent Trends in
Deep Learning Based Open-Domain Textual Question Answering Systems”.


---
---

__Import__ of essential libraries


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd 
import sys # only needed to determine Python version number
import matplotlib # only needed to determine Matplotlib version 
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
import pprint
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext import data
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
import logging
from tqdm import tqdm, trange
import re
import string

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
!pip install colorama

Collecting colorama
  Downloading https://files.pythonhosted.org/packages/44/98/5b86278fbbf250d239ae0ecb724f8572af1c91f4a11edf4d36a206189440/colorama-0.4.4-py2.py3-none-any.whl
Installing collected packages: colorama
Successfully installed colorama-0.4.4


In [None]:
import colorama
from colorama import Fore

Selecting device (GPU - CUDA if available)

In [None]:
# First checking if GPU is available
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
    torch.cuda.empty_cache()
else:
    print('No GPU available, training on CPU.')

Training on GPU.


In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/98/87/ef312eef26f5cecd8b17ae9654cdd8d1fae1eb6dbd87257d6d73c128a4d0/transformers-4.3.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 10.7MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 34.6MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/fd/5b/44baae602e0a30bcc53fbdbc60bd940c15e143d252d658dfdefce736ece5/tokenizers-0.10.1-cp36-cp36m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 27.9MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=0a07

In [None]:
from tokenizers import BertWordPieceTokenizer
from transformers import BertTokenizer, BertForQuestionAnswering

# Loading data
---

## SQuAD Dataset
Stanford Question Answering Dataset (SQuAD) is a new reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage. With 100,000+ question-answer pairs on 500+ articles, SQuAD is significantly larger than previous reading comprehension datasets.

## Problem
For each observation in the training set, we have a context, question, and text.
The goal is to find the text for any new question and context provided. This is a closed dataset meaning that the answer to a question is always a part of the context and also a continuous span of context. 
- Getting the sentence having the right answer
- Once the sentence is finalized, getting the correct answer from the sentence.

In [None]:
# Opening data file
import io
import os
from google.colab import drive
from os import listdir
from os.path import isfile, join
import json
import sys

drive.mount('/content/drive/',force_remount=True)
%cd drive/My\ Drive/AI_4
!pwd

Mounted at /content/drive/
/content/drive/My Drive/AI_4
/content/drive/My Drive/AI_4


Loading SQuAD 2.0 dataset training and validation set


In [None]:
train_path = r'train-v2.0.json'
train_data = open(train_path, 'rb')
raw_train_data = json.load(train_data)

eval_path = r'dev-v2.0.json'
eval_data = open(eval_path, 'rb')
raw_eval_data = json.load(eval_data)

2 tokenizers:

- BertWordPieceTokenizer
- BertTokenizer

Same encoding, but BertWordPieceTokenizer is faster than BertTokenizer as it is implemented in Rust

In [None]:
!mkdir "bert_base_uncased"

slow_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
slow_tokenizer.save_pretrained("bert_base_uncased/")

fast_tokenizer = BertWordPieceTokenizer("bert_base_uncased/vocab.txt", lowercase=True)
tokenizer = fast_tokenizer

mkdir: cannot create directory ‘bert_base_uncased’: File exists


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




Two dataframes for a better represantation of the training and validation set SQuAD

In [None]:
squadDf = pd.DataFrame()
val_squadDf = pd.DataFrame()

# Pre-process
---
Preparing dataset SQuAD 2.0. I am going to create and use a class with all the needed functions in order to read json and process it.

Variables that I am going to read and store:

- questions (id and text)
- paragraphs that have the answers
- token start and end of each answer
- answers as text

After reading the json file, I am going:

1. Find the end token of each answer
2. Create the array of ints that BERT needs
3. Encode the data
4. Insert the mask
5. Make the padding

> Because BERT is a pretrained model that expects input data in a specific format, we will need:
1. A **special token, `[SEP]`,** to mark the end of a sentence, or the separation between two sentences
2. A **special token, `[CLS]`,** at the beginning of our text. This token is used for classification tasks, but BERT expects it no matter what your application is. 
3. Tokens that conform with the fixed vocabulary used in BERT
4. The **Token IDs** for the tokens, from BERT's tokenizer
5. **Mask IDs** to indicate which elements in the sequence are tokens and which are padding elements
6. **Segment IDs** used to distinguish different sentences
7. **Positional Embeddings** used to show token position within the sequence



All these will be implemented as one class ```SQuAD_Dataset``` that has the preprocess function implementation.

In [None]:
class SQuAD_Dataset:
    '''
    Constructor: all fields in the json file
    '''
    def __init__(self, question, context, start_char_idx=None, answer_text=None, all_answers=None):
        self.question = question
        self.context = context
        self.start_char_idx = start_char_idx
        self.end_char_idx = -1
        self.answer_text = answer_text
        self.all_answers = all_answers
        self.skip = False
        self.start_token_idx = -1
        self.end_token_idx = -1
        

    '''
    Pre-process: tokenizing and encoding paragraphs and questions, as BERT encoding needs
    '''
    def preprocess(self):

        # Isolating context and questions from the given json
        context = " ".join(str(self.context).split())
        question = " ".join(str(self.question).split())

        # Text encoding
        tokenized_context = tokenizer.encode(context)
        tokenized_question = tokenizer.encode(question)

        # Marking start and end of answers in text
        if self.answer_text is not None:
            answer = " ".join(str(self.answer_text).split())

            # Finding last token of the answer
            end_char_idx = self.start_char_idx + len(answer)
            
            # If end before start skip
            if end_char_idx >= len(context):
                self.skip = True
                return
            
            self.end_char_idx = end_char_idx

            # Creating the array of 0s and 1s as BERT needs
            is_char_in_ans = [0] * len(context)
            for idx in range(self.start_char_idx, end_char_idx):
                is_char_in_ans[idx] = 1

            # Finding answers token
            ans_token_idx = []
            for idx, (start, end) in enumerate(tokenized_context.offsets):
                if sum(is_char_in_ans[start:end]) > 0:
                    ans_token_idx.append(idx)
            if len(ans_token_idx) == 0:
                self.skip = True
                return
            
            # Initializing class attributes
            self.start_token_idx = ans_token_idx[0]
            self.end_token_idx = ans_token_idx[-1]
        
        #  Converted ids from encoding
        input_ids = tokenized_context.ids + tokenized_question.ids[1:]

        token_type_ids = [0] * len(tokenized_context.ids) + [1] * len(tokenized_question.ids[1:])

        # Creating the mask
        attention_mask = [1] * len(input_ids)

        # Creating the Padding
        padding_length = max_seq_length - len(input_ids)

        if padding_length > 0:
            input_ids = input_ids + ([0] * padding_length)
            attention_mask = attention_mask + ([0] * padding_length)
            token_type_ids = token_type_ids + ([0] * padding_length)
        elif padding_length < 0:
            self.skip = True
            return

        # Initializing class variables 
        self.input_word_ids = input_ids
        self.input_type_ids = token_type_ids
        self.input_mask = attention_mask
        self.context_token_to_char = tokenized_context.offsets

# Spliting and preparing datasets for training and validation
---



In [None]:
'''
Creating squad: Creating the SQuAD dataset QAs, returns the preprocessed dataset that consists of vectors to be train. Also I've inserted a dataframe for visualizing the data 
'''
def create_squad(raw_data, desc, df):

    if len(desc)!=0:
      p_bar = tqdm(total=len(raw_data["data"]), desc=desc,position=0, leave=True,file=sys.stdout, bar_format="{l_bar}%s{bar}%s{r_bar}" % (Fore.BLUE, Fore.RESET))
    
    squad_examples = []

    # Dataframe variables initilization
    if isinstance(df, pd.DataFrame):
      titles = []
      ids = []
      contents = []
      answer_start = []
      answer_end = []
      answers = []
      imp = []
      allans = []
      questions = []
      skip = []
    
    # -------- Processing json data ---------
    for item in raw_data["data"]:
      title = item["title"]
      for para in item["paragraphs"]:            
        context = para["context"]
        for qa in para["qas"]:
          question = qa["question"]
          is_impossible = qa["is_impossible"]
          id = qa["id"]

          if ("answers" in qa) or ("plausible_answers" in qa):
            if len(qa["answers"]):
              answer_text = qa["answers"][0]["text"]
              all_answers = [_["text"] for _ in qa["answers"]]
              start_char_idx = qa["answers"][0]["answer_start"]    
              
            if "plausible_answers" in qa and len(qa["plausible_answers"]):
              answer_text += qa["plausible_answers"][0]["text"]
              all_answers += [_["text"] for _ in qa["plausible_answers"]]
              start_char_idx += qa["plausible_answers"][0]["answer_start"]

            # Creating set
            squad = SQuAD_Dataset(question, context, start_char_idx, answer_text, all_answers)  

            if isinstance(df, pd.DataFrame):
              # Initializing variables for the dataframe
              questions.append(question)
              answers.append(answer_text)
              answer_start.append(start_char_idx)
              titles.append(title)
              imp.append(is_impossible)
              ids.append(id)
              contents.append(context)
              allans.append(all_answers)

          else:
            squad = SQuAD_Dataset(question, context)
          

          squad.preprocess()
          if isinstance(df, pd.DataFrame):
            answer_end.append(squad.end_char_idx)
            skip.append(squad.skip)
          
          squad_examples.append(squad)

      if len(desc)!=0:
        p_bar.update(1)

    if len(desc)!=0:   
      p_bar.close()
    
    if isinstance(df, pd.DataFrame):
      df['question_id'] = ids
      df['title'] = titles
      df['question'] = questions
      df['context'] = context
      df['answer'] = answers
      df['all_answers'] = answers
      df['answer_start'] = answer_start
      df['answer_end'] = answer_end
      df['is_impossible'] = imp
      df['skip'] = skip


    return squad_examples

'''
Xy_split: Splitting the dataset to the previous masked encodings as the X and the starting and ending token as the true label y. 
          Returning a dictionary of [input_word_ids,input_mask,input_type_ids],[start_token_idx,end_token_idx] as BERT needs.
'''
def Xy_split(squad_examples):
    dataset_dict = {
        "input_word_ids": [],
        "input_type_ids": [],
        "input_mask": [],
        "start_token_idx": [],
        "end_token_idx": [],
    }

    for item in squad_examples:
        # Do not insert data with mixed start and end answers tokens -> skip them
        if item.skip is False:
            for key in dataset_dict:
                dataset_dict[key].append(getattr(item, key))

    # dataset_dict is a dictionary that stores every format of the data BERT needs (word ids,types and masked data)
    for key in dataset_dict:
        dataset_dict[key] = np.array(dataset_dict[key])
    
    x = [dataset_dict["input_word_ids"], dataset_dict["input_mask"], dataset_dict["input_type_ids"]]
    y = [dataset_dict["start_token_idx"], dataset_dict["end_token_idx"]]

    return x, y


'''
Text normalization: Lower, no pancutation, unicode format and this function is used
                    for comaring predicted answer with the true ansers in order to find accuracy and hence evaluate model
'''
def normalize_text(text):
    text = text.lower() # no capitals
    text = "".join(ch for ch in text if ch not in set(string.punctuation)) # no pancutation
    regex = re.compile(r"\b(a|an|the)\b", re.UNICODE) # no a,an,the to sentences
    text = re.sub(regex, " ", text) 
    text = " ".join(text.split())

    return text

Forming tensors from the initial set


In [None]:
max_seq_length = 256  # bigger seq lengths crashed CUDA

train_squad_examples = create_squad(raw_train_data, "Creating training data   ",squadDf)
X_train, y_train = Xy_split(train_squad_examples)

eval_squad_examples = create_squad(raw_eval_data,   "Creating validation data ",val_squadDf)
X_eval, y_eval = Xy_split(eval_squad_examples)

Creating training data   : 100%|[34m██████████[39m| 442/442 [01:26<00:00,  5.12it/s]
Creating validation data : 100%|[34m██████████[39m| 35/35 [00:07<00:00,  4.49it/s]


# Visualizing data
---

In [None]:
squadDf.head(10)

Unnamed: 0,question_id,title,question,context,answer,all_answers,answer_start,answer_end,is_impossible,skip
0,56be85543aeaaa14008c9063,Beyoncé,When did Beyonce start becoming popular?,"The term ""matter"" is used throughout physics i...",in the late 1990s,in the late 1990s,269,286,False,False
1,56be85543aeaaa14008c9065,Beyoncé,What areas did Beyonce compete in when she was...,"The term ""matter"" is used throughout physics i...",singing and dancing,singing and dancing,207,226,False,False
2,56be85543aeaaa14008c9066,Beyoncé,When did Beyonce leave Destiny's Child and bec...,"The term ""matter"" is used throughout physics i...",2003,2003,526,530,False,False
3,56bf6b0f3aeaaa14008c9601,Beyoncé,In what city and state did Beyonce grow up?,"The term ""matter"" is used throughout physics i...","Houston, Texas","Houston, Texas",166,180,False,False
4,56bf6b0f3aeaaa14008c9602,Beyoncé,In which decade did Beyonce become famous?,"The term ""matter"" is used throughout physics i...",late 1990s,late 1990s,276,286,False,False
5,56bf6b0f3aeaaa14008c9603,Beyoncé,In what R&B group was she the lead singer?,"The term ""matter"" is used throughout physics i...",Destiny's Child,Destiny's Child,320,335,False,False
6,56bf6b0f3aeaaa14008c9604,Beyoncé,What album made her a worldwide known artist?,"The term ""matter"" is used throughout physics i...",Dangerously in Love,Dangerously in Love,505,524,False,False
7,56bf6b0f3aeaaa14008c9605,Beyoncé,Who managed the Destiny's Child group?,"The term ""matter"" is used throughout physics i...",Mathew Knowles,Mathew Knowles,360,374,False,False
8,56d43c5f2ccc5a1400d830a9,Beyoncé,When did Beyoncé rise to fame?,"The term ""matter"" is used throughout physics i...",late 1990s,late 1990s,276,286,False,False
9,56d43c5f2ccc5a1400d830aa,Beyoncé,What role did Beyoncé have in Destiny's Child?,"The term ""matter"" is used throughout physics i...",lead singer,lead singer,290,301,False,False


In [None]:
val_squadDf.head(10)

Unnamed: 0,question_id,title,question,context,answer,all_answers,answer_start,answer_end,is_impossible,skip
0,56ddde6b9a695914005b9628,Normans,In what country is Normandy located?,"The pound-force has a metric counterpart, less...",France,France,159,165,False,False
1,56ddde6b9a695914005b9629,Normans,When were the Normans in Normandy?,"The pound-force has a metric counterpart, less...",10th and 11th centuries,10th and 11th centuries,94,117,False,False
2,56ddde6b9a695914005b962a,Normans,From which countries did the Norse originate?,"The pound-force has a metric counterpart, less...","Denmark, Iceland and Norway","Denmark, Iceland and Norway",256,283,False,False
3,56ddde6b9a695914005b962b,Normans,Who was the Norse leader?,"The pound-force has a metric counterpart, less...",Rollo,Rollo,308,313,False,False
4,56ddde6b9a695914005b962c,Normans,What century did the Normans first gain their ...,"The pound-force has a metric counterpart, less...",10th century,10th century,671,683,False,False
5,5ad39d53604f3c001a3fe8d1,Normans,Who gave their name to Normandy in the 1000's ...,"The pound-force has a metric counterpart, less...",10th centuryNormans,10th centuryNormans,675,694,True,False
6,5ad39d53604f3c001a3fe8d2,Normans,What is France a region of?,"The pound-force has a metric counterpart, less...",10th centuryNormansNormandy,10th centuryNormansNormandy,812,-1,True,True
7,5ad39d53604f3c001a3fe8d3,Normans,Who did King Charles III swear fealty to?,"The pound-force has a metric counterpart, less...",10th centuryNormansNormandyRollo,10th centuryNormansNormandyRollo,1120,-1,True,True
8,5ad39d53604f3c001a3fe8d4,Normans,When did the Frankish identity emerge?,"The pound-force has a metric counterpart, less...",10th centuryNormansNormandyRollo10th century,10th centuryNormansNormandyRollo10th century,1791,-1,True,True
9,56dddf4066d3e219004dad5f,Normans,Who was the duke in the battle of Hastings?,"The pound-force has a metric counterpart, less...",William the Conqueror,William the Conqueror,1022,1043,False,True


# Transforming dataset to tensors and creating shuffled batches
---



In [None]:
batch_size = 16

# Training set transformation
train_data = TensorDataset(torch.tensor(X_train[0], dtype=torch.int64),
                           torch.tensor(X_train[1], dtype=torch.float),
                           torch.tensor(X_train[2], dtype=torch.int64),
                           torch.tensor(y_train[0], dtype=torch.int64),
                           torch.tensor(y_train[1], dtype=torch.int64))
print(f"{len(train_data)} training points created.")
train_sampler = RandomSampler(train_data) # Randomizing data
train_data_loader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size,drop_last=True)

# Validation set transformation
eval_data = TensorDataset(torch.tensor(X_eval[0], dtype=torch.int64),
                          torch.tensor(X_eval[1], dtype=torch.float),
                          torch.tensor(X_eval[2], dtype=torch.int64),
                          torch.tensor(y_eval[0], dtype=torch.int64),
                          torch.tensor(y_eval[1], dtype=torch.int64))
print(f"{len(eval_data)} evaluation points created.")
eval_sampler = SequentialSampler(eval_data)
validation_data_loader = DataLoader(eval_data, sampler=eval_sampler, batch_size=batch_size,drop_last=True)

85014 training points created.
6164 evaluation points created.


## Selecting again GPU

In [None]:
if train_on_gpu:
  gpu = torch.device('cuda')
  print("CUDA")
else:
  gpu = torch.device('cpu')
  print("CPU")

CUDA


# Initializing BERT model and fine tunning the hyperparameters
---


In [None]:
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')

if train_on_gpu:
  model = model.cuda()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased a

## Initializing optimizer - Adam


In [None]:
# Optimizer parameters
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]
optimizer = torch.optim.Adam(lr=1e-5, betas=(0.9, 0.98), eps=1e-9, params=optimizer_grouped_parameters)

## Selecting numer of epochs

In [None]:
epochs = 2

# Training and Validation
---

In [None]:
for epoch in range(1, epochs + 1):

    # ----------------------- TRAINING -----------------------
    
    training_pbar = tqdm(total=len(train_data), position=0, leave=True, file=sys.stdout, bar_format="{l_bar}%s{bar}%s{r_bar}" % (Fore.GREEN, Fore.RESET))
    
    model.train()

    tr_loss = 0
    training_steps = 0

    # Loop for every batch
    for step, batch in enumerate(train_data_loader):

      # Every batch consists of the tuple that I transformed before
      # I am sending to GPU every batch, because of memory consumption. This way model don't crashes because of limited memory
      batch = tuple(t.to(gpu) for t in batch)

      # Isolating data from the tuple
      input_word_ids, input_mask, input_type_ids, start_token_idx, end_token_idx = batch
      
      # init Adam
      optimizer.zero_grad()

      # model train
      loss, _, _ = model(input_ids=input_word_ids, attention_mask=input_mask, token_type_ids=input_type_ids, start_positions=start_token_idx,end_positions=end_token_idx, return_dict=False)
      loss.backward()
      optimizer.step()

      # Summing loss
      tr_loss += loss.item()
      training_steps += 1

      # progress bar
      training_pbar.update(input_word_ids.size(0))
    
    training_pbar.close()

    # Average training loss
    training_loss = tr_loss / training_steps
    
    
    # Saving model (in each epoch) for future runs
    torch.save(model.state_dict(), "./weights_" + str(epoch) + ".pth")

    #  ----------------------- VALIDATION -----------------------
    
    validation_pbar = tqdm(total=len(eval_data),
                           position=0, leave=True,
                           file=sys.stdout, bar_format="{l_bar}%s{bar}%s{r_bar}" % (Fore.BLUE, Fore.RESET))
    
    model.eval()

    # Filter data
    eval_examples_no_skip = [_ for _ in eval_squad_examples if _.skip is False]
    currentIdx = 0
    count = 0
    val_loss = 0
    
    # Loop for every validation batch
    for batch in validation_data_loader:

      batch = tuple(t.to(gpu) for t in batch)
      input_word_ids, input_mask, input_type_ids, start_token_idx, end_token_idx = batch

      with torch.no_grad():

          # predicted answer
          start_logits, end_logits = model(input_ids=input_word_ids, attention_mask=input_mask, token_type_ids=input_type_ids, return_dict=False)

          # prediction answer and end, detached (for memory usage and for summing them for acc)
          pred_start, pred_end = start_logits.detach().cpu().numpy(), end_logits.detach().cpu().numpy()

      # Evaluation based on Accuracy
      for idx, (start, end) in enumerate(zip(pred_start, pred_end)):
        
        squad_eg = eval_examples_no_skip[currentIdx]
        currentIdx += 1
        offsets = squad_eg.context_token_to_char

        # Answer start and end
        start = np.argmax(start)
        end = np.argmax(end)
        if start >= len(offsets):
            continue

        # Checking offsets
        pred_char_start = offsets[start][0]
        if end < len(offsets):
            pred_char_end = offsets[end][1]
            pred_ans = squad_eg.context[pred_char_start:pred_char_end]
        else:
            pred_ans = squad_eg.context[pred_char_start:]

        # Normalizing text for comparison of predicted answer and true answer
        normalized_pred_ans = normalize_text(pred_ans)

        if squad_eg.all_answers!=None:
          # all true answers to a list 
          normalized_true_ans = [normalize_text(x) for x in squad_eg.all_answers ]

          # If predicted answer in true answers add +1 in Accuracy
          if normalized_pred_ans in normalized_true_ans:
              count += 1
            
        validation_pbar.update(input_word_ids.size(0))

    # Final accuracy and loss
    acc = count / len(y_eval[0])*100
    validation_pbar.close()
    print("Epoch ", str(epoch)," | ",f"Avg training loss: {training_loss:.4f}"," | ",f"Validation accuracy: {acc:.2f} %\n")

100%|[32m█████████▉[39m| 85008/85014 [1:57:54<00:00, 12.02it/s]
|[34m          [39m| 98544/? [03:00<00:00, 546.68it/s]
Epoch  1  |  Avg training loss: 1.9296  |  Validation accuracy: 72.29 %

100%|[32m█████████▉[39m| 85008/85014 [1:57:50<00:00, 12.02it/s]
|[34m          [39m| 98560/? [03:00<00:00, 546.98it/s]
Epoch  2  |  Avg training loss: 1.3891  |  Validation accuracy: 74.12 %



# Testing
---

## Evaluation fuction for test data

In [None]:
'''
evaluation: Same method as validation but for testing purpose
'''
def evaluation(data):
  model.eval()

  test_samples = create_squad(data, "Creating test points",None)
  x_test, _ = Xy_split(test_samples)

  pred_start, pred_end = model(torch.tensor(x_test[0], dtype=torch.int64, device=gpu),
                              torch.tensor(x_test[1], dtype=torch.float, device=gpu),
                              torch.tensor(x_test[2], dtype=torch.int64, device=gpu), return_dict=False)

  pred_start, pred_end = pred_start.detach().cpu().numpy(), pred_end.detach().cpu().numpy()

  print("\nQuestions and answers:\n")
  for idx, (start, end) in enumerate(zip(pred_start, pred_end)):
      test_sample = test_samples[idx]
      offsets = test_sample.context_token_to_char
      start = np.argmax(start)
      end = np.argmax(end)
      pred_ans = None
      if start >= len(offsets):
          continue
      pred_char_start = offsets[start][0]
      if end < len(offsets):
          pred_ans = test_sample.context[pred_char_start:offsets[end][1]]
      else:
          pred_ans = test_sample.context[pred_char_start:]

      print("Question: " + test_sample.question)
      print("Answer:   " + pred_ans,"\n")


## Questions and paragraph for the apollo mission

In [None]:
'''
Information and questions for the Apollo program taken from Wikipedia
'''
apollo_data = {"data":
    [
        {"title": "Project Apollo",
         "paragraphs": [
             {
                 "context": "The Apollo program, also known as Project Apollo, was the third United States human "
                            "spaceflight program carried out by the National Aeronautics and Space Administration ("
                            "NASA), which accomplished landing the first humans on the Moon from 1969 to 1972. First "
                            "conceived during Dwight D. Eisenhower's administration as a three-man spacecraft to "
                            "follow the one-man Project Mercury which put the first Americans in space, Apollo was "
                            "later dedicated to President John F. Kennedy's national goal of landing a man on the "
                            "Moon and returning him safely to the Earth by the end of the 1960s, which he proposed in "
                            "a May 25, 1961, address to Congress. Project Mercury was followed by the two-man Project "
                            "Gemini. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972, "
                            "and was supported by the two man Gemini program which ran concurrently with it from 1962 "
                            "to 1966. Gemini missions developed some of the space travel techniques that were "
                            "necessary for the success of the Apollo missions. Apollo used Saturn family rockets as "
                            "in 1973-74, and the Apollo-Soyuz Test Project, a joint Earth orbit mission with the "
                            "Soviet Union in 1975.",
                 "qas": [
                     {"question": "What project put the first Americans into space?",
                      "id": "Q1",
                      "is_impossible": "False"
                      },
                     {"question": "What program was created to carry out these projects and missions?",
                      "id": "Q2",
                      "is_impossible": "False"
                      },
                     {"question": "What year did the first manned Apollo flight occur?",
                      "id": "Q3",
                      "is_impossible": "False"
                      },
                     {"question": "What President is credited with the original notion of putting Americans in space?",
                      "id": "Q4",
                      "is_impossible": "False"
                      },
                     {"question": "Who did the U.S. collaborate with on an Earth orbit mission in 1975?",
                      "id": "Q5",
                      "is_impossible": "False"
                      },
                     {"question": "How long did Project Apollo run?",
                      "id": "Q6",
                      "is_impossible": "False"
                      },
                     {"question": "What program helped develop space travel techniques that Project Apollo used?",
                      "id": "Q7",
                      "is_impossible": "False"
                      },
                     {"question": "What space station supported three manned missions in 1973-1974?",
                      "id": "Q8",
                      "is_impossible": "False"
                      }
                 ]
              }
            ]
         }
      ]
    }

In [None]:
evaluation(apollo_data)

Creating test points: 100%|[34m██████████[39m| 1/1 [00:00<00:00, 69.95it/s]

Questions and answers:

Question: What project put the first Americans into space?
Answer:   Project Mercury 

Question: What program was created to carry out these projects and missions?
Answer:    

Question: What year did the first manned Apollo flight occur?
Answer:   1968 

Question: What President is credited with the original notion of putting Americans in space?
Answer:   John F. Kennedy 

Question: Who did the U.S. collaborate with on an Earth orbit mission in 1975?
Answer:   Soviet Union 

Question: How long did Project Apollo run?
Answer:   1961 to 1972 

Question: What program helped develop space travel techniques that Project Apollo used?
Answer:   Gemini 

Question: What space station supported three manned missions in 1973-1974?
Answer:   Saturn family rockets 



## Questions and paragraphs about Beyonce

In [None]:
'''
Information and questions for Beyonce taken from Wikipedia
'''
beyonce_data = {"data":
    [
        {"title": "Beyonce",
         "paragraphs": [
             {
                  "context": "Beyoncé Giselle Knowles-Carter is an American singer, songwriter, actress, and record producer."
                            "Born and raised in Houston, Texas, Beyoncé performed in various singing and dancing competitions as a child."
                            "During Destiny's Child's hiatus, Beyoncé made her theatrical film debut with "
                            "a role in the US box-office number-one Austin Powers in Goldmember (2002) and began her solo music career."
                            "including the number-one singles 'Crazy in Love' featuring rapper Jay-Z and 'Baby Boy' featuring singer-rapper Sean Paul."
                            "Following the disbandment of Destiny's Child in 2006,"
                            "she released her second solo album, B'Day, which contained her first "
                            "US number-one solo single 'Irreplaceable', and 'Beautiful Liar', which topped the charts in most countries."
                            "Her marriage to Jay-Z and her portrayal of Etta James in Cadillac Records (2008) influenced her third album, "
                            "I Am... Sasha Fierce (2008), which earned a record-setting six Grammy Awards in 2010.",
                 "qas": [
                     {"question": "Where was Beyoncé born?",
                      "id": "Q1",
                      "is_impossible": "False"
                      },
                     {"question": "When Beyonce rose to fame?",
                      "id": "Q2",
                      "is_impossible": "False"
                      },
                     {"question": "What year did Beyonce began her solo music career?",
                      "id": "Q3",
                      "is_impossible": "False"
                      },
                     {"question": "Whats the name of the album that published in 2003?",
                      "id": "Q4",
                      "is_impossible": "False"
                      },
                     {"question": "For which song did Beyonce won the Grammy Award?",
                      "id": "Q5",
                      "is_impossible": "False"
                      },
                     {"question": "In which movies did she act?",
                      "id": "Q6",
                      "is_impossible": "False"
                      },
                     {"question": "Who is Beyonces husband?",
                      "id": "Q7",
                      "is_impossible": "False"
                      },
                     {"question": "Which was her first US number-one solo single?",
                      "id": "Q8",
                      "is_impossible": "False"
                      },
                     {"question": "Was Beyonce a better magician than Harry Potter?",
                      "id": "Q9",
                      "is_impossible": "False"
                      }
                 ]
              }
            ]
         }
     ]
  }

In [None]:
evaluation(beyonce_data)

Creating test points: 100%|[34m██████████[39m| 1/1 [00:00<00:00, 81.65it/s]

Questions and answers:

Question: Where was Beyoncé born?
Answer:   Houston, Texas 

Question: When Beyonce rose to fame?
Answer:   2002 

Question: What year did Beyonce began her solo music career?
Answer:   2002 

Question: Whats the name of the album that published in 2003?
Answer:   B'Day 

Question: For which song did Beyonce won the Grammy Award?
Answer:   Sasha Fierce 

Question: In which movies did she act?
Answer:   Austin Powers in Goldmember 

Question: Who is Beyonces husband?
Answer:   Jay-Z 

Question: Which was her first US number-one solo single?
Answer:   Irreplaceable 

Question: Was Beyonce a better magician than Harry Potter?
Answer:   singer, songwriter, actress, and record producer 



In [None]:
#@title Enter your question about __Beyonce__ { run: "auto", vertical-output: true, display-mode: "form" }
Question = "Where Beyonce was born?" #@param {type:"string"}

def qa(q):
  model.eval()
  data = {"data":
    [
        {"title": "",
         "paragraphs": [
             {
                  "context": "Beyoncé Giselle Knowles-Carter is an American singer, songwriter, actress, and record producer."
                            "Born and raised in Houston, Texas, Beyoncé performed in various singing and dancing competitions as a child."
                            "During Destiny's Child's hiatus, Beyoncé made her theatrical film debut with "
                            "a role in the US box-office number-one Austin Powers in Goldmember (2002) and began her solo music career."
                            "including the number-one singles 'Crazy in Love' featuring rapper Jay-Z and 'Baby Boy' featuring singer-rapper Sean Paul."
                            "Following the disbandment of Destiny's Child in 2006,"
                            "she released her second solo album, B'Day, which contained her first "
                            "US number-one solo single 'Irreplaceable', and 'Beautiful Liar', which topped the charts in most countries."
                            "Her marriage to Jay-Z and her portrayal of Etta James in Cadillac Records (2008) influenced her third album, "
                            "I Am... Sasha Fierce (2008), which earned a record-setting six Grammy Awards in 2010.",
                   "qas": [
                     {"question": q,
                      "id": "Q1",
                      "is_impossible": "False"
                      },
              ]
            }
          ]
        }
    ]
  }
  test_samples = create_squad(data, "",None)
  x_test, _ = Xy_split(test_samples)

  pred_start, pred_end = model(torch.tensor(x_test[0], dtype=torch.int64, device=gpu),
                              torch.tensor(x_test[1], dtype=torch.float, device=gpu),
                              torch.tensor(x_test[2], dtype=torch.int64, device=gpu), return_dict=False)

  pred_start, pred_end = pred_start.detach().cpu().numpy(), pred_end.detach().cpu().numpy()

  for idx, (start, end) in enumerate(zip(pred_start, pred_end)):
      test_sample = test_samples[idx]
      offsets = test_sample.context_token_to_char
      start = np.argmax(start)
      end = np.argmax(end)
      pred_ans = None
      if start >= len(offsets):
          continue
      pred_char_start = offsets[start][0]
      if end < len(offsets):
          pred_ans = test_sample.context[pred_char_start:offsets[end][1]]
      else:
          pred_ans = test_sample.context[pred_char_start:]

      print(Fore.RED +"\nAnswer: " + Fore.GREEN + pred_ans,"\n")  
qa(Question)

[31m
Answer: [32mHouston, Texas 



# Remarks
---

## __Summary__

In this notebook I have implemented:
- A parser for SQuAD 2.0 dataset
- Transformer for the data into tensors
- Training method for BERT in combination with SQuAD
- Evaluated based on validation set

## __Fine tunning__

- Tested the training model for multiple number of epochs (1,2,3,4)
- batch sizes (I faced multiple problems in memory allocation for big batches and for this reason I am using 16 that in undoubtly small)
- Optimizer (Adam) in:
  - ```lr``` (float, optional) – learning rate (default: 1e-3)
  - ```betas``` (Tuple[float, float], optional) – coefficients used for computing    running averages of gradient and its square (default: (0.9, 0.999))
  - ```eps``` (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

I concluded to the parameters I used as these were the best. I had some good results in the answers that the model gave as most of them answered my questions both for Apollo and Beyonce.

## __Accuracy__
Epochs made a real difference in the validation accuracy. Approximately this model has a 74% accuracy that is not bad, if we consider the oficial SQuAD results. 

*Didn't understand how F1 score should be measured as it consists of Recall and Presicion and I didn't found a good way to measure them in my implementation.

## __Time needed to execute__
At least 2 hours in colabs cuda for each epoch




# References
---

[1]  [Fine tunning in SQuAD 1.1 ](https://github.com/dredwardhyde/bert-examples/blob/main/bert_squad_pytorch.py)