# Part 1: Fine tune a Bert model for a QA task

##1.1 Preparation
We start by bringing in the required components: We install the transformers library to use its tools, and then from it we import a tokenizer, and, more importantly, the DistilBert moodel for question answering and Adamw, an optimizer with a fixed weight decay.

In [None]:
!pip install transformers

# Import Libraries
import torch
from transformers import AutoTokenizer, DistilBertForQuestionAnswering, AdamW

import time
import matplotlib.pyplot as plt 
import os
from fastai.imports import *

Collecting transformers
  Downloading transformers-4.16.1-py3-none-any.whl (3.5 MB)
[K     |████████████████████████████████| 3.5 MB 12.1 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.3 MB/s 
Collecting tokenizers!=0.11.3,>=0.10.1
  Downloading tokenizers-0.11.4-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 31.6 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 44.5 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 41.6 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Fou

Then we import our data, which are the 1.1 version validation set of the squad dataset. Our tokenizer is also defined to a specific model.

In [None]:
%%capture
!mkdir squad
!wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json -O squad/dev-v1.1.json

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Next we bring our data in lists' form so they can be used by our model. The total length of our dataset is 34,726 context/question/answer triplets. From those, we use the first 33,000 as our train data, and the last 1,726 as our validation data. We also print out the first triplet to check that they're in the correct form. 

In [None]:
# Give the path for train data
path = Path('squad/dev-v1.1.json')

# Open .json file
with open(path, 'rb') as f:
    squad_dict = json.load(f)

texts = []
queries = []
answers = []

# Search for each passage, its question and its answer
for group in squad_dict['data']:
    for passage in group['paragraphs']:
        context = passage['context']
        for qa in passage['qas']:
            question = qa['question']
            for answer in qa['answers']:
                # Store every passage, query and its answer to the lists
                texts.append(context)
                queries.append(question)
                answers.append(answer)

print('Total train set length:', len(texts))
train_texts, train_queries, train_answers = texts[:33000], queries[:33000], answers[:33000]
val_texts,   val_queries,   val_answers   = texts[33000:], queries[33000:], answers[33000:]
print()

print("Passage: ",train_texts[0])  
print("Query: ",train_queries[0])
print("Answer: ",train_answers[0])

# train_texts   = train_texts[:10]
# train_queries = train_queries[:10]
# train_answers = train_answers[:10] 

Total train set length: 34726

Passage:  Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.
Query:  Which NFL team represented the AFC at Super Bowl 50?
Answer:  {'answer_start': 177, 'text': 'Denver Broncos'}


The Bert model needs the start and end positions of the answers to work, so here these are specified for both train and validation sets. For the end position, a minor tweak has been made, because the squad dataset tends to be up to 2 characters short of the real answer.

In [None]:
# Find start and end positions for train set

for answer, text in zip(train_answers, train_texts):
    real_answer = answer['text']
    start_idx = answer['answer_start']
    # Get the real end index
    end_idx = start_idx + len(real_answer)

    # Deal with the problem of 1 or 2 more characters 
    if text[start_idx:end_idx] == real_answer:
        answer['answer_end'] = end_idx
    # When the real answer is more by one character
    elif text[start_idx-1:end_idx-1] == real_answer:
        answer['answer_start'] = start_idx - 1
        answer['answer_end'] = end_idx - 1  
    # When the real answer is more by two characters  
    elif text[start_idx-2:end_idx-2] == real_answer:
        answer['answer_start'] = start_idx - 2
        answer['answer_end'] = end_idx - 2    

# Find start and end positions for validation set

for answer, text in zip(val_answers, val_texts):
    real_answer = answer['text']
    start_idx = answer['answer_start']
    # Get the real end index
    end_idx = start_idx + len(real_answer)

    # Deal with the problem of 1 or 2 more characters 
    if text[start_idx:end_idx] == real_answer:
        answer['answer_end'] = end_idx
    # When the real answer is more by one character
    elif text[start_idx-1:end_idx-1] == real_answer:
        answer['answer_start'] = start_idx - 1
        answer['answer_end'] = end_idx - 1  
    # When the real answer is more by two characters  
    elif text[start_idx-2:end_idx-2] == real_answer:
        answer['answer_start'] = start_idx - 2
        answer['answer_end'] = end_idx - 2  

Now we tokenize our data.

In [None]:
train_encodings = tokenizer(train_texts, train_queries, truncation=True, padding=True)
val_encodings   = tokenizer(val_texts,   val_queries,   truncation=True, padding=True)

The next function is used to insert start-end tokens to the respective positions. 

In [None]:
def add_token_positions(encodings, answers):
    start_positions = []
    end_positions = []
    for i in range(len(answers)):
        start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
        end_positions.append(encodings.char_to_token(i, answers[i]['answer_end'] - 1))
        # if None, the answer passage has been truncated
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length
        if end_positions[-1] is None:
            end_positions[-1] = tokenizer.model_max_length
    encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

add_token_positions(train_encodings, train_answers)
add_token_positions(val_encodings,     val_answers)

Three last things required to lay the groundwork for finetuning our model: One is to create a class for the train and val datasets, to facilitate training and to convert encodings to datasets. The second is to define a Dataloader, to input the data in a shuffled batch (size 8). Lastly, we define the type of device used (cuda or CPU) and print which one it is.

In [None]:
# Create train and val dataset classes
class SquadDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

train_dataset = SquadDataset(train_encodings)
val_dataset = SquadDataset(val_encodings)

# Define Dataloader
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=8, shuffle=True)

# Define device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Device is', device)

Device is cuda


##1.2 Finetuning the model
We load the model required for this task (DistilBert For Question Answering). We also load AdamW as our optimizer, and we define our train epochs to 4.

In [11]:
model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")

optim = AdamW(model.parameters(), lr=5e-5)

epochs = 4 

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this mode

We are now ready to train the model

In [12]:
whole_train_eval_time = time.time()

train_losses = []
val_losses = []
model.to(device)
print_every = 1000

for epoch in range(epochs):
  epoch_time = time.time()

  model.train() # Set model in train mode
    
  loss_of_epoch = 0

  print("TRAIN")

  for batch_idx,batch in enumerate(train_loader): 
    
    optim.zero_grad()

    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    start_positions = batch['start_positions'].to(device)
    end_positions = batch['end_positions'].to(device)
    
    outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
    loss = outputs[0]
    # do a backwards pass 
    loss.backward()
    # update the weights
    optim.step()
    # Find the total loss
    loss_of_epoch += loss.item()

    if (batch_idx+1) % print_every == 0:
      print("Batch {:} / {:}".format(batch_idx+1,len(train_loader)),"\nLoss:", round(loss.item(),1),"\n")

  loss_of_epoch /= len(train_loader)
  train_losses.append(loss_of_epoch)

  # Evaluation

  # Set model in evaluation mode
  model.eval()

  print("EVALUATE")

  loss_of_epoch = 0

  for batch_idx,batch in enumerate(val_loader):
    
    with torch.no_grad():

      input_ids = batch['input_ids'].to(device)
      attention_mask = batch['attention_mask'].to(device)
      start_positions = batch['start_positions'].to(device)
      end_positions = batch['end_positions'].to(device)
      
      outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
      loss = outputs[0]
      # Find the total loss
      loss_of_epoch += loss.item()

    if (batch_idx+1) % print_every == 0:
       print("Batch {:} / {:}".format(batch_idx+1,len(val_loader)),"\nLoss:", round(loss.item(),1),"\n")

  loss_of_epoch /= len(val_loader)
  val_losses.append(loss_of_epoch)

  # Print each epoch's time and train/val loss 
  print("\nEpoch ", epoch+1,
        "\nTraining Loss:", train_losses[-1],
        "\nValidation Loss:", val_losses[-1],
        "\nTime: ",(time.time() - epoch_time),
        "\n\n")

print("Total training and evaluation time: ", (time.time() - whole_train_eval_time))

TRAIN
Batch 1000 / 4125 
Loss: 1.4 

Batch 2000 / 4125 
Loss: 1.4 

Batch 3000 / 4125 
Loss: 2.9 

Batch 4000 / 4125 
Loss: 0.6 

EVALUATE

Epoch  1 
Training Loss: 1.5312047022617223 
Validation Loss: 1.8528673295621518 
Time:  3351.294318675995 


TRAIN
Batch 1000 / 4125 
Loss: 1.3 

Batch 2000 / 4125 
Loss: 0.7 

Batch 3000 / 4125 
Loss: 1.0 

Batch 4000 / 4125 
Loss: 0.6 

EVALUATE

Epoch  2 
Training Loss: 0.8931168509685632 
Validation Loss: 1.8838744224221617 
Time:  3348.6583893299103 


TRAIN
Batch 1000 / 4125 
Loss: 0.2 

Batch 2000 / 4125 
Loss: 0.7 

Batch 3000 / 4125 
Loss: 0.6 

Batch 4000 / 4125 
Loss: 0.7 

EVALUATE

Epoch  3 
Training Loss: 0.7490737383230166 
Validation Loss: 1.952481464655311 
Time:  3347.89563536644 


TRAIN
Batch 1000 / 4125 
Loss: 0.3 

Batch 2000 / 4125 
Loss: 1.4 

Batch 3000 / 4125 
Loss: 1.0 

Batch 4000 / 4125 
Loss: 0.3 

EVALUATE

Epoch  4 
Training Loss: 0.6651255462043213 
Validation Loss: 2.0054682562196695 
Time:  3332.7469170093536 




In [None]:
fig,ax = plt.subplots(1,1,figsize=(15,10))

ax.set_title("Train and Validation Losses",size=20)
ax.set_ylabel('Loss', fontsize = 20) 
ax.set_xlabel('Epochs', fontsize = 25) 
_=ax.plot(train_losses)
_=ax.plot(val_losses)
_=ax.legend(('Train','Val'),loc='upper right')

Lastly, we save the finetuned model and the tokenizer used, to be used in the evaluation stage.

In [None]:
# Invoke google drive
from google.colab import drive
drive.mount('/content/drive')

# Save model and tokenizer
torch.save(model,"/content/drive/MyDrive/NLU/model220130/model.bin")
tokenizer.save_pretrained("/content/drive/MyDrive/NLU/model220130")
