# Finetuning a Masked Language BERT Model using Huggingface

Adapted from the work done by Chris McCormick and Nick Ryan

See the original guide as a blog post [here](http://mccormickml.com/2019/07/22/BERT-fine-tuning/) and as a Colab Notebook [here](https://colab.research.google.com/drive/1Y4o3jh3ZH70tl6mCd76vz_IxX23biCPP). 

# Check Data is present

This notebook runs on the output of the `process_ontonotes.ipynb`. Be sure to run this in the same directory 

Link google drive account to save models

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


Activate function for saving models and tokenizer post training

In [0]:
import os

# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
def save_model(processed_model, epoch, lr, eps):
  output_dir = './drive/My Drive/playground/model_save/debias/full/lr_{}_eps_{}/epoch_{}/'.format(lr, eps, epoch)

  # Create output directory if needed
  if not os.path.exists(output_dir):
      os.makedirs(output_dir)

  print("Saving model to %s" % output_dir)

  # Save a trained model, configuration and tokenizer using `save_pretrained()`.
  # They can then be reloaded using `from_pretrained()`
  model_to_save = processed_model.module if hasattr(processed_model, 'module') else processed_model  # Take care of distributed/parallel training
  model_to_save.save_pretrained(output_dir)
  tokenizer.save_pretrained(output_dir)

  # Good practice: save your training arguments together with the trained model
  torch.save([epoch, lr, eps], os.path.join(output_dir, 'training_args.bin'))

# 1. Setup

## 1.1. Using Colab GPU for Training


Check GPU is activated

In [0]:
import tensorflow as tf

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


In order for torch to use the GPU, we need to identify and specify the GPU as the device. Later, in our training loop, we will load data onto the device. 

In [0]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.   
    torch.cuda.empty_cache()
    device = torch.device("cuda")
    torch.cuda.empty_cache()


    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


## 1.2. Installing the Hugging Face Library


In [0]:
!pip install transformers



# 2. Ensure Ontonotes data has been loaded

## 2.2. Parse

We'll use pandas to parse the "in-domain" training set and look at a few of its properties and data points.

If training a data augmented model uncomment `df_flipped`.

In [0]:
import pandas as pd

df_orig = pd.read_csv('./drive/My Drive/playground/original_data.csv')
# df_flipped = pd.read_csv('./drive/My Drive/playground/flipped_data.csv')
df = pd.concat([df_orig, df_flipped])
df['gender'] = df['pronouns'].str.contains('^he$|^his$|^him$').astype(int)
# Report the number of sentences.
print('Number of training sentences: {:,}\n'.format(df.shape[0]))
df.sample(10)

Number of training sentences: 42,634



Unnamed: 0,text,pronouns,gender
6874,[CLS] the confirmation hearings do count /. [S...,he,1
11977,[CLS] it 's not [SEP] [MASK] has [SEP],she,0
5506,[CLS] a bible in hand [SEP] taiwan 's united d...,he,1
16586,[CLS] on that last day many will call me lord ...,his,1
18188,[CLS] [MASK] would say `` E5502 the lord give ...,he,1
18274,[CLS] but E4812 said to E4306 `` the lord has ...,he,1
13225,[CLS] the stronger man will take away the weap...,his,1
23433,[CLS] sorting out the wreckage is expected to ...,he,1
21605,[CLS] prices were up across the board with mos...,he,1
23405,[CLS] offices of the city 's rent board were d...,he,1


Check data augmented data has equal number of gender mentions (as should be the case).

In [0]:
df.gender.value_counts()

1    21317
0    21317
Name: gender, dtype: int64



Let's extract the sentences and labels of our training set as numpy ndarrays.

In [0]:
# Get the lists of sentences and their labels.
sentences = df.text.values
labels = df.pronouns.values

# 3. Tokenization & Input Formatting

In this section, we'll transform our dataset into the format that BERT can be trained on.

## 3.1. BERT Tokenizer


To feed our text to BERT, it must be split into tokens, and then these tokens must be mapped to their index in the tokenizer vocabulary.

The tokenization must be performed by the tokenizer included with BERT--the below cell will download this for us. We'll be using the "uncased" version here.


In [0]:
from transformers import BertTokenizer

# Load the BERT tokenizer.
print('Loading BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

Loading BERT tokenizer...


Let's apply the tokenizer to one sentence just to see the output.


In [0]:
# Print the original sentence.
print(' Original: ', sentences[0])

# Print the sentence split into tokens.
print('Tokenized: ', tokenizer.tokenize(sentences[0]))

# Print the sentence mapped to token ids.
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentences[0])))

 Original:  [CLS] the confirmation hearings do count /. [SEP] and it 's not over for me /. [SEP] i want to see how this nominee performs /. [SEP] i want to see what [MASK] has to say /. [SEP]
Tokenized:  ['[CLS]', 'the', 'confirmation', 'hearings', 'do', 'count', '/', '.', '[SEP]', 'and', 'it', "'", 's', 'not', 'over', 'for', 'me', '/', '.', '[SEP]', 'i', 'want', 'to', 'see', 'how', 'this', 'nominee', 'performs', '/', '.', '[SEP]', 'i', 'want', 'to', 'see', 'what', '[MASK]', 'has', 'to', 'say', '/', '.', '[SEP]']
Token IDs:  [101, 1996, 13964, 19153, 2079, 4175, 1013, 1012, 102, 1998, 2009, 1005, 1055, 2025, 2058, 2005, 2033, 1013, 1012, 102, 1045, 2215, 2000, 2156, 2129, 2023, 9773, 10438, 1013, 1012, 102, 1045, 2215, 2000, 2156, 2054, 103, 2038, 2000, 2360, 1013, 1012, 102]


We create a mask which hides all tokens/words which do not correspond to the pronouns we seek to predict. This is used by the BERT model such that only predictions on the pronouns are considered when calculating the training loss. We mask the terms we wish to ignore with the token `-100`.

In [0]:
mask_id = tokenizer.convert_tokens_to_ids('[MASK]')
mask_id

103

In [0]:
%%time
masked_lm_labels = []
for sentence, label in zip(sentences, labels):
  sentence_ids =  tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentence))
  label_id = tokenizer.convert_tokens_to_ids(label)
  masked_lm_labels.append([label_id if id == mask_id else -100 for id in sentence_ids])

CPU times: user 31.5 s, sys: 25.3 ms, total: 31.5 s
Wall time: 31.6 s


In [0]:
masked_lm_labels[0]

[-100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100,
 2002,
 -100,
 -100,
 -100,
 -100,
 -100,
 -100]

## 3.2. Required Formatting

`[CLS]` and `[SEP]` were added in preprocessing so we don't need to add these terms to our data.

### Sentence Length & Attention Mask



The sentences in our dataset obviously have varying lengths, so how does BERT handle this?

BERT has two constraints:
1. All sentences must be padded or truncated to a single, fixed length.
2. The maximum sentence length is 512 tokens.

Padding is done with a special `[PAD]` token, which is at index 0 in the BERT vocabulary. The below illustration demonstrates padding out to a "MAX_LEN" of 8 tokens.

<img src="http://www.mccormickml.com/assets/BERT/padding_and_mask.png" width="600">

The "Attention Mask" is simply an array of 1s and 0s indicating which tokens are padding and which aren't.

## 3.2. Sentences to IDs

The `tokenizer.encode` function combines multiple steps for us:
1. Split the sentence into tokens.
2. Map the tokens to their IDs.

In [0]:
# Get word count
word_count = [len(s.split()) for s in list(sentences)]
sum(word_count)

1909444

In [0]:
%%time
# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []

# For every sentence...
for sent in list(sentences):
    # `encode` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    encoded_sent = tokenizer.encode(
                        sent,                      # Sentence to encode.
                        add_special_tokens = False, # Adds '[CLS]' and '[SEP]' if True

                        # This function also supports truncation and conversion
                        # to pytorch tensors, but we need to do padding, so we
                        # can't use these features :( .
                        #max_length = 128,          # Truncate all sentences.
                        #return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.
    input_ids.append(encoded_sent)

# Print sentence 0, now as a list of IDs.
print('Original: ', sentences[0])
print('Token IDs:', input_ids[0])

Original:  [CLS] the confirmation hearings do count /. [SEP] and it 's not over for me /. [SEP] i want to see how this nominee performs /. [SEP] i want to see what [MASK] has to say /. [SEP]
Token IDs: [101, 1996, 13964, 19153, 2079, 4175, 1013, 1012, 102, 1998, 2009, 1005, 1055, 2025, 2058, 2005, 2033, 1013, 1012, 102, 1045, 2215, 2000, 2156, 2129, 2023, 9773, 10438, 1013, 1012, 102, 1045, 2215, 2000, 2156, 2054, 103, 2038, 2000, 2360, 1013, 1012, 102]
CPU times: user 31.4 s, sys: 49.7 ms, total: 31.4 s
Wall time: 31.4 s


## 3.3. Padding & Truncating

Pad and truncate our sequences so that they all have the same length, `MAX_LEN`.

First, what's the maximum sentence length in our dataset?

In [0]:
print('Max sentence length: ', max([len(sen) for sen in input_ids]))

Max sentence length:  157


Given that, let's choose MAX_LEN = 150 and apply the padding.

In [0]:
# We'll borrow the `pad_sequences` utility function to do this.
from keras.preprocessing.sequence import pad_sequences

# Set the maximum sequence length.
# I've chosen 150 somewhat arbitrarily. It's slightly larger than the
# maximum training sentence length of 148...
MAX_LEN = 164

print('\nPadding/truncating all sentences to %d values...' % MAX_LEN)

print('\nPadding token: "{:}", ID: {:}'.format(tokenizer.pad_token, tokenizer.pad_token_id))

# Pad our input tokens with value 0.
# "post" indicates that we want to pad and truncate at the end of the sequence,
# as opposed to the beginning.
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", 
                          value=0, truncating="post", padding="post")
masked_lm_labels = pad_sequences(masked_lm_labels, maxlen=MAX_LEN, dtype="long", 
                          value=0, truncating="post", padding="post")
print('\nDone.')


Padding/truncating all sentences to 164 values...

Padding token: "[PAD]", ID: 0

Done.


## 3.4. Attention Masks

The attention mask simply makes it explicit which tokens are actual words versus which are padding. 

The BERT vocabulary does not use the ID 0, so if a token ID is 0, then it's padding, and otherwise it's a real token.

In [0]:
# Create attention masks
attention_masks = []

# For each sentence...
for sent in input_ids:
    
    # Create the attention mask.
    #   - If a token ID is 0, then it's padding, set the mask to 0.
    #   - If a token ID is > 0, then it's a real token, set the mask to 1.
    att_mask = [int(token_id > 0) for token_id in sent]
    
    # Store the attention mask for this sentence.
    attention_masks.append(att_mask)

## 3.5. Training & Validation Split


Divide up our training set to use 80% for training and 20% for validation.

In [0]:
# Use train_test_split to split our data into train and validation sets for
# training
import numpy as np
from sklearn.model_selection import train_test_split

# Use 80% for training and 20% for validation.
train_inputs, validation_inputs, train_lm_labels, validation_lm_labels = train_test_split(input_ids, masked_lm_labels,
                                                            random_state=2018, test_size=0.2)
# Do the same for the masks.
train_masks, validation_masks, _, _ = train_test_split(attention_masks, masked_lm_labels,
                                             random_state=2018, test_size=0.2)

## 3.6. Converting to PyTorch Data Types

Our model expects PyTorch tensors rather than numpy.ndarrays, so convert all of our dataset variables.

In [0]:
# Convert all inputs and labels into torch tensors, the required datatype 
# for our model.
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)

train_lm_labels = torch.tensor(train_lm_labels)
validation_lm_labels = torch.tensor(validation_lm_labels)

train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

We'll also create an iterator for our dataset using the torch DataLoader class. This helps save on memory during training because, unlike a for loop, with an iterator the entire dataset does not need to be loaded into memory.

In [0]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# The DataLoader needs to know our batch size for training, so we specify it 
# here.
# For fine-tuning BERT on a specific task, the authors recommend a batch size of
# 16 or 32.

batch_size = 16

# Create the DataLoader for our training set.
train_data = TensorDataset(train_inputs, train_masks, train_lm_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# Create the DataLoader for our validation set.
validation_data = TensorDataset(validation_inputs, validation_masks, validation_lm_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

# 4. Train Our Classification Model

## 4.1. BertForMaskedLM

Load model and send to GPU.

In [0]:
from transformers import BertForMaskedLM, AdamW, BertConfig

# Load BertForSequenceClassification, the pretrained BERT model with a single 
# linear classification layer on top. 
model = BertForMaskedLM.from_pretrained(
    '/content/drive/My Drive/playground/model_save/debias/full/lr_2e-05_eps_1e-08/epoch_1/',
    # "bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)

# Tell pytorch to run this model on the GPU.
model.cuda()

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=Tr

## 4.2. Optimizer & Learning Rate Scheduler

Now that we have our model loaded we need to grab the training hyperparameters from within the stored model.

For the purposes of fine-tuning, the authors recommend choosing from the following values:
- Batch size: 16, 32  (We chose 16 when creating our DataLoaders).
- Learning rate (Adam): 5e-5, 3e-5, 2e-5  (We cross validated over these parameters).
- Number of epochs: 2, 3, 4  (We'll used 8, and selected the epoch number which best performs on the validation dataset).

The epsilon parameter `eps = 1e-8` is "a very small number to prevent any division by zero in the implementation" (from [here](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)).

You can find the creation of the AdamW optimizer in `run_glue.py` [here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L109).

In [0]:
# Note: AdamW is a class from the huggingface library (as opposed to pytorch) 
# I believe the 'W' stands for 'Weight Decay fix"
lr = 2e-5
eps = 1e-8
optimizer = AdamW(model.parameters(),
                  lr = lr, # args.learning_rate - default is 5e-5, our notebook had 2e-5
                  eps = eps # args.adam_epsilon  - default is 1e-8.
                )


In [0]:
from transformers import get_linear_schedule_with_warmup

# Number of training epochs (authors recommend between 2 and 4)
epochs = 8

# Total number of training steps is number of batches * number of epochs.
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

## 4.3. Training Loop

Below is our training loop. There's a lot going on, but fundamentally for each pass in our loop we have a trianing phase and a validation phase. At each pass we need to:

Training loop:
- Unpack our data inputs and labels
- Load data onto the GPU for acceleration
- Clear out the gradients calculated in the previous pass. 
    - In pytorch the gradients accumulate by default (useful for things like RNNs) unless you explicitly clear them out.
- Forward pass (feed input data through the network)
- Backward pass (backpropagation)
- Tell the network to update parameters with optimizer.step()
- Track variables for monitoring progress

Evalution loop:
- Unpack our data inputs and labels
- Load data onto the GPU for acceleration
- Forward pass (feed input data through the network)
- Compute loss on our validation data and track variables for monitoring progress

So please read carefully through the comments to get an understanding of what's happening. If you're unfamiliar with pytorch a quick look at some of their [beginner tutorials](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py) will help show you that training loops really involve only a few simple steps; the rest is usually just decoration and logging.  

Define a helper function for calculating accuracy.

In [0]:
import numpy as np

# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=2).flatten()
    labels_flat = labels.flatten()
    labels_flat_filtered = (labels_flat != 0) * (labels_flat != -100) * labels_flat

    return np.sum((pred_flat == labels_flat_filtered) * (labels_flat_filtered != 0)) / sum(labels_flat_filtered != 0)

Helper function for formatting elapsed times.


In [0]:
import time
import datetime

def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))


We're ready to kick off the training!

Now start training!

In [0]:
import random
from tqdm import tqdm

# This training code is based on the `run_glue.py` script here:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128


# Set the seed value all over the place to make this reproducible.
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# Store the average loss after each epoch so we can plot them.
training_loss_values = []
eval_loss_values = []

# For each epoch...
for epoch_i in range(2, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_loss = 0

    # Put the model into training mode. Don't be mislead--the call to 
    # `train` just changes the *mode*, it doesn't *perform* the training.
    # `dropout` and `batchnorm` layers behave differently during training
    # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
    model.train()

    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):

        # Progress update every 20 batches.
        if step % 40 == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed = format_time(time.time() - t0)
            
            # Report progress.
            print('Batch {:>5,} of {:>5,}.  Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using the 
        # `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        #   [3]: segments 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)


        # Always clear any previously calculated gradients before performing a
        # backward pass. PyTorch doesn't do this automatically because 
        # accumulating the gradients is "convenient while training RNNs". 
        # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
        model.zero_grad()        

        # Perform a forward pass (evaluate the model on this training batch).
        # This will return the loss (rather than the model output) because we
        # have provided the `labels`.
        # The documentation for this `model` function is here: 
        # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
        outputs = model(b_input_ids, 
                    # token_type_ids=b_segments,
                    attention_mask=b_input_mask, 
                    masked_lm_labels=b_labels)
        
        # The call to `model` always returns a tuple, so we need to pull the 
        # loss value out of the tuple.
        loss = outputs[0]

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value 
        # from the tensor.
        total_loss += loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

    # Calculate the average loss over the training data.
    avg_train_loss = total_loss / len(train_dataloader)            
    
    # Store the loss value for plotting the learning curve.
    training_loss_values.append(avg_train_loss)
    save_model(model, epoch_i, lr, eps)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(format_time(time.time() - t0)))
        
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("")
    print("Running Validation...")

    t0 = time.time()

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Tracking variables 
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        
        # Add batch to GPU
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        
        # Telling the model not to compute or store gradients, saving memory and
        # speeding up validation
        with torch.no_grad():        

            # Forward pass, calculate logit predictions.
            # This will return the logits rather than the loss because we have
            # not provided labels.
            # token_type_ids is the same as the "segment ids", which 
            # differentiates sentence 1 and 2 in 2-sentence tasks.
            # The documentation for this `model` function is here: 
            # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
          outputs = model(b_input_ids, 
                      # token_type_ids=b_segments,
                      attention_mask=b_input_mask,
                      masked_lm_labels=b_labels)
        
        # Get testing loss
        loss = outputs[0]
        eval_loss += loss.item()

        # Get the "logits" output by the model. The "logits" are the output
        # values prior to applying an activation function like the softmax.
        logits = outputs[1]        

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        
        # Calculate the accuracy for this batch of test sentences.
        tmp_eval_accuracy = flat_accuracy(logits, label_ids)
        
        # Accumulate the total accuracy.
        eval_accuracy += tmp_eval_accuracy

        # Track the number of batches
        nb_eval_steps += 1

    # Calculate the average loss over the training data.
    avg_eval_loss = eval_loss / len(validation_dataloader)            
    eval_loss_values.append(avg_eval_loss)


    # Report the final accuracy for this validation run.
    print("  Average evaluation loss: {0:.2f}".format(avg_eval_loss))
    print("  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
    print("  Validation took: {:}".format(format_time(time.time() - t0)))

print("")
print("Training complete!")


Training...
Batch    40 of 2,132.  Elapsed: 0:00:23.
Batch    80 of 2,132.  Elapsed: 0:00:46.
Batch   120 of 2,132.  Elapsed: 0:01:10.
Batch   160 of 2,132.  Elapsed: 0:01:34.
Batch   200 of 2,132.  Elapsed: 0:01:58.
Batch   240 of 2,132.  Elapsed: 0:02:22.
Batch   280 of 2,132.  Elapsed: 0:02:47.
Batch   320 of 2,132.  Elapsed: 0:03:12.
Batch   360 of 2,132.  Elapsed: 0:03:37.
Batch   400 of 2,132.  Elapsed: 0:04:03.
Batch   440 of 2,132.  Elapsed: 0:04:28.
Batch   480 of 2,132.  Elapsed: 0:04:54.
Batch   520 of 2,132.  Elapsed: 0:05:20.
Batch   560 of 2,132.  Elapsed: 0:05:45.
Batch   600 of 2,132.  Elapsed: 0:06:11.
Batch   640 of 2,132.  Elapsed: 0:06:37.
Batch   680 of 2,132.  Elapsed: 0:07:02.
Batch   720 of 2,132.  Elapsed: 0:07:28.
Batch   760 of 2,132.  Elapsed: 0:07:54.
Batch   800 of 2,132.  Elapsed: 0:08:19.
Batch   840 of 2,132.  Elapsed: 0:08:45.
Batch   880 of 2,132.  Elapsed: 0:09:11.
Batch   920 of 2,132.  Elapsed: 0:09:36.
Batch   960 of 2,132.  Elapsed: 0:10:02.
Bat

Let's take a look at our training loss over all batches:

In [0]:
import matplotlib.pyplot as plt
% matplotlib inline

import seaborn as sns

# Use plot styling from seaborn.
sns.set(style='darkgrid')

# Increase the plot size and font size.
sns.set(font_scale=1.5)
plt.rcParams["figure.figsize"] = (12,6)

# Plot the learning curve.
plt.plot(training_loss_values, 'b-o')
plt.plot(eval_loss_values, 'r-o')


# Label the plot.
plt.title("Training loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")

plt.show()

# Conclusion

This post demonstrates that with a pre-trained BERT model you can quickly and effectively create a high quality model with minimal effort and training time using the pytorch interface, regardless of the specific NLP task you are interested in.

# Appendix


## A1. Saving & Loading Fine-Tuned Model

This first cell (taken from `run_glue.py` [here](https://github.com/huggingface/transformers/blob/35ff345fc9df9e777b27903f11fa213e4052595b/examples/run_glue.py#L495)) writes the model and tokenizer out to disk.

In [0]:
import os

# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
def save_model(processed_model, epoch, lr, eps):
  output_dir = './drive/My Drive/UCL/NLP/model_save/bias/epoch_{}_lr_{}_eps_{}/'.format(epoch, lr, eps)

  # Create output directory if needed
  if not os.path.exists(output_dir):
      os.makedirs(output_dir)

  print("Saving model to %s" % output_dir)

  # Save a trained model, configuration and tokenizer using `save_pretrained()`.
  # They can then be reloaded using `from_pretrained()`
  model_to_save = processed_model.module if hasattr(processed_model, 'module') else processed_model  # Take care of distributed/parallel training
  model_to_save.save_pretrained(output_dir)
  tokenizer.save_pretrained(output_dir)

  # Good practice: save your training arguments together with the trained model
  torch.save([epoch, lr, eps], os.path.join(output_dir, 'training_args.bin'))

In [0]:
save_model(model, 2, 0.01, 0.005)

Let's check out the file sizes, out of curiosity.

In [0]:
!ls -l --block-size=K ./model_save/

The largest file is the model weights, at around 418 megabytes.

In [0]:
!ls -l --block-size=M ./model_save/pytorch_model.bin

To save your model across Colab Notebook sessions, download it to your local machine, or ideally copy it to your Google Drive.

In [0]:
# Mount Google Drive to this Notebook instance.
from google.colab import drive
    drive.mount('/content/drive')

In [0]:
# Copy the model files to a directory in your Google Drive.
!cp -r ./model_save/ "./drive/Shared drives/ChrisMcCormick.AI/Blog Posts/BERT Fine-Tuning/"

The following functions will load the model back from disk.

In [0]:
# Load a trained model and vocabulary that you have fine-tuned
model = BertForMaskedLM.from_pretrained('./drive/My Drive/playground/model_save/bias/epoch_2_lr_0.01_eps_0.005/')
tokenizer = BertTokenizer.from_pretrained('./drive/My Drive/playground/model_save/bias/epoch_2_lr_0.01_eps_0.005/')

# Copy the model to the GPU.
model.to(device)

## A.2. Weight Decay (this was not used



The huggingface example includes the following code block for enabling weight decay, but the default decay rate is "0.0", so I moved this to the appendix.

This block essentially tells the optimizer to not apply weight decay to the bias terms (e.g., $ b $ in the equation $ y = Wx + b $ ). Weight decay is a form of regularization--after calculating the gradients, we multiply them by, e.g., 0.99.

In [0]:
# This code is taken from:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L102

# Don't apply weight decay to any parameters whose names include these tokens.
# (Here, the BERT doesn't have `gamma` or `beta` parameters, only `bias` terms)
no_decay = ['bias', 'LayerNorm.weight']

# Separate the `weight` parameters from the `bias` parameters. 
# - For the `weight` parameters, this specifies a 'weight_decay_rate' of 0.01. 
# - For the `bias` parameters, the 'weight_decay_rate' is 0.0. 
optimizer_grouped_parameters = [
    # Filter for all parameters which *don't* include 'bias', 'gamma', 'beta'.
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.1},
    
    # Filter for parameters which *do* include those.
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]

# Note - `optimizer_grouped_parameters` only includes the parameter values, not 
# the names.