<a href="https://colab.research.google.com/github/JYP97/DS2_Proj_Jobs_skills_analysis/blob/master/DS2_BERT_Fine_Tuning_Sentence_Classification_v4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT Fine-Tuning Tutorial with PyTorch

# 1. Setup

## 1.1. Using Colab GPU for Training



Google Colab offers free GPUs and TPUs! Since we'll be training a large neural network it's best to take advantage of this (in this case we'll attach a GPU), otherwise training will take a very long time.

A GPU can be added by going to the menu and selecting:

`Edit 🡒 Notebook Settings 🡒 Hardware accelerator 🡒 (GPU)`

Then run the following cell to confirm that the GPU is detected.

In [1]:
import tensorflow as tf

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


In order for torch to use the GPU, we need to identify and specify the GPU as the device. Later, in our training loop, we will load data onto the device. 

In [2]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


## 1.2. Installing the Hugging Face Library



Next, let's install the [transformers](https://github.com/huggingface/transformers) package from Hugging Face which will give us a pytorch interface for working with BERT. (This library contains interfaces for other pretrained language models like OpenAI's GPT and GPT-2.) We've selected the pytorch interface because it strikes a nice balance between the high-level APIs (which are easy to use but don't provide insight into how things work) and tensorflow code (which contains lots of details but often sidetracks us into lessons about tensorflow, when the purpose here is BERT!).

At the moment, the Hugging Face library seems to be the most widely accepted and powerful pytorch interface for working with BERT. In addition to supporting a variety of different pre-trained transformer models, the library also includes pre-built modifications of these models suited to your specific task. For example, in this tutorial we will use `BertForSequenceClassification`.

The library also includes task-specific classes for token classification, question answering, next sentence prediciton, etc. Using these pre-built classes simplifies the process of modifying BERT for your purposes.


In [3]:
# !pip install transformers
!pip install transformers datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m81.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.8.0-py3-none-any.whl (452 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m452.9/452.9 KB[0m [31m44.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m102.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4/182.4 KB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
Col

The code in this notebook is actually a simplified version of the [run_glue.py](https://github.com/huggingface/transformers/blob/master/examples/run_glue.py) example script from huggingface.

`run_glue.py` is a helpful utility which allows you to pick which GLUE benchmark task you want to run on, and which pre-trained model you want to use (you can see the list of possible models [here](https://github.com/huggingface/transformers/blob/e6cff60b4cbc1158fbd6e4a1c3afda8dc224f566/examples/run_glue.py#L69)). It also supports using either the CPU, a single GPU, or multiple GPUs. It even supports using 16-bit precision if you want further speed up.

Unfortunately, all of this configurability comes at the cost of *readability*. In this Notebook, we've simplified the code greatly and added plenty of comments to make it clear what's going on. 

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
import time
import datetime

import pandas as pd
import numpy as np
from transformers import BertTokenizer
from torch.utils.data import TensorDataset, random_split
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from transformers import BertForSequenceClassification, AdamW, BertConfig
from transformers import get_linear_schedule_with_warmup
from sklearn.metrics import f1_score, accuracy_score

import random

# Set the seed value all over the place to make this reproducible.
seed_val = 66
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# 2. Loading Dataset


## 2.1 Loading and Parsing

We'll use pandas to parse the "in-domain" training set and look at a few of its properties and data points.

In [6]:
# Load the dataset into a pandas dataframe.
# df = pd.read_csv("./cola_public/raw/in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])
df_valid_salary = pd.read_csv("/content/drive/MyDrive/DS2/valid_salary_dataset.csv")
df_valid_salary = df_valid_salary.drop(columns = ["Unnamed: 0",	"education", "description", "experience", "employment_type"])
# Report the number of sentences.
print('Number of training sentences: {:,}\n'.format(df_valid_salary.shape[0]))

# Display 10 random rows from the data.
display(df_valid_salary)

Number of training sentences: 948



Unnamed: 0,salary,title,job category,skills
0,$17.23 - $22.00 / hour,Head Start Teacher,Managers,Emergency Handling
1,$19.00 - $26.00 / hour,Teacher of English for Online Groups,Professionals,"Vocabularies, Grammars, Teaching, Lesson Plann..."
2,"$106,250.00 - $125,000.00 / year",CRM / PHP Developer,Professionals,"PHP (Scripting Language), Debugging, Web Servi..."
3,"$85,000.00 - $120,000.00 / year",Licensed Nursing Home Administrator,Managers,"Emergency Handling, Training, Accounting, Heal..."
4,"$53,041.00 - $120,750.00 / year",Sales Leader,Managers,"Training, Recruitment, Direct Selling, Sales, ..."
...,...,...,...,...
943,"$110,000.00 - $120,000.00 / year",Microstrategy Programmer/Analyst,Professionals,"Analysis, TAFIM, Business Intelligence, Archit..."
944,"$50,000.00 - $55,000.00 / year",Entry Level Sales Representative: Complete Tra...,Service and sales workers,"Time Management, Attention To Detail, Customer..."
945,"$30,680.00 - $44,431.00 / year",Customer Service Representative (CRM),Clerical support workers,"Retailing, Hospitality, Sales, Merchandising, ..."
946,"$80,000.00/ year",Automotive Technicians / Master Level Technicians,Technicians and associate professionals,"Diagnostic Tools, Steering, Brakes, Suspension..."


In [7]:
# Load the dataset into a pandas dataframe.
df_invalid_salary = pd.read_csv("/content/drive/MyDrive/DS2/invalid_salary_dataset.csv")
df_invalid_salary = df_invalid_salary.drop(columns = ["Unnamed: 0",	"education", "description", "experience", "employment_type"])
# Report the number of sentences.
print('Number of training sentences: {:,}\n'.format(df_invalid_salary.shape[0]))

# Display 10 random rows from the data.
display(df_invalid_salary)

Number of training sentences: 852



Unnamed: 0,salary,title,job category,skills
0,,Registered Nurse Emergency Department Full-Tim...,Professionals,"Evaluation Of Care, Phoronix Test Suite, Resea..."
1,,Treasury Manager,Managers,"Treasury, Finance, Financial Institution, Inve..."
2,,Mammography Tech-Cert,Technicians and associate professionals,"Anatomy, Mammography, American Registry Of Rad..."
3,,Warehouse Worker (Immediate Hire) - Earn up to...,Elementary occupations,"Smartphone, Mobile Devices"
4,,Inspector/Packer $16/hour,Elementary occupations,"Learning, Communication, Scheduling, Attention..."
...,...,...,...,...
847,,Short Stay Registered Nurse (RN),Technicians and associate professionals,"Blood Pressure, Heart Rate, Advanced Cardiovas..."
848,,Junior Data Center Technician,Technicians and associate professionals,"Installations (Computer Systems), Complex Prob..."
849,,Human Resources Recruiter ( REMOTE ),Professionals,"Recruitment, Complex Problem Solving, Leadersh..."
850,,Project Director REMOTE,Managers,"Economics, Computer Sciences, Business Adminis..."


In [8]:
df = pd.concat([df_valid_salary, df_invalid_salary], axis=0).reset_index(drop=True)
print('Number of training sentences: {:,}\n'.format(df.shape[0]))
display(df)

Number of training sentences: 1,800



Unnamed: 0,salary,title,job category,skills
0,$17.23 - $22.00 / hour,Head Start Teacher,Managers,Emergency Handling
1,$19.00 - $26.00 / hour,Teacher of English for Online Groups,Professionals,"Vocabularies, Grammars, Teaching, Lesson Plann..."
2,"$106,250.00 - $125,000.00 / year",CRM / PHP Developer,Professionals,"PHP (Scripting Language), Debugging, Web Servi..."
3,"$85,000.00 - $120,000.00 / year",Licensed Nursing Home Administrator,Managers,"Emergency Handling, Training, Accounting, Heal..."
4,"$53,041.00 - $120,750.00 / year",Sales Leader,Managers,"Training, Recruitment, Direct Selling, Sales, ..."
...,...,...,...,...
1795,,Short Stay Registered Nurse (RN),Technicians and associate professionals,"Blood Pressure, Heart Rate, Advanced Cardiovas..."
1796,,Junior Data Center Technician,Technicians and associate professionals,"Installations (Computer Systems), Complex Prob..."
1797,,Human Resources Recruiter ( REMOTE ),Professionals,"Recruitment, Complex Problem Solving, Leadersh..."
1798,,Project Director REMOTE,Managers,"Economics, Computer Sciences, Business Adminis..."


In [9]:
df.groupby(['job category']).count()

Unnamed: 0_level_0,salary,title,skills
job category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Armed forces occupations,3,3,3
Clerical support workers,115,178,178
Craft and related trades workers,45,64,64
Elementary occupations,48,150,150
Managers,136,242,242
Plant and machine operators and assemblers,81,105,105
Professionals,249,503,503
Service and sales workers,98,173,173
"Skilled agricultural, forestry and fishery workers",2,3,3
Technicians and associate professionals,171,379,379


In [10]:
# Remove Job Category "Armed forces occupations" and "Skilled agricultural, forestry and fishery workers", because they have too little data points
df.drop(index=df[df['job category'].isin(["Armed forces occupations", "Skilled agricultural, forestry and fishery workers"])].index.values, inplace=True)
display(df)

Unnamed: 0,salary,title,job category,skills
0,$17.23 - $22.00 / hour,Head Start Teacher,Managers,Emergency Handling
1,$19.00 - $26.00 / hour,Teacher of English for Online Groups,Professionals,"Vocabularies, Grammars, Teaching, Lesson Plann..."
2,"$106,250.00 - $125,000.00 / year",CRM / PHP Developer,Professionals,"PHP (Scripting Language), Debugging, Web Servi..."
3,"$85,000.00 - $120,000.00 / year",Licensed Nursing Home Administrator,Managers,"Emergency Handling, Training, Accounting, Heal..."
4,"$53,041.00 - $120,750.00 / year",Sales Leader,Managers,"Training, Recruitment, Direct Selling, Sales, ..."
...,...,...,...,...
1795,,Short Stay Registered Nurse (RN),Technicians and associate professionals,"Blood Pressure, Heart Rate, Advanced Cardiovas..."
1796,,Junior Data Center Technician,Technicians and associate professionals,"Installations (Computer Systems), Complex Prob..."
1797,,Human Resources Recruiter ( REMOTE ),Professionals,"Recruitment, Complex Problem Solving, Leadersh..."
1798,,Project Director REMOTE,Managers,"Economics, Computer Sciences, Business Adminis..."




Let's extract the sentences and labels of our training set as numpy ndarrays.

In [11]:
labels = [label for label in df['job category'].unique() if not pd.isnull(label)]

label2id = {# 'Armed forces occupations':0,
          'Managers':0,
          'Professionals':1,
          'Technicians and associate professionals':2,
          'Clerical support workers':3,
          'Service and sales workers':4,
          # 'Skilled agricultural, forestry and fishery workers':6,
          'Craft and related trades workers':5,
          'Plant and machine operators and assemblers':6,
          'Elementary occupations':7
          }

id2label = {idx:label for label, idx in label2id.items()}


labels

['Managers',
 'Professionals',
 'Service and sales workers',
 'Plant and machine operators and assemblers',
 'Craft and related trades workers',
 'Technicians and associate professionals',
 'Clerical support workers',
 'Elementary occupations']

In [12]:
df['labels'] = df['job category'].replace(label2id)

In [13]:
display(df)

Unnamed: 0,salary,title,job category,skills,labels
0,$17.23 - $22.00 / hour,Head Start Teacher,Managers,Emergency Handling,0
1,$19.00 - $26.00 / hour,Teacher of English for Online Groups,Professionals,"Vocabularies, Grammars, Teaching, Lesson Plann...",1
2,"$106,250.00 - $125,000.00 / year",CRM / PHP Developer,Professionals,"PHP (Scripting Language), Debugging, Web Servi...",1
3,"$85,000.00 - $120,000.00 / year",Licensed Nursing Home Administrator,Managers,"Emergency Handling, Training, Accounting, Heal...",0
4,"$53,041.00 - $120,750.00 / year",Sales Leader,Managers,"Training, Recruitment, Direct Selling, Sales, ...",0
...,...,...,...,...,...
1795,,Short Stay Registered Nurse (RN),Technicians and associate professionals,"Blood Pressure, Heart Rate, Advanced Cardiovas...",2
1796,,Junior Data Center Technician,Technicians and associate professionals,"Installations (Computer Systems), Complex Prob...",2
1797,,Human Resources Recruiter ( REMOTE ),Professionals,"Recruitment, Complex Problem Solving, Leadersh...",1
1798,,Project Director REMOTE,Managers,"Economics, Computer Sciences, Business Adminis...",0


In [14]:
df.groupby(['labels']).count()

Unnamed: 0_level_0,salary,title,job category,skills
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,136,242,242,242
1,249,503,503,503
2,171,379,379,379
3,115,178,178,178
4,98,173,173,173
5,45,64,64,64
6,81,105,105,105
7,48,150,150,150


In [15]:
df.to_csv('/content/drive/MyDrive/DS2/clean_dataset_1794.csv') 

## 2.2 Train/Test-set Split

In [16]:
from sklearn.model_selection import train_test_split

df = pd.read_csv('/content/drive/MyDrive/DS2/clean_dataset_1794.csv')

# Divide the dataset by randomly selecting samples.
text_train, text_test, label_train, label_test = train_test_split(df['skills'], df['labels'], 
                                    test_size=0.2, 
                                    random_state=42, shuffle=True)


print('{:>5,} training samples'.format(len(text_train)))
print('{:>5,} test samples'.format(len(text_test)))

temp_dataset = pd.concat((text_train, label_train), axis=1).reset_index(drop=True)
test_dataset = pd.concat((text_test, label_test), axis=1).reset_index(drop=True)

temp_dataset.to_csv('/content/drive/MyDrive/DS2/train-val-dataset.csv')  
test_dataset.to_csv('/content/drive/MyDrive/DS2/test-dataset.csv')  

1,435 training samples
  359 test samples


# 3. Tokenization & Input Formatting

In this section, we'll transform our dataset into the format that BERT can be trained on.

## 3.1. BERT Tokenizer


To feed our text to BERT, it must be split into tokens, and then these tokens must be mapped to their index in the tokenizer vocabulary.

The tokenization must be performed by the tokenizer included with BERT--the below cell will download this for us. We'll be using the "uncased" version here.


In [17]:
train_df = pd.read_csv("/content/drive/MyDrive/DS2/train-val-dataset.csv", index_col=0)

# Report the number of sentences.
print('Number of training sentences: {:,}\n'.format(train_df.shape[0]))

# Display 10 random rows from the data.
display(train_df)

Number of training sentences: 1,435



Unnamed: 0,skills,labels
0,"Driving, Information Security, Guard, Enforcem...",2
1,"Coordinating, Sanitation, Stocks (Inventory), ...",4
2,"Scheduling, Reports, Analysis, Testing, Verifi...",2
3,"Scheduling, Upselling, Certified Society Of Pa...",2
4,"HVAC, Computer Literacy, Customer Service, Dis...",4
...,...,...
1430,"Straightforward, Creativity, Warehousing, Team...",7
1431,"Digital Marketing, Email Marketing, Marketing ...",0
1432,"Social Work, Assessments, Licensed Master Soci...",1
1433,"Emergency Nursing, Neonatal Resuscitation Prog...",2


In [18]:
train_df.groupby(['labels']).count()

Unnamed: 0_level_0,skills
labels,Unnamed: 1_level_1
0,185
1,400
2,319
3,141
4,130
5,50
6,89
7,121


In [19]:
# Load the BERT tokenizer.
print('Loading BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

Loading BERT tokenizer...


Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Let's apply the tokenizer to one sentence just to see the output.


In [20]:
# Print the original sentence.
print(' Original: ', train_df['skills'][1])

# Print the sentence split into tokens.
print('Tokenized: ', tokenizer.tokenize(train_df['skills'][1]))

# Print the sentence mapped to token ids.
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(train_df['skills'][1])))

 Original:  Coordinating, Sanitation, Stocks (Inventory), Storage (Warehousing), Outline Of Food Preparation, Emergency Handling, Merchandising, Retailing, Scheduling
Tokenized:  ['coordinating', ',', 'sanitation', ',', 'stocks', '(', 'inventory', ')', ',', 'storage', '(', 'ware', '##ho', '##using', ')', ',', 'outline', 'of', 'food', 'preparation', ',', 'emergency', 'handling', ',', 'mer', '##chan', '##dis', '##ing', ',', 'retail', '##ing', ',', 'scheduling']
Token IDs:  [19795, 1010, 18723, 1010, 15768, 1006, 12612, 1007, 1010, 5527, 1006, 16283, 6806, 18161, 1007, 1010, 12685, 1997, 2833, 7547, 1010, 5057, 8304, 1010, 21442, 14856, 10521, 2075, 1010, 7027, 2075, 1010, 19940]


## 3.2. Tokenize Dataset

The transformers library provides a helpful `encode` function which will handle most of the parsing and data prep steps for us.

Before we are ready to encode our text, though, we need to decide on a **maximum sentence length** for padding / truncating to.

The below cell will perform one tokenization pass of the dataset in order to measure the maximum sentence length.

In [21]:
max_len = 0

# For every sentence...
for skill in train_df['skills']:

    # Tokenize the text and add `[CLS]` and `[SEP]` tokens.
    input_ids = tokenizer.encode(skill, add_special_tokens=True)

    # Update the maximum sentence length.
    max_len = max(max_len, len(input_ids))

print('Max sentence length: ', max_len)

Max sentence length:  264


Now we're ready to perform the real tokenization.

The `tokenizer.encode_plus` function combines multiple steps for us:

1. Split the sentence into tokens.
2. Add the special `[CLS]` and `[SEP]` tokens.
3. Map the tokens to their IDs.
4. Pad or truncate all sentences to the same length.
5. Create the attention masks which explicitly differentiate real tokens from `[PAD]` tokens.

The first four features are in `tokenizer.encode`, but I'm using `tokenizer.encode_plus` to get the fifth item (attention masks). Documentation is [here](https://huggingface.co/transformers/main_classes/tokenizer.html?highlight=encode_plus#transformers.PreTrainedTokenizer.encode_plus).


In [22]:
# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
attention_masks = []

# For every sentence...
for skill in train_df['skills']:
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad or truncate the sentence to `max_length`
    #   (6) Create attention masks for [PAD] tokens.
    encoded_dict = tokenizer.encode_plus(
                        skill,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = 512,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.    
    input_ids.append(encoded_dict['input_ids'])
    
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(train_df.labels.values)
# labels = torch.tensor(ids).float()
# torch.unsqueeze(labels, 1)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [23]:
# Print sentence 0, now as a list of IDs.
print('Original: ', len(train_df['skills'][0]))
print('Token IDs:', len(input_ids[0]))
print('attention_masks:', len(attention_masks[0]))
print('labels:', labels[0])

Original:  105
Token IDs: 512
attention_masks: 512
labels: tensor(2)


## 3.3. Training & Validation Split


Divide up our training set to use 90% for training and 10% for validation.

In [24]:
# Combine the training inputs into a TensorDataset.
dataset = TensorDataset(input_ids, attention_masks, labels)

# Calculate the number of samples to include in each set.
train_size = int(0.80 * len(dataset))
val_size = len(dataset) - train_size

# Divide the dataset by randomly selecting samples.
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])


print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))

1,148 training samples
  287 validation samples


We'll also create an iterator for our dataset using the torch DataLoader class. This helps save on memory during training because, unlike a for loop, with an iterator the entire dataset does not need to be loaded into memory.

In [25]:
# The DataLoader needs to know our batch size for training, so we specify it 
# here. For fine-tuning BERT on a specific task, the authors recommend a batch 
# size of 16 or 32.
batch_size = 16

# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order. 
train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
            val_dataset, # The validation samples.
            sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )

# 4. Train Our Classification Model

Now that our input data is properly formatted, it's time to fine tune the BERT model. 

## 4.1. BertForSequenceClassification



We'll be using [BertForSequenceClassification](https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#bertforsequenceclassification). This is the normal BERT model with an added single linear layer on top for classification that we will use as a sentence classifier. As we feed input data, the entire pre-trained BERT model and the additional untrained classification layer is trained on our specific task. 


In [26]:
# Load BertForSequenceClassification, the pretrained BERT model with a single 
# linear classification layer on top. 
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
    num_labels = 8, # The number of output labels--2 for binary classification.
                    # You can increase this for multi-class tasks.   
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)

# Tell pytorch to run this model on the GPU.
model.cuda()

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

Just for curiosity's sake, we can browse all of the model's parameters by name here.

In the below cell, I've printed out the names and dimensions of the weights for:

1. The embedding layer.
2. The first of the twelve transformers.
3. The output layer.




In [27]:
# Get all of the model's parameters as a list of tuples.
params = list(model.named_parameters())

print('The BERT model has {:} different named parameters.\n'.format(len(params)))

print('==== Embedding Layer ====\n')

for p in params[0:5]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== First Transformer ====\n')

for p in params[5:21]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== Output Layer ====\n')

for p in params[-4:]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

The BERT model has 201 different named parameters.

==== Embedding Layer ====

bert.embeddings.word_embeddings.weight                  (30522, 768)
bert.embeddings.position_embeddings.weight                (512, 768)
bert.embeddings.token_type_embeddings.weight                (2, 768)
bert.embeddings.LayerNorm.weight                              (768,)
bert.embeddings.LayerNorm.bias                                (768,)

==== First Transformer ====

bert.encoder.layer.0.attention.self.query.weight          (768, 768)
bert.encoder.layer.0.attention.self.query.bias                (768,)
bert.encoder.layer.0.attention.self.key.weight            (768, 768)
bert.encoder.layer.0.attention.self.key.bias                  (768,)
bert.encoder.layer.0.attention.self.value.weight          (768, 768)
bert.encoder.layer.0.attention.self.value.bias                (768,)
bert.encoder.layer.0.attention.output.dense.weight        (768, 768)
bert.encoder.layer.0.attention.output.dense.bias              (

## 4.2. Optimizer & Learning Rate Scheduler

Now that we have our model loaded we need to grab the training hyperparameters from within the stored model.

For the purposes of fine-tuning, the authors recommend choosing from the following values (from Appendix A.3 of the [BERT paper](https://arxiv.org/pdf/1810.04805.pdf)):

>- **Batch size:** 16, 32  
- **Learning rate (Adam):** 5e-5, 3e-5, 2e-5  
- **Number of epochs:** 2, 3, 4 

We chose:
* Batch size: 32 (set when creating our DataLoaders)
* Learning rate: 2e-5
* Epochs: 4 (we'll see that this is probably too many...)

The epsilon parameter `eps = 1e-8` is "a very small number to prevent any division by zero in the implementation" (from [here](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)).

You can find the creation of the AdamW optimizer in `run_glue.py` [here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L109).

In [28]:
# Note: AdamW is a class from the huggingface library (as opposed to pytorch) 
# I believe the 'W' stands for 'Weight Decay fix"
optimizer = AdamW(model.parameters(),
                  lr = 5e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5
                  eps = 1e-8 # args.adam_epsilon  - default is 1e-8.
                )




In [29]:
# Number of training epochs. The BERT authors recommend between 2 and 4. 
# We chose to run for 4, but we'll see later that this may be over-fitting the
# training data.
epochs = 4

# Total number of training steps is [number of batches] x [number of epochs]. 
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

## 4.3. Training Loop

Below is our training loop. 

**Training:**
- Unpack our data inputs and labels
- Load data onto the GPU for acceleration
- Clear out the gradients calculated in the previous pass. 
    - In pytorch the gradients accumulate by default (useful for things like RNNs) unless you explicitly clear them out.
- Forward pass (feed input data through the network)
- Backward pass (backpropagation)
- Tell the network to update parameters with optimizer.step()
- Track variables for monitoring progress

**Evalution:**
- Unpack our data inputs and labels
- Load data onto the GPU for acceleration
- Forward pass (feed input data through the network)
- Compute loss on our validation data and track variables for monitoring progress

Define a helper function for calculating accuracy.

In [30]:
# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

Helper function for formatting elapsed times as `hh:mm:ss`


In [31]:
def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))


We're ready to kick off the training!

In [32]:
import torch
torch.cuda.empty_cache()

In [None]:


# We'll store a number of quantities such as training and validation loss, 
# validation accuracy, and timings.
training_stats = []
best_f1 = 0

# Measure the total training time for the whole run.
total_t0 = time.time()

# For each epoch...
for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_train_loss = 0
    total_train_accuracy = 0
    predictions , true_labels = [], []

    # Put the model into training mode. Don't be mislead--the call to 
    # `train` just changes the *mode*, it doesn't *perform* the training.
    # `dropout` and `batchnorm` layers behave differently during training
    # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
    model.train()

    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):

        # Progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed = format_time(time.time() - t0)
            
            # Report progress.
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using the 
        # `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        # Always clear any previously calculated gradients before performing a
        # backward pass. PyTorch doesn't do this automatically because 
        # accumulating the gradients is "convenient while training RNNs". 
        # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
        model.zero_grad()        

        # Perform a forward pass (evaluate the model on this training batch).
        # In PyTorch, calling `model` will in turn call the model's `forward` 
        # function and pass down the arguments. The `forward` function is 
        # documented here: 
        # https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification
        # The results are returned in a results object, documented here:
        # https://huggingface.co/transformers/main_classes/output.html#transformers.modeling_outputs.SequenceClassifierOutput
        # Specifically, we'll get the loss (because we provided labels) and the
        # "logits"--the model outputs prior to activation.
        result = model(b_input_ids, 
                       token_type_ids=None, 
                       attention_mask=b_input_mask, 
                       labels=b_labels,
                       return_dict=True)

        loss = result.loss
        logits = result.logits

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value 
        # from the tensor.
        total_train_loss += loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        # Calculate the accuracy for this batch of test sentences, and
        # accumulate it over all batches.
        total_train_accuracy += flat_accuracy(logits, label_ids)

        pred_flat = np.argmax(logits, axis=1).flatten()
        labels_flat = label_ids.flatten()

        # Store predictions and true labels
        predictions.extend(pred_flat)
        true_labels.extend(labels_flat)
       

    print("")
    avg_train_accuracy = total_train_accuracy / len(train_dataloader)
    # avg_train_accuracy = accuracy_score(y_pred=predictions, y_true=true_labels, normalize = True)
    print("  Accuracy: {0:.2f}".format(avg_train_accuracy))
    
    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)  
    print("  Average training loss: {0:.2f}".format(avg_train_loss))

    f1_train = f1_score(y_pred=predictions, y_true=true_labels, average='macro')
    print("  F1 score: {0:.2f}".format(f1_train)) 
    
    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)
    print("  Training epoch took: {:}".format(training_time))
    
        
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("")
    print("Running Validation...")

    t0 = time.time()

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Tracking variables 
    total_eval_accuracy = 0
    total_eval_loss = 0
    nb_eval_steps = 0
    predictions , true_labels = [], []

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        
        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using 
        # the `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        
        # Tell pytorch not to bother with constructing the compute graph during
        # the forward pass, since this is only needed for backprop (training).
        with torch.no_grad():        

            # Forward pass, calculate logit predictions.
            # token_type_ids is the same as the "segment ids", which 
            # differentiates sentence 1 and 2 in 2-sentence tasks.
            result = model(b_input_ids, 
                           token_type_ids=None, 
                           attention_mask=b_input_mask,
                           labels=b_labels,
                           return_dict=True)

        # Get the loss and "logits" output by the model. The "logits" are the 
        # output values prior to applying an activation function like the 
        # softmax.
        loss = result.loss
        logits = result.logits
            
        # Accumulate the validation loss.
        total_eval_loss += loss.item()

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        # Calculate the accuracy for this batch of test sentences, and
        # accumulate it over all batches.
        total_eval_accuracy += flat_accuracy(logits, label_ids)
        
        pred_flat = np.argmax(logits, axis=1).flatten()
        labels_flat = label_ids.flatten()

        # Store predictions and true labels
        predictions.extend(pred_flat)
        true_labels.extend(labels_flat)

    # Report the final accuracy for this validation run.
    avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
    # avg_val_accuracy = accuracy_score(y_pred=predictions, y_true=true_labels, normalize = True)
    print("  Accuracy: {0:.2f}".format(avg_val_accuracy))

    # Calculate the average loss over all of the batches.
    avg_val_loss = total_eval_loss / len(validation_dataloader)    
    print("  Validation Loss: {0:.2f}".format(avg_val_loss))

    f1_val = f1_score(y_pred=predictions, y_true=true_labels, average='macro')
    print("  F1 score: {0:.2f}".format(f1_val))

    # Measure how long the validation run took.
    validation_time = format_time(time.time() - t0)
    print("  Validation took: {:}".format(validation_time))

    if(f1_val > best_f1):
        model.save_pretrained("/content/drive/MyDrive/DataScience/DS2/f1_models")
        print("  New best model saved!")
        best_f1 = f1_val

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Accur.': avg_train_accuracy,
            'Training Loss': avg_train_loss,
            'Training F1': f1_train,
            'Training Time': training_time,
            'Valid. Accur.': avg_val_accuracy,
            'Valid. Loss': avg_val_loss,
            'Valid. F1': f1_val,
            'Valid. Time': validation_time
        }
    )

print("")
print("Training complete!")

print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))


Training...
  Batch    40  of     72.    Elapsed: 0:00:55.

  Accuracy: 0.33
  Average training loss: 1.74
  F1 score: 0.18
  Training epoch took: 0:01:37

Running Validation...
  Accuracy: 0.39
  Validation Loss: 1.49
  F1 score: 0.27
  Validation took: 0:00:09
  New best model saved!

Training...
  Batch    40  of     72.    Elapsed: 0:00:54.

  Accuracy: 0.45
  Average training loss: 1.43
  F1 score: 0.38
  Training epoch took: 0:01:38

Running Validation...


Let's view the summary of the training process.

In [None]:
import pandas as pd

# Display floats with two decimal places.
pd.set_option('precision', 2)

# Create a DataFrame from our training statistics.
df_stats = pd.DataFrame(data=training_stats)

# Use the 'epoch' as the row index.
df_stats = df_stats.set_index('epoch')

# A hack to force the column headers to wrap.
#df = df.style.set_table_styles([dict(selector="th",props=[('max-width', '70px')])])

# Display the table.
df_stats

Notice that, while the the training loss is going down with each epoch, the validation loss is increasing! This suggests that we are training our model too long, and it's over-fitting on the training data. 

(For reference, we are using 7,695 training samples and 856 validation samples).

Validation Loss is a more precise measure than accuracy, because with accuracy we don't care about the exact output value, but just which side of a threshold it falls on. 

If we are predicting the correct answer, but with less confidence, then validation loss will catch this, while accuracy will not.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

# Use plot styling from seaborn.
sns.set(style='darkgrid')

# Increase the plot size and font size.
sns.set(font_scale=1.5)
plt.rcParams["figure.figsize"] = (12,6)

# Plot the learning curve.
plt.plot(df_stats['Training Accur.'], 'b-o', label="Training")
plt.plot(df_stats['Valid. Accur.'], 'g-o', label="Validation")

# Label the plot.
plt.title("Training & Validation Accuracy")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend()
plt.xticks([x+1 for x in range(epochs)])

plt.show()

In [None]:
# Use plot styling from seaborn.
sns.set(style='darkgrid')

# Increase the plot size and font size.
sns.set(font_scale=1.5)
plt.rcParams["figure.figsize"] = (12,6)

# Plot the learning curve.
plt.plot(df_stats['Training F1'], 'b-o', label="Training")
plt.plot(df_stats['Valid. F1'], 'g-o', label="Validation")

# Label the plot.
plt.title("Training & Validation F1 Score")
plt.xlabel("Epoch")
plt.ylabel("F1 Score")
plt.legend()
plt.xticks([x+1 for x in range(epochs)])

plt.show()

# 5. Performance On Test Set

Now we'll load the holdout dataset and prepare inputs just as we did with the training set. Then we'll evaluate predictions using [Matthew's correlation coefficient](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html) because this is the metric used by the wider NLP community to evaluate performance on CoLA. With this metric, +1 is the best score, and -1 is the worst score. This way, we can see how well we perform against the state of the art models for this specific task.

### 5.1. Data Preparation



We'll need to apply all of the same steps that we did for the training data to prepare our test data set.

In [None]:
import pandas as pd

# Load the dataset into a pandas dataframe.
test_df = pd.read_csv("/content/drive/MyDrive/DS2/test-dataset.csv", index_col=0)

# Report the number of sentences.
print('Number of test sentences: {:,}\n'.format(test_df.shape[0]))

display(test_df)

Number of test sentences: 95



Unnamed: 0,skills,labels
0,"Scheduling, Management, Construction Managemen...",4
1,"Children'S Health Insurance Program, Chromatin...",2
2,"Outline Of Food Preparation, Cooking, Restaura...",5
3,"Retailing, Scheduling, Customer Service, Sales...",3
4,"Scheduling, Time Management, Communication, Re...",4
...,...,...
90,"Soldering, Assembling, Electronic Components, ...",8
91,"Extroverted, Self Motivation, Team-working, Pa...",4
92,"Accounting, Auditing, Systems Analysis, Human ...",4
93,"Instructions, Hardworking And Dedicated, Drug ...",8


In [None]:
test_df.groupby(['labels']).count()

Unnamed: 0_level_0,skills
labels,Unnamed: 1_level_1
0,1
1,14
2,24
3,15
4,11
5,10
7,4
8,12
9,4


In [None]:
# Create sentence and label lists
skills = test_df.skills.values
labels = test_df.labels.values

In [None]:
# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
attention_masks = []

# For every sentence...
for skill in skills:
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad or truncate the sentence to `max_length`
    #   (6) Create attention masks for [PAD] tokens.
    encoded_dict = tokenizer.encode_plus(
                        skill,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = 512,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.    
    input_ids.append(encoded_dict['input_ids'])
    
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(labels)

# Set the batch size.  
batch_size = 32  

# Create the DataLoader.
prediction_data = TensorDataset(input_ids, attention_masks, labels)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)



## 5.2. Evaluate on Test Set



With the test set prepared, we can apply our fine-tuned model to generate predictions on the test set.

In [None]:
# Load the saved parameters

model = BertForSequenceClassification.from_pretrained(
    "/content/drive/MyDrive/DataScience/DS2/f1_models", # Use the 12-layer BERT model, with an uncased vocab.
    num_labels = 8, # The number of output labels: 2 for binary classification. 
)

model.cuda()

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [None]:
# Prediction on test set

print('Predicting labels for {:,} test sentences...'.format(len(input_ids)))

# Put model in evaluation mode
model.eval()

# metric = load_metric("glue", "mrpc")

# Tracking variables 
predictions , true_labels = [], []

total_test_accuracy = 0
total_test_loss = 0
nb_test_steps = 0

# Predict 
for batch in prediction_dataloader:
  # Add batch to GPU
  batch = tuple(t.to(device) for t in batch)
  
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask, b_labels = batch
  
  # Telling the model not to compute or store gradients, saving memory and 
  # speeding up prediction
  with torch.no_grad():
      # Forward pass, calculate logit predictions.
      result = model(b_input_ids, 
                     token_type_ids=None, 
                     attention_mask=b_input_mask,
                     return_dict=True)

  logits = result.logits

  # Move logits and labels to CPU
  logits = logits.detach().cpu().numpy()
  label_ids = b_labels.to('cpu').numpy()

  pred_flat = np.argmax(logits, axis=1).flatten()
  labels_flat = label_ids.flatten()

  # Store predictions and true labels
  predictions.extend(pred_flat)
  true_labels.extend(labels_flat)

  total_test_accuracy += flat_accuracy(logits, label_ids)
    
print("")
# Report the final accuracy for this validation run.
avg_test_accuracy = total_test_accuracy / len(prediction_dataloader)
# avg_test_accuracy = accuracy_score(y_pred=predictions, y_true=true_labels, normalize = True)
print("  Accuracy: {0:.2f}".format(avg_test_accuracy))

# Calculate the average loss over all of the batches.
avg_test_loss = total_test_loss / len(prediction_dataloader)
print("  Test Loss: {0:.2f}".format(avg_test_loss))
  
print("  F1 score: {0:.2f}".format(f1_score(y_pred=predictions, y_true=true_labels, average='macro'))) 

# Measure how long the validation run took.
test_time = format_time(time.time() - t0)
print("  Test took: {:}".format(test_time))

print("")
print("Test complete!")

Predicting labels for 95 test sentences...

  Accuracy: 0.35
  Test Loss: 0.00
  F1 score: 0.20
  Test took: 0:12:27

Test complete!


# Conclusion

This post demonstrates that with a pre-trained BERT model you can quickly and effectively create a high quality model with minimal effort and training time using the pytorch interface, regardless of the specific NLP task you are interested in.

# Appendix


## A1. Saving & Loading Fine-Tuned Model

This first cell (taken from `run_glue.py` [here](https://github.com/huggingface/transformers/blob/35ff345fc9df9e777b27903f11fa213e4052595b/examples/run_glue.py#L495)) writes the model and tokenizer out to disk.

In [None]:
import os

# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()

output_dir = './model_save/'

# Create output directory if needed
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print("Saving model to %s" % output_dir)

# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

# Good practice: save your training arguments together with the trained model
# torch.save(args, os.path.join(output_dir, 'training_args.bin'))


Let's check out the file sizes, out of curiosity.

In [None]:
!ls -l --block-size=K ./model_save/

The largest file is the model weights, at around 418 megabytes.

In [None]:
!ls -l --block-size=M ./model_save/pytorch_model.bin

To save your model across Colab Notebook sessions, download it to your local machine, or ideally copy it to your Google Drive.

In [None]:
# Mount Google Drive to this Notebook instance.
from google.colab import drive
    drive.mount('/content/drive')

In [None]:
# Copy the model files to a directory in your Google Drive.
!cp -r ./model_save/ "./drive/Shared drives/ChrisMcCormick.AI/Blog Posts/BERT Fine-Tuning/"

The following functions will load the model back from disk.

In [None]:
# Load a trained model and vocabulary that you have fine-tuned
model = model_class.from_pretrained(output_dir)
tokenizer = tokenizer_class.from_pretrained(output_dir)

# Copy the model to the GPU.
model.to(device)

## A.2. Weight Decay



The huggingface example includes the following code block for enabling weight decay, but the default decay rate is "0.0", so I moved this to the appendix.

This block essentially tells the optimizer to not apply weight decay to the bias terms (e.g., $ b $ in the equation $ y = Wx + b $ ). Weight decay is a form of regularization--after calculating the gradients, we multiply them by, e.g., 0.99.

In [None]:
# This code is taken from:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L102

# Don't apply weight decay to any parameters whose names include these tokens.
# (Here, the BERT doesn't have `gamma` or `beta` parameters, only `bias` terms)
no_decay = ['bias', 'LayerNorm.weight']

# Separate the `weight` parameters from the `bias` parameters. 
# - For the `weight` parameters, this specifies a 'weight_decay_rate' of 0.01. 
# - For the `bias` parameters, the 'weight_decay_rate' is 0.0. 
optimizer_grouped_parameters = [
    # Filter for all parameters which *don't* include 'bias', 'gamma', 'beta'.
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.1},
    
    # Filter for parameters which *do* include those.
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]

# Note - `optimizer_grouped_parameters` only includes the parameter values, not 
# the names.

# Revision History

**Version 4** - *Feb 2nd, 2020* - (current)
* Updated all calls to `model` (fine-tuning and evaluation) to use the [`SequenceClassifierOutput`](https://huggingface.co/transformers/main_classes/output.html#transformers.modeling_outputs.SequenceClassifierOutput) class.
* Moved illustration images to Google Drive--Colab appears to no longer support images at external URLs.

**Version 3** - *Mar 18th, 2020*
* Simplified the tokenization and input formatting (for both training and test) by leveraging the `tokenizer.encode_plus` function. 
`encode_plus` handles padding *and* creates the attention masks for us.
* Improved explanation of attention masks.
* Switched to using `torch.utils.data.random_split` for creating the training-validation split.
* Added a summary table of the training statistics (validation loss, time per epoch, etc.).
* Added validation loss to the learning curve plot, so we can see if we're overfitting. 
    * Thank you to [Stas Bekman](https://ca.linkedin.com/in/stasbekman) for contributing this!
* Displayed the per-batch MCC as a bar plot.

**Version 2** - *Dec 20th, 2019* - [link](https://colab.research.google.com/drive/1Y4o3jh3ZH70tl6mCd76vz_IxX23biCPP)
* huggingface renamed their library to `transformers`. 
* Updated the notebook to use the `transformers` library.

**Version 1** - *July 22nd, 2019*
* Initial version.

## Further Work

* It might make more sense to use the MCC score for “validation accuracy”, but I’ve left it out so as not to have to explain it earlier in the Notebook.
* Seeding -- I’m not convinced that setting the seed values at the beginning of the training loop is actually creating reproducible results…
* The MCC score seems to vary substantially across different runs. It would be interesting to run this example a number of times and show the variance.
