Reference for code: https://medium.com/@aniruddha.choudhury94/part-2-bert-fine-tuning-tutorial-with-pytorch-for-text-classification-on-the-corpus-of-linguistic-18057ce330e1

In [1]:
#https://github.com/pytorch/pytorch/issues/19406#issuecomment-581178550
#pip install torch==1.2.0+cpu torchvision==0.4.0+cpu -f https://download.pytorch.org/whl/torch_stable.html

### Parsing Data

In [2]:
import pandas as pd

df = pd.read_csv('data/cola_public/raw/in_domain_train.tsv',
                delimiter='\t',
                header = None,
                names = ['sentence_source','label','label_notes','sentence'])
print('Number of tuples: {:,}'.format(df.shape[0]))
df.sample(10)

Number of tuples: 8,551


Unnamed: 0,sentence_source,label,label_notes,sentence
6363,d_98,1,,John kissed even the ugliest woman.
6733,m_02,1,,Nurse Rooke suspected that Mrs Clay planned to...
5631,c_13,1,,Zeke cooked and ate the chili.
7498,sks13,1,,We put a book on the table.
4728,ks08,0,*,Seoul was slept in by the businessman last night.
1786,r-67,1,,Willy is taller than Bill by as much as it is ...
6766,m_02,0,*,The person who never had he been so offended w...
982,bc01,0,*,The problem's perception is quite thorough.
6610,m_02,1,,With which club did you hit the winning putt?
8545,ad03,0,*,Anson thought that himself was going to the club.


In [3]:
sentences = df.sentence.values
labels = df.label.values

### Tokenizing

In [4]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case = True)
input_ids = []

for sent in sentences:
    encoded_sent = tokenizer.encode(sent,
                                    add_special_tokens=True,
                                    truncation=True,
                                    pad_to_max_length=True,
                                    max_length=64)
    input_ids.append(encoded_sent)

print('Original:', sentences[0])
print('Token IDs:', input_ids[0])



Original: Our friends won't buy this analysis, let alone the next one we propose.
Token IDs: [101, 2256, 2814, 2180, 1005, 1056, 4965, 2023, 4106, 1010, 2292, 2894, 1996, 2279, 2028, 2057, 16599, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [5]:
print('Max sentence length:', max([len(s) for s in input_ids]))

Max sentence length: 64


### Attention Masks

In [6]:
attention_masks = []

for sent in input_ids:
    att_mask = [int(token_id > 0) for token_id in sent]
    attention_masks.append(att_mask)

In [7]:
attention_masks[0]

[1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

### Training/Test Set Creation

In [8]:
from sklearn.model_selection import train_test_split

train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids,
                                                                                    labels,
                                                                                    random_state = 2018,
                                                                                    test_size = 0.1)
train_masks, validation_masks, _, _ = train_test_split(attention_masks,
                                                       labels,
                                                       random_state = 2018,
                                                       test_size = 0.1)

In [9]:
print(train_inputs[0])
print(train_masks[0])

[101, 2002, 2939, 1996, 3328, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


### Tensors

In [10]:
import torch

train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)

train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)

train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

In [11]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

batch_size = 32
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

### Readying BERT Model

In [12]:
# pip install tensorflow
from transformers import BertForSequenceClassification, AdamW, BertConfig

model = BertForSequenceClassification.from_pretrained('bert-base-uncased',
                                                     num_labels=2,
                                                     output_attentions=False,
                                                     output_hidden_states=False)

- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
params = list(model.named_parameters())
print('The BERT Model has {:} parameters'.format(len(params)))

print('=====Embedding Layer=====')
for p in params[0:5]:
    print('{:<55} {:>12}'.format(p[0], str(tuple(p[1].size()))))
print('=====First Transformer=====')
for p in params[5:21]:
    print('{:<55} {:>12}'.format(p[0], str(tuple(p[1].size()))))
print('=====Output Layer=====')
for p in params[-4:]:
    print('{:<55} {:>12}'.format(p[0], str(tuple(p[1].size()))))

The BERT Model has 201 parameters
=====Embedding Layer=====
bert.embeddings.word_embeddings.weight                  (30522, 768)
bert.embeddings.position_embeddings.weight                (512, 768)
bert.embeddings.token_type_embeddings.weight                (2, 768)
bert.embeddings.LayerNorm.weight                              (768,)
bert.embeddings.LayerNorm.bias                                (768,)
=====First Transformer=====
bert.encoder.layer.0.attention.self.query.weight          (768, 768)
bert.encoder.layer.0.attention.self.query.bias                (768,)
bert.encoder.layer.0.attention.self.key.weight            (768, 768)
bert.encoder.layer.0.attention.self.key.bias                  (768,)
bert.encoder.layer.0.attention.self.value.weight          (768, 768)
bert.encoder.layer.0.attention.self.value.bias                (768,)
bert.encoder.layer.0.attention.output.dense.weight        (768, 768)
bert.encoder.layer.0.attention.output.dense.bias              (768,)
bert.encoder.la

### Learning and Optimization

In [14]:
optimizer = AdamW(model.parameters(),
                  lr = 2e-5,
                  eps = 1e-8)

from transformers import get_linear_schedule_with_warmup

epochs = 4
total_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps = 0,
                                            num_training_steps = total_steps)

### Training

In [15]:
# To test accuracy
import numpy as np

def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

# Formatting times
import time
import datetime

def format_time(elapsed):
    elapsed_rounded = int(round(elapsed))
    return str(datetime.timedelta(seconds=elapsed_rounded))

In [None]:
import random

seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)

loss_values = []

for epoch_i in range(0, epochs):
    t0 = time.time() # start time
    total_loss = 0
    model.train() # puts model into training mode
    
    for step, batch in enumerate(train_dataloader):
        if step % 40 == 0 and not step == 0: # show progress every 40 batches
            elapsed = format_time(time.time() - t0)
            print('Batch {:>5,} of {:>5,}. Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))
            b_input_ids = batch[0]
            b_input_mask = batch[1]
            b_labels = batch[2]
            
            model.zero_grad()
            
            outputs = model(b_input_ids,
                            token_type_ids = None,
                            attention_mask = b_input_mask,
                            labels = b_labels)
            
            loss = outputs[0]
            total_loss += loss.item()
            
            loss.backward()
            
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            
            optimizer.step()
            
            scheduler.step()
            
            avg_train_loss = total_loss / len(train_dataloader)
            loss_values.append(avg_train_loss)

    print('Average Training Loss: {0:.2f}'.format(avg_train_loss))
    print('Training Epoch Took: {:}'.format(format_time(time.time() - t0)))

Batch    40 of   241. Elapsed: 0:00:00.
Average Training Loss: 0.00
Training Epoch Took: 0:00:06
=====Validation=====
Batch    80 of   241. Elapsed: 0:00:06.
Average Training Loss: 0.01
Training Epoch Took: 0:00:13
=====Validation=====
Batch   120 of   241. Elapsed: 0:00:13.
Average Training Loss: 0.01
Training Epoch Took: 0:00:20
=====Validation=====
Batch   160 of   241. Elapsed: 0:00:20.
Average Training Loss: 0.01
Training Epoch Took: 0:00:27
=====Validation=====
Batch   200 of   241. Elapsed: 0:00:27.
Average Training Loss: 0.01
Training Epoch Took: 0:00:34
=====Validation=====
Batch   240 of   241. Elapsed: 0:00:34.
Average Training Loss: 0.02
Training Epoch Took: 0:00:39
=====Validation=====
Batch    40 of   241. Elapsed: 0:00:00.
Average Training Loss: 0.00
Training Epoch Took: 0:00:06
=====Validation=====
Batch    80 of   241. Elapsed: 0:00:06.
Average Training Loss: 0.01
Training Epoch Took: 0:00:13
=====Validation=====
Batch   120 of   241. Elapsed: 0:00:13.
Average Training