We often have large quantity of unlabelled dataset with only a small amount of labelled dataset.If we need to get accurate classification, we can use pretrained models trained on large corpus to get decent results.Generally, we use pretrained language models trained on large corpus to get embeddings and then mostly add a layer or two of neural networks on top to  fit to our task in hand. This works very well until the data on which language model was trained is similar to our data. If our data is different than data used for pretraining, results would not be that satifactory. Consider for example if we have mix of Hindi and English language data and we are using pretrained model trained on 
Wikipedia, it would lead to bad results. In that scenario we need to fine-tune our language model too. As shown by Jeremy Howard and Sebastian Ruder in [Universal Language Model Fine-tuning for Text Classification](https://arxiv.org/abs/1801.06146), finetuning the language model can lead to performance enhancement. We generally modify the last few layers of language models to adapt to our data. This has been done and explained by Fast.ai in [Finetuning FastAI language model](https://docs.fast.ai/text.html#Fine-tuning-a-language-model). They have done it extensively for ULMFit. We can follow the same approach with Bert and other models. With the revolution in NLP world, and with the arrival of beasts such as Bert, OpenAI-GPT, Elmo and so on we need a library which could help us keep up with this growing pace in NLP. Here comes in Hugging Face pytorch-transformers, a one stop for NLP. This is easy to use 
library to meet all your NLP requirements written in Pytorch. We will see how we can fine-tune Bert language model and then use that for SequenceClassification all using pytorch-transformers.

We will use https://www.kaggle.com/c/word2vec-nlp-tutorial/data this data for our purpose. 

This dataset contains 25000 training examples having movie reviews with labels and 25000 test examples without labels. We will finetune our language model on combined train and test data having 50000 reviews as whole.

### Finetuning language model
We will use pytorch-transfomers to finetune pretrained Bert language model. It is well written and documented. We will use [Finetune language model](https://github.com/huggingface/pytorch-transformers/tree/master/examples/lm_finetuning) for our purpose. Bert is a bidirectional transformer pre-trained using a combination of masked language modeling objective and next sentence prediction on a large corpus. In masked language modelling it masks or hides certain words during training and tries to predict them and simultaneously to get relevant distingushing factor it also tries to to predict whether two sentences are next to each other or not. For same task, mask modelling and nsp, Bert require training data to be in specific format. This format is made using   [pregenerate_training_data.py](https://github.com/huggingface/pytorch-transformers/blob/master/examples/lm_finetuning/pregenerate_training_data.py) . This scripts expect a single file as input, consisting of untokenized text, with one sentence per line, and one blank line between documents. The reason for the sentence splitting is that part of BERT's training involves a next sentence objective in which the model must predict whether two sequences of text are contiguous text from the same document or not. Running this script will create a new directory called training with training data in stored desired format. Do checkout for this directory for data format. 

### Making data in desired format for language model

**We will finetune our language model on complete train and test set**

In [0]:
import os
import numpy as np
import pandas as pd
import re
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
import torch
from keras.preprocessing.sequence import pad_sequences
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from tqdm import  tqdm_notebook

Using TensorFlow backend.


In [0]:
directory_path = '/content/drive/My Drive/Bert Training'## we will store our data in this drive

In [1]:
##uncomment below lines to import Google drive
# from google.colab import drive
# drive.mount('/content/drive')

In [0]:
train_df = pd.read_csv(os.path.join(directory_path,'labeledTrainData.tsv'),delimiter='\t')
test_df = pd.read_csv(os.path.join(directory_path,'testData.tsv'),delimiter='\t')

In [0]:
train_df.shape,test_df.shape

((25000, 3), (25000, 2))

In [0]:
train_df.head(n=2)

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."


In [0]:
lm_df = pd.concat([train_df[['review']],test_df[['review']]])

In [0]:
lm_df.review = lm_df.review.str.lower()

In [0]:
lm_df.question_text=lm_df.question_text.str.lower()

In [0]:
lm_df.head(n=2)

Unnamed: 0,review
0,with all this stuff going down at the moment w...
1,"\the classic war of the worlds\"" by timothy hi..."


In [0]:
tqdm.pandas()
changed_text=lm_df.review.apply(lambda x:x+"\n"+"\n")

In [0]:
open(os.path.join(directory_path,'data_lm.txt'), "w").write(''.join(changed_text))

65703846

**Now, data is ready in the desired format to make it suitable for pregenerate_training_data.py** . clone https://github.com/huggingface/pytorch-transformers in the same directory as this Jupyter notebook in google Drive

## Now, Finetuning Language model on the data

In [0]:
!pip install pytorch_transformers
from pytorch_transformers import *

Collecting pytorch_transformers
[?25l  Downloading https://files.pythonhosted.org/packages/a3/b7/d3d18008a67e0b968d1ab93ad444fc05699403fa662f634b2f2c318a508b/pytorch_transformers-1.2.0-py3-none-any.whl (176kB)
[K     |████████████████████████████████| 184kB 3.4MB/s 
[?25hCollecting sacremoses (from pytorch_transformers)
[?25l  Downloading https://files.pythonhosted.org/packages/df/24/0b86f494d3a5c7531f6d0c77d39fd8f9d42e651244505d3d737e31db9a4d/sacremoses-0.0.33.tar.gz (802kB)
[K     |████████████████████████████████| 808kB 37.8MB/s 
Collecting sentencepiece (from pytorch_transformers)
[?25l  Downloading https://files.pythonhosted.org/packages/14/3d/efb655a670b98f62ec32d66954e1109f403db4d937c50d779a75b9763a29/sentencepiece-0.1.83-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 35.8MB/s 
Collecting regex (from pytorch_transformers)
[?25l  Downloading https://files.pythonhosted.org/packages/6f/a6/99eeb5904ab763db87af4bd71d9b1dfdd97926812406

In [0]:
%cd /content/drive/My Drive/Bert Training/pytorch-transformers/examples/lm_finetuning

/content/drive/My Drive/Bert Training/pytorch-transformers/examples/lm_finetuning


In [0]:
import shutil

In [0]:
src = directory_path+"/data_lm.txt" ##copying newly created data to the same finetuned folder.
dst = os.getcwd()
shutil.copy(src, dst)

'/content/drive/My Drive/Bert Training/pytorch-transformers/examples/lm_finetuning/data_lm.txt'

In [0]:
os.listdir(os.getcwd())

['README.md',
 'finetune_on_pregenerated.py',
 'pregenerate_training_data.py',
 'simple_lm_finetuning.py',
 'data_lm.txt']

Now, run **!python3 pregenerate_training_data.py --train_corpus data_lm.txt --bert_model bert-base-uncased --do_lower_case --output_dir training/ --epochs_to_generate 2 --max_seq_len 256**  to get data in Bert Format

In [0]:
!python3 pregenerate_training_data.py --train_corpus data_lm.txt --bert_model bert-base-uncased --do_lower_case --output_dir training/ --epochs_to_generate 2 --max_seq_len 256

100% 231508/231508 [00:00<00:00, 1846622.27B/s]
Loading Dataset: 100000 lines [03:25, 487.38 lines/s]
Epoch:   0% 0/2 [00:00<?, ?it/s]
Document:   0% 0/50000 [00:00<?, ?it/s][A
Document:   0% 154/50000 [00:00<00:32, 1536.77it/s][A
Document:   1% 322/50000 [00:00<00:31, 1576.79it/s][A
Document:   1% 495/50000 [00:00<00:30, 1619.54it/s][A
Document:   1% 659/50000 [00:00<00:30, 1624.97it/s][A
Document:   2% 843/50000 [00:00<00:29, 1682.38it/s][A
Document:   2% 1012/50000 [00:00<00:29, 1684.64it/s][A
Document:   2% 1186/50000 [00:00<00:28, 1695.32it/s][A
Document:   3% 1362/50000 [00:00<00:28, 1710.02it/s][A
Document:   3% 1529/50000 [00:00<00:28, 1696.95it/s][A
Document:   3% 1708/50000 [00:01<00:28, 1718.23it/s][A
Document:   4% 1876/50000 [00:01<00:29, 1627.80it/s][A
Document:   4% 2056/50000 [00:01<00:28, 1675.13it/s][A
Document:   4% 2235/50000 [00:01<00:27, 1707.34it/s][A
Document:   5% 2406/50000 [00:01<00:28, 1685.20it/s][A
Document:   5% 2583/50000 [00:01<00:27, 170

In [0]:
os.listdir(os.getcwd())

['README.md',
 'finetune_on_pregenerated.py',
 'pregenerate_training_data.py',
 'simple_lm_finetuning.py',
 'data_lm.txt',
 'training']

You can now finetune your language model on the data created in desired format using the script finetune_on_pregenerated.py . I have used bert-base-uncased model for that purpose. I have used batch size of 16 due to memory limitation on Colab. If you use Paperspace or AWS, bacth size of 32 would work. We can do more experiments to get better results. 

In [0]:
!python3 finetune_on_pregenerated.py --pregenerated_data training/ --bert_model bert-base-uncased --do_lower_case --train_batch_size 16  --output_dir finetuned_lm/ --epochs 2

2019-09-06 10:08:37,535: device: cuda n_gpu: 1, distributed training: False, 16-bits training: False
2019-09-06 10:08:37,705: loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /root/.cache/torch/pytorch_transformers/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
2019-09-06 10:08:37,885: https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json not found in cache or force_download set to True, downloading to /tmp/tmpzxnds003
100% 313/313 [00:00<00:00, 236543.63B/s]
2019-09-06 10:08:38,052: copying /tmp/tmpzxnds003 to cache at /root/.cache/torch/pytorch_transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.bf3b9ea126d8c0001ee8a1e8b92229871d06d36d8808208cc2449280da87785c
2019-09-06 10:08:38,053: creating metadata file for /root/.cache/torch/pytorch_transformers/4dad0251492946e18ac39290fcfe91b89d370fee250

At this point our **Language model has been created** . We will now use this finetuned language model and pretrained language model with simple neural network layers on top and would compare results. 

Firstly, we would use direct pretrained model than we will use finetuned language model for classififcation task. 

In [0]:
np.random.seed(56)

Our task is of sequence classification, we would use https://huggingface.co/pytorch-transformers/model_doc/bert.html#bertforsequenceclassification . This expects our sentences to be started  and terminated with **cls** and **sep** respectively. We will firstly reformat our reviews or text column in this format.

In [0]:
train_df.review = train_df.review.str.lower()
sentences = train_df.review.values

# We need to add special tokens at the beginning and end of each sentence for BERT to work properly
sentences = ["[CLS] " + sentence + " [SEP]" for sentence in sentences]
labels = train_df.sentiment.values

Then, we would use **BertTokenizer** to tokenize our text data in the Bert format. For a given token or word, tokenizer will keep the word as such if the token is found in Bert's vocabulary else it will find the small subword which will be in Bert's vocabulary. '##' in tokenized text means it is not independent word but subword of another word. 

In [0]:
%%time
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
print ("Tokenize the first sentence:")
print (tokenized_texts[0])

100%|██████████| 231508/231508 [00:00<00:00, 2633232.26B/s]


Tokenize the first sentence:
['[CLS]', 'with', 'all', 'this', 'stuff', 'going', 'down', 'at', 'the', 'moment', 'with', 'm', '##j', 'i', "'", 've', 'started', 'listening', 'to', 'his', 'music', ',', 'watching', 'the', 'odd', 'documentary', 'here', 'and', 'there', ',', 'watched', 'the', 'wi', '##z', 'and', 'watched', 'moon', '##walker', 'again', '.', 'maybe', 'i', 'just', 'want', 'to', 'get', 'a', 'certain', 'insight', 'into', 'this', 'guy', 'who', 'i', 'thought', 'was', 'really', 'cool', 'in', 'the', 'eighties', 'just', 'to', 'maybe', 'make', 'up', 'my', 'mind', 'whether', 'he', 'is', 'guilty', 'or', 'innocent', '.', 'moon', '##walker', 'is', 'part', 'biography', ',', 'part', 'feature', 'film', 'which', 'i', 'remember', 'going', 'to', 'see', 'at', 'the', 'cinema', 'when', 'it', 'was', 'originally', 'released', '.', 'some', 'of', 'it', 'has', 'subtle', 'messages', 'about', 'm', '##j', "'", 's', 'feeling', 'towards', 'the', 'press', 'and', 'also', 'the', 'obvious', 'message', 'of', 'drugs

In [0]:
len(tokenized_texts[0])

544

We will now convert tokens to word id's as per Bert's vocabulary. We allso need to pad the sequences. 

In [0]:
input_ids=[]
for i in tqdm_notebook(range(len(tokenized_texts))):
  input_ids.append(tokenizer.convert_tokens_to_ids(tokenized_texts[i]))

HBox(children=(IntProgress(value=0, max=25000), HTML(value='')))

Token indices sequence length is longer than the specified maximum sequence length for this model (544 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (520 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (553 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (549 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (546 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for thi

In [0]:
MAX_LEN = 256
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

For a given tokenized text, we also need to distinguish whether it is a part of token or padding. This is done using **attention masks**

In [0]:
#Create attention masks
attention_masks = []
# Create a mask of 1s for each token followed by 0s for padding
for seq in input_ids:
  seq_mask = [float(i>0) for i in seq]
  attention_masks.append(seq_mask)

We will use our training data to split between training and val data and will test our data on val data itself, because for test data we don't have labels.

Now, we need to do routine Pytorch training. We will first make dataloader and then just rest is simply Pytorch's standard routine. Pytorch's Blitz is sufficient to get one familiar with Pytorch.

In [0]:
# Use train_test_split to split our data into train and validation sets for training

train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, labels,random_state=56, test_size=0.2)
train_masks, validation_masks, _, _ = train_test_split(attention_masks, input_ids,random_state=56, test_size=0.2)

In [0]:
#Convert all of our data into torch tensors, the required datatype for our model

train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

In [0]:
batch_size = 16

# Create an iterator of our data with torch DataLoader. This helps save on memory during training because, unlike a for loop, 
# with an iterator the entire dataset does not need to be loaded into memory

train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)


In [0]:
## Define model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# device ='cpu'
model = BertForSequenceClassification.from_pretrained('bert-base-uncased',num_labels=2)
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm(torch.Size([768]), eps=1e-12, elementwise_aff

**Pytorch transformers** was previously called as  **pytorch-pretrained-bert**. There are some modifications in migrating from previous version to new updated one. This has been written on new version.

In [0]:
lr = 2e-5
max_grad_norm = 1.0
num_total_steps = 1000
num_warmup_steps = 100
warmup_proportion = float(num_warmup_steps) / float(num_total_steps)  # 0.1


### In PyTorch-Transformers, optimizer and schedules are splitted and instantiated like this:
optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False)  # To reproduce BertAdam specific behavior set correct_bias=False
scheduler = WarmupLinearSchedule(optimizer, warmup_steps=num_warmup_steps, t_total=num_total_steps)  # PyTorch scheduler

In [0]:
len(train_dataloader)

1250

In [0]:
total_step = len(train_dataloader)

# Store our loss and accuracy for plotting
train_loss_set = []


epochs = 2

# trange is a tqdm wrapper around the normal python range
for epoch in tqdm_notebook(range(epochs)):
  
  

    # Training
    # Set our model to training mode (as opposed to evaluation mode)
    model.train()

    # Tracking variables
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0

    # Train the data for one epoch
    for i, batch in enumerate(train_dataloader):
      # Add batch to GPU
      batch = tuple(t.to(device) for t in batch)
      # Unpack the inputs from our dataloader
      b_input_ids, b_input_mask, b_labels = batch
      # Forward pass
      outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
      loss = outputs[0]
      train_loss_set.append(loss.item())    
      # Backward pass
      loss.backward()
      # Update parameters and take a step using the computed gradient
      optimizer.step()
      scheduler.step()
      optimizer.zero_grad()
      if (i) % 50 == 0:
        print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}' 
                   .format(epoch+1, epochs, i+1, total_step, loss.item()))
      

HBox(children=(IntProgress(value=0, max=2), HTML(value='')))

Epoch [1/2], Step [1/1250], Loss: 0.7326
Epoch [1/2], Step [51/1250], Loss: 0.5696
Epoch [1/2], Step [101/1250], Loss: 0.5504
Epoch [1/2], Step [151/1250], Loss: 0.2354
Epoch [1/2], Step [201/1250], Loss: 0.2650
Epoch [1/2], Step [251/1250], Loss: 0.4311
Epoch [1/2], Step [301/1250], Loss: 0.3085
Epoch [1/2], Step [351/1250], Loss: 0.2568
Epoch [1/2], Step [401/1250], Loss: 0.1055
Epoch [1/2], Step [451/1250], Loss: 0.3037
Epoch [1/2], Step [501/1250], Loss: 0.3515
Epoch [1/2], Step [551/1250], Loss: 0.2150
Epoch [1/2], Step [601/1250], Loss: 0.2704
Epoch [1/2], Step [651/1250], Loss: 0.4486
Epoch [1/2], Step [701/1250], Loss: 0.0981
Epoch [1/2], Step [751/1250], Loss: 0.2061
Epoch [1/2], Step [801/1250], Loss: 0.1373
Epoch [1/2], Step [851/1250], Loss: 0.0930
Epoch [1/2], Step [901/1250], Loss: 0.2151
Epoch [1/2], Step [951/1250], Loss: 0.1000
Epoch [1/2], Step [1001/1250], Loss: 0.2520
Epoch [1/2], Step [1051/1250], Loss: 0.1063
Epoch [1/2], Step [1101/1250], Loss: 0.1929
Epoch [1/2]

In [0]:
torch.save(model.state_dict(), directory_path+'/model_without_language_model.ckpt')

In [0]:
# Test the model
with torch.no_grad():
    correct = 0
    total = 0
    for i, batch in enumerate(validation_dataloader):
      batch = tuple(t.to(device) for t in batch)
      # Unpack the inputs from our dataloader
      b_input_ids, b_input_mask, b_labels = batch
      # Forward pass
      outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
      # print (outputs)
      prediction = torch.argmax(outputs[0],dim=1)
      total += b_labels.size(0)
      correct+=(prediction==b_labels).sum().item()

Test Accuracy of the model on the 10000 test images: 90.76 %


In [0]:
print('Test Accuracy of the model on val data is: {} %'.format(100 * correct / total)) 

Test Accuracy of the model on val data is: 90.76 %


Now, we will use finetuned language model. Only  difference in code would be to change tokenizer and model path.

**Using finetuned language model**

In [0]:
directory_path+"/pytorch-transformers/examples/lm_finetuning/finetuned_lm")

['config.json',
 'pytorch_model.bin',
 'tokenizer_config.json',
 'special_tokens_map.json',
 'added_tokens.json',
 'vocab.txt']

In [0]:
model_path=directory_path+"/pytorch-transformers/examples/lm_finetuning/finetuned_lm"## model is stored at this directory. 

In [0]:
%%time
tokenizer = BertTokenizer.from_pretrained(model_path)
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
print ("Tokenize the first sentence:")
print (tokenized_texts[0])

Tokenize the first sentence:
['[CLS]', 'with', 'all', 'this', 'stuff', 'going', 'down', 'at', 'the', 'moment', 'with', 'm', '##j', 'i', "'", 've', 'started', 'listening', 'to', 'his', 'music', ',', 'watching', 'the', 'odd', 'documentary', 'here', 'and', 'there', ',', 'watched', 'the', 'wi', '##z', 'and', 'watched', 'moon', '##walker', 'again', '.', 'maybe', 'i', 'just', 'want', 'to', 'get', 'a', 'certain', 'insight', 'into', 'this', 'guy', 'who', 'i', 'thought', 'was', 'really', 'cool', 'in', 'the', 'eighties', 'just', 'to', 'maybe', 'make', 'up', 'my', 'mind', 'whether', 'he', 'is', 'guilty', 'or', 'innocent', '.', 'moon', '##walker', 'is', 'part', 'biography', ',', 'part', 'feature', 'film', 'which', 'i', 'remember', 'going', 'to', 'see', 'at', 'the', 'cinema', 'when', 'it', 'was', 'originally', 'released', '.', 'some', 'of', 'it', 'has', 'subtle', 'messages', 'about', 'm', '##j', "'", 's', 'feeling', 'towards', 'the', 'press', 'and', 'also', 'the', 'obvious', 'message', 'of', 'drugs

In [0]:
input_ids=[]
for i in tqdm_notebook(range(len(tokenized_texts))):
  input_ids.append(tokenizer.convert_tokens_to_ids(tokenized_texts[i]))
# input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]

HBox(children=(IntProgress(value=0, max=25000), HTML(value='')))

Token indices sequence length is longer than the specified maximum sequence length for this model (544 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (520 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (553 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (549 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (546 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for thi

In [0]:
MAX_LEN = 256
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

In [0]:
#Create attention masks
attention_masks = []

# Create a mask of 1s for each token followed by 0s for padding
for seq in input_ids:
  seq_mask = [float(i>0) for i in seq]
  attention_masks.append(seq_mask)

# Use train_test_split to split our data into train and validation sets for training

train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, labels,random_state=56, test_size=0.2)
train_masks, validation_masks, _, _ = train_test_split(attention_masks, input_ids,random_state=56, test_size=0.2)

#Convert all of our data into torch tensors, the required datatype for our model

train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

batch_size = 16

# Create an iterator of our data with torch DataLoader. This helps save on memory during training because, unlike a for loop, 
# with an iterator the entire dataset does not need to be loaded into memory

train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

In [0]:
## Define model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# device ='cpu'
model = BertForSequenceClassification.from_pretrained(model_path,num_labels=2)
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm(torch.Size([768]), eps=1e-12, elementwise_aff

In [0]:
lr = 2e-5
max_grad_norm = 1.0
num_total_steps = 1000
num_warmup_steps = 100
warmup_proportion = float(num_warmup_steps) / float(num_total_steps)  # 0.1


### In PyTorch-Transformers, optimizer and schedules are splitted and instantiated like this:
optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False)  # To reproduce BertAdam specific behavior set correct_bias=False
scheduler = WarmupLinearSchedule(optimizer, warmup_steps=num_warmup_steps, t_total=num_total_steps)  # PyTorch scheduler


In [0]:
total_step = len(train_dataloader)

# Store our loss and accuracy for plotting
train_loss_set = []

# Number of training epochs (authors recommend between 2 and 4)
epochs = 2

# trange is a tqdm wrapper around the normal python range
for epoch in tqdm_notebook(range(epochs)):
  
  

    # Training
    # Set our model to training mode (as opposed to evaluation mode)
    model.train()

    # Tracking variables
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0

    # Train the data for one epoch
    for i, batch in enumerate(train_dataloader):
      # Add batch to GPU
      batch = tuple(t.to(device) for t in batch)
      # Unpack the inputs from our dataloader
      b_input_ids, b_input_mask, b_labels = batch
      # Forward pass
      outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
      loss = outputs[0]
      train_loss_set.append(loss.item())    
      # Backward pass
      loss.backward()
      # Update parameters and take a step using the computed gradient
      optimizer.step()
      scheduler.step()
      optimizer.zero_grad()
      if (i) % 50 == 0:
        print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}' 
                   .format(epoch+1, epochs, i+1, total_step, loss.item()))

torch.save(model.state_dict(), directory_path+'/model_with_language_model.ckpt')

HBox(children=(IntProgress(value=0, max=2), HTML(value='')))

Epoch [1/2], Step [1/1250], Loss: 0.8097
Epoch [1/2], Step [51/1250], Loss: 0.8031
Epoch [1/2], Step [101/1250], Loss: 0.6182
Epoch [1/2], Step [151/1250], Loss: 0.3168
Epoch [1/2], Step [201/1250], Loss: 0.1880
Epoch [1/2], Step [251/1250], Loss: 0.1677
Epoch [1/2], Step [301/1250], Loss: 0.2950
Epoch [1/2], Step [351/1250], Loss: 0.1150
Epoch [1/2], Step [401/1250], Loss: 0.0999
Epoch [1/2], Step [451/1250], Loss: 0.0788
Epoch [1/2], Step [501/1250], Loss: 0.0594
Epoch [1/2], Step [551/1250], Loss: 0.3012
Epoch [1/2], Step [601/1250], Loss: 0.1296
Epoch [1/2], Step [651/1250], Loss: 0.1388
Epoch [1/2], Step [701/1250], Loss: 0.5235
Epoch [1/2], Step [751/1250], Loss: 0.2360
Epoch [1/2], Step [801/1250], Loss: 0.3495
Epoch [1/2], Step [851/1250], Loss: 0.3014
Epoch [1/2], Step [901/1250], Loss: 0.2131
Epoch [1/2], Step [951/1250], Loss: 0.1677
Epoch [1/2], Step [1001/1250], Loss: 0.0486
Epoch [1/2], Step [1051/1250], Loss: 0.2029
Epoch [1/2], Step [1101/1250], Loss: 0.1197
Epoch [1/2]

In [0]:
torch.save(model.state_dict(), directory_path+'/model_with_language_model.ckpt')

In [0]:
# Test the model
with torch.no_grad():
    correct = 0
    total = 0
    for i, batch in enumerate(validation_dataloader):
      batch = tuple(t.to(device) for t in batch)
      # Unpack the inputs from our dataloader
      b_input_ids, b_input_mask, b_labels = batch
      # Forward pass
      outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
      # print (outputs)
      prediction = torch.argmax(outputs[0],dim=1)
      total += b_labels.size(0)
      correct+=(prediction==b_labels).sum().item()

In [0]:
print('Test Accuracy of the model on val data is: {} %'.format(100 * correct / total)) 

Test Accuracy of the model on val data is: 90.9 %


## Observations

We can see that with language model, score improved by  0.15 percent. See, that our data is basically reviews written in English language, this is pretty similar to tha task on which Bert was trained. You can see significant result, if your data is relatively different than data on which Bert was pretrained. For example if you have lot of data between customer and client in their native language then this approach could excel. 