#**Implementation of BERT on IMDB dataset**
The following notebook illustrates the implemetation of BERT on the IMDB dataset
Upload the IMDB dataset.csv file to run the notebook. 

#**Initializing variables and Setting up the Environemnt**

In [1]:
# Set environment seed
import os
os.environ['PYTHONHASHSEED']=str(1)

In [2]:
!pip install pytorch-nlp
!pip install pytorch-pretrained-bert
#!pip install sklearn

Collecting pytorch-nlp
[?25l  Downloading https://files.pythonhosted.org/packages/4f/51/f0ee1efb75f7cc2e3065c5da1363d6be2eec79691b2821594f3f2329528c/pytorch_nlp-0.5.0-py3-none-any.whl (90kB)
[K     |███▋                            | 10kB 18.1MB/s eta 0:00:01[K     |███████▎                        | 20kB 24.0MB/s eta 0:00:01[K     |███████████                     | 30kB 13.9MB/s eta 0:00:01[K     |██████████████▌                 | 40kB 10.4MB/s eta 0:00:01[K     |██████████████████▏             | 51kB 7.3MB/s eta 0:00:01[K     |█████████████████████▉          | 61kB 7.7MB/s eta 0:00:01[K     |█████████████████████████▌      | 71kB 8.1MB/s eta 0:00:01[K     |█████████████████████████████   | 81kB 8.4MB/s eta 0:00:01[K     |████████████████████████████████| 92kB 4.9MB/s 
Installing collected packages: pytorch-nlp
Successfully installed pytorch-nlp-0.5.0
Collecting pytorch-pretrained-bert
[?25l  Downloading https://files.pythonhosted.org/packages/d7/e0/c08d5553b89973d9a

In [3]:
import sys
import numpy as np
import os
import random as rn
import pandas as pd
from tqdm import tqdm
import torch
from sklearn.model_selection import train_test_split
from pytorch_pretrained_bert import BertModel
from torch import nn
# from torchnlp.datasets import imdb_dataset      # --> We are using our own uploaded dataset.
from pytorch_pretrained_bert import BertTokenizer
from keras.preprocessing.sequence import pad_sequences
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from torch.optim import Adam
from torch.nn.utils import clip_grad_norm_
from IPython.display import clear_output
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report
%matplotlib inline

In [4]:
def reset_random_seeds():
    '''
    Sets all necessary seed for reproduceability.
    '''
    os.environ['PYTHONHASHSEED']=str(1)
    torch.manual_seed(1)
    torch.cuda.manual_seed(1)
    np.random.seed(1)
    
reset_random_seeds()

#**Data Preprocessing**

In [5]:
#Reading the csv dataset
data = pd.read_csv('IMDB Dataset.csv')
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [6]:
#spliting the data into test and train 
x_train, x_test, y_train, y_test = train_test_split(data['review'], data['sentiment'], 
                                                    test_size=0.2, random_state=0, 
                                                    stratify=data['sentiment'])

x_test = x_test.to_list()
y_test = y_test.to_list()

In [7]:
# Create a new df using the train set.
data = {
    'review':x_train,
    'sentiment':y_train
}

data_split = pd.DataFrame(data)

In [8]:
# Create a dictionary of x_train and y_train with 12 different split sizes
data_dict = {}
split_size = [0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
for split in split_size:
    _, x_train, _, y_train = train_test_split(data_split['review'], data_split['sentiment'], 
                                                    test_size=split, random_state=0, 
                                                    stratify=data_split['sentiment'])
    df = {
        'review':x_train,
        'sentiment':y_train
    }
    
    df = pd.DataFrame(df)
    data_dict[split] = df

In [9]:
# Specifying the split size
data_dict.keys()
split = 0.7

In [10]:
#Obtaining the x_train and y_train for the respective split size 
x_train = data_dict[split]['review']
y_train = data_dict[split]['sentiment']
print(x_train)
print(y_train)
y_train = y_train.to_list()
x_train = x_train.to_list()

34919    This game was really great and quite a challen...
32093    Superb comic farce from Paul Mazursky, Richard...
23874    A brash, self-centered Army cadet arrives at W...
7297     This one and "Her Pilgrim Soul" are two of my ...
47940    David Tennant and Sarah Parish's brilliant act...
                               ...                        
21892    demonicus rocked, you guys need to understand ...
28814    Okay, now I am pretty sure that my summary got...
44744    "Laughter is a state of mind" says the tag, an...
9248     I had initially heard of TEARS OF KALI a while...
45490    I finally snagged a copy of Kannathil Muthamit...
Name: review, Length: 28000, dtype: object
34919    positive
32093    positive
23874    positive
7297     positive
47940    positive
           ...   
21892    positive
28814    negative
44744    negative
9248     positive
45490    positive
Name: sentiment, Length: 28000, dtype: object


In [11]:
#Printing the number of entries in train and test
len(x_train), len(y_train), len(x_test), len(y_test)

This game was really great and quite a challenge. It has a great, spooky story line and the graphics are also very good. I would recommend this game to all Horror fans and is very gripping from start to finish. The only problem with this game is that i would have liked more weapons but thats just me.<br /><br /> A truly great game for RPG and Shoot'em'up fans.<br /><br />>


(28000, 28000, 10000, 10000)

#**Tokenising**


In [12]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

100%|██████████| 231508/231508 [00:00<00:00, 19362211.97B/s]


In [13]:
tokenizer.tokenize('Hi my name is Atul')

['hi', 'my', 'name', 'is', 'at', '##ul']

In [14]:
train_tokens = list(map(lambda t: ['[CLS]'] + tokenizer.tokenize(t)[:510] + ['[SEP]'], x_train))
test_tokens = list(map(lambda t: ['[CLS]'] + tokenizer.tokenize(t)[:510] + ['[SEP]'], x_test))

len(train_tokens), len(test_tokens)

(28000, 10000)

#**Vectorising and masking inputs**

In [15]:
#Reshaping the input size as (no of records,512)
train_tokens_ids = pad_sequences(list(map(tokenizer.convert_tokens_to_ids, train_tokens)), maxlen=512, truncating="post", padding="post", dtype="int")
test_tokens_ids = pad_sequences(list(map(tokenizer.convert_tokens_to_ids, test_tokens)), maxlen=512, truncating="post", padding="post", dtype="int")

train_tokens_ids.shape, test_tokens_ids.shape

((28000, 512), (10000, 512))

In [16]:
train_masks = [[float(i > 0) for i in ii] for ii in train_tokens_ids]
test_masks = [[float(i > 0) for i in ii] for ii in test_tokens_ids]

In [17]:
# converting the labels from positive and negative to true and false
train_y = np.array(y_train) == 'positive'
test_y = np.array(y_test) == 'positive'
print(train_y.shape, test_y.shape, np.mean(train_y), np.mean(test_y))
train_y[1:20]
test_y[1:20]

(28000,) (10000,) 0.5 0.5


array([ True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True, False, False,  True, False, False, False, False,
       False])

In [18]:
#Double checking that the number of False and True is 50% for both train and test labesls
from collections import Counter
Counter(train_y)
Counter(test_y)

Counter({False: 5000, True: 5000})

In [19]:
#Initialising masks for train and test dataset 
train_masks = [[float(i > 0) for i in ii] for ii in train_tokens_ids]
test_masks = [[float(i > 0) for i in ii] for ii in test_tokens_ids]

#**BERT Model**
Bidirectional Encoder Representations from Transformers. Each word here has a meaning to it and we will encounter that one by one in this article. For now, the key takeaway from this line is – BERT is based on the Transformer architecture.

In [20]:
class BertBinaryClassifier(nn.Module):
    def __init__(self, dropout=0.1):
        super(BertBinaryClassifier, self).__init__()

        self.bert = BertModel.from_pretrained('bert-base-uncased')

        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(768, 1)
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, tokens, masks=None):
        _, pooled_output = self.bert(tokens, attention_mask=masks, output_all_encoded_layers=False)
        dropout_output = self.dropout(pooled_output)
        linear_output = self.linear(dropout_output)
        proba = self.sigmoid(linear_output)
        return proba

In [21]:
# ensuring that the model runs on GPU, not on CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [22]:
str(torch.cuda.memory_allocated(device)/1000000 ) + 'M'

'0.0M'

In [23]:
#Initialising bert model
bert_clf = BertBinaryClassifier()
bert_clf = bert_clf.cuda()     # running BERT on CUDA_GPU

100%|██████████| 407873900/407873900 [00:32<00:00, 12518281.19B/s]


In [24]:
str(torch.cuda.memory_allocated(device)/1000000 ) + 'M'

'439.065088M'

In [25]:
x = torch.tensor(train_tokens_ids[:3]).to(device)
y, pooled = bert_clf.bert(x, output_all_encoded_layers=False)
x.shape, y.shape, pooled.shape

(torch.Size([3, 512]), torch.Size([3, 512, 768]), torch.Size([3, 768]))

In [26]:
y = bert_clf(x)
y.cpu().detach().numpy()        # kinda Garbage Collector to free up used and cache space

array([[0.5572745],
       [0.663245 ],
       [0.5770542]], dtype=float32)

In [27]:
# Cross- checking CUDA GPU Memory to ensure GPU memory is not overflowing.
str(torch.cuda.memory_allocated(device)/1000000 ) + 'M'

'6697.349632M'

In [28]:
y, x, pooled = None, None, None
torch.cuda.empty_cache()     # Clearing Cache space for fresh Model run
str(torch.cuda.memory_allocated(device)/1000000 ) + 'M'

'439.065088M'

In [29]:
# Setting hyper-parameters

BATCH_SIZE = 8
EPOCHS = 5

In [30]:
# Conwverting inputs to tensor and applying mask on the train and test tensors
train_tokens_tensor = torch.tensor(train_tokens_ids)
train_y_tensor = torch.tensor(train_y.reshape(-1, 1)).float()

test_tokens_tensor = torch.tensor(test_tokens_ids)
test_y_tensor = torch.tensor(test_y.reshape(-1, 1)).float()

train_masks_tensor = torch.tensor(train_masks)
test_masks_tensor = torch.tensor(test_masks)

str(torch.cuda.memory_allocated(device)/1000000 ) + 'M'

'439.065088M'

In [31]:
#Initaiseing dataloader for train and test tensors
train_dataset = TensorDataset(train_tokens_tensor, train_masks_tensor, train_y_tensor)
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=BATCH_SIZE)

test_dataset = TensorDataset(test_tokens_tensor, test_masks_tensor, test_y_tensor)
test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=BATCH_SIZE)

In [32]:
param_optimizer = list(bert_clf.sigmoid.named_parameters()) 
optimizer_grouped_parameters = [{"params": [p for n, p in param_optimizer]}]

In [33]:
#Defining Optimizer
optimizer = Adam(bert_clf.parameters(), lr=3e-6)

In [34]:
torch.cuda.empty_cache()   # Clearing Cache space for a fresh Model run

In [None]:
#Trainining model
val_acc = []
val_loss = []
train_losses = []

for epoch_num in range(EPOCHS):
    bert_clf.train()
    train_loss = 0
    for step_num, batch_data in enumerate(train_dataloader):
        token_ids, masks, labels = tuple(t.to(device) for t in batch_data)
        print(str(torch.cuda.memory_allocated(device)/1000000 ) + 'M')
        logits = bert_clf(token_ids, masks)
        
        loss_func = nn.BCELoss()

        batch_loss = loss_func(logits, labels)
        train_loss += batch_loss.item()
        
        
        bert_clf.zero_grad()
        batch_loss.backward()
        

        clip_grad_norm_(parameters=bert_clf.parameters(), max_norm=1.0)
        optimizer.step()
        
        clear_output(wait=True)
        print('Epoch: ', epoch_num + 1)
        print("\r" + "{0}/{1} loss: {2} ".format(step_num, len(x_train) / BATCH_SIZE, train_loss / (step_num + 1)))
    train_losses.append(train_loss / (step_num + 1))

Epoch:  1
973/3500.0 loss: 0.36442436724977695 
1816.352256M


In [None]:
# Save recorded train loss into a CSV file
pd.DataFrame(np.array(train_losses),
                   columns=['Loss']).to_csv('./train_loss.csv')

In [None]:
# Evaluate Bert model with the test set
bert_clf.eval()
bert_predicted = []
all_logits = []
loss = []
with torch.no_grad():
    for step_num, batch_data in enumerate(test_dataloader):

        token_ids, masks, labels = tuple(t.to(device) for t in batch_data)

        logits = bert_clf(token_ids, masks)
        loss_func = nn.BCELoss()
        loss.append(loss_func(logits, labels))
        numpy_logits = logits.cpu().detach().numpy()
        
        bert_predicted += list(numpy_logits[:, 0] > 0.5)
        all_logits += list(numpy_logits[:, 0])
print(sum(loss)/len(loss))

In [None]:
# >0.5 means that the model is slightly bias towards positive sentiment
# <0.5 means that the model is slightly bias towards negative sentiment
np.mean(bert_predicted)

In [None]:
# Get classificaqtion report and save it in the csv file
print(classification_report(test_y, bert_predicted))
report = classification_report(test_y, bert_predicted,output_dict=True)
df = pd.DataFrame(data=report).transpose()
df.to_csv('./report.csv')