Transformers for Sentiment Analysis

IMPORTING THE LIBRARIES

In [1]:
import torch

import random
import numpy as np

SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

The transformer has already been trained with a specific vocabulary, which means we need to train with the exact same vocabulary and also tokenize our data in the same way that the transformer did when it was initially trained.

The transformers library has tokenizers for each of the transformer models provided. In this case we are using the BERT model which ignores casing (i.e. will lower case every word). We get this by loading the pre-trained bert-base-uncased tokenizer.

In [2]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

We want to find the number of tokens in our vocabulary for which will be using the tokenizer.vocab function.

In [3]:
len(tokenizer.vocab)

30522

Using the tokenizer is as simple as calling tokenizer.tokenize on a string. This will tokenize and lower case the data in a way that is consistent with the pre-trained transformer model.

In [4]:
tokens = tokenizer.tokenize('Hello WORLD how ARE yoU?')
print(tokens)

['hello', 'world', 'how', 'are', 'you', '?']


We will be creating numerical indexes of the word which was already trained on our voabulary.

In [5]:
indexes = tokenizer.convert_tokens_to_ids(tokens)
print(indexes)

[7592, 2088, 2129, 2024, 2017, 1029]


In [6]:
init_token = tokenizer.cls_token
eos_token = tokenizer.sep_token
pad_token = tokenizer.pad_token
unk_token = tokenizer.unk_token

print(init_token, eos_token, pad_token, unk_token)

[CLS] [SEP] [PAD] [UNK]


In [7]:
init_token_idx = tokenizer.convert_tokens_to_ids(init_token)
eos_token_idx = tokenizer.convert_tokens_to_ids(eos_token)
pad_token_idx = tokenizer.convert_tokens_to_ids(pad_token)
unk_token_idx = tokenizer.convert_tokens_to_ids(unk_token)

print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)

101 102 0 100


In [8]:
max_input_length = tokenizer.max_model_input_sizes['bert-base-uncased']
print(max_input_length)

512


To explicitly get them from the tokenizer

In [9]:
init_token_idx = tokenizer.cls_token_id
eos_token_idx = tokenizer.sep_token_id
pad_token_idx = tokenizer.pad_token_id
unk_token_idx = tokenizer.unk_token_id

print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)

101 102 0 100


We find that the maximum length of the imput sequence is 512 tokens.

For the RNN and CNN architectures, we used the spacy word tokenizer.
Here, however, we will perform tokenization using a fucntion. A special edge case to consider that the maximum tokens in a sentence should be 510 tokens and 512 because we will be appending two extra tokens which will be the "bos" and "eos" token. 

In [10]:
def tokenize_and_cut(sentence):
  tokens = tokenizer.tokenize(sentence)
  tokens = tokens[:max_input_length-2]
  return tokens

In [11]:
from torchtext import data

TEXT = data.Field(batch_first = True,
                  use_vocab = False,
                  tokenize = tokenize_and_cut,
                  preprocessing = tokenizer.convert_tokens_to_ids,
                  init_token = init_token_idx,
                  eos_token = eos_token_idx,
                  pad_token = pad_token_idx,
                  unk_token = unk_token_idx)

LABEL = data.LabelField(dtype = torch.float)

LOAD THE DATA AND CREATE TEST AND VALIDATION SPLITS

In [12]:
from torchtext import datasets

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
train_data, valid_data = train_data.split(random_state = random.seed(SEED))

PRINTING THE NUMBER OF EXAMPLES IN EACH SET

In [13]:
print(f"Number of training examples: {len(train_data)}")
print(f"Number of validation examples: {len(valid_data)}")
print(f"Number of testing examples: {len(test_data)}")

Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000


PRINT A TEST EXAMPLE

In [14]:
print(vars(train_data.examples[6]))

{'text': [8040, 9541, 3762, 20160, 2003, 17319, 2028, 1997, 1996, 2087, 3722, 1010, 3144, 1998, 11419, 9476, 3494, 1999, 1996, 2088, 1012, 2061, 1010, 2054, 6433, 2043, 2017, 1005, 2310, 2042, 7249, 1998, 2589, 2673, 2007, 1996, 5675, 1029, 2017, 6942, 2009, 2039, 2157, 1029, 3308, 1012, 2017, 2644, 2537, 1998, 2292, 2009, 2717, 2005, 1037, 5476, 2030, 2061, 1998, 2059, 2448, 2009, 2153, 1010, 4363, 1996, 4563, 1997, 2049, 3112, 10109, 1012, 2008, 2003, 2000, 2360, 1010, 6293, 2007, 1996, 5675, 2005, 1996, 2087, 2112, 2021, 5587, 2115, 3327, 28126, 2000, 2009, 1012, 2023, 2000, 2033, 2003, 2339, 1000, 2054, 1005, 1055, 2047, 8040, 9541, 3762, 20160, 1000, 2499, 1010, 2027, 2215, 2067, 2000, 1996, 4438, 8040, 9541, 3762, 20160, 5675, 2029, 2018, 2069, 5147, 24501, 3126, 12172, 2094, 1037, 5476, 3041, 1999, 1000, 1037, 26781, 2315, 8040, 9541, 3762, 20160, 1000, 2021, 2005, 1996, 2087, 2112, 2018, 2025, 2042, 10410, 2144, 1996, 2434, 1000, 8040, 9541, 3762, 20160, 2073, 2024, 2017, 1000,

We have stored this data in form of indices, so we need to get back the original tokens. 
We have stored the data in the form of a dictionary where we can easily fetch the token based upon its index.

In [15]:
tokens = tokenizer.convert_ids_to_tokens(vars(train_data.examples[6])['text'])
print(tokens)

['sc', '##oo', '##by', 'doo', 'is', 'undoubtedly', 'one', 'of', 'the', 'most', 'simple', ',', 'successful', 'and', 'beloved', 'cartoon', 'characters', 'in', 'the', 'world', '.', 'so', ',', 'what', 'happens', 'when', 'you', "'", 've', 'been', 'everywhere', 'and', 'done', 'everything', 'with', 'the', 'formula', '?', 'you', 'switch', 'it', 'up', 'right', '?', 'wrong', '.', 'you', 'stop', 'production', 'and', 'let', 'it', 'rest', 'for', 'a', 'decade', 'or', 'so', 'and', 'then', 'run', 'it', 'again', ',', 'keeping', 'the', 'core', 'of', 'its', 'success', 'intact', '.', 'that', 'is', 'to', 'say', ',', 'stick', 'with', 'the', 'formula', 'for', 'the', 'most', 'part', 'but', 'add', 'your', 'particular', 'flavour', 'to', 'it', '.', 'this', 'to', 'me', 'is', 'why', '"', 'what', "'", 's', 'new', 'sc', '##oo', '##by', 'doo', '"', 'worked', ',', 'they', 'want', 'back', 'to', 'the', 'classic', 'sc', '##oo', '##by', 'doo', 'formula', 'which', 'had', 'only', 'successfully', 'res', '##ur', '##face', '##

Building a vocabulary for labels

In [16]:
LABEL.build_vocab(train_data)

In [17]:
print(LABEL.vocab.stoi)

defaultdict(None, {'neg': 0, 'pos': 1})


Like the previous models, we want to create iterators and batches to iterate over our data.

In [18]:
BATCH_SIZE = 128

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE, 
    device = device)

BUILD THE MODEL:
We will use the pre-trained version of the BERT model for training as we did earlier for the tokenizer. We are also using the vocabulary of the BERT model.

In [19]:
from transformers import BertTokenizer, BertModel

bert = BertModel.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


DEFINING THE MODEL:

1. In the previous approaches we were feeding the input vector to an embedding layer to get vector embeddings. Here we will be using the pre-trained Transformer model.
2. These embeddings will then be fed into a GRU to produce a prediction for the sentiment of the input sentence.

In [23]:
import torch.nn as nn

In [24]:
class BERTGRUSentiment(nn.Module):
    def __init__(self, bert,hidden_dim,output_dim,n_layers,bidirectional,dropout):
        super().__init__()

        self.bert = bert

        embedding_dim = bert.config.to_dict()['hidden_size']

        self.rnn = nn.GRU(embedding_dim,
                          hidden_dim,
                          num_layers = n_layers,
                          bidirectional = bidirectional,
                          batch_first = True,
                          dropout = 0 if n_layers < 2 else dropout) 

        self.out = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)

    def forward(self,text):

        #text = [batch size, sent len]
        with torch.no_grad():
            embedded = self.bert(text)[0]

        #embedded = [batch size, sent len, emb dim]
        _, hidden = self.rnn(embedded)

        #hidden = [n layers * n directions, batch size, emb dim]
        if self.rnn.bidirectional:
            hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
        else:
            hidden = self.dropout(hidden[-1,:,:])

        #hidden = [batch size, hid dim]
        output = self.out(hidden)
        
        #output = [batch size, out dim]
        return output

INITIALIZING HYPERPARAMETERS

In [25]:
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.25

model = BERTGRUSentiment(bert,
                         HIDDEN_DIM,
                         OUTPUT_DIM,
                         N_LAYERS,
                         BIDIRECTIONAL,
                         DROPOUT)

COUNTING THE NUMBER OF PARAMETERS IN OUR MODEL.

In [26]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 112,241,409 trainable parameters


In order to freeze paramers (not train them) we need to set their requires_grad attribute to False. To do this, we simply loop through all of the named_parameters in our model and if they're a part of the bert transformer model, we set requires_grad = False.

In [27]:
for name, param in model.named_parameters():                
    if name.startswith('bert'):
        param.requires_grad = False

TRAINING THE MODEL

In [28]:
import torch.optim as optim
optimizer = optim.Adam(model.parameters())

In [29]:
criterion = nn.BCEWithLogitsLoss()

In [30]:
model = model.to(device)
criterion = criterion.to(device)

CALCULATING THE ACCURACY

In [31]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

In [32]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        predictions = model(batch.text).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [33]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [34]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

TRAINING THE TRASNFORMER

In [None]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
        
    end_time = time.time()
        
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
        
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut6-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

TESTING THE MODEL

In [None]:
model.load_state_dict(torch.load('tut6-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

INFERENCE

In [None]:
def predict_sentiment(model, tokenizer, sentence):
    model.eval()
    tokens = tokenizer.tokenize(sentence)
    tokens = tokens[:max_input_length-2]
    indexed = [init_token_idx] + tokenizer.convert_tokens_to_ids(tokens) + [eos_token_idx]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(0)
    prediction = torch.sigmoid(model(tensor))
    if prediction.item() >= 0.5:
        print("Positive")
    else:
        print("Negative")
    return

In [None]:
predict_sentiment(model, tokenizer, "This film is terrible")

In [None]:
predict_sentiment(model, tokenizer, "This film is great")