### Transformers for seintiment analyisis.

In this notebook we are going to use the Transformer model **(BERT)**  which was first introduced in this [paper](https://arxiv.org/abs/1706.03762) and it is from this [paper](https://arxiv.org/abs/1810.04805).


**BERT** -> **B**idirectional **E**ncoder **R**epresentations from **T**ransformers.


Transformer models are considerably larger than anything else covered from the previous notebooks. As such we are going to use the [trasformers library](https://github.com/huggingface/transformers) to get pre-trained transformers and use them as our embedding layers. We will freeze (not train) the transformer and only train the remainder of the model which learns from the representations produced by the transformer. In this case we will be using a multi-layer bi-directional GRU, however any model can learn from these representations.






In [1]:
import torch
import numpy as np
import random

In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True


The transformer has already been trained with a specific vocabulary, which means we need to train with the exact same vocabulary and also tokenize our data in the same way that the transformer did when it was initially trained.

Luckily, the transformers library has tokenizers for each of the transformer models provided. In this case we are using the ``BERT`` model which ignores casing . We get this by loading the pre-trained ``bert-base-uncased`` tokenizer.

### Installation

In [3]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |████████████████████████████████| 2.3MB 5.2MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 27.5MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 34.5MB/s 
Collecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25

In [4]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




### Checking the Vocabulary size in the `tokenizer`

In [5]:
tokenizer.vocab_size, len(tokenizer.vocab)

(30522, 30522)

### Using a tokenizer to tokenize a string

In [6]:
tokens = tokenizer.tokenize("This is what we call AI")
tokens

['this', 'is', 'what', 'we', 'call', 'ai']

## Numericalize Tokens
We can numericalize tokens using our vocabulary using ``tokenizer.convert_tokens_to_ids``.

In [7]:
tokens_to_ids = tokenizer.convert_tokens_to_ids(tokens)
tokens_to_ids

[2023, 2003, 2054, 2057, 2655, 9932]

> Converting back to string representation we call the `tokenizer.convert_ids_to_tokens`.

In [8]:
tokenizer.convert_ids_to_tokens(tokens_to_ids)

['this', 'is', 'what', 'we', 'call', 'ai']

The transformer was also trained with special tokens to mark the beginning and end of the sentence [more deteil](https://huggingface.co/transformers/model_doc/bert.html#transformers.BertModel). As well as a standard padding and unknown token. We can also get these from the tokenizer.

**Note:** the tokenizer does have a beginning of sequence and end of sequence attributes (``bos_token`` and ``eos_token``) but these are not set and should not be used for this transformer.

In [9]:
init_token = (tokenizer.cls_token, tokenizer.cls_token_id)
sep_token = (tokenizer.sep_token, tokenizer.sep_token_id)
pad_token = (tokenizer.pad_token, tokenizer.pad_token_id)
unk_token = (tokenizer.unk_token, tokenizer.unk_token_id)
bos_token = (tokenizer.bos_token, tokenizer.bos_token_id)
eos_token = (tokenizer.eos_token, tokenizer.eos_token_id)

init_token, sep_token, pad_token, unk_token, bos_token, eos_token

Using bos_token, but it is not set yet.
Using eos_token, but it is not set yet.


(('[CLS]', 101),
 ('[SEP]', 102),
 ('[PAD]', 0),
 ('[UNK]', 100),
 (None, None),
 (None, None))

Another thing we need to handle is that the model was trained on sequences with a defined maximum length - it does not know how to handle sequences longer than it has been trained on. We can get the maximum length of these input sizes by checking the ``max_model_input_sizes`` for the version of the transformer we want to use.

In [10]:
max_input_length = tokenizer.max_model_input_sizes['bert-base-uncased']
max_input_length

512

Previously we have used the ``spaCy`` tokenizer to tokenize our examples. However we now need to define a function that we will pass to our ``TEXT`` field that will handle all the tokenization for us. It will also cut down the number of tokens to a maximum length. 

**Note:** that our maximum length is **2** less than the actual maximum length. This is because we need to append two tokens to each sequence, one to the **start** and one to the **end**.

In [11]:
def tokenize_and_cut(sent):
  return tokenizer.tokenize(sent)[:max_input_length-2]


### Special tokens indices.

In [12]:
init_token_idx = tokenizer.cls_token_id
eos_token_idx = tokenizer.sep_token_id
pad_token_idx = tokenizer.pad_token_id
unk_token_idx = tokenizer.unk_token_id

init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx

(101, 102, 0, 100)


Now we define our **fields**. The transformer expects the batch dimension to be first, so we set ``batch_first = True``. 

As we already have the vocabulary for our text, provided by the transformer we set ``use_vocab = False`` to tell torchtext that we'll be handling the vocabulary side of things. 

We pass our ``tokenize_and_cut`` function as the tokenizer. The preprocessing argument is a function that takes in the example after it has been tokenized, this is where we will convert the tokens to their indexes. 

Finally, we define the special tokens - making note that we are defining them to be their index value and not their string value, i.e. 100 instead of ``[UNK]`` This is because the sequences will already be converted into indexes.

In [13]:
from torchtext.legacy import data, datasets

In [14]:
TEXT = data.Field(
    use_vocab=False,
    batch_first = True,
    tokenize = tokenize_and_cut,
    preprocessing = tokenizer.convert_tokens_to_ids,
    init_token = init_token_idx,
    eos_token = eos_token_idx,
    pad_token = pad_token_idx,
    unk_token = unk_token_idx
)

LABEL = data.LabelField(dtype = torch.float)

### Loading the data.
We are going to use the `IMDB` dataset for movies reviews.

In [15]:
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
valid_data, test_data = test_data.split(random_state = random.seed(SEED))

aclImdb_v1.tar.gz:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:02<00:00, 37.3MB/s]


In [16]:
print(f"TRAINING EXAMPLES: {len(train_data)}")
print(f"VALIDATION EXAMPLES: {len(valid_data)}")
print(f"TESTING EXAMPLES: {len(test_data)}")

TRAINING EXAMPLES: 25000
VALIDATION EXAMPLES: 17500
TESTING EXAMPLES: 7500


### Checking a single example.

In [17]:
print(vars(train_data.examples[1]))

{'text': [2004, 2467, 1010, 6801, 5691, 2066, 2023, 2031, 3816, 4391, 1012, 2017, 2593, 2293, 2009, 2030, 2017, 5223, 2009, 1010, 1998, 2025, 3071, 2097, 2066, 2023, 3185, 1012, 2023, 3065, 1996, 7339, 1997, 1996, 15978, 1010, 2029, 2003, 2242, 1045, 7714, 2514, 2003, 2242, 2590, 2000, 5136, 1012, 2017, 2089, 5223, 2068, 1010, 2017, 2089, 4366, 2000, 3305, 2068, 1998, 2514, 2004, 2295, 2017, 2064, 14396, 1010, 2021, 7539, 2023, 3185, 2097, 2191, 2017, 2228, 2055, 2082, 5008, 2015, 2013, 1037, 2367, 7339, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 3185, 2003, 2915, 4498, 2478, 1037, 2192, 1011, 2218, 4950, 1010, 2242, 2008, 1045, 2228, 2573, 3243, 2092, 2004, 2009, 3084, 2009, 2062, 12689, 1012, 2009, 2003, 2409, 3294, 2013, 1996, 15978, 2391, 1997, 3193, 1010, 2013, 2037, 1000, 6416, 1000, 2000, 2155, 26256, 2015, 1010, 2035, 2877, 2039, 1996, 2502, 2154, 1000, 5717, 2154, 1000, 1999, 2029, 2027, 2024, 4041, 2006, 1037, 9288, 2012, 2037, 2082, 1012, 5717, 2154, 2515, 2

We can use the ``convert_ids_to_tokens`` to transform these indexes back into readable tokens.

In [18]:
tokens = tokenizer.convert_ids_to_tokens(vars(train_data.examples[1])['text'])
print(tokens)

['as', 'always', ',', 'controversial', 'movies', 'like', 'this', 'have', 'mixed', 'reviews', '.', 'you', 'either', 'love', 'it', 'or', 'you', 'hate', 'it', ',', 'and', 'not', 'everyone', 'will', 'like', 'this', 'movie', '.', 'this', 'shows', 'the', 'perspective', 'of', 'the', 'killers', ',', 'which', 'is', 'something', 'i', 'personally', 'feel', 'is', 'something', 'important', 'to', 'consider', '.', 'you', 'may', 'hate', 'them', ',', 'you', 'may', 'claim', 'to', 'understand', 'them', 'and', 'feel', 'as', 'though', 'you', 'can', 'relate', ',', 'but', 'regardless', 'this', 'movie', 'will', 'make', 'you', 'think', 'about', 'school', 'shooting', '##s', 'from', 'a', 'different', 'perspective', '.', '<', 'br', '/', '>', '<', 'br', '/', '>', 'the', 'movie', 'is', 'shot', 'entirely', 'using', 'a', 'hand', '-', 'held', 'camera', ',', 'something', 'that', 'i', 'think', 'works', 'quite', 'well', 'as', 'it', 'makes', 'it', 'more', 'realistic', '.', 'it', 'is', 'told', 'completely', 'from', 'the', 

### Building labels vocabulary
Although we've handled the vocabulary for the text, we still need to build the vocabulary for the labels.

In [19]:
LABEL.build_vocab(train_data)
LABEL.vocab.stoi

defaultdict(None, {'neg': 0, 'pos': 1})

### Creating Iterators
As before we are going to use the `BucketIterator`

In [35]:
BATCH_SIZE = 64

train_iterator, test_iterator, validation_iterator = data.BucketIterator.splits(
    (train_data, test_data, valid_data),
    batch_size=BATCH_SIZE,
    device = device
)

### Building the Model.

1. First we will load the pretrainned model. **making sure to load the same model as we did for the tokenizer.**

In [36]:
from transformers import BertModel
bert = BertModel.from_pretrained('bert-base-uncased')
bert

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

2. Defining our actual model

Instead of using an embedding layer to get embeddings for our text, we'll be using the pre-trained transformer model. These embeddings will then be fed into a GRU to produce a prediction for the sentiment of the input sentence. We get the embedding dimension size (called the hidden_size) from the transformer via its config attribute. The rest of the initialization is standard.

Within the forward pass, we wrap the transformer in a no_grad to ensure no gradients are calculated over this part of the model. The transformer actually returns the embeddings for the whole sequence as well as a pooled output. The [documentation](https://huggingface.co/transformers/model_doc/bert.html#transformers.BertModel) states that the pooled output is "usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence", hence we will not be using it. The rest of the forward pass is the standard implementation of a recurrent model, where we take the hidden state over the final time-step, and pass it through a linear layer to get our predictions.

In [37]:
from torch import nn

In [38]:
class BERTGRUSentiment(nn.Module):
    def __init__(self, bert, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        super().__init__()
        self.bert = bert
        embedding_dim = bert.config.to_dict()['hidden_size']
        self.rnn = nn.GRU(embedding_dim, hidden_dim,
                          num_layers = n_layers, bidirectional = bidirectional,
                          batch_first = True, dropout = 0 if n_layers < 2 else dropout
                          )
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        #text = [batch size, sent len]   
        with torch.no_grad():
            embedded = self.bert(text)[0]
        #embedded = [batch size, sent len, emb dim]
        _, hidden = self.rnn(embedded)
        #hidden = [n layers * n directions, batch size, emb dim]
        if self.rnn.bidirectional:
            hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
        else:
            hidden = self.dropout(hidden[-1,:,:])     
        #hidden = [batch size, hid dim]
        output = self.fc(hidden)
        #output = [batch size, out dim]
        return output

### Model Hyper parameters.

In [39]:
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.25

model = BERTGRUSentiment(bert,
                         HIDDEN_DIM,
                         OUTPUT_DIM,
                         N_LAYERS,
                         BIDIRECTIONAL,
                         DROPOUT)

### Counting trainable parameters of the model.

In [40]:
def count_trainable_params(model):
  return sum([p.numel() for p in model.parameters() if p.requires_grad])

print(f'The model has {count_trainable_params(model):,} trainable parameters')

The model has 112,241,409 trainable parameters


### Freezing bert model parameters

In order to freeze paramers (not train them) we need to set their ``requires_grad`` attribute to ``False``. To do this, we simply loop through all of the ``named_parameters`` in our model and if they're a part of the bert transformer model, we set ``requires_grad = False``.

In [41]:
for name, param in model.named_parameters():
  if name.startswith('bert'):
    param.requires_grad = False

We can now see that our model has under 3M trainable parameters, making it almost comparable to the FastText model. However, the text still has to propagate through the transformer which causes training to take considerably longer.

In [42]:
print(f'The model has {count_trainable_params(model):,} trainable parameters')

The model has 2,759,169 trainable parameters


We can double check the names of the trainable parameters, ensuring they make sense. As we can see, they are all the parameters of the GRU (rnn) and the linear layer (out).

In [43]:
for name, param in model.named_parameters():                
    if param.requires_grad:
        print(name)

rnn.weight_ih_l0
rnn.weight_hh_l0
rnn.bias_ih_l0
rnn.bias_hh_l0
rnn.weight_ih_l0_reverse
rnn.weight_hh_l0_reverse
rnn.bias_ih_l0_reverse
rnn.bias_hh_l0_reverse
rnn.weight_ih_l1
rnn.weight_hh_l1
rnn.bias_ih_l1
rnn.bias_hh_l1
rnn.weight_ih_l1_reverse
rnn.weight_hh_l1_reverse
rnn.bias_ih_l1_reverse
rnn.bias_hh_l1_reverse
fc.weight
fc.bias


### Trainning the model.

In [44]:
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()

### Pushing the model to the device.

In [45]:

model = model.to(device)
criterion = criterion.to(device)

### The accuracy function

In [46]:
def binary_accuracy(y_preds, y_true):
  #round predictions to the closest integer
  rounded_preds = torch.round(torch.sigmoid(y_preds))
  correct = (rounded_preds == y_true).float() #convert into float for division 
  acc = correct.sum() / len(correct)
  return acc

### The trainning and evaluation functions.

In [47]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    for batch in iterator:
        optimizer.zero_grad()
        text = batch.text
        predictions = model(text).squeeze(1)
        loss = criterion(predictions, batch.label)
        acc = binary_accuracy(predictions, batch.label)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)
def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    with torch.no_grad():
        for batch in iterator:
            text = batch.text
            predictions = model(text).squeeze(1)
            loss = criterion(predictions, batch.label)
            acc = binary_accuracy(predictions, batch.label)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

We'll also create a function to tell us how long an epoch takes to compare training times between models.

In [48]:
import time
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

### Training and evaluation loop.

In [51]:
torch.cuda.empty_cache()

In [53]:
N_EPOCHS = 5
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, validation_iterator, criterion)
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best-model.pt')
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 27m 20s
	Train Loss: 0.238 | Train Acc: 90.48%
	 Val. Loss: 0.193 |  Val. Acc: 92.13%
Epoch: 02 | Epoch Time: 27m 25s
	Train Loss: 0.209 | Train Acc: 91.97%
	 Val. Loss: 0.207 |  Val. Acc: 91.50%
Epoch: 03 | Epoch Time: 27m 27s
	Train Loss: 0.178 | Train Acc: 93.02%
	 Val. Loss: 0.193 |  Val. Acc: 92.38%
Epoch: 04 | Epoch Time: 27m 28s
	Train Loss: 0.155 | Train Acc: 94.12%
	 Val. Loss: 0.203 |  Val. Acc: 92.74%
Epoch: 05 | Epoch Time: 27m 28s
	Train Loss: 0.129 | Train Acc: 95.22%
	 Val. Loss: 0.198 |  Val. Acc: 92.40%


### Evaluating the best model.

In [50]:
model.load_state_dict(torch.load('best-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

FileNotFoundError: ignored

### Inference
We'll then use the model to test the sentiment of some sequences. We tokenize the input sequence, trim it down to the maximum length, add the special tokens to either side, convert it to a tensor, add a fake batch dimension and then pass it through our model.

In [None]:
def predict_sentiment(model, tokenizer, sentence):
    model.eval()
    tokens = tokenizer.tokenize(sentence)[:max_input_length-2]
    indexed = [init_token_idx] + tokenizer.convert_tokens_to_ids(tokens) + [eos_token_idx]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(0)
    prediction = torch.sigmoid(model(tensor))
    return prediction.item()

In [None]:
while True:
  review = input("Enter a review\nexit to quit:\n")
  if review.lower() == "quit":
    break
  print(predict_sentiment(model, tokenizer, review))


### Credits.

[bentrevett](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/6%20-%20Transformers%20for%20Sentiment%20Analysis.ipynb)