### IMDB Faster Sentiment Analyisis

In this notebook we are going to use our own datset, load it using the `torchtext` predict positive or negative sentiments.

We will be using the data that we created from the previous notebook to do sentiment classification on the movie reviews. This data was created and stored on my google drive @ [garidziracrispen@gmail.com](https://drive.google.com/drive/folders/1M5NseG3ni7_UcLYlCyTrhaqdrs0Pzf7d)


The data was stored as clean text without `html` tags.
### Imports

In [1]:
import torch
from torchtext.legacy import data
import numpy as np
import matplotlib.pyplot as plt
import en_core_web_sm
import random

### Device Config

In [2]:
SEED = 42
torch.manual_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.backends.cudnn.deteministic = True

### Mounting the google drive

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Generating Bigrams
Accoding to the FastText paper we have to generate bigrams for each sentence.

In [4]:
def generate_bigrams(x):
  n_grams = set(zip(*[x[i:] for i in range(2)]))
  for n_gram in n_grams:
      x.append(' '.join(n_gram))
  return x
generate_bigrams(['This', 'film', 'is', 'terrible'])

['This', 'film', 'is', 'terrible', 'film is', 'is terrible', 'This film']

In [6]:
TEXT = data.Field(tokenize = 'spacy',
                  tokenizer_language = 'en_core_web_sm',
                  preprocessing=generate_bigrams
                  )
LABEL = data.LabelField( dtype = torch.float)

In [7]:
fields = {
     "review": ('review', TEXT),
    "sentiment": ("sentiment", LABEL)
}

### Creating dataset using the `TabularDataset`

In [8]:
train_data, test_data, validation_data = data.TabularDataset.splits(
    path="/content/drive/MyDrive/NLP Data/IMDB/",
    train="train.json",
    test="test.json",
    validation = "validation.json",
    format="json",
    fields=fields
)

In [9]:
print(vars(train_data[0]))

{'review': ['I', 'really', 'liked', 'this', 'Summerslam', 'due', 'to', 'the', 'look', 'of', 'the', 'arena', ',', 'the', 'curtains', 'and', 'just', 'the', 'look', 'overall', 'was', 'interesting', 'to', 'me', 'for', 'some', 'reason', '.', 'Anyways', ',', 'this', 'could', 'have', 'been', 'one', 'of', 'the', 'best', 'Summerslam', "'s", 'ever', 'if', 'the', 'WWF', 'did', "n't", 'have', 'Lex', 'Luger', 'in', 'the', 'main', 'event', 'against', 'Yokozuna', ',', 'now', 'for', 'it', "'s", 'time', 'it', 'was', 'ok', 'to', 'have', 'a', 'huge', 'fat', 'man', 'vs', 'a', 'strong', 'man', 'but', 'I', "'m", 'glad', 'times', 'have', 'changed', '.', 'It', 'was', 'a', 'terrible', 'main', 'event', 'just', 'like', 'every', 'match', 'Luger', 'is', 'in', 'is', 'terrible', '.', 'Other', 'matches', 'on', 'the', 'card', 'were', 'Razor', 'Ramon', 'vs', 'Ted', 'Dibiase', ',', 'Steiner', 'Brothers', 'vs', 'Heavenly', 'Bodies', ',', 'Shawn', 'Michaels', 'vs', 'Curt', 'Hening', ',', 'this', 'was', 'the', 'event', 'wh

In [10]:
from collections import Counter
counter = Counter()
for i in train_data:
  counter[i.sentiment] += 1
counter

Counter({0: 18780, 1: 18720})

### Checking the data sizes for each sample.

In [11]:

from prettytable import PrettyTable
def tabulate(column_names, data):
  table = PrettyTable(column_names)
  for row in data:
    table.add_row(row)
  print(table)
column_names = ["SUBSET", "EXAMPLE(s)"]
data_sets = [
        ["training", len(train_data)],
        ['validation', len(validation_data)],
        ['test', len(test_data)]
]
tabulate(column_names, data_sets)


+------------+------------+
|   SUBSET   | EXAMPLE(s) |
+------------+------------+
|  training  |   37500    |
| validation |    5000    |
|    test    |    7500    |
+------------+------------+


### Loading pretrained word Embedings.

In [12]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(
    train_data,
    max_size = MAX_VOCAB_SIZE,
    vectors = "glove.6B.100d",
    unk_init = torch.Tensor.normal_
)

LABEL.build_vocab(train_data)

.vector_cache/glove.6B.zip: 862MB [02:40, 5.37MB/s]                           
100%|█████████▉| 398239/400000 [00:17<00:00, 23097.75it/s]

### Creating Iterators - `Bucket Iterator`.

In [13]:
BATCH_SIZE = 64

train_iterator, validation_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, validation_data, test_data),
    device = device,
    batch_size = BATCH_SIZE,
    sort_key = lambda x: len(x.review),
)

### Creating a Model
In this notebook we are going to use a `bidirectional` LSTM.

In [14]:
from torch import nn
from torch.nn import functional as F

In [15]:
class IMDBFastText(nn.Module):
  def __init__(self, vocab_size, embedding_dim, output_dim, pad_idx):
    super(IMDBFastText, self).__init__()

    self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
    self.fc = nn.Linear(embedding_dim, output_dim)

  def forward(self, text):
    #text = [sent len, batch size]
    embedded = self.embedding(text)
    #embedded = [sent len, batch size, emb dim]
    embedded = embedded.permute(1, 0, 2)
    #embedded = [batch size, sent len, emb dim]
    pooled = F.avg_pool2d(embedded, (embedded.shape[1], 1)).squeeze(1) 
    #pooled = [batch size, embedding_dim]
    return self.fc(pooled)

### Creating a model instance

In [18]:

INPUT_DIM = len(TEXT.vocab) # # 25002
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] # 0
imdb_model = IMDBFastText(INPUT_DIM, 
            EMBEDDING_DIM, 
            OUTPUT_DIM, 
            PAD_IDX)
imdb_model

IMDBFastText(
  (embedding): Embedding(25002, 100, padding_idx=1)
  (fc): Linear(in_features=100, out_features=1, bias=True)
)

### Counting number of trainable parameters in the model.

In [19]:
def count_trainable_params(model):
  return sum([i.numel() for i in model.parameters() if i.requires_grad])

print(f'The model has {count_trainable_params(imdb_model):,} trainable parameters')

The model has 2,500,301 trainable parameters


#### Loading pretrained embeddings

In [20]:
pretrained_embeddings = TEXT.vocab.vectors

In [21]:
imdb_model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 1.9269,  1.4873,  0.9007,  ...,  0.1233,  0.3499,  0.6173],
        [ 0.7262,  0.0912, -0.3891,  ...,  0.0821,  0.4440, -0.7240],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [-1.1343,  1.3227,  1.2282,  ...,  0.8923,  0.5572, -1.2730],
        [-1.6437,  0.6558,  0.3809,  ..., -0.6630, -1.0708,  0.5194],
        [ 0.4784,  0.6412, -0.1034,  ...,  0.6080,  0.2487, -2.1468]])

### Zeroing the <pad> and <unk> tokens

In [22]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token] or TEXT.vocab.stoi["<unk>"]
imdb_model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
imdb_model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)
print(imdb_model.embedding.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [-1.1343,  1.3227,  1.2282,  ...,  0.8923,  0.5572, -1.2730],
        [-1.6437,  0.6558,  0.3809,  ..., -0.6630, -1.0708,  0.5194],
        [ 0.4784,  0.6412, -0.1034,  ...,  0.6080,  0.2487, -2.1468]])


### Trainning the Model.

In [23]:
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(imdb_model.parameters())

### Criterion and amazon_model to the Device.

In [24]:
imdb_model = imdb_model.to(device)
criterion = criterion.to(device)

### The Binary Accuracy Function.

In [25]:
def binary_accuracy(y_preds, y_true):
  #round predictions to the closest integer
  rounded_preds = torch.round(torch.sigmoid(y_preds))
  correct = (rounded_preds == y_true).float() #convert into float for division 
  acc = correct.sum() / len(correct)
  return acc

### Trainning and evaluation Functions.

In [26]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    for batch in iterator:
        optimizer.zero_grad()
        text = batch.review
        predictions = model(text).squeeze(1)
        loss = criterion(predictions, batch.sentiment)
        acc = binary_accuracy(predictions, batch.sentiment)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    with torch.no_grad():
        for batch in iterator:
            text = batch.review
            predictions = model(text).squeeze(1)
            loss = criterion(predictions, batch.sentiment)
            acc = binary_accuracy(predictions, batch.sentiment)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

We'll also create a function to tell us how long an epoch takes to compare training times between models.

In [27]:
import time
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

### Trainning Loop

In [28]:
N_EPOCHS = 10
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss, train_acc = train(imdb_model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(imdb_model, validation_iterator, criterion)
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(imdb_model.state_dict(), 'best-model.pt')
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 2m 11s
	Train Loss: 0.663 | Train Acc: 67.87%
	 Val. Loss: 0.481 |  Val. Acc: 77.45%
Epoch: 02 | Epoch Time: 2m 9s
	Train Loss: 0.523 | Train Acc: 81.96%
	 Val. Loss: 0.372 |  Val. Acc: 85.32%
Epoch: 03 | Epoch Time: 2m 8s
	Train Loss: 0.397 | Train Acc: 87.47%
	 Val. Loss: 0.364 |  Val. Acc: 88.01%
Epoch: 04 | Epoch Time: 2m 10s
	Train Loss: 0.327 | Train Acc: 89.40%
	 Val. Loss: 0.388 |  Val. Acc: 88.96%
Epoch: 05 | Epoch Time: 2m 9s
	Train Loss: 0.282 | Train Acc: 90.64%
	 Val. Loss: 0.413 |  Val. Acc: 89.58%
Epoch: 06 | Epoch Time: 2m 8s
	Train Loss: 0.252 | Train Acc: 91.57%
	 Val. Loss: 0.435 |  Val. Acc: 90.03%
Epoch: 07 | Epoch Time: 2m 8s
	Train Loss: 0.230 | Train Acc: 92.35%
	 Val. Loss: 0.456 |  Val. Acc: 90.35%
Epoch: 08 | Epoch Time: 2m 8s
	Train Loss: 0.209 | Train Acc: 93.00%
	 Val. Loss: 0.478 |  Val. Acc: 90.45%
Epoch: 09 | Epoch Time: 2m 10s
	Train Loss: 0.194 | Train Acc: 93.55%
	 Val. Loss: 0.497 |  Val. Acc: 90.70%
Epoch: 10 | Epoch Time: 2

### Evaluating the best model.

In [29]:
imdb_model.load_state_dict(torch.load('best-model.pt'))

test_loss, test_acc = evaluate(imdb_model, test_iterator, criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.398 | Test Acc: 86.71%


### Model Inference - Making predictions

In [32]:
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()
sentiments = ["NEG", "POS"]
def predict_sentiment(model, sent):
  model.eval()
  tokenized = [tok.text for tok in nlp.tokenizer(sent)]
  indexed = [TEXT.vocab.stoi[t] for t in tokenized]
  length = [len(indexed)]
  tensor = torch.LongTensor(indexed).to(device)
  tensor = tensor.unsqueeze(1)
  prediction = torch.sigmoid(model(tensor))

  predicted_class = round(prediction.item())
  confidence = prediction.item() if prediction.item() >= .5 else 1 - prediction.item()
  print(f'PREDICTED CLASS:\t{predicted_class}\nCONFIDENCE:\t\t{confidence * 100:.2f}%\nSENTIMENT:\t\t{sentiments[predicted_class]}')

### Negative review

In [34]:
predict_sentiment(imdb_model, "This movie is terrible")

PREDICTED CLASS:	0
CONFIDENCE:		100.00%
SENTIMENT:		NEG


### Positive review

In [35]:
predict_sentiment(imdb_model, "Best movie of all time")

PREDICTED CLASS:	1
CONFIDENCE:		100.00%
SENTIMENT:		POS
