### IMDB - Sentiment Analyisis

In this notebook we are going to use our own datset, load it using the `torchtext` predict positive or negative sentiments.
### Imports

In [1]:
import torch
from torchtext.legacy import data
import numpy as np
import matplotlib.pyplot as plt
import en_core_web_sm
import random

### Device Config

In [2]:
SEED = 42
torch.manual_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.backends.cudnn.deteministic = True

### Data Prep.

We have a IMDB ``csv`` file that we want to transform it's data from `csv` to `json`.

```json
{"review": "This movie was borderline in crude humor....I utterly can not believe that these people can get away with this. Johnny Knoxville didn't cross the line...he was stomping all over it! This was better than the first...ALL THA WAY! The thing I found about the 1st movie was that the shenanigans were somewhat as if it was on the t.v. show. NOT THIS TIME!!! they completely made a 180 degree flip...the whole cast is so outstanding in what they do and not were the stunts crazy...but the music basically fit every situation...GOOD WORK!!!! When you go see this be sure to use the bathroom before going to the theater, maintain a strong stomach and rememba to not let your beverage spray out your nose....", "sentiment": 1}
```
Here we are interested in two things:

```
"review" -> thats the review
"sentiment" -> 0 negative and 1 for postive.
```



### Read the csv file using pandas.

In [3]:
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/NLP Data/IMDB/IMDB Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


### Extract the sentiments and reviews.

In [5]:
reviews = df.review.values
sentiments = [1 if sentiment== "positive" else 0 for sentiment in df.sentiment.values]

### A helper function that will help us to clean text.

In [6]:
import re
def remove_html_tags(sent):
  a = re.sub(r'<[a-zA-Z]+\s/?>', ' ', sent)
  b = re.sub(r'\s+', ' ', a)
  return b

### Clean the review text.

In [7]:
reviews  = list(map(remove_html_tags, reviews))

In [8]:
test_size = int(0.1 * len(reviews))
valid_size = int(.15 * len(reviews))
train_size = int((1 - 0.1 - 0.15) * len(reviews) )
test_size, train_size, valid_size

(5000, 37500, 7500)

### Shuffle the data.

Before shuffling, we don't want to split labels with their respective reviews and then randomly shuffle using numpy.

In [9]:
data = np.column_stack([reviews, sentiments])

In [10]:
np.random.shuffle(data)

### Using fancy indexing to distribute the sentiments and reviews with their respective size.

In [11]:
train_reviews_sentiments = data[:train_size]
validation_reviews_sentiments = data[train_size:train_size + valid_size]
test_reviews_sentiments = data[train_size + valid_size:]

### Checking the length of the training, validation, and test sets

In [12]:
len(train_reviews_sentiments), len(validation_reviews_sentiments), len(test_reviews_sentiments)

(37500, 7500, 5000)

### Processing data to python dictioneries

We will use these list of dictionery to create json files.

In [13]:
import json
train_reviews_sentiments_list_dict = []
for r, s in train_reviews_sentiments:
  train_reviews_sentiments_list_dict.append({
      "review": r,
      "sentiment": int(s)
  })

test_reviews_sentiments_list_dict = []
for r, s in test_reviews_sentiments:
  test_reviews_sentiments_list_dict.append({
      "review": r,
      "sentiment": int(s)
  })

validation_reviews_sentiments_list_dict = []
for r, s in validation_reviews_sentiments:
  validation_reviews_sentiments_list_dict.append({
      "review": r,
      "sentiment": int(s)
  })

### Saving  the preprocessed data to json files

In [14]:
import os
base_path="/content/drive/MyDrive/NLP Data/IMDB"
test_path = 'test.json'
train_path = 'train.json'
valid_path = 'validation.json'

In [15]:
file_object = open(os.path.join(base_path, train_path), 'w')
for line in train_reviews_sentiments_list_dict:
  file_object.write(json.dumps(line))
  file_object.write('\n')
file_object.close()
print("train.json created")

file_object = open(os.path.join(base_path, test_path), 'w')
for line in test_reviews_sentiments_list_dict:
  file_object.write(json.dumps(line))
  file_object.write('\n')
file_object.close()
print("test.json created")

file_object = open(os.path.join(base_path, valid_path), 'w')
for line in validation_reviews_sentiments_list_dict:
  file_object.write(json.dumps(line))
  file_object.write('\n')
file_object.close()
print("validation.json created")

train.json created
test.json created
validation.json created


### Preparing the `Fields`

In [16]:
from torchtext.legacy import data

In [22]:
TEXT = data.Field(tokenize = 'spacy',
                  tokenizer_language = 'en_core_web_sm',
                  include_lengths=True
                  )
LABEL = data.LabelField(dtype = torch.float)

### Creating the fields

In [25]:
fields = {
     "review": ('review', TEXT),
    "sentiment": ("sentiment", LABEL)
}

### Creating a dataset

In [27]:
train_data, test_data, validation_data = data.TabularDataset.splits(
    path="/content/drive/MyDrive/NLP Data/IMDB/",
    train="train.json",
    test="test.json",
    validation = "validation.json",
    format="json",
    fields=fields
)

In [28]:
print(vars(train_data[0]))

{'review': ['I', 'really', 'liked', 'this', 'Summerslam', 'due', 'to', 'the', 'look', 'of', 'the', 'arena', ',', 'the', 'curtains', 'and', 'just', 'the', 'look', 'overall', 'was', 'interesting', 'to', 'me', 'for', 'some', 'reason', '.', 'Anyways', ',', 'this', 'could', 'have', 'been', 'one', 'of', 'the', 'best', 'Summerslam', "'s", 'ever', 'if', 'the', 'WWF', 'did', "n't", 'have', 'Lex', 'Luger', 'in', 'the', 'main', 'event', 'against', 'Yokozuna', ',', 'now', 'for', 'it', "'s", 'time', 'it', 'was', 'ok', 'to', 'have', 'a', 'huge', 'fat', 'man', 'vs', 'a', 'strong', 'man', 'but', 'I', "'m", 'glad', 'times', 'have', 'changed', '.', 'It', 'was', 'a', 'terrible', 'main', 'event', 'just', 'like', 'every', 'match', 'Luger', 'is', 'in', 'is', 'terrible', '.', 'Other', 'matches', 'on', 'the', 'card', 'were', 'Razor', 'Ramon', 'vs', 'Ted', 'Dibiase', ',', 'Steiner', 'Brothers', 'vs', 'Heavenly', 'Bodies', ',', 'Shawn', 'Michaels', 'vs', 'Curt', 'Hening', ',', 'this', 'was', 'the', 'event', 'wh

In [30]:
from collections import Counter
counter = Counter()
for i in train_data:
  counter[i.sentiment] += 1
counter

Counter({0: 18780, 1: 18720})

#### Checking how many examples do we have for each set.

In [32]:
from prettytable import PrettyTable
def tabulate(column_names, data):
  table = PrettyTable(column_names)
  for row in data:
    table.add_row(row)
  print(table)
column_names = ["SUBSET", "EXAMPLE(s)"]
data_sets = [
        ["training", len(train_data)],
        ['validation', len(validation_data)],
        ['test', len(test_data)]
]
tabulate(column_names, data_sets)

+------------+------------+
|   SUBSET   | EXAMPLE(s) |
+------------+------------+
|  training  |   37500    |
| validation |    5000    |
|    test    |    7500    |
+------------+------------+


### Loading pretrained word Embedings.

In [33]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(
    train_data,
    max_size = MAX_VOCAB_SIZE,
    vectors = "glove.6B.100d",
    unk_init = torch.Tensor.normal_
)

LABEL.build_vocab(train_data)

### Creating Iterators - `Bucket Iterator`.

In [34]:
BATCH_SIZE = 64

train_iterator, validation_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, validation_data, test_data),
    device = device,
    batch_size = BATCH_SIZE,
    sort_key = lambda x: len(x.review),
)

### Creating a Model
In this notebook we are going to use a `bidirectional` LSTM.

In [35]:
from torch import nn
from torch.nn import functional as F

In [38]:
class IMDBLSTMRNN(nn.Module):
  def __init__(self, vocab_size, embedding_size, hidden_size, output_size, num_layers
               , bidirectional, dropout, pad_idx):
    super(IMDBLSTMRNN, self).__init__()

    self.embedding = nn.Embedding(vocab_size, embedding_dim=embedding_size, padding_idx=pad_idx)
    self.lstm = nn.LSTM(embedding_size, hidden_size=hidden_size, 
                        bidirectional=bidirectional, num_layers=num_layers,
                        dropout=dropout)
    self.fc = nn.Linear(hidden_size * 2, out_features=output_size)
    self.dropout = nn.Dropout(dropout)

  def forward(self, text, text_lengths):
    embedded = self.dropout(self.embedding(text))
    packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.to('cpu'), enforce_sorted=False)
    packed_output, (h_0, c_0) = self.lstm(packed_embedded)
    output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
    h_0 = self.dropout(torch.cat((h_0[-2,:,:], h_0[-1,:,:]), dim = 1))
    return self.fc(h_0)
    

### Creating a model instance

In [39]:

INPUT_DIM = len(TEXT.vocab) # # 25002
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] # 0
imdb_model = IMDBLSTMRNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX)
imdb_model

IMDBLSTMRNN(
  (embedding): Embedding(25002, 100, padding_idx=1)
  (lstm): LSTM(100, 256, num_layers=2, dropout=0.5, bidirectional=True)
  (fc): Linear(in_features=512, out_features=1, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

### Counting number of trainable parameters in the model.

In [40]:
def count_trainable_params(model):
  return sum([i.numel() for i in model.parameters() if i.requires_grad])

print(f'The model has {count_trainable_params(imdb_model):,} trainable parameters')

The model has 4,810,857 trainable parameters


#### Loading pretrained embeddings

In [41]:
pretrained_embeddings = TEXT.vocab.vectors

In [42]:
imdb_model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 1.9269,  1.4873,  0.9007,  ...,  0.1233,  0.3499,  0.6173],
        [ 0.7262,  0.0912, -0.3891,  ...,  0.0821,  0.4440, -0.7240],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.1098,  0.5145,  0.3449,  ..., -0.5092,  0.6672,  0.3905],
        [-0.5405,  0.3392,  0.0200,  ...,  0.1090, -1.0385,  0.4093],
        [ 0.8393,  0.4943, -0.8756,  ...,  0.6801,  0.2226,  1.1534]])

### Zeroing the <pad> and <unk> tokens

In [44]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token] or TEXT.vocab.stoi["<unk>"]
imdb_model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
imdb_model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)
print(imdb_model.embedding.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.1098,  0.5145,  0.3449,  ..., -0.5092,  0.6672,  0.3905],
        [-0.5405,  0.3392,  0.0200,  ...,  0.1090, -1.0385,  0.4093],
        [ 0.8393,  0.4943, -0.8756,  ...,  0.6801,  0.2226,  1.1534]])


### Trainning the Model.

In [45]:
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(imdb_model.parameters())

### Criterion and amazon_model to the Device.

In [46]:
imdb_model = imdb_model.to(device)
criterion = criterion.to(device)

### The Binary Accuracy Function.

In [47]:
def binary_accuracy(y_preds, y_true):
  #round predictions to the closest integer
  rounded_preds = torch.round(torch.sigmoid(y_preds))
  correct = (rounded_preds == y_true).float() #convert into float for division 
  acc = correct.sum() / len(correct)
  return acc

### Trainning and evaluation Functions.

In [48]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    for batch in iterator:
        optimizer.zero_grad()
        text, text_lengths = batch.review
        predictions = model(text, text_lengths).squeeze(1)
        loss = criterion(predictions, batch.sentiment)
        acc = binary_accuracy(predictions, batch.sentiment)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    with torch.no_grad():
        for batch in iterator:
            text, text_lengths = batch.review
            predictions = model(text, text_lengths).squeeze(1)
            loss = criterion(predictions, batch.sentiment)
            acc = binary_accuracy(predictions, batch.sentiment)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

We'll also create a function to tell us how long an epoch takes to compare training times between models.

In [49]:
import time
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

### Trainning Loop

In [51]:
N_EPOCHS = 10
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss, train_acc = train(imdb_model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(imdb_model, validation_iterator, criterion)
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(imdb_model.state_dict(), 'best-model.pt')
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 2m 37s
	Train Loss: 0.680 | Train Acc: 56.30%
	 Val. Loss: 0.633 |  Val. Acc: 64.48%
Epoch: 02 | Epoch Time: 2m 39s
	Train Loss: 0.559 | Train Acc: 70.85%
	 Val. Loss: 0.484 |  Val. Acc: 77.31%
Epoch: 03 | Epoch Time: 2m 39s
	Train Loss: 0.326 | Train Acc: 86.28%
	 Val. Loss: 0.254 |  Val. Acc: 89.85%
Epoch: 04 | Epoch Time: 2m 39s
	Train Loss: 0.266 | Train Acc: 89.15%
	 Val. Loss: 0.271 |  Val. Acc: 88.69%
Epoch: 05 | Epoch Time: 2m 38s
	Train Loss: 0.222 | Train Acc: 91.31%
	 Val. Loss: 0.252 |  Val. Acc: 89.22%
Epoch: 06 | Epoch Time: 2m 39s
	Train Loss: 0.195 | Train Acc: 92.47%
	 Val. Loss: 0.224 |  Val. Acc: 91.36%
Epoch: 07 | Epoch Time: 2m 39s
	Train Loss: 0.178 | Train Acc: 93.15%
	 Val. Loss: 0.233 |  Val. Acc: 91.40%
Epoch: 08 | Epoch Time: 2m 39s
	Train Loss: 0.158 | Train Acc: 94.05%
	 Val. Loss: 0.224 |  Val. Acc: 91.93%
Epoch: 09 | Epoch Time: 2m 39s
	Train Loss: 0.140 | Train Acc: 94.83%
	 Val. Loss: 0.230 |  Val. Acc: 91.71%
Epoch: 10 | Epoch T

### Evaluating the best model.

In [52]:
imdb_model.load_state_dict(torch.load('best-model.pt'))

test_loss, test_acc = evaluate(imdb_model, test_iterator, criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.243 | Test Acc: 90.75%


### Model Inference - Making predictions

In [57]:
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()
sentiments = ["NEG", "POS"]
def predict_sentiment(model, sent):
  model.eval()
  tokenized = [tok.text for tok in nlp.tokenizer(sent)]
  indexed = [TEXT.vocab.stoi[t] for t in tokenized]
  length = [len(indexed)]
  tensor = torch.LongTensor(indexed).to(device)
  tensor = tensor.unsqueeze(1)
  length_tensor = torch.LongTensor(length)
  prediction = torch.sigmoid(model(tensor, length_tensor))

  predicted_class = round(prediction.item())
  confidence = prediction.item() if prediction.item() >= .5 else 1 - prediction.item()
  print(f'PREDICTED CLASS:\t{predicted_class}\nCONFIDENCE:\t\t{confidence * 100:.2f}%\nSENTIMENT:\t\t{sentiments[predicted_class]}')

### Negative review

In [58]:
predict_sentiment(imdb_model, "This movie is boring")

PREDICTED CLASS:	0
CONFIDENCE:		99.93%
SENTIMENT:		NEG


In [62]:
predict_sentiment(imdb_model, "This movie is bad.")

PREDICTED CLASS:	0
CONFIDENCE:		99.58%
SENTIMENT:		NEG


### Positive review

In [59]:
predict_sentiment(imdb_model, "This movie is the best.")

PREDICTED CLASS:	1
CONFIDENCE:		98.83%
SENTIMENT:		POS


In [63]:
predict_sentiment(imdb_model, "This movie is good.")

PREDICTED CLASS:	1
CONFIDENCE:		93.05%
SENTIMENT:		POS
