### Amazon Data - Sentiment Analyisis

In this notebook we are going to use our own datset, load it using the `torchtext` predict positive or negative sentiments. We have two files which are:

```
Books_small_10000.json -> training
Books_small.json -> testing
```

### Imports

In [1]:
import torch
from torchtext.legacy import data
import numpy as np
import matplotlib.pyplot as plt
import en_core_web_sm
import random

### Device Config

In [2]:
SEED = 42
torch.manual_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.backends.cudnn.deteministic = True

### Data Prep.

The json files has the following data in them:
```json
{"reviewerID": "A1E5ZR1Z4OQJG", "asin": "1495329321", "reviewerName": "Pure Jonel \"Pure Jonel\"", "helpful": [0, 0], "reviewText": "Da Silva takes the divine by storm with this unique new novel.  She develops a world unlike any others while keeping it firmly in the real world.  This is a very well written and entertaining novel.  I was quite impressed and intrigued by the way that this solid storyline was developed, bringing the readers right into the world of the story.  I was engaged throughout and definitely enjoyed my time spent reading it.I loved the character development in this novel.  Da Silva creates a cast of high school students who actually act like high school students.  I really appreciated the fact that none of them were thrown into situations far beyond their years, nor did they deal with events as if they had decades of life experience under their belts.  It was very refreshing and added to the realism and impact of the novel.  The friendships between the characters in this novel were also truly touching.Overall, this novel was fantastic.  I can&#8217;t wait to read more and to find out what happens next in the series.  I&#8217;d definitely recommend this debut novel by Da Silva to those who want a little YA fun with a completely unique & shocking storyline.Please note that I received a complimentary copy of this work in exchange for an honest review.", "overall": 4.0, "summary": "An amazing first novel", "unixReviewTime": 1396137600, "reviewTime": "03 30, 2014"}

```
Here we are interested in two things:

"reviewText" -> thats the review

"overall" -> that's the sentiment we will be working with assuming that greater than 2 is a positive review.




### Preparing the `Fields`

In [3]:
get_sentiment = lambda x: 1 if x >=3 else 0
# dtype = torch.float

In [4]:
TEXT = data.Field(tokenize = 'spacy',
                  tokenizer_language = 'en_core_web_sm',
                  include_lengths=True
                  )
LABEL = data.LabelField(preprocessing=get_sentiment, dtype = torch.float)

In [5]:
fields = {
    'reviewText': ('review', TEXT),
    'overall': ('sentiment', LABEL)
}

In [6]:
train_data, test_data = data.TabularDataset.splits(
    path="/content/",
    train="Books_small_10000.json",
    test="Books_small.json",
    format="json",
    fields=fields
)

In [7]:
print(vars(train_data[0]))

{'review': ['I', 'bought', 'both', 'boxed', 'sets', ',', 'books', '1', '-', '5', '.', ' ', 'Really', 'a', 'great', 'series', '!', ' ', 'Start', 'book', '1', 'three', 'weeks', 'ago', 'and', 'just', 'finished', 'book', '5', '.', ' ', 'Sloane', 'Monroe', 'is', 'a', 'great', 'character', 'and', 'being', 'able', 'to', 'follow', 'her', 'through', 'both', 'private', 'life', 'and', 'her', 'PI', 'life', 'gets', 'a', 'reader', 'very', 'involved', '!', ' ', 'Although', 'clues', 'may', 'be', 'right', 'in', 'front', 'of', 'the', 'reader', ',', 'there', 'are', 'twists', 'and', 'turns', 'that', 'keep', 'one', 'guessing', 'until', 'the', 'last', 'page', '!', ' ', 'These', 'are', 'books', 'you', 'wo', "n't", 'be', 'disappointed', 'with', '.'], 'sentiment': 1}


In [8]:
from collections import Counter
counter = Counter()

for i in train_data:
  counter[i.sentiment] += 1

counter

Counter({0: 644, 1: 9356})

We have `466` negative and `6534` positive reviews in the dataset

### Checking the data sizes for each sample.

In [9]:
print(f"TRAINING EXAMPLES: \t {len(train_data)}\nTEST EXAMPLES: \t {len(test_data)}\nTOTAL EXAMPLES: \t {len(train_data) + len(test_data)}")

TRAINING EXAMPLES: 	 10000
TEST EXAMPLES: 	 1000
TOTAL EXAMPLES: 	 11000


### Creating A Validation data from the train_data.

In [10]:
train_data, validation_data = train_data.split(random_state=random.seed(SEED))

In [11]:
print(f"TRAINING EXAMPLES: \t {len(train_data)}\nVALIDATION EXAMPLES: \t {len(validation_data)}\nTEST EXAMPLES: \t {len(test_data)}\nTOTAL EXAMPLES: \t {len(train_data) + len(test_data) + len(validation_data)}")

TRAINING EXAMPLES: 	 7000
VALIDATION EXAMPLES: 	 3000
TEST EXAMPLES: 	 1000
TOTAL EXAMPLES: 	 11000


### Loading pretrained word Embedings.

In [12]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(
    train_data,
    max_size = MAX_VOCAB_SIZE,
    vectors = "glove.6B.100d",
    unk_init = torch.Tensor.normal_
)

LABEL.build_vocab(train_data)

### Creating Iterators - `Bucket Iterator`.

In [13]:
BATCH_SIZE = 64

train_iterator, validation_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, validation_data, test_data),
    device = device,
    batch_size = BATCH_SIZE,
    sort_key = lambda x: len(x.review),
)

### Creating a Model
In this notebook we are going to use a `bidirectional` LSTM.

In [14]:
from torch import nn
from torch.nn import functional as F

In [15]:
class AmazonLSTMRNN(nn.Module):
  def __init__(self, vocab_size, embedding_size, hidden_size, output_size, num_layers
               , bidirectional, dropout, pad_idx):
    super(AmazonLSTMRNN, self).__init__()

    self.embedding = nn.Embedding(vocab_size, embedding_dim=embedding_size, padding_idx=pad_idx)
    self.lstm = nn.LSTM(embedding_size, hidden_size=hidden_size, 
                        bidirectional=bidirectional, num_layers=num_layers,
                        dropout=dropout)
    self.fc = nn.Linear(hidden_size * 2, out_features=output_size)
    self.dropout = nn.Dropout(dropout)

  def forward(self, text, text_lengths):
    embedded = self.dropout(self.embedding(text))
    packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.to('cpu'), enforce_sorted=False)
    packed_output, (h_0, c_0) = self.lstm(packed_embedded)
    output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
    h_0 = self.dropout(torch.cat((h_0[-2,:,:], h_0[-1,:,:]), dim = 1))
    return self.fc(h_0)
    

### Creating a model instance

In [16]:

INPUT_DIM = len(TEXT.vocab) # # 25002
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] # 0
amazon_model = AmazonLSTMRNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX)
amazon_model

AmazonLSTMRNN(
  (embedding): Embedding(25002, 100, padding_idx=1)
  (lstm): LSTM(100, 256, num_layers=2, dropout=0.5, bidirectional=True)
  (fc): Linear(in_features=512, out_features=1, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

### Counting number of trainable parameters in the model.

In [17]:
def count_trainable_params(model):
  return sum([i.numel() for i in model.parameters() if i.requires_grad])

print(f'The model has {count_trainable_params(amazon_model):,} trainable parameters')

The model has 4,810,857 trainable parameters


#### Loading pretrained embeddings

In [18]:
pretrained_embeddings = TEXT.vocab.vectors

In [19]:
amazon_model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 1.9269,  1.4873,  0.9007,  ...,  0.1233,  0.3499,  0.6173],
        [ 0.7262,  0.0912, -0.3891,  ...,  0.0821,  0.4440, -0.7240],
        [-0.3398,  0.2094,  0.4635,  ..., -0.2339,  0.4730, -0.0288],
        ...,
        [ 0.6949,  0.2950, -1.4695,  ..., -1.3336,  0.5506,  0.4094],
        [ 1.0059, -0.3491, -0.8520,  ..., -0.8817,  1.3140, -0.1069],
        [ 0.9223,  0.0553,  0.0212,  ..., -0.7226, -1.5909,  0.7309]])

### Zeroing the <pad> and <unk> tokens

In [20]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token] or TEXT.vocab.stoi["<unk>"]
amazon_model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
amazon_model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)
print(amazon_model.embedding.weight.data)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.3398,  0.2094,  0.4635,  ..., -0.2339,  0.4730, -0.0288],
        ...,
        [ 0.6949,  0.2950, -1.4695,  ..., -1.3336,  0.5506,  0.4094],
        [ 1.0059, -0.3491, -0.8520,  ..., -0.8817,  1.3140, -0.1069],
        [ 0.9223,  0.0553,  0.0212,  ..., -0.7226, -1.5909,  0.7309]])


### Trainning the Model.

In [21]:
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(amazon_model.parameters())

### Criterion and amazon_model to the Device.

In [22]:
amazon_model = amazon_model.to(device)
criterion = criterion.to(device)

### The Binary Accuracy Function.

In [23]:
def binary_accuracy(y_preds, y_true):
  #round predictions to the closest integer
  rounded_preds = torch.round(torch.sigmoid(y_preds))
  correct = (rounded_preds == y_true).float() #convert into float for division 
  acc = correct.sum() / len(correct)
  return acc

### Trainning and evaluation Functions.

In [24]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    for batch in iterator:
        optimizer.zero_grad()
        text, text_lengths = batch.review
        predictions = model(text, text_lengths).squeeze(1)
        loss = criterion(predictions, batch.sentiment)
        acc = binary_accuracy(predictions, batch.sentiment)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    with torch.no_grad():
        for batch in iterator:
            text, text_lengths = batch.review
            predictions = model(text, text_lengths).squeeze(1)
            loss = criterion(predictions, batch.sentiment)
            acc = binary_accuracy(predictions, batch.sentiment)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

We'll also create a function to tell us how long an epoch takes to compare training times between models.

In [25]:
import time
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

### Trainning Loop

In [26]:
N_EPOCHS = 10
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss, train_acc = train(amazon_model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(amazon_model, validation_iterator, criterion)
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(amazon_model.state_dict(), 'best-model.pt')
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 22s
	Train Loss: 0.263 | Train Acc: 92.57%
	 Val. Loss: 0.222 |  Val. Acc: 94.08%
Epoch: 02 | Epoch Time: 0m 21s
	Train Loss: 0.240 | Train Acc: 93.38%
	 Val. Loss: 0.214 |  Val. Acc: 94.08%
Epoch: 03 | Epoch Time: 0m 22s
	Train Loss: 0.211 | Train Acc: 93.35%
	 Val. Loss: 0.206 |  Val. Acc: 93.55%
Epoch: 04 | Epoch Time: 0m 21s
	Train Loss: 0.184 | Train Acc: 93.74%
	 Val. Loss: 0.212 |  Val. Acc: 94.01%
Epoch: 05 | Epoch Time: 0m 22s
	Train Loss: 0.171 | Train Acc: 93.70%
	 Val. Loss: 0.203 |  Val. Acc: 91.57%
Epoch: 06 | Epoch Time: 0m 22s
	Train Loss: 0.143 | Train Acc: 94.74%
	 Val. Loss: 0.216 |  Val. Acc: 94.21%
Epoch: 07 | Epoch Time: 0m 22s
	Train Loss: 0.132 | Train Acc: 95.00%
	 Val. Loss: 0.201 |  Val. Acc: 91.61%
Epoch: 08 | Epoch Time: 0m 22s
	Train Loss: 0.105 | Train Acc: 95.81%
	 Val. Loss: 0.199 |  Val. Acc: 92.90%
Epoch: 09 | Epoch Time: 0m 22s
	Train Loss: 0.093 | Train Acc: 96.41%
	 Val. Loss: 0.196 |  Val. Acc: 93.03%
Epoch: 10 | Epoch T

### Evaluating the best model.

In [27]:
amazon_model.load_state_dict(torch.load('best-model.pt'))

test_loss, test_acc = evaluate(amazon_model, test_iterator, criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.221 | Test Acc: 92.70%


### Model Inference - Making predictions

In [28]:
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()
sentiments = ["NEG", "POS"]
def predict_sentiment(model, sent):
  model.eval()
  tokenized = [tok.text for tok in nlp.tokenizer(sent)]
  indexed = [TEXT.vocab.stoi[t] for t in tokenized]
  length = [len(indexed)]
  tensor = torch.LongTensor(indexed).to(device)
  tensor = tensor.unsqueeze(1)
  length_tensor = torch.LongTensor(length)
  prediction = torch.sigmoid(model(tensor, length_tensor))

  predicted_class = round(prediction.item())
  confidence = prediction.item() if prediction.item() >= .5 else 1 - prediction.item()
  print(f'PREDICTED CLASS:\t{predicted_class}\nCONFIDENCE:\t\t{confidence * 100:.2f}%\nSENTIMENT:\t\t{sentiments[predicted_class]}')
  return prediction.item()

### Negative review

In [29]:
predict_sentiment(amazon_model, "This book is terrible")

PREDICTED CLASS:	0
CONFIDENCE:		58.68%
SENTIMENT:		NEG


0.41323035955429077

### Positive review

In [30]:
predict_sentiment(amazon_model, "Da Silva takes the divine by storm with this unique new novel.  She develops a world unlike any others while keeping it firmly in the real world.  This is a very well written and entertaining novel.")

PREDICTED CLASS:	0
CONFIDENCE:		99.97%
SENTIMENT:		NEG


0.0002923606662079692