### Sentiment alalysis.

In this notebook we are going to perform a sentiment classification task, weather a review is positive or negative based on the Amazon data. We are going to use the file `Books_small_10000.json` which is located in the files folder.

### `books_small_10000.json` structure.

```json
{"reviewerID": "A1F2H80A1ZNN1N", "asin": "B00GDM3NQC", "reviewerName": "Connie Correll", "helpful": [0, 0], "reviewText": "I bought both boxed sets, books 1-5.  Really a great series!  Start book 1 three weeks ago and just finished book 5.  Sloane Monroe is a great character and being able to follow her through both private life and her PI life gets a reader very involved!  Although clues may be right in front of the reader, there are twists and turns that keep one guessing until the last page!  These are books you won't be disappointed with.", "overall": 5.0, "summary": "Can't stop reading!", "unixReviewTime": 1390435200, "reviewTime": "01 23, 2014"}
```

We are going to say any `"overall"` that is less than 3 is considered negative otherwise positive review. Note that from the previous notebook we got a reasonable accuracy by using pretrained word embeddings. In this one we are going to do the same thing as well.

### Imports

In [25]:
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
import os
from nltk.tokenize import word_tokenize
import json, re
from collections import Counter

### Data preparation.

In [12]:
class Review:
    def __init__(self, review, sentiment):
        self.review = review
        self.sentiment = sentiment
    def differentiate(sentiment):
        return 0 if sentiment < 3 else 1 # 1 pos and 0 neg

In [13]:
path = 'files/Books_small_10000.json'

### Creating preprocessing function that remove Numbers and double spacing in a review

In [17]:
def clean_text(sent):
    a = re.sub(r'\d', ' ', sent)
    b = re.sub(r'\s+', ' ', a)
    return b

In [18]:
reviews = []
with open(path, 'r') as reader:
    for line in reader:
        json_data = json.loads(line)
        reviews.append(Review(
            clean_text(json_data["reviewText"]),
            Review.differentiate(json_data["overall"])
        ))

In [20]:
reviews[0].review

"I bought both boxed sets, books - . Really a great series! Start book three weeks ago and just finished book . Sloane Monroe is a great character and being able to follow her through both private life and her PI life gets a reader very involved! Although clues may be right in front of the reader, there are twists and turns that keep one guessing until the last page! These are books you won't be disappointed with."

In [22]:
reviews_text = [i.review for i in reviews]
reviews_labels = [i.sentiment for i in reviews]

In [26]:
Counter(reviews_labels)

Counter({1: 9356, 0: 644})

### Vocabulary size `aka` number of unique words.

In [27]:
counter = Counter()
for sent in reviews_text:
    words = word_tokenize(sent)
    for word in words:
        counter[word] += 1

In [30]:
counter.most_common(3)

[('.', 49831), ('the', 44135), (',', 38643)]

In [29]:
vocabulary_size = len(counter)
vocabulary_size

47300

> We have `~47k`unique words in our data.

### Now, Creating word vectors.

In [31]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [32]:
tokenizer = Tokenizer(num_words=vocabulary_size)
tokenizer.fit_on_texts(reviews_text)

In [36]:
word_indices = tokenizer.word_index
word_indices_reversed = dict([(v, k) for (k, v) in word_indices.items()])

### A function that converts `sequences to sents`.

In [49]:
def sequence_to_text(seq):
    return " ".join([word_indices_reversed[i] for i in seq])

### A function that converts `sents to sequences`.
We are going to use this function during inference.

In [38]:
def sent_to_sequence(sent):
    words = word_tokenize(str(sent).lower())
    sequences = []
    for word in words:
        try:
            sequences.append(word_indices[word])
        except:
            sequences.append(0)
    return sequences

### Loading pretrainned weights `glove.6B.`
We are going to use this weights in our `embedding` layer which is the first layer in the net.

In [42]:
embeddings_dictionary = dict()
with open(r"C:\Users\crisp\Downloads\glove.6B\glove.6B.100d.txt", encoding='utf8') as glove_file:
    for line in glove_file:
        records = line.split()
        word  = records[0]
        vectors = np.asarray(records[1:], dtype='float32')
        embeddings_dictionary[word] = vectors

> Creating an `embedding` matrix that suits our data.

In [43]:
embedding_matrix = np.zeros((vocabulary_size, 100))
for word, index in tokenizer.word_index.items():
    vector = embeddings_dictionary.get(word)
    if vector is not None:
        embedding_matrix[index] = vector

### Creating sequences from our data.

In [45]:
sequence_tokens = tokenizer.texts_to_sequences(reviews_text)

In [47]:
print(sequence_tokens[0])

[5, 477, 155, 6886, 1401, 57, 54, 4, 60, 55, 224, 11, 271, 958, 545, 2, 45, 346, 11, 11377, 6887, 7, 4, 60, 126, 2, 128, 290, 3, 378, 17, 113, 155, 1796, 77, 2, 17, 6307, 77, 325, 4, 209, 49, 580, 322, 1926, 233, 28, 181, 9, 998, 6, 1, 209, 48, 26, 397, 2, 372, 12, 179, 29, 1037, 230, 1, 229, 227, 106, 26, 57, 18, 357, 28, 431, 15]


In [50]:
sequence_to_text(sequence_tokens[0])

"i bought both boxed sets books really a great series start book three weeks ago and just finished book sloane monroe is a great character and being able to follow her through both private life and her pi life gets a reader very involved although clues may be right in front of the reader there are twists and turns that keep one guessing until the last page these are books you won't be disappointed with"

### Padding sequences.

In [51]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [52]:
max_words = 100
sequences_padded = pad_sequences(sequence_tokens, maxlen=max_words, padding="post", truncating="post")

### Creating a model.

### Model `Achitecture`

```
                [ Embedding Layer]
                        |
                        |
[ LSTM ] <---- [Bidirectional Layer] ----> [GRU] (forward_layer)
 (backward_layer)       |
                        |
                 [ Flatten Layer]
                        |
                        |
                 [Dense Layer 1]
                        |
                        |    
                 [Dense Layer 2]
                        |
                        |
                 [Dense Layer 3] (output [binary class])
```

In [87]:
forward_layer = keras.layers.GRU(64, return_sequences=True, dropout=.5 )
backward_layer = keras.layers.LSTM(64, activation='relu', return_sequences=True,
                       go_backwards=True, dropout=.5)

# forward_layer = keras.layers.GRU(64, return_sequences=True, )
# backward_layer = keras.layers.LSTM(64, activation='relu', return_sequences=True,
#                        go_backwards=True)

model = keras.Sequential([
    keras.layers.Embedding(
        vocabulary_size, 100, input_length=max_words, weights=[embedding_matrix], trainable=False
    ),
    keras.layers.Bidirectional(
        forward_layer,
        backward_layer = backward_layer
    ),
    keras.layers.Flatten(),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dropout(.3),
    keras.layers.Dense(512, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
    
], name="amazon_sentiment_classifier")

model.summary()

Model: "amazon_sentiment_classifier"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 100, 100)          4730000   
_________________________________________________________________
bidirectional_2 (Bidirection (None, 100, 128)          74112     
_________________________________________________________________
flatten_4 (Flatten)          (None, 12800)             0         
_________________________________________________________________
dense_8 (Dense)              (None, 64)                819264    
_________________________________________________________________
dropout (Dropout)            (None, 64)                0         
_________________________________________________________________
dense_9 (Dense)              (None, 512)               33280     
_________________________________________________________________
dense_10 (Dense)             (None, 1) 

In [88]:
early_stoping = keras.callbacks.EarlyStopping(
    monitor='val_loss',
    min_delta=0,
    patience=3,
    verbose=1,
    mode='auto',
    baseline=None,
    restore_best_weights=False,
)

In [89]:
X_train, X_test, y_train, y_test = train_test_split(np.array(sequences_padded), np.array(reviews_labels).astype('float32'), random_state=42, test_size = .05)
X_train.shape, y_train.shape, X_test.shape

((9500, 100), (9500,), (500, 100))

### Trainning the Model.

In [90]:
model.compile(
    loss = keras.losses.BinaryCrossentropy(from_logits=False),
    metrics = ['accuracy'],
    optimizer = keras.optimizers.Adam()
)
history = model.fit(
    X_train, y_train,
    epochs = 10,
    verbose = 1,
    validation_split = .2,
    shuffle=True,
    batch_size= 32,
    validation_batch_size = 16,
    callbacks = [early_stoping]
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 00005: early stopping


### Evaluating the model.

In [91]:
model.evaluate(X_test, y_test, verbose=1, batch_size=128)



[0.27966317534446716, 0.9240000247955322]

### Inference.

In [106]:
def predict(sent):
    class_names = ["NEGATIVE", "POSITIVE"]
    tokens = sent_to_sequence(sent)
    padded_tokens = pad_sequences([tokens], maxlen=max_words, padding="post", truncating="post")
    prediction = tf.squeeze(tf.round(model(padded_tokens)).numpy().astype('int32'))
    print(f'Predicted Class:\t {prediction}\nPredicted Category:\t{class_names[prediction]}')

### Negative review.

In [107]:
predict("This book is very bad i dont like this kind of content.")

Predicted Class:	 0
Predicted Category:	NEGATIVE


### Positive Review.

In [108]:
predict("This book is one of the amaizing books ever.")

Predicted Class:	 1
Predicted Category:	POSITIVE
