### Category and Sentiment Classification.

In this notebook we are going to expand the previous 2 Notebooks to create a `NN` that will be able to predict the class which s review belongs to as well as the sentiment if it is posive or negative. So we are going to create a model that will take one input and output two outputs suing the flexible keras Functional API.

### Imports

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
import os
from nltk.tokenize import word_tokenize
import json, re, random
from collections import Counter

### Data preparation.

We have 5 categories and each categories which are:

    * Books
    * Clothing
    * Grocery
    * Patio
 
Based on the review we want to be able to predict which category does the review belongs to.

We also have reviews that start from 0 to 5. We are going to assume that greater than 3 is a positive (1) otherwise negative (0)

### File structure

```
files
    category
        Books_small.json
        Clothing_small.json
        Electronics_small.json
        Grocery_small.json
        Patio_small.json
```
Each json file has the following structure

```json
{
     "reviewerID": "..", 
     "asin": "..", 
     "reviewerName": ".."",
     "helpful": [..],
     "reviewText": "...",
     "overall": .., 
     "summary": "..", 
     "unixReviewTime": ..., 
     "reviewTime": ".."
 }
```
### Labels.

```
[positive = 1, negative = 0]

[BOOKS = 0, CLOTHING=1, ELECTRONICS=2, GROCERY=3, PATIO=4]
```

In [2]:
class Review:
    def __init__(self, review, sentiment, category):
        self.category = category
        self.review = review
        self.sentiment = sentiment

    def get_sentiment(sentiment):
        return 0 if sentiment < 3 else 1

### Creating preprocessing function that remove Numbers and double spacing in a review

In [3]:
def clean_text(sent):
    a = re.sub(r'\d', ' ', sent)
    b = re.sub(r'\s+', ' ', a)
    return b

In [4]:
base_url = "./files/category/"
test_text_reviews = []

In [6]:
data = []
for cat, category in enumerate(os.listdir(base_url)):
    with open(os.path.join(base_url, category), 'r') as f:
        seed = random.randrange(0, 500)
        for index, line in enumerate(f):
            category_json = json.loads(line)
            data.append(Review(
                clean_text(category_json["reviewText"]),
                Review.get_sentiment(category_json["overall"]),
                cat
            ))
            if index == seed:
                test_text_reviews.append(category_json["reviewText"]) 

In [7]:
data[0].review, data[0].category, data[0].sentiment

('Da Silva takes the divine by storm with this unique new novel. She develops a world unlike any others while keeping it firmly in the real world. This is a very well written and entertaining novel. I was quite impressed and intrigued by the way that this solid storyline was developed, bringing the readers right into the world of the story. I was engaged throughout and definitely enjoyed my time spent reading it.I loved the character development in this novel. Da Silva creates a cast of high school students who actually act like high school students. I really appreciated the fact that none of them were thrown into situations far beyond their years, nor did they deal with events as if they had decades of life experience under their belts. It was very refreshing and added to the realism and impact of the novel. The friendships between the characters in this novel were also truly touching.Overall, this novel was fantastic. I can&# ;t wait to read more and to find out what happens next in 

In [8]:
len(data), len(test_text_reviews)

(5000, 5)

In [9]:
reviews_text = [i.review for i in data]
reviews_category_labels = [i.category for i in data]
reviews_sentiment_labels = [i.sentiment for i in data]

In [10]:
Counter(reviews_category_labels), Counter(reviews_sentiment_labels)

(Counter({0: 1000, 1: 1000, 2: 1000, 3: 1000, 4: 1000}),
 Counter({1: 4592, 0: 408}))

### Vocabulary size `aka` number of unique words.

In [11]:
counter = Counter()
for sent in reviews_text:
    words = word_tokenize(sent)
    for word in words:
        counter[word] += 1

In [12]:
counter.most_common(3)

[('.', 24101), ('the', 19797), (',', 16566)]

In [13]:
vocabulary_size = len(counter)
vocabulary_size

27770

> We have `~28k`unique words in our data.

### Now, Creating word vectors.

In [14]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [15]:
tokenizer = Tokenizer(num_words=vocabulary_size)
tokenizer.fit_on_texts(reviews_text)

In [16]:
word_indices = tokenizer.word_index
word_indices_reversed = dict([(v, k) for (k, v) in word_indices.items()])

### A function that converts `sequences to sents`.

In [17]:
def sequence_to_text(seq):
    return " ".join([word_indices_reversed[i] for i in seq])

### A function that converts `sents to sequences`.
We are going to use this function during inference.

In [18]:
def sent_to_sequence(sent):
    words = word_tokenize(str(sent).lower())
    sequences = []
    for word in words:
        try:
            sequences.append(word_indices[word])
        except:
            sequences.append(0)
    return sequences

### Loading pretrainned weights `glove.6B.`
We are going to use this weights in our `embedding` layer which is the first layer in the net.

In [19]:
embeddings_dictionary = dict()
with open(r"C:\Users\crisp\Downloads\glove.6B\glove.6B.100d.txt", encoding='utf8') as glove_file:
    for line in glove_file:
        records = line.split()
        word  = records[0]
        vectors = np.asarray(records[1:], dtype='float32')
        embeddings_dictionary[word] = vectors

> Creating an `embedding` matrix that suits our data.

In [20]:
embedding_matrix = np.zeros((vocabulary_size, 100))
for word, index in tokenizer.word_index.items():
    vector = embeddings_dictionary.get(word)
    if vector is not None:
        embedding_matrix[index] = vector

### Creating sequences from our data.

In [21]:
sequence_tokens = tokenizer.texts_to_sequences(reviews_text)

In [22]:
print(sequence_tokens[0])

[5840, 6775, 374, 1, 6776, 74, 2636, 13, 9, 1136, 151, 444, 78, 5155, 4, 500, 981, 88, 357, 130, 724, 6, 3355, 10, 1, 363, 500, 9, 8, 4, 27, 47, 431, 3, 2033, 444, 2, 20, 216, 775, 3, 2637, 74, 1, 90, 12, 9, 476, 1256, 20, 1704, 3591, 1, 927, 141, 91, 1, 500, 7, 1, 100, 2, 20, 3592, 1198, 3, 230, 355, 14, 68, 1061, 226, 6, 2, 237, 1, 566, 2118, 10, 9, 444, 5840, 6775, 3593, 4, 2034, 7, 218, 904, 3139, 122, 241, 2795, 26, 218, 904, 3139, 2, 57, 2514, 1, 411, 12, 1315, 7, 35, 83, 2388, 91, 2119, 196, 1257, 109, 162, 999, 99, 24, 460, 13, 1890, 22, 30, 24, 46, 3594, 7, 203, 501, 351, 109, 4208, 6, 20, 27, 2291, 3, 545, 5, 1, 11053, 3, 2638, 7, 1, 444, 1, 3873, 318, 1, 214, 10, 9, 444, 83, 72, 795, 2958, 292, 9, 444, 20, 776, 2, 36, 238, 428, 5, 80, 39, 3, 5, 143, 43, 58, 951, 251, 10, 1, 194, 2, 384, 230, 124, 9, 5156, 444, 74, 5840, 6775, 5, 197, 122, 114, 4, 70, 2959, 487, 13, 4, 609, 1136, 3595, 1256, 894, 610, 12, 2, 368, 4, 4209, 1164, 7, 9, 98, 10, 1081, 11, 51, 599, 211]


In [23]:
sequence_to_text(sequence_tokens[0])

'da silva takes the divine by storm with this unique new novel she develops a world unlike any others while keeping it firmly in the real world this is a very well written and entertaining novel i was quite impressed and intrigued by the way that this solid storyline was developed bringing the readers right into the world of the story i was engaged throughout and definitely enjoyed my time spent reading it i loved the character development in this novel da silva creates a cast of high school students who actually act like high school students i really appreciated the fact that none of them were thrown into situations far beyond their years nor did they deal with events as if they had decades of life experience under their belts it was very refreshing and added to the realism and impact of the novel the friendships between the characters in this novel were also truly touching overall this novel was fantastic i can t wait to read more and to find out what happens next in the series i d d

### Padding sequences.

In [24]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [25]:
max_words = 100
sequences_padded = pad_sequences(sequence_tokens, maxlen=max_words, padding="post", truncating="post")

### Creating a model.

### Model `Achitecture` Functional API.

```
                    [ Input ]
                        |
                        |
                [ Embedding Layer]
                        |
                        |
[ LSTM ] <---- [Bidirectional Layer] ----> [GRU] (forward_layer)
 (backward_layer)       |
                        |
                 [ Flatten Layer]
                        |
                        |
                 [Dense Layer 1]
                        |
                        |    
                 [Dense Layer 2]
                        |
                        |
            |--------------------------|
            |                          |
      [Dense Layer 3]            [Dense Layer 4]
        (pos, neg)       (book, electronic, clothing, grocery, patio)            
      
```

In [26]:
forward_layer = keras.layers.GRU(64, return_sequences=True, dropout=.5 )
backward_layer = keras.layers.LSTM(64, activation='relu', return_sequences=True,
                       go_backwards=True, dropout=.5)

input_layer = keras.layers.Input(shape=(100, ), name="input_layer")

embedding_layer = keras.layers.Embedding(
                        vocabulary_size, 100, 
                        input_length=max_words, 
                        weights=[embedding_matrix], 
                        trainable=False, name="embedding_layer"
                    )(input_layer)

bidirectional_layer = keras.layers.Bidirectional(
                                forward_layer,
                                backward_layer = backward_layer,
                                name ="bidirectional_layer"
                            )(embedding_layer)

flatten_layer = keras.layers.Flatten(name="flatten_layer")(bidirectional_layer)
fc_1 = keras.layers.Dense(64, activation='relu', name="fc_1")(flatten_layer)
dropout_layer = keras.layers.Dropout(.3, name="dropout")(fc_1)
fc_2 = keras.layers.Dense(512, activation='relu', name='fc_2')(dropout_layer)

# Output layers

sentiment_output = keras.layers.Dense(1, activation='sigmoid', name='sentiment_output')(fc_2)
category_output = keras.layers.Dense(5, activation='softmax', name='category_output')(fc_2)

model = keras.Model(inputs=input_layer, outputs=[sentiment_output, category_output], name="amazon_sentiment_category_classifier")

model.summary()

Model: "amazon_sentiment_category_classifier"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_layer (InputLayer)        [(None, 100)]        0                                            
__________________________________________________________________________________________________
embedding_layer (Embedding)     (None, 100, 100)     2777000     input_layer[0][0]                
__________________________________________________________________________________________________
bidirectional_layer (Bidirectio (None, 100, 128)     74112       embedding_layer[0][0]            
__________________________________________________________________________________________________
flatten_layer (Flatten)         (None, 12800)        0           bidirectional_layer[0][0]        
_______________________________________________________________

### Label processing.
* `reviews_category_labels` - should be one hot vectors

* `reviews_sentiment_labels` - should be just a single number

In [27]:
reviews_category_labels_one_hot = tf.one_hot(reviews_category_labels, depth=5).numpy().astype('float32')
reviews_sentiment_labels_binary = np.array(reviews_sentiment_labels).astype('float32')

### Splitting and shuffling Labels.

In [28]:
review_text_train, review_text_test, reviews_sentiment_labels_train, reviews_sentiment_labels_test,  reviews_category_labels_train, reviews_category_labels_test = train_test_split(
   sequences_padded,  reviews_sentiment_labels_binary, reviews_category_labels_one_hot,
  random_state = 42,
 test_size = .05
) 

review_text_train.shape, review_text_test.shape, reviews_sentiment_labels_train.shape, reviews_sentiment_labels_test.shape,  reviews_category_labels_train.shape, reviews_category_labels_test.shape

((4750, 100), (250, 100), (4750,), (250,), (4750, 5), (250, 5))

### Trainning the Model.

In [29]:
early_stoping = keras.callbacks.EarlyStopping(
    monitor='val_sentiment_output_loss',
    min_delta=0,
    patience=2,
    verbose=1,
    mode='auto',
    baseline=None,
    restore_best_weights=False,
)

In [30]:
model.compile(
    loss = {
        "sentiment_output" : keras.losses.BinaryCrossentropy(from_logits=False),
        "category_output" : keras.losses.CategoricalCrossentropy(from_logits=False)
    },
    metrics = ['accuracy'],
    optimizer = keras.optimizers.Adam()
)
history = model.fit(
    review_text_train, 
    y = [reviews_sentiment_labels_train, reviews_category_labels_train],
    epochs = 10,
    verbose = 1,
    validation_split = .2,
    shuffle=True,
    batch_size= 32,
    validation_batch_size = 16,
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Evaluating the model.

In [31]:
model.evaluate( review_text_test, 
                y = [reviews_sentiment_labels_test, reviews_category_labels_test],
                verbose=1, batch_size=128
              )



[0.955642819404602,
 0.3524135649204254,
 0.6032292246818542,
 0.9319999814033508,
 0.8960000276565552]

### Inference.

In [32]:
def predict(sent):
    sentiments = ["NEGATIVE", "POSITIVE"]
    categories =["BOOKS", "CLOTHES", "ELECTRONICS", "GROCERY", "PATIO"]
    tokens = sent_to_sequence(sent)
    padded_tokens = pad_sequences([tokens], maxlen=max_words, padding="post", truncating="post")
    sentiment_prediction, category_prediction = model(padded_tokens)
    
    sentiment_prediction = tf.squeeze(tf.round(sentiment_prediction)).numpy().astype('int32')
    category_prediction = tf.argmax(category_prediction, axis=1).numpy()[0]
    
    
    print(f'Predicted Classes:\t [{sentiment_prediction}, {category_prediction}]\nPredicted Labels:\t[{sentiments[sentiment_prediction]}, {categories[category_prediction]}]')

In [33]:
predict(test_text_reviews[0])

Predicted Classes:	 [1, 1]
Predicted Labels:	[POSITIVE, CLOTHES]


In [34]:
predict(test_text_reviews[1])

Predicted Classes:	 [1, 1]
Predicted Labels:	[POSITIVE, CLOTHES]


In [35]:
predict(test_text_reviews[2])

Predicted Classes:	 [1, 2]
Predicted Labels:	[POSITIVE, ELECTRONICS]


In [36]:
predict(test_text_reviews[3])

Predicted Classes:	 [1, 3]
Predicted Labels:	[POSITIVE, GROCERY]


In [37]:
predict(test_text_reviews[4])

Predicted Classes:	 [1, 4]
Predicted Labels:	[POSITIVE, PATIO]


### A negative Patio Review.

In [38]:
predict(["There has been relatively significant traffic of moles under the soil in my yard. I have tried a wide variety of repellants, traps, and poisons for moles, but not all of them have worked, but this one did not work well. I am not sure if the product was too diluted or it simply did not work. If this product is good, then why is it that the manufacturer discontinued this product?"])

Predicted Classes:	 [1, 4]
Predicted Labels:	[POSITIVE, PATIO]


## The model is not performing well on sentiment classification due to few negative reviews.