# Sentiment Analysis on Product Reviews using LSTM 
<hr>

### Steps
<ol type="1">
    <li>Load the dataset </li>
    <li>Clean Dataset</li>
    <li>Encode Sentiments</li>
    <li>Split Dataset</li>
    <li>Tokenize and Pad/Truncate Reviews</li>
    <li>Build Architecture/Model</li>
    <li>Train and Test</li>
</ol>

<hr>
<i>Importing all the libraries needed</i>

In [1]:
!pip install tensorflow



In [90]:
import pandas as pd    # to load dataset
import numpy as np     # for mathematic equation
from nltk.corpus import stopwords   # to get collection of stopwords
from sklearn.model_selection import train_test_split # for splitting dataset
  # load saved model
import re

In [91]:
from tensorflow.keras.preprocessing.text import Tokenizer  # to encode text to int
from tensorflow.keras.preprocessing.sequence import pad_sequences   # to do padding or truncating
from tensorflow.keras.models import Sequential     # the model
from tensorflow.keras.layers import Embedding, LSTM, Dense # layers of the architecture
from tensorflow.keras.callbacks import ModelCheckpoint   # save model
from tensorflow.keras.models import load_model 

<hr>
<i>Preview dataset</i>

In [92]:
dataa = pd.read_csv('product.csv')

print(dataa)

                                                  review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]


<hr>
<b>Stop Word</b> is a commonly used words in a sentence, usually a search engine is programmed to ignore this words (i.e. "the", "a", "an", "of", etc.)

<i>Declaring the english stop words</i>

In [93]:
english_stops = set(stopwords.words('english'))

<hr>

### Load and Clean Dataset

In the original dataset, the reviews are still dirty. There are still html tags, numbers, uppercase, and punctuations. This will not be good for training, so in <b>load_dataset()</b> function, beside loading the dataset using <b>pandas</b>, I also pre-process the reviews by removing html tags, non alphabet (punctuations and numbers), stop words, and lower case all of the reviews.

### Encode Sentiments
In the same function, I also encode the sentiments into integers (0 and 1). Where 0 is for negative sentiments and 1 is for positive sentiments.

In [94]:
def load_dataset():
    df = pd.read_csv('product.csv')
    x_data = df['review'].astype(str)
    y_data = df['sentiment'].map({'positive': 1, 'negative': 0})

    # PRE-PROCESS REVIEW
    x_data = x_data.apply(lambda review: re.sub(r'<.*?>', '', review))  # remove html tags
    x_data = x_data.apply(lambda review: re.sub(r'[^A-Za-z]', ' ', review))  # remove non-alphabet characters
    x_data = x_data.apply(lambda review: [w.lower() for w in review.split() if w not in english_stops])  # remove stopwords

    return x_data, y_data

x_data, y_data = load_dataset()


<hr>

### Split Dataset
In this work, I decided to split the data into 80% of Training and 20% of Testing set using <b>train_test_split</b> method from Scikit-Learn. By using this method, it automatically shuffles the dataset. We need to shuffle the data because in the original dataset, the reviews and sentiments are in order, where they list positive reviews first and then negative reviews. By shuffling the data, it will be distributed equally in the model, so it will be more accurate for predictions.

In [95]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.2)

print('Train Set')
#print(x_train, '\n')
#print(x_test, '\n')
print('Test Set')
print(y_train, '\n')
print(y_test)

Train Set
Test Set
49141    0
4210     0
25193    1
12835    0
18907    0
        ..
28276    1
6668     0
8817     0
12172    0
20543    1
Name: sentiment, Length: 40000, dtype: int64 

6033     1
8427     0
20671    1
32833    1
34126    1
        ..
28923    1
35153    0
12571    1
15455    1
13363    0
Name: sentiment, Length: 10000, dtype: int64


<hr>
<i>Function for getting the maximum review length, by calculating the mean of all the reviews length (using <b>numpy.mean</b>)</i>

In [96]:
def get_max_length():
    review_length = []
    for review in x_train:
        review_length.append(len(review))

    return int(np.ceil(np.mean(review_length)))

<hr>

### Tokenize and Pad/Truncate Reviews
A Neural Network only accepts numeric data, so we need to encode the reviews. I use <b>tensorflow.keras.preprocessing.text.Tokenizer</b> to encode the reviews into integers, where each unique word is automatically indexed (using <b>fit_on_texts</b> method) based on <b>x_train</b>. <br>
<b>x_train</b> and <b>x_test</b> is converted into integers using <b>texts_to_sequences</b> method.

Each reviews has a different length, so we need to add padding (by adding 0) or truncating the words to the same length (in this case, it is the mean of all reviews length) using <b>tensorflow.keras.preprocessing.sequence.pad_sequences</b>.


<b>post</b>, pad or truncate the words in the back of a sentence<br>
<b>pre</b>, pad or truncate the words in front of a sentence

In [97]:
# ENCODE REVIEW
token = Tokenizer(lower=False)
token.fit_on_texts(x_data)

x_train = token.texts_to_sequences(x_train)
x_test = token.texts_to_sequences(x_test)

max_length = get_max_length()

x_train = pad_sequences(x_train, maxlen=max_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=max_length, padding='post', truncating='post')

total_words = len(token.word_index) + 1
  # add 1 because of 0 padding

print('Encoded X Train\n', x_train, '\n')
print('Encoded X Test\n', x_test, '\n')
print('Maximum review length: ', max_length)

Encoded X Train
 [[   1 1775  806 ...  553  594    2]
 [ 171  119 3946 ...    0    0    0]
 [ 331  565 2134 ...   54  144  586]
 ...
 [6641 6878 1097 ...    0    0    0]
 [1071 2389   13 ... 2348    6 1951]
 [ 282  462  232 ...  710 5231   95]] 

Encoded X Test
 [[1484    3  877 ...    0    0    0]
 [   1 1147  250 ... 3835  213  193]
 [   1 3761 2105 ...    0    0    0]
 ...
 [ 572  226  367 ...    0    0    0]
 [   2   64   43 ... 3711   23 6068]
 [ 518    8 7202 ...    0    0    0]] 

Maximum review length:  130


<hr>

### Build Architecture/Model
<b>Embedding Layer</b>: in simple terms, it creates word vectors of each word in the <i>word_index</i> and group words that are related or have similar meaning by analyzing other words around them.

<b>LSTM Layer</b>: to make a decision to keep or throw away data by considering the current input, previous output, and previous memory. There are some important components in LSTM.
<ul>
    <li><b>Forget Gate</b>, decides information is to be kept or thrown away</li>
    <li><b>Input Gate</b>, updates cell state by passing previous output and current input into sigmoid activation function</li>
    <li><b>Cell State</b>, calculate new cell state, it is multiplied by forget vector (drop value if multiplied by a near 0), add it with the output from input gate to update the cell state value.</li>
    <li><b>Ouput Gate</b>, decides the next hidden state and used for predictions</li>
</ul>

<b>Dense Layer</b>: compute the input with the weight matrix and bias (optional), and using an activation function. I use <b>Sigmoid</b> activation function for this work because the output is only 0 or 1.

The optimizer is <b>Adam</b> and the loss function is <b>Binary Crossentropy</b> because again the output is only 0 and 1, which is a binary number.

In [98]:
# ARCHITECTURE
EMBED_DIM = 64
LSTM_OUT = 128

model = Sequential()
model.add(Embedding(total_words, EMBED_DIM, input_length=max_length))
model.add(LSTM(LSTM_OUT, dropout=0.2, recurrent_dropout=0.2))  # add dropout
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

print(model.summary())

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_5 (Embedding)     (None, 130, 64)           6489472   
                                                                 
 lstm_5 (LSTM)               (None, 128)               98816     
                                                                 
 dense_8 (Dense)             (None, 1)                 129       
                                                                 
Total params: 6588417 (25.13 MB)
Trainable params: 6588417 (25.13 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None


<hr>

### Training


In [99]:
checkpoint = ModelCheckpoint(
    'models/LSTM3.h5',
    monitor='val_loss',
    save_best_only=True,
    verbose=1
)

In [100]:
model.fit(x_train, y_train, validation_data=(x_test, y_test), batch_size=256, epochs=5, callbacks=[checkpoint])


Epoch 1/5
Epoch 1: val_loss improved from inf to 0.30806, saving model to models/LSTM3.h5
Epoch 2/5


  saving_api.save_model(


Epoch 2: val_loss improved from 0.30806 to 0.29591, saving model to models/LSTM3.h5
Epoch 3/5
Epoch 3: val_loss improved from 0.29591 to 0.29284, saving model to models/LSTM3.h5
Epoch 4/5
Epoch 4: val_loss did not improve from 0.29284
Epoch 5/5
Epoch 5: val_loss did not improve from 0.29284


<keras.src.callbacks.History at 0x7f07672a17b0>

In [101]:
# Evaluate the model on the test data
loss, accuracy = model.evaluate(x_test, y_test)

# Print accuracy
print(f"Final Model Accuracy: {accuracy * 100:.2f}%")


Final Model Accuracy: 86.39%


---

### Load Saved Model

Load saved model and use it to predict a movie review statement's sentiment (positive or negative).

In [125]:
loaded_model = load_model('models/LSTM3.h5')

Receives a review as an input to be predicted

In [141]:
review = str(input('Product Review: '))

The input must be pre processed before it is passed to the model to be predicted

In [142]:
# Pre-process input
import re
import nltk
from nltk.corpus import stopwords

english_stops = set(stopwords.words('english'))
regex = re.compile(r'[^a-zA-Z\s]')
review = regex.sub('', review)
print('Cleaned: ', review)

words = review.split(' ')
filtered = [w for w in words if w not in english_stops]
filtered = ' '.join(filtered)
filtered = [filtered.lower()]

print('Filtered: ', filtered)

Cleaned:  Kind of drawn in by the erotic scenes only to realize this was one of the most amateurish and unbelievable bits of film Ive ever seen Sort of like a high school film project What was Rosanna Arquette thinking And what was with all those stock characters in that bizarre supposed Midwest town Pretty hard to get involved with this one No lessons to be learned from it no brilliant insights just stilted and quite ridiculous but lots of skin if that intrigues you videotaped nonsenseWhat was with the bisexual relationship out of nowhere after all the heterosexual encounters And what was with that absurd dance with everybody playing their stereotyped roles Give this one a pass its like a million other miles of bad wasted film money that could have been spent on starving children or Aids in Africa
Filtered:  ['kind drawn erotic scenes realize one amateurish unbelievable bits film ive ever seen sort like high school film project what rosanna arquette thinking and stock characters bizar

Once again, we need to tokenize and encode the words. I use the tokenizer which was previously declared because we want to encode the words based on words that are known by the model.

In [143]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer()
tokenize_words = tokenizer.texts_to_sequences(filtered)
tokenize_words = pad_sequences(tokenize_words, maxlen=max_length, padding='post', truncating='post')
print(tokenize_words)

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


This is the result of the prediction which shows the **confidence score** of the review statement.

In [144]:
result = loaded_model.predict(tokenize_words)
print(result)

[[0.50170016]]


In [145]:
if result >= 0.8:
    print('*****')
elif result>0.65:
    print('****')
elif result>0.5:
    print('***')
elif result>0.3:
    print('**')
else:
    print('*')

***
