# Sentiment Analysis 

### Steps
<ol type="1">
    <li>Load the dataset</li>
    <li>Clean and encode Dataset</li>
    <li>Split Dataset 80:20</li>
    <li>Tokenize and Pad/Truncate Reviews</li>
    <li>Bulid LSTM Model</li>
    <li>Train and Test</li>
</ol>

<hr>
<i>Import all the libraries needed</i>

In [30]:
import pandas as pd    # to load dataset
import numpy as np     # for mathematic equation
from nltk.corpus import stopwords   # to get collection of stopwords
from sklearn.model_selection import train_test_split       # for splitting dataset
from tensorflow.keras.preprocessing.text import Tokenizer  # to encode text to int
from tensorflow.keras.preprocessing.sequence import pad_sequences   # to do padding or truncating
from tensorflow.keras.models import Sequential     # the model
from tensorflow.keras.layers import Embedding, LSTM, Dense # layers of the architecture
from tensorflow.keras.callbacks import ModelCheckpoint   # save model
from tensorflow.keras.models import load_model   # load saved model
import re

<hr>
<i>Show the datset we are using</i>

In [25]:
data = pd.read_csv('IMDB Dataset.csv')

print(data)

                                                  review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]


<hr>
<b>Stop Word</b> is a commonly used words in a sentence, usually a search engine is programmed to ignore this words (i.e. "the", "a", "an", "of", etc.)


In [26]:
english_stops = set(stopwords.words('english'))

<hr>

### Load and Clean Dataset

In the original dataset, the reviews are still dirty. There are still html tags, numbers, uppercase, and punctuations. We remove all that in this step and encode the sentiments into integers (0 and 1). Where 0 is for negative sentiments and 1 is for positive sentiments.

In [27]:
def load_dataset():
    df = pd.read_csv('IMDB Dataset.csv')
    x_data = df['review']       # Reviews/Input
    y_data = df['sentiment']    # Sentiment/Output

    # PRE-PROCESS REVIEW
    x_data = x_data.replace({'<.*?>': ''}, regex = True)          # remove html tag
    x_data = x_data.replace({'[^A-Za-z]': ' '}, regex = True)     # remove non alphabet
    x_data = x_data.apply(lambda review: [w for w in review.split() if w not in english_stops])  # remove stop words
    x_data = x_data.apply(lambda review: [w.lower() for w in review])   # lower case
    
    # ENCODE SENTIMENT -> 0 & 1
    y_data = y_data.replace('positive', 1)
    y_data = y_data.replace('negative', 0)

    return x_data, y_data

x_data, y_data = load_dataset()

print('Reviews')
print(x_data, '\n')
print('Sentiment')
print(y_data)

Reviews
0        [one, reviewers, mentioned, watching, oz, epis...
1        [a, wonderful, little, production, the, filmin...
2        [i, thought, wonderful, way, spend, time, hot,...
3        [basically, family, little, boy, jake, thinks,...
4        [petter, mattei, love, time, money, visually, ...
                               ...                        
49995    [i, thought, movie, right, good, job, it, crea...
49996    [bad, plot, bad, dialogue, bad, acting, idioti...
49997    [i, catholic, taught, parochial, elementary, s...
49998    [i, going, disagree, previous, comment, side, ...
49999    [no, one, expects, star, trek, movies, high, a...
Name: review, Length: 50000, dtype: object 

Sentiment
0        1
1        1
2        1
3        0
4        1
        ..
49995    1
49996    0
49997    0
49998    0
49999    0
Name: sentiment, Length: 50000, dtype: int64


<hr>

### Split Dataset
In this work, I decided to split the data into 80% of Training and 20% of Testing set using <b>train_test_split</b> method from Scikit-Learn. By using this method, it automatically shuffles the dataset.

In [31]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.2)

print('Train Set')
print(x_train, '\n')
print(x_test, '\n')
print('Test Set')
print(y_train, '\n')
print(y_test)

Train Set
8929     [i, remember, film, fondly, seeing, theatre, i...
200      [interesting, short, television, movie, descri...
21346    [ed, wood, eclipsed, becomes, orson, welles, t...
38239    [in, rapid, economic, development, china, resu...
49685    [i, always, loved, old, movies, one, top, ten,...
                               ...                        
48028    [from, start, know, movie, end, it, full, clic...
23085    [i, saw, peter, watkin, culloden, the, war, ga...
47004    [wonderland, fascinating, film, chronicling, x...
11490    [this, movie, travels, farther, gunshots, kiss...
2247     [that, answer, the, question, what, single, re...
Name: review, Length: 40000, dtype: object 

5424     [this, movie, quite, possibly, one, horrible, ...
41503    [s, i, c, k, really, stands, so, incredibly, c...
47855    [it, where, poppa, the, groove, tube, putney, ...
9922     [i, know, loved, movie, years, old, now, watch...
26229    [this, picture, hit, movie, screens, june, th,...
 

<hr>
<i>Function for getting the maximum review length (using mean)</i>

In [7]:
def get_max_length():
    review_length = []
    for review in x_train:
        review_length.append(len(review))

    return int(np.ceil(np.mean(review_length)))

<hr>

### Tokenize and Pad/Truncate Reviews
A Neural Network only accepts numeric data, so we need to encode the reviews. Tokenizer is used to encode the reviews into integers, where each unique word is automatically indexed.

Each reviews has a different length, so we need to add padding (by adding 0) or truncating the words to the same length (in this case, it is the mean of all reviews length) using

<b>post</b>, pad or truncate the words in the back of a sentence<br>
<b>pre</b>, pad or truncate the words in front of a sentence

In [32]:
# ENCODE REVIEW
token = Tokenizer(lower=False)    # no need lower, because already lowered the data in load_data()
token.fit_on_texts(x_train)
x_train = token.texts_to_sequences(x_train)
x_test = token.texts_to_sequences(x_test)

max_length = get_max_length()

x_train = pad_sequences(x_train, maxlen=max_length, padding='post', truncating='post')
x_test = pad_sequences(x_test, maxlen=max_length, padding='post', truncating='post')

total_words = len(token.word_index) + 1   # add 1 because of 0 padding

print('Encoded X Train\n', x_train, '\n')
print('Encoded X Test\n', x_test, '\n')
print('Maximum review length: ', max_length)

Encoded X Train
 [[    1   289     4 ...     0     0     0]
 [  129   239   588 ...     0     0     0]
 [ 1209  1544 18168 ...     0     0     0]
 ...
 [ 6821  1345     4 ...     0     0     0]
 [    8     3  3428 ...     0     0     0]
 [  143  1415     2 ...     0     0     0]] 

Encoded X Test
 [[    8     3    93 ...     0     0     0]
 [  614     1   859 ...     0     0     0]
 [    7  1072 18822 ...     0     0     0]
 ...
 [ 1916     1     6 ...     0     0     0]
 [  486     1   194 ...     0     0     0]
 [    1   113    23 ...     0     0     0]] 

Maximum review length:  130


<hr>

### Build Architecture/Model
<b>Embedding Layer</b>: in simple terms, it creates word vectors of each word in the <i>word_index</i> and group words that are related or have similar meaning by analyzing other words around them.

<b>LSTM Layer</b>: to make a decision to keep or throw away data by considering the current input, previous output, and previous memory. There are some important components in LSTM.
<ul>
    <li><b>Forget Gate</b>, decides information is to be kept or thrown away</li>
    <li><b>Input Gate</b>, updates cell state by passing previous output and current input into sigmoid activation function</li>
    <li><b>Cell State</b>, calculate new cell state, it is multiplied by forget vector (drop value if multiplied by a near 0), add it with the output from input gate to update the cell state value.</li>
    <li><b>Ouput Gate</b>, decides the next hidden state and used for predictions</li>
</ul>

<b>Dense Layer</b>: compute the input with the weight matrix and bias (optional), and using an activation function. I use <b>Sigmoid</b> activation function for this work because the output is only 0 or 1.

The optimizer is <b>Adam</b> and the loss function is <b>Binary Crossentropy</b> because again the output is only 0 and 1, which is a binary number.

In [33]:
# ARCHITECTURE
EMBED_DIM = 32
LSTM_OUT = 64

model = Sequential()
model.add(Embedding(total_words, EMBED_DIM, input_length = max_length))
model.add(LSTM(LSTM_OUT))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 130, 32)           2954208   
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                24832     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
Total params: 2,979,105
Trainable params: 2,979,105
Non-trainable params: 0
_________________________________________________________________
None


<hr>

### Training
For training We only need to fit our <b>x_train</b> (input) and <b>y_train</b> (output/label) data. For this training, mini-batch learning method with a <b>batch_size</b> of <i>32</i> and <i>100</i> <b>epochs</b> is used.

Also, I added a callback called **checkpoint** to save the model locally for every epoch if its accuracy improved from the previous epoch.

In [34]:
checkpoint = ModelCheckpoint(
    'models/LSTM.h5',
    monitor='accuracy',
    save_best_only=True,
    verbose=1
)

In [None]:
model.fit(x_train, y_train, batch_size = 32, epochs = 100, callbacks=[checkpoint])

Epoch 1/100
Epoch 00001: accuracy improved from -inf to 0.58830, saving model to models/LSTM.h5
Epoch 2/100
Epoch 00002: accuracy improved from 0.58830 to 0.60965, saving model to models/LSTM.h5
Epoch 3/100
Epoch 00003: accuracy improved from 0.60965 to 0.74553, saving model to models/LSTM.h5
Epoch 4/100
Epoch 00004: accuracy improved from 0.74553 to 0.80785, saving model to models/LSTM.h5
Epoch 5/100
Epoch 00005: accuracy improved from 0.80785 to 0.83627, saving model to models/LSTM.h5
Epoch 6/100
Epoch 00006: accuracy improved from 0.83627 to 0.84522, saving model to models/LSTM.h5
Epoch 7/100
Epoch 00007: accuracy did not improve from 0.84522
Epoch 8/100
Epoch 00008: accuracy improved from 0.84522 to 0.85607, saving model to models/LSTM.h5
Epoch 9/100
Epoch 00009: accuracy improved from 0.85607 to 0.88898, saving model to models/LSTM.h5
Epoch 10/100
Epoch 00010: accuracy did not improve from 0.88898
Epoch 11/100
Epoch 00011: accuracy improved from 0.88898 to 0.92287, saving model to

<hr>

### Testing
To evaluate the model, we need to predict the sentiment using our <b>x_test</b> data and comparing the predictions with <b>y_test</b> (expected output) data.

In [10]:
y_pred = model.predict_classes(x_test, batch_size = 32)

true = 0
for i, y in enumerate(y_test):
    if y == y_pred[i]:
        true += 1

print('Correct Prediction: {}'.format(true))
print('Wrong Prediction: {}'.format(len(y_pred) - true))
print('Accuracy: {}'.format(true/len(y_pred)*100))

Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).
Correct Prediction: 5031
Wrong Prediction: 4969
Accuracy: 50.31


---

### Load Saved Model

Load saved model and use it to predict a  statement's sentiment (positive or negative).

In [12]:
loaded_model = load_model('models/LSTM.h5')

Receives a review as an input to be predicted

In [18]:
review = str(input('Statement: '))

Movie Review: good movie a must watch 


Process the input string

In [19]:
# Pre-process input
regex = re.compile(r'[^a-zA-Z\s]')
review = regex.sub('', review)
print('Cleaned: ', review)

words = review.split(' ')
filtered = [w for w in words if w not in english_stops]
filtered = ' '.join(filtered)
filtered = [filtered.lower()]

print('Filtered: ', filtered)

Cleaned:  good movie a must watch 
Filtered:  ['good movie must watch ']


Tokenize again

In [20]:
tokenize_words = token.texts_to_sequences(filtered)
tokenize_words = pad_sequences(tokenize_words, maxlen=max_length, padding='post', truncating='post')
print(tokenize_words)

[[  9   3 114  33   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0]]


calculate the output

In [21]:
result = loaded_model.predict(tokenize_words)
print(result)

[[0.9996678]]


If the confidence score is close to 0, then the statement is **negative**. On the other hand, if the confidence score is close to 1, then the statement is **positive** . (0.5 is the threshhold here)

In [22]:
if result >= 0.5:
    print('positive')
else:
    print('negative')

positive
