# Sentimental Analysis Part 2 

Predicting Sentiments through Recurring Neural Network

<b> Name:      </b> Bharathvaj Devarajan<br>
<b> Student No.</b> 16212388<br>
<b> Module     </b> CA684 - Machine Learning

<b>Aim</b> The second part of this project is to use Deep Learning to do Sentimental Analysis and predict the sentiment of a review based on the summary text. 

<b> Import Libraries</b>

In [1]:
# Pandas, numpy and other helpful libraries
import pandas as pd
import matplotlib.pyplot as plt
import string
import sqlite3
import numpy as np
# Scikit learn libraries for Preprocessing,evaluating, Count Vectorization and train test split
from sklearn import preprocessing
from sklearn import metrics
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
# Tensorflow Learn library for training Recurrent Neural network to do Sentimental Analysis
import tflearn
from tflearn.data_utils import to_categorical, pad_sequences
%matplotlib inline



The first step is to establish a database connection and read the data to a panda dataframe

In [2]:
con = sqlite3.connect('/root/amazon_db.sqlite')

As the Score is scaled between 1 to 5 , the scores with 3 represent neutral sentiments,so those reviews are omitted 
from this analysis

In [3]:
amazon_df = pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3
""", con)

In [4]:
amazon_df = amazon_df.dropna()

In [5]:
amazon_df.head(1)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...


<b>Text Preprocessing</b>

The Review scores are converted to sentiments for NLP by doing a simple transformation by setting score > 3 as positive and rest as negative 

In [6]:
def analyze(x):
    if x >3:
        return "Positive"
    return "Negative"

amazon_df["Sentiment"] = amazon_df["Score"].map(analyze)

Check the transformation

In [7]:
Sentiment = amazon_df["Sentiment"]

A look at the summary of the Product Reviews 

In [8]:
Summary = amazon_df["Summary"]

We actually want to work on the training and testing datasets, before cleaning we will split the data into training and testing 

In [9]:
np.random.seed(10)
X_train,X_test,y_train,y_test = train_test_split(Summary,Sentiment,test_size = 0.2,random_state = 123)

In [10]:
X_train[0]

u'Good Quality Dog Food'

#  Sentimental Analysis

Now The Vectors are generated by using the Count Vectorizer library, We use unigram for our predictions and also
set regex to include only words

In [12]:
vect = CountVectorizer(ngram_range=(1,1), token_pattern=r'\b\w{1,}\b')

Converting the training data and setting the vocabulary for Prediction

In [13]:
vect.fit(X_train)
vocab = vect.vocabulary_

Function to preprocess the data and convert them to ID's by stripping, splittingand lowering the case of the words that are in the vocabulary <br>
This is essential as the padding sequences and embedding only works with int data

In [14]:
def word_to_id(X):
    return X.apply( lambda x: [vocab[w] for w in [w.lower().strip() for w in x.split()] if w in vocab] )

Applying the function to train and test data

In [15]:
X_train_id = word_to_id(X_train)
X_test_id  = word_to_id(X_test)

In [16]:
X_train_id.head()

376694                                              [20912]
471359                                              [23951]
177455                                        [17223, 9521]
435725    [4133, 28991, 4106, 9578, 11202, 10998, 26272,...
76178                                                [5478]
Name: Summary, dtype: object

<b>Create the Pad Sequence</b> <br>
Here the Inputs are converted to 2D array of uniform length (specified by the max length argument), the sequence is 
padded with the value (0 in this case) if it is shorter than maxlen

In [17]:
X_train_padseqs = pad_sequences(X_train_id,maxlen=20, value=0)
X_test_padseqs  = pad_sequences(X_test_id,maxlen=20, value=0)

In [18]:
X_train_padseqs.shape

(420651, 20)

Function to convert and map the output values as categorical 

In [19]:
def convert(x):
    x = x.map(lambda x: 1 if x =='Positive' else 0)
    x = to_categorical(x,nb_classes=2)
    return x

Apply the transformation

In [20]:
y_train_nn = convert(y_train)
y_test_nn =  convert(y_test)

In [21]:
y_train_nn

array([[ 1.,  0.],
       [ 0.,  1.],
       [ 0.,  1.],
       ..., 
       [ 0.,  1.],
       [ 0.,  1.],
       [ 0.,  1.]])

vector size and vocab size are calculated as they are needed to train the model

In [22]:
vector_size = X_train_padseqs.shape[1]
vocab_size = len(vocab)

Creating the Neural Network

<b>Input Layer</b><br>
The First element is the batch size which is set to none and the second element is the shape of the vector

In [23]:
net = tflearn.input_data([None,vector_size]) 

<b>Embedding Layer</b><br>
The second layer is the embedding Layer in which we pass the input node and the row and column dimensions

In [24]:
net = tflearn.embedding(net, input_dim=vocab_size, output_dim=128)

<b>LSTM Layer</b><br>
The third layer is the LSTM (Long Short Term Memory) layer<br>
The core of the model consists of an LSTM cell that processes one word at a time and computes probabilities of the possible values for the next word in the sentence. The memory state of the network is initialized with a vector of zeros and gets updated after reading each word. 


In [25]:
net = tflearn.lstm(net,128,dropout=0.8)

<b> Fully Connected </b><br>
The fourth layer is fully connected which ensures all nodes are connected to each other.
We also specify the activation function and the number of classes in the target variable

In [26]:
net = tflearn.fully_connected(net, 2, activation='softmax')

<b>Regression Layer</b><br>
The fifth layer is the regression layer, here we specify the optimizer function, learning rate and the loss function 

In [27]:
net = tflearn.regression(net, 
                         optimizer='adam',  
                         learning_rate=0.0001,
                         loss='categorical_crossentropy')

<b>Initializing the model</b> 

In [28]:
model = tflearn.DNN(net, tensorboard_verbose=0)

<b> Training the model </b><br>
I have only specified the number of epochs to 2 because of low Compute power, I am confident that an epoch of 100 
will return amazing results

In [29]:
model.fit(X_train_padseqs,y_train_nn, validation_set=(X_test_padseqs,y_test_nn), show_metric=True, batch_size=32, n_epoch=2)

Training Step: 26291  | total loss: [1m[32m0.16187[0m[0m | time: 1467.102s
| Adam | epoch: 002 | loss: 0.16187 - acc: 0.9441 -- iter: 420640/420651
Training Step: 26292  | total loss: [1m[32m0.16346[0m[0m | time: 1513.539s
| Adam | epoch: 002 | loss: 0.16346 - acc: 0.9435 | val_loss: 0.21143 - val_acc: 0.9192 -- iter: 420651/420651
--


Save the model to avoid retraining

In [30]:
model.save('/root/model.tfl')

INFO:tensorflow:/root/model.tfl is not in all_model_checkpoint_paths. Manually adding it.


Load the model

In [31]:
model.load('/root/model.tfl')

<b>Model Evaluation</b><br>
Argument max function, calculates the probabilities of both the cases for a particular input and returns the
input which has the greater probability

In [45]:
pred_classes = [np.argmax(i) for i in model.predict(X_test_padseqs)]
true_classes = [np.argmax(i) for i in y_test_nn]

print('Model Accuracy: %0.3f\n'% metrics.accuracy_score(true_classes, pred_classes))

Model Accuracy: 0.919



<b>Label Encoding</b><br> to do inverse transform and see the actual Predictions

In [36]:
y_labels = list(y_train.value_counts().index)

In [37]:
y_labels

['Positive', 'Negative']

In [38]:
le = preprocessing.LabelEncoder()
le.fit(y_labels)

LabelEncoder()

In [43]:
for i in range(30):
    pred_class = np.argmax(model.predict([X_test_padseqs[i]]))
    true_class = np.argmax(y_test_nn[i])
    
    print(X_test.values[i])
    print('Predicted class:',le.inverse_transform(pred_class))
    print('Actual class:', le.inverse_transform(true_class))
    print('')

Dark chocolate, granola -- nice combination
('Predicted class:', 'Positive')
('Actual class:', 'Positive')

easy chili
('Predicted class:', 'Positive')
('Actual class:', 'Positive')

Best Decaf available in T-disc so far
('Predicted class:', 'Positive')
('Actual class:', 'Positive')

crunchy good
('Predicted class:', 'Positive')
('Actual class:', 'Positive')

Best "Pod" coffee
('Predicted class:', 'Positive')
('Actual class:', 'Positive')

The Best Snack I've Ever Had
('Predicted class:', 'Positive')
('Actual class:', 'Positive')

cinnamon discs sugar free candy
('Predicted class:', 'Positive')
('Actual class:', 'Positive')

Cat Treats
('Predicted class:', 'Positive')
('Actual class:', 'Positive')

Great if you're watching sugar or carbs!
('Predicted class:', 'Positive')
('Actual class:', 'Positive')

disappointed
('Predicted class:', 'Negative')
('Actual class:', 'Negative')

great cookie
('Predicted class:', 'Positive')
('Actual class:', 'Positive')

Very not awesome.
('Predicted cla