# Sentiment Analysis - CNN vs LSTM



This is a part of tutorial series on classifying the sentiments of IMDB movie reviews using machine learning and deep learning techniques. In the last part (<a href="https://www.kaggle.com/oumaimahourrane/sentiment-analysis-doc2vec-vs-word2vec">link</a>) of this series, I have shown how we can get word embeddings and classify the sentiments of our corpus based using Word2vec and Doc2vec. In this part, I use one-layered convolution neural network, and compare it with LSTM at the and of this tutorial.


## Convolutional Neural Network

In the previous kernel <a href="https://www.kaggle.com/oumaimahourrane/sentiment-analysis-doc2vec-vs-word2vec">link</a>, We have aggregated the word vectors of each word using Tf-IDF weighting to get one vector representation of each text, in order to feed a simple neural network. It is not the case for CNN, where we have to feed word vectors in a sequence to the model.
we can also consider that a neural network expect all the data to have the same dimension. however different sentences have different sizez. and this can be handled next with sequence padding. 

Let's begin at first, with loading our cleaned data (see the data pre-processing  <a href="https://www.kaggle.com/oumaimahourrane/sentiment-analysis-ml-models-comparison">in this post</a>), and splite it into training and validation set.


In [1]:
import os
import sys
import pandas as pd

csv = '~/clean_data.csv'
data = pd.read_csv(csv,index_col=0, encoding='latin-1')
data.head()

Unnamed: 0,SentimentText,Sentiment
0,first think another disney movie might good it...,1
1,put aside dr house repeat missed desperate hou...,0
2,big fan stephen king s work film made even gre...,1
3,watched horrid thing tv needless say one movie...,0
4,truly enjoyed film acting terrific plot jeff c...,1


In [2]:
from sklearn.cross_validation import train_test_split

SEED = 2000

x_train, x_validation, y_train, y_validation = train_test_split(data.SentimentText, data.Sentiment, test_size=.2, random_state=SEED)



Next, we will use Keras Tokenizer to split each word in a sentence. Then, in order to get a sequential representation of each row we use texts_to_sequences method. 

In [3]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=100000)
tokenizer.fit_on_texts(x_train)
sequences = tokenizer.texts_to_sequences(x_train)

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Now, we can check the max lenght of rows in our corpus for padding.

In [4]:
length = []
for x in x_train:
    length.append(len(x.split()))
max(length)

974

We can say that the maximum lenght to be 1000

In [5]:
x_train_seq = pad_sequences(sequences, maxlen=1000)
x_train_seq[:5]

array([[   0,    0,    0, ..., 1265,   16,    6],
       [   0,    0,    0, ..., 1468,  213,  237],
       [   0,    0,    0, ...,   58,  100,    9],
       [   0,    0,    0, ...,   43,  664,  141],
       [   0,    0,    0, ...,    5,    1,  169]], dtype=int32)

After checking, we can see that all the data transformed to have the same length of 1000.
We do the same thing to the validation set.

In [6]:
sequences_val = tokenizer.texts_to_sequences(x_validation)
x_val_seq = pad_sequences(sequences_val, maxlen=1000)

Next, we will define a CNN using an embedding layer of 200x1000 dimension as an input with 100000 as max feature, then add to our 1D Convolutional layer 100x2000 filters, then add Global Max Pooling layer which  will extract the maximum value from each filter. Finally, the output will be a  one dimensional vector with length equal to the number of the filters.

In [7]:
from keras.models import Sequential
from keras.layers import Dense, Conv1D, GlobalMaxPooling1D
from keras.layers.embeddings import Embedding
from time import time

acc = []
times = []

model_cnn = Sequential()

fp = []
tp = []

e = Embedding(100000, 128, input_length=1000)
model_cnn.add(e)
model_cnn.add(Conv1D(filters=100, kernel_size=2, padding='valid', activation='relu', strides=1))
model_cnn.add(GlobalMaxPooling1D())
model_cnn.add(Dense(256, activation='relu'))
model_cnn.add(Dense(1, activation='sigmoid'))
model_cnn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

t0 = time()
model_cnn.fit(x_train_seq, y_train, validation_data=(x_val_seq, y_validation), epochs=5, batch_size=32, verbose=2)
score,accu = model_cnn.evaluate(x_val_seq, y_validation, verbose = 2, batch_size = 32)
tv_time = time()-t0

Train on 20000 samples, validate on 5000 samples
Epoch 1/5
 - 410s - loss: 0.3808 - acc: 0.8210 - val_loss: 0.2782 - val_acc: 0.8828
Epoch 2/5
 - 400s - loss: 0.1269 - acc: 0.9555 - val_loss: 0.3018 - val_acc: 0.8860
Epoch 3/5
 - 411s - loss: 0.0158 - acc: 0.9972 - val_loss: 0.4509 - val_acc: 0.8618
Epoch 4/5
 - 507s - loss: 0.0012 - acc: 1.0000 - val_loss: 0.4082 - val_acc: 0.8852
Epoch 5/5
 - 494s - loss: 2.0990e-04 - acc: 1.0000 - val_loss: 0.4202 - val_acc: 0.8872


In [8]:
from sklearn.metrics import roc_curve

y_pred= model_cnn.predict(x_val_seq).ravel()
fpr, tpr, _ = roc_curve(y_validation, y_pred)

acc.append(accu*100)
times.append(tv_time*0.0166667)
fp.append(fpr)
tp.append(tpr)

print("score: %.2f" % (score))
print("acc: %.2f" % (accu))

score: 0.42
acc: 0.89


After 15 short minutes of training, we get the above accuracy, which seems a better result from all the model I've run in previous kernels.

## Long Short Term Memory

Let's now try another model LSTM and compare it with the previous CNN model.
We will use a single LSTM layer preceded by an embedding layer with 100000 as max feature and 128 dimension of each word in a sequence, then followed with a dense layer with softmax function.


In [9]:
#LSTM

from keras.layers import SpatialDropout1D, LSTM, Dropout, Dense, GRU, Bidirectional

model_lstm = Sequential()

model_lstm.add(Embedding(100000, 128))
model_lstm.add(SpatialDropout1D(0.4))
model_lstm.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model_lstm.add(Dense(1,activation='sigmoid'))
model_lstm.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model_lstm.summary())

t0 = time()
model_lstm.fit(x_train_seq, y_train, epochs = 7, batch_size=32, verbose = 2)
score,accu = model_lstm.evaluate(x_val_seq, y_validation, verbose = 2, batch_size = 32)
tv_time = time()-t0
y_pred= model_lstm.predict(x_val_seq).ravel()
fpr, tpr, _ = roc_curve(y_validation, y_pred)

acc.append(accu*100)
times.append(tv_time*0.0166667)
fp.append(fpr)
tp.append(tpr)

print("score: %.2f" % (score))
print("acc: %.2f" % (accu))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 128)         12800000  
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, None, 128)         0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 129       
Total params: 12,931,713
Trainable params: 12,931,713
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/7
 - 4216s - loss: 0.4390 - acc: 0.8015
Epoch 2/7
 - 3466s - loss: 0.2426 - acc: 0.9093
Epoch 3/7
 - 3154s - loss: 0.1391 - acc: 0.9519
Epoch 4/7
 - 3254s - loss: 0.0963 - acc: 0.9684
Epoch 5/7
 - 2862s - loss: 0.0641 - acc: 0.9792
Epoch 6/7
 - 2933

## GRU

In [10]:
model_gru = Sequential()
model_gru.add(Embedding(100000, 128))
model_gru.add(GRU(units=16, name = "gru_1",return_sequences=True))
model_gru.add(GRU(units=8, name = "gru_2" ,return_sequences=True))
model_gru.add(GRU(units=4, name= "gru_3"))

model_gru.add(Dense(1,activation='sigmoid'))
model_gru.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model_gru.summary())
model_gru.fit(x_train_seq, y_train, epochs = 7, batch_size=32, verbose = 2)
score,accu = model_gru.evaluate(x_val_seq, y_validation, verbose = 2, batch_size = 32)



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, None, 128)         12800000  
_________________________________________________________________
gru_1 (GRU)                  (None, None, 16)          6960      
_________________________________________________________________
gru_2 (GRU)                  (None, None, 8)           600       
_________________________________________________________________
gru_3 (GRU)                  (None, 4)                 156       
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 5         
Total params: 12,807,721
Trainable params: 12,807,721
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/7
 - 1071s - loss: 0.4143 - acc: 0.8174
Epoch 2/7
 - 1076s - loss: 0.1798 - acc: 0.9398
Epoch 3/7
 - 1048s - loss: 0.

In [11]:
y_pred= model_gru.predict(x_val_seq).ravel()
fpr, tpr, _ = roc_curve(y_validation, y_pred)

acc.append(accu*100)
times.append(tv_time*0.0166667)
fp.append(fpr)
tp.append(tpr)

print("score: %.2f" % (score))
print("acc: %.2f" % (accu))

score: 0.68
acc: 0.84


**CNN+LSTM**

In [12]:
from keras.layers import MaxPooling1D, Activation
# Embedding
max_features = 100000
maxlen = 1000
embedding_size = 128

# Convolution
kernel_size = 5
filters = 64
pool_size = 4

# LSTM
lstm_output_size = 70

# Training
batch_size = 30
epochs = 2

model = Sequential()
model.add(Embedding(max_features, embedding_size, input_length=maxlen))
model.add(Dropout(0.25))
model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(LSTM(lstm_output_size))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

t0 = time()
model.fit(x_train_seq, y_train, validation_data=(x_val_seq, y_validation), epochs=5, batch_size=32, verbose=2)
score,accu = model.evaluate(x_val_seq, y_validation, verbose = 2, batch_size = 32)
tv_time = time()-t0


Train on 20000 samples, validate on 5000 samples
Epoch 1/5
 - 541s - loss: 0.3563 - acc: 0.8373 - val_loss: 0.2938 - val_acc: 0.8828
Epoch 2/5
 - 535s - loss: 0.1272 - acc: 0.9555 - val_loss: 0.3167 - val_acc: 0.8794
Epoch 3/5
 - 550s - loss: 0.0415 - acc: 0.9877 - val_loss: 0.4833 - val_acc: 0.8752
Epoch 4/5
 - 557s - loss: 0.0189 - acc: 0.9945 - val_loss: 0.5479 - val_acc: 0.8778
Epoch 5/5
 - 549s - loss: 0.0118 - acc: 0.9963 - val_loss: 0.5809 - val_acc: 0.8694


In [13]:
y_pred= model.predict(x_val_seq).ravel()
fpr, tpr, _ = roc_curve(y_validation, y_pred)

acc.append(accu*100)
times.append(tv_time*0.0166667)
fp.append(fpr)
tp.append(tpr)
print("score: %.2f" % (score))
print("acc: %.2f" % (accu))

score: 0.58
acc: 0.87


In [14]:
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go

import numpy as np


names = ["CNN","LSTM", "GRU", "Combined"]

trace1 = go.Bar(
    x=names,
    y=acc,
    name='Accuracy (%)'
)
trace2 = go.Bar(
    x=names,
    y=times,
    name='train and test time (min)'
)

data = [trace1, trace2]
layout = go.Layout(
    barmode='group'
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='grouped-bar')

In [15]:
data = []

print (len(fp))
for i in range(0, len(fp)):
    trace = go.Scatter(x=fp[i], y=tp[i],
                        mode='lines', 
                        name='ROC curve {}'
                               ''.format(names[i]))
    data.append(trace)
    
layout = go.Layout(title='Receiver operating characteristic ',
                   xaxis=dict(title='False Positive Rate'),
                   yaxis=dict(title='True Positive Rate'))

fig = go.Figure(data=data, layout=layout)
    
py.iplot(fig)

4


The result after the training and the validation step give us a poor accuracy comparing to the CNN, the model was even slower.
Thus this can be imporved by tunning the hyperparameter, it can be even faster as well if we combine it with another model including CNN itself.

Thanks for your reading :)