Sentiment Analysis with LSTM and Keras by 20SW013, 20SW079, 20SW135

As an improvement to my previous [Kernel][1], here I am trying to achieve better results with a Recurrent Neural Network. <br/>
You may want to [check out](https://www.kaggle.com/ngyptr/multi-class-classification-with-lstm) my latest kernel on an LSTM multi-class classification problem.

  [1]: https://www.kaggle.com/ngyptr/d/crowdflower/first-gop-debate-twitter-sentiment/python-nltk-sentiment-analysis

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
import re

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

Using TensorFlow backend.


Only keeping the necessary columns.

In [None]:
data = pd.read_csv('/kaggle/input/first-gop-debate-twitter-sentiment/Sentiment.csv')

In [None]:
data = data[['sentiment', 'text']]

In [None]:
data.head()

Unnamed: 0,sentiment,text
0,Neutral,RT @NancyLeeGrahn: How did everyone feel about...
1,Positive,RT @ScottWalker: Didn't catch the full #GOPdeb...
2,Neutral,RT @TJMShow: No mention of Tamir Rice and the ...
3,Positive,RT @RobGeorge: That Carly Fiorina is trending ...
4,Positive,RT @DanScavino: #GOPDebate w/ @realDonaldTrump...


Next, I am dropping the 'Neutral' sentiments as my goal was to only differentiate positive and negative tweets. After that, I am filtering the tweets so only valid texts and words remain.  Then, I define the number of max features as 2000 and use Tokenizer to vectorize and convert text into Sequences so the Network can deal with it as input.

In [None]:
data = data[data.sentiment != "Neutral"]
data['text'] = data['text'].apply(lambda x: x.lower())
data['text'] = data['text'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x)))

print(data[ data['sentiment'] == 'Positive'].size) #Positive
print(data[ data['sentiment'] == 'Negative'].size) #Negative

for idx,row in data.iterrows():
    row[0] = row[0].replace('rt',' ')

max_fatures = 2000
tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(data['text'].values)
X = tokenizer.texts_to_sequences(data['text'].values)
X = pad_sequences(X)

4472
16986


Next, I compose the LSTM Network. Note that **embed_dim**, **lstm_out**, **batch_size**, **droupout_x** variables are hyperparameters, their values are somehow intuitive, can be and must be played with in order to achieve good results. Please also note that I am using softmax as activation function. The reason is that our Network is using categorical crossentropy, and softmax is just the right activation method for that.

In [None]:
embed_dim = 128
lstm_out = 196

model = Sequential()
model.add(Embedding(max_fatures, embed_dim,input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 28, 128)           256000    
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 28, 128)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 196)               254800    
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 394       
Total params: 511,194
Trainable params: 511,194
Non-trainable params: 0
_________________________________________________________________
None


Hereby I declare the train and test dataset.

In [None]:
Y = pd.get_dummies(data['sentiment']).values
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.33, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(7188, 28) (7188, 2)
(3541, 28) (3541, 2)


Here we train the Network. We should run much more than 7 epoch, but I would have to wait forever for kaggle, so it is 7 for now.

In [None]:
batch_size = 32
model.fit(X_train, Y_train, epochs = 40, batch_size=batch_size, verbose = 2)

Epoch 1/40
 - 12s - loss: 0.4431 - acc: 0.8127
Epoch 2/40
 - 11s - loss: 0.3257 - acc: 0.8635
Epoch 3/40
 - 11s - loss: 0.2832 - acc: 0.8836
Epoch 4/40
 - 12s - loss: 0.2493 - acc: 0.8961
Epoch 5/40
 - 11s - loss: 0.2248 - acc: 0.9090
Epoch 6/40
 - 11s - loss: 0.1998 - acc: 0.9164
Epoch 7/40
 - 11s - loss: 0.1877 - acc: 0.9245
Epoch 8/40
 - 11s - loss: 0.1665 - acc: 0.9345
Epoch 9/40
 - 11s - loss: 0.1534 - acc: 0.9375
Epoch 10/40
 - 11s - loss: 0.1421 - acc: 0.9439
Epoch 11/40
 - 11s - loss: 0.1319 - acc: 0.9452
Epoch 12/40
 - 11s - loss: 0.1247 - acc: 0.9482
Epoch 13/40
 - 12s - loss: 0.1136 - acc: 0.9546
Epoch 14/40
 - 11s - loss: 0.1155 - acc: 0.9521
Epoch 15/40
 - 11s - loss: 0.1057 - acc: 0.9558
Epoch 16/40
 - 11s - loss: 0.1070 - acc: 0.9562
Epoch 17/40
 - 11s - loss: 0.1041 - acc: 0.9583
Epoch 18/40
 - 11s - loss: 0.0978 - acc: 0.9581
Epoch 19/40
 - 11s - loss: 0.0931 - acc: 0.9623
Epoch 20/40
 - 11s - loss: 0.0917 - acc: 0.9637
Epoch 21/40
 - 11s - loss: 0.0952 - acc: 0.9627
E

<keras.callbacks.History at 0x7dffe479d0f0>

Extracting a validation set, and measuring score and accuracy.

In [None]:
validation_size = 1500

X_validate = X_test[-validation_size:]
Y_validate = Y_test[-validation_size:]
X_test = X_test[:-validation_size]
Y_test = Y_test[:-validation_size]
score,acc = model.evaluate(X_test, Y_test, verbose = 2, batch_size = batch_size)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))

score: 1.10
acc: 0.81


Finally measuring the number of correct guesses.  It is clear that finding negative tweets goes very well for the Network but deciding whether is positive is not really. My educated guess here is that the positive training set is dramatically smaller than the negative, hence the "bad" results for positive tweets.

In [None]:
pos_cnt, neg_cnt, pos_correct, neg_correct = 0, 0, 0, 0
for x in range(len(X_validate)):

    result = model.predict(X_validate[x].reshape(1,X_test.shape[1]),batch_size=1,verbose = 2)[0]

    if np.argmax(result) == np.argmax(Y_validate[x]):
        if np.argmax(Y_validate[x]) == 0:
            neg_correct += 1
        else:
            pos_correct += 1

    if np.argmax(Y_validate[x]) == 0:
        neg_cnt += 1
    else:
        pos_cnt += 1



print("pos_acc", pos_correct/pos_cnt*100, "%")
print("neg_acc", neg_correct/neg_cnt*100, "%")

pos_acc 63.10679611650486 %
neg_acc 86.0621326616289 %


As it was requested by the crowd, I extended the kernel with a prediction example, and also updated the API calls to Keras 2.0. Please note that the network performs poorly. Its because the training data is very unbalanced (pos: 4472, neg: 16986), you should get more data, use other dataset, use pre-trained model, or weight classes to achieve reliable predictions.

I have created this kernel when I knew much less about LSTM & ML. It is a really basic, beginner level kernel, yet it had a huge audience in the past year. I had a lot of private questions and requests regarding this notebook and I tried my best to help and answer them . In the future I am not planning to answer custom questions and support/enhance this kernel in any ways. Thank you my folks :)

In [None]:
twt = ['']
#vectorizing the tweet by the pre-fitted tokenizer instance
twt = tokenizer.texts_to_sequences(twt)
#padding the tweet to have exactly the same shape as `embedding_2` input
twt = pad_sequences(twt, maxlen=28, dtype='int32', value=0)
print(twt)
sentiment = model.predict(twt,batch_size=1,verbose = 2)[0]
if(np.argmax(sentiment) == 0):
    print("negative")
elif (np.argmax(sentiment) == 1):
    print("positive")

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
positive
