###**Today's Topic: "Use of LSTM in Sentiment Analysis"**


**Sentiment Analysis:** the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral.

Using LSTM to classify the text into positive and negative Sentiments.

**Steps:** 
1. Import the Modules and Load the Dataset 
2. Encode Sentiments (Drop Neutral)
3. Tokenize (Pad/Fit the texts)
4. Build & Compose the LSTM Model/Network
5. Split Dataset (Train & Test)
6. Train the Model
7. Predict

1. Importing necessary Modules ...

In [None]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, SpatialDropout1D
from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
from sklearn.utils import resample
from sklearn.utils import shuffle
from sklearn.metrics import confusion_matrix,classification_report
import re

Only keeping the necessary columns.

In [None]:
url = 'https://raw.githubusercontent.com/AmritkumarT/DATASETS/main/Sentiment.csv'
data = pd.read_csv(url)

#data = pd.read_csv('https://raw.githubusercontent.com/AmritkumarT/DATASETS/main/Sentiment.csv')

# Keeping only the neccessary columns
data = data[['text','sentiment']]

Data preview

In [None]:
data.head()

Unnamed: 0,text,sentiment
0,RT @NancyLeeGrahn: How did everyone feel about...,Neutral
1,RT @ScottWalker: Didn't catch the full #GOPdeb...,Positive
2,RT @TJMShow: No mention of Tamir Rice and the ...,Neutral
3,RT @RobGeorge: That Carly Fiorina is trending ...,Positive
4,RT @DanScavino: #GOPDebate w/ @realDonaldTrump...,Positive


2. Encode Sentiments

I am dropping the 'Neutral' sentiments as my goal was to only differentiate positive and negative tweets.

In [None]:
data = data[data.sentiment != "Neutral"]
data['text'] = data['text'].apply(lambda x: x.lower())
# removing special chars
data['text'] = data['text'].apply((lambda x: re.sub('[^a-zA-z0-9\s]','',x)))

data.head()

Unnamed: 0,text,sentiment
1,rt scottwalker didnt catch the full gopdebate ...,Positive
3,rt robgeorge that carly fiorina is trending h...,Positive
4,rt danscavino gopdebate w realdonaldtrump deli...,Positive
5,rt gregabbott_tx tedcruz on my first day i wil...,Positive
6,rt warriorwoman91 i liked her and was happy wh...,Negative


In [None]:
print(data[ data['sentiment'] == 'Positive'].size)
print(data[ data['sentiment'] == 'Negative'].size)

for idx,row in data.iterrows():
    row[0] = row[0].replace('rt','')
data.head()

4472
16986


Unnamed: 0,text,sentiment
1,scottwalker didnt catch the full gopdebate la...,Positive
3,robgeorge that carly fiorina is trending hou...,Positive
4,danscavino gopdebate w realdonaldtrump delive...,Positive
5,gregabbott_tx tedcruz on my first day i will ...,Positive
6,warriorwoman91 i liked her and was happy when...,Negative


3. Tokenization (Pad/Fit Texts)

In [None]:
    
max_fatures = 2000
tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(data['text'].values)
X = tokenizer.texts_to_sequences(data['text'].values)
X = pad_sequences(X)
X[:2]

array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
         359,  120,    1,  692,    2,   39,   58,  234,   37,  207,    6,
         172, 1745,   12, 1308, 1394,  733],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          16,  281,  249,    5,  809,  102,  170,   26,  134,    6,    1,
         171,   12,    2,  231,  713,   17]], dtype=int32)

4. Build the LSTM Model

Next, I compose the LSTM Network.
The reason why I am using softmax as activation function is that our Network is using categorical crossentropy, and softmax is just the right activation method for that.

In [None]:
embed_dim = 128
lstm_out = 196

model = Sequential()
model.add(Embedding(max_fatures, embed_dim,input_length = X.shape[1]))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 28, 128)           256000    
                                                                 
 spatial_dropout1d_2 (Spatia  (None, 28, 128)          0         
 lDropout1D)                                                     
                                                                 
 lstm_2 (LSTM)               (None, 196)               254800    
                                                                 
 dense_2 (Dense)             (None, 2)                 394       
                                                                 
Total params: 511,194
Trainable params: 511,194
Non-trainable params: 0
_________________________________________________________________
None


5. Split Dataset: Hereby I declare the train and test dataset.

In [None]:
Y = pd.get_dummies(data['sentiment']).values
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.20, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(8583, 28) (8583, 2)
(2146, 28) (2146, 2)


6. Train the Model:

Here we train the Network. We should run much more than 7 epochs, but I would have to wait forever for kaggle, so it is 2 for now.

In [None]:
batch_size = 128
model.fit(X_train, Y_train, epochs = 2, batch_size=batch_size, verbose = 1)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f5500b92fd0>

7. Prediction: Extracting a validation set, and measuring score and accuracy.

In [None]:
#Y_pred = model.predict_classes(X_test,batch_size = batch_size)
#Y_pred = (model.predict(X_test) > 0.5).astype("int32")
#Y_pred = np.argmax(model.predict(X_test))

Y_pred = np.argmax(model.predict(X_test), axis=-1)

In [None]:
df_test = pd.DataFrame({'true': Y_test.tolist(), 'pred':Y_pred})
df_test['true'] = df_test['true'].apply(lambda x: np.argmax(x))
print("confusion matrix",confusion_matrix(df_test.true, df_test.pred))
print(classification_report(df_test.true, df_test.pred))

confusion matrix [[1647   66]
 [ 254  179]]
              precision    recall  f1-score   support

           0       0.87      0.96      0.91      1713
           1       0.73      0.41      0.53       433

    accuracy                           0.85      2146
   macro avg       0.80      0.69      0.72      2146
weighted avg       0.84      0.85      0.83      2146



Finally measuring the number of correct guesses.  It is clear that finding negative tweets (**class 0**) goes very well (**recall 0.96**) for the Network but deciding whether is positive (**class 1**) is not really (**recall 0.43**). My educated guess here is that the positive training set is dramatically smaller than the negative, hence the "bad" results for positive tweets.