## Step 0 : Importing necessary libraries 

In [1]:
import numpy as np
import pandas as pd
import os 
import matplotlib.pyplot as plt
from imblearn.over_sampling import SMOTE 

from keras.optimizers import *
from keras.callbacks import *
from keras.models import Model, Sequential
from keras.layers import Dense, Embedding, Input, LSTM, Dropout, Bidirectional
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import keras.backend as K

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import *
from sklearn.metrics import *

Using TensorFlow backend.


## Step 1:- Reading the dataset 
<div class="alert alert-block alert-success">
<b></b> Here we will read the dataset from the suitable source. The dataset is of comma separated value format. <br>
    We will only extract two necessary columns from the table - One being the text having the necessary tweets and the other column being the sentiment. <br>
    Moreover, we are going to extract only the positive and negative sentiments from the data because we can't really draw any conclusions from neutral comments.
</div>

In [2]:
file_path = r"D:\\kaggle_trials\\first-gop-debate-twitter-sentiment\\Sentiment.csv"
df        = pd.read_csv(file_path)
df        = df[['sentiment','text']]
df        = df[df['sentiment']!='Neutral']
print('The extracted dataset looks like this')
df.head(7)

The extracted dataset looks like this


Unnamed: 0,sentiment,text
1,Positive,RT @ScottWalker: Didn't catch the full #GOPdeb...
3,Positive,RT @RobGeorge: That Carly Fiorina is trending ...
4,Positive,RT @DanScavino: #GOPDebate w/ @realDonaldTrump...
5,Positive,"RT @GregAbbott_TX: @TedCruz: ""On my first day ..."
6,Negative,RT @warriorwoman91: I liked her and was happy ...
8,Negative,Deer in the headlights RT @lizzwinstead: Ben C...
9,Negative,RT @NancyOsborne180: Last night's debate prove...


In [3]:
print('The number of positive sentiments are :- ',len(df[df['sentiment']=='Positive']))
print('The number of negative sentiments are :- ',len(df[df['sentiment']=='Negative']))

The number of positive sentiments are :-  2236
The number of negative sentiments are :-  8493


In [4]:
print('We clearly have an imbalanced dataset where there ar emore negative sentiments than positive. We will use SMOTE later \
      to get a balanced data for train test split')

We clearly have an imbalanced dataset where there ar emore negative sentiments than positive. We will use SMOTE later       to get a balanced data for train test split


## Step 2: Feature extraction 
<div class="alert alert-block alert-success">
<b>Tokenization:</b> We vectorize each text of our text column where each text is converted to a sequence. All special characters will be excluded from the sentences while tokenizing. We will select a maximum vocabulary size of 7000.<br>
<b>Padding: </b> Since there are going to be sentences of unequal size, we are going to pad the extra dimensions of the vector representation of smaller statements with zero. <br>
    <b> Role of vocabulary size :</b> The tokenizer will assign an integer value to every word of the statement but the maximum vocabulary size will tell us how many of such integer assigned words to be taken. Hence, all integer values in the vector representation of the statement will be below the maximum vocab size
</div>

In [5]:
MAX_VOCAB_SIZE = 4000
tokenizer      = Tokenizer(num_words=MAX_VOCAB_SIZE, filters='')
tokenizer.fit_on_texts(df['text'].values)
X              = tokenizer.texts_to_sequences(df['text'].values)
X              = pad_sequences(X,padding = 'post')

In [6]:
print('The shape of X matrix after preprocessing becomes:- ',X.shape)
print('The third tweet was                              :- ',df.iloc[3]['text'])
print('\n')
print('This tweet is being tokenized(integer value assigned to each word of the statement) to a vector\
    which is the third row of the matrix X.')
print('\n')
print('The third row of X matrix is given by            :- ',X[3])
print('\n')
print('Note how the zeros are added towards the end as a result of padding')
print('\n')

The shape of X matrix after preprocessing becomes:-  (10729, 29)
The third tweet was                              :-  RT @GregAbbott_TX: @TedCruz: "On my first day I will rescind every illegal executive action taken by Barack Obama." #GOPDebate @FoxNews


This tweet is being tokenized(integer value assigned to each word of the statement) to a vector    which is the third row of the matrix X.


The third row of X matrix is given by            :-  [   2 1521 3089   47  256  507   10   73 3759  284  469 1319 1832 1127
   56 2605    3   76    0    0    0    0    0    0    0    0    0    0
    0]


Note how the zeros are added towards the end as a result of padding




In [7]:
print('We can see what integer value corresponds to which word in a sentence. This is attained by the following expression')
word2idx   =  tokenizer.word_index
print(word2idx)

We can see what integer value corresponds to which word in a sentence. This is attained by the following expression


<div class="alert alert-block alert-success">
  In the statement - <b>  RT @GregAbbott_TX: @TedCruz: "On my first day I will rescind every illegal executive action taken by Barack Obama." #GOPDebate @FoxNews" </b>, <br>
    we get a value zero assigned to the phrase corresponding to <b> Obama." </b> because the integer assigned to it is 8255 which is greater than 4000. Check below
</div>

In [8]:
word2idx['obama."']

8255

## Step 3 : Loading pretrained vectors for word embeddings.
We will use the pretrained word vectors for creating our embedding matrix. The pretrained word embeddings are downloaded from https://nlp.stanford.edu/projects/glove/


In [9]:
path = os.getcwd()+'\\glove.6B\\glove.6B.'+str(100)+'d.txt'
print('Loading word vectors...')
word2vec = {}
with open(path,encoding='utf8') as f:
    for line in f:
        values = line.split()
        word   = values[0]
        vec = np.asarray(values[1:], dtype='float32')
        word2vec[word] = vec
print('Found %s word vectors.' % len(word2vec))


Loading word vectors...
Found 400000 word vectors.


In [10]:
print('let us see a vector representation of a particular word from the example i quoted above.')
print('The vector representation of word- rescind will be',word2vec['rescind'])
print('The length of the vector representaion corresponding to rescind will be :-',len(word2vec['rescind']))

let us see a vector representation of a particular word from the example i quoted above.
The vector representation of word- rescind will be [ 1.0736    -0.77327   -0.20854    0.054075  -0.63376    0.064913
 -0.61847    1.0656     0.50622    0.59997   -0.54504    0.031256
 -0.1484    -0.40969   -0.23884    0.36642    0.036637  -0.16631
 -0.32595    0.049568  -0.43365   -0.49295   -0.14352    0.044648
  0.30151    0.54691   -0.49973   -0.52283    0.37499   -0.50527
  0.61562    0.21215   -0.49117    0.065761  -0.83696    0.86366
  0.4291    -0.73684   -1.369      0.36995   -0.23235    0.39077
  0.46441   -0.45154   -0.21207   -0.070333  -0.087881   0.13571
 -0.20956   -0.58017    0.0058648  0.40705    0.093736   0.42991
 -0.69914    0.25706    1.1666    -0.36646    0.53836   -0.46289
  0.23091   -0.37328    0.28271   -0.21854    0.43297   -0.3562
 -0.51997    1.0802    -0.89746   -0.43188   -0.1046    -0.31163
 -0.41757   -0.59158   -0.2912    -0.056102  -0.53707    0.12951
 -0.22111    

## Step 4 :- Creating the embedding matrix.
<div class="alert alert-block alert-success">
<b>Step 1:- </b> Create a null matrix of size (MAX_VOCAB_SIZE,100). (ie., we have a matrix of 4000 rows and 100 columns in our case). Each row representing a word to be included and each column corresponsing to a row representing the vector value corresponding to that word as assigned in pretrained vector representation downloaded above. <br>
<b> Step 2:- </b> If a word is not present in the pretrained word embeddings, we assign the value to be zero.
</div>

In [11]:
embedding_matrix = np.zeros((MAX_VOCAB_SIZE, 100))
for word, i in word2idx.items():
    if i < MAX_VOCAB_SIZE:
        embedding_vector = word2vec.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector


In [12]:
embedding_matrix.shape

(4000, 100)

## Step 5 :- Creating the model to perform the classification task 
<div class="alert alert-block alert-success">
<b>Bidirectional LSTM (2 layers)+Dense Layer:</b> We will use 2 layers of bi directional LSTM followed by one dense layer. The dropout rate is randomly selected as 0.2 (and it works well in this case) <br>
    <b>  Loss: </b> We chose categorical cross entropy. However we can select Binay cross entropy as well which serves same purpose <br>
    <b> Optimizer: </b> We chose Adam optimizer with a given learning rate. This can be modified later
</div>

In [13]:

model = Sequential()
model.add(Embedding(MAX_VOCAB_SIZE, 100,input_length = X.shape[1],weights=[embedding_matrix],trainable=False))
model.add(Bidirectional(LSTM(220,return_sequences=True,activation='tanh')))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(200,return_sequences=False,activation='tanh')))
model.add(Dropout(0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer=Adam(lr = 0.001),metrics = ['accuracy'])
print(model.summary())

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 29, 100)           400000    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 29, 440)           564960    
_________________________________________________________________
dropout_1 (Dropout)          (None, 29, 440)           0         
_________________________________________________________________
bidirectional_2 (Bidirection (None, 400)               1025600   
_________________________________________________________________
dropout_2 (Dropout)          (None, 400)               0         
_________________________________________________________________
dense_1 (Dense)      

## Step 6 : Callback Creation
We create callbacks which we will be using while execution of the model. Our main priority is on the reduction of categorical cross entropy loss and hence we are putting checks on it 

In [14]:
reduce_lr  = ReduceLROnPlateau(monitor='loss', factor=0.02,verbose=1,
                              patience=5, min_lr=0.0001)
es         = EarlyStopping(monitor='loss', patience=5, verbose=1, mode='auto', baseline=None, 
                          restore_best_weights=True)
filepath   = os.getcwd()+'\\chkpts\\'+"weights-improvement-{epoch:02d}-{loss:.2f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='auto')

## Step 7 : Train-test split 
<div class="alert alert-block alert-success">
    We perform the train test split of the data in following steps:- <br>
<b>Step 1</b> One Hot encode the Labels. <br>
    <b>Step 2</b> Apply SMOTE to get balanced dataset and and one hot encode the labels again. <br>
    <b>Step 3</b> Perform train and test split <br>
</div>

In [15]:
# In case we want to proceed without handling imbalanced data, just uncomment the code and mode ahead
# Y = pd.get_dummies(df['sentiment']).values
# X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.4, random_state = 19)
# print(X_train.shape,Y_train.shape)
# print(X_test.shape,Y_test.shape)

In [16]:
Y            = pd.get_dummies(df['sentiment']).values
sm           = SMOTE(random_state=42)
X_res, Y_res = sm.fit_resample(X, Y)
Y_after      = []
for i in range(len(Y_res)):
    Y_after.append([[0,1] if Y_res[i][0]==1 else [1,0]][0])
Y_after      = np.array(Y_after)

In [17]:
print('Positive values before Oversampling is ', sum(Y == [[0,1]])[0])
print('Negative values before Oversampling is ', sum(Y == [[1,0]])[0])
print('\n')
print('Positive values after Oversampling is ', sum(Y_after == [[0,1]])[0])
print('Negative values after Oversampling is ', sum(Y_after == [[1,0]])[0])
print('\n')

Positive values before Oversampling is  2236
Negative values before Oversampling is  8493


Positive values after Oversampling is  8493
Negative values after Oversampling is  8493




In [18]:
X_train, X_test, Y_train, Y_test = train_test_split(X_res,Y_after, test_size = 0.33, random_state = 42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)
print('Positive values after Oversampling in test data is ', sum(Y_test == [[0,1]])[0])
print('Negative values after Oversampling in test data is ', sum(Y_test == [[1,0]])[0])
print('\n')
print('Positive values after Oversampling in train data is ', sum(Y_train == [[0,1]])[0])
print('Negative values after Oversampling in train data is ', sum(Y_train == [[1,0]])[0])
print('\n')
print('We can clearly see that the negative and positive sentiments are balances in both the datasets.')

(11380, 29) (11380, 2)
(5606, 29) (5606, 2)
Positive values after Oversampling in test data is  2793
Negative values after Oversampling in test data is  2813


Positive values after Oversampling in train data is  5700
Negative values after Oversampling in train data is  5680


We can clearly see that the negative and positive sentiments are balances in both the datasets.


## Step 8 : Fitting the model

In [19]:
epochs   = 25
batch_sz = 24
model.fit(X_train, Y_train, 
          epochs = epochs, 
          batch_size = batch_sz,
         callbacks        = [reduce_lr,es,checkpoint])

Instructions for updating:
Use tf.cast instead.
Epoch 1/25

Epoch 00001: loss improved from inf to 0.53712, saving model to C:\Users\Batfleck\APB_DL_EXERCISES\Bidirectional LSTM\chkpts\weights-improvement-01-0.54.hdf5
Epoch 2/25

Epoch 00002: loss improved from 0.53712 to 0.43776, saving model to C:\Users\Batfleck\APB_DL_EXERCISES\Bidirectional LSTM\chkpts\weights-improvement-02-0.44.hdf5
Epoch 3/25

Epoch 00003: loss improved from 0.43776 to 0.39497, saving model to C:\Users\Batfleck\APB_DL_EXERCISES\Bidirectional LSTM\chkpts\weights-improvement-03-0.39.hdf5
Epoch 4/25

Epoch 00004: loss improved from 0.39497 to 0.36111, saving model to C:\Users\Batfleck\APB_DL_EXERCISES\Bidirectional LSTM\chkpts\weights-improvement-04-0.36.hdf5
Epoch 5/25

Epoch 00005: loss improved from 0.36111 to 0.32690, saving model to C:\Users\Batfleck\APB_DL_EXERCISES\Bidirectional LSTM\chkpts\weights-improvement-05-0.33.hdf5
Epoch 6/25

Epoch 00006: loss improved from 0.32690 to 0.29067, saving model to C:\Use

<keras.callbacks.History at 0x1628c360e10>

## Step 9 : Validating on test set 
We will now validate the accuracy on the test set 

In [20]:
score,acc = model.evaluate(X_test, Y_test, batch_size = batch_sz)
print("score: %.2f" % (score))
print("acc: %.2f" % (acc))

score: 0.77
acc: 0.83


## Step 10 : Model performance 
We will compute the precision, recall and confusion matrix for the predicted and target values
<div class="alert alert-block alert-success">
<b>Precision = tp/(tp+fp):- </b>   This is the correctly classified positive examples out of the Actual results(tp+fp) <br>
<b>Recall    = tp/(tp+fn):- </b>  This is the correctly classified positive examples out of the predicted results (tp+fn).
    

</div>

In [21]:
Y_pred = model.predict_classes(X_test,batch_size = batch_sz)
print('Predicted values attained!')

Predicted values attained!


In [22]:
print('We had done one hot encoding for the actual values, we will resolve them back to binary values')
Y_actual = []
for i in range(len(Y_test)):
    if Y_test[i][0] ==1:
        Y_actual.append(0)
    else:
        Y_actual.append(1)

We had done one hot encoding for the actual values, we will resolve them back to binary values


In [23]:
print('The classification report is given below:-')
print('\n')
print(classification_report(Y_pred,Y_actual))

The classification report is given below:-


              precision    recall  f1-score   support

           0       0.84      0.82      0.83      2874
           1       0.82      0.84      0.83      2732

    accuracy                           0.83      5606
   macro avg       0.83      0.83      0.83      5606
weighted avg       0.83      0.83      0.83      5606



In [24]:
print('The required confusion matrix is given as:- \n')
confusion_matrix(Y_pred,Y_actual)

The required confusion matrix is given as:- 



array([[2366,  508],
       [ 447, 2285]], dtype=int64)