# Recurrent Neural Networsk for sentiment classification

In this notebook we extend our analysis from the *tweets_final.ipnyb* notebook and use Keras' implementation of recurrent neural networks to improve our classification.

We will use pretrained GloVe embeddings of both words and emojis. Emoji embeddings are taken from https://github.com/bradleypallen/keras-emoji-embeddings. The main idea behind emoji embeddings is described in the paper https://arxiv.org/pdf/1609.08359.pdf. However, the embeddings we use here were obtained in a modified fashion, as explained in Bradley Pallen's repo.

Let's import utility function kept in a separate script.

In [106]:
from rnn_utils import *

Let's read raw data, extract raw emojis and map their sentiments to integers.

In [107]:
raw_data = pd.read_csv('tweets.csv')
raw_tweets = raw_data['text']
raw_sentiment = raw_data['airline_sentiment']
y = raw_sentiment.map({'negative': 0, 'positive': 1, 'neutral' : 2}).values
X = raw_tweets.values

In [108]:
# DEFINE GLOBAL VARS

EMBEDDING_DIM = 300
BATCH_SIZE = 32
NUM_WORDS = 5000

### Word embeddings only (no emojis)

We begin by training a number of neural networks using word embeddings only (ie removing emojis altogether). 
Firstly, let's do some preprocessing using our imported utility functions. We preprocess tweets to a format accepted by Keras and prepare our embedding matrix.

In [132]:
X_preprocessed = preprocess(X)
tokenizer = Tokenizer(num_words=NUM_WORDS)
X_preprocessed = prepare_text_for_keras(X_preprocessed,tokenizer)
y_preprocessed = prepare_flags_for_keras(y)

X_train, X_val, X_test, y_train, y_val, y_test = split_table(X_preprocessed,y_preprocessed)

embeddings_index = read_embedding('glove/glove.6B.300d.txt')
embedding_matrix = create_embedding_matrix(tokenizer, 300, embeddings_index)

We now train three neural networks:
    * simple RNN
    * simple LSTM
    * conv1D + LSTM + dense hidden layer with droput

In [94]:
# 1 Simple RNN

early_stopping = EarlyStopping(patience=5, monitor='val_loss')
take_best_model = ModelCheckpoint("weights.h5py", save_best_only=True)

model1 = Sequential()

model1.add(Embedding(NUM_WORDS + 1,EMBEDDING_DIM,
                    weights=[embedding_matrix],
                    input_length=X_train.shape[1],
                    trainable=True))

model1.add(SimpleRNN(100))

model1.add(Dense(3, activation='softmax'))

model1.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])

model1.summary()

model1.fit(X_train,y_train, epochs=100, 
          batch_size=BATCH_SIZE, 
          callbacks = [early_stopping, take_best_model], 
          validation_split = 0.25)

model1.load_weights("weights.h5py")
os.remove("weights.h5py")

model1.evaluate(X_test, y_test)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_11 (Embedding)     (None, 35, 300)           1500300   
_________________________________________________________________
simple_rnn_4 (SimpleRNN)     (None, 100)               40100     
_________________________________________________________________
dense_16 (Dense)             (None, 3)                 303       
Total params: 1,540,703.0
Trainable params: 1,540,703
Non-trainable params: 0.0
_________________________________________________________________
Train on 8784 samples, validate on 2928 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100


[0.48126643228400601, 0.82786885278472488]

**Simple RNN on the test set:**

In [95]:
show_classification_report(model1, X_test, y_test)


             precision    recall  f1-score   support

          0       0.85      0.95      0.90       894
          1       0.83      0.67      0.74       248
          2       0.75      0.62      0.68       322

avg / total       0.82      0.83      0.82      1464



In [96]:
# 2 LSTM

early_stopping = EarlyStopping(patience=5, monitor='val_loss')
take_best_model = ModelCheckpoint("weights.h5py", save_best_only=True)

model2 = Sequential()

model2.add(Embedding(NUM_WORDS + 1,EMBEDDING_DIM,
                    weights=[embedding_matrix],
                    input_length=X_train.shape[1],
                    trainable=True))

model2.add(LSTM(100))

model2.add(Dense(3, activation='softmax'))

model2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])

model2.summary()

model2.fit(X_train,y_train, epochs=100, 
          batch_size=BATCH_SIZE, 
          callbacks = [early_stopping, take_best_model], 
          validation_split = 0.25)

model2.load_weights("weights.h5py")
os.remove("weights.h5py")

model2.evaluate(X_test, y_test)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_12 (Embedding)     (None, 35, 300)           1500300   
_________________________________________________________________
lstm_7 (LSTM)                (None, 100)               160400    
_________________________________________________________________
dense_17 (Dense)             (None, 3)                 303       
Total params: 1,661,003.0
Trainable params: 1,661,003
Non-trainable params: 0.0
_________________________________________________________________
Train on 8784 samples, validate on 2928 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100

[0.41533763796253936, 0.8456284156262549]

**LSTM on the test set:**

In [97]:
show_classification_report(model2, X_test, y_test)


             precision    recall  f1-score   support

          0       0.89      0.91      0.90       894
          1       0.84      0.75      0.79       248
          2       0.72      0.73      0.72       322

avg / total       0.84      0.85      0.84      1464



In [110]:
# 3 conv1d + lstm + dropout

early_stopping = EarlyStopping(patience=5, monitor='val_loss')
take_best_model = ModelCheckpoint("weights.h5py", save_best_only=True)

model3 = Sequential()

model3.add(Embedding(NUM_WORDS + 1,EMBEDDING_DIM,
                    weights=[embedding_matrix],
                    input_length=X_train.shape[1],
                    trainable=True))

model3.add(Conv1D(filters=50,kernel_size=6, activation='relu'))
model3.add(MaxPooling1D(pool_size=3))
model3.add(Dropout(0.5))
model3.add(LSTM(100))
model3.add(Dropout(0.5))
model3.add(Dense(100, activation = 'relu'))
model3.add(Dropout(0.5))
model3.add(Dense(3, activation='softmax'))
model3.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])

early_stopping = EarlyStopping(patience=2, monitor='val_loss')

model3.summary()
model3.fit(X_train,y_train, epochs=100, 
          batch_size=BATCH_SIZE, 
          callbacks = [early_stopping, take_best_model], 
          validation_split = 0.25)

model3.load_weights("weights.h5py")
os.remove("weights.h5py")

model3.evaluate(X_test, y_test)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_15 (Embedding)     (None, 35, 300)           1500300   
_________________________________________________________________
conv1d_8 (Conv1D)            (None, 30, 50)            90050     
_________________________________________________________________
max_pooling1d_6 (MaxPooling1 (None, 10, 50)            0         
_________________________________________________________________
dropout_17 (Dropout)         (None, 10, 50)            0         
_________________________________________________________________
lstm_9 (LSTM)                (None, 100)               60400     
_________________________________________________________________
dropout_18 (Dropout)         (None, 100)               0         
_________________________________________________________________
dense_21 (Dense)             (None, 100)               10100     
__________

[0.41013590967068908, 0.84562841497483798]

**Conv1d + LSTM on the test set:**

In [111]:
show_classification_report(model3, X_test, y_test)


             precision    recall  f1-score   support

          0       0.89      0.92      0.90       894
          1       0.75      0.82      0.78       248
          2       0.78      0.67      0.72       322

avg / total       0.84      0.85      0.84      1464



With respect to f1 score, the best performing network on the test set were models 2 and 3. To get a better approximation for their true performance let's report their scores on the validation set.

In [112]:
show_classification_report(model2, X_val, y_val)



             precision    recall  f1-score   support

          0       0.90      0.91      0.90       919
          1       0.82      0.73      0.77       217
          2       0.69      0.73      0.71       328

avg / total       0.84      0.84      0.84      1464



In [113]:
show_classification_report(model3, X_val, y_val)



             precision    recall  f1-score   support

          0       0.89      0.92      0.90       919
          1       0.75      0.82      0.78       217
          2       0.76      0.66      0.71       328

avg / total       0.84      0.84      0.84      1464



Their performance is very similary and choice between these two depends on the business case and should be made by comparing their precision/recall on different classes.

### Word and emoji embeddings

We now train the same networks but this time we use embeddings for both words and emojis. 
The preprocessing goes as follows: 
* words are tokenized and corresponding embedding matrix is computed
* emojis are extracted from tweets, and their corresponding embedding matrix is computed
* emojis one-hot indices are shifted by the number of words in the words embedding matrix
* tweet's emojis are appended to the end of each tweet
* embedding matrices for words and emojis are concatenated

Put simply, we collect all emojis from a given tweet and put them at the end of the tweet. Our embedding matrix and one hot encoding of tweets relfect that.

We realize this is not optimal, however this was a quick and dirty solution we could implement given available time. Although emojis' meaning might change given their position in the sentence, most of them should be agnostic to that and putting them all at the end of a tweet could still potentially improve our networks' performances. Let's find out!

Extract emojis, find corresponding embedding matrix.

In [114]:
X_emojis = GraphicsEmojisExtractor().fit_transform(X)

emoji_tokenizer = Tokenizer()
emoji_tokenizer.fit_on_texts(X_emojis)
X_emojis = emoji_tokenizer.texts_to_sequences(X_emojis)
X_emojis = [shift_emoji_indices(x,NUM_WORDS) for x in X_emojis]
emoji_embedding_index = read_embedding('emoji2vec.txt')
emoji_embedding_matrix = create_embedding_matrix(emoji_tokenizer,300, emoji_embedding_index)

Append emojis to tweets and do the train, test, validation split.

In [115]:
X_with_emojis = append_emojis(X_emojis,X_preprocessed)
X_train, X_val, X_test, y_train, y_val, y_test = split_table(X_with_emojis,y_preprocessed)

In [116]:
weigths = concat_weights(embedding_matrix, emoji_embedding_matrix)
weigths.shape

(5119, 300)

In [117]:
# 1 Simple RNN
early_stopping = EarlyStopping(patience=5, monitor='val_loss')
take_best_model = ModelCheckpoint("weights.h5py", save_best_only=True)

model4 = Sequential()

model4.add(Embedding(weigths.shape[0],EMBEDDING_DIM,
                    weights=[weigths],
                    input_length=X_train.shape[1],
                    trainable=True))

model4.add(SimpleRNN(100))

model4.add(Dense(3, activation='softmax'))

model4.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])

model4.summary()

model4.fit(X_train,y_train, epochs=100, 
          batch_size=BATCH_SIZE, 
          callbacks = [early_stopping, take_best_model], 
          validation_split = 0.25)

model4.load_weights("weights.h5py")
os.remove("weights.h5py")

model4.evaluate(X_test, y_test)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_16 (Embedding)     (None, 75, 300)           1535700   
_________________________________________________________________
simple_rnn_6 (SimpleRNN)     (None, 100)               40100     
_________________________________________________________________
dense_23 (Dense)             (None, 3)                 303       
Total params: 1,576,103.0
Trainable params: 1,576,103
Non-trainable params: 0.0
_________________________________________________________________
Train on 8784 samples, validate on 2928 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100

[0.49409439459524518, 0.81693989103609099]

In [118]:
show_classification_report(model4, X_test, y_test)



             precision    recall  f1-score   support

          0       0.86      0.94      0.90       894
          1       0.78      0.60      0.68       248
          2       0.70      0.63      0.67       322

avg / total       0.81      0.82      0.81      1464



In [119]:
#2 LSTM
early_stopping = EarlyStopping(patience=5, monitor='val_loss')
take_best_model = ModelCheckpoint("weights.h5py", save_best_only=True)

model5 = Sequential()

model5.add(Embedding(weigths.shape[0],EMBEDDING_DIM,
                    weights=[weigths],
                    input_length=X_train.shape[1],
                    trainable=True))

model5.add(LSTM(100))

model5.add(Dense(3, activation='softmax'))

model5.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])

model5.summary()

model5.fit(X_train,y_train, epochs=100, 
          batch_size=BATCH_SIZE, 
          callbacks = [early_stopping, take_best_model], 
          validation_split = 0.25)

model5.load_weights("weights.h5py")
os.remove("weights.h5py")

model5.evaluate(X_test, y_test)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_17 (Embedding)     (None, 75, 300)           1535700   
_________________________________________________________________
lstm_10 (LSTM)               (None, 100)               160400    
_________________________________________________________________
dense_24 (Dense)             (None, 3)                 303       
Total params: 1,696,403.0
Trainable params: 1,696,403
Non-trainable params: 0.0
_________________________________________________________________
Train on 8784 samples, validate on 2928 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100


[0.39329060716707198, 0.85450819639560305]

In [120]:
show_classification_report(model5, X_test, y_test)


             precision    recall  f1-score   support

          0       0.89      0.93      0.91       894
          1       0.82      0.79      0.80       248
          2       0.78      0.69      0.73       322

avg / total       0.85      0.85      0.85      1464



In [122]:
# conv1d + lstm + dropout

early_stopping = EarlyStopping(patience=5, monitor='val_loss')
take_best_model = ModelCheckpoint("weights.h5py", save_best_only=True)

model6 = Sequential()

model6.add(Embedding(weigths.shape[0],EMBEDDING_DIM,
                    weights=[weigths],
                    input_length=X_train.shape[1],
                    trainable=True))

model6.add(Conv1D(filters=50,kernel_size=6, activation='relu'))
model6.add(MaxPooling1D(pool_size=3))
model6.add(Dropout(0.5))
model6.add(LSTM(100))
model6.add(Dropout(0.5))
model6.add(Dense(100, activation = 'relu'))
model6.add(Dropout(0.5))
model6.add(Dense(3, activation='softmax'))
model6.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])

early_stopping = EarlyStopping(patience=2, monitor='val_loss')

model6.summary() 
model6.fit(X_train,y_train, epochs=100, 
          batch_size=BATCH_SIZE, 
          callbacks = [early_stopping, take_best_model], 
          validation_split = 0.25)

model6.load_weights("weights.h5py")
os.remove("weights.h5py")

model6.evaluate(X_test, y_test)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_19 (Embedding)     (None, 75, 300)           1535700   
_________________________________________________________________
conv1d_10 (Conv1D)           (None, 70, 50)            90050     
_________________________________________________________________
max_pooling1d_8 (MaxPooling1 (None, 23, 50)            0         
_________________________________________________________________
dropout_23 (Dropout)         (None, 23, 50)            0         
_________________________________________________________________
lstm_12 (LSTM)               (None, 100)               60400     
_________________________________________________________________
dropout_24 (Dropout)         (None, 100)               0         
_________________________________________________________________
dense_27 (Dense)             (None, 100)               10100     
__________

[0.45761204677852779, 0.82308743201969747]

In [123]:
show_classification_report(model6, X_test, y_test)



             precision    recall  f1-score   support

          0       0.88      0.92      0.90       894
          1       0.78      0.68      0.73       248
          2       0.69      0.66      0.68       322

avg / total       0.82      0.82      0.82      1464



With respect to f1 score, the best performing network on the test was model 5. To get a better approximation for its true performance let's report its scores on the validation set.

In [124]:
show_classification_report(model5, X_val, y_val)



             precision    recall  f1-score   support

          0       0.90      0.92      0.91       919
          1       0.79      0.79      0.79       217
          2       0.73      0.69      0.71       328

avg / total       0.84      0.85      0.84      1464



It's clear adding emojis did not yield any improvement to our neural nets. To gain better appreciation why, let's see how many tweets in total had emojis in then.

In [131]:
counter = 0
for i in X_emojis:
    if i:
        counter += 1
print('Number of tweets with emojis: {}'.format(counter))
print('Percentage of tweets with emojis: {}'.format(counter/len(X_emojis)))

Number of tweets with emojis: 493
Percentage of tweets with emojis: 0.03367486338797814


## Discussion:

By using recurrent neural networks, we were able to significantly improve on the scores of our previous apporach (note though that scores are not directly comparable because of the different train, test, validation splits). However, adding emojis did not help. There are two clear reasons for that. Firstly, appending emojis to the end of a tweet is suboptimal. More importantly, only 3% of all tweets had emojis in them - this is not nearly enough training signal for our networks to pick up. Finally, the following improvements to our approach should be explored:
* checking different network architectures: adding dropout, dense layers, bidirectional networks, multiple lstm layers...
* hyperparameter tuning (hidden layers dimensions, filter sizes in covn1d etc)
* incorporating emoji embeddings but keeping their relative position in a tweet

The goal of the project was not to max out the score, thus we abandond exploring the above possiblities. If time allows, the aforementioned improvements will be explored.