In [5]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
import random
import re


for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


# Grab the data

train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')

## Step 1: Background and Problem description

For Module 4 in MSDS 5511 - Deep learning & AI, we have been asked to use recurrent neural networks to build a tweet classifier. A kaggle competition will be used to demonstrate knowledge of RNNs relating to natural language processing models (NLPs henceforth).

A dataset containing tweets with a binary incidator of whether or not the tweet is specifically relating to a natural disaster will be used to train a moedl.

## Step 2 - EDA

Exploratory data analysis will be performed on the data



In [6]:
print(train.head(5))

# We will only concern ourselves with the text

train = train.drop(['id','keyword','location'], axis=1)
test = test.drop(['id','keyword','location'], axis=1)

# Counts of binary target data
print(train.target.value_counts())


We'd prefer to have balanced data. This will involve sampling an equal amount of 0s vs. 1s in the training set. Let's see what the trainnig set looks like first

In [7]:
NumTweets = 20

randomTweetIndex =  random.sample(list(train.index),NumTweets)

for i in randomTweetIndex:
    print(train.text[i])


First, we can remove all redudant duplicate tweets

In [8]:
print('Before duplicates are removed: ',len(train))
train = train.drop_duplicates(subset='text', keep="first")
print('After Duplicates are removed: ',len(train))


A couple of things that immediately stand out are "@" direct mentions and 'http://' links.

Let's drop these two from the training group

In [9]:
# Note: functions below are derived from examples found at: https://docs.python.org/3/library/re.html


# Remove http links and @ mentions
def drop_links(line):
    link = re.compile(r'https?://\S+')
    return link.sub(r'', line)

def drop_mentions(line):
    tgt_twt = re.compile(r'@\S+')
    return tgt_twt.sub(r'', line)

def dropJunk_data(data):
    data['text'] = data['text'].apply(lambda x : drop_links(x))
    data['text'] = data['text'].apply(lambda x : drop_mentions(x))    
    return data

In [10]:
wiped_train = dropJunk_data(train)
wiped_test = dropJunk_data(test)

print('Training length :',len(wiped_train))
print(wiped_train['target'].value_counts())

The data is imbalanced. Downsample 0s to make it the same size as 1s

In [11]:
# Downsample 0s, concat 1s and zeros, shuffle

zeros = wiped_train[wiped_train['target']==0]
ones = wiped_train[wiped_train['target']==1]

zeros_sample = zeros.sample(n=len(ones))

wiped_train = pd.concat([zeros_sample,ones]).sample(frac=1, random_state=42).reset_index(drop=True)
wiped_train.head()
wiped_train['target'].value_counts()

In [12]:
# remove stop words
# NOTE: I have looked at examples found at: https://www.geeksforgeeks.org/removing-stop-words-nltk-python/

from nltk.corpus import stopwords

def drop_stopwords(sentence):
    sen = sentence.split()
    wrd = [word for word in sen if word not in stopwords.words('english')]
    
    return ' '.join(wrd)

def processed_wo_stopwords(data):
    data['text'] = data['text'].apply(lambda x : drop_stopwords(x))   
    return data

train = processed_wo_stopwords(wiped_train)
test = processed_wo_stopwords(wiped_test)


In [13]:
# Samples
print(train.head(10))

Vectorize data:

I will create a corpus that contains train and test data by concatonating the two. I will then pass the corpus in to the vectorizer

In [14]:
# Note: This was a helpful notebook for understanding vectorization: 
#  https://www.kaggle.com/code/mattbast/rnn-and-nlp-detect-a-disaster-in-tweets/notebook#Encode-sentences

import tensorflow as tf
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

# Concat
corpus = pd.concat([train['text'],test['text']])
    
tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)

maxLen = max(len(c.split()) for c in corpus)
print('maximum length: ', maxLen)

In [15]:
trainFeatures = train.iloc[:,0]
trainLabels = train.iloc[:,1]
testFeatures = test.iloc[:,0]


trainToken = tokenizer.texts_to_sequences(trainFeatures)
testToken = tokenizer.texts_to_sequences(testFeatures)

# Pad with extra length follow tweet (if blank)
trainPadded = pad_sequences(trainToken, maxlen=maxLen, padding='post')
testPadded = pad_sequences(testToken, maxlen=maxLen, padding='post')

# Make an array of train labels for training the model
trainLabels = np.array(trainLabels)

## Step 3: Model Arch.

Let's build a model

LSTM is the way to go here, first because it is part of the assignment and secondly because we are working with sequence data. Keras has lots of lstm support. Bidirectional LSTM considers the sequence of training data in terms of front-to-back and back-to-front to find patterns. For my first model I will do 1 LSTM with 128 hidden layers. I need to first embed the data and then compress the LSTM output with a series of dense fully connected layers with tanh activations. Since the problem is a binary classification problem, I will use a single neuron output layer with sigmoid activation. The loss function will be Binary crossentropy.

In [22]:
import keras
from keras.layers import LSTM
from keras.models import Sequential
from keras.layers import Dense, Embedding, Bidirectional, Dropout, BatchNormalization
from keras import optimizers


EmbeddedWords = len(tokenizer.word_index)+1
units = 144  # Just guessing 
hidden_units = 128

model_v1 = Sequential()
model_v1.add(Embedding(EmbeddedWords, units, input_length = maxLen))
model_v1.add(Bidirectional(LSTM(hidden_units)))
model_v1.add(Dense(256, activation='tanh'))
model_v1.add(Dense(128, activation='tanh'))
model_v1.add(Dense(64, activation='tanh'))
model_v1.add(Dense(1, activation='sigmoid'))

model_v1.summary()



In [18]:

model_v1.compile( loss=tf.keras.losses.BinaryCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(0.00001),
    metrics=['accuracy', 'Precision', 'Recall'])


# Call back on test data as trigger
cb = [tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10, verbose=1)]


history_v1 = model_v1.fit(trainPadded, trainLabels, epochs=100, validation_split=0.25, callbacks = cb)


In [20]:
import matplotlib.pyplot as plt

fig, axs = plt.subplots(1,4, figsize=(16, 8))

axs[0].set_title('Loss')
axs[0].plot(history_v1.history['loss'], label='train')
axs[0].plot(history_v1.history['val_loss'], label='val')
axs[0].legend()

axs[1].set_title('Accuracy')
axs[1].plot(history_v1.history['accuracy'], label='train')
axs[1].plot(history_v1.history['val_accuracy'], label='val')
axs[1].legend()

axs[2].set_title('Precision')
axs[2].plot(history_v1.history['precision'], label='train')
axs[2].plot(history_v1.history['val_precision'], label='val')
axs[2].legend()

axs[3].set_title('Recall')
axs[3].plot(history_v1.history['recall'], label='train')
axs[3].plot(history_v1.history['val_recall'], label='val')
axs[3].legend()


Analysis:

Looking at the loss function vs. epoch, clearly the training is overfitting. I will subsequently tighten up the patience to 3 and change the stopping criteria to Accuracy because it seems like there is room for improvement on that metric.

I am going to experiement with dropout & Batchnorm as levers for improving test loss.


In [29]:

model_v2 = Sequential()
model_v2.add(Embedding(EmbeddedWords, units, input_length = maxLen))
model_v2.add(Bidirectional(LSTM(hidden_units)))
model_v2.add(Dropout(0.25))
model_v2.add(BatchNormalization())
model_v2.add(Dense(128, activation='tanh'))
model_v2.add(Dropout(0.25))
model_v2.add(BatchNormalization())
model_v2.add(Dense(64, activation='tanh'))
model_v2.add(Dropout(0.25))
model_v2.add(BatchNormalization())
model_v2.add(Dense(1, activation='sigmoid'))

model_v2.summary()

model_v2.compile( loss=tf.keras.losses.BinaryCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(0.00001),
    metrics=['accuracy', 'Precision', 'Recall'])

# Call back on test data as trigger
cb = [tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', patience=3, verbose=1)]
history_v2 = model_v2.fit(trainPadded, trainLabels,epochs=50,validation_split=0.2, callbacks = cb)


In [30]:

import matplotlib.pyplot as plt

fig, axs = plt.subplots(1, 4, figsize=(16, 8))

axs[0].set_title('Loss')
axs[0].plot(history_v2.history['loss'], label='train')
axs[0].plot(history_v2.history['val_loss'], label='val')
axs[0].legend()

axs[1].set_title('Accuracy')
axs[1].plot(history_v2.history['accuracy'], label='train')
axs[1].plot(history_v2.history['val_accuracy'], label='val')
axs[1].legend()

axs[2].set_title('Precision')
axs[2].plot(history_v2.history['precision'], label='train')
axs[2].plot(history_v2.history['val_precision'], label='val')
axs[2].legend()

axs[3].set_title('Recall')
axs[3].plot(history_v2.history['recall'], label='train')
axs[3].plot(history_v2.history['val_recall'], label='val')
axs[3].legend()



Dropout and batch normalization really helped the model in terms of generalization. The loss function indicates overtraining, but the accuracy of the model is improving while loss is no longer improving. 

I will now swap the tanh function with sigmoid for the dense layers to see how this impacts the model.


In [31]:
model_v3 = Sequential()
model_v3.add(Embedding(EmbeddedWords, units, input_length = maxLen))
model_v3.add(Bidirectional(LSTM(hidden_units)))
model_v3.add(Dropout(0.25))
model_v3.add(BatchNormalization())
model_v3.add(Dense(128, activation='sigmoid'))
model_v3.add(Dropout(0.25))
model_v3.add(BatchNormalization())
model_v3.add(Dense(64, activation='sigmoid'))
model_v3.add(Dropout(0.25))
model_v3.add(BatchNormalization())
model_v3.add(Dense(1, activation='sigmoid'))

model_v3.summary()

#%%

model_v3.compile( loss=tf.keras.losses.BinaryCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(0.00001),
    metrics=['accuracy', 'Precision', 'Recall'])

#%%

history_v3 = model_v3.fit(trainPadded, trainLabels,epochs=50,validation_split=0.2, callbacks = cb)


In [32]:

import matplotlib.pyplot as plt

fig, axs = plt.subplots(1, 4, figsize=(20, 5))

axs[0].set_title('Loss')
axs[0].plot(history_v3.history['loss'], label='train')
axs[0].plot(history_v3.history['val_loss'], label='val')
axs[0].legend()

axs[1].set_title('Accuracy')
axs[1].plot(history_v3.history['accuracy'], label='train')
axs[1].plot(history_v3.history['val_accuracy'], label='val')
axs[1].legend()

axs[2].set_title('Precision')
axs[2].plot(history_v3.history['precision'], label='train')
axs[2].plot(history_v3.history['val_precision'], label='val')
axs[2].legend()

axs[3].set_title('Recall')
axs[3].plot(history_v3.history['recall'], label='train')
axs[3].plot(history_v3.history['val_recall'], label='val')
axs[3].legend()

Switching to dense activations of Sigmoid vs. tanh showed marginal improvement to the accuracy performance in the test set.

For my last experiement I will add 1 more layer of Bidirectional LSTM to see if this increases performance.


In [34]:
model_v4 = Sequential()
model_v4.add(Embedding(EmbeddedWords, units, input_length = maxLen))
model_v4.add(Bidirectional(LSTM(hidden_units,return_sequences=True)))
model_v4.add(Dropout(0.25))
model_v4.add(Bidirectional(LSTM(hidden_units)))
model_v4.add(Dropout(0.25))
model_v4.add(BatchNormalization())
model_v4.add(Dense(128, activation='sigmoid'))
model_v4.add(Dropout(0.25))
model_v4.add(BatchNormalization())
model_v4.add(Dense(64, activation='sigmoid'))
model_v4.add(Dropout(0.25))
model_v4.add(BatchNormalization())
model_v4.add(Dense(1, activation='sigmoid'))

model_v4.summary()

#%%

model_v4.compile( loss=tf.keras.losses.BinaryCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(0.00001),
    metrics=['accuracy', 'Precision', 'Recall'])

#%%

history_v4 = model_v4.fit(trainPadded, trainLabels,epochs=50,validation_split=0.2, callbacks = cb)


In [37]:

fig, axs = plt.subplots(1, 4, figsize=(20, 5))

axs[0].set_title('Loss')
axs[0].plot(history_v4.history['loss'], label='train')
axs[0].plot(history_v4.history['val_loss'], label='val')
axs[0].legend()

axs[1].set_title('Accuracy')
axs[1].plot(history_v4.history['accuracy'], label='train')
axs[1].plot(history_v4.history['val_accuracy'], label='val')
axs[1].legend()

axs[2].set_title('Precision')
axs[2].plot(history_v4.history['precision'], label='train')
axs[2].plot(history_v4.history['val_precision'], label='val')
axs[2].legend()

axs[3].set_title('Recall')
axs[3].plot(history_v4.history['recall'], label='train')
axs[3].plot(history_v4.history['val_recall'], label='val')
axs[3].legend()


An additional LSTM layer is not improving from the model_v3 performance. This will be the model to use going forward.


## Step 4: Results and Analysis

Multiple models were used to predict tweet classification based on text data. A single Bidirectional LSTM with dense fully connected layers each containing a sigmoid activation and Dropout & batchnormalization had the best convergence on maximum accuracy. The best accuracy in test data for model_v3 was 75.16%. Different activations on the dense layers as well as adding an extra LSTM layer did not improve the model.



In [39]:

preds = model_v3.predict(testPadded)

len(preds)

predictions = []

for pred in preds:
    if pred >= 0.5:
        predictions.append(1)
    else:
        predictions.append(0)
        
submission = pd.read_csv("../input/nlp-getting-started/sample_submission.csv")
submission

submission['target']=predictions
submission


submission.to_csv("submission.csv", index=False)




# Conclusion

I have only looked at LSTM layers and dense fully connected layers. I have seen other architecture like CNN used in conjunction w/ LSTM layers, but I did not study this for the problem being investigated. Different vectorization methods may yield better results. Obviously, more input features such as location and time may greatly enhance the model performance. Based on this notebook, it seems clear that putting more LSTM layers in to the model does not improve performance.
