Python Group
Lab Assignment Seven: RNNs
Wali Chaudhary, Bryce Shurts, & Alex Wright

# Business Understanding

The dataset is sourced from data.world on Kaggle, and is called "Emotion Detection from Text". It was distributed by data.world under a public license according to the source. The dataset is a collection of tweets annotated with emotions attached to each sample.

There exist 3 columns, "tweet_id", "sentiment", and "content". Tweet_id represents the identification number of the tweet for querying with the Twitter API, sentiment is the classification of emotion associated with the content, and content is the raw text from the tweet.

The dataset contains 13 different emotions, with 40000 records. The dataset is imbalanced, as there are a different number of records per emotion, so this must be addressed in preprocessing the data.


# Preparation


#### Splitting the data
We will be using Stratified K-Fold Cross Validation to split our data into training and testing sets. We chose this method based on the structure of our dataset.

The dataset contains imbalanced classes, so we want to ensure that the distribution of the training & testing sets are representative of the overall distribution of classes in the dataset and is beneficial to reducing bias. Features like the sentiment feature are imbalanced with each of the 13 values associated with it having a different distribution across the dataset. Stratified K-Fold Cross Validation ensures this proportionality of distribution, unlike a normal train test split which randomly splits the data into two sets based on a predefined ratio. This can result in a skewed representation of classes in the training & testing datasets not representative of the distribution of the entire dataset.

Stratified K-Fold Cross Validation also gives us the advantage of being able to compare different models as each model is trained differently per fold. This will help us in hyper parameter tuning.


#### Tokenization methods, Vocabulary, and content length

We use the Keras Tokenizer class to tokenize the text data in our dataset. We convert each tweet's text into a sequence of integers, where each integer represents a unique word in the vocabulary. The Tokenizer also takes care of lowercasing, removing punctuation, and handling out-of-vocabulary words.

We defined our vocabulary to use all the unique words in "content", this was because we didn't want to lose any data as we thought this would also generalize our model better overall. Although this will increase the complexity of our model, and one could also argue that mispellings and grammatical errors can lead to an incorrect classification; we believe that grammatical errors and slang are common ways of emotional expression online. For example, if a user tweets GAAAAAHHHH, that probably means they're very upset versus someone who tweets GAH which may indicate surprise.

We decided to keep the length of each sequence to the longest individual sequence by using padding, as this'll help us capture the maximum amount of input for sentiment analysis.


#### Evaluation Metrics

For evaluation, we decided to rely on the F1 score, precision, recall, and accuracy. This is because we are performing a classification task on sequences to predict sentiment. Precision is important because it helps us guage the our models True positive predictiveness, and recall helps us identify the number of positive instances correctly identified out of the total amount of positive cases. In our specific case, recall is important because it'll tell us the ratio of how much our classifier was able to identify correct sentiment, like happiness, out the all of the happiness records. Precision will tell us if our evaluation of sentiment was correct.

The F1 score balances both of these metrics together, and provides a balanced measure of our models performance. In sentiment analysis, false positives and false negatives are not good because both types of errors can have significant consequences, especially in a system which relies on the classification to provide recommendations or critical information. A high F1-score indicates that our model is making accurate positive predictions while minimizing false positives and false negatives. It is also useful in our case as well since our dataset is imbalanced, because it reduces bias to the majority class by giving a balanced result of the recall and precision to account for both false positives and false negatives.

In [12]:
# Handle all imports for notebook

import pandas as pd
import numpy as np
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
import tensorflow as tf
from tensorflow import keras
from PIL import Image
from os import listdir
from os.path import isfile, join
from skimage.transform import resize
from sklearn import preprocessing
from keras.models import Sequential
from keras.utils import plot_model
from keras.layers import Embedding
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import CuDNNLSTM
from keras.layers import Bidirectional
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import roc_curve, auc, accuracy_score, confusion_matrix, classification_report
from keras.preprocessing.text import Tokenizer
# from keras_preprocessing.sequence import pad_sequences
from keras.utils import pad_sequences
from sklearn.preprocessing import OneHotEncoder

In [13]:
df = pd.read_csv("tweet_emotions.csv")

In [14]:
df.dropna(axis=0, inplace=True)
df.drop("tweet_id", inplace=True, axis=1)

df.groupby(df["sentiment"]).describe()

Unnamed: 0_level_0,content,content,content,content
Unnamed: 0_level_1,count,unique,top,freq
sentiment,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
anger,110,110,fuckin'm transtelecom,1
boredom,179,179,i'm so tired,1
empty,827,827,@tiffanylue i know i was listenin to bad habi...,1
enthusiasm,759,759,wants to hang out with friends SOON!,1
fun,1776,1776,"Wondering why I'm awake at 7am,writing a new s...",1
happiness,5209,5194,FREE UNLIMITED RINGTONES!!! - http://tinyurl.c...,4
hate,1323,1323,It is so annoying when she starts typing on he...,1
love,3842,3801,I just received a mothers day card from my lov...,13
neutral,8638,8617,FREE UNLIMITED RINGTONES!!! - http://tinyurl.c...,4
relief,1526,1524,http://snipurl.com/hq0n1 Just printed my mom a...,2


This shows us that the sentiment feature is very imbalanced, with each of its 13 values having a different distribution across the dataset. The highest value, happiness with 5209 in count is significantly higher than the lowest count, anger, which has only 110 records.

Dropped the "twitter_id" column as that was not relevant to the task at hand for classifying sentiment.

In [15]:
# Define the Stratified Shuffle Split object
n_splits = 3
test_size = 0.2
sss = StratifiedShuffleSplit(n_splits=n_splits, test_size=test_size, random_state=42)

X, y = df.drop("sentiment", inplace=False, axis=1), df["sentiment"]

In [16]:
#tokenize the text, use entire vocabulary
tokenizer = Tokenizer()
y_np = y.to_numpy().reshape(-1, 1)

# save as sequences with integers replacing words
tokenizer.fit_on_texts(X["content"])
sequences = tokenizer.texts_to_sequences(X["content"])
final_seqs = pad_sequences(sequences,maxlen=300)
num_vocab = len(tokenizer.word_index)+1

encoder = OneHotEncoder(handle_unknown='ignore')
target = encoder.fit_transform(y_np).toarray()

In [17]:
# Load in the word embeddings
word_vectors = {}
for line in open("glove.840B.300d.txt"):
    value = line.split(' ')
    word_vectors[value[0]] = np.array(value[1:],dtype = 'float32')

# Apply embeddings to dataset
embedding_matrix = np.zeros((num_vocab, 300))
for word, index in tokenizer.word_index.items():
    embedding = word_vectors.get(word)
    if embedding is not None:
        embedding_matrix[index] = embedding

FileNotFoundError: [Errno 2] No such file or directory: 'glove.840B.300d.txt'

In [None]:
# LSTM RNN
model_bi_lstm = Sequential()
model_bi_lstm.add(Embedding(num_vocab, 300, weights=[embedding_matrix], input_length=300, trainable=False))
model_bi_lstm.add(Bidirectional(CuDNNLSTM(75)))
model_bi_lstm.add(Dense(32, activation="relu"))
model_bi_lstm.add(Dense(13, activation="sigmoid"))
model_bi_lstm.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])


# Default padding is to the longest sequence
for train_index, test_index in sss.split(sequences, target):
    X_train, X_test = final_seqs[train_index], final_seqs[test_index]
    y_train, y_test = target[train_index], target[test_index]

    hist = model_bi_lstm.fit(X_train, y_train, epochs=100, batch_size=256, validation_split=0.2)

In [18]:
# Load in the word embeddings
word_vectors = {}
for line in open("numberbatch-en-17.04b.txt"):
    value = line.split(' ')
    word_vectors[value[0]] = np.array(value[1:],dtype = 'float32')

# Apply embeddings to dataset
embedding_matrix = np.zeros((num_vocab, 300))
for word, index in tokenizer.word_index.items():
    embedding = word_vectors.get(word)
    if embedding is not None:
        embedding_matrix[index] = embedding

In [11]:
# LSTM RNN
model_bi_lstm = Sequential()
model_bi_lstm.add(Embedding(num_vocab, 300, weights=[embedding_matrix], input_length=300, trainable=False))
model_bi_lstm.add(Bidirectional(CuDNNLSTM(75)))
model_bi_lstm.add(Dense(32, activation="relu"))
model_bi_lstm.add(Dense(13, activation="sigmoid"))
model_bi_lstm.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])


# Default padding is to the longest sequence
for train_index, test_index in sss.split(sequences, target):
    X_train, X_test = final_seqs[train_index], final_seqs[test_index]
    y_train, y_test = target[train_index], target[test_index]

    hist = model_bi_lstm.fit(X_train, y_train, epochs=100, batch_size=256, validation_split=0.2)

Epoch 1/100


InvalidArgumentError: No OpKernel was registered to support Op 'CudnnRNNV2' used by {{node sequential/bidirectional/forward_cu_dnnlstm/CudnnRNNV2}} with these attrs: [seed2=0, is_training=true, seed=0, dropout=0, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm"]
Registered devices: [CPU]
Registered kernels:
  <no registered kernels>

	 [[sequential/bidirectional/forward_cu_dnnlstm/CudnnRNNV2]] [Op:__inference_train_function_3089]