Python Group
Lab Assignment Seven: RNNs
Wali Chaudhary, Bryce Shurts, & Alex Wright

# Business Understanding

The dataset is sourced from data.world on Kaggle, and is called "Emotion Detection from Text". It was distributed by data.world under a public license according to the source. The dataset is a collection of tweets annotated with emotions attached to each sample.

There exist 3 columns, "tweet_id", "sentiment", and "content". Tweet_id represents the identification number of the tweet for querying with the Twitter API, sentiment is the classification of emotion associated with the content, and content is the raw text from the tweet.

The dataset contains 13 different emotions, with 40000 records. The dataset is imbalanced, as there are a different number of records per emotion, so this must be addressed in preprocessing the data.


# Preparation


#### Splitting the data
We will be using Stratified K-Fold Cross Validation to split our data into training and testing sets. We chose this method based on the structure of our dataset.

The dataset contains imbalanced classes, so we want to ensure that the distribution of the training & testing sets are representative of the overall distribution of classes in the dataset and is beneficial to reducing bias. Features like the sentiment feature are imbalanced with each of the 13 values associated with it having a different distribution across the dataset. Stratified K-Fold Cross Validation ensures this proportionality of distribution, unlike a normal train test split which randomly splits the data into two sets based on a predefined ratio. This can result in a skewed representation of classes in the training & testing datasets not representative of the distribution of the entire dataset.

Stratified K-Fold Cross Validation also gives us the advantage of being able to compare different models as each model is trained differently per fold. This will help us in hyper parameter tuning.


#### Tokenization methods, Vocabulary, and content length

We use the Keras Tokenizer class to tokenize the text data in our dataset. We convert each tweet's text into a sequence of integers, where each integer represents a unique word in the vocabulary. The Tokenizer also takes care of lowercasing, removing punctuation, and handling out-of-vocabulary words.

We defined our vocabulary to use all the unique words in "content", this was because we didn't want to lose any data as we thought this would also generalize our model better overall.

In [97]:
# Handle all imports for notebook

import pandas as pd
from pandas import DataFrame
import seaborn as sns
import numpy as np
from numpy import expand_dims
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
import tensorflow as tf
from tensorflow import keras
from PIL import Image
from os import listdir
from os.path import isfile, join
from skimage.transform import resize
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold
from sklearn import metrics as mt
import pprint
from matplotlib import pyplot as plt
from keras.layers import Dense, Activation, Input
from keras.models import Model
from keras.models import Sequential
from keras.utils import plot_model
from keras.layers import Embedding
from keras.layers import concatenate
from sklearn.model_selection import train_test_split
from keras.layers import Reshape
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Activation, Flatten, Dense, Dropout, GlobalAveragePooling2D
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import StratifiedShuffleSplit
from tensorflow.keras.utils import img_to_array
from sklearn.metrics import roc_curve, auc, accuracy_score, confusion_matrix, classification_report
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import OneHotEncoder

In [98]:
df = pd.read_csv("tweet_emotions.csv")

In [99]:
df.dropna(axis=0, inplace=True)
df.drop("tweet_id", inplace=True, axis=1)

df.groupby(df["sentiment"]).describe()

Unnamed: 0_level_0,content,content,content,content
Unnamed: 0_level_1,count,unique,top,freq
sentiment,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
anger,110,110,fuckin'm transtelecom,1
boredom,179,179,i'm so tired,1
empty,827,827,@tiffanylue i know i was listenin to bad habi...,1
enthusiasm,759,759,wants to hang out with friends SOON!,1
fun,1776,1776,"Wondering why I'm awake at 7am,writing a new s...",1
happiness,5209,5194,FREE UNLIMITED RINGTONES!!! - http://tinyurl.c...,4
hate,1323,1323,It is so annoying when she starts typing on he...,1
love,3842,3801,I just received a mothers day card from my lov...,13
neutral,8638,8617,FREE UNLIMITED RINGTONES!!! - http://tinyurl.c...,4
relief,1526,1524,http://snipurl.com/hq0n1 Just printed my mom a...,2


This shows us that the sentiment feature is very imbalanced, with each of its 13 values having a different distribution across the dataset. The highest value, happiness with 5209 in count is significantly higher than the lowest count, anger, which has only 110 records.

In [100]:
# Define the Stratified Shuffle Split object
n_splits = 3
test_size = 0.2
sss = StratifiedShuffleSplit(n_splits=n_splits, test_size=test_size, random_state=42)

X, y = df.drop("sentiment", inplace=False, axis=1), df["sentiment"]

In [101]:
#tokenize the text, use entire vocabulary
tokenizer = Tokenizer(num_words=None)
y_np = y.to_numpy().reshape(-1, 1)

# save as sequences with integers replacing words
tokenizer.fit_on_texts(X.content)
sequences = tokenizer.texts_to_sequences(X.content)
sequences = np.array(sequences, dtype=object)

encoder = OneHotEncoder(handle_unknown='ignore')
target = encoder.fit_transform(y_np).toarray()

# Default padding is to the longest sequence
for train_index, test_index in sss.split(sequences, target):
    X_train, X_test = pad_sequences(sequences[train_index]), pad_sequences(sequences[test_index])
    y_train, y_test = target[train_index], target[test_index]