# 1. Introducition
The following code contains the creation and training of an RNN Model using Keras and TensorFlow. The project was originally started in Kaggle but was moved to Jupyter Notebook so as to utilize hardware acceleration. The aim of the model is to decide whether tweets entered are one of eight emotions. The emotions are as followed: {'joy', 'surprise', 'sadness', 'anger', 'trust', 'fear', 'disgust', 'anticipation'}. Many of the porcesses taken and utilized were thanks to the GPU-Accelerated performance of the Jupyter Notebook. With that I was able to perform more epochs as well as create more versions as well.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 2 Importing Files
Due to the JSON being a Nested Json, the original df did not fully contain the information needed. Due to this another dataframe was created with json_normalize so as to have the tweet_id, and the text, which were very importan. The tweet_id was necessary so as to add combine the other dataframes that contained the training and test data. The reason for this was so as to have the dataframe be able to share the tweet text via the tweet_id. Afterwards, separate dataframes where created to have the Training and Testing Data Separate from each other, as well as the emotion labels.

In [None]:
#Creating Twitter DF From Json Files
twitterDM = "Twitter/tweets_DM.json"
twitter_df = pd.read_json(twitterDM, lines=True)
twitter_df.head()

In [None]:
#Getting the Tweet Inform From the Source Column of the Original DF
twitterDM_df = pd.json_normalize(twitter_df._source, record_prefix=None)
twitterDM_df.columns = twitterDM_df.columns.str.replace('^tweet.', '')
twitterDM_df

In [None]:
#Creating Data_Identification DF
twitterID = "Twitter/data_identification.csv"
twitid_df = pd.read_csv(twitterID)
twitid_df

In [None]:
#Creating Emotion DF
emotion = "Twitter/emotion.csv"
twitemo_df = pd.read_csv(emotion)
twitemo_df

In [None]:
#Combining DataFrames
dfs = [twitterDM_df,twitid_df,twitemo_df]
twitcom_df = pd.concat([x.set_index('tweet_id') for x in dfs], axis=1).reset_index()
twitcom_df

In [None]:
#Extracting the Training and testing Data
twitTrain_df = twitcom_df.loc[twitcom_df['identification'] == 'train']
twitTest_df = twitcom_df.loc[twitcom_df['identification'] == 'test']


### Indexes had to be reset for other uses after everything was separated.

In [None]:
#Reseting and Testing Train Dataframe
twitTrain_df.reset_index(drop=True, inplace=True)
twitTrain_df

In [None]:
#Reseting and Testing Test Dataframe
twitTest_df.reset_index(drop=True, inplace=True)
twitTest_df

In [None]:
#Extracting and Testing Training Data
tweets_train = twitTrain_df['text']
tweets_test = twitTest_df['text']
emolabel_train = twitTrain_df['emotion']
tweets_train[50], emolabel_train[50]

# 3. Preprocessing
Originally, no preprocessing was done. This was becuase I thought that even without it I would still be able to get a 0.7 or above accuracy. An assumption that proved to be true. With that being said, for some reason or another, I decided to add preprocessing of the data to see how far I could take this. The preprocessing step was mostly to remove usernames, one letter words, unnecessary stopwords, spaces, the lower of text for uniformity and word lemmetimization. I also wanted to do more but I think I did enough. (I got tired). Also tested whether or not removing stop words performed better or worse. You can tell which one I chose based on int's inclusion/exclusion in th immediate following data. Plan to utilize emoji detection as well.

In [None]:
################################# PREVIOUS VERSION####################################
# import nltk
# from nltk.corpus import stopwords
# from nltk.stem import WordNetLemmatizer
# from nltk.tokenize import word_tokenize
# nltk.download('wordnet')
# nltk.download('omw-1.4')
# import re
# import emoji 

# def preprocess_text(Tweet):
#         #Removes Username
#         Tweet = re.sub('@[^\s]+','',Tweet)
        
#         #Demojize Emojis
#         Tweet = emoji.demojize(Tweet)
#         Tweet = Tweet.replace(":"," ")
#         Tweet = ' '.join(Tweet.split())
        
#         #Removes LH
#         Tweet = re.sub('<[^\s]+','',Tweet)
        
#         #Removes Hashtage
#         Tweet = re.sub('#','',Tweet)
        
#         # Remove puntuations and numbers
#         Tweet = re.sub('[^a-zA-Z]', ' ', Tweet)
               
#         # Remove single characters
#         Tweet = re.sub(r"\s+[a-zA-Z]\s+", ' ', Tweet)

#         # remove multiple spaces
#         Tweet = re.sub(r'\s+', ' ', Tweet)
#         Tweet = Tweet.lower()
        
#         # Convert Text sentence to Tokens
#         Tweet = word_tokenize(Tweet)
       
#         #Remove unncecessay stopwords #####Disabled#####
#         stop_words = stopwords.words('english')
#         filtered_text = []
#         for t in Tweet:
#             #if t not in stop_words:
#             filtered_text.append(t)

#         # Word lemmatization
#         wordnet_lemmatizer = WordNetLemmatizer()
#         processed_text1 = []
#         for t in filtered_text:
#             word1 = wordnet_lemmatizer.lemmatize(t, pos="n")
#             word2 = wordnet_lemmatizer.lemmatize(word1, pos="v")
#             word3 = wordnet_lemmatizer.lemmatize(word2, pos=("a"))
#             processed_text1.append(word3)

#         result = ""
#         for word in processed_text1:
#             result = result + word + " "
#         result = result.rstrip() 
        
#         return result

# 3.1 Optimized Pre-Processing
After giving up on Pre-Processing aside from Keras, I decide to try it again. The Following method was ahieved through an accidental eureka moment when looking for other Tokenizers and Find Twitter Tokenizer that meshed well with the rest of the code as well as the Keras Module. The follwing model performs better than the previous one by doing everything the previous one did but with inclusion of the emojis themselves. While it may have some very rare symbols issues

In [None]:
################################# UPDATED VERSION WITH BETTER RESULTS###########################

import re
from nltk.tokenize import TweetTokenizer
from cleantext import clean

def preprocess_text(Tweet):
    #Removes LH
    Tweet = re.sub('<[^\s]+','',Tweet)
    
    #Removes Hashtage
    Tweet = re.sub('#','',Tweet)
    
    #Removes Numbers
    Tweet =re.sub(r'[0-9]+', '', Tweet)
        
    #Removes Currency
    Tweet = clean(Tweet, no_currency_symbols=True, replace_with_currency_symbol="")
    
    #Tokenizer Using Tweet Tokenizer
    tknzr = TweetTokenizer(strip_handles=True, reduce_len=True, preserve_case=False)
    Tweet_t = tknzr.tokenize(Tweet)
    
    result = ""
    for word in Tweet_t:
        result = result + word + " "
    result = result.rstrip() 
    return result

In [None]:
tweets_train_pre = []
for c in tweets_train:
    tweets_train_pre.append(preprocess_text(c))

In [None]:
tweets_test_pre = []
for c in tweets_test:
    tweets_test_pre.append(preprocess_text(c))

In [None]:
print(tweets_train[50])
print(tweets_train_pre[50])
print(emolabel_train[50])

In [None]:
print(tweets_test[50])
print(tweets_test_pre[50])

# 4 Tokenization
As mentioned before, certain decision were decided due to the speed of how things could be. Here is where that started, with usage of Tensorflow, Keras, and trying to make code that make things faster but instead made things longer. I decided to go with the Keras Tokenization as I was able to set a limit of words to be used, as well as it being very compatible with padding. This solution performed well as it was able to still provided a relatively decent accuracy predicition overall.

In [None]:
#Initialization of Tokenization of Tweets using Keras Tokoenizer
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words = 50000)#Num_words choses due to trial and error.

tokenizer.fit_on_texts(tweets_train_pre) 

print(tokenizer.texts_to_sequences([tweets_train_pre[50]]))

### Due to the model being used works best with a fixed input, data must be padded to be properly utilized. 

In [None]:
#First we need to see how much we need to pad it to get the most of it.
import matplotlib.pyplot as plt

lengths = [len(t.split(' ')) for t in tweets_train]
fig = plt.figure(figsize=(20, 10))
plt.hist(lengths, bins=len(set(lengths)))
plt.show()

### The histogram shows us the lenght of the sequence of texts and as such we can pad it to have the most included.

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

sequences = tokenizer.texts_to_sequences(tweets_train_pre)
tweets_train_pad = pad_sequences(sequences, truncating='post', maxlen=30, padding='post')
tweets_train_pad[50]

### Labels were set and functions were created that could easily convert the emotions to a value to be used in the training of the model and convert those values back to string.


In [None]:
classes = set(emolabel_train)
print(classes)

In [None]:
classes_to_index = dict((c, i) for i, c in enumerate(classes))
index_to_classes = dict((v, k) for k, v in classes_to_index.items())

In [None]:
classes_to_index

In [None]:
index_to_classes

In [None]:
#Function used to convert string values to numeric values to be used in training along with the padded data.
names_to_ids = lambda labels: np.array([classes_to_index.get(x) for x in labels])
ids_to_names = lambda labels: np.array([index_to_classes.get(x) for x in labels])

In [None]:
train_emolabels = names_to_ids(emolabel_train)
print (tweets_train_pre[0])
print (emolabel_train[0])
print(train_emolabels[0])

# MODEL CREATION
#### Genuinely just relied on Google for the formatting and trial and error.
Model was created based on RNN and utilized the Keras Layers to aid with long-form content by using LSTM to mitigate the initial proplems with RNN which is it's potentially to lean towards improper weight gradients when it comes to large data looping too much and messing with the weight values.
With all that being said, even without pre-processing it still performed relatively well. The Original Training of the Model was done at batches of 64, to save time and Epochs of 25. The 64, which is larger than the usual 32, may have been faster but could have impacted accuracy as well. As such, in the second round with the preprocessed data, it was changed to 32 and 10. The model utilize Keras embedding and layers to create an RNN that checks for meaning withing the word and related words of the twitter. The LSTM allows for the RNN to avoid looping too much and messing with the ability to validate data properly. The Dense is amount to amount of emotions and utilize softmax for the predicted results.

In [None]:
import tensorflow as tf

model4 = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(50000, 16, input_length=30),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(20, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(20)),
    tf.keras.layers.Dense(8, activation='softmax')
])

model4.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

model4.summary()

# Model Training
No validation data sure made things interesting.
Original Training = 18/25 Epoch for Batches of 64.
Current = 10 Epoch for Batches of 32.
There was an issue whereby the predit function did not produce the desired results and I thought the model was wrong. However, it turned out I was using an outdated TensorFlow model function and I have learned the importance of reading a Module information and formatting. The working model produce a single value which is then converted from index_to_ids from the previously created function. The most optimal set of intructions and parameter are what we have now in this document. 

The various versions were tested on a 1 Epoch basis, and the current model, the No Pre-Processing Model consistently placed the best. In order of Performance, the other ones were Model With Stop Words and Demojize Include and the Last was the Pre-Processing Model that got rid of most things.

That being said, while this first model was indeed the best, by working with the other models I was able narrow down the num_words, length, and epochs for more or less the best performance I could do.


In [None]:
# training!
epochs = 10
batch_size = 32

history4 = model4.fit(tweets_train_pad, train_emolabels, 
                    epochs=epochs, 
                    batch_size=batch_size, 
                    callbacks = [tf.keras.callbacks.EarlyStopping(monitor='accuracy', patience=2)]
                    #validation_data = (X_test, y_test))
                   )
print('training finish')

In [None]:
#Padding for Test Data
sequences_test = tokenizer.texts_to_sequences(tweets_test_pre)
tweets_test_pad = pad_sequences(sequences_test, truncating='post', maxlen=30, padding='post')
tweets_test_pad[50]

In [None]:
#Predicting
pred = np.argmax(model4.predict(tweets_test_pad), axis=-1)


In [None]:
pred[50]

In [None]:
pred.shape, tweets_test_pad.shape

### Converting Prediction index to name and comparing test results.

In [None]:
pred_result = ids_to_names(pred)
pred_result[:5]

In [None]:
print(tweets_test[50])
print(pred_result[50])

In [None]:
my_submission = pd.DataFrame({'id': twitTest_df.tweet_id, 'emotion': pred_result})
my_submission.to_csv('submissionOPP10Epo50Kl30.csv', index=False)

### Optimized Model outperforms other Pre-Processing Models but not the Main Model