# Paul Adams
# DS7337 Natural Language Processing
# Final Exam
# 9 August 2020

1. [Build a model then predict the sentiment based on a sequence of characters and a second model based on a sequence of bi-grams](#question1)
    1. [Sequence of Unigram Characters Model - Start](#question1a)
        1. [Sequence of Unigram Characters Model - Model Results](#question1ab)
    2. [Sequence of Bi-Gram Characters Model](#question1b)
        1. [Sequence of Bi-Gram Characters Model - Model Results](#question1bb)
2. [What is the vector you learned for the following emoji?](#question1ca)
    1. [Unigram Model Output for Emoji Embedding](#question1cb)
    2. [The vector learned for the emoji: 😂](#question1cc)
        1. [Loading the saved model](#question1cd)
        2. [Load the saved word list](#question1ce)
        3. [The learned vector for 😂](#question1cf)
3. [What is the most similar character for the emoji 😂?](#question1da)
4. [Build a Universal Sentence Encoder (USE) Model and GRU](#question1e)
    1. [Universal Sentence Encoder Models](#question1ea)
        1. [USE Model Results](#question1eb)
    2. [Gated Recurrent Unit Models](#question1ec)
        1. [GRU Model Results](#question1ed)

In [None]:
import pandas as pd
import numpy as np
import os
import re
import tensorflow as tf
# fix random seed for reproducibility
np.random.seed(7)

# display 400 characters of column width
pd.options.display.max_colwidth = 400

In [None]:
print("Tensorflow version: {}".format(tf.__version__))

# **PART I: Sequence of Characters (Unigram) Model**  <a class="anchor" id="question1"></a>

Given the GOP twitter dataset (a dataset of tweets from 2012 with 3 sentiments—see the attached file)

# 1. Build a model then predict the sentiment (column “Sentiment”) of the tweet based on a sequence of characters and a second model based on a sequence of bi-grams (2-letter sequences).

I chose to use Gated Recurrent Unit (GRU) - as opposed to a Recurrent Neural Net (RNN) - because of the high volume of incomprehensible words such as hashtags (GOPdebate is one of the top words used in the whole dataset) and misspellings in the tweets. My opinion is that these words contribute greatly to the lack of confidence indicated in the tweet sentiment confidence provided for the pre-labeled sentiment classes.

Additionally, there are words that appear to have deep cultural meaning, such as the previously mentioned hashtags. I left these words in because the GRU can consider wider ranges of the sentence sequences simultaneously, thus capturing more of the context in which these words are repeatedly framed. For example, GOPdebate; while this word doesn't necessarily mean anything in terms of a word you would find in a traditional, academically accepted lexicon, there is underlying meaning which GRU is able to extract without resulting in a significant issue of vanishing gradients during back-propogation (an issue an RNN would likely suffer from). Therefore, by capturing meaning from these contexts, GRU is able to keep meaningful weights and biases for nodes that would otherwise be minimized from those vanishing gradients and thus result in the model's failure to continue learning. For this reason, I selected GRU as my model of choice.

In [None]:
df = pd.read_csv('./Sentiment.csv')

### Important to note is that the provided confidence of the pre-labeling of Tweet sentiments is roughly 76%. Therefore, the benchmark of perfect model accuracy should approximate this value. Please see below:

In [None]:
print("The average confidence of all pre-scored sentiments: {}%".format(100*round(df['sentiment_confidence'].mean(),4)))

In [None]:
len(df)

In [None]:
df_senti = df[['sentiment','text']]

### DATA-CLEANING-START

In [None]:
def remove_specials(doc):
    """ removes all but alphanumeric, newline escape characters, and replaces hyphens with spaces for hyphenated words to not become one, but two"""
    doc = re.sub('-', ' ', doc)
    doc = re.sub('_', ' ', doc)
    pattern = r"[^a-zA-z0-9\s]+"
    doc = re.sub(pattern, '', doc)
    return doc

Note: This is a three-target-class predictive analysis

In [None]:
df_senti.head()

In [None]:
df_senti['sentiment'].replace("Negative", 0, inplace=True)
df_senti['sentiment'].replace("Neutral", 1, inplace=True)
df_senti['sentiment'].replace("Positive", 2, inplace=True)

In [None]:
# df_senti['text'] = df_senti['text'].str.replace('http://', '', case=False)
# df_senti['text'] = df_senti['text'].str.replace('https://', '', case=False)
# df_senti['text'] = df_senti['text'].str.replace('http\S+|www.\S+', '', case=False)
# df_senti['text'] = df_senti['text'].str.replace('https\S+|www.\S+', '', case=False)

In [None]:
for i in np.arange(0, len(df_senti)):
    if df_senti['text'][i].strip()[0:2] == "RT":
        df_senti['text'][i] = df_senti['text'][i][2:]
    
    df_senti['text'][i] = remove_specials(df_senti['text'][i])

In [None]:
for i in np.arange(0,len(df_senti)):
    df_senti['text'][i] = ' '.join(word for word in df_senti['text'][i].split(' ') if not word.startswith('http'))

In [None]:
len(df_senti)

In [None]:
df_senti['text'] = df_senti['text'].str.strip().str.lower()

In [None]:
df_senti['text'].str.strip().isna().sum()

In [None]:
for i in np.arange(len(df_senti['text'])):
    df_senti['text'][i] = df_senti['text'][i].replace('\\n', ' ')
    
for i in np.arange(len(df_senti['text'])):
    df_senti['text'][i] = df_senti['text'][i].replace("\n", "")

In [None]:
df_senti.iloc[0,1].replace(' ', '')

In [None]:
#args=pd.DataFrame()
args=[]

for j in range(len(df_senti)):
    args.append(list([df_senti.iloc[j,1].replace(' ', '')[i:i+1] for i in range(len(df_senti.iloc[j,1].replace(' ', '')))]))

In [None]:
args_frame = pd.Series(args).replace('', '')
args_frame

In [None]:
for i in np.arange(len(args_frame)):
    args_frame[i] = ' '.join(args_frame[i])

In [None]:
df_senti2 = pd.concat([df_senti['sentiment'],args_frame], axis=1)
df_senti2.columns = ['sentiment','text']

In [None]:
df_senti2.head()

### DATA-CLEANING-EXIT

Function for adding the tweet words into a dictionary

In [None]:
# put words into a dictionary for downstream use
import collections
def build_dataset(words):
    count = collections.Counter(words).most_common() #.most_common(100) to use the 100 most common words; .most_common() means zero is the most common
    dictionary = dict()
    for word, _ in count:
        dictionary[word] = len(dictionary)
    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return dictionary, reverse_dictionary

## Encoding and decoding for the tweets, based on the word list.

In [None]:
word_list = []

for i in df_senti2['text']:
    word_list = word_list + i.split()

In [None]:
def encode_decode(input):
    enc, dec = build_dataset(word_list)
    return enc, dec

In [None]:
enc, dec = encode_decode(word_list)

## Preparing padding and unknown tokens for each sequence (sentence). Tokens will be added downstream with pad_sequences().

Start by shifting all sequences by two places to insert in a pad token and an unknown token in index locations 0 and 1. These will be specified within the keras pre-processing pad_sequences function downstream once the desired maximum sequence length is determined based on sentence length distribution analysis. These tokens will be added to both the encoded and decoded.

Including an unknown token in the validation data is helpful for ensuring the model can avoid bias based on the words it knows in the event it encounters a new word it doesn't know. The patterns around the unknown word could match to the unknown marker so in the event there truly is a new word or new context for a word used, it will provide inference based on the pattern of the context rather than force it to match meaning to a word it does know, but does not apply for the context. This helps the model provide meaning to the context of what is being said rather than the words used, such as in the example "I'm feeling blue because something happened" vs. "the sky is blue because something happened."

In [None]:
for i in enc:
    enc[i] = enc[i]+2 # shift everything by two so you can put in a pad and an unknown in index locations 0 and 1

                                                                    ###      ###
                                                                    # Encoding #
                                                                    ###      ###
enc['pad'] = 0

# start is useful for more complex architectures to invoke the LSTM to perform certain tasks, like decoding or recognizing the start of a sentence, for example.
#enc['<start>'] = 1

enc['<unk>'] = 1

                                                                    ###      ###
                                                                    # Decoding #
                                                                    ###      ###

# pad and include an unknown for the decoded values as well
dec[-2]='<pad>'
#dec[-1]='<start>'
dec[-1]='<unk>' # this is useful to indicate the LSTM should start decoding, or that this is the start of a sentence, or etc.

In [None]:
import numpy as np
n=int(np.floor(df_senti2.shape[0]*0.75)) # 75% for training
train = df_senti2[0:n]
test = df_senti2[n:]

In [None]:
train.head()

In [None]:
test.tail()

## Summarizing train/test split balance

In [None]:
train['sentiment'].value_counts()

In [None]:
test['sentiment'].value_counts()

Optional training data balancing approach (consider the trade-off between overfitting and bias before committing to downsampling classes):

In [None]:
# If you want to balance the training data:

# df1 = train[train['sentiment'] == 0]
# df1 = df1[:1701]

# df2 = train[train['sentiment'] == 1]
# df2 = df2[:1701]

# df3 = train[train['sentiment'] == 2]

# train = pd.concat([df1, df2, df3], axis=0)

In [None]:
df_senti2['y'] = 0
df_senti2.loc[df_senti2['sentiment']==1,'y'] = 1
df_senti2.loc[df_senti2['sentiment']==2,'y'] = 2

In [None]:
df_senti2.tail()

In [None]:
test.tail()

In [None]:
test.shape[0]

# Creating test examples for encoding and decoding sentences

In [None]:
x_train=[]
y_train=[]

for i in range(train.shape[0]):
    tmp = [enc[j] for j in train.iloc[i,1].split()] # enc[j]: the j (list expression) is encoding the number for the word in the encoded matrix (i,j) 
    x_train.append(tmp) # append the newly replaced word
    if train.iloc[i,0]==0: # re-encode y in the below
        y=0
    elif train.iloc[i,0]==1: # re-encode y in the below
        y=1
    else:
        y=2
    y_train.append(y) # append the newly encoded y here
    
x_test=[] # repeat for the test data the steps performed above for training data
y_test=[]
for i in range(test.shape[0]):
    tmp = [enc[j] for j in test.iloc[i,1].split()]
    x_test.append(tmp)
    if test.iloc[i,0]==0: # re-encode y in the below
        y=0
    elif test.iloc[i,0]==1: # re-encode y in the below
        y=1
    else:
        y=2
    y_test.append(y)

In [None]:
len(y_train)

In [None]:
np.unique(y_train)

In [None]:
np.unique(y_test)

In [None]:
len(x_train)

In [None]:
len(x_test)

In [None]:
df_senti2.iloc[0,1]

# Deciding the maximum vector sequence length
Here, I visualize the distribution of word counts in each tweet, then use the 90th percentile. If results aren't satisfactory, I can increase the percentile - to 95%, for example - but the 90th percentile is healthy for preventing overfitting.

In [None]:
import matplotlib.pyplot as plt

lengths=[]

for i in x_train:
    lengths.append(len(i))

%matplotlib inline
plt.hist(lengths,bins=25)

In [None]:
print("Sentence with least unigrams (characters) has {} characters.".format(min(lengths)))

In [None]:
print("90th percentile of all tweet character count volumes: {}".format(int(np.percentile(lengths, 90))))

In [None]:
print("Two standard-deviation range: {}\n".format([np.asarray(lengths).mean() - 2 * np.asarray(lengths).std(), np.asarray(lengths).mean() + 2 * np.asarray(lengths).std()]))

### Normalizing sentence word length by adding the padding and unknown tokens to the sequences (prepared above)

In [None]:
from tensorflow.keras.preprocessing import sequence
# Because most of the reviews in the histogram above are length 23 or less, setting max_length to 23 words:
max_length = 110
x_train = sequence.pad_sequences(x_train, maxlen = max_length)
x_test = sequence.pad_sequences(x_test, maxlen = max_length)

In [None]:
# Load Neural Nets
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.layers import GRU
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
x_train.shape

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.hist([len(i) for i in x_train])
plt.show()
### Note that after padding, all sentences are the same length (same number of parameters)

# Sequence of Characters Model Output <a class="anchor" id="question1ab"></a>

In [None]:
from keras.utils import plot_model
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)

embedding_vector_length = 80
model1 = Sequential()
model1.add(Embedding(len(dec), embedding_vector_length, input_length=max_length))
model1.add(GRU(100, unroll=True, dropout=0.2)) # unroll makes this run faster; units between 100-300
model1.add(Dense(3, activation='softmax')) # 3 for the three classes
model1.compile(loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy']) # rmsprop did better than adam
print(model1.summary())

plot_model(model1, to_file='model_plot4a.png', show_shapes=True, show_layer_names=True)

In [None]:
model_GRU = model1.fit(x_train
                       , np.array(y_train)
                       , validation_data=(x_test, np.array(y_test))
                       , callbacks=[callback] # EarlyStopping
                       , epochs=20
                       , batch_size=32)

In [None]:
val_loss = model_GRU.history['val_loss']
min_loss_loc = np.where(val_loss==np.min(val_loss))[0][0]

fig, ax = plt.subplots(ncols=2, figsize = (15,7))
ax[0].plot(model_GRU.history['val_accuracy'], label = 'val_accuracy')
ax[0].plot(model_GRU.history['accuracy'], label = 'accuracy')
ax[0].vlines(min_loss_loc, *ax[0].get_ylim(), label = 'min_val_loss')
ax[0].set_title('Gated Recurrent Unit Accuracy Curves')
ax[0].legend();

ax[1].plot(model_GRU.history['val_loss'], label = 'val_loss')
ax[1].plot(model_GRU.history['loss'], label = 'loss')
ax[1].vlines(min_loss_loc, *ax[1].get_ylim(), label = 'min_val_loss')
ax[1].set_title('Gated Recurrent Unit Loss Curves')
ax[1].legend();

In [None]:
model_GRU = model1.fit(x_train
                       , np.array(y_train)
                       , validation_data=(x_test, np.array(y_test))
                       , callbacks=[callback]
                       , epochs=20
                       , batch_size=16)

In [None]:
val_loss = model_GRU.history['val_loss']
min_loss_loc = np.where(val_loss==np.min(val_loss))[0][0]

fig, ax = plt.subplots(ncols=2, figsize = (15,7))
ax[0].plot(model_GRU.history['val_accuracy'], label = 'val_accuracy')
ax[0].plot(model_GRU.history['accuracy'], label = 'accuracy')
ax[0].vlines(min_loss_loc, *ax[0].get_ylim(), label = 'min_val_loss')
ax[0].set_title('Gated Recurrent Unit Accuracy Curves')
ax[0].legend();

ax[1].plot(model_GRU.history['val_loss'], label = 'val_loss')
ax[1].plot(model_GRU.history['loss'], label = 'loss')
ax[1].vlines(min_loss_loc, *ax[1].get_ylim(), label = 'min_val_loss')
ax[1].set_title('Gated Recurrent Unit Loss Curves')
ax[1].legend();

In [None]:
model_GRU = model1.fit(x_train
                       , np.array(y_train)
                       , validation_data=(x_test, np.array(y_test))
                       , callbacks=[callback]
                       , epochs=20
                       , batch_size=64)

In [None]:
val_loss = model_GRU.history['val_loss']
min_loss_loc = np.where(val_loss==np.min(val_loss))[0][0]

fig, ax = plt.subplots(ncols=2, figsize = (15,7))
ax[0].plot(model_GRU.history['val_accuracy'], label = 'val_accuracy')
ax[0].plot(model_GRU.history['accuracy'], label = 'accuracy')
ax[0].vlines(min_loss_loc, *ax[0].get_ylim(), label = 'min_val_loss')
ax[0].set_title('Gated Recurrent Unit Accuracy Curves')
ax[0].legend();

ax[1].plot(model_GRU.history['val_loss'], label = 'val_loss')
ax[1].plot(model_GRU.history['loss'], label = 'loss')
ax[1].vlines(min_loss_loc, *ax[1].get_ylim(), label = 'min_val_loss')
ax[1].set_title('Gated Recurrent Unit Loss Curves')
ax[1].legend();

# **PART I: Sequence of Bi-gram (2-letter sequences) Model** <a class="anchor" id="question1b"></a>

In [None]:
df_senti.iloc[0,1].replace(' ', '')

In [None]:
#args=pd.DataFrame()
args=[]

for j in range(len(df_senti)):
    args.append(list([df_senti.iloc[j,1].replace(' ', '')[i:i+2] for i in range(len(df_senti.iloc[j,1].replace(' ', '')))]))

Stripping quotes:

In [None]:
args_frame = pd.Series(args).replace('', '')
args_frame

### Stripping commas:

In [None]:
for i in np.arange(len(args_frame)):
    args_frame[i] = ' '.join(args_frame[i])

Checking the first index (and comparing to above):

In [None]:
args_frame[0]

Checking the last index (and comparing to above):

In [None]:
args_frame[-1:]

In [None]:
for i in np.arange(len(args_frame)):
    args_frame[i] = args_frame[i].replace('\\n', ' ')

In [None]:
for i in np.arange(len(args_frame)):
    args_frame[i] = args_frame[i].replace("\n", "")

Confirming new length matches original length to ensure no data was inadvertantly truncated before concatenating back to form a character bi-grammed dataset:

In [None]:
len(args_frame)

In [None]:
len(df_senti['sentiment'])

In [None]:
df_senti2 = pd.concat([df_senti['sentiment'],args_frame], axis=1)

In [None]:
len(df_senti2)

In [None]:
df_senti2.columns = ['sentiment','text']

In [None]:
df_senti2.head(12)

In [None]:
word_list = []

for i in df_senti2['text']:
    word_list = word_list + i.split()

In [None]:
enc, dec = encode_decode(word_list)

In [None]:
df_senti2['y'] = 0
df_senti2.loc[df_senti2['sentiment']==1,'y'] = 1
df_senti2.loc[df_senti2['sentiment']==2,'y'] = 2

In [None]:
df_senti2.head()

In [None]:
import numpy as np
n=int(np.floor(df_senti2.shape[0]*0.75)) # 75% for training
train = df_senti2[0:n]
test = df_senti2[n:]

In [None]:
train['sentiment'].value_counts()

In [None]:
test['sentiment'].value_counts()

In [None]:
train.head(12)

In [None]:
train.shape[0]

In [None]:
test.shape[0]

In [None]:
x_train=[]
y_train=[]

for i in range(train.shape[0]):
    tmp = [enc[j] for j in train.iloc[i,1].split()] # enc[j]: the j (list expression) is encoding the number for the word in the encoded matrix (i,j) 
    x_train.append(tmp) # append the newly replaced word
    if train.iloc[i,0]==0: # re-encode y in the below
        y=0
    elif train.iloc[i,0]==1: # re-encode y in the below
        y=1
    else:
        y=2
    y_train.append(y) # append the newly encoded y here
    
x_test=[] # repeat for the test data the steps performed above for training data
y_test=[]
for i in range(test.shape[0]):
    tmp = [enc[j] for j in test.iloc[i,1].split()]
    x_test.append(tmp)
    if test.iloc[i,0]==0: # re-encode y in the below
        y=0
    elif test.iloc[i,0]==1: # re-encode y in the below
        y=1
    else:
        y=2
    y_test.append(y)

In [None]:
import matplotlib.pyplot as plt

lengths=[]

for i in x_train:
    lengths.append(len(i))

%matplotlib inline
plt.hist(lengths,bins=25)

In [None]:
min(lengths)

In [None]:
max(lengths)

Here, because of the volume of bigrams, I've chosen to use the 95th percentile whereas with the unigram approach, I chose to use the 90th percentile. There's less risk of overfitting here.

In [None]:
int(np.percentile(lengths, 95))

In [None]:
print([np.asarray(lengths).mean() - 2 * np.asarray(lengths).std(), np.asarray(lengths).mean() + 2 * np.asarray(lengths).std()])

In [None]:
from tensorflow.keras.preprocessing import sequence
# Because most of the reviews in the histogram above are length 23 or less, setting max_length to 23 words:
max_length = 112
x_train = sequence.pad_sequences(x_train, maxlen = max_length)
x_test = sequence.pad_sequences(x_test, maxlen = max_length)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.hist([len(i) for i in x_train])
plt.show()
### Note that after padding, all sentences are the same length (same number of parameters)

In [None]:
len(x_train)

# Sequence of Bi-Grams Model Output <a class="anchor" id="question1bb"></a>

In [None]:
embedding_vector_length = 128
model1 = Sequential()
model1.add(Embedding(len(dec), embedding_vector_length, input_length=max_length))
model1.add(GRU(100, unroll=True, dropout=0.2)) # unroll makes this run faster
model1.add(Dense(3, activation='softmax'))
#model1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model1.compile(loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
print(model1.summary())

plot_model(model1, to_file='model_plot4a.png', show_shapes=True, show_layer_names=True)

In [None]:
model_GRU = model1.fit(x_train
                       , np.array(y_train)
                       , validation_data=(x_test, np.array(y_test))
                       , callbacks=[callback]
                       , epochs=20
                       , batch_size=32)

In [None]:
val_loss = model_GRU.history['val_loss']
min_loss_loc = np.where(val_loss==np.min(val_loss))[0][0]

fig, ax = plt.subplots(ncols=2, figsize = (15,7))
ax[0].plot(model_GRU.history['val_accuracy'], label = 'val_accuracy')
ax[0].plot(model_GRU.history['accuracy'], label = 'accuracy')
ax[0].vlines(min_loss_loc, *ax[0].get_ylim(), label = 'min_val_loss')
ax[0].set_title('Gated Recurrent Unit Accuracy Curves')
ax[0].legend();

ax[1].plot(model_GRU.history['val_loss'], label = 'val_loss')
ax[1].plot(model_GRU.history['loss'], label = 'loss')
ax[1].vlines(min_loss_loc, *ax[1].get_ylim(), label = 'min_val_loss')
ax[1].set_title('Gated Recurrent Unit Loss Curves')
ax[1].legend();

Relative to the batch size of 32, batch size 16 seems to overfit a little less. Validation metrics are much less outperformed by training with batch size 16 than with batch size 32. Batch size 64 performs the worst, as expected, of the three (16, 32, and 64).

In [None]:
model_GRU = model1.fit(x_train
                       , np.array(y_train)
                       , validation_data=(x_test, np.array(y_test))
                       , callbacks=[callback]
                       , epochs=20
                       , batch_size=16)

In [None]:
val_loss = model_GRU.history['val_loss']
min_loss_loc = np.where(val_loss==np.min(val_loss))[0][0]

fig, ax = plt.subplots(ncols=2, figsize = (15,7))
ax[0].plot(model_GRU.history['val_accuracy'], label = 'val_accuracy')
ax[0].plot(model_GRU.history['accuracy'], label = 'accuracy')
ax[0].vlines(min_loss_loc, *ax[0].get_ylim(), label = 'min_val_loss')
ax[0].set_title('Gated Recurrent Unit Accuracy Curves')
ax[0].legend();

ax[1].plot(model_GRU.history['val_loss'], label = 'val_loss')
ax[1].plot(model_GRU.history['loss'], label = 'loss')
ax[1].vlines(min_loss_loc, *ax[1].get_ylim(), label = 'min_val_loss')
ax[1].set_title('Gated Recurrent Unit Loss Curves')
ax[1].legend();

In [None]:
model_GRU = model1.fit(x_train
                       , np.array(y_train)
                       , validation_data=(x_test, np.array(y_test))
                       , callbacks=[callback]
                       , epochs=20
                       , batch_size=64)

In [None]:
val_loss = model_GRU.history['val_loss']
min_loss_loc = np.where(val_loss==np.min(val_loss))[0][0]

fig, ax = plt.subplots(ncols=2, figsize = (15,7))
ax[0].plot(model_GRU.history['val_accuracy'], label = 'val_accuracy')
ax[0].plot(model_GRU.history['accuracy'], label = 'accuracy')
ax[0].vlines(min_loss_loc, *ax[0].get_ylim(), label = 'min_val_loss')
ax[0].set_title('Gated Recurrent Unit Accuracy Curves')
ax[0].legend();

ax[1].plot(model_GRU.history['val_loss'], label = 'val_loss')
ax[1].plot(model_GRU.history['loss'], label = 'loss')
ax[1].vlines(min_loss_loc, *ax[1].get_ylim(), label = 'min_val_loss')
ax[1].set_title('Gated Recurrent Unit Loss Curves')
ax[1].legend();

# 2. What is the vector you learned for the following emoji. <a class="anchor" id="question1ca"></a>

For parts 2 and 3, I used the findings and pre-trained vectors (in the binary file) from the *Learning Emoji Representations from their Description* research paper and project. I used the Gensim library to load the pre-trained vectors and guage cosine similarity to identify the closest matching emojis.

Paper: https://arxiv.org/pdf/1609.08359.pdf
</br>
GitHub (pre-trained vectors): https://github.com/uclnlp/emoji2vec

In [None]:
import pandas as pd
import numpy as np
import os
import re
import tensorflow as tf
# fix random seed for reproducibility
np.random.seed(7)

# display 400 characters of column width
pd.options.display.max_colwidth = 400

### This cleaning process is the same as for the unigram modeling approach

In [None]:
df = pd.read_csv('./Sentiment.csv')
df_senti = df[['sentiment','text']]


df_senti['sentiment'].replace("Negative", 0, inplace=True)
df_senti['sentiment'].replace("Neutral", 1, inplace=True)
df_senti['sentiment'].replace("Positive", 2, inplace=True)

In [None]:
df_senti['text'] = df_senti['text'].str.strip('[]')

In [None]:
df_senti['text']

In [None]:
for i in np.arange(0, len(df_senti)):
    if df_senti['text'][i].strip()[0:2] == "RT":
        df_senti['text'][i] = df_senti['text'][i][2:]

for i in np.arange(0,len(df_senti)):
    df_senti['text'][i] = ' '.join(word for word in df_senti['text'][i].split(' ') if not word.startswith('http'))


for i in np.arange(len(df_senti['text'])):
    df_senti['text'][i] = df_senti['text'][i].replace('\\n', ' ')
    
for i in np.arange(len(df_senti['text'])):
    df_senti['text'][i] = df_senti['text'][i].replace("\n", "")

In [None]:
df_senti['text']

In [None]:
df_senti['text'].head()

In [None]:
df_senti.iloc[0,1].replace(' ', '')

In [None]:
#args=pd.DataFrame()
args=[]

for j in range(len(df_senti)):
    args.append(list([df_senti.iloc[j,1].replace(' ', '')[i:i+1] for i in range(len(df_senti.iloc[j,1].replace(' ', '')))]))

In [None]:
args_frame = pd.Series(args).replace('', '')
args_frame

In [None]:
for i in np.arange(len(args_frame)):
    args_frame[i] = ' '.join(args_frame[i])

In [None]:
df_senti2 = pd.concat([df_senti['sentiment'],args_frame], axis=1)
df_senti2.columns = ['sentiment','text']

In [None]:
df_senti2.head()

In [None]:
# put words into a dictionary for downstream use
import collections
def build_dataset(words):
    count = collections.Counter(words).most_common() #.most_common(100) to use the 100 most common words; .most_common() means zero is the most common
    dictionary = dict()
    for word, _ in count:
        dictionary[word] = len(dictionary)
    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return dictionary, reverse_dictionary

In [None]:
word_list = []

for i in df_senti2['text']:
    word_list = word_list + i.split()

In [None]:
def encode_decode(input):
    enc, dec = build_dataset(word_list)
    return enc, dec

In [None]:
enc, dec = encode_decode(word_list)

In [None]:
list(list(enc.items())[85])

In [None]:
for i in enc:
    enc[i] = enc[i]+2 # shift everything by two so you can put in a pad and an unknown in index locations 0 and 1

                                                                    ###      ###
                                                                    # Encoding #
                                                                    ###      ###
enc['pad'] = 0

# start is useful for more complex architectures to invoke the LSTM to perform certain tasks, like decoding or recognizing the start of a sentence, for example.
#enc['<start>'] = 1

enc['<unk>'] = 1

                                                                    ###      ###
                                                                    # Decoding #
                                                                    ###      ###

# pad and include an unknown for the decoded values as well
dec[-2]='<pad>'
#dec[-1]='<start>'
dec[-1]='<unk>' # this is useful to indicate the LSTM should start decoding, or that this is the start of a sentence, or etc.

In [None]:
import numpy as np
n=int(np.floor(df_senti2.shape[0]*0.75)) # 75% for training
train = df_senti2[0:n]
test = df_senti2[n:]

In [None]:
train.head()

In [None]:
train['sentiment'].value_counts()

In [None]:
test['sentiment'].value_counts()

In [None]:
df_senti2['y'] = 0
df_senti2.loc[df_senti2['sentiment']==1,'y'] = 1
df_senti2.loc[df_senti2['sentiment']==2,'y'] = 2

In [None]:
x_train=[]
y_train=[]

for i in range(train.shape[0]):
    tmp = [enc[j] for j in train.iloc[i,1].split()] # enc[j]: the j (list expression) is encoding the number for the word in the encoded matrix (i,j) 
    x_train.append(tmp) # append the newly replaced word
    if train.iloc[i,0]==0: # re-encode y in the below
        y=0
    elif train.iloc[i,0]==1: # re-encode y in the below
        y=1
    else:
        y=2
    y_train.append(y) # append the newly encoded y here
    
x_test=[] # repeat for the test data the steps performed above for training data
y_test=[]
for i in range(test.shape[0]):
    tmp = [enc[j] for j in test.iloc[i,1].split()]
    x_test.append(tmp)
    if test.iloc[i,0]==0: # re-encode y in the below
        y=0
    elif test.iloc[i,0]==1: # re-encode y in the below
        y=1
    else:
        y=2
    y_test.append(y)

In [None]:
len(y_train)

In [None]:
np.unique(y_train)

In [None]:
np.unique(y_test)

In [None]:
len(x_train)

In [None]:
len(x_test)

In [None]:
df_senti2.iloc[0,1]

In [None]:
import matplotlib.pyplot as plt

lengths=[]

for i in x_train:
    lengths.append(len(i))

%matplotlib inline
plt.hist(lengths,bins=25)

In [None]:
print("90th percentile of all tweet character count volumes: {}".format(int(np.percentile(lengths, 90))))

In [None]:
from tensorflow.keras.preprocessing import sequence
# Because most of the reviews in the histogram above are length 23 or less, setting max_length to 23 words:
max_length = 118
x_train = sequence.pad_sequences(x_train, maxlen = max_length)
x_test = sequence.pad_sequences(x_test, maxlen = max_length)

In [None]:
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.layers import GRU
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
print("{} rows of {} sequences".format(x_train.shape[0], x_train.shape[1]))

# **Unigram Model Output for Emoji Embedding** <a class="anchor" id="question1cb"></a>

In [None]:
from keras.utils import plot_model
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)

embedding_vector_length = 80
model_emoji = Sequential()
model_emoji.add(Embedding(len(dec), embedding_vector_length, input_length=max_length))
model_emoji.add(GRU(100, unroll=True, dropout=0.2)) # unroll makes this run faster; units between 100-300
model_emoji.add(Dense(3, activation='softmax')) # 3 for the three classes
model_emoji.compile(loss='sparse_categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy']) # rmsprop did better than adam
print(model_emoji.summary())

plot_model(model_emoji, to_file='model_plot4a.png', show_shapes=True, show_layer_names=True)

In [None]:
model_emojis = model_emoji.fit(x_train
                       , np.array(y_train)
                       , validation_data=(x_test, np.array(y_test))
                       , callbacks=[callback]
                       , epochs=20
                       , batch_size=32)

In [None]:
val_loss = model_emojis.history['val_loss']
min_loss_loc = np.where(val_loss==np.min(val_loss))[0][0]

fig, ax = plt.subplots(ncols=2, figsize = (15,7))
ax[0].plot(model_emojis.history['val_accuracy'], label = 'val_accuracy')
ax[0].plot(model_emojis.history['accuracy'], label = 'accuracy')
ax[0].vlines(min_loss_loc, *ax[0].get_ylim(), label = 'min_val_loss')
ax[0].set_title('GRU for Emojis Accuracy Curves')
ax[0].legend();

ax[1].plot(model_emojis.history['val_loss'], label = 'val_loss')
ax[1].plot(model_emojis.history['loss'], label = 'loss')
ax[1].vlines(min_loss_loc, *ax[1].get_ylim(), label = 'min_val_loss')
ax[1].set_title('GRU for Emojis Loss Curves')
ax[1].legend();

# The vector learned for the emoji: 😂  <a class="anchor" id="question1cc"></a>

Saving the Twitter model trained (includes emojis):

In [None]:
model_emoji.save('./emoji_model.h5')

# Loading the Saved Model <a class="anchor" id="question1cd"></a>

In [None]:
from tensorflow import keras
model_emoji = keras.models.load_model('./emoji_model.h5')

In [None]:
embeddings = model_emoji.get_weights()[0]

Pickle the word list

In [None]:
import pickle

with open("./word_list.txt", "wb") as wl:
    pickle.dump(word_list, wl)

# Load the word list from pickle (this is needed for the cosine similarity operation) <a class="anchor" id="question1ce"></a>

In [13]:
import pickle

with open("./word_list.txt", "rb") as wl:
    word_list = pickle.load(wl)

#### Visualize all characters in the dictionary to visually inspect (and make sure target emoji is present).
#### Note: Pull a fresh set of dictionaries without the pad and unknown tokens:

In [14]:
# put words into a dictionary for downstream use
import collections
def build_dataset(words):
    count = collections.Counter(words).most_common()
    dictionary = dict()
    for word, _ in count:
        dictionary[word] = len(dictionary)
    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return dictionary, reverse_dictionary

def encode_decode(input):
    enc, dec = build_dataset(word_list)
    return enc, dec

In [15]:
enc, dec = encode_decode(word_list)

In [16]:
enc.items()

dict_items([('e', 0), ('t', 1), ('a', 2), ('o', 3), ('i', 4), ('n', 5), ('s', 6), ('r', 7), ('l', 8), ('h', 9), ('d', 10), ('u', 11), ('b', 12), ('m', 13), ('c', 14), ('#', 15), ('g', 16), ('p', 17), ('G', 18), ('y', 19), ('D', 20), ('O', 21), ('P', 22), ('w', 23), ('f', 24), ('@', 25), ('.', 26), ('k', 27), (':', 28), ('T', 29), ('v', 30), ('S', 31), ('I', 32), ('R', 33), ('W', 34), ('C', 35), ("'", 36), (',', 37), ('A', 38), ('F', 39), ('B', 40), ('N', 41), ('M', 42), ('"', 43), ('x', 44), ('H', 45), ('?', 46), ('L', 47), ('E', 48), ('…', 49), ('!', 50), ('j', 51), ('J', 52), ('🇺', 53), ('🇸', 54), ('z', 55), ('-', 56), ('1', 57), ('K', 58), ('0', 59), ('U', 60), ('2', 61), (';', 62), ('Y', 63), ('&', 64), ('q', 65), ('6', 66), ('/', 67), ('V', 68), ('_', 69), ('3', 70), ('X', 71), ('4', 72), ('5', 73), ('7', 74), ('9', 75), ('8', 76), (')', 77), ('’', 78), ('(', 79), ('Z', 80), ('Q', 81), ('“', 82), ('*', 83), ('”', 84), ('😂', 85), ('%', 86), ('$', 87), ('=', 88), ('[', 89), (']', 90

### Note: This needed to be run twice:

In [17]:
enc = {value:key for key, value in enc.items()}

In [18]:
enc = {value:key for key, value in enc.items()}

In [19]:
# if you have access to the embedding layer explicitly
embeddings = model_emoji.get_weights()[0]

# `word_to_index` is a mapping (i.e. dict) from words to their index, e.g. `😂`: 85
words_embeddings = {w:embeddings[idx] for w, idx in enc.items()}

# This is the learned vector for 😂:  <a class="anchor" id="question1cf"></a>

In [20]:
# now you can use it like this for example
print(words_embeddings['😂'])

[ 0.3025325   0.5842908   0.42319667 -0.31405398  0.3698214   0.15363605
 -0.4177241  -0.22574772  0.00251621  0.00190915 -0.42551884 -0.0626784
  0.35473254  0.38380325  0.1223547   0.21410556 -0.23546855  0.16376063
 -0.26741436 -0.3407969   0.1284697  -0.58715177  0.42320794  0.41040668
 -0.26324856  0.21925806  0.02229878 -0.00487047  0.08724135  0.2913094
  0.01472738  0.16298081  0.5784312  -0.27969462 -0.16617207  0.22300777
  0.35038143  0.38679498 -0.29851472 -0.13641584 -0.18616225 -0.27854842
 -0.12230528 -0.25580463 -0.01578699 -0.15387978 -0.38997325 -0.53054965
  0.50691783  0.45597878  0.01528363 -0.30013585  0.1732997   0.34517664
 -0.13388382 -0.44196326 -0.55923593  0.31625775 -0.30021486 -0.16417497
 -0.39239502  0.43360528  0.187316   -0.4065224   0.33533433 -0.5363497
  0.318125    0.35078213 -0.08221983 -0.02760422 -0.27928853  0.2756848
 -0.38379908 -0.18375534 -0.5366697   0.08700037  0.29265705  0.43113294
  0.01999613  0.29827833]


In [164]:
from spacy.vocab import Vocab

# Adding the vectors into spaCy vocab
vocab = Vocab()
for word, vector in words_embeddings.items():
    vocab.set_vector(word, vector)

In [None]:
# words_embeddings.values()

In [None]:
words_embeddings

In [165]:
gee = list(map(list, words_embeddings.values()))

In [None]:
gee

In [55]:
# np.array(words_embeddings['😂'].tolist()).shape

(80,)

In [None]:
# .reshape(-1,1)

In [62]:
# np.array(words_embeddings['😂'].tolist()).shape

(80,)

In [90]:
distances[min_index]

array([0., 2., 2., ..., 2., 2., 2.])

In [98]:
min_distance

array([0., 2., 2., ..., 2., 2., 2.])

In [166]:
import numpy as np
from scipy.spatial import distance

distances = distance.cdist(np.array(words_embeddings['😂'].tolist()).reshape(-1,1)
                           , np.array(gee).reshape(-1,1)
                           , "cosine"
                          )#[0]
min_index = np.argmin(distances)
min_distance = distances[min_index]
max_similarity = 1 - min_distance

In [None]:
# words_embeddings.keys()

In [167]:
p = np.array([words_embeddings["😂"]])

In [168]:
ids = [x for x in words_embeddings.keys()]

In [None]:
# ids

In [169]:
vectors = [words_embeddings[x] for x in ids]
vectors = np.array(vectors)

In [None]:
# vectors

In [170]:
# most_similar_char

12274633225176665583

In [171]:
most_similar = distance.cdist(p, vectors).argmin()
most_similar_char = ids[most_similar]
output_char = words_embeddings[most_similar_char]

In [177]:
output_char

array([ 0.3025325 ,  0.5842908 ,  0.42319667, -0.31405398,  0.3698214 ,
        0.15363605, -0.4177241 , -0.22574772,  0.00251621,  0.00190915,
       -0.42551884, -0.0626784 ,  0.35473254,  0.38380325,  0.1223547 ,
        0.21410556, -0.23546855,  0.16376063, -0.26741436, -0.3407969 ,
        0.1284697 , -0.58715177,  0.42320794,  0.41040668, -0.26324856,
        0.21925806,  0.02229878, -0.00487047,  0.08724135,  0.2913094 ,
        0.01472738,  0.16298081,  0.5784312 , -0.27969462, -0.16617207,
        0.22300777,  0.35038143,  0.38679498, -0.29851472, -0.13641584,
       -0.18616225, -0.27854842, -0.12230528, -0.25580463, -0.01578699,
       -0.15387978, -0.38997325, -0.53054965,  0.50691783,  0.45597878,
        0.01528363, -0.30013585,  0.1732997 ,  0.34517664, -0.13388382,
       -0.44196326, -0.55923593,  0.31625775, -0.30021486, -0.16417497,
       -0.39239502,  0.43360528,  0.187316  , -0.4065224 ,  0.33533433,
       -0.5363497 ,  0.318125  ,  0.35078213, -0.08221983, -0.02

In [None]:
words_embeddings

In [173]:
[k for k, v in words_embeddings.items() if v == output_char]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [174]:
dec[output_char]

TypeError: unhashable type: 'numpy.ndarray'

In [None]:
print(words_embeddings.values())

In [158]:
# import numpy as np
# from scipy.spatial import distance
# import spacy

# input_word = "😂"
# p = np.array([vocab[input_word].vector])

In [159]:
# ids = [x for x in vocab.vectors.keys()]
# vectors = [vocab.vectors[x] for x in ids]
# vectors = np.array(vectors)

In [160]:
# most_similar = distance.cdist(p, vectors).argmin()
# most_similar_char = ids[most_similar]
# output_char = vocab[most_similar_char].text

In [143]:
# vectors

array([[ 0.0308127 , -0.10562409, -0.14084205, ...,  0.14341864,
        -0.08657023,  0.09953693],
       [ 0.04126022,  0.04071863,  0.02803428, ...,  0.04862085,
         0.01575268, -0.01946538],
       [-0.00228359, -0.02515168, -0.04487442, ..., -0.08339384,
        -0.01983955, -0.15999721],
       ...,
       [ 0.01612668,  0.04308012, -0.00096083, ..., -0.00385525,
         0.02414829, -0.02430508],
       [ 0.01106132, -0.00174845, -0.04906845, ..., -0.0057042 ,
        -0.019363  ,  0.02408675],
       [ 0.03025493, -0.04712554, -0.02480022, ..., -0.00637128,
        -0.034205  , -0.01382823]], dtype=float32)

In [162]:
# output_char

'😂'

# 3. What is the most similar character for the above emoji  <a class="anchor" id="question1da"></a>
## Populating a dataframe with the closest characters to the provided emoji (😂), by cosine distance.
## Scale: Values closest to 0 are closest to the emoji (😂). The most similar value is the double-quote (“)

In [176]:
import scipy
import pandas as pd

dist_df = []

for k in dec.keys():
    dist_df.append(dec[k] + "~~~" + str(scipy.spatial.distance.cosine(words_embeddings['😂'], words_embeddings[dec[k]])))

distance_df = pd.DataFrame([sub.split("~~~") for sub in dist_df])

distance_df.columns = ['symbol','distance']

distance_df = distance_df.sort_values(by='distance').reset_index(drop=True)

distance_df.head()

Unnamed: 0,symbol,distance
0,😂,0.0
1,“,0.0574437975883483
2,5,0.1289249062538147
3,😴,0.1517524123191833
4,*,0.1804565191268921


In [None]:
print("The most similar emoji: [{}]\n\nCosine similarity (not distance) to this emoji: {}".format(distance_df.iloc[1,0], round(1-float(distance_df.iloc[1,1]), 4)))

In [None]:
from gensim.models import Word2Vec, KeyedVectors
from gensim.test.utils import common_texts, get_tmpfile

In [None]:
filename = './emoji2vec.bin'
e2v = KeyedVectors.load_word2vec_format(filename, binary=True)

In [None]:
happy_vector = e2v['😂']

In [None]:
happy_vector

In [None]:
result = e2v.most_similar("😂", topn=10)

In [None]:
result

In [None]:
print("The most similar emoji: {}\n\nCosine similarity to this emoji: {}%".format(result[0][0], (100*round(result[0][1], 4))))

# **PART II**<a class="anchor" id="question1e"></a>
# 4. Build a Universal Sentence Encoder (USE) Model and an RNN (RNN, LSTM, GRU, etc.) model for the following data set. Accuracy must be above 50%. Compare the results of the two in terms of time and accuracy.
http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip (first column is the polarity 0:negative, 4:positive).

## <u>Comparative Performance Analysis:</u> **Universal Sentence Encoder** versus **Gated Recurrent Unit**

### Overall, the results were very similar for the Universal Sentence Encoder and the Gated Recurrent Unit. However, the USE appeared to be less prone to overfitting; model and training accuracy had less divergence as batch size was increased for the USE models compared to the GRU models. Conversely, as batch size was increased, training loss decreased much more compared to validation loss for the USE models compared to the GRU models.

In [None]:
summary_data = {
    'Model':['Universal Sentence Encoder (batch size 128)','Universal Sentence Encoder (batch size 512)','Universal Sentence Encoder (batch size 2048)'
             ,'Gated Recurrent Unit (batch size 512)','Gated Recurrent Unit (batch size 2048)'], 
    'Mean Accuracy (Training)':['0.8055','0.8016','0.8175', '0.8714','0.9442'], 
    'Mean Accuracy (Validation)':['0.8018','0.7998', '0.8037','0.7956','0.7838'],
    'Mean Accuracy Ratio (Training/Validation)':['1.0046', '1.0023', '1.0172','1.0953','1.2046'],
    'Mean Loss (Training)':['0.4181','0.4255', '0.3982', '0.2997','0.1424'], 
    'Mean Loss (Validation)':['0.4258','0.4278', '0.4216', '0.4629','0.5859'],
    'Mean Loss Ratio (Training/Validation)':['0.9819','0.9946','0.9445','0.6474','0.2430']}

pd.DataFrame(summary_data)

# **PART II: Universal Sentence Encoder Models**<a class="anchor" id="question1ea"></a>

In [None]:
from tensorflow.keras.layers import Input, Dense, Dropout, Flatten, Lambda
import matplotlib.pyplot as plt
import tensorflow_hub as hub
import tensorflow as tf
import pandas as pd
import numpy as np
import os
import re

In [None]:
print("Tensorflow Version:\n{}\n\nTensor-Hub Version:\n{}".format(tf.__version__, hub.__version__))

In [None]:
!export CUDA_VISIBLE_DEVICES=0

In [None]:
physical_devices = tf.config.experimental.list_physical_devices('GPU')
for physical_device in physical_devices:
    tf.config.experimental.set_memory_growth(physical_device, True)

In [None]:
use_test = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4", trainable=False)

In [None]:
data = pd.read_csv('./training.1600000.processed.noemoticon.csv', encoding = "ISO-8859-1", header=None)

In [None]:
data = data[[0,5]]

This dataset needs to be shuffled. The original dataset is split - the first half is all class 0 and the second half is all class 4:

In [None]:
# data.head()

In [None]:
# data.tail()

Shuffling the data:

In [None]:
data = data.sample(frac=1.0, random_state=42).reset_index(drop=True)

In [None]:
data.head(7)

In [None]:
data.tail()

In [None]:
# Bash to shuffle
#!shuf training.1600000.processed.noemoticon.csv > training.1600000.processed.noemoticon_shuffled.csv
# data = pd.read_csv('~/Desktop/DS7337 Natural Language Processing/Final_Exam/PART2_Data/training.1600000.processed.noemoticon_shuffled.csv', encoding = "ISO-8859-1", header=None)

In [None]:
data.columns = ['polarity','text']

In [None]:
data.head(2)

In [None]:
len(data)

In [None]:
data['polarity'].unique()

In [None]:
data.polarity[data.polarity == 4] = 1
data.polarity[data.polarity == 0] = 0

In [None]:
data['polarity'].unique()

Data is balanced on a perfect 50/50 split and has been randomly shuffled so I'll use a 50/50 test/validation split (downstream) since this gives a little more to the validation than a 75/25 (or similar) split:

In [None]:
print("Balance of data classes: {}%".format(100*(round(data[data.polarity == 1].shape[0]/data.shape[0], 4))))

### Cleaning the data

Because this data appears to be from Twitter or some other platform that uses hashtags ("#") and mentions ("@"). Also, I'll leave only spaces and alphanumeric values

In [None]:
data['text'] = data['text'].str.strip().str.lower()
data.text = data.text.str.replace(r"[^a-zA-Z0-9 ]", "")

In [None]:
data.shape

In [None]:
data.head(2)

For retweets, I'm removing the "RT" values

In [None]:
for i in np.arange(0, len(data)):
    if data['text'][i].strip()[0:2] == "RT":
        data['text'][i] = data['text'][i][2:]

In [None]:
data.shape

In [None]:
n=int(np.floor(data.shape[0]*0.5)) # 50% for training
train_df = data[0:n]
test_df = data[n:]

In [None]:
train_df['polarity'].unique()

In [None]:
test_df['polarity'].unique()

In [None]:
import matplotlib.pyplot as plt

lengths=[]

for i in data['text']:
    lengths.append(len(i))

%matplotlib inline
plt.hist(lengths,bins=25)

In [None]:
int(np.percentile(lengths, 95))

Two standard deviations cover 95 percent of the data in a normal distribution. Because this distribution is not perfectly normal, there is a slight difference (129 vs 139) when considering both tails (139) versus only the right tail (129). Nonetheless, this difference is relatively negligible.

However, because USE converts everything to 51 and the maximum original length is 179, I won't limit the sequence length.

In [None]:
print([np.asarray(lengths).mean() - 2 * np.asarray(lengths).std(), np.asarray(lengths).mean() + 2 * np.asarray(lengths).std()])

In [None]:
max(lengths)

In [None]:
X_train = train_df['text'].tolist()
X_train = [' '.join(t.split()[0:179]) for t in X_train]
X_train = np.array(X_train, dtype=object)[:, np.newaxis]
y_train = train_df['polarity']

In [None]:
X_test = test_df['text'].tolist()
X_test = [' '.join(t.split()[0:179]) for t in X_test]
X_test = np.array(X_test, dtype=object)[:, np.newaxis]
y_test = test_df['polarity']

In [None]:
labels = pd.concat([y_train, y_test], axis=0)
labels.shape

In [None]:
print("Train Data Length: {}\nValidation Data Length:{}".format(X_train.shape[0], X_test.shape[0]))

In [None]:
def UniversalEmbedding(x):
    return use_test(tf.squeeze(tf.cast(x, tf.string)))

# Universal Sentence Encoder Model Results <a class="anchor" id="question1eb"></a>

In [None]:
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
embed_size = 512 # since USE produces 512 vector lengths regardless of input size

input_text = Input(shape=(), dtype=tf.string) # leaving this as floating point

embedding = hub.KerasLayer(UniversalEmbedding, output_shape=(embed_size,))(input_text)

x = Dense(256, activation='relu')(embedding)
output = Dense(1,activation='sigmoid',name='output')(x) # sigmoid for 2-classes

model = tf.keras.Model(inputs=input_text, outputs=[output])

model.compile(loss='binary_crossentropy',
                  optimizer='rmsprop',
                  callbacks=[callback],
                  metrics=['accuracy'])
model.summary()
tf.keras.utils.plot_model(model, show_shapes=True, dpi=100)

### Batch size 128 yeilds mean runtime of 123.6 seconds, mean training accuracy 0.8055, and mean validation accuracy of 0.8018:

In [None]:
hist = model.fit(X_train, np.array(y_train),
                 batch_size = 128,
                 epochs = 5,
                 callbacks = [callback],
                 validation_data=(X_test, np.array(y_test)))

In [None]:
val_loss = hist.history['val_loss']
min_loss_loc = np.where(val_loss==np.min(val_loss))[0][0]

fig, ax = plt.subplots(ncols=2, figsize = (15,7))
ax[0].plot(hist.history['val_accuracy'], label = 'val_accuracy')
ax[0].plot(hist.history['accuracy'], label = 'accuracy')
ax[0].vlines(min_loss_loc, *ax[0].get_ylim(), label = 'min_val_loss')
ax[0].set_title('Universal Sentence Encoder Accuracy Curves')
ax[0].legend();

ax[1].plot(hist.history['val_loss'], label = 'val_loss')
ax[1].plot(hist.history['loss'], label = 'loss')
ax[1].vlines(min_loss_loc, *ax[1].get_ylim(), label = 'min_val_loss')
ax[1].set_title('Universal Sentence Encoder Loss Curves')
ax[1].legend();

### Batch size 512 yeilds mean runtime of 70.8 seconds, mean training accuracy 0.8016, and mean validation accuracy of 0.7998:

In [None]:
hist = model.fit(X_train, np.array(y_train),
                 batch_size = 512,
                 epochs = 5,
                 callbacks = [callback],
                 validation_data=(X_test, np.array(y_test)))

In [None]:
val_loss = hist.history['val_loss']
min_loss_loc = np.where(val_loss==np.min(val_loss))[0][0]

fig, ax = plt.subplots(ncols=2, figsize = (15,7))
ax[0].plot(hist.history['val_accuracy'], label = 'val_accuracy')
ax[0].plot(hist.history['accuracy'], label = 'accuracy')
ax[0].vlines(min_loss_loc, *ax[0].get_ylim(), label = 'min_val_loss')
ax[0].set_title('Universal Sentence Encoder Accuracy Curves')
ax[0].legend();

ax[1].plot(hist.history['val_loss'], label = 'val_loss')
ax[1].plot(hist.history['loss'], label = 'loss')
ax[1].vlines(min_loss_loc, *ax[1].get_ylim(), label = 'min_val_loss')
ax[1].set_title('Universal Sentence Encoder Loss Curves')
ax[1].legend();

### Batch size 2048 yeilds mean runtime of 51.2 seconds, mean training accuracy 0.8175, and mean validation accuracy of 0.8037:

In [None]:
hist = model.fit(X_train, np.array(y_train),
                 batch_size = 2048,
                 epochs = 5,
                 callbacks = [callback],
                 validation_data=(X_test, np.array(y_test)))

In [None]:
val_loss = hist.history['val_loss']
min_loss_loc = np.where(val_loss==np.min(val_loss))[0][0]

fig, ax = plt.subplots(ncols=2, figsize = (15,7))
ax[0].plot(hist.history['val_accuracy'], label = 'val_accuracy')
ax[0].plot(hist.history['accuracy'], label = 'accuracy')
ax[0].vlines(min_loss_loc, *ax[0].get_ylim(), label = 'min_val_loss')
ax[0].set_title('Universal Sentence Encoder Accuracy Curves')
ax[0].legend();

ax[1].plot(hist.history['val_loss'], label = 'val_loss')
ax[1].plot(hist.history['loss'], label = 'loss')
ax[1].vlines(min_loss_loc, *ax[1].get_ylim(), label = 'min_val_loss')
ax[1].set_title('Universal Sentence Encoder Loss Curves')
ax[1].legend();

# **PART II: Gated Recurrent Unit Models** <a class="anchor" id="question1ec"></a>

In [None]:
import matplotlib.pyplot as plt
import tensorflow_hub as hub
import tensorflow as tf
import pandas as pd
import numpy as np
import os
import re

In [None]:
!export CUDA_VISIBLE_DEVICES=0

In [None]:
physical_devices = tf.config.experimental.list_physical_devices('GPU')
for physical_device in physical_devices:
    tf.config.experimental.set_memory_growth(physical_device, True)

In [None]:
data = pd.read_csv('./training.1600000.processed.noemoticon.csv', encoding = "ISO-8859-1", header=None)

In [None]:
data = data.sample(frac=1.0, random_state=42).reset_index(drop=True)

In [None]:
data = data[[0,5]]

In [None]:
data.columns = ['polarity','text']

In [None]:
data.polarity[data.polarity == 4] = 1
data.polarity[data.polarity == 0] = 0

In [None]:
data['polarity'].unique()

In [None]:
print("Balance of data classes: {}%".format(100*(round(data[data.polarity == 1].shape[0]/data.shape[0], 4))))

In [None]:
data['text'] = data['text'].str.strip().str.lower()
data.text = data.text.str.replace(r"[^a-zA-Z0-9 ]", "")

In [None]:
for i in np.arange(0, len(data)):
    if data['text'][i].strip()[0:2] == "RT":
        data['text'][i] = data['text'][i][2:]

In [None]:
n=int(np.floor(data.shape[0]*0.5)) # 50% for training
train_df = data[0:n]
test_df = data[n:]

In [None]:
train_df['polarity'].unique()

In [None]:
data.head()

In [None]:
data.tail()

In [None]:
data.head(2)

In [None]:
# put words into a dictionary for downstream use
import collections

def build_dataset(words):
    count = collections.Counter(words).most_common() #.most_common(100) to use the 100 most common words; .most_common() means zero is the most common
    dictionary = dict()
    for word, _ in count:
        dictionary[word] = len(dictionary)
    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return dictionary, reverse_dictionary

In [None]:
word_list=[]
flat_list = []

for i in np.arange(len(data)):
    word_list.append(data['text'][i].split())

In [None]:
flat_list = []
for sublist in word_list:
    for item in sublist:
        flat_list.append(item)

In [None]:
word_list = flat_list.copy()

In [None]:
len(word_list)

In [None]:
words_unique = set(word_list)

In [None]:
len(words_unique)

In [None]:
def encode_decode(input):
    enc, dec = build_dataset(word_list)
    return enc, dec

In [None]:
enc, dec = encode_decode(word_list)

In [None]:
for i in enc:
    enc[i] = enc[i]+2 # shift everything by two so you can put in a pad and an unknown in index locations 0 and 1

                                                                    ###      ###
                                                                    # Encoding #
                                                                    ###      ###
enc['pad'] = 0

# start is useful for more complex architectures to invoke the LSTM to perform certain tasks, like decoding or recognizing the start of a sentence, for example.
#enc['<start>'] = 1

enc['<unk>'] = 1

                                                                    ###      ###
                                                                    # Decoding #
                                                                    ###      ###

# pad and include an unknown for the decoded values as well
dec[-2]='<pad>'
#dec[-1]='<start>'
dec[-1]='<unk>' # this is useful to indicate the LSTM should start decoding, or that this is the start of a sentence, or etc.

In [None]:
data['y'] = 0
data.loc[data['polarity']==1,'y'] = 1

In [None]:
print("Balance of data classes: {}%".format(100*(round(data[data.polarity == 1].shape[0]/data.shape[0], 4))))

In [None]:
import numpy as np
n=int(np.floor(data.shape[0]*0.50)) # 50% for training since target classes are balanced
train = data[0:n]
test = data[n:]

In [None]:
len(train)

In [None]:
len(test)

In [None]:
train['y'].value_counts()

In [None]:
test['y'].value_counts()

In [None]:
x_train=[]
y_train=[]

for i in range(train.shape[0]):
    tmp = [enc[j] for j in train.iloc[i,1].split()] # enc[j]: the j (list expression) is encoding the number for the word in the encoded matrix (i,j) 
    x_train.append(tmp) # append the newly replaced word
    if train.iloc[i,0]==0: # re-encode y in the below
        y=0
    else:
        y=1
    y_train.append(y) # append the newly encoded y here
    
x_test=[] # repeat for the test data the steps performed above for training data
y_test=[]
for i in range(test.shape[0]):
    tmp = [enc[j] for j in test.iloc[i,1].split()]
    x_test.append(tmp)
    if test.iloc[i,0]==0: # re-encode y in the below
        y=0
    else:
        y=1
    y_test.append(y)

In [None]:
len(x_train)

In [None]:
len(y_train)

In [None]:
len(x_test)

In [None]:
len(y_test)

In [None]:
np.unique(y_train)

In [None]:
np.unique(y_test)

In [None]:
# The first time this runs, i-2 will remove the decoded pad and unknown variables used to prevent over-fitting
[dec[i-2] for i in x_train[1]]

In [None]:
data.iloc[1,1]

In [None]:
import matplotlib.pyplot as plt

lengths=[]

for i in data['text']:
    lengths.append(len(i))

%matplotlib inline
plt.hist(lengths,bins=25)


In [None]:
int(np.percentile(lengths, 95))

In [None]:
from tensorflow.keras.preprocessing import sequence

max_length = 129
x_train = sequence.pad_sequences(x_train, maxlen = max_length)
x_test = sequence.pad_sequences(x_test, maxlen = max_length)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.hist([len(i) for i in x_train])
plt.show()
### Note that after padding, all sentences are the same length (same number of parameters)

In [None]:
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Embedding
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.layers import GRU
from tensorflow.keras.callbacks import TensorBoard, EarlyStopping, ModelCheckpoint
from tensorflow.keras.preprocessing.text import Tokenizer

There are 131,585 total (and trainable) parameters in the Universal Sentence Encoder model. There are 67,25,541 total (and trainable) parameters in this Gated Recurrent Unit model.

# Gated Recurrent Unit Model Results <a class="anchor" id="question1ed"></a>

In [None]:
from keras.utils import plot_model
max_length = 129
embedding_vector_length = 180

model1 = Sequential()
model1.add(Embedding(len(dec), embedding_vector_length, input_length=max_length)) # create the embedding vectors
model1.add(GRU(300, unroll=True, dropout=0.2)) # unroll makes this run faster; units between 100-300
model1.add(Dense(1, activation='sigmoid')) # 1 for the two classes
model1.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy']) # binary crossentropy for the two classes + sigmoid
print(model1.summary())

plot_model(model1, to_file='model_plot4a.png', show_shapes=True, show_layer_names=True)

In [None]:
np.unique(y_train)

In [None]:
np.unique(y_test)

In [None]:
np.array(y_train).shape

In [None]:
np.array(y_test).shape

In [None]:
x_train.shape

In [None]:
x_test.shape

(sanity check to make sure input variable datasets are rectangular and there are no ragged edges)

In [None]:
def rectangular(List):
    n = List
    for i in n:
        if len(i) != len(n[0]):
            return False
    return True

In [None]:
print("The data is rectangular (True or False):\n{}".format(rectangular(x_test)))

### Batch size 512 yeilds mean runtime of 1098.8 seconds, mean training accuracy 0.8714, and mean validation accuracy of 0.7956:

In [None]:
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)

model_GRU = model1.fit(x_train
                       , np.array(y_train)
                       , validation_data=(x_test, np.array(y_test))
                       , callbacks = [callback]
                       , epochs=5
                       , batch_size=512)

In [None]:
val_loss = model_GRU.history['val_loss']
min_loss_loc = np.where(val_loss==np.min(val_loss))[0][0]

fig, ax = plt.subplots(ncols=2, figsize = (15,7))
ax[0].plot(model_GRU.history['val_accuracy'], label = 'val_accuracy')
ax[0].plot(model_GRU.history['accuracy'], label = 'accuracy')
ax[0].vlines(min_loss_loc, *ax[0].get_ylim(), label = 'min_val_loss')
ax[0].set_title('GRU Accuracy Curves')
ax[0].legend();

ax[1].plot(model_GRU.history['val_loss'], label = 'val_loss')
ax[1].plot(model_GRU.history['loss'], label = 'loss')
ax[1].vlines(min_loss_loc, *ax[1].get_ylim(), label = 'min_val_loss')
ax[1].set_title('GRU Loss Curves')
ax[1].legend();

### Batch size 2048 yeilds mean runtime of 336.6 seconds, mean training accuracy 0.9442, and mean validation accuracy of 0.7838:

In [None]:
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)

model_GRU = model1.fit(x_train
                       , np.array(y_train)
                       , validation_data=(x_test, np.array(y_test))
                       , callbacks = [callback]
                       , epochs=5
                       , batch_size=2048)

In [None]:
val_loss = model_GRU.history['val_loss']
min_loss_loc = np.where(val_loss==np.min(val_loss))[0][0]

fig, ax = plt.subplots(ncols=2, figsize = (15,7))
ax[0].plot(model_GRU.history['val_accuracy'], label = 'val_accuracy')
ax[0].plot(model_GRU.history['accuracy'], label = 'accuracy')
ax[0].vlines(min_loss_loc, *ax[0].get_ylim(), label = 'min_val_loss')
ax[0].set_title('GRU Accuracy Curves')
ax[0].legend();

ax[1].plot(model_GRU.history['val_loss'], label = 'val_loss')
ax[1].plot(model_GRU.history['loss'], label = 'loss')
ax[1].vlines(min_loss_loc, *ax[1].get_ylim(), label = 'min_val_loss')
ax[1].set_title('GRU Loss Curves')
ax[1].legend();