# Sentiment Analyzer Overview
This notebook is used to generate the trained model for the sentiment analyzer in the DiscordNLP bot. 
Note: The data sets to train this model **are not** included in the repository, so you will need to obtain them at https://www.kaggle.com/kazanova/sentiment140. The script `PrepareData.py` was used to remove unwanted columns and strip unwanted text. 

## Setup
The following blocks handle library imports, and importing the data set as a Pandas dataframe. A vocabulary tokenizer will be generated and saved if one does not already exist. Once tokenized, the data will be padded to a fixed length and then split into training and validation data sets for the model.

Note: The tokenizer is saved as a pickle, funniest shit I've seen. 

Second Note: Never load a pickle from a source you don't trust. They are serialized Python objects and can run all sorts of mean, nasty things on your computer.

In [1]:
import numpy as np
import pandas
import time
import re
import math
import pickle
import tensorflow as tf
from tensorflow.keras import layers
import tensorflow_datasets as tfds

In [2]:
cols = ["sentiment", "text"]
train_data = pandas.read_csv(
    "../data/training_data_long.csv",
    header=None,
    names=cols,
    engine="python",
    encoding="latin1"
)
data_clean = train_data["text"].tolist()

In [3]:
try:
    with open("../models/tokenizer.pickle", "rb") as f:
        tokenizer = pickle.load(f)
except:
    tokenizer = tfds.features.text.SubwordTextEncoder.build_from_corpus(
        data_clean, target_vocab_size=2**16
    )
    with open("../models/tokenizer.pickle", "wb") as f:
        pickle.dump(tokenizer, f, protocol=pickle.HIGHEST_PROTOCOL)

In [4]:
data_input = [tokenizer.encode(sentence) for sentence in data_clean]

# pad sentences with 0's to match the longest sentence in the data set 
max_sentence_length = max([len(sentence) for sentence in data_input])
data_input = tf.keras.preprocessing.sequence.pad_sequences(
    data_input, value=0, padding="post", maxlen=max_sentence_length
)

In [5]:
data_labels = train_data["sentiment"].to_numpy()
test_idx = np.random.randint(0, math.floor(len(data_clean)/2), max(math.floor(len(data_clean)/200), 100))
test_idx = np.concatenate((test_idx, test_idx+math.floor(len(data_clean)/2)))
test_inputs = data_input[test_idx]
test_labels = data_labels[test_idx]
train_inputs = np.delete(data_input, test_idx, axis=0)
train_labels = np.delete(data_labels, test_idx)

## Neural Net Definition
Next, several constants are declared to make tweaking the model easy. The model is defined sequentially with Keras, however, the convolution+maxpool layers are all in parallel with eachother. To do this with a sequential model, they are instantiated seperately, and then concatenated together to form one layer. This "custom" layer can then be appended to the sequential model like any other layer.

In [6]:
VOCAB_SIZE = tokenizer.vocab_size
EMB_DIM = 200
NB_FILTERS = 100
FFN_UNITS = 256
NB_CLASSES = len(set(train_data["sentiment"]))
DROPOUT_RATE = 0.2
BATCH_SIZE = 32
NB_EPOCHS = 5

In [7]:
input_layer = layers.Input(shape=max_sentence_length)
embedding = layers.Embedding(input_dim=VOCAB_SIZE, output_dim=EMB_DIM)(input_layer)
bigram = layers.Conv1D(filters=NB_FILTERS, 
                         kernel_size=2, 
                         padding="valid", 
                         activation="relu")(embedding)
conv1 = layers.GlobalMaxPool1D()(bigram)

trigram = layers.Conv1D(filters=NB_FILTERS, 
                         kernel_size=3, 
                         padding="valid", 
                         activation="relu")(embedding)
conv2 = layers.GlobalMaxPool1D()(trigram)

quadgram = layers.Conv1D(filters=NB_FILTERS, 
                         kernel_size=4, 
                         padding="valid", 
                         activation="relu")(embedding)
conv3 = layers.GlobalMaxPool1D()(quadgram)

concat = layers.Concatenate()([conv1, conv2, conv3])
parallel_model = tf.keras.Model(input_layer, concat)

Dcnn = tf.keras.models.Sequential()
Dcnn.add(parallel_model)
Dcnn.add(layers.Dense(units=NB_CLASSES, activation="softmax"))
Dcnn.add(layers.Dropout(rate=DROPOUT_RATE))
if NB_CLASSES == 2:
    Dcnn.add(layers.Dense(units=1, activation="sigmoid"))
else:
    Dcnn.add(layers.Dense(units=NB_CLASSES, activation="softmax"))


In [8]:
if NB_CLASSES == 2:
    Dcnn.compile(loss="binary_crossentropy",
                 optimizer="adam",
                 metrics=["accuracy"])
else:
    Dcnn.compile(loss="sparse_categorical_crossentropy",
                 optimizer="adam",
                 metrics=["sparse_categorical_accuracy"])

In [9]:
checkpoint_path = "../chkpts"
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                                 save_weights_only=True,
                                                 verbose=1)
ckpt = tf.train.Checkpoint(Dcnn=Dcnn)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=10)


## Training and Validation
Now that the model has been declared, we can shove some data through it. I don't have a working GPU, so this will likely go faster for you.

After the model has been trained, its accuracy and loss are evaluated with the validation data set we made earlier. 

I also added some simple tests by hand just to make sure we get the outputs we expect.

In [10]:
Dcnn.fit(train_inputs,
         train_labels,
         batch_size=BATCH_SIZE,
         epochs=NB_EPOCHS,
         callbacks=[checkpoint_callback])

Epoch 1/5
Epoch 00001: saving model to ../chkpts
Epoch 2/5
Epoch 00002: saving model to ../chkpts
Epoch 3/5
Epoch 00003: saving model to ../chkpts
Epoch 4/5
Epoch 00004: saving model to ../chkpts
Epoch 5/5
Epoch 00005: saving model to ../chkpts


<tensorflow.python.keras.callbacks.History at 0x7fcbc68e17b8>

In [11]:
ckpt_manager.save()
Dcnn.save("../models/sentiment_model.h5")

In [16]:
results = Dcnn.evaluate(test_inputs, test_labels, batch_size=BATCH_SIZE)
print(results)
text = "You are so funny"
tokenized_input = np.array([tokenizer.encode("You are so funny")])
tokenized_input = tf.keras.preprocessing.sequence.pad_sequences(
                  tokenized_input, value=0, padding="post", maxlen=73
                  )
print(Dcnn(tokenized_input, training=False).numpy())


[0.4453016519546509, 0.8216875195503235]
[[0.8870019]]
