# Sentiment Analysis for TrackMatcher

This notebook is intended to run a model which can predict the emotion of a given track through its lyrics, using the library TensorFlow 2.0 in order to create the Deep Learning process. The model will be finally used in a Data Science Python project.

First of all, we need to import the libraries:

In [8]:
import json
import csv
import os
import numpy as np
import tensorflow as tf
# from google.colab import files
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import load_model
from sklearn.model_selection import train_test_split

The next step is loading the parameters of the model which are stored in a json file named `parameters.json`

In [9]:
with open('parameters.json', 'r') as json_file:
    pm = json.load(json_file)    

The first part of a training model is loading the data.

In [10]:
lyrics = []  # Here will be stored the lyrics data
emotions = []  # Here will be stored the emotions data

# Here will be stored every type of emotiom
labels = ["Anger", "Joy", "Sad", "Disgusting", "Fear", "Suprise"]

data_path = os.path.join('data', 'lyrics_dataset.csv')

def loading_data(lyrics_list, emotions_list, path):
    with open(path, 'r', encoding='utf-8') as file:
        dataset = csv.reader(file)
        next(dataset)  # We skip the header of the file

        for track in dataset:
            lyrics_list.append(track[0].capitalize())
            emotions_list.append(track[3].capitalize())


# We need to numerize the emotions
def numerize_emotions(emotions_dataset):
    for i in range(len(emotions_dataset)):
        for j in range(len(labels)):
            if emotions_dataset[i] == labels[j]:
                emotions_dataset[i] = j


loading_data(lyrics, emotions, data_path)
numerize_emotions(emotions)

Since we have enough data, we will split it into training data and testing data, but in order to get the best model accuracy we are going to use the K-Fold Cross Validation, with 5 folds, 10 folds and the sample size minus 1 ("leave one out") as recommended in *Big Data: New Tricks for Econometrics, Hal R. Varian, 2013.*

In [None]:
def cross_validation(test_size)
training_lyrics, training_emotions, testiong_lyrics, training_emotions = train_test_split(
    (lyrics, emotions, test_size=test_size))

cross_validation()

Once the data is stored in the code, we need to convert the data, which are fragments of lyrics, into numbers because the Deep Learning process only can work with number and vector arrays. This process is known as "Tokenization". Fortunately, `tensorflow` provides a `Tokenizer` class which has all the methods needed to tokenize our data. Moreover, the `numpy` library will help us convert the numbers into vector arrays.

In [4]:
# Tokenization process
tokenizer = Tokenizer(num_words=pm["num_words"], oov_token=pm["oov_tok"])
tokenizer.fit_on_texts(training_lyrics)  # Words are fitted on the tokenizer
word_index = tokenizer.word_index  # Words are represented by index

vocab_size = len(word_index) + 1  # The size of the vocabulary

training_lyrics = tokenizer.texts_to_sequences(training_lyrics)
training_lyrics = pad_sequences(
    training_lyrics,
    maxlen=pm["max_length"],
    padding=pm["padding_type"],
    truncating=pm["trunc_type"]
)

testing_lyrics = tokenizer.texts_to_sequences(testing_lyrics)
testing_lyrics = pad_sequences(
    testing_lyrics,
    maxlen=pm["max_length"],
    padding=pm["padding_type"],
    truncating=pm["trunc_type"]
)

training_emotions = to_categorical(training_emotions, len(labels))
testing_emotions = to_categorical(testing_emotions, len(labels))


# Transforming the tokenized data in numpy arrays
training_lyrics = np.array(training_lyrics)
training_emotions = np.array(training_emotions)
testing_lyrics = np.array(testing_lyrics)
testing_emotions = np.array(testing_emotions)

Now that the tokenization is done, let's create the model. the `tensorflow` library provides us a `Sequential` class which is used to build Neural Networks (NN). Each layer of our NN system has to be in the correct order, so first we need to create an `Embedding` layer in order to "vectorize" each sentense according to the similarity or difference of each input. For more info about how Neural Networks works, there are many information through Internet.

Next, the output of this layer will be passed to `LSTM` layers, who will use the information and, in a recurrent loop the model will be trained. Finally, according to the result of each input, the `Dense` layers will help us define the result of the predictions.

In [10]:
# Generation of the Recurrent Neural Network (RNN)
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=vocab_size,
                              input_length=pm["max_length"],
                              output_dim=pm["embedding_dim"]),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.LSTM(pm["embedding_dim"],
                         return_sequences=True,
                         dropout=0.3,
                         recurrent_dropout=0.2),
    tf.keras.layers.LSTM(pm["embedding_dim"],
                         dropout=0.3,
                         recurrent_dropout=0.2),
    tf.keras.layers.Dense(len(labels), activation="softmax")
])

model.compile(
    loss="categorical_crossentropy",
    optimizer="adam",
    metrics=['accuracy']
)

The last step of the training process is ... training the model.

In [13]:
model.fit(
    training_lyrics,
    training_emotions,
    epochs=pm["num_epochs"],
    batch_size=32,
    verbose=2
)
results = model.evaluate(testing_lyrics, testing_emotions)
print(results)

Epoch 1/20
158/158 - 127s - loss: 1.5894 - accuracy: 0.3075
Epoch 2/20
158/158 - 127s - loss: 1.5410 - accuracy: 0.3529
Epoch 3/20
158/158 - 128s - loss: 1.4087 - accuracy: 0.4624
Epoch 4/20
158/158 - 127s - loss: 1.2761 - accuracy: 0.5202
Epoch 5/20
158/158 - 127s - loss: 1.1599 - accuracy: 0.5677
Epoch 6/20
158/158 - 127s - loss: 1.0608 - accuracy: 0.6107
Epoch 7/20
158/158 - 127s - loss: 0.9354 - accuracy: 0.6570
Epoch 8/20
158/158 - 128s - loss: 0.8378 - accuracy: 0.6962
Epoch 9/20
158/158 - 129s - loss: 0.7515 - accuracy: 0.7339
Epoch 10/20
158/158 - 128s - loss: 0.6781 - accuracy: 0.7564
Epoch 11/20
158/158 - 127s - loss: 0.7656 - accuracy: 0.7297
Epoch 12/20
158/158 - 127s - loss: 0.6391 - accuracy: 0.7717
Epoch 13/20
158/158 - 129s - loss: 0.5628 - accuracy: 0.8083
Epoch 14/20
158/158 - 129s - loss: 0.5061 - accuracy: 0.8356
Epoch 15/20
158/158 - 128s - loss: 0.4537 - accuracy: 0.8529
Epoch 16/20
158/158 - 129s - loss: 0.4011 - accuracy: 0.8723
Epoch 17/20
158/158 - 130s - loss

¡We have reached **95.54%** of accuracy!

The following code is written in order to store in the local machine the fitted model

In [14]:
model.save('model.h5')

Now, once the training and testing process are done, we can predict emotions with new information.

In [12]:
model = load_model('model.h5')

phrase = ["'Cause you're amazing just the way you are"] # Here the lyrics can be predicted
sequence = tokenizer.texts_to_sequences(phrase)
padded = pad_sequences(sequence,
                       maxlen=pm["max_length"],
                       padding=pm["padding_type"],
                       truncating=pm["trunc_type"])
results = model.predict(padded)
for i in range(len(labels)):
    print(labels[i] + ":", results[0][i])

Anger: 0.0004190314
Joy: 0.99541414
Sad: 0.00065330206
Disgusting: 0.00026164335
Fear: 0.00024047188
Suprise: 0.0030113417
