# Neural Network

In this notebook, we define and train a neural network on the Numerical as well as the lyrics' data from our data set.

For those who have not worked with Neural Networks before (co-created with ChatGPT): 

A neural network is a model type inspired by the structure and function of the human brain. It consists of interconnected neurons organized in layers that process and transmit information. They are capable of learning patterns and relationships in data through a process called training, which involves adjusting the connections (or weights) between neurons based on input data and desired outputs. Neural Networks are commonly used due to their ability to model complex, nonlinear relationships.

## 1. Setting up the spark session

In [1]:
import findspark
findspark.init("/usr/local/spark/")
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext

spark = SparkSession.builder \
   .master("local[8]") \
   .appName("NNClassifier") \
   .config("spark.executor.memory", "1gb") \
   .config("spark.sql.random.seed", "1234") \
   .getOrCreate()
sc = spark.sparkContext

## 2. Loading the data

We are loading the data from the final.csv file, which was created by our preprocessing script.

In [2]:
from pyspark.sql.functions import when
from pyspark.sql.utils import AnalysisException 

try: 
    # Because our lyrics has multiple lines of text we need to apply quote, and escape to make sure it is currectly loaded from the csv
    file = spark.read.format("csv").option("header", "true").option("multiline", "true").option("quote", "\"").option("escape", "\"").load("../Final Preprocessing/final.csv")
    file.show(1)
except AnalysisException: 
    print("Please check the Filename and Filepath")

+---+--------------------+--------------------+--------------------+--------------------+----------+-----------+--------+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+------+--------------+-----------+--------------------+---------+
|_c0|          track_name|              artist|            track_id|          album_name|popularity|duration_ms|explicit|danceability|energy|key|loudness|mode|speechiness|acousticness|instrumentalness|liveness|valence| tempo|time_signature|track_genre|              lyrics|billboard|
+---+--------------------+--------------------+--------------------+--------------------+----------+-----------+--------+------------+------+---+--------+----+-----------+------------+----------------+--------+-------+------+--------------+-----------+--------------------+---------+
|  0|"""Martha: """"M'...|Friedrich von Flo...|1NzZWhNIP9DIX4yy0...|The World's Best ...|        23|     204706|   False|       0.222| 0.195|  5| -1

In [3]:
# Check the lenght of the loaded file
file.count()

61762

## 3. Adapting the Numerical columns

In order to work with our numerical columns, we need to cast them into doubles because, when loading them from a CSV, they are datatype strings. Finally, we make sure that we have no missing values in our numerical columns. 

In [4]:
from pyspark.sql.functions import col

# Selecting all the numerical columns we want to use in our model
columns_with_numbers = ["popularity", "duration_ms", "danceability", "danceability", "energy", "key", "loudness", "mode", "speechiness","acousticness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "time_signature", "billboard"]

# Cast all numerical columns to double
for column_name in columns_with_numbers:
    file = file.withColumn(column_name, col(column_name).cast("double"))
    
# Remove possible NA form our numerical columns
file = file.na.drop(subset=["popularity", "duration_ms", "danceability", "danceability", "energy", "key", "loudness", "mode", "speechiness","acousticness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "time_signature", "billboard"])

# Check that our columns have to correct datatype 
file.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- track_name: string (nullable = true)
 |-- artist: string (nullable = true)
 |-- track_id: string (nullable = true)
 |-- album_name: string (nullable = true)
 |-- popularity: double (nullable = true)
 |-- duration_ms: double (nullable = true)
 |-- explicit: string (nullable = true)
 |-- danceability: double (nullable = true)
 |-- energy: double (nullable = true)
 |-- key: double (nullable = true)
 |-- loudness: double (nullable = true)
 |-- mode: double (nullable = true)
 |-- speechiness: double (nullable = true)
 |-- acousticness: double (nullable = true)
 |-- instrumentalness: double (nullable = true)
 |-- liveness: double (nullable = true)
 |-- valence: double (nullable = true)
 |-- tempo: double (nullable = true)
 |-- time_signature: double (nullable = true)
 |-- track_genre: string (nullable = true)
 |-- lyrics: string (nullable = true)
 |-- billboard: double (nullable = true)



In [5]:
# Getting a understanding of how our data is distributed 
file.select('billboard').groupBy('billboard').count().show()
file.printSchema()

+---------+-----+
|billboard|count|
+---------+-----+
|      0.0|52839|
|      1.0| 8923|
+---------+-----+

root
 |-- _c0: string (nullable = true)
 |-- track_name: string (nullable = true)
 |-- artist: string (nullable = true)
 |-- track_id: string (nullable = true)
 |-- album_name: string (nullable = true)
 |-- popularity: double (nullable = true)
 |-- duration_ms: double (nullable = true)
 |-- explicit: string (nullable = true)
 |-- danceability: double (nullable = true)
 |-- energy: double (nullable = true)
 |-- key: double (nullable = true)
 |-- loudness: double (nullable = true)
 |-- mode: double (nullable = true)
 |-- speechiness: double (nullable = true)
 |-- acousticness: double (nullable = true)
 |-- instrumentalness: double (nullable = true)
 |-- liveness: double (nullable = true)
 |-- valence: double (nullable = true)
 |-- tempo: double (nullable = true)
 |-- time_signature: double (nullable = true)
 |-- track_genre: string (nullable = true)
 |-- lyrics: string (nullable =

## 4. Feature Transformation for Numerical column

First, we transform our numerical columns into a combined feature column called numeric_vector. On this column, we then perform a Scaler to ensure that all the values are comparable. 

In [6]:
from pyspark.ml.feature import VectorAssembler, StandardScaler

# Create initial Feature Vector
assembler = VectorAssembler(inputCols=["popularity", "duration_ms", "danceability", "danceability", "energy", "key", "loudness", "mode", "speechiness","acousticness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "time_signature"], outputCol="numeric_vector")
file = assembler.transform(file)

# Define Scaler
scaler = StandardScaler(inputCol="numeric_vector", outputCol="scaled_features")

# Perform feature transformation on the data 
scaler_model = scaler.fit(file)
file = scaler_model.transform(file)

## 5. Feature Transformation for Text

We also need to do a Feature Transformation on the Lyrics. For this, we first tokenize the text, then apply a Hashing Transformation and finally an IDF. The results are written to the text_features column. 

In [7]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

# Tokenizing
tokenizer = Tokenizer(inputCol="lyrics", outputCol="words")
file = tokenizer.transform(file)

# Hashing
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(file)

# IDF
idf = IDF(inputCol="rawFeatures", outputCol="text_features")
idfModel = idf.fit(featurizedData)

# Writing feature column to data 
file = idfModel.transform(featurizedData)

## 6. Combine features

In the previous steps, we created the feature columns for numerical and lyrical data. However, our neural network can only take one feature input. Because of that, we need to combine the two feature columns (numerical and lyrics) into one combined feature. We do this by using the VectorAssembler function. The resulting feature column will then be used as the input for our neural network. 

In [8]:
from pyspark.ml.feature import VectorAssembler

# Define Vector Assembler
final_assembler = VectorAssembler(inputCols=["scaled_features", "text_features"], outputCol="features")

# Apply to data
file = final_assembler.transform(file)

## 7. Create evenly split dataset

As already described in the NLP model, we agreed to use an evenly split data set for our model training, to avoid the possibility of biases in the data interfering with our performance. In the next step, we create an evenly split dataset for the model training. 

In [9]:
label_counts = file.groupBy("billboard").count().collect()

min_count = min(row['count'] for row in label_counts)

# Create balanced DataFrame by sampling
balanced_df = None

for row in label_counts:
    label = row['billboard']
    count = row['count']
    fraction = min_count / count  # Calculate the fraction of the data to sample
    
    # Sample the data
    sampled_df = file.filter(col("billboard") == label).sample(False, fraction, seed=1)
    
    # Append sampled data to the balanced DataFrame
    if balanced_df is None:
        balanced_df = sampled_df.limit(min_count)  # Use limit to ensure exact number of instances
    else:
        balanced_df = balanced_df.union(sampled_df.limit(min_count))

In [10]:
balanced_df.select('billboard').groupBy('billboard').count().show()

file = balanced_df

+---------+-----+
|billboard|count|
+---------+-----+
|      0.0| 8825|
|      1.0| 8923|
+---------+-----+



## 8. Build model

### 8.1. Setting up the layers

The first step in our model creation process is defining the underlying architecture (the number of layers and neurons). For the input layer, we get the feature dimension of the combined feature columns we previously created. 

When it comes to the hidden layers, we used the Heuristic of using  two hidden layers with a number of neurons, as outlined in this article: https://medium.com/geekculture/introduction-to-neural-network-2f8b8221fbd3. Note that we have also experimented with different hidden layer structures but could not find any significant differences. From our perspective, this makes sense, as our current neuron structure should be able to explain rather complex data inputs. 

For the output layer, we used two neurons, since we wanted to perform a binary classification. 

In [11]:
# Get the input dimension for the Neural Network
feature_dimension = file.schema["features"].metadata["ml_attr"]["num_attrs"]

# Calculate hidden layers, source: https://medium.com/geekculture/introduction-to-neural-network-2f8b8221fbd3

hidden_neurons = feature_dimension * (2/3) + 2  # Approximation from source above
hidden_neurons_1 = hidden_neurons // 2 # Split between layer 1 and 2
hidden_neurons_2 = round(hidden_neurons - hidden_neurons_1 -1 ,0) # -1 to make sure we dont go above input size

# Output of size 2 -> Two output classes because we want to classify if the song is in the billboard 100 or not
layers = [feature_dimension, hidden_neurons_1, hidden_neurons_2, 2] # This is our final layer structure 

print(f"Input Layer: {feature_dimension}")
print(f"Hidden Layer 1: {hidden_neurons_1}")
print(f"Hidden Layer 2: {hidden_neurons_2}")
print(f"Output Layer: 2")

Input Layer: 36
Hidden Layer 1: 13.0
Hidden Layer 2: 12.0
Output Layer: 2


### 8.2. Split Train and Test data

For the model training and model evaluation process, we create a training and test data set with the ratio 80/20. We also set a seed for this split to make the model performance comparable with the Logistic Regression and the NLP. 

In [12]:
train_data, test_data = file.randomSplit([0.8, 0.2], seed=1234)

### 8.3. Grid search method to find the best hyperparameters for the model

We want to achieve the best possible model. For this, we need to find the parameters that match the data the best. Since we don't want to manually change the model parameters and re-run the model, we create a grid with the parameters we think have the biggest impact. What is extremely important here is that you do not try to test too many parameters, as the Time Complexity scales with O(x^2). Because of that we only focused on the blockSize, stepSize, and tol (tolerance). 

The model performance was evaluated using a 2-fold Cross Validation.

In [13]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator


mlp = MultilayerPerceptronClassifier(featuresCol='features', labelCol='billboard')

paramGrid = ParamGridBuilder() \
    .addGrid(mlp.maxIter, [100]) \
    .addGrid(mlp.layers, [layers]) \
    .addGrid(mlp.blockSize, [2,3,4,6]) \
    .addGrid(mlp.stepSize, [0.00001,0.00005, 0.000001, 0.0000001]) \
    .addGrid(mlp.tol, [1e-06, 1e-08, 1e-10]) \
    .build()

evaluator = MulticlassClassificationEvaluator(labelCol = 'billboard', metricName="accuracy")

crossval = CrossValidator(estimator=mlp,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=2)  # Use 2 folds

In [14]:
# Perform the gird search and train the model
cvModel = crossval.fit(train_data)

# Fetch best model
bestModel = cvModel.bestModel

## 9. Evaluation

To test our models' performance we used the measures, Area Under ROC and Area under PR. For the Neural Network, we calculate these measurements on the test data using the best resulting model from our Cross Validation Grid search.  

In [15]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Transform the test data using the best model
predictions = bestModel.transform(test_data)

# Initialize the evaluator for accuracy and F1 score
evaluator = BinaryClassificationEvaluator(labelCol='billboard')

# Evaluate the model
areaUnderROC = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})
areaUnderPR = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderPR"})

print("Test set AUC (ROC) = " + str(areaUnderROC))
print("Test set AUC (PR) = " + str(areaUnderPR))

Test set AUC (ROC) = 0.8259662715916801
Test set AUC (PR) = 0.8065419144102637


### Confusion matrix

To further investigate and evaluate our model's performance, and to make it easier to benchmark the Neural Network results with the other models(NLPs) and (LR), we also created a confusion matrix based on the calculated predictions. **Note:** All of this is run based on training and testing on the balanced data. Our final comparison and results will also be based on this balanced data across all of our models. 

In [17]:
# Confusion Matrix
confusion_matrix = predictions.groupBy("billboard", "prediction").count()
confusion_matrix.show()

+---------+----------+-----+
|billboard|prediction|count|
+---------+----------+-----+
|      1.0|       1.0| 1472|
|      0.0|       1.0|  552|
|      1.0|       0.0|  326|
|      0.0|       0.0| 1188|
+---------+----------+-----+



With the help of the confusion matrix, then, we can now calculate the accuracy and F1 scores of this NN model.

In [19]:
# Assigning the needed values from the confusion matrix - https://towardsdatascience.com/a-look-at-precision-recall-and-f1-score-36b5fd0dd3ec

# True Positive
TP = confusion_matrix.filter((col("billboard") == 1) & (col("prediction") == 1)).collect()[0]["count"]

# False Positive
FP = confusion_matrix.filter((col("billboard") == 0) & (col("prediction") == 1)).collect()[0]["count"]

# True Negative
TN = confusion_matrix.filter((col("billboard") == 0) & (col("prediction") == 0)).collect()[0]["count"]

# False Negative
FN = confusion_matrix.filter((col("billboard") == 1) & (col("prediction") == 0)).collect()[0]["count"]


# Calculating the metrics
Accuracy = (TP + TN)/(TP + TN + FP + FN) 
Precision = TP/(TP + FP) 
Recall = TP/(TP + FN) 
F1 = 2 * (Precision * Recall) / (Precision + Recall)


print(f"Accuracy: {Accuracy}")
print(f"Precision: {Precision}")
print(f"Recall: {Recall}")
print(f"F1 Score: {F1}")

Accuracy: 0.7518371961560204
Precision: 0.7272727272727273
Recall: 0.8186874304783093
F1 Score: 0.7702773417059132


# Describe results

Compared to the other models - both Logistic Regression and NLP, this Neural Network model gives us the best numbers in almost everything, with higher **ROC-AUC**, **PR-AUC**, and better **Accuracy** and **F1** scores(In fact, F1 score is comparable with NLP). This shows the superior prediction capabilities of NN models, which may be due to the fact that both Logistic Regression and the NLP models use either numerical or lyrical data, while the Neural Network uses both of them when making predictions. 

# Saving the model

To avoid having to rerun the time-intensive training process each time we start the server, we save the fully trained model for re-use. 

In [23]:
from py4j.protocol import Py4JJavaError

try:
    bestModel.save("../Final Code/trained_models/NN_model")
except Py4JJavaError:
    print("Error saving model the model, make sure the model isn't allready saved in the defined folder")