## NLP Model Training

This notebook contains the model training for the filtered data, where only songs in English are included.

### 1. Spark

First, we obviously need Spark again

In [1]:
import findspark
findspark.init("/usr/local/spark/")

from pyspark.sql import SparkSession
import collections
from pyspark.sql import SQLContext
from pyspark.sql.functions import when
from pyspark.sql.functions import col



from pyspark.sql import SparkSession
from pyspark.sql import SQLContext


spark = SparkSession.builder \
   .master("local") \
   .appName("NLP2") \
   .config("spark.executor.memory", "1gb") \
   .config("spark.sql.random.seed", "1234") \
   .getOrCreate()
      
sc = spark.sparkContext


sqlContext = SQLContext(sc)

### 2. Read file, drop some columns and recast some variable types

We load the data and as we can see, all columns are strings. We delete the first column, as it is just another id variable and we don't need 2 of them. Then we change the datatype of id and label to integers (needed for model training afterwards).

In [2]:
from pyspark.sql.utils import AnalysisException 

try: 
    data = spark.read.format("csv").option("header", "true").option("escapeQuotes", "false").load("../Final Preprocessing/NLP_processed_onlyEnglish.csv")
    data.show(1)
except AnalysisException: 
    print("Please check the Filename and Filepath")

+---+-----+--------------------+-----+
|_c0|   id|                text|label|
+---+-----+--------------------+-----+
|  0|44054|let lover marry f...|  1.0|
+---+-----+--------------------+-----+
only showing top 1 row



In [3]:
 data.select('label').groupBy('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|  1.0| 8752|
|  0.0|26598|
+-----+-----+



In [4]:
data.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- id: string (nullable = true)
 |-- text: string (nullable = true)
 |-- label: string (nullable = true)



In [5]:
data = data.drop("_c0")
data = data.withColumn("id", data["id"].cast("integer"))
data = data.withColumn("label", data["label"].cast("integer"))

In [6]:
data.printSchema()

root
 |-- id: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- label: integer (nullable = true)



### 3. Creating an 50% Billboard-Songs and 50% not Billboard-Songs Dataset 

First we again drop some NAs that we introduced at some unknown point again (theory for that at the top). Now we split the data so that we have a dataset that contains 50% Billboard and 50% not Billboard songs. The seed for this was set at the beginning of the spark session and is the same for all the models. We did this split since we feared that the model might overperform when the amount of songs in the billboard was proportionally too small compared to the number of songs not in the billboard.

In [7]:
label_counts = data.groupBy("label").count().collect()

min_count = min(row['count'] for row in label_counts)

# Create balanced DataFrame by sampling
balanced_df = None

for row in label_counts:
    label = row['label']
    count = row['count']
    fraction = min_count / count  # Calculate the fraction of the data to sample
    
    # Sample the data
    sampled_df = data.filter(col("label") == label).sample(False, fraction, seed=1)
    
    # Append sampled data to the balanced DataFrame
    if balanced_df is None:
        balanced_df = sampled_df.limit(min_count)  # Use limit to ensure exact number of instances
    else:
        balanced_df = balanced_df.union(sampled_df.limit(min_count))


In [8]:
balanced_df = balanced_df.drop("_c0")
balanced_df = balanced_df.withColumn("id", balanced_df["id"].cast("string"))
balanced_df = balanced_df.withColumn("label", balanced_df["label"].cast("long"))

In [9]:
balanced_df.select('label').groupBy('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|    0| 8736|
|    1| 8752|
+-----+-----+



In [10]:
# balanced_df.printSchema()

In [11]:
balanced_df = balanced_df.na.drop()
balanced_training_df_0, balanced_test_df_0 = balanced_df.randomSplit([0.8, 0.2], seed=1234)

In [12]:
balanced_training_df_0.show(n=3)

+-----+--------------------+-----+
|   id|                text|label|
+-----+--------------------+-----+
|    1|baby fight comin ...|    1|
|10001|summer rain tap w...|    1|
|10002|plate pile high d...|    1|
+-----+--------------------+-----+
only showing top 3 rows



### 4. Initiate pipeline

We initiate the pipeline: import necessary libraries and set up our function that we'll use later. These include the tokenizer, HashingTF, a function to calculate idf and a Naive Bayes Classifier. What all of these roughly do is explained in the steps 5-7 below. The Naive Bayes classifier doesn't have a seperate section, it's essentially a way of classifying something using probabilities. We use it to predict whether a song was in the charts or not (predicts the "label" column).

IMPORTANT: Step 4 sets up our functions. Steps 5-7 do not need to be executed, but they serve as a visual help to see what is happening. For the model training, the functions specified in step 4 are called during the crossvalidation (step 9) as part of the pipeline created in step 8. The tokenizer takes a "text" column as input from the training data, HashingTF takes the tokenizer's output column as input and IDF takes the output of HashingTF as input. So each stage uses the output of the stage before.

In [13]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml import Pipeline
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.tuning import CrossValidator

# tokenizer
tokenizer = Tokenizer(inputCol="text", outputCol="words")

# Term Frequency (TF)
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")

#Inverse Document Frequency (IDF)
idf = IDF(minDocFreq=3, inputCol="features", outputCol="idf")

# Naive Bayes Classifiers.
nb = NaiveBayes()

### 5. Tokenize

We tokenize the input, meaning we convert all words to lowercase and then split them by whitespace. This gives us the words of the text and puts them in another column so that we can work with them.

In [14]:
tokenized_balanced = tokenizer.transform(balanced_training_df_0)

In [15]:
tokenized_balanced.show(n=2)

+-----+--------------------+-----+--------------------+
|   id|                text|label|               words|
+-----+--------------------+-----+--------------------+
|    1|baby fight comin ...|    1|[baby, fight, com...|
|10001|summer rain tap w...|    1|[summer, rain, ta...|
+-----+--------------------+-----+--------------------+
only showing top 2 rows



### 6. Hash Term Frequency

Term frequency is a metric of how often a word occurs in a document compared to the total number of words in that document (each row is a document in our case). The function "HashingTF" takes sets of terms (in our case a bag of words) and converts those sets into fixed-length feature vectors. The algorithm combines Term Frequency (TF) counts with the hashing trick for dimensionality reduction. We create these feature vectors because they are needed for the next step.

In [16]:
hashed_balanced = hashingTF.transform(tokenized_balanced)
hashed_balanced.select("words","features","label").show(n=1,truncate=0)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|words                                                                                                                                                                                                                                                         |features                                                                                                                                                                                                          

### 7. Calculate IDF

Now we calculate inverse document frequency or IDF in short. The IDF of a term reflects the proportion of documents (aka rows aka songs) that contain the term in the whole file. Terms that appear in only a small number of documents are perceived as more important than words that are commonly used. This is gonna help the model, as it needs to give some type of numerical value to the terms.

In [17]:
model_balanced = idf.fit(hashed_balanced)
tfidf_balanced = model_balanced.transform(hashed_balanced)

In [18]:
tfidf_balanced.show(n=1,truncate=0)

+---+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### 8. Create pipeline and parameter grid

We create a simple pipeline, which consists of a sequence of stages, each stage is either an estimator or a transformer. We'll call the pipeline in the next step and then all the stages will be executed in their specified order, which helps us automate the process of finding good parameters. If an estimator is called, a model will be fit on the input. Then the model (which is a transformer) is used to transform the data as the input to the next stage. If a transformer is called, the current input will be transformed to produce the input for the next stage. So overall we create a bunch of models and transform our data a few times when calling our pipeline (in the ways specified in the steps above). The stages are tokenizing, finding term frequency, calculating IDF and calling the Naive Bayes classifier, as mentioned in the steps above. 

The paramGrid sets the parameters specified in the grid to specific values. This allows us to test for the optimal value of a parameter during the next step. Concretely, we are testing several values for "numFeatures" (which is a parameter of HashingTF, see step 6) to find the value that gives us the best performance during model training and evaluation. The nb.smoothing parameter adds a small probability value to whenever the Naive Bayes classifier would give something probability zero. We do this because zero probabilities can lead to errors in the model.

In [19]:
pipeline = Pipeline(stages=[tokenizer, hashingTF, idf, nb])

paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 500, 1000]) \
    .addGrid(nb.smoothing, [0.0, 1.0]).build()

### 9. Run cross validation

With our pipeline of functions to call and our parameter grid to test different values ready, we start the cross validation. K-fold cross validation splits the dataset into K non-overlapping randomly partitioned folds which are used as separate training and test datasets. So with e.g. 10 folds, 9 of the folds are used to train the model and the remaining fold is used for testing. We do this so that each fold is used for testing once. By increasing the number of folds, we reduce the likelihood of missing out on important data that might only be present in the test split and not the training split when using e.g. 2 folds. However, the runtime needed also increases with more folds. 

We tested different numbers of folds and found that the quality of the predictions did not increase by much, so we chose to stuck with two folds for improved runtime. After creating the cross validator, we train our final models, using the training split. We do this and the following steps twice, once for the full data and once for the balanced data (see step 3). 

In [20]:
# K-fold cross validation 
cv = CrossValidator(estimator=pipeline, 
                    estimatorParamMaps=paramGrid, 
                    evaluator=MulticlassClassificationEvaluator(), 
                    numFolds=2)

# Run cross-validation, and choose the best set of parameters.
cvModel_balanced = cv.fit(balanced_training_df_0)
bestModel = cvModel_balanced.bestModel

### 10. Create predictions

We use our final models created in the last step and apply them on the test split. Now our predictions have been made and we can evaluate them. For that and to showcase the result, we once again select only the variables of interest (text, label and prediction).

In [21]:
result_balanced = bestModel.transform(balanced_test_df_0)

# Projects a set of expressions 
prediction_df_balanced = result_balanced.select("text", "label", "prediction")

In [22]:
prediction_df_balanced.show(n=2,truncate=0)

+-------------------------------------------------------------------------------------------------------------------+-----+----------+
|text                                                                                                               |label|prediction|
+-------------------------------------------------------------------------------------------------------------------+-----+----------+
|lord old tune fiddle guitar take rhinestone suit new shiny car way year need change somebody tell come son get make|1    |1.0       |
|talk sleep anything believe word day anything heard talk night right call though hear name speak name              |1    |1.0       |
+-------------------------------------------------------------------------------------------------------------------+-----+----------+
only showing top 2 rows



### 11. Result

In this step we only compare the results. We focus on AUC (section c), as we are equally interested in false negatives and false positives and overall just want to know how many labels we predicted correctly. We are also interested in the actual proportions of chart songs and non-chart songs (section a) and the proportion of positive/negative prediction (section b). If the proportions in a) and b) vary greatly, this would indicate that the model may be overly optimistic or pessimistic.

#### a)

The next two cells show how many songs of the test data were within the charts:

Balanced data: roughly 50% --> something went wrong with creation of the balanced data as it's not exact, but not too bad.

In [23]:
prediction_df_balanced.select('label').groupBy('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|    0| 1764|
|    1| 1720|
+-----+-----+



#### b)

Now let's see how the model predicts:

Balanced data: same here with roughly 48% being rather close to the actual 50%

In [24]:
prediction_df_balanced.select('prediction').groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|       0.0| 1752|
|       1.0| 1732|
+----------+-----+



The only English model is far closer to the actual proportions than the model trained on all songs.

#### c)

In [26]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Transform the test data using the best model
predictions = bestModel.transform(balanced_test_df_0)

# Initialize the evaluator 
evaluator = BinaryClassificationEvaluator(labelCol='label')

# Evaluate the model
areaUnderROC = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})
areaUnderPR = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderPR"})

print("Test set AUC (ROC) = " + str(areaUnderROC))
print("Test set AUC (PR) = " + str(areaUnderPR))

Test set AUC (ROC) = 0.4211017837367505
Test set AUC (PR) = 0.43024811479057956


In [27]:
# Confusion Matrix
confusion_matrix = predictions.groupBy("label", "prediction").count()
confusion_matrix.show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|    1|       0.0|  625|
|    0|       1.0|  637|
|    0|       0.0| 1127|
|    1|       1.0| 1095|
+-----+----------+-----+



In [28]:
# Assigning the needed values from the confusion matrix - https://towardsdatascience.com/a-look-at-precision-recall-and-f1-score-36b5fd0dd3ec

# True Positive
TP = confusion_matrix.filter((col("label") == 1) & (col("prediction") == 1)).collect()[0]["count"]

# False Positive
FP = confusion_matrix.filter((col("label") == 0) & (col("prediction") == 1)).collect()[0]["count"]

# True Negative
TN = confusion_matrix.filter((col("label") == 0) & (col("prediction") == 0)).collect()[0]["count"]

# False Negative
FN = confusion_matrix.filter((col("label") == 1) & (col("prediction") == 0)).collect()[0]["count"]


# Calculating the metrics
Accuracy = (TP + TN)/(TP + TN + FP + FN) 
Precision = TP/(TP + FP) 
Recall = TP/(TP + FN) 
F1 = 2 * (Precision * Recall) / (Precision + Recall)


print(f"Accuracy: {Accuracy}")
print(f"Precision: {Precision}")
print(f"Recall: {Recall}")
print(f"F1 Score: {F1}")

Accuracy: 0.6377726750861079
Precision: 0.6322170900692841
Recall: 0.6366279069767442
F1 Score: 0.6344148319814601


### 12. Discussion of results of all NLP models

- The accuracy of the unfiltered balanced data is around: 0.71
- The accuracy of the only English balanced data is around: 0.63

This is roughly what we expected. When training with unfiltered data, the model performs better because it can plainly assign all non-English songs a prediction of "0" and with high likelihood, this is correct as usually only English songs are in the billboard hot 100. When the data consists of only English songs, this trick doesn't work anymore.
Arguably, it may also have something to do with the size of the dataset, as the unfiltered data is obviously bigger. However, I do not think this plays a role here. 

Our claim is supported by the fact that when looking at the unfiltered data prediction, the model tends to either be very optimistic or pessimistic. For the full data prediction, it classifies almost all songs as non-chart songs, and for the balanced data prediction, it classifies most songs as chart songs. 
Contrary to that, the model trained on only English songs predicts the values in a ratio that is much closer to the actual ratio of "chart songs" and "non-chart songs" in the test split. (Compare the first 4 cells of the "result" section in both notebooks)