# **Overview:**
In this section, we will try processing the two categorical columns of our dataset, which are **"title"** and **"artist"**. More specifically, we will try converting these two columns to numerical representation to be able to "feed" them to a Classification model.

This conversion will be performed by employing **One-Hot-Encoded vectors**. One-hot encoding is a technique used in machine learning to convert categorical variables into a binary representation, where each unique category becomes a separate binary feature, with a value of 1 indicating the presence of that category and 0 indicating its absence.

Goal of this analysis is to evaluate the new models' performance and compare it with that of the baseline.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.multiclass import OneVsRestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier, VotingClassifier, BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold, StratifiedKFold, GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.naive_bayes import GaussianNB

import warnings
warnings.filterwarnings("ignore")

Columns of **"title"** and **"artist"** will be examined, to check if they are valuable for model training. If they are, we will perform **One-Hot-Encoding** on them.

In [2]:
train = pd.read_csv("CS98XClassificationTrain.csv")
train = train.drop(['Id'],axis=1)
train = train.dropna()
train = train.reset_index()
train = train.drop(["index"], axis = 1)
train

Unnamed: 0,title,artist,year,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop,top genre
0,My Happiness,Connie Francis,1996,107,31,45,-8,13,28,150,75,3,44,adult standards
1,How Deep Is Your Love,Bee Gees,1979,105,36,63,-9,13,67,245,11,3,77,adult standards
2,Woman in Love,Barbra Streisand,1980,170,28,47,-16,13,33,232,25,3,67,adult standards
3,Goodbye Yellow Brick Road - Remastered 2014,Elton John,1973,121,47,56,-8,15,40,193,45,3,63,glam rock
4,Grenade,Bruno Mars,2010,110,56,71,-7,12,23,223,15,6,74,pop
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
433,But Not For Me,Ella Fitzgerald,1959,80,22,18,-17,10,16,214,92,4,45,adult standards
434,Surf City,Jan & Dean,2010,148,81,53,-13,23,96,147,50,3,50,brill building pop
435,Dilemma,Nelly,2002,168,55,73,-8,20,61,289,23,14,77,dance pop
436,It's Gonna Be Me,*NSYNC,2000,165,87,64,-5,6,88,191,5,8,62,boy band


We will now examine how many unique different elements does each of these columns possess.

In [3]:
unique_title = len(train['title'].unique())
unique_artist = len(train['artist'].unique())
print("The number of unique song titles in the training set are :", unique_title)
print("The number of unique song artists in the training set are :", unique_artist)

The number of unique song titles in the training set are : 436
The number of unique song artists in the training set are : 331


Given the large number of unique **"titles"** and unique **"artists"**, one-hot encoding these variables might lead to a very high-dimensional dataset, which could pose challenges for some algorithms and may lead to the curse of dimensionality.

However, since unique **"artists"** are less that unique **"titles"**, we will experiment with **One-Hot-Encoding** only on the **"artist"** column.

In [4]:
# Here, we will process the train/test datasets to both have 11 numerical variables + the "artist" column:
# (We will also load the validation set)
train = train.drop(['title'], axis = 1)

test = pd.read_csv("CS98XClassificationTest.csv")
test = test.drop(['Id','title'], axis = 1)
test = test.dropna()
test = test.reset_index()
test = test.drop(["index"], axis = 1)

test2 = pd.read_csv('CS98XRegressionTest.csv')
test2 = test2.dropna()
validation_test = test2["top genre"]

To use our Classification model effectively, both train and test sets need to have the same dimensions (number of columns). As such, we will firstly concatenate both our train/test sets vertically. After sucessfully creating One-Hot-Encoded vectors, we will split them again into their original size and drop the original **"artist"** column.

In [5]:
# Firstly, we concatenate both train and test sets to include all 331 unique artists
train_no_genre = train.drop(["top genre"], axis = 1)
concatenated_train_test = pd.concat([train_no_genre, test], axis=0)

# Convert categorical data to one-hot encoded vectors
train_categorical = pd.get_dummies(concatenated_train_test['artist'])
concatenated_train_test = concatenated_train_test.drop(['artist'],axis=1)

# Concatenate the two DataFrames vertically
concatenated_df = pd.concat([concatenated_train_test, train_categorical], axis=1)

# Split the DataFrame vertically at row 438
train_final = concatenated_df.iloc[:438]
test_final = concatenated_df.iloc[438:]
train_final = pd.concat([train_final, train.loc[:, 'top genre']], axis=1)

In [6]:
Y = train_final.loc[:, 'top genre']
X = train_final.drop(['top genre'], axis=1)

Scaling is generally not necessary for one-hot encoded variables. However, in our case, the combination of one-hot encoded vectors with scaled data produced significantly higher accuracies compared to training with unscaled data.

The increase in accuracy ranged from **1%** (**Naive Bayes model**) to **16%** (**Logistic Regression model** + **Multilayer Perceptron model**).

In [7]:
scaler = StandardScaler()
scaled_X = scaler.fit_transform(X[['live', 'dur', 'spch', 'year', 'bpm', 'nrgy', 'dnce', 'dB', 'val', 'acous', 'pop']])
scaled_test = scaler.fit_transform(test_final[['live', 'dur', 'spch', 'year', 'bpm', 'nrgy', 'dnce', 'dB', 'val', 'acous', 'pop']])
X[['live', 'dur', 'spch', 'year', 'bpm', 'nrgy', 'dnce', 'dB', 'val', 'acous', 'pop']] = scaled_X
test_final[['live', 'dur', 'spch', 'year', 'bpm', 'nrgy', 'dnce', 'dB', 'val', 'acous', 'pop']] = scaled_test

In [8]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 15)

In [9]:
tree_clf = DecisionTreeClassifier()
log_clf = LogisticRegression()
ovr_clf = OneVsRestClassifier(log_clf)
rnd_clf = RandomForestClassifier()
kn_clf = KNeighborsClassifier()
svm_clf = SVC()
mlp_clf = MLPClassifier()
gnb_clf = GaussianNB()

In [10]:
for clf in (log_clf, rnd_clf, svm_clf, kn_clf, mlp_clf, tree_clf, ovr_clf, gnb_clf):
    clf.fit(X_train, Y_train)
    ypred = clf.predict(X_test)
    print(clf.__class__.__name__, ":", accuracy_score(Y_test, ypred))

LogisticRegression : 0.3522727272727273
RandomForestClassifier : 0.38636363636363635
SVC : 0.375
KNeighborsClassifier : 0.26136363636363635
MLPClassifier : 0.38636363636363635
DecisionTreeClassifier : 0.18181818181818182
OneVsRestClassifier : 0.38636363636363635
GaussianNB : 0.25


In [11]:
final_predictions1 = log_clf.predict(test_final)
final_predictions2 = rnd_clf.predict(test_final)
final_predictions3 = svm_clf.predict(test_final)
final_predictions4 = kn_clf.predict(test_final)
final_predictions5 = mlp_clf.predict(test_final)
final_predictions6 = tree_clf.predict(test_final)
final_predictions7 = ovr_clf.predict(test_final)
final_predictions8 = gnb_clf.predict(test_final)

In [12]:
accuracy1 = accuracy_score(validation_test, final_predictions1)
accuracy2 = accuracy_score(validation_test, final_predictions2)
accuracy3 = accuracy_score(validation_test, final_predictions3)
accuracy4 = accuracy_score(validation_test, final_predictions4)
accuracy5 = accuracy_score(validation_test, final_predictions5)
accuracy6 = accuracy_score(validation_test, final_predictions6)
accuracy7 = accuracy_score(validation_test, final_predictions7)
accuracy8 = accuracy_score(validation_test, final_predictions8)

print("Accuracy of the Logistic Regression model is:", accuracy1)
print("Accuracy of the Random Forest model is:", accuracy2)
print("Accuracy of the Support Vector model is:", accuracy3)
print("Accuracy of the K-Nearest Neighbors model is:", accuracy4)
print("Accuracy of the Multilayer Perceptron model is:", accuracy5)
print("Accuracy of the Decision Tree model is:", accuracy6)
print("Accuracy of the OneVSRest model is:", accuracy7)
print("Accuracy of the Gaussian Naive Bayes model is:", accuracy8)

Accuracy of the Logistic Regression model is: 0.4690265486725664
Accuracy of the Random Forest model is: 0.46017699115044247
Accuracy of the Support Vector model is: 0.35398230088495575
Accuracy of the K-Nearest Neighbors model is: 0.26548672566371684
Accuracy of the Multilayer Perceptron model is: 0.5221238938053098
Accuracy of the Decision Tree model is: 0.3008849557522124
Accuracy of the OneVSRest model is: 0.46017699115044247
Accuracy of the Gaussian Naive Bayes model is: 0.35398230088495575


One-Hot-Encoding produced the highest results so far, with the **Multilayer Perceptron model** producing **0.5221** on unknown data. Honorable mentions are the **Logistic Regression model** with **0.469** accuracy, the **OneVSRest model** with **0.46** accuracy and the **Random Forest model** with **0.46** accuracy respectively.

However, having higher accuracy on the test set rather than on the train set can be a cause of concern. Concatenating train and test set, in the beginning, indicates **Data Leakage** where information from the test set is incorporated into the training process. Furthermore, having small training/testing sets, can create random variations in the data which can lead to fluctuations in performance metrics. In such cases, it's possible for the test set accuracy to be higher by chance.