## Spotify Classification

<h2> Obtaining Data

In [1]:
from pandas import read_csv
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
import numpy as np

In [2]:
# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)
# To disable SettingWithCopyWarning 
pd.options.mode.chained_assignment = None  # default='warn'

In [3]:
df_2000 = pd.read_csv("Spotify-2000.csv")
df_top10s = pd.read_csv("top10s.csv", engine='python') # the engine needs to be changed otherwise UTF-8 error occurs
df_2000.head()

Unnamed: 0,Index,Title,Artist,Top Genre,Year,Beats Per Minute (BPM),Energy,Danceability,Loudness (dB),Liveness,Valence,Length (Duration),Acousticness,Speechiness,Popularity
0,1,Sunrise,Norah Jones,adult standards,2004,157,30,53,-14,11,68,201,94,3,71
1,2,Black Night,Deep Purple,album rock,2000,135,79,50,-11,17,81,207,17,7,39
2,3,Clint Eastwood,Gorillaz,alternative hip hop,2001,168,69,66,-9,7,52,341,2,17,69
3,4,The Pretender,Foo Fighters,alternative metal,2007,173,96,43,-4,3,37,269,0,4,76
4,5,Waitin' On A Sunny Day,Bruce Springsteen,classic rock,2002,106,82,58,-5,10,87,256,1,3,59


In [4]:
df_top10s.head()

Unnamed: 0.1,Unnamed: 0,title,artist,top genre,year,bpm,nrgy,dnce,dB,live,val,dur,acous,spch,pop
0,1,"Hey, Soul Sister",Train,neo mellow,2010,97,89,67,-4,8,80,217,19,4,83
1,2,Love The Way You Lie,Eminem,detroit hip hop,2010,87,93,75,-5,52,64,263,24,23,82
2,3,TiK ToK,Kesha,dance pop,2010,120,84,76,-3,29,71,200,10,14,80
3,4,Bad Romance,Lady Gaga,dance pop,2010,119,92,70,-4,8,71,295,0,4,79
4,5,Just the Way You Are,Bruno Mars,pop,2010,109,84,64,-5,9,43,221,2,4,78


In [5]:
df_2000.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1994 entries, 0 to 1993
Data columns (total 15 columns):
Index                     1994 non-null int64
Title                     1994 non-null object
Artist                    1994 non-null object
Top Genre                 1994 non-null object
Year                      1994 non-null int64
Beats Per Minute (BPM)    1994 non-null int64
Energy                    1994 non-null int64
Danceability              1994 non-null int64
Loudness (dB)             1994 non-null int64
Liveness                  1994 non-null int64
Valence                   1994 non-null int64
Length (Duration)         1994 non-null object
Acousticness              1994 non-null int64
Speechiness               1994 non-null int64
Popularity                1994 non-null int64
dtypes: int64(11), object(4)
memory usage: 233.8+ KB


In [6]:
df_top10s.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 603 entries, 0 to 602
Data columns (total 15 columns):
Unnamed: 0    603 non-null int64
title         603 non-null object
artist        603 non-null object
top genre     603 non-null object
year          603 non-null int64
bpm           603 non-null int64
nrgy          603 non-null int64
dnce          603 non-null int64
dB            603 non-null int64
live          603 non-null int64
val           603 non-null int64
dur           603 non-null int64
acous         603 non-null int64
spch          603 non-null int64
pop           603 non-null int64
dtypes: int64(12), object(3)
memory usage: 70.8+ KB


In [7]:
len(df_2000["Top Genre"].unique()), len(df_top10s["top genre"].unique())

(149, 50)

In [8]:
df_2000["Top Genre"].value_counts(), df_top10s["top genre"].value_counts()

(album rock              413
 adult standards         123
 dutch pop                88
 alternative rock         86
 dance pop                83
                        ... 
 danish pop                1
 la pop                    1
 australian americana      1
 laboratorio               1
 gangster rap              1
 Name: Top Genre, Length: 149, dtype: int64, dance pop                    327
 pop                           60
 canadian pop                  34
 boy band                      15
 barbadian pop                 15
 electropop                    13
 british soul                  11
 big room                      10
 canadian contemporary r&b      9
 neo mellow                     9
 art pop                        8
 australian dance               6
 hip pop                        6
 complextro                     6
 australian pop                 5
 edm                            5
 atl hip hop                    5
 hip hop                        4
 permanent wave          

From a quick observation, we can see certain things about the data we are dealing with:<br>
1. Both datasets use the same metrics, albeit with different column names so we will have to change those names. Fortunately, the same order is kept for both datasets
2. Our target label is the top genre category, however there is a very large number of genre types, so many in fact that our model will be very inaccurate and inefficient if we attempt to label all categories as they are, especially since a large number of categories contain a single song only. 
3. The two datasets do not share the same # of genres.
4. We don't care about the artist, title, index/Unamed: 0, and year. I simply don't think these would have a strong connection with the genre. (Arguably, artist would have a connection to the music genre, but I personally think there is too much variance and it wouldn't be a consistent pattern)

<h2> Data Preparation

Dropping off all unecessary columns

In [9]:
df_top10s.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 603 entries, 0 to 602
Data columns (total 15 columns):
Unnamed: 0    603 non-null int64
title         603 non-null object
artist        603 non-null object
top genre     603 non-null object
year          603 non-null int64
bpm           603 non-null int64
nrgy          603 non-null int64
dnce          603 non-null int64
dB            603 non-null int64
live          603 non-null int64
val           603 non-null int64
dur           603 non-null int64
acous         603 non-null int64
spch          603 non-null int64
pop           603 non-null int64
dtypes: int64(12), object(3)
memory usage: 70.8+ KB


In [10]:
df_2000.drop(columns = ['Index', 'Title', 'Artist', 'Year'], inplace = True)
df_top10s.drop(columns = ['Unnamed: 0', 'title', 'artist', 'year'], inplace = True)

Now both datasets have the same # of columsn in the same order, so we can now join them together after renaming

In [11]:
df_top10s.columns = df_2000.columns # setting column names as each other
df = df_2000.append(df_top10s, ignore_index = True)

both X_train and X_test have several values that are string type numbers, separated with a comma. We have to get rid of these commas then turn them into floats

In [12]:
attributes = df.columns[1:]
for attribute in attributes:
    temp = df[attribute]
    for instance in range(len(temp)):
        if(type(temp[instance]) == str):
            df[attribute][instance] = float(temp[instance].replace(',',''))
# check data types using df.dtype

<h3> Splitting Genre

Now that we have obtained our full dataset, we need a way to split the top genres. Two methods will be explored: <br>
**Method 1** All related songs of a specific category will be placed in a broader category (i.e. celtic pop and indie pop will be placed under the larger theme of pop). The main assumption using this method is that there is minimal difference between differing types of a similar music genre. This will be multiclass classification <br>
**Method 2** Songs will have their genres split by space and multilabel classification will be used. <br>
**NOTE**: Method 2 has been pushed to the bottom after Method 1 is completed to prevent interference and distraction from cross coding

In [13]:
# first extracting the genre columns
# getting rid of white spaces and turning it all into lower cases
genre = (df["Top Genre"].str.strip()).str.lower()

<h4> Method 1

In [14]:
# function to split the genre column
def genre_splitter(genre):
    result = genre.copy()
    result = result.str.split(" ",1)
    for i in range(len(result)):
        if (len(result[i]) > 1):
            result[i] = [result[i][1]]
    return result.str.join('')

In [15]:
# loop until the genre cannot be split any further
genre_m1 = genre.copy()
while(max((genre_m1.str.split(" ", 1)).str.len()) > 1):
    genre_m1 = genre_splitter(genre_m1)

In [16]:
len(genre_m1.unique())

73

We reduced our target label by over 50%. Let's take a quick look at our new shortened labels

In [17]:
genre_m1.value_counts()

rock             857
pop              802
standards        123
metal             93
indie             78
                ... 
j-core             1
rock-and-roll      1
levenslied         1
laboratorio        1
ambient            1
Name: Top Genre, Length: 73, dtype: int64

Note that there are certain music genres that have a single value. These genres would make our model inefficient, since it does not have data to work off of, so these values and corresponding rows in the original dataframe will be taken out. Putting all of these instances into a broader "Other" category is another potential solution, but I decided against it because there is probably minimal similarity between all of these music genres. 

In [18]:
unique = genre_m1.unique()
to_remove = [] 

# genres that have a single instance only will be placed within the to_remove array
for genre in unique:
    if genre_m1.value_counts()[genre] < 20: # 10 was arbitrarily chosen
        to_remove += [genre]
len(to_remove)

58

Now replacing our original genre columns with our updated version

In [19]:
df['Top Genre'] = genre_m1
df

Unnamed: 0,Top Genre,Beats Per Minute (BPM),Energy,Danceability,Loudness (dB),Liveness,Valence,Length (Duration),Acousticness,Speechiness,Popularity
0,standards,157,30,53,-14,11,68,201,94,3,71
1,rock,135,79,50,-11,17,81,207,17,7,39
2,hop,168,69,66,-9,7,52,341,2,17,69
3,metal,173,96,43,-4,3,37,269,0,4,76
4,rock,106,82,58,-5,10,87,256,1,3,59
...,...,...,...,...,...,...,...,...,...,...,...
2592,pop,104,66,61,-7,20,16,176,1,3,75
2593,pop,95,79,75,-6,7,61,206,21,12,75
2594,pop,136,76,53,-5,9,65,260,7,34,70
2595,pop,114,79,60,-6,42,24,217,1,7,69


In [20]:
df.set_index(["Top Genre"],drop = False, inplace = True)
for name in to_remove:
    type(name)
    df.drop(index = str(name), inplace = True)
    

In [21]:
df["Top Genre"].value_counts()

rock         857
pop          802
standards    123
metal         93
indie         78
soul          56
cabaret       51
hop           43
wave          42
invasion      36
europop       27
mellow        26
dance         22
band          21
folk          20
Name: Top Genre, dtype: int64

<h2> Model Creation

Since we are dealing with a classification problem, several models will be tried:
1. Random Forest
2. Naive Bayes
3. Stochastic Gradient Descent Classifier
4. Logistic Regression

<h3> Method 1

first preparing the training and testing dataset with proper standardization using StandardScaler

In [22]:
train_set, test_set = train_test_split(df, test_size = 0.2, random_state = 42)
# training set
X_train = train_set.values[:,1:]
y_train = train_set.values[:,0]

# test set
X_test = test_set.values[:,1:]
y_test = test_set.values[:,0]

In [23]:
from sklearn.preprocessing import StandardScaler

standard_scaler = StandardScaler().fit(X_train)

# Standard Scaler
X_train_ST = standard_scaler.transform(X_train)
X_test_ST = standard_scaler.transform(X_test)

The labels need to be converted into a form that can be understood by the models, One hot encoding will be used here

In [24]:
# obtaining all unique classes
unique = np.unique(y_train)

In [25]:
from sklearn.preprocessing import label_binarize
from sklearn.preprocessing import LabelEncoder
# 1 hot encoding
y_test_1hot = label_binarize(y_test, classes = unique)
y_train_1hot = label_binarize(y_train, classes = unique)

# labelling
y_test_label = LabelEncoder()

Now creating the instances of the models

In [26]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier
from sklearn.multiclass import OneVsOneClassifier

models = []
models += [['Naive Bayes', GaussianNB()]]
models += [['SGD', OneVsOneClassifier(SGDClassifier())]]
models += [['Logistic', LogisticRegression(multi_class = 'ovr')]]
rand_forest = RandomForestClassifier(random_state = 42, min_samples_split = 5)

Now training the models using k cross validation

In [27]:
result_ST =[]
kfold = StratifiedKFold(n_splits = 10, random_state = 1, shuffle = True)

# Random Forest has to be done separately since it takes in one hot encoded labels instead
RF_cross_val_score = cross_val_score(rand_forest, X_train_ST, y_train_1hot, cv = 10, scoring = 'accuracy')
print('%s: %f (%f)' % ('Random Forest', RF_cross_val_score.mean(), RF_cross_val_score.std()))

for name, model in models:
    cv_score = cross_val_score(model, X_train_ST, y_train, cv = kfold, scoring = 'accuracy')
    result_ST.append(cv_score)
    print('%s: %f (%f)' % (name,cv_score.mean(), cv_score.std()))

Random Forest: 0.375630 (0.042596)
Naive Bayes: 0.406146 (0.039497)
SGD: 0.509915 (0.023463)
Logistic: 0.562196 (0.018899)


All of our models have very low accuracy, which is most likely due to the lack of available data. Let's move forward however, and select a model. While it appears that Logistic Regression has the highest accuracy (56%), let's calculate the respective recall and precision values. Note that micro averaging will be used for the models (w/ exception to Random Forest) bc of the imbalance in class examples.

In [28]:
from sklearn.metrics import precision_score, recall_score

result_precision_recall = []

# same reasoning as before for Random Forest
y_temp_randforest = cross_val_predict(rand_forest, X_train_ST, y_train_1hot, cv = 10)
result_precision_recall += [['Random Forest', precision_score(y_train_1hot, y_temp_randforest, average = "micro"), 
                            recall_score(y_train_1hot, y_temp_randforest, average = "micro")]]

print('%s| %s: %f, %s (%f)' % ('Random Forest', 'Precision Score: ', precision_score(y_train_1hot, y_temp_randforest, average = "micro"), 
                           'Recall Score: ', recall_score(y_train_1hot, y_temp_randforest, average = "micro")))

for name, model in models:
    y_pred = cross_val_predict(model, X_train_ST, y_train, cv = kfold)
    precision = precision_score(y_train, y_pred, average = "micro")
    recall = recall_score(y_train, y_pred, average = "micro")
    # storing the precision and recall values
    result_precision_recall += [[name , precision, recall]]
    print('%s| %s: %f, %s (%f)' % (name, 'Precision Score: ', precision, 'Recall Score: ', recall))


Random Forest| Precision Score: : 0.685204, Recall Score:  (0.375612)
Naive Bayes| Precision Score: : 0.406097, Recall Score:  (0.406097)
SGD| Precision Score: : 0.494829, Recall Score:  (0.494829)
Logistic| Precision Score: : 0.562330, Recall Score:  (0.562330)


From the given precision and recall scores, we can now find the respective f1 score and use the highest score to select our model.

In [29]:
from sklearn.metrics import f1_score

for name, precision, recall in result_precision_recall:
    print("%s: %f" % (name, 2 * (precision * recall) / (precision + recall)))

Random Forest: 0.485232
Naive Bayes: 0.406097
SGD: 0.494829
Logistic: 0.562330


Given that Logistic Regression has the highest f1 score, we will move forward with that model.

<h2> Evaluation

<h3> Method 1

Our chosen model was logistic regression, so let's evaluate our trained model on the test data set now

In [30]:
# training the models
model_method1 = LogisticRegression(multi_class = 'ovr').fit(X_train_ST, y_train)

# getting predictions
predictions_method1 = model_method1.predict(X_test_ST)

Now evaluating our accuracy and f1 score

In [31]:
from sklearn.metrics import confusion_matrix
print(f1_score(y_test, predictions_method1, labels = unique, average = 'micro' ))

0.55


This low f1_score is most definitely from the lack of data available