# Table of Contents
1. [Importing Libraries](#importing-libraries)
2. [Data Description](#data-description)
    * [Multiple Genres](#multiple-genres)
3. [Data Preprocessing](#data-cleaning)
4. [Data Modelling](#data-modelling)

## Importing Libraries <a class="anchor" id="importing-libraries"></a>

Here, we import all the necessary libraries for our work.

In [47]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

## Data Description <a class="anchor" id="data-description"></a>

First, the data is loaded and basic information about the data is displayed.

In [48]:
tracks = pd.read_csv('csvs/dataset.csv', index_col=0)

tracks.head()

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,1,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,1,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,0,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,0,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,2,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


Our goal is to identify and predict the genres of the song, so we try display and see how many genres are there in the dataset.

In [49]:
print('Number of genres: {}'.format(tracks.track_genre.nunique()))

# Get a count of all genre
tracks.track_genre.value_counts()

Number of genres: 114


track_genre
acoustic             1000
punk-rock            1000
progressive-house    1000
power-pop            1000
pop                  1000
                     ... 
folk                 1000
emo                  1000
electronic           1000
electro              1000
world-music          1000
Name: count, Length: 114, dtype: int64

### Multiple Genres <a class="anchor" id="multiple-genres"></a>
 
We discover that some song may have multiple genres. To improve our modelling, we will be using tracks with one genre only. 

In [50]:
# Sort by popularity first, so when we drop duplicate we drop lower popularity
# Drop duplicate if track_name, duration_ms, artists and track_genre are all the same
tracks.sort_values(by=['popularity'],ascending=False,inplace=True)
tracks.drop_duplicates(subset=['track_name','duration_ms','artists','track_genre'],inplace=True)

# If track_name, duration_ms and artists are same, but genre is different, aggregate the genre
tracks = tracks.groupby(['track_name','duration_ms','artists'],as_index=False).agg({'track_genre':lambda x: ','.join(x),
                                                                                                  'album_name': 'first',
                                                                                                  'track_id': 'first',
                                                                                                  'popularity': 'max',
                                                                                                  'explicit': 'first',
                                                                                                  'danceability': 'first',
                                                                                                  'energy': 'first',
                                                                                                  'loudness': 'first',
                                                                                                  'speechiness': 'first',
                                                                                                  'acousticness': 'first',
                                                                                                  'instrumentalness': 'first',
                                                                                                  'liveness': 'first',
                                                                                                  'valence': 'first',
                                                                                                  'tempo': 'first',
                                                                                                  'key': 'first',
                                                                                                  'mode': 'first'})


# Remove all tracks with more than one genre
tracks = tracks[tracks['track_genre'].str.contains(',') == False]
tracks.track_genre.value_counts()

track_genre
study          996
tango          991
comedy         987
grindcore      984
honky-tonk     974
              ... 
house          118
indie           94
alternative     87
reggae          78
reggaeton       66
Name: count, Length: 112, dtype: int64

Additionally, any genre with less than 500 tracks does not constitute enough training and test sample, and will be removed from the dataset.

In [51]:
# Remove all genres with less than 500 tracks, maintain all columns
tracks = tracks.groupby('track_genre').filter(lambda x: len(x) > 500)
tracks.track_genre.value_counts()

track_genre
study         996
tango         991
comedy        987
grindcore     984
honky-tonk    974
             ... 
emo           533
german        532
country       522
psych-rock    521
groove        510
Name: count, Length: 78, dtype: int64

## Data Preprocessing <a class="anchor" id="data-cleaning"></a>

We start off with basic data cleaning, removing null data and removing unnecessary columns according to our EDA.

In [52]:
# Drop the row where track_name = null
tracks.drop(tracks.index[tracks['track_name'].isnull()], inplace=True)

To make our modelling easier, we will limit our selection to a hand selected few genres. As much as the top 10 genre present an interesting opportunity, a cursory glance at the data shows that the top 10 genres are not very distinct from each other. Hence, we will select a few genres that are more significantly distinct from one another.

In [53]:
genre_popularity = tracks.groupby('track_genre')['popularity'].mean()
genre_popularity.sort_values(ascending=False)

# What is the difference between pop-film, k-pop, pop? 
# And what is the difference between sad and emo?

track_genre
k-pop             59.071334
pop-film          56.744038
chill             56.228319
sad               52.147114
piano             50.342262
                    ...    
grindcore         14.520325
chicago-house     12.174381
detroit-techno    11.007487
romance            3.562077
iranian            2.245868
Name: popularity, Length: 78, dtype: float64

We choose the following genre for our modelling, and remove the rest of the genres from the dataset.
- Country
- Chill
- K-Pop
- Club
- Rock-n-Roll
- Classical
- Sleep
- Electronic
- Ambient
- Opera

In [54]:
# Retain only the genres listed above
tracks = tracks[tracks['track_genre'].isin(['country', 'chill', 'k-pop', 'club', 'rock-n-roll', 'classical', 'sleep', 'electronic', 'ambient', 'opera'])]

We will also remove Track ID from our dataset as the ID is randomly generated data. Additionally, track name, artist name and album name will be removed as well. These three category are too diverse and will be hard to generalize, even if they provide very useful information. 

We will also drop the track key, as it will present too many dimension for our model to handle.

In [55]:
# Drop the track_id column
tracks.drop('track_id', axis=1, inplace=True)

# Drop the track_name, artists, album_name columns
tracks.drop(['track_name', 'artists', 'album_name'], axis=1, inplace=True)

# Drop the key column
tracks.drop('key', axis=1, inplace=True)

Next, we will discretize both loudness, tempo and duration_ms into 10 bins each. The exact value of these columns are not important, but their rough bins will help better inform the model.

We will also normalise the popularity columns, as they are on a different scale from the rest of the data.

In [56]:
# Discretize the loudness column into 10 bins, normalised within 0 and 1
tracks['loudness'] = pd.cut(tracks['loudness'], 10, labels=False)
tracks['loudness'] = MinMaxScaler().fit_transform(tracks[['loudness']])

# Discretize the tempo column into 10 bins, normalised within 0 and 1
tracks['tempo'] = pd.cut(tracks['tempo'], 10, labels=False)
tracks['tempo'] = MinMaxScaler().fit_transform(tracks[['tempo']])

# Normalise the duration_ms column through the use of log transformation, then normalise within 0 and 1
tracks['duration_ms'] = np.log(tracks['duration_ms'])
tracks['duration_ms'] = MinMaxScaler().fit_transform(tracks[['duration_ms']])

# Normalise the popularity column through MinMaxScaler
tracks['popularity'] = MinMaxScaler().fit_transform(tracks[['popularity']])

# Describe the dataset
tracks.describe()

Unnamed: 0,duration_ms,popularity,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,mode
count,7143.0,7143.0,7143.0,7143.0,7143.0,7143.0,7143.0,7143.0,7143.0,7143.0,7143.0,7143.0
mean,0.43832,0.42165,0.477782,0.477257,0.701837,0.065801,0.515192,0.308702,0.209258,0.359005,0.525378,0.676046
std,0.08162,0.224826,0.219627,0.297521,0.180076,0.071441,0.381924,0.401518,0.195412,0.274434,0.174052,0.468015
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.402446,0.269663,0.302,0.22,0.555556,0.03465,0.1055,1e-06,0.0984,0.1085,0.444444,0.0
50%,0.443685,0.438202,0.504,0.481,0.777778,0.0433,0.557,0.00309,0.123,0.312,0.555556,1.0
75%,0.480905,0.595506,0.65,0.734,0.777778,0.0623,0.914,0.813,0.248,0.5735,0.666667,1.0
max,1.0,1.0,0.958,1.0,1.0,0.889,0.996,1.0,1.0,0.988,1.0,1.0


Next, we make sure each of the genres has 500 sample exactly.

In [57]:
# Drop individual rows until the number of tracks per genre is equal
tracks = tracks.groupby('track_genre').apply(lambda x: x.sample(tracks.track_genre.value_counts().min(), random_state=42).reset_index(drop=True))
tracks.track_genre.value_counts()

track_genre
ambient        522
chill          522
classical      522
club           522
country        522
electronic     522
k-pop          522
opera          522
rock-n-roll    522
sleep          522
Name: count, dtype: int64

## Data Modelling <a class="anchor" id="data-modelling"></a>

Placeholder

In [58]:
# Train test split
X = tracks.drop('track_genre', axis=1)
y = tracks['track_genre']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

# Random Forest Classifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
rfc_pred = rfc.predict(X_test)

print('Random Forest Classifier')
print('Accuracy: {}'.format(accuracy_score(y_test, rfc_pred)))
print('Confusion Matrix: \n{}'.format(confusion_matrix(y_test, rfc_pred)))
print('Classification Report: \n{}'.format(classification_report(y_test, rfc_pred)))


Random Forest Classifier
Accuracy: 0.7873563218390804
Confusion Matrix: 
[[ 77   4   2   3   0   0   0   9   0   9]
 [  7  87   0   3   1   4   3   0   1   0]
 [  2   0  74   1   0   4   0   1   7   1]
 [  0   3   0  80   1  15   2   3   1   1]
 [  0   6   0   0  79   7   9   0   1   1]
 [  1   8   1  17   2  60   6   0   7   0]
 [  1  11   0   2   3   4  92   1   0   0]
 [  8   3   6   0   3   0   0  89   2   2]
 [  1   1   1   1   2   3   2   0  83   0]
 [  1   0   9   0   0   0   0   1   0 101]]
Classification Report: 
              precision    recall  f1-score   support

     ambient       0.79      0.74      0.76       104
       chill       0.71      0.82      0.76       106
   classical       0.80      0.82      0.81        90
        club       0.75      0.75      0.75       106
     country       0.87      0.77      0.81       103
  electronic       0.62      0.59      0.60       102
       k-pop       0.81      0.81      0.81       114
       opera       0.86      0.79      