# mco3_Avellaneda_Fadrigo_Sibal_Tan

# Outline

1. Overview of Project
2. Description and crediting of Dataset
3. Data Preprocessing
4. Definition of Classification Task
    - Explain what will be classified
    - Give features and labels for ML task
    - Rationale and relevance of dataset and classification
5. Implementation and Evaluation of Classification Models
    - List and describe models to be used
    - ML code proper, with appropriate explanations

## Overview
### PLACEHOLDER TEXT:
>Project aims to implement and compare the performance of 2 ML models for a classification task of their choice. An applicable dataset was chosen and was used on the 2 ML models. A deeper-than-surface-level understanding of the ML model in the context of the dataset is required

## Dataset
This [dataset](https://www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs), entitled "30000 Spotify Songs" was uploaded by [Joakin Arvidsson](https://www.kaggle.com/joebeachcapital) on [Kaggle](https://www.kaggle.com/). Additionally the dataset was locally downloaded and can be found in "/Spotify Dataset".

The dataset contains about 30,000 songs of varying genres from the Spotify API using the spotifyr package (details listed in readme in "/Spotify Dataset"). It contains information like the unique track ID, release date, genre, etc. 

The dataset has 23 columns they are as follows:
1.  track_id: Song unique ID
2.  track_name: Song name
3.  track_artist: Song artist
4.  track_popularity: Song popularity where higher is better
5.  track_album_id: Album unique ID
6.  track_album_name: Song album name
7.  track_album_release_date: Date when album released
8.  playlist_name: Name of playlist
9.  playlist_id: Playlist ID
10. playlist_genre: Playlist genre
11. playlist_subgenre: Playlist subgenre
12. danceability: Describes how suitable a track is for dancing based on different factors like tempo, rhythm stability, etc. 0.0 denotes least dancable and 1.0 denotes most dancable.
13. energy: Measured from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. 
14. key: The estimated overall key of the track.
15. loudness: The overall loudness of a track in decibels (dB). Typically range from -60 to 0 dB.
16. mode: Mode indicates the modality (major or minor) of a track. 1 represents Major and 0 represents Minor
17. speechiness: Detects the presence of spoken words in a track. Values close to 1.0 likely indicate a talk show or podcast. Values above 0.66 means that tracks are probably made entirely of spoken words. 0.33 to 0.66 means that the track may have both music and speech. Values below 0.33 represent music or non-speech tracks. 
18. acousticness: Confidence measure from 0.0 to 1.0 whether the track is acoustic. A higher value means a higher confidence. 
19. instrumentalness: Predicts whether a track contains no vocals. Values above 0.5 intend to represent musical tracks, with the values approching 1.0 denote a higher degree of confidence.
20. liveness: Detects the presence of an audience in the recording. Values above 0.8 indicate that a track likely recorded live.
21. valence: Measured from 0.0 to 1.0 describing the musical positiveness conveyed by a track. 
22. tempo: The overall estimated tempo of a track in beats per minute (BPM). 
23. duration_ms: Duration of song in milliseconds


## Preprocessing

### Importing libraries of preprocessing and data visualization

In [68]:
import argparse
import json
import csv
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
# keep plt inline in ntbk instead of new window
%matplotlib inline
from tqdm import tqdm

from sklearn import preprocessing

plt.rcParams['figure.figsize'] = (6.0, 6.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# autoreload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%reload_ext autoreload
%autoreload 2

### Load the dataset for visualization and preprocessing

In [69]:
df = pd.read_csv('../MCO3-MachineLearning/Spotify_Dataset/spotify_songs.csv')

In [70]:
# show general info and check for NULL values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32833 entries, 0 to 32832
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   track_id                  32833 non-null  object 
 1   track_name                32828 non-null  object 
 2   track_artist              32828 non-null  object 
 3   track_popularity          32833 non-null  int64  
 4   track_album_id            32833 non-null  object 
 5   track_album_name          32828 non-null  object 
 6   track_album_release_date  32833 non-null  object 
 7   playlist_name             32833 non-null  object 
 8   playlist_id               32833 non-null  object 
 9   playlist_genre            32833 non-null  object 
 10  playlist_subgenre         32833 non-null  object 
 11  danceability              32833 non-null  float64
 12  energy                    32833 non-null  float64
 13  key                       32833 non-null  int64  
 14  loudne

In [71]:
df.head()

Unnamed: 0,track_id,track_name,track_artist,track_popularity,track_album_id,track_album_name,track_album_release_date,playlist_name,playlist_id,playlist_genre,...,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
0,6f807x0ima9a1j3VPbc7VN,I Don't Care (with Justin Bieber) - Loud Luxur...,Ed Sheeran,66,2oCs0DGTsRO98Gh5ZSl2Cx,I Don't Care (with Justin Bieber) [Loud Luxury...,2019-06-14,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,6,-2.634,1,0.0583,0.102,0.0,0.0653,0.518,122.036,194754
1,0r7CVbZTWZgbTCYdfa2P31,Memories - Dillon Francis Remix,Maroon 5,67,63rPSO264uRjW1X5E6cWv6,Memories (Dillon Francis Remix),2019-12-13,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,11,-4.969,1,0.0373,0.0724,0.00421,0.357,0.693,99.972,162600
2,1z1Hg7Vb0AhHDiEmnDE79l,All the Time - Don Diablo Remix,Zara Larsson,70,1HoSmj2eLcsrR0vE9gThr4,All the Time (Don Diablo Remix),2019-07-05,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,1,-3.432,0,0.0742,0.0794,2.3e-05,0.11,0.613,124.008,176616
3,75FpbthrwQmzHlBJLuGdC7,Call You Mine - Keanu Silva Remix,The Chainsmokers,60,1nqYsOef1yKKuGOVchbsk6,Call You Mine - The Remixes,2019-07-19,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,7,-3.778,1,0.102,0.0287,9e-06,0.204,0.277,121.956,169093
4,1e8PAfcKUYoKkxPhrHqw4x,Someone You Loved - Future Humans Remix,Lewis Capaldi,69,7m7vv9wlQ4i0LFuJiE2zsQ,Someone You Loved (Future Humans Remix),2019-03-05,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,1,-4.672,1,0.0359,0.0803,0.0,0.0833,0.725,123.976,189052


In [72]:
df.describe()

Unnamed: 0,track_popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
count,32833.0,32833.0,32833.0,32833.0,32833.0,32833.0,32833.0,32833.0,32833.0,32833.0,32833.0,32833.0,32833.0
mean,42.477081,0.65485,0.698619,5.374471,-6.719499,0.565711,0.107068,0.175334,0.084747,0.190176,0.510561,120.881132,225799.811622
std,24.984074,0.145085,0.18091,3.611657,2.988436,0.495671,0.101314,0.219633,0.22423,0.154317,0.233146,26.903624,59834.006182
min,0.0,0.0,0.000175,0.0,-46.448,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4000.0
25%,24.0,0.563,0.581,2.0,-8.171,0.0,0.041,0.0151,0.0,0.0927,0.331,99.96,187819.0
50%,45.0,0.672,0.721,6.0,-6.166,1.0,0.0625,0.0804,1.6e-05,0.127,0.512,121.984,216000.0
75%,62.0,0.761,0.84,9.0,-4.645,1.0,0.132,0.255,0.00483,0.248,0.693,133.918,253585.0
max,100.0,0.983,1.0,11.0,1.275,1.0,0.918,0.994,0.994,0.996,0.991,239.44,517810.0


### Check and remove rows if there are null cells in the dataset

NaN values are removed or replaced in ML models because there are cases wherein the ML model cannot handle NaN values. Additionally, cleaning the dataset to avoid NaN values could increase the accuracy and efficiency of the model.

In [73]:
nan_rows = df.isna().any(axis=1)
np.nonzero(nan_rows)

(array([ 8151,  9282,  9283, 19568, 19811], dtype=int64),)

In [74]:
df.iloc[8151]

track_id                    69gRFGOWY9OMpFJgFol1u0
track_name                                     NaN
track_artist                                   NaN
track_popularity                                 0
track_album_id              717UG2du6utFe7CdmpuUe3
track_album_name                               NaN
track_album_release_date                2012-01-05
playlist_name                              HIP&HOP
playlist_id                 5DyJsJZOpMJh34WvUrQzMV
playlist_genre                                 rap
playlist_subgenre                 southern hip hop
danceability                                 0.714
energy                                       0.821
key                                              6
loudness                                    -7.635
mode                                             1
speechiness                                  0.176
acousticness                                 0.041
instrumentalness                               0.0
liveness                       

Rows 8151, 9282, 9283, 19568 and 19811 contain NaN values.

Using python pandas, removing these NaN values is as simple as using the [dropna()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) function

In [75]:
df = df.dropna()

In [92]:
nan_rows = df.isna().any(axis=1)
np.nonzero(nan_rows)

(array([], dtype=int64),)

It can be seen that that nan_rows is empty. Indicating that the rows containing a NaN value have been removed from the dataframe.

### Obtaining Input and Output data
- Input Data: The features that the model will use to make predictions
- Output Data: The target variable that the model aims to predict

The output variable chosen for this model is "playlist_genre"

In [93]:
output = df['playlist_genre']
output = output.to_frame()
output.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32828 entries, 0 to 32832
Data columns (total 1 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   playlist_genre  32828 non-null  object
dtypes: object(1)
memory usage: 512.9+ KB


Because the "playlist_genre" is non-numeric. The researchers will map the unique genres and replace the values in the dataframe.

In [95]:
genres = output['playlist_genre'].unique()
genres

array(['pop', 'rap', 'rock', 'latin', 'r&b', 'edm'], dtype=object)

In [96]:
mapped_genres = {genre : i for i, genre in enumerate(genres)}
mapped_genres

{'pop': 0, 'rap': 1, 'rock': 2, 'latin': 3, 'r&b': 4, 'edm': 5}

In [97]:
output = output.replace({'playlist_genre':mapped_genres})
output.head()

Unnamed: 0,playlist_genre
0,0
1,0
2,0
3,0
4,0


In [109]:
output = output.squeeze()
output.shape

(32828,)

The input data to be used to predict a songs genre are the quantified qualities of the music that was listed in the dataset.

In [110]:
features = ['danceability',
            'energy',
            'key',
            'loudness',
            'mode',
            'speechiness',
            'acousticness',
            'instrumentalness',
            'liveness',
            'valence',
            'tempo',
            'duration_ms']

In [111]:
df_features = df[features]
df_features.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32828 entries, 0 to 32832
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   danceability      32828 non-null  float64
 1   energy            32828 non-null  float64
 2   key               32828 non-null  int64  
 3   loudness          32828 non-null  float64
 4   mode              32828 non-null  int64  
 5   speechiness       32828 non-null  float64
 6   acousticness      32828 non-null  float64
 7   instrumentalness  32828 non-null  float64
 8   liveness          32828 non-null  float64
 9   valence           32828 non-null  float64
 10  tempo             32828 non-null  float64
 11  duration_ms       32828 non-null  int64  
dtypes: float64(9), int64(3)
memory usage: 3.3 MB


In [112]:
df_features.shape

(32828, 12)

### Value Normalization

Normalizing values in a machine learning model is important for serveral different reasons. These are the following:
1. Scale Consistency: The different features in the dataset may be scaled differently. ML algorithms perform better when the features of the data are of a similar scale.
2. Convergence Speed: Gradient-based optimization algorithms, which are commonly used to train machine learnign models, coverge faster when features are normalized.
3. Improves Model Performance: Normalizing the values in a dataset can improve model performance in general.
4. Avoiding Numerical Instabilities: There are cases wherein numerical instability issues can arise if the features are not normalized. 
5. Regularization: Regularization techniques, which are often used to prevent overfitting, assume that the features are on a similar scale.
6. Interpretability: Normalizing the features makes it easier to interpret the coefficients of a model.
7. Handling Different Measurement Units: In the event that a dataset has different measurement units, normalizing helps to make the features unitless, making them easier to understand for the model.

In [113]:
scaler = preprocessing.MinMaxScaler()
x = scaler.fit_transform(df_features)
df_normalized = pd.DataFrame(x, columns=features)
print(df_normalized)

       danceability    energy       key  loudness  mode  speechiness  \
0          0.760936  0.915985  0.545455  0.918090   1.0     0.063508   
1          0.738555  0.814968  1.000000  0.869162   1.0     0.040632   
2          0.686673  0.930988  0.090909  0.901368   0.0     0.080828   
3          0.730417  0.929988  0.636364  0.894118   1.0     0.111111   
4          0.661241  0.832971  0.090909  0.875385   1.0     0.039107   
...             ...       ...       ...       ...   ...          ...   
32823      0.435402  0.921986  0.181818  0.935272   1.0     0.101961   
32824      0.531027  0.785963  0.000000  0.879785   1.0     0.045752   
32825      0.538149  0.820969  0.545455  0.870628   0.0     0.052397   
32826      0.636826  0.887980  0.181818  0.902856   1.0     0.118736   
32827      0.613428  0.883980  0.454545  0.877501   0.0     0.041939   

       acousticness  instrumentalness  liveness   valence     tempo  \
0          0.102616          0.000000  0.065562  0.522704  0.509

Input data has been normalized and is ready to be fed to the model.

## Definition of Classification Task
- Spotify - output = genre, input = everything else that was a quantifiable quality, dictated by the dataset.

TODO:
- explain why we chose to classify genre
- explain how the features and how they will help
- give rationale for choosing genre and why its relevant to us

## Machine Learning Model Implementation and Evaluation

Candidate models (all present in SKLearn):
- Logistic Regression
- SVC/NuSVC
- SGDClassifier - "Stochastic Gradient Descent"
- Decision Tree Classifier (have kaggle code for it)
- MLP Classifier - "Multi-Level Perceptron" (have kaggle reference)
- Random Forest Classifier
- Gradient Boosting Classifier
- Ada Boost Classifier
- KNeighbors Classifier
- Gaussian Process Classifier

Importing ML model Libraries

In [56]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from mlxtend.plotting import plot_confusion_matrix

from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
import xgboost as xgb
from sklearn.ensemble import AdaBoostClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.neural_network import MLPClassifier