### On Music Genre Classification
Genre classification is a task that has lately been taken over entirely by deep learning models and convolutional neural networks. 
Specifying a genre for a song and its performer can be a very nebulous affair, given its subjective nature, and even the most precise audio-based models often fail to reach a high accuracy. Especially when it comes to drawing boundaries between wide-reaching genres such as rock and pop, this precision becomes a matter of correct labeling as much as it is a matter of designing a good model.

*Library Dependencies*

In [None]:
import pandas as pd
import numpy as np
import os

from sklearn.preprocessing import minmax_scale

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV

from xgboost import XGBClassifier

import eli5
from eli5.sklearn import PermutationImportance



# Genre Classification with Cross Gradient Boosting

In this notebook, a music genre classifier will be built using the features extracted from 30 second long audio samples, stored in the features dataframe. 

It is often sought after to work with shorter samples as it reduces storage space, and in general an audio AI model will be considered superior if it can extract information from a smalled sample. Moreover, if features are values averaged over a time span, using a brief excerpt ensures these mean values are more indicative of the content that exists in the entire sample.

However, the ultimate goal here is to find similar sounding songs, and very brief segments of a song might often prove to be quite uncharacteristic of its overall character. Therefore, the 30 second samples will be used. It is optimistically anticipated that these changes will be adequately expressed by the variance features.

In [None]:
# Read GTZAn dataframe:
df = pd.read_csv(../dataframes/feature_dataframe.csv)
# Discard filename and duration:
df = df.iloc[0:, 2:] 

df.head()

Unnamed: 0,harmonic_mean,harmonic_var,percussive_mean,percussive_var,chroma_stft_mean,chroma_stft_var,spectral_centroid_mean,spectral_centroid_var,zero_crossing_rate_mean,zero_crossing_rate_var,mfcc1_mean,mfcc1_var,mfcc2_mean,mfcc2_var,mfcc3_mean,mfcc3_var,mfcc4_mean,mfcc4_var,mfcc5_mean,mfcc5_var,mfcc6_mean,mfcc6_var,mfcc7_mean,mfcc7_var,mfcc8_mean,mfcc8_var,mfcc9_mean,mfcc9_var,mfcc10_mean,mfcc10_var,mfcc11_mean,mfcc11_var,mfcc12_mean,mfcc12_var,mfcc13_mean,mfcc13_var,mfcc14_mean,mfcc14_var,mfcc15_mean,mfcc15_var,mfcc16_mean,mfcc16_var,mfcc17_mean,mfcc17_var,mfcc18_mean,mfcc18_var,mfcc19_mean,mfcc19_var,mfcc20_mean,mfcc20_var,genre
0,-4.529724e-05,0.008172,8e-06,0.005698,0.350088,0.088757,1784.16585,129774.064525,0.083045,0.000767,-113.570648,2564.20752,121.571793,295.913818,-19.168142,235.574432,42.366421,151.106873,-6.364664,167.934799,18.623499,89.18084,-13.704891,67.660492,15.34315,68.932579,-12.27411,82.204201,10.976572,63.386311,-8.326573,61.773094,8.803792,51.244125,-3.6723,41.217415,5.747995,40.554478,-5.162882,49.775421,0.75274,52.42091,-1.690215,36.524071,-0.408979,41.597103,-2.303523,55.062923,1.221291,46.936035,blues
1,0.0001395807,0.005099,-0.000178,0.003063,0.340914,0.09498,1530.176679,375850.073649,0.05604,0.001448,-207.501694,7764.555176,123.991264,560.259949,8.955127,572.810913,35.877647,264.506104,2.90732,279.932922,21.510466,156.477097,-8.560436,200.849182,23.370686,142.555954,-10.099661,166.108521,11.900497,104.358612,-5.555639,105.17363,5.376327,96.197212,-2.23176,64.914291,4.22014,73.152534,-6.012148,52.422142,0.927998,55.356403,-0.731125,60.314529,0.295073,48.120598,-0.283518,51.10619,0.531217,45.786282,blues
2,2.105576e-06,0.016342,-1.9e-05,0.007458,0.363637,0.085275,1552.811865,156467.643368,0.076291,0.001007,-90.722595,3319.044922,140.446304,508.765045,-29.093889,411.781219,31.684334,144.090317,-13.984504,155.493759,25.764742,74.548401,-13.664875,106.981827,11.639934,106.574875,-11.783643,65.447945,9.71876,67.908859,-13.133803,57.781425,5.791199,64.480209,-8.907628,60.385151,-1.077,57.711136,-9.229274,36.580986,2.45169,40.598766,-7.729093,47.639427,-1.816407,52.382141,-3.43972,46.63966,-2.231258,30.573025,blues
3,4.583644e-07,0.019054,-1.4e-05,0.002712,0.404785,0.093999,1070.106615,184355.942417,0.033309,0.000423,-199.544205,5507.51709,150.090897,456.505402,5.662678,257.161163,26.859079,158.267303,1.771399,268.034393,14.234031,126.794128,-4.832006,155.912079,9.286494,81.273743,-0.759186,92.11409,8.137607,71.314079,-3.200653,110.236687,6.079319,48.251999,-2.480174,56.7994,-1.079305,62.289902,-2.870789,51.651592,0.780874,44.427753,-3.319597,50.206673,0.636965,37.31913,-0.619121,37.259739,-3.407448,31.949339,blues
4,-1.756129e-05,0.004814,-1e-05,0.003094,0.308526,0.087841,1835.004266,343399.939274,0.101461,0.001954,-160.337708,5195.291992,126.219635,853.784729,-35.587811,333.792938,22.148071,193.4561,-32.4786,336.276825,10.852294,134.831573,-23.352329,93.257095,0.498434,124.672127,-11.793437,130.073349,1.207256,99.675575,-13.088418,80.254066,-2.813867,86.430626,-6.933385,89.555443,-7.552725,70.943336,-9.164666,75.793404,-4.520576,86.099236,-5.454034,75.269707,-0.916874,53.613918,-4.404827,62.910812,-11.703234,55.19516,blues


Features are separated from the labels (Genres):

In [None]:
# Create Feature and Label dataframes:

y = df['genre'] 
X = df.loc[:, df.columns != 'genre'] 

# Normalize features:
cols = X.columns
min_max_scaler = MinMaxScaler()
np_scaled = min_max_scaler.fit_transform(X)

X = pd.DataFrame(np_scaled, columns = cols)

Training dataset and Testing dataset split:

In [None]:
# Train test split:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Helper functions for cross validation and final assessment of the model:

In [None]:
# Helper Functions: 

# Print cross-validation results
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

# Fit and make predictions
def model_assess(model, title = "Default"):
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    print('Accuracy', title, ':', round(accuracy_score(y_test, preds), 5), '\n')

The classifier to be used is the Cross Gradient Booster. 5-fold cross validation is performed to estimate the optimum hyperparameters. (This is a process that might take a while, depending on the parameters set to investigate.)

In [None]:
# Perform Cross Validation on XGB

# Practical to ignore the following warnings:
import warnings
warnings.filterwarnings('ignore', category=Warning)
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

# Define XGB parameters to test:
xgb = XGBClassifier(use_label_encoder=False, eval_metric = 'mlogloss')
parameters = {
    'n_estimators': [600, 800, 1000],
    'max_depth': [1, 3, 6],
    'learning_rate': [0.01, 0.05, 1]
}

# 5-fold Cross Validation
cv = GridSearchCV(xgb, parameters, cv=5)
cv.fit(X_train, y_train.ravel())

print_results(cv)

BEST PARAMS: {'learning_rate': 0.05, 'max_depth': 3, 'n_estimators': 800}

0.577 (+/-0.076) for {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 600}
0.586 (+/-0.062) for {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 800}
0.597 (+/-0.039) for {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 1000}
0.669 (+/-0.041) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 600}
0.676 (+/-0.058) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 800}
0.677 (+/-0.059) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 1000}
0.667 (+/-0.053) for {'learning_rate': 0.01, 'max_depth': 6, 'n_estimators': 600}
0.667 (+/-0.06) for {'learning_rate': 0.01, 'max_depth': 6, 'n_estimators': 800}
0.67 (+/-0.063) for {'learning_rate': 0.01, 'max_depth': 6, 'n_estimators': 1000}
0.643 (+/-0.065) for {'learning_rate': 0.05, 'max_depth': 1, 'n_estimators': 600}
0.651 (+/-0.07) for {'learning_rate': 0.05, 'max_depth': 1, 'n_estimators': 800}
0.646 (+/-0.074) for {'

The best performing model is assessed on the test dataset:

In [None]:
# Use best performing XGB:
xgb = XGBClassifier(n_estimators=800, learning_rate=0.05, max_depth=3, eval_metric = 'mlogloss')
model_assess(xgb, "of the Cross Gradient Booster")

Accuracy of the Cross Gradient Booster : 0.725 



Even though averaged over a time span of 30 seconds, the extracted features can still be utilized to achieve over 70% accuracy in a task such as genre classification. It is therefore reasonable to assume, that such features maintain enough coherent information to be used in a recommendation engine.

The trained model can be saved:

In [None]:
import pickle

# save
with open('XGB_Music_Genre.pkl','wb') as f:
    pickle.dump(xgb,f)

'''# load
with open('XGB_Music_Genre.pkl', 'rb') as f:
    xgb2 = pickle.load(f)'''

The importance of each feature is calculated:

In [None]:
# Estimate feature importance using eli5
perm = PermutationImportance(estimator=xgb, random_state=1)
perm.fit(X_test, y_test)

eli5.show_weights(estimator=perm, feature_names = X_test.columns.tolist())

Weight,Feature
0.0893  ± 0.0129,percussive_var
0.0280  ± 0.0390,chroma_stft_mean
0.0273  ± 0.0078,mfcc5_var
0.0253  ± 0.0248,mfcc4_mean
0.0247  ± 0.0191,spectral_centroid_var
0.0167  ± 0.0119,zero_crossing_rate_mean
0.0160  ± 0.0065,mfcc13_mean
0.0160  ± 0.0221,percussive_mean
0.0160  ± 0.0186,mfcc6_mean
0.0147  ± 0.0233,mfcc17_mean


The feature importance table shows that the percussive variance is the best indicator of a song's genre, and likely its high-level identity, by a margin. Chroma STFT mean, the values of mean and variance for a few other features follow, indicating that the majority of the extracted features might contain information that correlates to high-level characteristics of the sound.