In [1]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

# Genre Classification

Supervised learning to predict genre classification based on spectral analysis.

[DATA](https://www.kaggle.com/andradaolteanu/gtzan-dataset-music-genre-classification) Taken from Kaggle

There are two sets. One is the original 30 sec clips and all their features. The other is those same 30 sec clips broken down into 10 separate 3 second clips and analyzed on their own.

[Original Files](https://web.archive.org/web/20200812034358/http://marsyas.info/downloads/datasets.html) Must use the Internet Archive Wayback Machine because the website is no longer available

### Feature explanation

The documentation on the Kaggle website is quite sparse. This is the best explanation I could come up with while looking around at other MIR resources (Music Information Retrieval)

The data set itself is not clear but I am assuming that each mean and variance is for an array for each subsections. That means each subsection is further broken down and the variables are generated for that subsection and then take the mean over the three second interval.

length - measured in ms. Same for every entry. Not useful

chroma_stft - Chromagram that breaks down the frequency spectrum. I believe this corresponds to the images generated from the original data set.

rms - RMS in relation to the frequency spectrum

spectral_centroid - The 'center' of the frequency spectrum. Higher values imply 'brighter' sounding songs

spectral_bandwidth - Use in relation with the centroid. How much of the spectrum contributes to the centroid

rolloff - A certain percentage of the total energy in the signal comes from frequencies below this value

zero_crossing_rate - Audio files are an array of values from -1 to 1. This is the rate at which the signal crosses the zero line. Not sure what the units are here.

harmony - Not sure how this was calculated. I'm guessing this is the inverse of dissonance. This is probably higher as a result of spectral peaks being in line with each other. I'll interpret this as the farther away from zero, the more dissonant or noisy a song is.

perceptr - difficult to know what this means in context. Could be short for perceptron. Could be taking each value and making them either -1 or 1 depending on what side they're on and taking the overall mean

tempo - perceived speed of the song. BPM or toe taps per minute.

mfcc***x*** - Mel Frequency Cepstral Coefficients. Essentially a dimensionality reduction of the frequency spectrum over time. Uses the Mel Frequency to transform the frequency into a scale that more accurately represents the way humans hear sound. Used commonly for speech recognition and deconvultion (echo/noise reduction). The higher the ***x*** the higher up in the frequency spectrum it is transforming.

In [2]:
import warnings

import pandas as pd
import numpy as np
import pickle

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
)

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

<IPython.core.display.Javascript object>

In [3]:
def print_vif(x):
    """Utility for checking multicollinearity assumption
    
    :param x: input features to check using VIF. This is assumed to be a pandas.DataFrame
    :return: nothing is returned the VIFs are printed as a pandas series
    """
    # Silence numpy FutureWarning about .ptp
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        x = sm.add_constant(x)

    vifs = []
    for i in range(x.shape[1]):
        vif = variance_inflation_factor(x.values, i)
        vifs.append(vif)

    print("VIF results\n-------------------------------")
    print(pd.Series(vifs, index=x.columns))
    print("-------------------------------\n")

<IPython.core.display.Javascript object>

In [4]:
# To use all
# df_long = pd.read_csv("../data/features_30_sec.csv")
# df_short = pd.read_csv("../data/features_3_sec.csv")
# df = pd.concat((df_long, df_short))

# To use just one
# df = pd.read_csv("../data/features_30_sec.csv")
df = pd.read_csv("../data/features_3_sec.csv")

df["genre"] = df["filename"].str.split(".").str[0]

# "blues.00000.0.wav" -> "blues.00000"
# and
# "blues.00000.wav" -> "blues.00000"
# logic: split on period, take first 2 elements, and but back together
df["songname"] = df["filename"].str.split(".").str[:2].str.join(".")

<IPython.core.display.Javascript object>

In [5]:
# generated by tuning vif and then checking model coefficients while tuning
keep_cols = [
    "chroma_stft_mean",
    "chroma_stft_var",
    "rms_var",
    "zero_crossing_rate_mean",
    "zero_crossing_rate_var",
    "harmony_mean",
    "harmony_var",
    "perceptr_mean",
    "tempo",
    "mfcc1_mean",
    "mfcc2_mean",
    "mfcc2_var",
    "mfcc3_mean",
    "mfcc3_var",
    "mfcc4_mean",
    "mfcc4_var",
    "mfcc5_var",
    "mfcc6_mean",
    "mfcc6_var",
    "mfcc7_mean",
    "mfcc8_mean",
    "mfcc8_var",
    "mfcc9_mean",
    "mfcc9_var",
    "mfcc10_var",
    "mfcc12_mean",
    "mfcc12_var",
    "mfcc13_mean",
    "mfcc15_mean",
    "mfcc15_var",
    "mfcc16_mean",
    "mfcc16_var",
    "mfcc17_mean",
    "mfcc18_mean",
    "mfcc19_mean",
    "mfcc19_var",
]

<IPython.core.display.Javascript object>

In [6]:
X = df[keep_cols]
y = df["genre"]

<IPython.core.display.Javascript object>

In [7]:
# Log all the variance features because of their distributions
X_logged = X.copy()
for c in X_logged:
    if c.endswith("_var"):
        X_logged[c] = np.log(X_logged[c])

<IPython.core.display.Javascript object>

In [8]:
# Double check for multicollinearity
print_vif(X_logged)

VIF results
-------------------------------
const                      1687.442676
chroma_stft_mean              3.821127
chroma_stft_var               2.578163
rms_var                       4.034234
zero_crossing_rate_mean       5.687919
zero_crossing_rate_var        4.774067
harmony_mean                  1.478989
harmony_var                   5.263573
perceptr_mean                 1.575745
tempo                         1.009630
mfcc1_mean                    8.873844
mfcc2_mean                    6.108025
mfcc2_var                     2.658020
mfcc3_mean                    2.402190
mfcc3_var                     2.499945
mfcc4_mean                    2.261148
mfcc4_var                     2.863863
mfcc5_var                     2.733756
mfcc6_mean                    3.361297
mfcc6_var                     2.972196
mfcc7_mean                    2.951355
mfcc8_mean                    3.618786
mfcc8_var                     2.520901
mfcc9_mean                    2.682485
mfcc9_var           

<IPython.core.display.Javascript object>

## Notes on Multicollinearity

I think it's important to point out here that I ended up dropping many of the metrics that look at the entire frequency spectrum for calculation. All the information they hold is distributed over the Mel Frequency Cepstrum Coefficients, making may of them redundant. Much of what's left over isn't exactly raw frequency data. Below is just one small example.

In [9]:
print_vif(df[["mfcc1_mean", "mfcc2_mean", "mfcc3_mean", "rolloff_mean"]])

VIF results
-------------------------------
const           270.051095
mfcc1_mean        2.373079
mfcc2_mean        6.901655
mfcc3_mean        1.370123
rolloff_mean      9.191105
dtype: float64
-------------------------------



<IPython.core.display.Javascript object>

In [10]:
# og: "blues.00000.0.wav"
# songname: "blues.00000"
# genre: "blues"
song_genre = df[["songname", "genre"]].drop_duplicates()

# stratification was done in another notebook.
# Some extra steps were taken.
# Stratify on the 30 second clips and then project down to the 3 second clips

train_songs = pickle.load(open("../data/train_songs.p", "rb"))
test_songs = pickle.load(open("../data/test_songs.p", "rb"))

train_idxs = df[df["songname"].isin(train_songs)].index
test_idxs = df[df["songname"].isin(test_songs)].index

X_train = X_logged.loc[train_idxs, :]
X_test = X_logged.loc[test_idxs, :]
y_train = y[train_idxs]
y_test = y[test_idxs]

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(7990, 36) (7990,)
(2000, 36) (2000,)


<IPython.core.display.Javascript object>

In [11]:
# Prove no overlap of songs between train/test
set(train_songs).intersection(set(test_songs))

set()

<IPython.core.display.Javascript object>

# Model 1: KNN

In [12]:
num_cols = list(X.columns)

preprocessing = ColumnTransformer(
    [
        # Scale numeric columns (not needed for all models but can't hurt)
        ("scaler", StandardScaler(), num_cols)
    ],
    remainder="passthrough",
)


pipeline_knn = Pipeline(
    [
        ("preprocessing", preprocessing),
        #         ("pca", PCA()),
        # Choose your model and put it here
        ("knn", KNeighborsClassifier(weights="uniform", n_neighbors=100)),
    ]
)


pipeline_knn.fit(X_train, y=y_train)


print(pipeline_knn.score(X_train, y_train))
print(pipeline_knn.score(X_test, y_test))

0.7321652065081352
0.628


<IPython.core.display.Javascript object>

## KNN Analysis:

Due to the nature of the data, it is far too easy for this model to cheat given the fact that for every single observation, there will be at least 9 other entries that will be similar in all features. Model performance on test data can't improve without allowing an extremely high level of overfitting. Weighting based on distance makes this even worse. 

This problem extends to any type of model that uses decision trees. Both Gradient Boosting and Random Forest Classification were tested with very similar results.

## Model 2: SVC rbf kernel

In [13]:
num_cols = list(X.columns)

preprocessing = ColumnTransformer(
    [
        # Scale numeric columns (not needed for all models but can't hurt)
        ("scaler", StandardScaler(), num_cols)
    ],
    remainder="passthrough",
)


pipeline_svc = Pipeline(
    [
        ("preprocessing", preprocessing),
        #         ("pca", PCA()),
        # Choose your model and put it here
        ("svc", SVC(kernel="rbf", C=10)),
    ]
)

pipeline_svc.fit(X_train, y_train)

print(pipeline_svc.score(X_train, y_train))
print(pipeline_svc.score(X_test, y_test))

0.9987484355444305
0.7165


<IPython.core.display.Javascript object>

## SVC rbf Analysis:

Same as before, impossible to achieve a higher testing accuracy than other models without allowing overfitting. Also, fewer parameters to even try and address the overfitting. Linear SVM somewhat addresses this, but at that point I think it's more pertinent to use Logistic Regression both to reduce model complexity and increase explanatory power.

## Final Model: ElasticNet Logistic Regression

In [14]:
num_cols = list(X.columns)

preprocessing = ColumnTransformer(
    [
        # Scale numeric columns (not needed for all models but can't hurt)
        ("scaler", StandardScaler(), num_cols)
    ],
    remainder="passthrough",
)


pipeline = Pipeline(
    [
        ("preprocessing", preprocessing),
        #         ("pca", PCA()),
        # Choose your model and put it here
        (
            "log",
            LogisticRegression(
                max_iter=800, penalty="elasticnet", solver="saga", C=0.1, l1_ratio=0.5
            ),
        ),
    ]
)

pipeline.fit(X_train, y_train)

print(pipeline.score(X_train, y_train))
print(pipeline.score(X_test, y_test))

0.7058823529411765
0.65


<IPython.core.display.Javascript object>

## ElasticNet Classification

The eventual hyper parameters used were tested using a giant 5 fold Grid Search of 180 different combinations using precision as the scoring metric. Having both LASSO and Ridge available was helpful, especially when deciding which features to eventually throw out.

The scores are not great, admittedly, but when we use the model on the subsections and then vote on the outcome of the original, there is significant improvement. More on that later. I'd like to point out something fairly interesting in the model coefficients

In [15]:
coef_df = pd.DataFrame(
    {"feat": X_train.columns, "coef": pipeline.named_steps["log"].coef_[0]}
)
coef_df["abs_coef"] = np.abs(coef_df["coef"])
coef_df.sort_values("abs_coef", ascending=False)[0:15]

Unnamed: 0,feat,coef,abs_coef
9,mfcc1_mean,-1.441049,1.441049
10,mfcc2_mean,1.137417,1.137417
2,rms_var,1.039603,1.039603
6,harmony_var,1.028323,1.028323
14,mfcc4_mean,0.913629,0.913629
3,zero_crossing_rate_mean,0.832223,0.832223
17,mfcc6_mean,0.827441,0.827441
11,mfcc2_var,-0.635519,0.635519
12,mfcc3_mean,0.508702,0.508702
19,mfcc7_mean,-0.503557,0.503557


<IPython.core.display.Javascript object>

This is the top 15 of 36 in total I ended up using. Much of this may or may not be meaningless to you (it was to me at first), but I'd like to draw attention to 'mfcc18_mean'. Taking the coefficient transformations into mind, this feature represents an incredibly high end of the frequency spectrum. Many of us probably can't hear that high. That means the model is taking into account 'overtones' of the sound which is mostly expressed in things like distortion and noise. It could be helping to separate genres that typically have distortion, like rock and metal, from genres that generally don't, like classical and jazz.

In [16]:
df["predictions"] = pipeline.predict(X_logged[keep_cols])

<IPython.core.display.Javascript object>

In [22]:
long = pd.read_csv("../data/features_30_sec.csv")

<IPython.core.display.Javascript object>

In [23]:
long["vote_pred"] = "none"

for i in range(long["filename"].size):
    curr_file = long["filename"][i]
    file_stripped = curr_file.strip(".wav")
    sub_selection = df["filename"].str.contains(file_stripped)
    prediction = (
        df[sub_selection]["predictions"]
        .value_counts()
        .sort_values(ascending=False)
        .index[0]
    )
    long["vote_pred"][i] = prediction

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]


<IPython.core.display.Javascript object>

In [25]:
pipeline.classes_


array(['blues', 'classical', 'country', 'disco', 'hiphop', 'jazz',
       'metal', 'pop', 'reggae', 'rock'], dtype=object)

<IPython.core.display.Javascript object>

In [27]:
long.columns

Index(['filename', 'length', 'chroma_stft_mean', 'chroma_stft_var', 'rms_mean',
       'rms_var', 'spectral_centroid_mean', 'spectral_centroid_var',
       'spectral_bandwidth_mean', 'spectral_bandwidth_var', 'rolloff_mean',
       'rolloff_var', 'zero_crossing_rate_mean', 'zero_crossing_rate_var',
       'harmony_mean', 'harmony_var', 'perceptr_mean', 'perceptr_var', 'tempo',
       'mfcc1_mean', 'mfcc1_var', 'mfcc2_mean', 'mfcc2_var', 'mfcc3_mean',
       'mfcc3_var', 'mfcc4_mean', 'mfcc4_var', 'mfcc5_mean', 'mfcc5_var',
       'mfcc6_mean', 'mfcc6_var', 'mfcc7_mean', 'mfcc7_var', 'mfcc8_mean',
       'mfcc8_var', 'mfcc9_mean', 'mfcc9_var', 'mfcc10_mean', 'mfcc10_var',
       'mfcc11_mean', 'mfcc11_var', 'mfcc12_mean', 'mfcc12_var', 'mfcc13_mean',
       'mfcc13_var', 'mfcc14_mean', 'mfcc14_var', 'mfcc15_mean', 'mfcc15_var',
       'mfcc16_mean', 'mfcc16_var', 'mfcc17_mean', 'mfcc17_var', 'mfcc18_mean',
       'mfcc18_var', 'mfcc19_mean', 'mfcc19_var', 'mfcc20_mean', 'mfcc20_var',
  

<IPython.core.display.Javascript object>

In [24]:
for c in pipeline.classes_:
    long[c] = 0.0

long["avg_vote"] = "none"

for i in range(long["filename"].size):
    curr_file = long["filename"][i]
    file_stripped = curr_file.strip(".wav")
    sub_selection = df["filename"].str.contains(file_stripped)
    avg_dict = {}
    for c in pipeline.classes_:
        long[c][i] = df[sub_selection][c].mean()
        avg_dict[c] = df[sub_selection][c].mean()
        prediction = max(avg_dict, key=avg_dict.get)
    long["avg_vote"][i] = prediction

KeyError: 'blues'

<IPython.core.display.Javascript object>

In [19]:
long["songname"] = long["filename"].str.split(".").str[:2].str.join(".")

train_idxs = long[long["songname"].isin(train_songs)].index
test_idxs = long[long["songname"].isin(test_songs)].index

long_train = long.loc[train_idxs, :]
long_test = long.loc[test_idxs, :]

<IPython.core.display.Javascript object>

In [20]:
# original 30 second clips
print(confusion_matrix(long_test["label"], long_test["vote_pred"]))
print(classification_report(long_test["label"], long_test["vote_pred"]))

[[13  0  0  0  0  3  1  0  2  1]
 [ 0 20  0  0  0  0  0  0  0  0]
 [ 2  0 13  0  1  1  1  0  0  2]
 [ 0  0  1 14  2  0  0  1  0  2]
 [ 0  0  0  1 10  0  1  4  4  0]
 [ 0  3  1  0  0 16  0  0  0  0]
 [ 0  0  0  1  1  0 18  0  0  0]
 [ 0  0  1  0  0  0  0 18  0  1]
 [ 0  0  2  1  2  0  1  0 13  1]
 [ 2  0  1  6  0  2  2  1  1  5]]
              precision    recall  f1-score   support

       blues       0.76      0.65      0.70        20
   classical       0.87      1.00      0.93        20
     country       0.68      0.65      0.67        20
       disco       0.61      0.70      0.65        20
      hiphop       0.62      0.50      0.56        20
        jazz       0.73      0.80      0.76        20
       metal       0.75      0.90      0.82        20
         pop       0.75      0.90      0.82        20
      reggae       0.65      0.65      0.65        20
        rock       0.42      0.25      0.31        20

    accuracy                           0.70       200
   macro avg       

<IPython.core.display.Javascript object>

This the result of the 'voting' I mentioned earlier. We see a significant increase in model performance when aggregating the results. I think this is a good case for the potential of this model and makes it worth pursuing 'denser' data. What happens if we don't take the mean and variance of each measurement but actually use all data available? What happens when we have more than 30 seconds for each song? I think this model proves it might be worth investing time into seeing where it goes.