In [1]:
import pandas as pd
import numpy as np
import utils
import librosa
import librosa.display
import soundfile as sf
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Executive Summary

### Goal
The goal of this notebook is to elucidate the components of an audio signal that cause us to perceive that audio in one way or another. To do this I attempt to predict the human construct of genre based on various audio features. The data I use were pulled from the [free music archive](https://github.com/mdeff/fma) github.

### Data
The audio columns are actually statistics of audio features extracted using the [librosa library](http://man.hubwiz.com/docset/LibROSA.docset/Contents/Resources/Documents/index.html). The feature statistics which best accounted for the variability in the data were the mel frequency cepstral coefficient (mfcc) statistics. The mel frequency scale is discussed further below.

### Metrics
Predicting genre with linear regression using these statistics was 52.2% accurate which is a significant improvement over the dummy baseline of 12.5%.

### Limitations
Genre is subjective. It is an attempt to put words to a feeling. Two people can listen to a song and classify it in different genres. To get around this I used the most broad definitions of genre provided. No matter what I do the models I build will inherit the perception of genre of whoever encoded this data.

# Mel Frequency Cepstral Coefficients

The mel frequency cepstral coefficients (mfccs) are an abstraction of a spectrogram that attempt to capture the components of an audio signal that describe for our perception of that signal. For instance in speech we have the concepts of vowels and consonants. These features describe the construction of a word and our understanding of that word. The mel frequency cepstral coefficients can not be compared directly to vowels and consonants but they abstract audio into components that describe what humans hear.

start with wave


![waveform](../images/waveform.png)

The first step to extracting mfccs is to preform the short time fourier transform on small time windows of the audio signal. This looks at sections of the signal and extracts the frequency components that make up that signal. This is the spectrogram of a 30 second clip of the song colorful lights which is track 065488 in the F.M.A.


![spectrogram](../images/stft.png)



![mel filter bins](../images/mel_filter_bins.png)


![mel spectrogram](../images/mel_spec.png)


![mfccs.png](../images/mfccs.png)

## MFCC Analysis

In [2]:
df = pd.read_csv('../data/features_with_genres.csv', index_col=0)
mfcc_cols = [col for col in df.columns if 'mfcc' in col]

In [5]:
X = df[mfcc_cols]
y = df['track_genre_top']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,
                                                   random_state=42)

In [6]:
ss = StandardScaler()
ss.fit(X_train)
X_train_sc = ss.transform(X_train)
X_test_sc = ss.transform(X_test)

In [8]:
lr = LogisticRegression(C = .00599484, max_iter = 1000)
lr.fit(X_train_sc, y_train)
lr.score(X_test_sc, y_test), lr.score(X_train_sc, y_train)

(0.5222611305652827, 0.538115095913261)

**Interpretation:** The accuracy here is pretty close to the accuracy on the entire dataset! The accuracy of the logistic regression with all of the features was 57.7 on the test set and 66.6 on the training set. Not only is the accuracy close but overfitting has been reduced by a lot.

In [9]:
pca = PCA()
pca.fit(X_train_sc)

PCA()

In [10]:
len(X_train.columns)

140

In [11]:
sum([round(val, 5) for val in pca.explained_variance_ratio_][:77])

0.97095

In [12]:
pca = PCA(77)
pca.fit(X_train_sc)

PCA(n_components=77)

In [13]:
X_train_pca = pd.DataFrame(pca.transform(X_train_sc))
X_test_pca = pd.DataFrame(pca.transform(X_test_sc))

In [14]:
lr = LogisticRegression(C = .00599484, max_iter = 1000)
cross_val_score(lr, X_train_sc, y_train)

array([0.50708924, 0.52126772, 0.49958299, 0.49791493, 0.49958299])

In [15]:
lr = LogisticRegression(C = .00599484, max_iter = 1000)
lr.fit(X_train_pca, y_train)
lr.score(X_test_pca, y_test), lr.score(X_train_pca, y_train)

(0.511255627813907, 0.5282735613010843)

In [16]:
gbm = GradientBoostingClassifier(min_samples_split = 10, subsample = 1,
                                 max_features = 'sqrt', n_iter_no_change = 5,
                                 n_estimators = 500, learning_rate = .0001,
                                 tol = .0001, verbose=1)
gbm.fit(X_train, y_train)
gbm.score(X_test, y_test), gbm.score(X_train, y_train)

      Iter       Train Loss   Remaining Time 
         1           2.0794            1.10m
         2           2.0793            1.09m
         3           2.0792            1.08m
         4           2.0791            1.08m
         5           2.0790            1.08m
         6           2.0789            1.07m
         7           2.0788            1.07m
         8           2.0787            1.07m
         9           2.0787            1.07m
        10           2.0786            1.07m
        20           2.0777            1.05m
        30           2.0768            1.03m
        40           2.0760            1.01m
        50           2.0751           59.41s
        60           2.0742           57.95s
        70           2.0734           56.53s
        80           2.0725           55.24s
        90           2.0716           53.93s
       100           2.0708           52.57s
       200           2.0622           39.06s
       300           2.0538           26.46s
       40

(0.4547273636818409, 0.5082568807339449)