# Feature selection

In [91]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.feature_selection import VarianceThreshold, SelectPercentile, chi2, SelectFromModel
from sklearn.svm import LinearSVC

Load data and print all features / columns that are present.

In [3]:
df_samples_all = pd.read_csv('s3://stormpetrels/samples/labels/samples_all.csv')
df_samples_all.columns.values

array(['freq_IQR', 'freq_Q25', 'freq_Q75', 'freq_mean', 'freq_median',
       'freq_mode', 'freq_peak', 'offset', 'onset', 'pitch_IQR',
       'pitch_Q25', 'pitch_Q75', 'pitch_max', 'pitch_mean',
       'pitch_median', 'pitch_min', 'yaafe_Chroma.0', 'yaafe_Chroma.1',
       'yaafe_Chroma.10', 'yaafe_Chroma.11', 'yaafe_Chroma.2',
       'yaafe_Chroma.3', 'yaafe_Chroma.4', 'yaafe_Chroma.5',
       'yaafe_Chroma.6', 'yaafe_Chroma.7', 'yaafe_Chroma.8',
       'yaafe_Chroma.9', 'yaafe_LPC', 'yaafe_LSF.0', 'yaafe_LSF.1',
       'yaafe_LSF.2', 'yaafe_LSF.3', 'yaafe_LSF.4', 'yaafe_LSF.5',
       'yaafe_LSF.6', 'yaafe_LSF.7', 'yaafe_LSF.8', 'yaafe_LSF.9',
       'yaafe_MFCC.0', 'yaafe_MFCC.1', 'yaafe_MFCC.10', 'yaafe_MFCC.11',
       'yaafe_MFCC.12', 'yaafe_MFCC.2', 'yaafe_MFCC.3', 'yaafe_MFCC.4',
       'yaafe_MFCC.5', 'yaafe_MFCC.6', 'yaafe_MFCC.7', 'yaafe_MFCC.8',
       'yaafe_MFCC.9', 'yaafe_OBSI.0', 'yaafe_OBSI.1', 'yaafe_OBSI.2',
       'yaafe_OBSI.3', 'yaafe_OBSI.4', 'yaafe_OBSI.5', 'ya

Thanks to [plot_features.ipynb](plot_features.ipynb) we were able to make a preselection of decent features for modelling. Here are the graphs:

* [Frequency statistics](https://plot.ly/~tracewsl/311?share_key=604K67TKBBwoLRPqZMrelQ)
* [Pitch statistics](https://plot.ly/~tracewsl/313?share_key=WoXXOcXzpNk1dGHRPJiya5)
* [Chroma](https://plot.ly/~tracewsl/315?share_key=BIa9dCSsJ74rBJuQFUJN3H)
* [Linear Predictor Coefficients](https://plot.ly/~tracewsl/331?share_key=QbevAWkqRJqPozUBubNac1)
* [Line Spectral Frequency](https://plot.ly/~tracewsl/317?share_key=08PLdeSz12ozUBZcGWpmUg)
* [Mel-frequencies cepstrum coefficients](https://plot.ly/~tracewsl/319?share_key=5SdA05qgOjP8NA1nK954gW)
* [Octave band signal intensity](https://plot.ly/~tracewsl/333?share_key=fu233IqjwRZ1qqjcbrdygh)
* [Crest factors](https://plot.ly/~tracewsl/321?share_key=lApd3QLTyp9s3XA36uHNbi)
* [Flatness](https://plot.ly/~tracewsl/323?share_key=Z8wX3wdrksuFeAcQwk0Egi)
* [Flux](https://plot.ly/~tracewsl/325?share_key=yYUfHOjwzXF1CkOhghjELA)
* [Rolloff](https://plot.ly/~tracewsl/327?share_key=iYdp06DseJNwqnWCnUZBbv)
* [Variation](https://plot.ly/~tracewsl/329?share_key=Xk58B0O0iBBy4o0SvbJQKJ)

In [5]:
nice_features =  ['freq_Q25', 'freq_Q75', 'freq_mean', 'freq_median',
       'freq_mode', 'freq_peak', 'offset', 'onset', 'pitch_IQR',
       'pitch_Q25', 'pitch_Q75', 'pitch_max', 'pitch_mean',
       'pitch_median', 'pitch_min', 'yaafe_LPC', 'yaafe_LSF.0', 'yaafe_LSF.1',
       'yaafe_LSF.2', 'yaafe_LSF.3', 'yaafe_LSF.4', 'yaafe_LSF.5',
       'yaafe_LSF.6', 'yaafe_LSF.7', 'yaafe_LSF.8', 'yaafe_LSF.9',
       'yaafe_MFCC.0', 'yaafe_MFCC.1', 'yaafe_MFCC.10', 'yaafe_MFCC.11',
       'yaafe_MFCC.12', 'yaafe_MFCC.2', 'yaafe_MFCC.3', 'yaafe_MFCC.4',
       'yaafe_MFCC.5', 'yaafe_MFCC.6', 'yaafe_MFCC.7', 'yaafe_MFCC.8',
       'yaafe_MFCC.9', 'yaafe_OBSI.0', 'yaafe_OBSI.1', 'yaafe_OBSI.2',
       'yaafe_OBSI.3', 'yaafe_OBSI.4', 'yaafe_OBSI.5', 'yaafe_OBSI.6',
       'yaafe_OBSI.7', 'yaafe_OBSI.8', 'yaafe_SpectralFlatness',
       'yaafe_SpectralFlux', 'yaafe_SpectralRolloff',
       'yaafe_SpectralVariation']

The selection is somewhat arbitrary, but allows to incorportate our knowledge on what various audio features bring to the table. We can also use strategies that are grounded in e.g. variation threshold that removes all features whose variance doesn’t meet some threshold.

In [77]:
def df_selector(selector, data, y=None):
    if y is not None:
        selector.fit(data, y)
    else:
        selector.fit(data)
    return data[data.columns[selector.get_support(indices=True)]]

In [73]:
df = df_samples_all.drop(columns=['petrel', 'filename', 'onset', 'offset']) # drop non-features

In [78]:
for threshold in np.linspace(0.1, 0.9, 9):
    selector = VarianceThreshold(threshold)
    df_variance_threshold = df_selector(selector, df)
    print(f'Threshold {threshold:.1f} Features left: {df_variance_threshold.shape[1]}')

Threshold 0.1 Features left: 48
Threshold 0.2 Features left: 43
Threshold 0.3 Features left: 38
Threshold 0.4 Features left: 38
Threshold 0.5 Features left: 34
Threshold 0.6 Features left: 33
Threshold 0.7 Features left: 32
Threshold 0.8 Features left: 31
Threshold 0.9 Features left: 30


In [75]:
selector = VarianceThreshold(0.2)
df_variance_threshold = df_selector(selector, df)
df_variance_threshold.columns.values

array(['freq_IQR', 'freq_Q25', 'freq_Q75', 'freq_mean', 'freq_median',
       'freq_mode', 'freq_peak', 'pitch_IQR', 'pitch_Q25', 'pitch_Q75',
       'pitch_max', 'pitch_mean', 'pitch_median', 'pitch_min',
       'yaafe_Chroma.0', 'yaafe_Chroma.1', 'yaafe_Chroma.10',
       'yaafe_Chroma.11', 'yaafe_Chroma.2', 'yaafe_Chroma.3',
       'yaafe_Chroma.4', 'yaafe_Chroma.5', 'yaafe_Chroma.6',
       'yaafe_Chroma.7', 'yaafe_Chroma.8', 'yaafe_Chroma.9',
       'yaafe_MFCC.0', 'yaafe_MFCC.1', 'yaafe_OBSI.0', 'yaafe_OBSI.1',
       'yaafe_OBSI.2', 'yaafe_OBSI.3', 'yaafe_OBSI.4', 'yaafe_OBSI.5',
       'yaafe_OBSI.6', 'yaafe_OBSI.7', 'yaafe_OBSI.8',
       'yaafe_SpectralCrestFactorPerBand.10',
       'yaafe_SpectralCrestFactorPerBand.11',
       'yaafe_SpectralCrestFactorPerBand.14',
       'yaafe_SpectralCrestFactorPerBand.15', 'yaafe_SpectralFlux',
       'yaafe_SpectralRolloff'], dtype=object)

The advantage of the latter is that it's simple and does not introduced any bias by looking at target variable. Univariate feature selection works by selecting the best features based on univariate statistical tests.

In [83]:
df = df_samples_all.drop(columns=['filename', 'onset', 'offset']) # drop non-features
y = df.pop('petrel')
scaler = MinMaxScaler()
df[df.columns] = scaler.fit_transform(df)
selector = SelectPercentile(chi2, percentile=20)
df_percentile = df_selector(selector, df, y)
df_percentile.columns.values

array(['freq_Q25', 'freq_Q75', 'freq_mean', 'freq_median', 'freq_mode',
       'freq_peak', 'pitch_IQR', 'pitch_Q25', 'pitch_Q75', 'pitch_max',
       'pitch_mean', 'pitch_median', 'pitch_min', 'yaafe_LPC',
       'yaafe_LSF.0', 'yaafe_LSF.1',
       'yaafe_SpectralCrestFactorPerBand.14'], dtype=object)

Instead of using a statistical test, we can use a meta-transformer that can be used along with any estimator that has a `coef_` or `feature_importances_` attribute after fitting. The features are considered unimportant and removed, if the corresponding `coef_` or `feature_importances_` values are below the provided threshold parameter. 

In [106]:
df = df_samples_all.drop(columns=['filename', 'onset', 'offset']) # drop non-features
y = df.pop('petrel')
scaler = StandardScaler()
df[df.columns] = scaler.fit_transform(df)
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(df, y)
model = SelectFromModel(lsvc, prefit=True)
feature_idx = model.get_support()
feature_name = df.columns[feature_idx]
feature_name

Index(['freq_IQR', 'pitch_Q75', 'pitch_max', 'pitch_mean', 'pitch_median',
       'pitch_min', 'yaafe_MFCC.11', 'yaafe_MFCC.3', 'yaafe_MFCC.5',
       'yaafe_OBSI.3', 'yaafe_SpectralCrestFactorPerBand.12',
       'yaafe_SpectralCrestFactorPerBand.13',
       'yaafe_SpectralCrestFactorPerBand.14',
       'yaafe_SpectralCrestFactorPerBand.9', 'yaafe_SpectralVariation'],
      dtype='object')