## Idea

I want to use two out of the box features: chroma_stft and mfcc from librosa. For each feature, I will make a simple logistic regression model, then combine the output from both models as new features to train another model on top of it. 

Several friends had commented in the past that stacked/ensemble models tend to work better than a single model, so I want to try the workflow here.

![stacked](../../docs/images/chang-1.4-stacked-model.png)

Also got the idea off Ch.50 page 3 of Andrew Ng's book Deep Learning Yearning he's sending out in emails

The results are evaluated by a 75%-25% train-test split with the out of the box metric (mean accuracy).

In [None]:
import helpers
import pandas as pd
import pickle
import numpy as np
import math
import librosa.feature as lf
from sklearn import model_selection
from sklearn import linear_model

## Loading data for later use

**Can skip reading this whole section** - basically load and process it using Carlos's function to either read the whole audio or only the first second.

Somehow I always have memory error in reading Applauses when I try to parse the whole thing. It may have to do with the 32-bit Python installation I have on Windows 10.

In [None]:
TRAIN_CSV = 'data/external/train.csv'
TRAIN_FILES = 'data/external/audio_train'

In [None]:
tags = list(pd.read_csv(TRAIN_CSV).label.unique())

Load each label and process into 1 second clips separately

In [None]:
for tag_num, tag_name in enumerate(tags):
    train_files = helpers.find_paths_with_tags(csv_path=TRAIN_CSV, files_path=TRAIN_FILES, filters=[tag_name])
    wav_data = helpers.load_wav_files(train_files, duration=15)
    with open('data/raw/train/wav-data-{}.pkl'.format(tag_name), 'wb') as f:
        pickle.dump(wav_data, f)
    print('Processed {}-{}'.format(tag_num, tag_name))

In [None]:
for tag_num, tag_name in enumerate(tags[37:]):
    train_files = helpers.find_paths_with_tags(csv_path=TRAIN_CSV, files_path=TRAIN_FILES, filters=[tag_name])
    wav_data = helpers.load_wav_files(train_files, duration=15)
    with open('data/raw/train/wav-data-{}.pkl'.format(tag_name), 'wb') as f:
        pickle.dump(wav_data, f)
    print('Processed {}-{}'.format(tag_num, tag_name))

#### 1 second versions

In [None]:
for tag_num, tag_name in enumerate(tags):
    train_files = helpers.find_paths_with_tags(csv_path=TRAIN_CSV, files_path=TRAIN_FILES, filters=[tag_name])
    wav_data = helpers.load_wav_files(train_files, duration=1)
    with open('data/raw/train-1-sec/wav-data-{}.pkl'.format(tag_name), 'wb') as f:
        pickle.dump(wav_data, f)
    print('Processed {}-{}'.format(tag_num, tag_name))

## Process features

Here I processed the files to extract their features.

If an audio is less than 1 second long, then pad it to 1 second before processing it.

In [None]:
def pad_audio(sound: np.ndarray, sample_rate=22050):
    padded_sound = np.tile(sound, math.ceil(sample_rate / sound.shape[0]))
    return padded_sound[:sample_rate]

#### chroma_stft

For each label, read in the 1 second version, process with librosa.features.chroma_stft, then save it back to disk.

In [None]:
%%time
feature_name = 'chroma_stft'
for tag_num, tag_name in enumerate(tags):
    with open('data/raw/train-1-sec/wav-data-{}.pkl'.format(tag_name), 'rb') as f:
        wav_data = pickle.load(f)
    wav_features = {sample.name: lf.chroma_stft(pad_audio(sample.wav[0])).flatten() for sample in wav_data}
    df_features = (
        pd.DataFrame.from_dict(wav_features, orient='index')
        .reset_index().rename({'index': 'name'}, axis=1)
    )
    df_features.columns = ['name'] + [feature_name + '_' + str(column_name) for column_name in list(df_features.columns)[1:]]
    df_features.to_pickle('data/interim/{}-1-sec/{}.pkl'.format(feature_name, tag_name))

#### mfcc

For each label, read in the 1 second version, process with librosa.features.mfcc, then save it back to disk.

In [None]:
%%time
feature_name = 'mfcc'
for tag_num, tag_name in enumerate(tags):
    with open('data/raw/train-1-sec/wav-data-{}.pkl'.format(tag_name), 'rb') as f:
        wav_data = pickle.load(f)
    wav_features = {sample.name: lf.mfcc(pad_audio(sample.wav[0])).flatten() for sample in wav_data}
    df_features = (
        pd.DataFrame.from_dict(wav_features, orient='index')
        .reset_index().rename({'index': 'name'}, axis=1)
    )
    df_features.columns = ['name'] + [feature_name + '_' + str(column_name) for column_name in list(df_features.columns)[1:]]
    df_features.to_pickle('data/interim/{}-1-sec/{}.pkl'.format(feature_name, tag_name))

## Make two models

One for each feature

#### Label the files 

Because scikit learn takes in numeric class labels only, I need to convert the string labels to numbers. 

EDIT: Only after I did it I found out about [sklearn.preprocessing.LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html), so I will use that instead in the future.

In [None]:
label_reference = pd.Series(tags).rename('label').to_frame().reset_index().rename({'index': 'label_index'}, axis=1)

In [None]:
file_labels = (
    pd.read_csv(TRAIN_CSV)
    .drop(['manually_verified'], axis=1)
    .rename({'fname': 'name'}, axis=1)
    .merge(label_reference, on=['label'])
)

#### Split the data into training and test set

For both models, I need to split it into a training set and a test set so I can score later.

### chroma_stft

##### train-test split

In [None]:
# load all features from processed files
train_chroma_stft = pd.concat([pd.read_pickle('data/interim/{}-1-sec/{}.pkl'.format('chroma_stft', name)) for name in tags])

df_train_chroma_stft = (
    train_chroma_stft
    .merge(file_labels, on=['name'], how='inner')
)

X_chroma_train, X_chroma_test, y_chroma_train, y_chroma_test = model_selection.train_test_split(
    df_train_chroma_stft.drop(['name', 'label', 'label_index'], axis=1).values,
    df_train_chroma_stft['label_index'].values,
    test_size=0.25, 
    random_state=707,
    stratify=df_train_chroma_stft['label_index'].values
)

Use the multiclass version of logistic regression, don't balance class weight yet

##### train model

In [None]:
chroma_model = linear_model.LogisticRegression(random_state=123)

In [None]:
%%time
chroma_model.fit(X_chroma_train, y_chroma_train)

Pickle the trained model

In [None]:
with open('data/interim/{}-1-sec/{}.pkl'.format('chroma_stft', 'logreg-model'), 'wb') as f:
    pickle.dump(chroma_model, f)

##### scoring

Uses the mean class accuracy. See [This link](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.score) on the `.score` method for multiclass classification.

In [None]:
chroma_model.score(X_chroma_test, y_chroma_test)

### mfcc

##### train-test split

In [None]:
# load all features from processed files
train_mfcc = pd.concat([pd.read_pickle('data/interim/{}-1-sec/{}.pkl'.format('mfcc', name)) for name in tags])

df_train_mfcc = (
    train_mfcc
    .merge(file_labels, on=['name'], how='inner')
)

X_mfcc_train, X_mfcc_test, y_mfcc_train, y_mfcc_test = model_selection.train_test_split(
    df_train_mfcc.drop(['name', 'label', 'label_index'], axis=1).values,
    df_train_mfcc['label_index'].values,
    test_size=0.25, 
    random_state=707,
    stratify=df_train_mfcc['label_index'].values
)

Use the multiclass version of logistic regression. I changed to tolerance to 0.001 from 0.0001 because the latter was taking forever (> 30 minutes) to train. 

In [None]:
mfcc_model = linear_model.LogisticRegression(tol=0.001, random_state=123) 

In [None]:
%%time
mfcc_model.fit(X_mfcc_train, y_mfcc_train)

Pickle the trained model

In [None]:
with open('data/interim/{}-1-sec/{}.pkl'.format('mfcc', 'logreg-model'), 'wb') as f:
    pickle.dump(mfcc_model, f)

##### scoring

See [This link](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.score) on the `.score` method for multiclass classification.

In [None]:
mfcc_model.score(X_mfcc_test, y_mfcc_test)

### Stack the two models up

Turn the output from model 1 and 2 into probability predictions.

I used log(probability) instead of raw class probability for no particular reason. 

##### read the trained models

In [None]:
with open('data/interim/{}-1-sec/{}.pkl'.format('chroma_stft', 'logreg-model'), 'rb') as f:
    chroma_model = pickle.load(f)

with open('data/interim/{}-1-sec/{}.pkl'.format('mfcc', 'logreg-model'), 'rb') as f:
    mfcc_model = pickle.load(f)

Note that y_chroma_train and y_mfcc_train are the same - actually I only need to split it once in the preprocessing step. 

In [None]:
np.array_equal(y_chroma_train, y_mfcc_train)

#### Use predictions from both models as new features

Because there are 41 classes, each model gives me 41 features. Taking both will give me 41 + 41 = 82 features to train the stacked model.

#### Model 1

In [None]:
X_feature1_train = chroma_model.predict_log_proba(X_chroma_train)
X_feature1_test = chroma_model.predict_log_proba(X_chroma_test)

#### Model 2

In [None]:
X_feature2_train = mfcc_model.predict_log_proba(X_mfcc_train)
X_feature2_test = mfcc_model.predict_log_proba(X_mfcc_test)

In [None]:
display(X_feature1_train.shape)
display(X_feature2_train.shape)

#### Combine the outputs from model 1 and 2

Combine the features by axis=1

In [None]:
X_train = np.concatenate([X_feature1_train, X_feature2_train], axis=1)
X_test = np.concatenate([X_feature1_test, X_feature2_test], axis=1)
display(X_train.shape)

##### train stacked model

In [None]:
stacked_model = linear_model.LogisticRegression(random_state=123)

In [None]:
%%time
stacked_model.fit(X_train, y_mfcc_train)

In [None]:
stacked_model.score(X_test, y_mfcc_test)

### Model performance with mean accuracy

* only chroma_stft features: 16.6%
* only mfcc features: 28.0%
* Stacked logistic regression model using output from chroma_stft and mfcc: 30.4%