# MCT4052 Workshop 6g: Parameter Estimation with Grid Search

*Author: Stefano Fasciani, stefano.fasciani@imv.uio.no, Department of Musicology, University of Oslo.*

Most ML models present a variety of parameters that can be tuned aming to improve performances. The object [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) allows to systematically explore a large space of parameter combinations for supervised machine learning (i.e. classification and regression tasks, where we have a well defined performance metric).

Grid search works best when combined with a pipeline and righ repeated k-fold cross validation, which allows to truly assess the performances of the overall ML system. Mind that grid search can be very time consuming, especially when combined tih repeated k-fold. Therefore, it's recommendable to perform multiple small searches aiming at progressively focusing on a parameters subspace which is likely to provide the best performances. Indeed, after obtaining the first result, you can further narrow down your search in a smaller, but more specific parameters range.

In this notebook first we apply grid search to a classifier only, then we repeat the process on an entire pipeline. Generally, most tunable parameters belong to the classifier (or regressor).

The image below provide an overall illustration on how cross validation and grid search allows to find best parameters that later can be used to deploy a ML-based system for real-world applications.

<img src="https://scikit-learn.org/stable/_images/grid_search_workflow.png" alt="drawing" style="width:600px;"/>

In [1]:
import numpy as np
import pandas as pd
import librosa
import sklearn
import os

In [2]:
sr = 22050

def extract_features(filename, sr):
    signal, dummy = librosa.load(filename, sr=sr, mono=True)
    output = np.mean(librosa.feature.mfcc(y=signal, n_mfcc=20), axis=1)
    return output


filenames = os.listdir('./data/examples2')
features = np.zeros((len(filenames),20))
labels = np.zeros((len(filenames))) 
classes = ['kick','snare','cymbal','clap']

for i in range(len(filenames)):
    features[i,:] = extract_features('./data/examples2/'+filenames[i], sr=sr)
    if filenames[i].find('kick') != -1:
        labels[i] = 0
    elif filenames[i].find('snare') != -1:
        labels[i] = 1
    elif filenames[i].find('cymbal') != -1:
        labels[i] = 2
    elif filenames[i].find('clap') != -1:
        labels[i] = 3

print('Done!')

Done!


### 1. Grid search with cross validation and repeated stratified k-fold on classifier

Mind that this code will train the SVM classifier 18000 times. It will take some time but not too much given the small size of the dataset.

In [3]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.svm import SVC

#creating classifier without parameters
svm = SVC()

#creating the repeated stratified k-folds
#this is not a must, we can do grid search with a simple k-fold
#cross validation by setting cv= to a number in the GridSearchCV constructor
rkf = RepeatedStratifiedKFold(n_splits=5, n_repeats=100)

#defining the parameters range to explore
grid_param = {
    'kernel': ['rbf', 'poly', 'sigmoid', 'linear'],
    'gamma': [1e-2, 1e-3, 1e-4],
    'C': [0.1, 1, 10]
}

gd_sr = GridSearchCV(estimator=svm,
                     param_grid=grid_param,
                     scoring='f1_macro', #this can be changes to accuracy, f1_micro, etc. or to another classification metric
                     cv=rkf, # if you do not want to do repeated kfold, you can set cv=5 to test just on 5 different splits 
                     n_jobs=-1) #if equal to -1 will use as many CPU as available

gd_sr.fit(features, labels) #performing the search

print('best set of parameters', gd_sr.best_params_)
print('associated best score',gd_sr.best_score_)

best set of parameters {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
associated best score 0.8774587164344382


### 2. Grid search with cross validation and repeated stratified k-fold on a complete pipeline

Mind that this code will train the pipeline (Scaler-LDA-ANN) classifier 800 times. It will take some time but not too much given the small size of the dataset.

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RepeatedStratifiedKFold


#creating pipeline
#note that here we do not initialize the parameters we want to tune/change
#the parameters that we decide to initialize here will be fixed across the grid search
#we also need to keep track of the names (between the quotes)
#that we selected for the different components of the pipeline
#the names are needed when creating the grid of parameters

pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('dim_red', LinearDiscriminantAnalysis()),
        ('classifier', MLPClassifier(max_iter=10000))
        ])

#n_components = 10
#hidden_layer_sizes=(20,5), max_iter=10000, activation='relu'


#creating the repeated stratified k-folds
#this is not a must, we can do grid search with a simple k-fold
#cross validation by setting cv= to a number in the GridSearchCV constructor
rkf = RepeatedStratifiedKFold(n_splits=5, n_repeats=20)


#defining the parameters range to explore
#the name of the parameters is preceeded by the name of the component
#in the pipeline followed by two underscore
#if you have trouble in identifying the correct, print all parameters and their
#names uwith the following commented line
#print(pipe.get_params().keys())
grid_param = {
    'dim_red__n_components': [3, 2],
    'classifier__hidden_layer_sizes': [(20,5), (6,5,4)],
    'classifier__activation': ['tanh', 'relu']
}

gd_sr = GridSearchCV(estimator=pipe,
                     param_grid=grid_param,
                     scoring='f1_macro', #this can be changes to accuracy, f1_micro, etc. or to another classification metric
                     cv=rkf, # if you do not want to do repeated kfold, you can set cv=5 to test just on 5 different splits 
                     n_jobs=-1) #if equal to -1 will use as many CPU as available

gd_sr.fit(features, labels) #performing the search

print('best set of parameters', gd_sr.best_params_)
print('associated best score',gd_sr.best_score_)

best set of parameters {'classifier__activation': 'tanh', 'classifier__hidden_layer_sizes': (20, 5), 'dim_red__n_components': 3}
associated best score 0.876205202696714


### 4. Follow up activity

Use grid search to optimize a ML application you previously developed using your own database. Aim at improving the performances you previously obtained. It is recommended to use pipelines. When doing this, estimate how many times your grid search + CV will train and test the ML model (i.e. the pipeline).