# MCT4052 Workshop 6f: Repeated K-Fold Cross Validation

*Author: Stefano Fasciani, stefano.fasciani@imv.uio.no, Department of Musicology, University of Oslo.*

In notebook we use the [RepeatedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedKFold.html) and [RepeatedStratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedStratifiedKFold.html) to further improve the estimation of the model performances via cross validation. In particular Repeated K-Fold is likely to reduce the bias in the model’s estimated performance (although it may increase the variance) because the split in multiple time is performed randomly multiple time (and at each time a cross validation is performed). This reduced the likelyhood to have a particularly favorable split in our cross validation.

Stratification is the process of rearranging the data as to ensure each fold is a good representation of the whole set. For example in a binary classification problem where each class comprises 50% of the data, it is best to arrange the data such that in every fold, each class comprises around half the instances. The repeated stratified k-fold take care of this aspect. Mind that the stratification process make sense and it can be applied only to classification problems, while the repeated k-fold object works with both classification and regression problems. 

In this notebook we demonstrate how to apply repeated k-fold cross validation to a regression task, and how to apply stratified repeated k-fold to a classification task.


In [1]:
import numpy as np
import pandas as pd
import librosa
import sklearn
import os

### 1. Regression task with repeated k-fold cross validation

In [2]:
#loading files and computing features
sr = 22050

def extract_features_target(filename, sr):
    
    signal, dummy = librosa.load(filename, sr, mono=True)
    output = librosa.feature.melspectrogram(signal, n_mels=25)
    output = output.flatten()
    
    #preparing the output array
    target = np.zeros((1,2))
    target[0,0] = np.mean(librosa.feature.rms(signal))
    target[0,1] = np.mean(librosa.feature.spectral_flatness(signal))
    
    return output, target

filenames = os.listdir('./data/examples3')
features = np.zeros((len(filenames),4325))
target = np.zeros((len(filenames),2))

for i in range(len(filenames)):
    features[i,:], target[i,:] = extract_features_target('./data/examples3/'+filenames[i], sr)

print('Done!')

Done!


In [3]:
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPRegressor
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RepeatedKFold


#creating pipeline
pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('dim_red', PCA(n_components = 5)),
        ('classifier', MLPRegressor(hidden_layer_sizes=(12,8,4), max_iter=2000, activation='tanh'))
        ])

#creating the repeated k-fold, use random_state to get repeatable results
#with n_splits=5 we partition the data into 5 splits of 20% and use 4 for trainign and 1 for testing.
#the n_repeats indicates how many times the k-fold has to be repeated
rkf = RepeatedKFold(n_splits=5, n_repeats=10)

#initializing the cross validator with pipe, features, target, scores, and kfold object
scores = sklearn.model_selection.cross_validate(pipe, features, target, cv=rkf, scoring=('r2', 'neg_mean_squared_error'),return_train_score=True)

print(scores,'\n')

print('MSE mean and variance', np.mean(scores['test_neg_mean_squared_error']),np.var(scores['test_neg_mean_squared_error']),'\n')
print('R2 mean and variance', np.mean(scores['test_r2']),np.var(scores['test_r2']),'\n')


{'fit_time': array([0.12645388, 0.1211071 , 0.11878514, 0.10641885, 0.12671471,
       0.08146286, 0.11984587, 0.09941983, 0.07951522, 0.18826294,
       0.12632489, 0.10452104, 0.11791205, 0.18685389, 0.13120508,
       0.09590983, 0.16439605, 0.12469101, 0.11248994, 0.13349414,
       0.12301397, 0.14514089, 0.06087685, 0.13959289, 0.16463423,
       0.199543  , 0.13410878, 0.13277793, 0.09713197, 0.12380385,
       0.17587876, 0.11027408, 0.16299987, 0.1875391 , 0.11559296,
       0.06928205, 0.1168251 , 0.09419298, 0.06376886, 0.15411496,
       0.1269381 , 0.14302468, 0.15886617, 0.08781505, 0.14927483,
       0.13979316, 0.11556721, 0.13158011, 0.09422326, 0.08842397]), 'score_time': array([0.00260997, 0.0020988 , 0.0022409 , 0.00231791, 0.00227928,
       0.00189209, 0.00235796, 0.00205803, 0.00188375, 0.00220108,
       0.00269198, 0.002213  , 0.00227094, 0.00201511, 0.00203991,
       0.00228286, 0.00190282, 0.002491  , 0.00233221, 0.00223088,
       0.00226688, 0.00193095, 0.

### 3. Classification task with stratified repeated k-fold cross validation

In [4]:
#loading files and extracting features
metadata = pd.read_csv('./data/examples4/meta.csv')
classes = list(metadata.label.unique())
print('There are',len(classes),'different classes:',classes)

sr = 22050

def extract_features(filename, sr):
    signal, dummy = librosa.load(filename, sr, mono=True)
    output = np.mean(librosa.feature.mfcc(signal, n_mfcc=20), axis=1)
    return output

print('number of files in database',len(metadata.index))
features = np.zeros((len(metadata.index),20))
labels = np.zeros((len(metadata.index)))

for i, row in metadata.iterrows():
    features[i,:] = extract_features('./data/examples4/'+row['filename'], sr)
    labels[i] = (classes.index(row['label']))

print('Done!')

There are 5 different classes: ['cello', 'guitar', 'clarinet', 'flute', 'harmonica']
number of files in database 60
Done!


In [5]:
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPClassifier
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RepeatedStratifiedKFold

#creating pipeline
pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('dim_red', PCA(n_components = 10)),
        ('classifier', MLPClassifier(hidden_layer_sizes=(20,5), max_iter=10000, activation='relu'))
        ])

#creating the repeated stratified k-fold, use random_state to get repeatable results
#with n_splits=5 we partition the data into 5 splits of 20% and use 4 for trainign and 1 for testing.
#the n_repeats indicates how many times the k-fold has to be repeated
rkf = RepeatedStratifiedKFold(n_splits=5, n_repeats=10)

#initializing and running the cross validator with pipe, features, labels, scores, and kfold object
scores = sklearn.model_selection.cross_validate(pipe, features, labels, cv=rkf, scoring=('f1_macro', 'accuracy'),return_train_score=True)

print(scores,'\n')
print('Accuracy mean and variance', np.mean(scores['test_accuracy']),np.var(scores['test_accuracy']),'\n')
print('F1 macro mean and variance', np.mean(scores['test_f1_macro']),np.var(scores['test_f1_macro']),'\n')


{'fit_time': array([0.29833293, 0.27563024, 0.38118505, 0.42341018, 0.35588098,
       0.32108092, 0.3315239 , 0.32617021, 0.27452707, 0.32143998,
       0.39589214, 0.52560091, 0.67980599, 0.30052376, 0.31691718,
       0.32405996, 0.72586393, 0.27688193, 0.33958507, 0.30991411,
       0.25682187, 0.26307392, 0.50258517, 0.35768199, 0.29305291,
       0.26679111, 0.27996802, 0.37450218, 0.272686  , 0.32807922,
       0.29201889, 0.31204128, 0.32975888, 0.32366419, 0.47866702,
       0.27372217, 0.28692913, 0.283777  , 0.39601612, 0.31957984,
       0.33203983, 0.30442905, 0.36724305, 0.74394274, 0.45512581,
       0.37756014, 0.35779095, 0.33095288, 0.30791903, 0.31149197]), 'score_time': array([0.00137997, 0.00119972, 0.00134492, 0.00121808, 0.00119805,
       0.00124216, 0.00134325, 0.00119185, 0.00145197, 0.0012362 ,
       0.00120783, 0.00237298, 0.00124907, 0.00124002, 0.00126195,
       0.00120425, 0.00124598, 0.00186229, 0.00126219, 0.00155973,
       0.00121999, 0.00124288, 0.

### 4. Follow up activity

Apply the repeated k fold and stretified repeated k fold to classification and regression on ML applications you previously developed using both your databases. It is recommended to use pipelines.