# MCT4052 Workshop 6d: Pipelines of Transforms

*Author: Stefano Fasciani, stefano.fasciani@imv.uio.no, Department of Musicology, University of Oslo.*

In this example we introduce the scikitlearn transformation [pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html), which beside being a useful tool to produce more code compact and less error prone ML programs, are also essential to perform cross validation and extensive grid search (introduced in following notebooks).

Most models/objects available in scikit-learn (related to ML and associated utilities) present the methods .fit(x) and .transform(x)/predict(x). These may also present the method fit_transform(x) (usually not uded in the notebooks). In most ML examples analyzed so far we perform a sequence of transformations using .transform(x)/predict(x) methods of objects previously trained with the function .fit(x). When training, we always use the same split of the dataset (i.e. the training split). For example, a common sequence we used is scaler->dimensionality_reduction->classifier. Sequences can include more or less transformation stages (not necessarily three of them).

Pipelines allows to create one macro object which embeds all transformation, which can be trained (i.e. fit) and used for inference (i.e. transform and/or predict) in a single line of code. In this example we use the pipeline on a supervised machine learning tasks, but it can be used for unsupervised as well as for any chain/sequence of scikit learn models.


In [1]:
import numpy as np
import pandas as pd
import librosa, librosa.display
import sklearn
import os

In [2]:
#loading files and extracting features
metadata = pd.read_csv('./data/examples4/meta.csv')
classes = list(metadata.label.unique())
print('There are',len(classes),'different classes:',classes)

sr = 22050

def extract_features(filename, sr):
    signal, dummy = librosa.load(filename, sr, mono=True)
    output = np.mean(librosa.feature.mfcc(signal, n_mfcc=20), axis=1)
    return output

print('number of files in database',len(metadata.index))
features = np.zeros((len(metadata.index),20))
labels = np.zeros((len(metadata.index)))

for i, row in metadata.iterrows():
    features[i,:] = extract_features('./data/examples4/'+row['filename'], sr)
    labels[i] = (classes.index(row['label']))

print('Done!')

There are 5 different classes: ['cello', 'guitar', 'clarinet', 'flute', 'harmonica']
number of files in database 60
Done!


In [3]:
from sklearn.model_selection import train_test_split

#splitting the dataset in training and testing parts
feat_train, feat_test, lab_train, lab_test = train_test_split(features, labels, test_size=0.2, random_state=17)

### 1. Creating the pipeline and initializing parameters to inner objects

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPClassifier
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('dim_red', PCA(n_components = 10)),
        ('classifier', MLPClassifier(hidden_layer_sizes=(5,5), max_iter=10000, activation='relu'))
        ])

### 2. Training and using the pipeline

In [5]:
#training the pipeline
pipe.fit(feat_train, lab_train)

#applying the trained pipeline
lab_predict = pipe.predict(feat_test)


#print the number of misclassified samples, accuracy and complete report (using scikit learn metric tools) 
print('Number of mislabeled samples %d out of %d' % ((lab_test != lab_predict).sum(),lab_test.size))
print('Accuracy:',sklearn.metrics.accuracy_score(lab_test, lab_predict))

Number of mislabeled samples 4 out of 12
Accuracy: 0.6666666666666666


### 3. Follow up activity

Use a transformation pipeline with one of the ML application you previously worked on using your own dataset. The transformation pipeline must include at least 3 components. Verify that the results are the same as when not using the transformation pipeline (use the same the random state in all components including random initializations).
