# MCT4052 Workshop 6h: Adding Noise to Improve Robustness

*Author: Stefano Fasciani, stefano.fasciani@imv.uio.no, Department of Musicology, University of Oslo.*

When working with a small dataset overfitting is likely to happen. Noise can help the ML model to learn a better generalization of the data and therefore avoid overfitting and improve the accuracy on the test set.

This is something we have to do "manually", and therefore it's essential to have all aspects of the training 
process under control. In this example we double the size of the training set by adding gaussian noise (zero mean, arbitrary variance) to the existing examples. At first we do this on a single split, and then we repeat the same on a repeated stratified k fold manually (but we have to do it manually). The variance is a parameter we have to tune. If features are scaled (normailzed) a 0.1 variance can represent 10% of the variance of the features.

This is fairly simple on classification problem, as we create new examples adding a bit of noise to the features and keeping the same label. On regression problem this is a little more complex, as we also need to create new target values (we can use a previously trained model as in this [tutorial](https://machinelearningmastery.com/test-time-augmentation-with-scikit-learn/)).

Generally it will be better to add noise direcly into the model (e.g. inside the ANN) but with scikit-learn this is not possible. To achieve a better or equivalent result we sould perform the trainign semi-manually using partial_fit() and changing the noise in in the features at each training iteration.

Another approach is to add noise directly into the the raw-data before comuting the features, which usually help to improve the robustness against noise of the ML-based system.

For more details on this you can read the following posts:

* [Train Neural Networks With Noise to Reduce Overfitting](https://machinelearningmastery.com/train-neural-networks-with-noise-to-reduce-overfitting/)
<!-- blank line -->

* [How to Improve Deep Learning Model Robustness by Adding Noise](https://towardsdatascience.com/how-to-use-noise-to-your-advantage-5301071d9dc3)
<!-- blank line -->

* [How to use Noise to your advantage?](https://towardsdatascience.com/how-to-use-noise-to-your-advantage-5301071d9dc3)
<!-- blank line -->

* [Test-Time Augmentation For Tabular Data With Scikit-Learn](https://machinelearningmastery.com/test-time-augmentation-with-scikit-learn/)
<!-- blank line -->

In [1]:
import numpy as np
import pandas as pd
import librosa
import sklearn
import os

In [2]:
#loading files and extracting features
metadata = pd.read_csv('./data/examples4/meta.csv')
classes = list(metadata.label.unique())
print('There are',len(classes),'different classes:',classes)

sr = 22050

def extract_features(filename, sr):
    signal, dummy = librosa.load(filename, sr=sr, mono=True)
    output = np.mean(librosa.feature.mfcc(y=signal, n_mfcc=20), axis=1)
    return output

print('number of files in database',len(metadata.index))
features = np.zeros((len(metadata.index),20))
labels = np.zeros((len(metadata.index)))

for i, row in metadata.iterrows():
    features[i,:] = extract_features('./data/examples4/'+row['filename'], sr=sr)
    labels[i] = (classes.index(row['label']))

print('Done!')

There are 5 different classes: ['cello', 'guitar', 'clarinet', 'flute', 'harmonica']
number of files in database 60
Done!


In [3]:
from sklearn.pipeline import Pipeline
from sklearn.neural_network import MLPClassifier
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler


#creating pipeline
pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('dim_red', PCA(n_components = 14)),
        ('classifier', MLPClassifier(hidden_layer_sizes=(12,4), max_iter=10000, activation='tanh'))
        ])

### 1. Extending a single split and using it to train the pipeline once

Here we also train the same pipeline with the non extended dataset to show the difference.
Mind that although we fixed the random split initializing the random_state to a fixed integer, the results (accuracy) will change at every execution. In particular, the weights of the ANN are initialized randomly, therefore reteaining again the ANN on the same dataset (non extended) may produce different results. At times the random initialization of weights may be more favorable to the task, other time less favorable. Moreover, when extendin the training set we add noise which is random by nature, and therefore also this will be at times favorable to the classification tasks, and at time adverse (as we will see in the next section, on average over a long run it should be favorable).


In [4]:
#splitting in training and testing set
from sklearn.model_selection import train_test_split
feat_train, feat_test, lab_train, lab_test = train_test_split(features, labels, test_size=0.2, random_state=10)

#the estended feature set include the original features
#plus features with added gaussian noise (appending one extended array after the original)
feat_train_ext = np.append(feat_train,feat_train+np.random.normal(0,0.15,(feat_train.shape)),axis=0)

#as a consequence we also have to extend (doubling) the array of labels
lab_train_ext = np.append(lab_train,lab_train,axis=0)

#displaying the size of the original and extended arrays
print(feat_train.shape)
print(feat_train_ext.shape)
print(lab_train.shape)
print(lab_train_ext.shape)

#training the pipe and checking the accuracy 
pipe.fit(feat_train, lab_train)
lab_predict = pipe.predict(feat_test)
print('Accuracy without extended set:',sklearn.metrics.accuracy_score(lab_test, lab_predict))

#training the pipe and checking the accuracy on extended set
pipe.fit(feat_train_ext, lab_train_ext)
lab_predict = pipe.predict(feat_test)
print('Accuracy with extended set:',sklearn.metrics.accuracy_score(lab_test, lab_predict))

(48, 20)
(96, 20)
(48,)
(96,)
Accuracy without extended set: 0.5
Accuracy with extended set: 0.4166666666666667


In [5]:
from sklearn.model_selection import RepeatedStratifiedKFold

#creating in the repeated stratified k-fold object
rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=10)

#empty list to store over the various iterations
accuracy = []

#iterating over the splits in the repeated stratified k-fold object (given our features and labels)
for train_index, test_index in rskf.split(features, labels):
    #creating features splits manually from indexes k fold indexes
    feat_train, feat_test = features[train_index], features[test_index] 
    #creating labels splits manually from indexes k fold indexes
    lab_train, lab_test = labels[train_index], labels[test_index]
    #extending features by adding noise
    feat_train_ext = np.append(feat_train,feat_train+np.random.normal(0,0.15,(feat_train.shape)),axis=0) #extending by adding noise
    #extending labels
    lab_train_ext = np.append(lab_train,lab_train,axis=0)
    #training
    pipe.fit(feat_train_ext, lab_train_ext)
    #inference of test set
    lab_predict = pipe.predict(feat_test)
    #computing accuracy and adding to array (later compute mean and variance)
    accuracy.append(sklearn.metrics.accuracy_score(lab_test, lab_predict)) 

print('Accuracy mean and variance', np.mean(accuracy),np.var(accuracy),'\n')

Accuracy mean and variance 0.7633333333333333 0.011211111111111111 



### 2. Follow up activity

Take a classifier you have previously trained, in which you suspect that overfitting was an issue (perhaps try to plot the decision boundaries). Make an attempt to add noise manually to see if this improved performances.