# <span style="color:green"> Environmental Sound Classification </span>
## <span style="color:green">Notebook 2: Features Classification </span>

---
[Mattia Pujatti](mattia.pujatti.1@studenti.unipd.it), ID 1232236, master degree in Physics of Data

---

This notebook has been realized as final project for the course of Human Data Analytics, held by professors [Michele Rossi](rossi@dei.unipd.it) and [Francesca Meneghello](meneghello@dei.unipd.it), during the academic year 2019/2020 at the University of Padua.

### Table of Content

1. #### [Introduction](#Introduction-to-Notebook-2) 
2. #### [Datasets Comparison](#Datasets-Comparison)
3. #### [Clips Classification](#Clips-Classification)
4. #### [Machine Learning Models](#Machine-Learning-Models)

## Project Presentation

*The main purpose of this notebook will be to provide an efficient way, using machine learning techniques, to classify environmental sound clips belonging to one of the only public available dataset on the internet. <br>
Several approaches have been tested during the years, but only a few of them were able to reproduce or even overcome the human classification accuracy, that was estimated around 81.30%. <br>
The analysis will be organized in the following way: since the very first approaches were maily focused on the examination of audio features that one could extract from raw audio files, we will provide a way to collect and organize all those "vector of features" and use them to distinguish among different classes. Then, different classification architectures and techniques will be implemented and compared among each other, in order also to show how they react to different data manipulation (overfitting, numerical stability,...). <br>
In the end, it will be shown that all those feature classifiers, without exceptions, underperform when compared to the results provided by the use of Convolutional Neural Networks directly on audio signals and relative spectrograms (so without any kind of feature extraction), and how this new approach opened for a large number of opportunities in term of models with high accuracy in sound classification.*

### Summary of Notebook 1

In the first [notebook](https://github.com/MattiaPujatti/Environmental-Sound-Classification/blob/master/Analysis_of_Sound_Features.ipynb) we presented the __ESC-50 dataset__, a labeled collection of 2000 audio recordings divided into 50 classes and collected by [Karol Piczak](https://github.com/karolpiczak/ESC-50). We loaded the dataset and, thank to several functions provided by the library _librosa_, we were able to extract from them 55 different _features per frame_, with the purpose of using such distributions to construct vectors of features to give in input to a machine learning classifier. We examined and studied the clips via their time signal, via their spectrogram and via the latter but in the Mel scale. Moreover, we gave a look at some particular distributions for specific clips that used to show an irregular behavior, and for them we dispensed a way to approximate and describe such features with a statistical approach. We also examined the problem of having short signals and the consequent impact of longs period of silence inside the clips, and to face up to this we developed a further apparently working preprocessing step. In the end, we also provided a data augmentation procedure to help us dealing with such scarse amount of data.

## Introduction to Notebook 2

*Spiegazione*

---

As our main instrument for constructing and training Machine Learning models, with the purpose of classifying environmental sounds, will be __keras__, an highlevel api of the platform __tensorflow__. <br>
Just a quick technical remark: we will import all keras functions and objects via _tf.keras_, and not just via _keras_, because of some incompatibility between the two modules.

In [8]:
# Requirements
import os
from tqdm.notebook import tqdm
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import time
import warnings
sns.set_theme()
warnings.filterwarnings('ignore')

In [2]:
from tqdm.keras import TqdmCallback
import tensorflow as tf
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler, LabelEncoder
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.pipeline import Pipeline
from sklearn.utils.multiclass import type_of_target

Using TensorFlow backend.


## Datasets Comparison

In the previous notebook we have created several _csv_ file containing all the vectors of features extracted from the audio files in the original dataset. We have already recalled, in the introduction, that for each clip we derived complessively 55 different features per frame, and in order to summarize those features we decide to take the mean and/or the standard deviation of the corresponding distributions. We also removed some silent parts in the calculation to have more "uniform" and consistent results, and, in the end, we augmented the data in order to have more available sounds. <br>
So we ended up with the following files:
* `features.csv`: uses mean and standard deviation to represent the features distributions and was corrected via silence removal with a window of 0.5 seconds.
* `features_nosilenceremoval.csv`: also this file uses both mean and std to represent features distribution but, as the name suggests, no silence removal was applied, mainly to show the effective improvement that this procedure guarantees.
* `reduced_features.csv`: differently from the first two, in this file only the mean of the features is reported; silence removal was instead correctly applied.
* `augmented_features.csv`: this file has been generated from the augmented data constructed applying to every clip in the original dataset 4 different augmentation procedures, resulting in a collection of clips five times bigger.

The following table summarize all the available data for classification.

| Filename | Features Length | Number of Clips | Silence Removal | Augmented |
| :--- | :--- | :--- | :--- | :--- |
| features.csv | 110 | 2000 | Yes | No |
| features_nosilenceremoval.csv | 110 | 2000 | No | No |
| reduced_features.csv | 55 | 2000 | Yes | No |
| augmented_features.csv | 110 | 10000 | Yes | Yes |

In [3]:
# Let's load the datasets
features = pd.read_csv("features.csv", index_col=0)
features_nosilenceremoval = pd.read_csv("features_nosilenceremoval.csv", index_col=0)
reduced_features = pd.read_csv("reduced_features.csv", index_col=0)
augmented_features = pd.read_csv("augmented_features.csv", index_col=0)
augmented_features

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,101,102,103,104,105,106,107,108,109,label
0,10849.601157,918.057984,18598.941982,945.065637,6370.751663,123.212212,0.485326,0.066063,0.046367,0.034819,...,0.510865,0.462348,0.497908,0.530278,0.440784,0.504250,0.503728,0.486035,0.488766,dog
1,9990.634558,559.101005,18132.805589,516.790446,6439.145801,148.285112,0.391947,0.068640,0.072870,0.031348,...,0.613910,0.542764,0.569944,0.649563,0.671917,0.506207,0.693061,0.575784,0.549581,chirping_birds
2,4779.126541,328.861758,9031.055367,488.682171,4592.885754,144.489410,0.200383,0.028605,0.271595,0.046680,...,0.551372,0.492177,0.536389,0.522492,0.562115,0.558984,0.570263,0.589155,0.527654,vacuum_cleaner
3,4480.194905,229.432479,8562.670729,323.455861,4134.897304,115.367570,0.202144,0.021189,0.275287,0.066021,...,0.574609,0.515944,0.493575,0.476780,0.479676,0.449910,0.554374,0.568633,0.512262,vacuum_cleaner
4,10159.868973,672.106116,18427.875421,329.407384,6725.294146,206.106674,0.390444,0.086304,0.015039,0.004850,...,0.475644,0.571642,0.568560,0.469428,0.505913,0.501398,0.493625,0.519823,0.513859,thunderstorm
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,1918.757528,303.783879,3106.227002,644.665288,2170.010452,434.008532,0.053308,0.018209,0.085123,0.087223,...,1.589473,0.908361,1.263996,0.991909,0.835533,0.968926,0.802479,0.763298,0.794067,hen
9996,5685.663242,143.534391,11292.940994,253.468461,4396.309142,68.666402,0.227157,0.017203,0.119567,0.003821,...,0.536032,0.487366,0.492283,0.451245,0.463138,0.476303,0.451795,0.511070,0.498505,vacuum_cleaner
9997,1931.486868,1329.795683,4316.632831,3188.653985,2824.456677,1655.568125,0.025522,0.025662,0.065660,0.094834,...,1.224146,0.850813,0.775175,0.799110,0.885374,0.834324,0.756124,0.759324,0.636704,footsteps
9998,3503.472061,940.171052,7393.582670,3105.111474,3994.320503,921.695082,0.075558,0.021528,0.061051,0.040678,...,0.767034,0.722414,0.723565,0.600105,0.629163,0.528401,0.590463,0.632595,0.592112,sheep


## Clips Classification

Following the same technique used also in the first notebook we will define a class to handle all the necessary steps to construct and train a machine learning model over our datasets. <br>
The class is supposed to work with a dataset formatted like the ones shown before, in which each line correspond to a vector of features while the last column contains the categorical label of the clip. More in practice, such data should be renormalized and their labels should be encoded in a format that is congenial to the machine learning classifiers. <br>
Moreover, the class is supposed to provide several functions to automatize the training process, the hyperparameter tuning and the final validation of the model. But there is a problem with 3 out of 4 of our datasets: they are composed just by 2000 clips, which is never enough to train properly a 50-classes classifier! We can't even talk about splitting the clips in a train-validation-test sets, because we won't have enough data to construct a significative statistics, and so we will rely on a technique (suggested also inside the same library _sklearn_) called __nested cross-validation__. A non-nested approach consists in using the same cross-validation procedure and data both to tune and select a model, but this is likely to lead to an optimistically biased evaluation of the model performance (because of  information leakage). Nested Cross-Validation (Nested-CV) nests cross-validation and hyperparameter tuning exploiting two different KFold (or stratified KFold) splitting, such that in the inner loop the score is approximately maximized by fitting a model to each training set, and then directly maximized in selecting (hyper)parameters over the validation set; in the outer loop, instead, the generalization error is estimated by averaging test set scores over several dataset splits. <br>
Under this procedure, hyperparameter search does not have an opportunity to overfit the dataset as it is only exposed to a subset of the dataset provided by the outer cross-validation procedure. This reduces, if not eliminates, the risk of the search procedure overfitting the original dataset and should provide a less biased estimate of a tuned model’s performance on the dataset. Obviously, this does not come without any additional cost, since you dramatically increase the number of intermediate steps: if _n*k_ models are fit and evaluated as part of a traditional non-nested CV for a given model, then this number is increased to _k*n*k_ as the procedure is then performed _k_ more times for each fold in the outer loop of nested CV.

This is the internal structure of the class `ClipsClassifier`:

* `__init__`: the constructor of the class that takes in input a pandas dataframe, splitting features and labels;
* `Setup_Classifier`: function used to modify the default values chosen for some parameters/methods implemented by the class, like:

    * the number of components to keep after the Principal Component Analysis (default = all);
    * the number of folds to use in the cross validation (defualt = 5);
    * the verbosity of the messages printed;
    * the encoder method for the labels (LabelEncoder or OneHotEncoder, default = onehot).
    
* `_Create_Pipeline`: in order to simplify the features setup and training procedure we rely on a sklearn Pipeline containing a standardization function (StandardScaler), a dimensionality reduction step (PCA) and the model that we want to fit;
* `Run_Nested_Cross_Validation`: this function implement all the steps necessary to run a nested cross validation for our model over the initial dataset, fitting the pipeline and performing a GridSearch over the dictionary of hyperparameters provided; additionally, it will compute the effective performances via _cross-val-predict_ over the different folds and, if requested, compute also the confusion matrix;
* `Run_Nested_Cross_Validation`: like _Run_Nested_Cross_Validation_, but with an initial split of the data into a training and a test set.

In [4]:
class ClipsClassifier():
    """The purpose of this class is to collect all the necessary steps and functions to construct a classification
    model for our clips. 
    In particular, all the necessary steps to prepare the input dataset for the training process will 
    be implemented:
    * standardization
    * PCA
    * Label Encoding
    * train test split or K-fold splitting
    Then a grid search can be run in order to test several combination of hyperparameters without constructing 
    directly a validation set. In the end, the performances will be shown in term of accuracy also over the
    different macro-categories to finally quantify the quality of the model constructed.
    """
    
    def __init__(self, dataset):
        """Initialize some global parameters.
        Dataset is a pandas dataframe with several "features" columns and one "label" column, 
        that contains the data that we want to fit."""
        
        self.X = dataset.drop(['label'], axis=1)
        self.Y = dataset[['label']].to_numpy()     
        self.Setup_Classifier()
        
        self.setup_completed = False
        self.pipeline_fitted = False
        
        self.clf = None
        self.nested_scores = []
        self.final_accuracy = 0
        
        self.confusion_matrix = None
        
        
    def Setup_Classifier(self, pca_components=0.99, n_folds=5, n_jobs=-1, verbose=2, encoder_method='label'):
        """Change the value of some parameters/methods used during data pre-processing and training step."""
        
        self.n_jobs = n_jobs
        self.verbose = verbose
        self.pca_components = pca_components
        self.n_folds = n_folds
        
        # Select the encoder for the labels
        if encoder_method == 'onehot':
            self.label_encoder = OneHotEncoder(sparse=False)
            self.Y = self.Y.reshape(-1,1)
        elif encoder_method == 'label':
            self.label_encoder = LabelEncoder()
        else:
            print('Invalid value of the encoder. Available: onehot, label')
            return 
        
        return self
    
    
    def _Create_Pipeline(self, model):
        """Construct a sklearn Pipeline that contains operations of standardization and
        dimensionality reduction."""
        
        return Pipeline([('standardization', StandardScaler()),
                         ('pca', PCA(n_components=self.pca_components, svd_solver='full')),
                         ('classifier', model)])
            
    
    def Run_Nested_Cross_Validation(self, model, parameters={}, compute_confusion_matrix=False):
        """To estimate the performances of a model with small amount of data, we will exploit 
        the "outer" K-fold splitting defined before, in order to compute the effective generalized 
        accuracy as the average of the validation values obtained among various folds. 
        Because of the stochastic nature of the approach, it may be better to repeat several times 
        the run to check if the results are compatible between themselves."""
        
        # Avoid calling cross validation more than once
        if self.pipeline_fitted:
            print('Cross validation alredy completed!')
            return
        
        # Define two KFold splitting
        inner_cv = KFold(n_splits=self.n_folds, shuffle=True, random_state=42)
        outer_cv = KFold(n_splits=self.n_folds, shuffle=True, random_state=42)
        
        # Standardize and eventually apply pca on the dataset
        pipeline = self._Create_Pipeline(model)
        
        # Encode the labels
        labels = self.label_encoder.fit_transform(self.Y)

        # Run the inner CV
        self.clf = GridSearchCV(estimator=pipeline, param_grid=parameters, n_jobs=self.n_jobs,
                                verbose=self.verbose, cv=inner_cv).fit(self.X, labels)
        self.pipeline_fitted = True
        
        # Nested CV cross validation
        self.nested_scores = cross_val_score(self.clf, X=self.X, y=labels, n_jobs=self.n_jobs, 
                                             verbose=self.verbose, cv=outer_cv)
                
        if self.verbose > 1:
            print("Optimal set of hyperparameters: ")
            print(self.clf.best_params_)
                    
        # Validate the best model found over the outer CV
        self.final_accuracy = np.mean(self.nested_scores)
        if self.verbose > 0:
            print("Average final accuracy estimated: {}%".format(round(self.final_accuracy*100, 2)))  
            
        if compute_confusion_matrix:
            
            # Compute the predictions over the outer CV
            predictions = cross_val_predict(self.clf, self.X, labels, cv=outer_cv)
            
            # Confusion matrix does not support one-hot encoding
            if type_of_target(predictions) == 'multilabel-indicator': predictions = np.argmax(predictions, axis=1)
            if type_of_target(labels)      == 'multilabel-indicator': labels = np.argmax(labels, axis=1)
                        
            self.confusion_matrix = confusion_matrix(labels, predictions)
            
        return        
    
    
    def Run_NonNested_Cross_Validation(self, model, parameters={}, test_size=0.2, compute_confusion_matrix=False):
        """When you have enough data to construct a training and a test set, a nested CV would be
        very unefficient, because now you are able to properly setup a set of completely unseen data.
        """
        
        # Avoid calling cross validation more than once
        if self.pipeline_fitted:
            print('Cross validation alredy completed!')
            return
        
        # Split the dataset into training and test data
        X_train, X_test, Y_train, Y_test = train_test_split(
            self.X, self.Y, test_size=test_size, shuffle=True, random_state=42)
        
        # Standardize and eventually apply pca on the dataset
        pipeline = self._Create_Pipeline(model)
        
        # Encode the labels
        train_labels = self.label_encoder.fit_transform(Y_train)
        test_labels = self.label_encoder.transform(Y_test)
        
        
        # Non_nested parameter search and scoring
        self.clf = GridSearchCV(estimator=pipeline, param_grid=parameters, n_jobs=self.n_jobs, 
                           verbose=self.verbose, cv=self.n_folds).fit(X_train, train_labels)
        
        self.pipeline_fitted = True
        
        # Return the optimal set of hyperparameters tuned
        if self.verbose > 1:
            print("Optimal set of hyperparameters: ")
            print(self.clf.best_params_)
        
        # Validate the best model found over the test set
        self.final_accuracy = self.clf.score(X_test, test_labels)
        if self.verbose > 0:
            print("Average final accuracy estimated: {}%".format(round(self.final_accuracy*100, 2)))  
            
        if compute_confusion_matrix:
            
            # Compute the predictions over the test set
            predictions = self.clf.predict(X_test)

            # Confusion matrix does not support one-hot encoding
            if type_of_target(predictions) == 'multilabel-indicator': predictions = np.argmax(predictions, axis=1)
            if type_of_target(test_labels) == 'multilabel-indicator': test_labels = np.argmax(test_labels, axis=1)

            self.confusion_matrix = confusion_matrix(test_labels, predictions)
            
        return

Now that we have a class that handle all the necessary information to train a model, it's time to start figuring out how can we effectively model our dataset and how different "canonical" classifiers work on them. In particular, in the next section we will train and study 4 different models provided by _sklearn_:
* a __Random Forest__
* a __Multi-Layer Perceptron__
* a __KNeighbors Classifier__
* a __Support Vector Machine__

## Machine Learning Models

First of all, let's define some reasonable set of parameters that our classifiers can take, that we will properly tune, writing them as dictionaries. Remeber that we are not performing the GridSearch directly on the models but on the pipelines, and so the keys of the dictionaries have to take into account the right *step_name* parameter.

In [5]:
# GridSearch for a Random Forest
params_RF = {'classifier__n_estimators': [500, 1000],
             'classifier__bootstrap': [True, False],
             'classifier__max_samples' : [0.5, None],
             'classifier__max_features': ['sqrt']}

# GridSearch for a Multi-Layer Perceptron
params_MLP = {'classifier__hidden_layer_sizes':[128, 256, 512],
              'classifier__activation':['logistic', 'relu'],
              'classifier__solver':['sgd', 'adam'],
              'classifier__learning_rate_init':[0.01, 0.001]}

# GridSearch for a KNeighbors Classifier
params_KNC = {'classifier__n_neighbors':[2,5,8,10],
              'classifier__weights':['uniform', 'distance'],
              'classifier__algorithm':['auto', 'ball_tree', 'kd_tree', 'brute'],
              'classifier__leaf_size':[10, 30, 50, 100]}

# GridSearch for a Support Vector Machine
params_SVM = {'classifier__C':[0.1, 0.5, 1],
              'classifier__kernel':['linear', 'poly', 'rbf', 'sigmoid']}

In [6]:
# GridSearch for a Random Forest
params_RF = {'classifier__bootstrap': [False], 'classifier__max_features': ['sqrt'], 
             'classifier__max_samples': [None], 'classifier__n_estimators': [1000]}

# GridSearch for a Multi-Layer Perceptron
params_MLP =  {'classifier__activation': ['relu'], 'classifier__hidden_layer_sizes': [512], 
               'classifier__learning_rate_init': [0.01], 'classifier__solver': ['adam']}

# GridSearch for a KNeighbors Classifier
params_KNC = {'classifier__algorithm': ['auto'], 'classifier__leaf_size': [10], 
              'classifier__n_neighbors': [2], 'classifier__weights': ['distance']}

# GridSearch for a Support Vector Machine
params_SVM = {'classifier__C': [0.1], 'classifier__kernel': ['linear']}

In [None]:
rf_cc = ClipsClassifier(dataset = features)
rf_cc.Run_Nested_Cross_Validation(model = RandomForestClassifier(), parameters = params_RF)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:   58.7s finished


In [None]:
mlp_cc = ClipsClassifier(dataset = features)
mlp_cc.Run_Nested_Cross_Validation(model = MLPClassifier(), parameters = params_MLP)

In [None]:
knc_cc = ClipsClassifier(dataset = features)
knc_cc.Run_Nested_Cross_Validation(model = KNeighborsClassifier(), parameters = params_KNC)

In [None]:
svm_cc = ClipsClassifier(dataset = features)
svm_cc.Run_Nested_Cross_Validation(model = SVC(), parameters = params_SVM)

In [None]:
features_accuracies_df = pd.DataFrame()
features_accuracies_df['RF']  =  rf_cc.nested_scores
features_accuracies_df['MLP'] = mlp_cc.nested_scores
features_accuracies_df['KNC'] = knc_cc.nested_scores
features_accuracies_df['SVM'] = svm_cc.nested_scores
print('Scores of the models over the cross validation folds:')
display(features_accuracies_df)

fig, ax = plt.subplots(1, 1, figsize=(8,5))
sns.violinplot(data=features_accuracies_df, ax=ax)
ax.set_ylabel('Accuracy')
ax.set_title('Comparison of different models over the "features.csv" dataset');

So from this first analysis over the dataset _features.csv_ we obtain quite good results: getting an average accuracy above the 50% with just 2000 clips for 50 different classes and classifiers that are not even complicated is a nice starting point for our work. Moreover, what we can say is that the top performers found after several run of the previous analysis are the Multi-Layer Perceptron and the Support Vector Machine, with a final accuracy that exceeds the 60%. Random Forests reach intermediate results, but, as we will see also later on, are also by far the model that necessitate of the biggest amount of time to be trained, and so they are probably the least adapted classifier for this kind of tasks. In the end we have the KNeighbor Classifiers, with an accuracy just over the threshold of the 50%, but with a more reasonable training time. It is worth to not forget about those KNC, since they will manifest surprising performance later on in the analysis. <br>
For a deeper analysis one could check, for example, directly the confusion matrices of the 4 classifiers, maybe to check if they behave similarly for different classes, or there is one performing better in a particular field respect to the others and so on. Moreover, one could also study the performances over the larger macro-categories, simply re-mapping the labels stored in the last column of the dataframe, to check how much such models are able to distinguish at least among those macro areas. But this notebook is not going through it for now, since it would make this analysis too much long and boring, and maybe it's better to guarantee such deeper controls only to the classifier that in the end will turn out to be the best.

One could also show that, on average, the scores calculated with the nested cross validation are all slightly bigger than the ones obtained with a non-nested approach: this means, as expected, that we are effectively reducing the overfitting and the information leakage between the train and validation sets. <br>
But, according to the methods that we have applied, are we sure that the statistics found, with such a small amount of data, is credible? Later on we will see that it effectively is, but for the moment we will keep what we obtain as a qualitative result.

We have left some open questions or, better, some statements that have not been proved yet, in the first notebook. Now we will go one-by-one through them with the purpose of providing at least a qualitative answer on the fact that a technique/procedure is better than another, justifying also many of the choices made previously in the analysis.

### Impact of the dimensionality reduction 

Working with large sized vectors of features could be very memory and time demanding, especially when training a complex classifier, maybe with a big number of parameters. For this reason one usually implement, before passing the data to the algorithm, a dimensionality reduction step, in our case a Principal Component Analysis, with the purpose of reducing the dimension of each input vector, while keeping the "information" provided by it as untouched as possible. But how this step influence the performances of our classifiers?

Before showing it let's give a look at the following plot, that allow us to determine the number of principal components to keep without loosing too much of that "information". Basically, during the PCA, you are projecting your data into a smaller dimensional vector space, and each principal component corresponds to an eigenvalue of the covariance matrix of the dataset: reducing the size of the problem means keeping only the first K eigenvalues, i.e. the ones with the higher variance. In the plot it is represented the cumulative explained variance as a function of the number of components kept: the closer to 1 is the value, the more will be the information kept.

In [None]:
# Let's retrieve the PCA fitted objects from one of our models: since the PCA is computed before the training
# step, the results will be the same for each classifier
PCA_fitted = svm_cc.clf.best_estimator_['pca']

cev = np.cumsum(PCA_fitted.explained_variance_ratio_)
plt.plot(cev, color='blue', lw=3, label='cev')

# Let's plot some typical thresholds
plt.axvline(np.argmax(cev>0.99), ls='--', c='black', lw=1, label='cev = 0.99 ({} pc)'.format(np.argmax(cev>0.99)))
plt.axvline(np.argmax(cev>0.90), ls='--', c='red', lw=1, label='cev = 0.90 ({} pc)'.format(np.argmax(cev>0.90)))

plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
plt.title('Study on the number of principal components')
plt.legend()

See? You can potentially keep the 90% of the information stored in your features with just 48 values! <br>
The accuracies plot in the previous section has been computed keeping, by default, the 99% of the explained variance, i.e. 84 principal components. And the results are still quite nice. What would happen, instead, further reducing the number of eigenvalues to keep?

In [None]:
def test_models_principal_component(features, n_components, pRF={}, pMLP={}, pKNC={}, pSVM={}):
    accuracies = {}
    training_times = {}
    
    start = time.time()
    rf = ClipsClassifier(dataset=features).Setup_Classifier(pca_components=n_components, verbose=0)
    rf.Run_Nested_Cross_Validation(model=RandomForestClassifier(), parameters=pRF)
    accuracies['RF'] = [np.mean(rf.nested_scores)]
    training_times['RF'] = [time.time() - start]
    
    start = time.time()
    mlp = ClipsClassifier(dataset=features).Setup_Classifier(pca_components=n_components, verbose=0)
    mlp.Run_Nested_Cross_Validation(model=MLPClassifier(), parameters=pMLP)
    accuracies['MLP'] = [np.mean(mlp.nested_scores)]
    training_times['MLP'] = [time.time() - start]
    
    start = time.time()
    knc = ClipsClassifier(dataset=features).Setup_Classifier(pca_components=n_components, verbose=0)
    knc.Run_Nested_Cross_Validation(model=KNeighborsClassifier(), parameters=pKNC)
    accuracies['KNC'] = [np.mean(knc.nested_scores)]
    training_times['KNC'] = [time.time() - start]
    
    start = time.time()
    svm = ClipsClassifier(dataset=features).Setup_Classifier(pca_components=n_components, verbose=0)
    svm.Run_Nested_Cross_Validation(model=SVC(), parameters=pSVM)
    accuracies['SVM'] = [np.mean(svm.nested_scores)]
    training_times['SVM'] = [time.time() - start]
    
    return accuracies, training_times

In [None]:
# Take the best set of hyperparameters according to the previous analysis
pRF  = {'classifier__bootstrap': [False], 'classifier__max_features': ['sqrt'], 'classifier__max_samples': [None], 'classifier__n_estimators': [1000]}
pMLP = {'classifier__activation': ['relu'], 'classifier__hidden_layer_sizes': [512], 'classifier__learning_rate_init': [0.01], 'classifier__solver': ['adam']}
pKNC = {'classifier__algorithm': ['auto'], 'classifier__leaf_size': [10], 'classifier__n_neighbors': [2], 'classifier__weights': ['distance']}
pSVM = {'classifier__C': [0.1], 'classifier__kernel': ['linear']}

acc_vs_comp = pd.DataFrame()
time_vs_comp = pd.DataFrame()
component_list = [1, 5, 10, 20, 35, 50, 75, 100, 110]

for n_components in tqdm(component_list):
    
    acc, times = test_models_principal_component(features, n_components, pRF, pMLP, pKNC, pSVM)
    acc_vs_comp = pd.concat([acc_vs_comp, pd.DataFrame(acc)])
    time_vs_comp = pd.concat([time_vs_comp, pd.DataFrame(times)])
    
acc_vs_comp['Components'] = component_list
time_vs_comp['Components'] = component_list

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(16,5))

sns.lineplot(x='Components', y='RF',  data=acc_vs_comp, ax=axs[0], label='RF')
sns.lineplot(x='Components', y='MLP', data=acc_vs_comp, ax=axs[0], label='MLP')
sns.lineplot(x='Components', y='KNC', data=acc_vs_comp, ax=axs[0], label='KNC')
sns.lineplot(x='Components', y='SVM', data=acc_vs_comp, ax=axs[0], label='SVM')
axs[0].set_title('Performances versus PCA')
axs[0].set_xlabel('Principal Components kept')
axs[0].set_ylabel('Mean Accuracy of the models')
axs[0].legend()

sns.lineplot(x='Components', y='RF',  data=time_vs_comp, ax=axs[1], label='RF')
sns.lineplot(x='Components', y='MLP', data=time_vs_comp, ax=axs[1], label='MLP')
sns.lineplot(x='Components', y='KNC', data=time_vs_comp, ax=axs[1], label='KNC')
sns.lineplot(x='Components', y='SVM', data=time_vs_comp, ax=axs[1], label='SVM')
axs[1].set_title('Training times versus PCA')
axs[1].set_xlabel('Principal Components kept')
axs[1].set_ylabel('Training times of the models')
axs[1].legend();

The results looks exactly what we expected.
* The right plot shows the computational times: the bigger are the input vectors, the larger will be the time necessary to make the classifier training on them. With just 2000 clips the absolut difference is not so relevant, but the scaling behavior of the lines, instead, is evident.
* In the left plot we can look at the mean accuracies obtained for our models: after 40/50 principal components the lines start becoming flat, meaning that no significant further improvements are possible. And this is coherent with the explained variance ratio shown in the plot above. The reason for this can maybe be reconduce to the first notebook, and the high number of features that we computed in order to construct a representation for our clips: probably we exaggerated taking as many (55) different values, and only half of them were sufficient. However now, at the training step, once you have already constructed your features dataframe, this does not matter anymore, since you can just reduce the number of principal components setting the proper parameter in the classifier class, to reduce the dimensionality of the input for the models.

### Importance of the statistical estimators of the features distributions

One of the datasets that we have created previously is called *reduced_features.csv*, and the main difference with respect to the data used in the previous section is that, in order to summarize the various distributions of the *features per frame*, identified and collected in the first notebook, we used only their *mean*, without caring of the *standard deviation*. In this way, the "vectors of features" so created will be of size (55,) rather than (110,), leading to a much smaller dataset; but will this influence the final performances of the classifiers?

In [None]:
red_feat_rf = ClipsClassifier(dataset=reduced_features).Setup_Classifier(verbose=0)
red_feat_rf.Run_Nested_Cross_Validation(model=RandomForestClassifier(), parameters=params_RF)

red_feat_mlp = ClipsClassifier(dataset=reduced_features).Setup_Classifier(verbose=0)
red_feat_mlp.Run_Nested_Cross_Validation(model=MLPClassifier(), parameters=params_MLP)

red_feat_knc = ClipsClassifier(dataset=reduced_features).Setup_Classifier(verbose=0)
red_feat_knc.Run_Nested_Cross_Validation(model=KNeighborsClassifier(), parameters=params_KNC)

red_feat_svm = ClipsClassifier(dataset=reduced_features).Setup_Classifier(verbose=0)
red_feat_svm.Run_Nested_Cross_Validation(model=SVC(), parameters=params_SVM)

In [None]:
red_features_accuracies_df = pd.DataFrame()
red_features_accuracies_df['red_RF']  = red_feat_rf.nested_scores
red_features_accuracies_df['red_MLP'] = red_feat_mlp.nested_scores
red_features_accuracies_df['red_KNC'] = red_feat_knc.nested_scores
red_features_accuracies_df['red_SVM'] = red_feat_svm.nested_scores

print('Scores of the models over the cross validation folds:')
display(red_features_accuracies_df)

fig, ax = plt.subplots(1, 1, figsize=(16,5))
sns.violinplot(data=pd.concat([red_features_accuracies_df, features_accuracies_df]), ax=ax)
ax.axvline(3.5, ls='--', c='black')
ax.set_ylabel('Accuracy')
ax.set_title('Comparison of different models for different statistical estimators', fontsize=18)
ax.set_xticklabels(['RF', 'MLP', 'KNC', 'SVM', 'RF', 'MLP', 'KNC', 'SVM'])

ax.text(1, 0.6, 'Models trained over the dataset \n "reduced_features.csv"', ha="center", va="center", size=15, 
        bbox=dict(boxstyle="round", fc="w", ec="0.5", alpha=0.9))
ax.text(6, 0.25, 'Models trained over the dataset \n "features.csv"', ha="center", va="center", size=15, 
        bbox=dict(boxstyle="round", fc="w", ec="0.5", alpha=0.9));

The improvement is obvious: using both the mean and the standard deviation as statistics estimators of the features distributions almost doubles the accuracy obtained with **all** the models used. The explanation has already been shown in the first notebook, giving a direct look at the interested plots; most of the times, in fact, such distributions are quite asymmetric and non-gaussian and taking just the average is probably just an excess of reductionism.

### Importance of the silence removal step

Another version of the feature dataset is _features_nosilenceremoval.csv_. Such dataset has been created computing the mean and the standard deviation of the features distributions, but without the preprocessing step during which we removed fixed size (of 0.5 seconds) windows of continuous silence. During the analysis we observed that in short sounds the impact of this silent part was strong, in the sense that, looking at the distributions of many features, the column corresponding to the value 0 was too much high respect what it was supposed to. So we justified theoretically this additional preprocessing, but from the point of view of a classifier, does this change anything?

In [None]:
silent_feat_rf = ClipsClassifier(dataset=features_nosilenceremoval).Setup_Classifier(verbose=0)
silent_feat_rf.Run_Nested_Cross_Validation(model=RandomForestClassifier(), parameters=params_RF)

silent_feat_mlp = ClipsClassifier(dataset=features_nosilenceremoval).Setup_Classifier(verbose=0)
silent_feat_mlp.Run_Nested_Cross_Validation(model=MLPClassifier(), parameters=params_MLP)

silent_feat_knc = ClipsClassifier(dataset=features_nosilenceremoval).Setup_Classifier(verbose=0)
silent_feat_knc.Run_Nested_Cross_Validation(model=KNeighborsClassifier(), parameters=params_KNC)

silent_feat_svm = ClipsClassifier(dataset=features_nosilenceremoval).Setup_Classifier(verbose=0)
silent_feat_svm.Run_Nested_Cross_Validation(model=SVC(), parameters=params_SVM)

In [None]:
silent_features_accuracies_df = pd.DataFrame()
silent_features_accuracies_df['silent_RF']  = silent_feat_rf.nested_scores
silent_features_accuracies_df['silent_MLP'] = silent_feat_mlp.nested_scores
silent_features_accuracies_df['silent_KNC'] = silent_feat_knc.nested_scores
silent_features_accuracies_df['silent_SVM'] = silent_feat_svm.nested_scores

print('Scores of the models over the cross validation folds:')
display(silent_features_accuracies_df)

fig, ax = plt.subplots(1, 1, figsize=(16,5))
sns.violinplot(data=pd.concat([silent_features_accuracies_df, features_accuracies_df]), ax=ax)
ax.axvline(3.5, ls='--', c='black')
ax.set_ylabel('Accuracy')
ax.set_title('Comparison of different models without and with silence removal', fontsize=18)
ax.set_xticklabels(['RF', 'MLP', 'KNC', 'SVM', 'RF', 'MLP', 'KNC', 'SVM'])

ax.text(1, 0.6, 'Models trained over the dataset \n "features_nosilenceremoval.csv"', ha="center", va="center", size=15, 
        bbox=dict(boxstyle="round", fc="w", ec="0.5", alpha=0.9))
ax.text(6, 0.33, 'Models trained over the dataset \n "features.csv"', ha="center", va="center", size=15, 
        bbox=dict(boxstyle="round", fc="w", ec="0.5", alpha=0.9));

The "features" dataset appear to perform slightly better, let's see how much.

In [None]:
improvements = pd.DataFrame(columns=['RF', 'MLP', 'KNC', 'SVM'])
improvements.loc[0] = (np.array(features_accuracies_df.mean()) - np.array(silent_features_accuracies_df.mean()))
ax = sns.barplot(data=improvements)
ax.set_ylabel('Accuracy')
ax.set_title('Accuracy improvements after silence removal');

And this is not even bad: getting a +3%/+5% improvement in accuracy is nice, at the expenses of a small further preprocessing step.

### Overfitting

The main problem encountered so far is the dimension of the dataset, since with just 2000 clips it is difficult to obtain a significative statistics for the results. In order to avoid creating both a test and a validation sets, that would further reduce the data available for training, we relied on a nested cross validation, as explained before. But how can we be sure that this is enough to eliminate the possibility of overfitting? <br>
To show this we will procede along an addional analysis using a neural network and this time defining a validation set. This won't be realized as before, instantiating the class _ClipsClassifier_ and tuning a set of hyperparameters, because we are not interested in the effective final accuracy of the model, but on the behavior of the loss as the training proceed.

In [None]:
# Define a function that construct a Feed Forward Network with some specified setup

def create_NN(optimizer='adamax', dropout_prob=0.1, lr=0.01):
    
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Dense(256))
    model.add(tf.keras.layers.Dropout(dropout_prob))
    model.add(tf.keras.layers.BatchNormalization(momentum=0.99, epsilon=0.001))
    model.add(tf.keras.layers.PReLU())
    model.add(tf.keras.layers.Dense(128))
    model.add(tf.keras.layers.Dropout(dropout_prob))
    model.add(tf.keras.layers.BatchNormalization(momentum=0.99, epsilon=0.001))
    model.add(tf.keras.layers.PReLU())
    model.add(tf.keras.layers.Dense(64))
    model.add(tf.keras.layers.Dropout(dropout_prob))
    model.add(tf.keras.layers.BatchNormalization(momentum=0.99, epsilon=0.001))
    model.add(tf.keras.layers.PReLU())
    model.add(tf.keras.layers.Dense(50))
    model.add(tf.keras.layers.Softmax())

    if optimizer == 'adam':
        optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
    elif optimizer == 'adamax':
        optimizer = tf.keras.optimizers.Adamax(learning_rate=lr)
        
    loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
    
    model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
    
    return model

In [None]:
# Shuffle the dataset
df_nn = features.copy()
df_nn = df_nn.sample(frac=1).reset_index(drop=True)

# Since we won't rely on the pipeline we need to preprocess the data directly here
X = df_nn.drop(['label'], axis=1)
Y = df_nn[['label']].to_numpy()     

# Standardize and use one-hot-encoding
X = StandardScaler().fit_transform(X)
Y = OneHotEncoder(sparse=False).fit_transform(Y)

# Take some reasonable parameters for the network
model = create_NN(optimizer='adamax', dropout_prob=0.1, lr=0.01)

# Use the 20% of data as validation set
history = model.fit(X, Y, epochs=100, batch_size=128, verbose=0, callbacks=[TqdmCallback(verbose=1)], validation_split=0.2)

In [None]:
model.summary()

The reason of not proceeding like for the other models is the possibilty offered by _keras_ of having an __history__ object.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(16,5))
history_df = pd.DataFrame(history.history).copy().reset_index()

sns.lineplot(x='index', y='loss', data=history_df, color='blue', ax=ax[0], label='Training Loss')
sns.lineplot(x='index', y='val_loss', data=history_df, color='red', ax=ax[0], label='Validation Loss')
ax[0].set_title('Analysis of the loss for the FFNN over the "features" dataset')
ax[0].set_xlabel('Epochs')
ax[0].set_ylabel('Loss (Categorical Cross-Entropy)')
ax[0].legend()

sns.lineplot(x='index', y='accuracy', data=history_df, color='blue', ax=ax[1], label='Training Accuracy')
sns.lineplot(x='index', y='val_accuracy', data=history_df, color='red', ax=ax[1], label='Validation Accuracy')
ax[1].set_title('Analysis of the accuracy for the FFNN over the "features" dataset')
ax[1].set_xlabel('Epochs')
ax[1].set_ylabel('Accuracy')
ax[1].legend();

This is obviously overfitting. <br>
Differently from the previous cases, this time we are considering a validation set that is completely made by _unseen_ data that do not influence the training procedure, so the final accuracy value will be more credible, at least from a statistical point of view. We clearly see from the plot, that already after some epochs, the training accuracy (and also the error) tends to converge to the "perfect" classification, while the generalization error, that in our case is computed in the validation set (400 clips) converges to an accuracy around the 60%. To compare the performance obtained with the other models, with a "handmade" neural network the results are similar to the one got with the Support Vector Machine, and greater respect to the other three. <br>
All of this means that the results obtained via nested cross validation are presumably correct, and so that one is a nice and functional approach when dealing with so small amount of data. But this is still not enough. Even using nested cross validation, the overfitting has been reduced, but not completely eliminated.

### Data Augmentation

The final answer to our problem lies in the last dataset that we have loaded, _augmented_features.csv_. Such dataset has been created in the first notebook, simply iterating over the clips, and for each of them creating 4 new "augmented" versions with different approaches (noise addiction, time shifting, pitch shifting and time stretching). In this way, instead of 2000 clips we end up with 10000 clips, much more reasonable to construct a reliable statistics.

Let's now repeat the previous analysis for the new dataset: this time we wont need a nested cross validation, since we have enough data to construct a proper test set on which evaluating the model. <br>
_Note: using nested CV we would still get slightly better accuracies, but this time this improvement is due comes from an increment of overfitting, rather than a reduction. In fact, constructing a test set we are effectively getting a good estimate of the generalization error, more reliable than the one obtained with the nested CV, simply because this technique is useful when we have to rely on the same data for both gridsearch and cross validation_.

In [None]:
augmented_rf = ClipsClassifier(dataset=augmented_features)
augmented_rf.Run_NonNested_Cross_Validation(model=RandomForestClassifier(), parameters=params_RF, compute_confusion_matrix=True)

augmented_mlp = ClipsClassifier(dataset=augmented_features)
augmented_mlp.Run_NonNested_Cross_Validation(model=MLPClassifier(), parameters=params_MLP, compute_confusion_matrix=True)

augmented_knc = ClipsClassifier(dataset=augmented_features)
augmented_knc.Run_NonNested_Cross_Validation(model=KNeighborsClassifier(), parameters=params_KNC,compute_confusion_matrix=True)

augmented_svm = ClipsClassifier(dataset=augmented_features)
augmented_svm.Run_NonNested_Cross_Validation(model=SVC(), parameters=params_SVM, compute_confusion_matrix=True)

Before comparing the results, let's consider also, for this time, a gridsearch over a neural network, changing the number of epochs, the optimizer or the learning rate...

In [None]:
NN = KerasClassifier(build_fn=create_NN, verbose=0)

params_NN = {'classifier__epochs':[100], 'classifier__batch_size':[64, 128], 
             'classifier__optimizer':['adam', 'adamax'], 'classifier__lr':[0.01, 0.001], 
             'classifier__dropout_prob':[0.1, 0.5]}

params_NN = {'classifier__batch_size': [128], 'classifier__dropout_prob': [0.1], 'classifier__epochs': [100], 
             'classifier__lr': [0.01], 'classifier__optimizer': ['adamax']}

aug_nn_cc = ClipsClassifier(augmented_features).Setup_Classifier(encoder_method='onehot')
aug_nn_cc.Run_NonNested_Cross_Validation(model=NN, parameters=params_NN, compute_confusion_matrix=True)

_Interesting note: apparently, just changing the optimizer from __adam__ to __adamax__ leads to an accuracy improvement up to the 20%! This could be an aspect deserving to be studied more in depth._

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(8,6))

augmented_accuracies = pd.DataFrame(columns=['RF', 'MLP', 'KNC', 'SVM', 'NN'])
augmented_accuracies.loc[0] = [augmented_rf.final_accuracy, augmented_mlp.final_accuracy, 
                               augmented_knc.final_accuracy, augmented_svm.final_accuracy, aug_nn_cc.final_accuracy]

sns.barplot(data=augmented_accuracies, ax=ax)

for p in ax.patches:
    ax.annotate(format(p.get_height(), '.3f'), (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', xytext = (0, 9), textcoords = 'offset points')

ax.set_ylim(0,1)
ax.set_ylabel('Accuracy')
ax.set_title('Models accuracy over the augmented dataset');

UWU

### Again the overfitting

In [None]:
# Shuffle the dataset
aug_df_nn = augmented_features.copy()
aug_df_nn = aug_df_nn.sample(frac=1).reset_index(drop=True)

# Since we won't rely on the pipeline we need to preprocess the data directly here
X = aug_df_nn.drop(['label'], axis=1)
Y = aug_df_nn[['label']].to_numpy()     

# Standardize and use one-hot-encoding
X = StandardScaler().fit_transform(X)
Y = OneHotEncoder(sparse=False).fit_transform(Y)

# Take some reasonable parameters for the network
aug_model = create_NN(optimizer='adamax', dropout_prob=0.1, lr=0.01)

# Use the 20% of data as validation set
aug_history = aug_model.fit(X, Y, epochs=200, batch_size=128, verbose=0, callbacks=[TqdmCallback(verbose=1)], validation_split=0.2)

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(16,5))
history_aug_df = pd.DataFrame(aug_history.history).copy().reset_index()

sns.lineplot(x='index', y='loss', data=history_aug_df, color='blue', ax=ax[0], label='Training Loss')
sns.lineplot(x='index', y='val_loss', data=history_aug_df, color='red', ax=ax[0], label='Validation Loss')
ax[0].set_title('Analysis of the loss for the FFNN over the "features" dataset')
ax[0].set_xlabel('Epochs')
ax[0].set_ylabel('Loss (Categorical Cross-Entropy)')
ax[0].legend()

sns.lineplot(x='index', y='accuracy', data=history_aug_df, color='blue', ax=ax[1], label='Training Accuracy')
sns.lineplot(x='index', y='val_accuracy', data=history_aug_df, color='red', ax=ax[1], label='Validation Accuracy')
ax[1].set_title('Analysis of the accuracy for the FFNN over the "features" dataset')
ax[1].set_xlabel('Epochs')
ax[1].set_ylabel('Accuracy')
ax[1].legend();

The results improved respect to before: this time the validation error has a better convergence to the training error, with the final accuracy remaining stable around the 80% (before was 60%). Under this point of view we can be quite satisfied of the results brought by the data augmentation procedure, since they provided us enough data to properly construct an _unseen_ test set without relying on the nested cross validation, and also improved the effective performance of our models, reaching very nice results (remember that the human's estimated accuracy is of the 81.3%!).

## Best case deeper Analysis

In this second-last section we will take one of the previous models trained over the augmented datasets with an high accuracy, and we will examine its performances deeper and on an higher level. Our choice fell on the neural network.

The first thing that is worth to examine is the confusion matrix of the model over the test set, that has been computed in the previous step. The following plot basically represent the true labels of the test set (2000 samples) versus the predictions of the network, for each of the possible classes of the data. Notice that, since the features have been encoded before being fed to the classifiers, they have been sorted in alphabetical order, and so they are no more grouped in macro-categories.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(22,20))

# Normalize the confusion matrix row by row
row_sums = aug_nn_cc.confusion_matrix.sum(axis=1)
norm_human = (aug_nn_cc.confusion_matrix/row_sums[:, np.newaxis])

sns.heatmap(norm_human, cmap=sns.color_palette("light:b", as_cmap=True), ax=ax, cbar=False)

ax.set_title('Human Confusion Matrix for the ESC-50 Dataset', fontsize=25, y=1.01)
ax.set_xticklabels(aug_nn_cc.label_encoder.categories_[0], rotation=90)
ax.set_yticklabels(aug_nn_cc.label_encoder.categories_[0], rotation=0)
ax.set_xlabel('Predicted Labels', fontsize=18)
ax.set_ylabel('True Labels', fontsize=18)

for x in range(50):
    for y in range(50):
        if norm_human[y, x] != 0:
            if x==y: c = 'white'
            else: c = 'grey'
            ax.text(x+0.5, y+0.5, "{:.0%}".format(norm_human[y, x]), fontsize=9, ha='center', color=c)

Mancano 4 punti, cosa che non succede con gli altri classifier

But, is there a way to check how our model behave from the point of view of the different macro-categories? <br>
Well, yeah. We just need to take the previous confusion matrix, and summing up, for each row, all the entries that correspond to the columns belonging to the same macro area. Obviously, since, as said, the classes are sorted in alphabetical order, we need to construct a specific function to group them again.

In [None]:
def compute_macro_conf_mat(confusion_matrix):
    """Given the confusion matrix for the 50 classes, this function return the corresponding mapped confusion 
    matrix of the macro-categories."""
    
    macro_confusion_matrix = np.zeros((5,5))
    
    # Map micro into macro categories
    animals  = [5, 13, 16, 18, 25, 29, 30, 34, 37, 39]
    natural  = [7, 14, 15, 35, 36, 38, 43, 44, 48, 49]
    humans   = [1, 2,   9, 12, 17, 21, 24, 32, 41, 42]
    domestic = [3, 10, 11, 19, 20, 26, 31, 33, 46, 47]
    urban    = [0, 4,   6,  8, 22, 23, 27, 28, 40, 45]
    
    macro_categories = [animals, natural, humans, domestic, urban]
    
    for i in range(5):
        for j in range(5):
            for cat in macro_categories[j]:
                macro_confusion_matrix[i,j] += sum(confusion_matrix[macro_categories[i], cat])

    return macro_confusion_matrix
        
macro_confusion_matrix = compute_macro_conf_mat(aug_nn_cc.confusion_matrix)
macro_confusion_matrix

In [None]:
macro_categories = ['Animal', 'Natural', 'Human', 'Domestic', 'Urban']

# Normalize the confusion matrix row by row
row_sums = macro_confusion_matrix.sum(axis=1)
norm_conf_macro = macro_confusion_matrix/row_sums[:, np.newaxis]

fig, ax = plt.subplots(1, 1, figsize=(8,6))
sns.heatmap(norm_conf_macro, cmap=sns.color_palette("light:b", as_cmap=True), annot=True, fmt='.2f', ax=ax)
ax.set_title('Human Confusion Matrix over the macro-categories', fontsize=18, y=1.05)
ax.set_xticklabels(macro_categories, rotation=45)
ax.set_yticklabels(macro_categories, rotation=0)

ax.set_xlabel('Predicted Labels', fontsize=15)
ax.set_ylabel('True Labels', fontsize=15);

### Example of functionality

Let's take a random sample from the dataset and see how the model behave.

In [None]:
aug_sample = augmented_features.sample(1, random_state=42)
display(aug_sample)
sample_feat = aug_sample.drop('label', axis=1)
sample_label = aug_sample['label'][aug_sample.index[0]]
sample_label_onehotencoded = aug_nn_cc.label_encoder.transform([[sample_label]])
print('True Label: ', sample_label, '(' + str(np.argmax(sample_label_onehotencoded, axis=1)[0]) + ')')

In [None]:
aug_nn_cc.clf.predict_proba(sample_feat)[:,19]

## Comprehensive recap of all the results

### Further possibilities?