# <span style="color:green"> Environmental Sound Classification </span>
## <span style="color:green">Notebook 3: Classification via Convolutional Neural Networks </span>

---
[Mattia Pujatti](mattia.pujatti.1@studenti.unipd.it), ID 1232236, master degree in Physics of Data

---

This notebook has been realized as final project for the course of Human Data Analytics, held by professors [Michele Rossi](rossi@dei.unipd.it) and [Francesca Meneghello](meneghello@dei.unipd.it), during the academic year 2019/2020 at the University of Padua.

### Table of Content

1. #### [Introduction](#Introduction-to-Notebook-2) 
2. #### [Input Pipeline with TFRecords](#TFRecords)
3. #### [Classification with CNN models](#Convolutional-Neural-Networks)
    * #### [Effects of data augmentation](#with-data-augmentation)

## Project Presentation

*The main purpose of this notebook will be to provide an efficient way, using machine learning techniques, to classify environmental sound clips belonging to one of the only public available dataset on the internet. <br>
Several approaches have been tested during the years, but only a few of them were able to reproduce or even overcome the human classification accuracy, that was estimated around 81.30%. <br>
The analysis will be organized in the following way: since the very first approaches were maily focused on the examination of audio features that one could extract from raw audio files, we will provide a way to collect and organize all those "vector of features" and use them to distinguish among different classes. Then, different classification architectures and techniques will be implemented and compared among each other, in order also to show how they react to different data manipulation (overfitting, numerical stability,...). <br>
In the end, it will be shown that all those feature classifiers, without exceptions, underperform when compared to the results provided by the use of Convolutional Neural Networks directly on audio signals and relative spectrograms (so without any kind of feature extraction), and how this new approach opened for a large number of opportunities in term of models with high accuracy in sound classification.*

### Summary of Notebook 1 and 2

## Introduction to Notebook 3

In this third and last notebook we will examine the possibility of applying Convolutional Neural Networks to the task of environmental sound classification, through the conversion of all the sound clips provided in the ESC-50 dataset into time signals and/or spectrograms. <br>
We will first go trough a particular way of handling data, as _TFRecord_ files, and we will construct an input pipeline for our images via tensorflow, in order to provide a training and validation procedure that is as customizable as possible to deal with all our necessities. Then, we will check the performances of several CNN, which structure was inspired by many previous works by other data scientists and reported in the table [here](https://github.com/karolpiczak/ESC-50), and we will compare them again exploiting cross-validation. Moreover, we will examine also the augmented dataset realized in the first notebook, verifying also that augmenting directly the audio clips bring to better results than applying traditional image augmentation techniques on the corresponding spectrograms. <br>
In the end...

In [7]:
# Requirements
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
import seaborn as sns
%matplotlib inline

import zipfile
import time
from glob import glob
import os
from sys import stdout
from PIL import Image
from tqdm.notebook import tqdm
from sklearn.model_selection import train_test_split, StratifiedKFold

from tensorflow.keras import layers
from tensorflow.keras.models import Sequential

import tensorflow as tf

import warnings
warnings.filterwarnings('ignore')
sns.set_theme()

In order to simplify the drafting of this third notebook, all the previous useful functions and classes defined in the previous notebooks have been collected into two _.py_ files, that can be imported, and used, as libraries.

In [3]:
# Import all the functionalities provided in the first notebook
import Notebook1 as nb1
from importlib import reload  
reload(nb1)

<module 'Notebook1' from '/home/mattia/Desktop/PhysicsofData/Human Data Analytics/Environmental-Sound-Classification/Notebook1.py'>

## TFRecords

As anticipated before, handling the whole dataset of images can become computationally a problem, especially in the augmented case, for both the memory of the notebook and the performances of the tensorflow input pipeline that we are going to construct. For this reason, usina a binary file format for the storage of our data can bring to a significant improvement in term of performances, and consequently also on the training time of our model. Binary data, in fact, takes up less space on disk, especially when compressed, and can be read in a much more efficient way. In our help, _tensorflow_ has developed its own binary storage format, that is the [_TFRecord_](https://medium.com/mostly-ai/tensorflow-records-what-they-are-and-how-to-use-them-c46bc4bbb564) format, gifted of a lot of preprocessing functionalities and the possibility of loading from the disk and processing only the data required at that particular time, allowing for an efficient way to deal with large datasets.

For this reason, the first step of this notebook will be defining a set of functions allowing for the conversion of the image datasets constructed in the first notebook into TFRecord files. 

In [4]:
# The following functions can be used to convert - value to a type compatible with tf.train.Example.

def _bytes_feature(value):
  """Returns a bytes_list from a string / byte."""
  if isinstance(value, type(tf.constant(0))):
    value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
  return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
  """Returns a float_list from a float / double."""
  return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
  """Returns an int64_list from a bool / enum / int / uint."""
  return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

In [5]:
def Generate_TFRecord_Images(dataset, plot, out_name='data'):
    """Given a dataset of 'Clips' objects, this function construct a TFRecord file converting all
    the plots (time signals, spectrograms, melspectrograms) that one can generate from each clip."""
    
    if plot not in ['time_signals', 'spectrograms', 'melspectrograms']:
        print('Invalid value in the choice of the plot to generate.')
        return
    
    # Initialize a TFRecord file
    outputfile = os.getcwd() + '/' + out_name + '_' + plot + '.tfrecords'
    with tf.io.TFRecordWriter(outputfile) as tf_writer:
        
        # Loop over the clips
        with tqdm(total=len(dataset), desc='Looping over the dataset:', unit='clips', leave=False) as pbar:
            for _, row in dataset.iterrows():
                # Generate and reformat the desired plot
                fig, ax = plt.subplots(1,1,figsize=(6,3), dpi=50)
                if   plot == 'time_signals'   : row['clip'].DisplayWave(ax=ax)
                elif plot == 'spectrograms'   : row['clip'].DisplaySpectrogram(ax=ax, cbar=False)
                elif plot == 'melspectrograms': row['clip'].DisplayMelSpectrogram(ax=ax, cbar=False)
                ax.set_title('')
                ax.set_axis_off()

                canvas = FigureCanvas(fig)
                fig.tight_layout(pad=0)
                fig.canvas.draw()

                width, height = map(int, fig.get_size_inches() * fig.get_dpi())
                image = np.fromstring(canvas.tostring_argb(), dtype='uint8').reshape(height, width, 4)
                image = np.roll(a=image, shift=-1, axis=2)
                fig.clear()
                plt.close()
            
                # Create a dictionary with the features with the proper types
                features = {'label' : _int64_feature(row['clip'].target), 
                            'image' : _bytes_feature(image.tostring())}
            
                # Create an example protocol buffer
                tf_example = tf.train.Example(features=tf.train.Features(feature=features))
                # Write the example into the TFRecords file
                tf_writer.write(tf_example.SerializeToString())
                pbar.update(1)

Since, as repeated several times during the second notebook, the main problem of the ESC-50 dataset is the limited number of clips available, in order to construct a reliable statistics for the final accuracy of our data we decided to apply cross-validation. To simplify the approach for TFRecords, we decide to create the (stratified) K-Folds directly here, generating _k_ binary files that will be recombined into a training and validation sets in a subsequent loop.

In [6]:
def TFRecord_StratifiedKFold(dataset, plot, n_folds=5, dir_name='CV_tfrecords', out_name='data'):
    """In case a KFold splitting is needed, this function generate several TFRecord files to be used
    in a cross validation loop."""
    
    nb1.initialize_folder(dir_name)
    skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
    
    for i, (_, fold_idx) in tqdm(enumerate(skf.split(dataset['clip'], dataset['label'])), total=n_folds, desc='Creating folds', unit='folds'):
        Generate_TFRecord_Images(dataset.loc[fold_idx], plot, out_name=dir_name + '/' + out_name + '_fold_' + str(i)) 

In [7]:
esc50_dir = os.getcwd() + '/ESC-50-master/ESC-50-master/audio/'
#audio_data = nb1.Collect_Clips(esc50_dir, nb1.get_label_map(), nb1.macro_categories_map, leave_tqdm=False)

#TFRecord_StratifiedKFold(audio_data, plot='time_signals', n_folds=5, dir_name='timesig_CV_tfrecords')
#TFRecord_StratifiedKFold(audio_data, plot='spectrograms', n_folds=5, dir_name='spec_CV_tfrecords')
#TFRecord_StratifiedKFold(audio_data, plot='melspectrograms', n_folds=5, dir_name='melspec_CV_tfrecords')

We can do the same for the augmented dataset, for which we don't need a K-Fold splitting.

In [8]:
#augmented_data = nb1.get_augmented_data(os.getcwd() + '/augmented_clips/')
#Generate_TFRecord_Images(augmented_data, 'time_signals',    out_name='augmented_data')
#Generate_TFRecord_Images(augmented_data, 'spectrograms',    out_name='augmented_data')
#Generate_TFRecord_Images(augmented_data, 'melspectrograms', out_name='augmented_data')

## Convolutional Neural Networks

Now, that we have collected and organized all our data into _TFRecord_ files, is time to select the machine learning classifier that we will use to distinguish different sound categories based only on their spectrograms. <br>
Our choice fell on _Convolutional Neural Networks_: several papers have proven so far that the usage of CNN for spectrogram classification brings excellent results, in a way that, all the high accuracies obtained on the ESC-50 dataset and summarized [here](https://github.com/karolpiczak/ESC-50), have all been achieved thank to those architectures. So, apparently, the possibility of learning local structures from a bidimensional input outperforms our previous approaches, mainly based on the extraction of vectors of features, directly from the raw audio files. <br> 
So, even if we have reached very good results in term of classification accuracy exploiting data augmentation, below we will effectively notice that there are even better approaches, that allow us to derive better performances, directly on the non-augmented data, at the expenses of some velocity and datasets more difficult to handle.

Following the same strategy used also in the previous notebooks, we will define a whole class to handle all the necessary functions to study and fit such spectrogram datasets. 

This is the internal structure of the class `ImageClipsClassifier`:

* `__init__`: constructor of the class, takes in input a machine learning model and allow to fix some training
    and setup parameters.
* `_Read_TFRecord`: read and decode, according to a predefined format, a sample of the TFRecord file.
* `_Augment_Dataset`: function that map to the input tf.data.Dataset five image augmentation steps.
* `_Preprocess_Dataset`: given the path of one (or more) TFrecord files, this function load their content into a dataset, creating an input pipeline of tensors that also apply some preprocessing and optional augmentation steps.
* `__train_step` and `__test_step`: functions to handle a customizable training loop for our model.
* `_Train_and_Evaluate`: given two batched datasets, this function run a training loop storing, for each epoch, the loss and the accuracy for both the train and validation sets.
* `Run_Cross_Validation` and `Run_Cross_Validation_from_images`:  with several calls to `_Train_and_Evaluate` it is possible to implement a cross validation, particularly useful when the dataset is small; two functions are defined for the situations in which you have a folder of TFRecord files or directly the images.
* `TrainTest_Model`: when the dataset is large enough no cross validation is needed, and so it is just convenient to split it into a training and validation sets. 

In [38]:
class ImageClipsClassifier():
    """The main purpose of this class is to collect all the necessary steps and functions to construct and
    train a CNN to our image datasets."""
    
    
    def __init__(self, model, n_classes=50, epochs=15, img_shape=(300, 600, 4), batch_size=16, verbose=1):
        """Initialize some global parameters, together with loss and accuracy functions."""
        
        self.model = model
        self.n_classes = n_classes
        self.epochs = epochs
        self.img_shape = img_shape
        self.batch_size = batch_size
        self.verbose = verbose

        self.history = pd.DataFrame(index=np.arange(self.epochs), columns=['train_loss', 'val_loss','train_metric', 'val_metric'])

        # Instantiate an optimizer.
        self.optimizer = tf.keras.optimizers.Adamax(learning_rate=0.0005, beta_1=0.9, beta_2=0.999)
        # Instantiate a loss function.
        self.loss_fn = tf.keras.losses.CategoricalCrossentropy(from_logits=True)

        # Prepare the metrics.
        self.train_acc_metric = tf.keras.metrics.CategoricalAccuracy()
        self.val_acc_metric = tf.keras.metrics.CategoricalAccuracy()
        

    def _Read_TFRecord(self, example):
        """Parse the input tf.train.Example proto using a proper dictionary."""
    
        # Create a dictionary describing the features.
        tfrecord_format = { 'image': tf.io.FixedLenFeature([], tf.string),
                            'label': tf.io.FixedLenFeature([], tf.int64) }
    
        example = tf.io.parse_single_example(example, tfrecord_format)
    
        # Decode the image
        image = tf.io.decode_raw(example['image'], out_type=tf.uint8)
        image = tf.reshape(image, shape=self.img_shape)

        # Read the label
        label = tf.cast(tf.one_hot(example['label'], depth=self.n_classes), tf.uint8)
    
        return image, label
    
    def _Augment_Dataset(self, ds):
        """Given a dataset of (image, label) pairs, it generates several new datasets
        according to some image augmentation techniques, concatenating them to the original
        object. In particular, the augmentation will be focused on:
        * brightness
        * contrast
        * noise addition
        * left-right flip."""
        
        ds = ds.repeat(5)
        # Adjust the brightness
        ds = ds.map(lambda image, label: (tf.image.random_brightness(image, 0.1), label))
        # Adjust the contrast
        ds = ds.map(lambda image, label: (tf.image.random_contrast(image, 0.8, 1.2), label))
        # Add random uniform noise
        uniform_noise = tf.random.uniform(shape=self.img_shape, minval=0, maxval=1, dtype=tf.int64)
        ds = ds.map(lambda image, label: (image + tf.cast(uniform_noise,dtype=tf.uint8), label))
        # Left right flip
        ds = ds.map(lambda image, label: (tf.image.random_flip_left_right(image), label))

        return ds
    
    
    def _Preprocess_Dataset(self, files, augmentation=False):
        """Given the path of a TFRecord file, the function read the file and apply some
        preprocessing steps over the data stored. In particular:
        * cache 
        * augmentation (optional)
        * shuffle
        * image normalization
        * batch
        * prefetch
        """
        
        # Load the dataset
        ds = tf.data.TFRecordDataset(files, num_parallel_reads=tf.data.AUTOTUNE)
        ds.cache()
        
        # Read and convert the data stored
        ds = ds.map(self._Read_TFRecord)

        # Augment data (optional)
        if augmentation: ds = self._Augment_Dataset(ds)
    
        # Shuffle the dataset
        full_len = ds.reduce(np.int64(0), lambda x, _: x + 1).numpy()
        ds = ds.shuffle(full_len, reshuffle_each_iteration=True)
        # If the order is ininfluent, this will speed up the computation
        opt = tf.data.Options()
        opt.experimental_deterministic = False
        ds = ds.with_options(opt)
        
        # Rescale the images in [-1,1]
        ds = ds.map(lambda image, label: ((1/255.)*tf.cast(image, tf.float32)*2.0-1.0, label))
    
        # Batch and prefetch
        ds = ds.batch(self.batch_size)
        ds = ds.prefetch(tf.data.AUTOTUNE)
    
        return ds
        
        
    @tf.function
    def __train_step(self, x, y): 
        with tf.GradientTape() as tape:
            # Forward Pass.
            logits = self.model(x, training=True)
            # Compute the loss for the minibatch.
            loss_value = self.loss_fn(y, logits)
        # Use the gradient tape to automatically retrieve the gradients of the trainable variables with respect to the loss.
        grads = tape.gradient(loss_value, self.model.trainable_weights)
        # Run one step of gradient descent.
        self.optimizer.apply_gradients(zip(grads, self.model.trainable_weights))
        # Update training metric.
        self.train_acc_metric.update_state(y, logits)
        return loss_value
    
    
    @tf.function
    def __test_step(self, x, y):
        val_logits = self.model(x, training=False)
        # Compute the loss for the minibatch.
        loss_value = self.loss_fn(y, val_logits)
        # Update val metrics
        self.val_acc_metric.update_state(y, val_logits)   
        return loss_value

    
    def _Train_and_Evaluate(self, training_set, validation_set, lengths):
        """Given two batched tensorflow datasets, this function will train the model over the batches of 
        the training set, then evaluating it over the validation set, computing the proper loss and accuracy
        fullfilling an history dataframe, containing such value for each epoch."""
        
        with tqdm(total=self.epochs, desc='Training epochs:', unit='epochs', leave=False) as epochs_pbar:
          #for epoch in tqdm_epochs:
          for epoch in range(self.epochs):
            
            train_batch_errors = []
            val_batch_errors = []

            # Iterate over the batches of the dataset.
            with tqdm(total=lengths['train'], desc='Training over the minibatches:', unit='batches', leave=False) as train_pbar:
                for step, (x_batch_train, y_batch_train) in enumerate(training_set):
                    # Manually detect the end of the loop
                    if step >= lengths['train']: 
                        train_pbar.close()
                        break
                    # Run training step
                    loss_train_value = self.__train_step(x_batch_train, y_batch_train)
                    train_pbar.update(1)
                    train_batch_errors.append(loss_train_value)
                
            # Display metrics at the end of each epoch.
            train_loss = sum(train_batch_errors)
            train_acc = self.train_acc_metric.result()

            # Reset training metrics at the end of each epoch
            self.train_acc_metric.reset_states()

            # Run a validation loop at the end of each epoch.
            with tqdm(total=lengths['val'], desc='Evaluating over the validation set:', unit='batches', leave=False) as val_bar:
                for step, (x_batch_val, y_batch_val) in enumerate(validation_set):
                    # Manually detect the end of the loop 
                    if step >= lengths['val']: 
                        val_bar.close()
                        break
                    loss_val_value = self.__test_step(x_batch_val, y_batch_val)
                    val_bar.update(1)
                    val_batch_errors.append(loss_val_value)
                
            val_loss = sum(val_batch_errors)
            val_acc = self.val_acc_metric.result()
            self.val_acc_metric.reset_states()

            self.history.loc[epoch] = [train_loss.numpy()/lengths['train'], val_loss.numpy()/lengths['val'], train_acc.numpy(), val_acc.numpy()]
            epochs_pbar.update(1)
            if self.verbose >= 1:
                stdout.write('\rTraining Loss: {:.3f}   Validation Loss: {:.3f}     Training Accuracy:  {:.3f}   Validation Accuracy:   {:.3f}\n'.format(
                    train_loss.numpy()/lengths['train'], val_loss.numpy()/lengths['val'], train_acc.numpy(), val_acc.numpy()))
                stdout.flush()
                
    
    
    def Run_Cross_Validation(self, path):
        """Given a path with n TFRecord files representing the folds of our dataset, we run a cross validation
        algorithm over such folds."""
        
        # Loop over the TFRecords in the folder, keeping just one of them as validation set, and load
        # the remaining one as training sets
        tfrecords = glob(path + '*.tfrecords')
        
        # Save initial setup to reset the weights later
        self.model.save_weights('dummy_model.h5')
        
        cross_results = pd.DataFrame(0, index=np.arange(self.epochs*len(tfrecords)), columns=['fold', 'train_metric', 'val_metric'])
        
        for fold, val_file in tqdm(enumerate(tfrecords), desc='Running Cross Validation:', total=len(tfrecords), unit='folds'):

            train_ds = self._Preprocess_Dataset([x for x in tfrecords if x != val_file], augmentation=False)
            val_ds = self._Preprocess_Dataset(val_file, augmentation=False)

            lengths = {'train' : train_ds.reduce(np.int64(0), lambda x, _: x + 1).numpy(),
                       'val'   :   val_ds.reduce(np.int64(0), lambda x, _: x + 1).numpy()}
            self._Train_and_Evaluate(train_ds, val_ds, lengths)

            cross_results.loc[fold*self.epochs:(fold+1)*self.epochs-1, 'fold'] = [fold+1]*self.epochs
            cross_results.loc[fold*self.epochs:(fold+1)*self.epochs-1, ['train_loss', 'val_loss', 'train_metric', 'val_metric']] = self.history.values

            # Reset the weigths of the model at each iteration of the cross validation
            self.model.load_weights('dummy_model.h5')
            
        # Remove the dummy file created
        os.remove('dummy_model.h5')

        return cross_results


    def TrainTest_Model(self, tfrecord, test_frac=0.2, augment=False):
        """Given a TFRecord file, the function load and preprocess the datased inside it, and it splits into a training and a test sets.
        Then, it runs the training + evaluation procedure over the datasets obtained."""
        
        full_dataset = self._Preprocess_Dataset(tfrecord, augmentation=augment)
        full_len = full_dataset.reduce(np.int64(0), lambda x, _: x + 1).numpy()
        train_size = int(full_len*(1-test_frac))

        train_ds = full_dataset.take(train_size)
        val_ds = full_dataset.skip(train_size)

        lengths = {'train' : train_ds.reduce(np.int64(0), lambda x, _: x + 1).numpy(),
                   'val'   :   val_ds.reduce(np.int64(0), lambda x, _: x + 1).numpy()}


        self._Train_and_Evaluate(train_ds, val_ds, lengths)

        return self.history
                
                
  
    def Run_Cross_Validation_from_images(self, path, n_folds=5):
        """In case you have a set of folders with the images, instead of the TFRecord files, you can
        load those images as tensors as well, and then run a cross validation over them."""
        
        # Read data folder
        dataset = pd.DataFrame()
        dataset['filename'] = glob(path + '*/*.png')
        dataset['class'] = dataset['filename'].str.split('/').str[-2]
        
        # Instantiate an object ImageDataGenerator, for scaling the images in [-1,1]
        imgdatagen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255, preprocessing_function=lambda img: img*2.0-255.)

        # Create the folds grouping some minibatches
        skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42).split(dataset['filename'], dataset['class'])

        # Save initial setup to reset the weights later
        self.model.save_weights('dummy_model.h5')

        cross_results = pd.DataFrame(0, index=np.arange(self.epochs*n_folds), columns=['fold', 'train_metric', 'val_metric'])

        for fold, (train_index, test_index) in tqdm(enumerate(skf), desc='Running Cross Validation:', total=n_folds, unit='folds'):
            
            # Create an input pipeline for the tensors
            train_datagen = imgdatagen.flow_from_dataframe(dataframe = dataset.loc[train_index,:], target_size = self.img_shape[0:2], 
                                                     color_mode = 'rgba', class_mode = 'categorical', batch_size = self.batch_size, shuffle = True)
      
            test_datagen = imgdatagen.flow_from_dataframe(dataframe = dataset.loc[test_index,:], target_size = self.img_shape[0:2], 
                                                     color_mode = 'rgba', class_mode = 'categorical', batch_size = self.batch_size, shuffle = True)
        
            lengths = {'train' : len(train_datagen), 'val' : len(test_datagen)}
            self._Train_and_Evaluate(train_datagen, test_datagen, lengths)

            cross_results.loc[fold*self.epochs:(fold+1)*self.epochs-1, 'fold'] = [fold+1]*self.epochs
            cross_results.loc[fold*self.epochs:(fold+1)*self.epochs-1, ['train_loss', 'val_loss', 'train_metric', 'val_metric']] = self.history.values

            # Reset the weigths of the model at each iteration of the cross validation
            self.model.load_weights('dummy_model.h5')
            
        # Remove the dummy file created
        os.remove('dummy_model.h5')

        return cross_results  

Then, it is all about selecting a proper model to fit our data: as can be noticed looking directly to the constructor of the class, we have already made several choices dictated by the research of the best performances, leaving outside of the class the specific definition of the architecture. In particular:
* labels are one-hot encoded, particularly useful when working with neural networks since we can make the last layer to have exactly 50 neurons, one for each possible class of our spectrograms;
* CategoricalCrossentropy as loss function and CategoricalAccuracy as first metric, and the motivations are obvious, given the problem that we are analyzing;
* the optimizer is `Adamax`, that apparently outperforms `Adam` for almost any architecture studied, with a learning rate that is quite small, around 0.0005, that turned out to represent a good tradeoff between convergence speed and accuracy reached.

The architecture used is the following, in which we can distinguish 3 convolutional blocks, characterized by an increasing number of filters and a progressive reduction of the kernel size, in order to simplify the learning of features that will be smaller and smaller as they pass through the network. In fact, at the end of each block it has been placed a MaxPooling layer with the purpose of reducing the size of the input images: particularly relevant is the first pooling layer, which strides are set to (2,4), in such a way we can transform the images from a rectangular to a square shape. Then, there is a fourth fully-connected block that ends with a layer composed by exactly 50 neurons, that correspond to the 50 possible classes of our images. As activation function we selected the `PReLU` for all the layers, since it helps the network to prevent vanishing gradients and does not increase too much the time necessary to process the epochs. <br>
This architecture has been inspired by several Convolutional model and canonical structures that have been used for this kind of problems, and in particular to [AlexNet](https://www.jeremyjordan.me/convnet-architectures/), even if this last one was designed for the dataset ImageNet.

In [9]:
model = Sequential([
    layers.InputLayer(input_shape=(300, 600, 4)),

    layers.Conv2D(filters=64, kernel_size=11, strides=4, padding='same'),
    layers.PReLU(),
    layers.MaxPool2D(3, strides=(2,4)),

    layers.Conv2D(filters=128, kernel_size=5, padding='same'),
    layers.PReLU(),
    layers.MaxPool2D(3, strides=2),
    
    layers.Conv2D(filters=256, kernel_size=3, padding='same'),
    layers.PReLU(),
    layers.Conv2D(filters=256, kernel_size=3, padding='same'),
    layers.PReLU(),
    layers.Conv2D(filters=256, kernel_size=3, padding='same'),
    layers.PReLU(),
    layers.MaxPool2D(3, strides=2),

    layers.Dropout(0.5),
    layers.Flatten(),
    layers.Dense(256),
    layers.PReLU(),
    layers.Dropout(0.5),
    layers.Dense(256),
    layers.PReLU(),
    layers.Dropout(0.5),
    layers.Dense(50),
])

In [10]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 75, 150, 64)       31040     
_________________________________________________________________
p_re_lu (PReLU)              (None, 75, 150, 64)       720000    
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 37, 37, 64)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 37, 37, 128)       204928    
_________________________________________________________________
p_re_lu_1 (PReLU)            (None, 37, 37, 128)       175232    
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 18, 18, 128)       0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 18, 18, 256)       2

Let's start training our model over the dataset of _spectrograms_. As anticipated before, we will use a cross validation approach, to compensate at least partially the lack of data, with 5 folds, in a way that we will train the CNN over 1600 samples, and validate it over 400.

In [37]:
ICC_spectrograms = ImageClipsClassifier(model, batch_size=16, epochs=100, verbose=0)
spectrograms_cv_scores = ICC_spectrograms.Run_Cross_Validation(os.getcwd() + '/spec_CV_tfrecords/')

Running Cross Validation::   0%|          | 0/5 [00:00<?, ?folds/s]

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module 'gast' has no attribute 'Index'


Training epochs::   0%|          | 0/100 [00:00<?, ?epochs/s]

Training over the minibatches::   0%|          | 0/100 [00:00<?, ?batches/s]

KeyboardInterrupt: 

Here, we report the trend of the accuracy and the loss for both sets as the training proceeds. We decided to to represent this behavior using boxplots for every epoch, to summarize the values of the accuracy/loss across different folds. The main purpose of this choice was the possibility of controlling the statistics of the results, in particular if the folds were created nicely and they did not behave too differently, one respect to the others.

In [11]:
spectrograms_cv_analysis = spectrograms_cv_scores.copy()
spectrograms_cv_analysis['epoch'] = [i+1 for i in range(50)]*5

fig, ax = plt.subplots(1, 2, figsize=(24,8))

sns.boxplot(x='epoch', y='train_metric', data=spectrograms_cv_analysis.reset_index(), color='blue', ax=ax[0])
sns.boxplot(x='epoch', y='val_metric', data=spectrograms_cv_analysis.reset_index(), color='red', ax=ax[0])

ax[0].set_xticks([i for i in range(0, 101, 10)])
ax[0].set_xticklabels([i for i in range(0,101,10)])
ax[0].set_xlabel('Epochs', fontsize=15)
ax[0].set_ylabel('Accuracy', fontsize=15)
ax[0].set_title('Training and Validation Accuracy for the spectrograms dataset', fontsize=20, y=1.03)
ax[0].legend(handles=[mpatches.Patch(color='navy', label='training set'), mpatches.Patch(color='firebrick', label='validation set')], fontsize=15)

sns.boxplot(x='epoch', y='train_loss', data=spectrograms_cv_analysis.reset_index(), color='blue', ax=ax[1])
sns.boxplot(x='epoch', y='val_loss', data=spectrograms_cv_analysis.reset_index(), color='red', ax=ax[1])

ax[1].set_xticks([i for i in range(0, 101, 10)])
ax[1].set_xticklabels([i for i in range(0,101,10)])
ax[1].set_xlabel('Epochs', fontsize=15)
ax[1].set_ylabel('Categorical Cross-Entropy', fontsize=15)
ax[1].set_title('Training and Validation Loss for the spectrograms dataset', fontsize=20, y=1.03)
ax[1].legend(handles=[mpatches.Patch(color='navy', label='training set'), mpatches.Patch(color='firebrick', label='validation set')], fontsize=15);

NameError: name 'spectrograms_cv_scores' is not defined

'/home/mattia/Desktop/PhysicsofData/Human Data Analytics/Environmental-Sound-Classification'

### with data augmentation

<div class="cite2c-biblio"></div>