# Classifying Urban sounds using Deep Learning

## 4 Model Refinement 

### Load Preprocessed data 

In [1]:
# retrieve the preprocessed data from previous notebook

%store -r x_train 
%store -r x_test 
%store -r y_train 
%store -r y_test 
%store -r yy 
%store -r le

#### Model refinement

In our inital attempt, we were able to achieve a Classification Accuracy score of: 

* Training data Accuracy:  92.3% 
* Testing data Accuracy:  87% 

We will now see if we can improve upon that score using a Convolutional Neural Network (CNN). 

#### Feature Extraction refinement 

In the prevous feature extraction stage, the MFCC vectors would vary in size for the different audio files (depending on the samples duration). 

However, CNNs require a fixed size for all inputs. To overcome this we will zero pad the output vectors to make them all the same size. 

In [2]:
import numpy as np
max_pad_len = 174

def extract_features(file_name):
   
    try:
        audio, sample_rate = librosa.load(file_name, res_type='kaiser_fast') 
        mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
        pad_width = max_pad_len - mfccs.shape[1]
        mfccs = np.pad(mfccs, pad_width=((0, 0), (0, pad_width)), mode='constant')
        
    except Exception as e:
        print("Error encountered while parsing file: ", file_name)
        return None 
     
    return mfccs

In [3]:
# Load various imports 
import pandas as pd
import os
import librosa

# Set the path to the full UrbanSound dataset 
fulldatasetpath = 'P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/UrbanSound8K/audio/'

metadata = pd.read_csv('P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/UrbanSound8K/metadata/UrbanSound8K3.csv')

features = []

# Iterate through each sound file and extract the features 
for index, row in metadata.iterrows():
    
    file_name = os.path.join(os.path.abspath(fulldatasetpath),'fold'+str(row["fold"])+'/',str(row["slice_file_name"]))
    
    class_label = row["class"]
    data = extract_features(file_name)
    
    features.append([data, class_label])

# Convert into a Panda dataframe 
featuresdf = pd.DataFrame(features, columns=['feature','class_label'])

print('Finished feature extraction from ', len(featuresdf), ' files') 

Finished feature extraction from  368  files


In [4]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical

# Convert features and corresponding classification labels into numpy arrays
X = np.array(featuresdf.feature.tolist())
y = np.array(featuresdf.class_label.tolist())

# Encode the classification labels
le = LabelEncoder()
yy = to_categorical(le.fit_transform(y)) 

# split the dataset 
from sklearn.model_selection import train_test_split 

x_train, x_test, y_train, y_test = train_test_split(X, yy, test_size=0.2, random_state = 42)

Using TensorFlow backend.


### Convolutional Neural Network (CNN) model architecture 


We will modify our model to be a Convolutional Neural Network (CNN) again using Keras and a Tensorflow backend. 

Again we will use a `sequential` model, starting with a simple model architecture, consisting of four `Conv2D` convolution layers, with our final output layer being a `dense` layer. 

The convolution layers are designed for feature detection. It works by sliding a filter window over the input and performing a matrix multiplication and storing the result in a feature map. This operation is known as a convolution. 


The `filter` parameter specifies the number of nodes in each layer. Each layer will increase in size from 16, 32, 64 to 128, while the `kernel_size` parameter specifies the size of the kernel window which in this case is 2 resulting in a 2x2 filter matrix. 

The first layer will receive the input shape of (40, 174, 1) where 40 is the number of MFCC's 174 is the number of frames taking padding into account and the 1 signifying that the audio is mono. 

The activation function we will be using for our convolutional layers is `ReLU` which is the same as our previous model. We will use a smaller `Dropout` value of 20% on our convolutional layers. 

Each convolutional layer has an associated pooling layer of `MaxPooling2D` type with the final convolutional layer having a `GlobalAveragePooling2D` type. The pooling layer is do reduce the dimensionality of the model (by reducing the parameters and subsquent computation requirements) which serves to shorten the training time and reduce overfitting. The Max Pooling type takes the maximum size for each window and the Global Average Pooling type takes the average which is suitable for feeding into our `dense` output layer.  

Our output layer will have 10 nodes (num_labels) which matches the number of possible classifications. The activation is for our output layer is `softmax`. Softmax makes the output sum up to 1 so the output can be interpreted as probabilities. The model will then make its prediction based on which option has the highest probability.

In [5]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, Conv2D, MaxPooling2D, GlobalAveragePooling2D
from keras.optimizers import Adam
from keras.utils import np_utils
from sklearn import metrics 

num_rows = 40
num_columns = 174
num_channels = 1

x_train = x_train.reshape(x_train.shape[0], num_rows, num_columns, num_channels)
x_test = x_test.reshape(x_test.shape[0], num_rows, num_columns, num_channels)

num_labels = yy.shape[1]
filter_size = 2

# Construct model 
model = Sequential()
model.add(Conv2D(filters=16, kernel_size=2, input_shape=(num_rows, num_columns, num_channels), activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=32, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=64, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=128, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))
model.add(GlobalAveragePooling2D())

model.add(Dense(num_labels, activation='softmax')) 

### Compiling the model 

For compiling our model, we will use the same three parameters as the previous model: 

In [6]:
# Compile the model
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam') 

In [7]:
# Display model architecture summary 
model.summary()

# Calculate pre-training accuracy 
score = model.evaluate(x_test, y_test, verbose=1)
accuracy = 100*score[1]

print("Pre-training accuracy: %.4f%%" % accuracy)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 39, 173, 16)       80        
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 19, 86, 16)        0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 19, 86, 16)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 18, 85, 32)        2080      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 9, 42, 32)         0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 9, 42, 32)         0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 8, 41, 64)        

### Training 

Here we will train the model. As training a CNN can take a sigificant amount of time, we will start with a low number of epochs and a low batch size. If we can see from the output that the model is converging, we will increase both numbers.  

In [9]:
from keras.callbacks import ModelCheckpoint 
from datetime import datetime 

#num_epochs = 12
#num_batch_size = 128

num_epochs = 72
num_batch_size = 256

checkpointer = ModelCheckpoint(filepath= 'P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_cnn.hdf5', 
                               verbose=1, save_best_only=True)
start = datetime.now()

model.fit(x_train, y_train, batch_size=num_batch_size, epochs=num_epochs, validation_data=(x_test, y_test), callbacks=[checkpointer], verbose=1)


duration = datetime.now() - start
print("Training completed in time: ", duration)

Train on 294 samples, validate on 74 samples
Epoch 1/72

Epoch 00001: val_loss improved from inf to 3.10662, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_cnn.hdf5
Epoch 2/72

Epoch 00002: val_loss did not improve from 3.10662
Epoch 3/72

Epoch 00003: val_loss improved from 3.10662 to 2.67576, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_cnn.hdf5
Epoch 4/72

Epoch 00004: val_loss improved from 2.67576 to 2.28524, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_cnn.hdf5
Epoch 5/72

Epoch 00005: val_loss improved from 2.28524 to 1.92232, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_cnn.hdf5
Epoch 6/72

Epoch 00006: val_loss improved from 1.92232 to 1.73130, saving model to P:/Documentos/ICAI/Clase


Epoch 00028: val_loss improved from 0.73317 to 0.70481, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_cnn.hdf5
Epoch 29/72

Epoch 00029: val_loss improved from 0.70481 to 0.67989, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_cnn.hdf5
Epoch 30/72

Epoch 00030: val_loss improved from 0.67989 to 0.65536, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_cnn.hdf5
Epoch 31/72

Epoch 00031: val_loss improved from 0.65536 to 0.63347, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_cnn.hdf5
Epoch 32/72

Epoch 00032: val_loss improved from 0.63347 to 0.60850, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_cnn.hdf5
Epoch 33/72

Epoch 000

Epoch 54/72

Epoch 00054: val_loss improved from 0.31493 to 0.30938, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_cnn.hdf5
Epoch 55/72

Epoch 00055: val_loss improved from 0.30938 to 0.30471, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_cnn.hdf5
Epoch 56/72

Epoch 00056: val_loss improved from 0.30471 to 0.29626, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_cnn.hdf5
Epoch 57/72

Epoch 00057: val_loss improved from 0.29626 to 0.28458, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_cnn.hdf5
Epoch 58/72

Epoch 00058: val_loss improved from 0.28458 to 0.27841, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_cnn.hdf5
Epoch 59/7

### Test the model 

Here we will review the accuracy of the model on both the training and test data sets. 

In [10]:
# Evaluating the model on the training and testing set
score = model.evaluate(x_train, y_train, verbose=0)
print("Training Accuracy: ", score[1])

score = model.evaluate(x_test, y_test, verbose=0)
print("Testing Accuracy: ", score[1])

Training Accuracy:  0.942176878452301
Testing Accuracy:  0.9324324131011963


The Training and Testing accuracy scores are both high and an increase on our initial model. Training accuracy has increased by ~6% and Testing accuracy has increased by ~4%. 

There is a marginal increase in the difference between the Training and Test scores (~6% compared to ~5% previously) though the difference remains low so the model has not suffered from overfitting. 

### Predictions  

Here we will modify our previous method for testing the models predictions on a specified audio .wav file. 

In [11]:
def print_prediction(file_name):
    prediction_feature = extract_features(file_name) 
    prediction_feature = prediction_feature.reshape(1, num_rows, num_columns, num_channels)

    predicted_vector = model.predict_classes(prediction_feature)
    predicted_class = le.inverse_transform(predicted_vector) 
    print("The predicted class is:", predicted_class[0], '\n') 

    predicted_proba_vector = model.predict_proba(prediction_feature) 
    predicted_proba = predicted_proba_vector[0]
    for i in range(len(predicted_proba)): 
        category = le.inverse_transform(np.array([i]))
        print(category[0], "\t\t : ", format(predicted_proba[i], '.32f') )

### Validation 

#### Test with sample data 

As before we will verify the predictions using a subsection of the sample audio files we explored in the first notebook. We expect the bulk of these to be classified correctly. 

In [13]:
ruta = 'P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/'

In [14]:
# Class: Air Conditioner

filename = ruta + 'UrbanSound8K/audio/fold5/100852-0-0-0.wav' 
print_prediction(filename) 

The predicted class is: air_conditioner 

air_conditioner 		 :  0.98223972320556640625000000000000
car_horn 		 :  0.00001567459548823535442352294922
children_playing 		 :  0.00104265159461647272109985351562
dog_bark 		 :  0.00026860044454224407672882080078
drilling 		 :  0.01383300684392452239990234375000
engine_idling 		 :  0.00139033806044608354568481445312
gun_shot 		 :  0.00007466073293471708893775939941
jackhammer 		 :  0.00102835637517273426055908203125
siren 		 :  0.00007832941628294065594673156738
street_music 		 :  0.00002857365507225040346384048462


In [15]:
# Class: Drilling

filename = ruta + 'UrbanSound8K/audio/fold3/103199-4-0-0.wav'
print_prediction(filename) 

The predicted class is: drilling 

air_conditioner 		 :  0.00003229550202377140522003173828
car_horn 		 :  0.00028092006687074899673461914062
children_playing 		 :  0.01506318524479866027832031250000
dog_bark 		 :  0.00349804782308638095855712890625
drilling 		 :  0.98062390089035034179687500000000
engine_idling 		 :  0.00000000937647115506479167379439
gun_shot 		 :  0.00002519226109143346548080444336
jackhammer 		 :  0.00000629935084361932240426540375
siren 		 :  0.00036320960498414933681488037109
street_music 		 :  0.00010688097245292738080024719238


In [16]:
# Class: Street music 

filename = ruta + 'UrbanSound8K/audio/fold7/101848-9-0-0.wav'
print_prediction(filename) 

The predicted class is: street_music 

air_conditioner 		 :  0.00002097810283885337412357330322
car_horn 		 :  0.00769456196576356887817382812500
children_playing 		 :  0.00344579387456178665161132812500
dog_bark 		 :  0.00024308456340804696083068847656
drilling 		 :  0.00010798596485983580350875854492
engine_idling 		 :  0.00731206079944968223571777343750
gun_shot 		 :  0.00038480089278891682624816894531
jackhammer 		 :  0.00155758636537939310073852539062
siren 		 :  0.14077284932136535644531250000000
street_music 		 :  0.83846026659011840820312500000000


In [17]:
# Class: Car Horn 

filename = ruta + 'UrbanSound8K/audio/fold10/100648-1-0-0.wav'
print_prediction(filename) 

The predicted class is: children_playing 

air_conditioner 		 :  0.02111194841563701629638671875000
car_horn 		 :  0.16520559787750244140625000000000
children_playing 		 :  0.17696383595466613769531250000000
dog_bark 		 :  0.16060802340507507324218750000000
drilling 		 :  0.10536871850490570068359375000000
engine_idling 		 :  0.02429792843759059906005859375000
gun_shot 		 :  0.04094675555825233459472656250000
jackhammer 		 :  0.03399929031729698181152343750000
siren 		 :  0.13256213068962097167968750000000
street_music 		 :  0.13893574476242065429687500000000


#### Observations 

We can see that the model performs well. 

Interestingly, car horn was again incorrectly classifed but this time as drilling - though the per class confidence shows it was a close decision between car horn with 26% confidence and drilling at 34% confidence.  

### Other audio

Again we will further validate our model using a sample of various copyright free sounds that we not part of either our test or training data. 

In [18]:
filename = ruta + 'Evaluation audio/dog_bark_1.wav'
print_prediction(filename) 

The predicted class is: dog_bark 

air_conditioner 		 :  0.02156988903880119323730468750000
car_horn 		 :  0.02986512705683708190917968750000
children_playing 		 :  0.11962341517210006713867187500000
dog_bark 		 :  0.59253251552581787109375000000000
drilling 		 :  0.04573479667305946350097656250000
engine_idling 		 :  0.00381804117932915687561035156250
gun_shot 		 :  0.06662539392709732055664062500000
jackhammer 		 :  0.00698739103972911834716796875000
siren 		 :  0.03585373982787132263183593750000
street_music 		 :  0.07738970220088958740234375000000


In [20]:
filename = ruta + 'Evaluation audio/drilling_1.wav'

print_prediction(filename) 

The predicted class is: air_conditioner 

air_conditioner 		 :  0.79560840129852294921875000000000
car_horn 		 :  0.00033518570126034319400787353516
children_playing 		 :  0.02056717872619628906250000000000
dog_bark 		 :  0.00510859349742531776428222656250
drilling 		 :  0.01514675375074148178100585937500
engine_idling 		 :  0.14320786297321319580078125000000
gun_shot 		 :  0.00267864577472209930419921875000
jackhammer 		 :  0.01289335545152425765991210937500
siren 		 :  0.00148829480167478322982788085938
street_music 		 :  0.00296570919454097747802734375000


In [21]:
filename = ruta + 'Evaluation audio/gun_shot_1.wav'

print_prediction(filename) 


The predicted class is: street_music 

air_conditioner 		 :  0.01558687072247266769409179687500
car_horn 		 :  0.00148262560833245515823364257812
children_playing 		 :  0.03005218878388404846191406250000
dog_bark 		 :  0.21456053853034973144531250000000
drilling 		 :  0.00009535325079923495650291442871
engine_idling 		 :  0.31208619475364685058593750000000
gun_shot 		 :  0.05465753003954887390136718750000
jackhammer 		 :  0.00021866557653993368148803710938
siren 		 :  0.00922765210270881652832031250000
street_music 		 :  0.36203235387802124023437500000000


#### Observations 

The performance of our final model is very good and has generalised well, seeming to predict well when tested against new audio data. 