# Classifying Urban sounds using Deep Learning

## 3 Model Training and Evaluation 

### Load Preprocessed data 

In [57]:
ruta = 'P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/'

In [58]:
# retrieve the preprocessed data from previous notebook

%store -r x_train 
%store -r x_test 
%store -r y_train 
%store -r y_test 
%store -r yy 
%store -r le

### Initial model architecture - MLP

We will start with constructing a Multilayer Perceptron (MLP) Neural Network using Keras and a Tensorflow backend. 

Starting with a `sequential` model so we can build the model layer by layer. 

We will begin with a simple model architecture, consisting of three layers, an input layer, a hidden layer and an output layer. All three layers will be of the `dense` layer type which is a standard layer type that is used in many cases for neural networks. 

The first layer will receive the input shape. As each sample contains 40 MFCCs (or columns) we have a shape of (1x40) this means we will start with an input shape of 40. 

The first two layers will have 256 nodes. The activation function we will be using for our first 2 layers is the `ReLU`, or `Rectified Linear Activation`. This activation function has been proven to work well in neural networks.

We will also apply a `Dropout` value of 50% on our first two layers. This will randomly exclude nodes from each update cycle which in turn results in a network that is capable of better generalisation and is less likely to overfit the training data.

Our output layer will have 10 nodes (num_labels) which matches the number of possible classifications. The activation is for our output layer is `softmax`. Softmax makes the output sum up to 1 so the output can be interpreted as probabilities. The model will then make its prediction based on which option has the highest probability.

In [60]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.optimizers import Adam
from keras.utils import np_utils
from sklearn import metrics 

num_labels = yy.shape[1]
filter_size = 2

# Construct model 
model = Sequential()

model.add(Dense(256, input_shape=(40,))) #40
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(num_labels))
model.add(Activation('softmax'))

### Compiling the model 

For compiling our model, we will use the following three parameters: 

* Loss function - we will use `categorical_crossentropy`. This is the most common choice for classification. A lower score indicates that the model is performing better.

* Metrics - we will use the `accuracy` metric which will allow us to view the accuracy score on the validation data when we train the model. 

* Optimizer - here we will use `adam` which is a generally good optimizer for many use cases.


In [61]:
# Compile the model
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam') 

In [62]:
# Display model architecture summary 
model.summary()

# Calculate pre-training accuracy 
score = model.evaluate(x_test, y_test, verbose=0)
accuracy = 100*score[1]

print("Pre-training accuracy: %.4f%%" % accuracy)

Model: "sequential_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_31 (Dense)             (None, 256)               10496     
_________________________________________________________________
activation_31 (Activation)   (None, 256)               0         
_________________________________________________________________
dropout_21 (Dropout)         (None, 256)               0         
_________________________________________________________________
dense_32 (Dense)             (None, 256)               65792     
_________________________________________________________________
activation_32 (Activation)   (None, 256)               0         
_________________________________________________________________
dropout_22 (Dropout)         (None, 256)               0         
_________________________________________________________________
dense_33 (Dense)             (None, 10)              

### Training 

Here we will train the model. 

We will start with 100 epochs which is the number of times the model will cycle through the data. The model will improve on each cycle until it reaches a certain point. 

We will also start with a low batch size, as having a large batch size can reduce the generalisation ability of the model. 

In [64]:
from keras.callbacks import ModelCheckpoint 
from datetime import datetime 

num_epochs = 100
num_batch_size = 32

checkpointer = ModelCheckpoint(filepath= ruta + 'saved_models/weights.best.basic_mlp.hdf5', 
                               verbose=1, save_best_only=True)
start = datetime.now()

model.fit(x_train, y_train, batch_size=num_batch_size, epochs=num_epochs, validation_data=(x_test, y_test), callbacks=[checkpointer], verbose=1)


duration = datetime.now() - start
print("Training completed in time: ", duration)

Train on 319 samples, validate on 80 samples
Epoch 1/100

Epoch 00001: val_loss improved from inf to 4.29201, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_mlp.hdf5
Epoch 2/100

Epoch 00002: val_loss improved from 4.29201 to 3.06752, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_mlp.hdf5
Epoch 3/100

Epoch 00003: val_loss improved from 3.06752 to 1.65758, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_mlp.hdf5
Epoch 4/100

Epoch 00004: val_loss improved from 1.65758 to 1.02270, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_mlp.hdf5
Epoch 5/100

Epoch 00005: val_loss improved from 1.02270 to 0.81236, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_mode


Epoch 00028: val_loss improved from 0.11049 to 0.10072, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_mlp.hdf5
Epoch 29/100

Epoch 00029: val_loss improved from 0.10072 to 0.09094, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_mlp.hdf5
Epoch 30/100

Epoch 00030: val_loss improved from 0.09094 to 0.08588, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_mlp.hdf5
Epoch 31/100

Epoch 00031: val_loss improved from 0.08588 to 0.08174, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_mlp.hdf5
Epoch 32/100

Epoch 00032: val_loss did not improve from 0.08174
Epoch 33/100

Epoch 00033: val_loss did not improve from 0.08174
Epoch 34/100

Epoch 00034: val_loss did not improve from 0.08174
Epoch 35/100

Epoch 00


Epoch 00062: val_loss did not improve from 0.04105
Epoch 63/100

Epoch 00063: val_loss did not improve from 0.04105
Epoch 64/100

Epoch 00064: val_loss improved from 0.04105 to 0.03407, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_mlp.hdf5
Epoch 65/100

Epoch 00065: val_loss improved from 0.03407 to 0.03029, saving model to P:/Documentos/ICAI/Clases/Analisis de Datos No Estructurados/Audio/data/saved_models/weights.best.basic_mlp.hdf5
Epoch 66/100

Epoch 00066: val_loss did not improve from 0.03029
Epoch 67/100

Epoch 00067: val_loss did not improve from 0.03029
Epoch 68/100

Epoch 00068: val_loss did not improve from 0.03029
Epoch 69/100

Epoch 00069: val_loss did not improve from 0.03029
Epoch 70/100

Epoch 00070: val_loss did not improve from 0.03029
Epoch 71/100

Epoch 00071: val_loss did not improve from 0.03029
Epoch 72/100

Epoch 00072: val_loss improved from 0.03029 to 0.03028, saving model to P:/Docume


Epoch 00098: val_loss did not improve from 0.01155
Epoch 99/100

Epoch 00099: val_loss did not improve from 0.01155
Epoch 100/100

Epoch 00100: val_loss did not improve from 0.01155
Training completed in time:  0:00:08.057318


### Test the model 

Here we will review the accuracy of the model on both the training and test data sets. 

In [65]:
# Evaluating the model on the training and testing set
score = model.evaluate(x_train, y_train, verbose=0)
print("Training Accuracy: ", score[1])

score = model.evaluate(x_test, y_test, verbose=0)
print("Testing Accuracy: ", score[1])

Training Accuracy:  1.0
Testing Accuracy:  0.987500011920929


The initial Training and Testing accuracy scores are quite high. As there is not a great difference between the Training and Test scores (~5%) this suggests that the model has not suffered from overfitting. 

### Predictions  

Here we will build a method which will allow us to test the models predictions on a specified audio .wav file. 

In [66]:
import librosa 
import numpy as np 

def extract_feature(file_name):
   
    try:
        audio_data, sample_rate = librosa.load(file_name, res_type='kaiser_fast') 
        mfccs = librosa.feature.mfcc(y=audio_data, sr=sample_rate, n_mfcc=40)
        mfccsscaled = np.mean(mfccs.T,axis=0)
        
    except Exception as e:
        print("Error encountered while parsing file: ", file)
        return None, None

    return np.array([mfccsscaled])


In [67]:
def print_prediction(file_name):
    prediction_feature = extract_feature(file_name) 

    predicted_vector = model.predict_classes(prediction_feature)
    predicted_class = le.inverse_transform(predicted_vector) 
    print("The predicted class is:", predicted_class[0], '\n') 

    predicted_proba_vector = model.predict_proba(prediction_feature) 
    predicted_proba = predicted_proba_vector[0]
    for i in range(len(predicted_proba)): 
        category = le.inverse_transform(np.array([i]))
        print(category[0], "\t\t : ", format(predicted_proba[i], '.32f') )

### Validation 

#### Test with sample data 

Initial sainity check to verify the predictions using a subsection of the sample audio files we explored in the first notebook. We expect the bulk of these to be classified correctly. 

In [78]:
# Class: Air Conditioner

filename = ruta + 'UrbanSound8K/audio/fold5/100852-0-0-0.wav' 
print_prediction(filename) 

The predicted class is: air_conditioner 

air_conditioner 		 :  1.00000000000000000000000000000000
car_horn 		 :  0.00000000000009086878656212188377
children_playing 		 :  0.00000001508326796795245172688738
dog_bark 		 :  0.00000000000013569890653861160779
drilling 		 :  0.00000000108132713894093512863037
engine_idling 		 :  0.00000000123514509731137422932079
gun_shot 		 :  0.00000000000093657340997227445101
jackhammer 		 :  0.00000000000185656893504637654502
siren 		 :  0.00000000000013435095863514184833
street_music 		 :  0.00000000000079323900963393367824


In [79]:
# Class: Drilling

filename = ruta + 'UrbanSound8K/audio/fold3/103199-4-0-0.wav'
print_prediction(filename) 

The predicted class is: drilling 

air_conditioner 		 :  0.00000007037179727831244235858321
car_horn 		 :  0.00000007567871307401219382882118
children_playing 		 :  0.00005377611159929074347019195557
dog_bark 		 :  0.00000672001760904095135629177094
drilling 		 :  0.99986040592193603515625000000000
engine_idling 		 :  0.00000000044791009790046132366115
gun_shot 		 :  0.00000000023955243344531140792242
jackhammer 		 :  0.00000001177512132244373788125813
siren 		 :  0.00007890544657129794359207153320
street_music 		 :  0.00000000100394725865982081813854


In [80]:
# Class: Street music 

filename = ruta + 'UrbanSound8K/audio/fold7/101848-9-0-0.wav'
print_prediction(filename) 

The predicted class is: street_music 

air_conditioner 		 :  0.00000055683381106064189225435257
car_horn 		 :  0.00000155163161252858117222785950
children_playing 		 :  0.00000202188221010146662592887878
dog_bark 		 :  0.00000000537815614265468866506126
drilling 		 :  0.00000019118265015549695817753673
engine_idling 		 :  0.00005027380393585190176963806152
gun_shot 		 :  0.00000283062513517506886273622513
jackhammer 		 :  0.00000000114690057451127813692437
siren 		 :  0.00000742041584089747630059719086
street_music 		 :  0.99993515014648437500000000000000


In [83]:
# Class: Car Horn 

filename = ruta + 'UrbanSound8K/audio/fold10/100648-1-0-0.wav'
print_prediction(filename) 

The predicted class is: children_playing 

air_conditioner 		 :  0.00000025663487690508191008120775
car_horn 		 :  0.43873438239097595214843750000000
children_playing 		 :  0.55922496318817138671875000000000
dog_bark 		 :  0.00008896659710444509983062744141
drilling 		 :  0.00000773040392232360318303108215
engine_idling 		 :  0.00000085428951024368871003389359
gun_shot 		 :  0.00000043670550553542852867394686
jackhammer 		 :  0.00001462799991713836789131164551
siren 		 :  0.00040928577072918415069580078125
street_music 		 :  0.00151850527618080377578735351562


#### Observations 

From this brief sanity check the model seems to predict well. One errror was observed whereby a car horn was incorrectly classifed as a dog bark. 

We can see from the per class confidence that this was quite a low score (43%). This allows follows our early observation that a dog bark and car horn are similar in spectral shape. 

### Other audio

Here we will use a sample of various copyright free sounds that we not part of either our test or training data to further validate our model. 

In [84]:
filename = ruta + 'Evaluation audio/dog_bark_1.wav'
print_prediction(filename) 

The predicted class is: dog_bark 

air_conditioner 		 :  0.00000001960061624117770406883210
car_horn 		 :  0.00067121715983375906944274902344
children_playing 		 :  0.17579813301563262939453125000000
dog_bark 		 :  0.76811760663986206054687500000000
drilling 		 :  0.00000009689290436654118821024895
engine_idling 		 :  0.00000097858367098524468019604683
gun_shot 		 :  0.05533875524997711181640625000000
jackhammer 		 :  0.00000000011452459813821036505033
siren 		 :  0.00006211299478309229016304016113
street_music 		 :  0.00001106958188756834715604782104


In [85]:
filename = ruta + 'Evaluation audio/drilling_1.wav'

print_prediction(filename) 

The predicted class is: jackhammer 

air_conditioner 		 :  0.03365809842944145202636718750000
car_horn 		 :  0.00000000093388741184696755226469
children_playing 		 :  0.00001408237221767194569110870361
dog_bark 		 :  0.00000000089894192090156366248266
drilling 		 :  0.00000000010202207828546860923780
engine_idling 		 :  0.00016141032392624765634536743164
gun_shot 		 :  0.00000001839103447309753391891718
jackhammer 		 :  0.96616637706756591796875000000000
siren 		 :  0.00000000007309666760768607218779
street_music 		 :  0.00000000317920001435822996427305


In [86]:
filename = ruta + 'Evaluation audio/gun_shot_1.wav'

print_prediction(filename) 

# sample data weighted towards gun shot - peak in the dog barking sample is simmilar in shape to the gun shot sample

The predicted class is: street_music 

air_conditioner 		 :  0.00000088978362100533558987081051
car_horn 		 :  0.00000044146099753561429679393768
children_playing 		 :  0.00001168697144748875871300697327
dog_bark 		 :  0.00000195436700778373051434755325
drilling 		 :  0.00000020943843992426991462707520
engine_idling 		 :  0.01178112905472517013549804687500
gun_shot 		 :  0.00601853430271148681640625000000
jackhammer 		 :  0.00000000009501017333990446900316
siren 		 :  0.00002499081165296956896781921387
street_music 		 :  0.98216015100479125976562500000000


In [87]:
filename = ruta + 'Evaluation audio/siren_1.wav'

print_prediction(filename) 

The predicted class is: engine_idling 

air_conditioner 		 :  0.00000978398566076066344976425171
car_horn 		 :  0.00000008693190522990335011854768
children_playing 		 :  0.00000985066617431584745645523071
dog_bark 		 :  0.00000000261273269686057574290317
drilling 		 :  0.00000044785193153984437230974436
engine_idling 		 :  0.97659450769424438476562500000000
gun_shot 		 :  0.00000320370577355788554996252060
jackhammer 		 :  0.00000021124824911566975060850382
siren 		 :  0.00003039501643797848373651504517
street_music 		 :  0.02335144393146038055419921875000


#### Observations 

The performance of our initial model is satisfactorry and has generalised well, seeming to predict well when tested against new audio data. 

### *In the next notebook we will refine our model*