# Classifying Urban sounds using Deep Learning

## 3 Model Training and Evaluation 

### Load Preprocessed data 

In [20]:
# retrieve the preprocessed data from previous notebook

%store -r x_train 
%store -r x_test 
%store -r y_train 
%store -r y_test 
%store -r yy 
%store -r le

### Initial model architecture - MLP

We will start with constructing a Multilayer Perceptron (MLP) Neural Network using Keras and a Tensorflow backend. 

Starting with a `sequential` model so we can build the model layer by layer. 

We will begin with a simple model architecture, consisting of three layers, an input layer, a hidden layer and an output layer. All three layers will be of the `dense` layer type which is a standard layer type that is used in many cases for neural networks. 

The first layer will receive the input shape. As each sample contains 40 MFCCs (or columns) we have a shape of (1x40) this means we will start with an input shape of 40. 

The first two layers will have 256 nodes. The activation function we will be using for our first 2 layers is the `ReLU`, or `Rectified Linear Activation`. This activation function has been proven to work well in neural networks.

We will also apply a `Dropout` value of 50% on our first two layers. This will randomly exclude nodes from each update cycle which in turn results in a network that is capable of better generalisation and is less likely to overfit the training data.

Our output layer will have 10 nodes (num_labels) which matches the number of possible classifications. The activation is for our output layer is `softmax`. Softmax makes the output sum up to 1 so the output can be interpreted as probabilities. The model will then make its prediction based on which option has the highest probability.

In [21]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.optimizers import Adam
from keras.utils import np_utils
from sklearn import metrics 

num_labels = yy.shape[1]
filter_size = 2

# Construct model 
model = Sequential()

model.add(Dense(256, input_shape=(40,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(num_labels))
model.add(Activation('softmax'))

### Compiling the model 

For compiling our model, we will use the following three parameters: 

* Loss function - we will use `categorical_crossentropy`. This is the most common choice for classification. A lower score indicates that the model is performing better.

* Metrics - we will use the `accuracy` metric which will allow us to view the accuracy score on the validation data when we train the model. 

* Optimizer - here we will use `adam` which is a generally good optimizer for many use cases.


In [22]:
# Compile the model
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam') 

In [23]:
# Display model architecture summary 
model.summary()

# Calculate pre-training accuracy 
score = model.evaluate(x_test, y_test, verbose=0)
accuracy = 100*score[1]

print("Pre-training accuracy: %.4f%%" % accuracy)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_3 (Dense)              (None, 256)               10496     
_________________________________________________________________
activation_3 (Activation)    (None, 256)               0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 256)               65792     
_________________________________________________________________
activation_4 (Activation)    (None, 256)               0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 10)               

### Training 

Here we will train the model. 

We will start with 100 epochs which is the number of times the model will cycle through the data. The model will improve on each cycle until it reaches a certain point. 

We will also start with a low batch size, as having a large batch size can reduce the generalisation ability of the model. 

In [24]:
from keras.callbacks import ModelCheckpoint 
from datetime import datetime 

num_epochs = 100
num_batch_size = 32

checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.basic_mlp.hdf5', 
                               verbose=1, save_best_only=True)
start = datetime.now()

model.fit(x_train, y_train, batch_size=num_batch_size, epochs=num_epochs, validation_data=(x_test, y_test), callbacks=[checkpointer], verbose=1)


duration = datetime.now() - start
print("Training completed in time: ", duration)

Epoch 1/100
Epoch 00001: val_loss improved from inf to 2.19327, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 2/100
Epoch 00002: val_loss improved from 2.19327 to 2.07203, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 3/100
Epoch 00003: val_loss improved from 2.07203 to 1.89202, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 4/100
Epoch 00004: val_loss improved from 1.89202 to 1.75286, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 5/100
Epoch 00005: val_loss improved from 1.75286 to 1.55467, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 6/100
Epoch 00006: val_loss improved from 1.55467 to 1.47453, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 7/100
Epoch 00007: val_loss improved from 1.47453 to 1.36728, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 8/100
Epoch 00008: val_loss improved from 1.36728 to 1.30677, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoc

Epoch 25/100
Epoch 00025: val_loss improved from 0.67916 to 0.65048, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 26/100
Epoch 00026: val_loss improved from 0.65048 to 0.62368, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 27/100
Epoch 00027: val_loss did not improve from 0.62368
Epoch 28/100
Epoch 00028: val_loss improved from 0.62368 to 0.59600, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 29/100
Epoch 00029: val_loss improved from 0.59600 to 0.58960, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 30/100
Epoch 00030: val_loss improved from 0.58960 to 0.57147, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 31/100
Epoch 00031: val_loss did not improve from 0.57147
Epoch 32/100
Epoch 00032: val_loss did not improve from 0.57147
Epoch 33/100
Epoch 00033: val_loss did not improve from 0.57147
Epoch 34/100
Epoch 00034: val_loss did not improve from 0.57147
Epoch 35/100
Epoch 00035: val_loss improved from 0

Epoch 51/100
Epoch 00051: val_loss improved from 0.49485 to 0.49365, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 52/100
Epoch 00052: val_loss improved from 0.49365 to 0.48929, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 53/100
Epoch 00053: val_loss did not improve from 0.48929
Epoch 54/100
Epoch 00054: val_loss improved from 0.48929 to 0.48213, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 55/100
Epoch 00055: val_loss improved from 0.48213 to 0.47592, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 56/100
Epoch 00056: val_loss improved from 0.47592 to 0.47552, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 57/100
Epoch 00057: val_loss improved from 0.47552 to 0.46154, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 58/100
Epoch 00058: val_loss did not improve from 0.46154
Epoch 59/100
Epoch 00059: val_loss did not improve from 0.46154
Epoch 60/100
Epoch 00060: val_loss did not improve f

Epoch 00078: val_loss did not improve from 0.45128
Epoch 79/100
Epoch 00079: val_loss did not improve from 0.45128
Epoch 80/100
Epoch 00080: val_loss improved from 0.45128 to 0.43473, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 81/100
Epoch 00081: val_loss improved from 0.43473 to 0.42638, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 82/100
Epoch 00082: val_loss did not improve from 0.42638
Epoch 83/100
Epoch 00083: val_loss did not improve from 0.42638
Epoch 84/100
Epoch 00084: val_loss did not improve from 0.42638
Epoch 85/100
Epoch 00085: val_loss did not improve from 0.42638
Epoch 86/100
Epoch 00086: val_loss did not improve from 0.42638
Epoch 87/100
Epoch 00087: val_loss did not improve from 0.42638
Epoch 88/100
Epoch 00088: val_loss did not improve from 0.42638
Epoch 89/100
Epoch 00089: val_loss did not improve from 0.42638
Epoch 90/100
Epoch 00090: val_loss did not improve from 0.42638
Epoch 91/100
Epoch 00091: val_loss did not improve from 0

### Test the model 

Here we will review the accuracy of the model on both the training and test data sets. 

In [25]:
# Evaluating the model on the training and testing set
score = model.evaluate(x_train, y_train, verbose=0)
print("Training Accuracy: ", score[1])

score = model.evaluate(x_test, y_test, verbose=0)
print("Testing Accuracy: ", score[1])

Training Accuracy:  0.9388689994812012
Testing Accuracy:  0.8763594627380371


The initial Training and Testing accuracy scores are quite high. As there is not a great difference between the Training and Test scores (~5%) this suggests that the model has not suffered from overfitting. 

### Predictions  

Here we will build a method which will allow us to test the models predictions on a specified audio .wav file. 

In [26]:
import librosa 
import numpy as np 

def extract_feature(file_name):
   
    try:
        audio_data, sample_rate = librosa.load(file_name, res_type='kaiser_fast') 
        mfccs = librosa.feature.mfcc(y=audio_data, sr=sample_rate, n_mfcc=40)
        mfccsscaled = np.mean(mfccs.T,axis=0)
        
    except Exception as e:
        print("Error encountered while parsing file: ", file)
        return None, None

    return np.array([mfccsscaled])


In [27]:
def print_prediction(file_name):
    prediction_feature = extract_feature(file_name) 

    predicted_vector = model.predict_classes(prediction_feature)
    predicted_class = le.inverse_transform(predicted_vector) 
    print("The predicted class is:", predicted_class[0], '\n') 

    predicted_proba_vector = model.predict_proba(prediction_feature) 
    predicted_proba = predicted_proba_vector[0]
    for i in range(len(predicted_proba)): 
        category = le.inverse_transform(np.array([i]))
        print(category[0], "\t\t : ", format(predicted_proba[i], '.32f') )

### Validation 

#### Test with sample data 

Initial sainity check to verify the predictions using a subsection of the sample audio files we explored in the first notebook. We expect the bulk of these to be classified correctly. 

In [28]:
# Class: Air Conditioner

filename = '../UrbanSound Dataset sample/audio/100852-0-0-0.wav' 
print_prediction(filename) 

The predicted class is: air_conditioner 

air_conditioner 		 :  0.98503690958023071289062500000000
car_horn 		 :  0.00063253677217289805412292480469
children_playing 		 :  0.00178799626883119344711303710938
dog_bark 		 :  0.00026157105457969009876251220703
drilling 		 :  0.00077542569488286972045898437500
engine_idling 		 :  0.00359599874354898929595947265625
gun_shot 		 :  0.00011406919657019898295402526855
jackhammer 		 :  0.00136287300847470760345458984375
siren 		 :  0.00001566314676892943680286407471
street_music 		 :  0.00641680276021361351013183593750


In [29]:
# Class: Drilling

filename = '../UrbanSound Dataset sample/audio/103199-4-0-0.wav'
print_prediction(filename) 

The predicted class is: drilling 

air_conditioner 		 :  0.00000000002315023867049958994357
car_horn 		 :  0.00000504127001477172598242759705
children_playing 		 :  0.00006995756848482415080070495605
dog_bark 		 :  0.00001750043702486436814069747925
drilling 		 :  0.96054077148437500000000000000000
engine_idling 		 :  0.00000000065668515070171906700125
gun_shot 		 :  0.00000000758778373466384437051602
jackhammer 		 :  0.00000005083144571926823118701577
siren 		 :  0.00000000782372033825140533735976
street_music 		 :  0.03936669602990150451660156250000


In [30]:
# Class: Street music 

filename = '../UrbanSound Dataset sample/audio/101848-9-0-0.wav'
print_prediction(filename) 

The predicted class is: street_music 

air_conditioner 		 :  0.00117893354035913944244384765625
car_horn 		 :  0.00514532066881656646728515625000
children_playing 		 :  0.00808414444327354431152343750000
dog_bark 		 :  0.01273029763251543045043945312500
drilling 		 :  0.01198443397879600524902343750000
engine_idling 		 :  0.00040718558011576533317565917969
gun_shot 		 :  0.00052774092182517051696777343750
jackhammer 		 :  0.00694827968254685401916503906250
siren 		 :  0.00027546731871552765369415283203
street_music 		 :  0.95271813869476318359375000000000


In [31]:
# Class: Car Horn 

filename = '../UrbanSound Dataset sample/audio/100648-1-0-0.wav'
print_prediction(filename) 

The predicted class is: street_music 

air_conditioner 		 :  0.00546680530533194541931152343750
car_horn 		 :  0.04527015239000320434570312500000
children_playing 		 :  0.00671882368624210357666015625000
dog_bark 		 :  0.16746664047241210937500000000000
drilling 		 :  0.17773860692977905273437500000000
engine_idling 		 :  0.00290776253677904605865478515625
gun_shot 		 :  0.00937572214752435684204101562500
jackhammer 		 :  0.02390000969171524047851562500000
siren 		 :  0.00073731801239773631095886230469
street_music 		 :  0.56041806936264038085937500000000


#### Observations 

From this brief sanity check the model seems to predict well. One errror was observed whereby a car horn was incorrectly classifed as a dog bark. 

We can see from the per class confidence that this was quite a low score (43%). This allows follows our early observation that a dog bark and car horn are similar in spectral shape. 

### Other audio

Here we will use a sample of various copyright free sounds that we not part of either our test or training data to further validate our model. 

In [32]:
filename = '../Evaluation audio/dog_bark_1.wav'
print_prediction(filename) 

The predicted class is: street_music 

air_conditioner 		 :  0.00407737307250499725341796875000
car_horn 		 :  0.01589436642825603485107421875000
children_playing 		 :  0.02757290937006473541259765625000
dog_bark 		 :  0.43261671066284179687500000000000
drilling 		 :  0.05380324274301528930664062500000
engine_idling 		 :  0.00108108692802488803863525390625
gun_shot 		 :  0.00602692132815718650817871093750
jackhammer 		 :  0.00002495371154509484767913818359
siren 		 :  0.00835195370018482208251953125000
street_music 		 :  0.45055046677589416503906250000000


In [33]:
filename = '../Evaluation audio/drilling_1.wav'

print_prediction(filename) 

The predicted class is: jackhammer 

air_conditioner 		 :  0.06842748075723648071289062500000
car_horn 		 :  0.00000041934748651328845880925655
children_playing 		 :  0.00067422993015497922897338867188
dog_bark 		 :  0.00011627443745965138077735900879
drilling 		 :  0.11063024401664733886718750000000
engine_idling 		 :  0.00008019795495783910155296325684
gun_shot 		 :  0.00015479714784305542707443237305
jackhammer 		 :  0.81987762451171875000000000000000
siren 		 :  0.00000333510274685977492481470108
street_music 		 :  0.00003550315159372985363006591797


In [34]:
filename = '../Evaluation audio/gun_shot_1.wav'

print_prediction(filename) 

# sample data weighted towards gun shot - peak in the dog barking sample is simmilar in shape to the gun shot sample

The predicted class is: street_music 

air_conditioner 		 :  0.12769186496734619140625000000000
car_horn 		 :  0.00309055973775684833526611328125
children_playing 		 :  0.00840576458722352981567382812500
dog_bark 		 :  0.19505158066749572753906250000000
drilling 		 :  0.00419114856049418449401855468750
engine_idling 		 :  0.00659308070316910743713378906250
gun_shot 		 :  0.00974449887871742248535156250000
jackhammer 		 :  0.00004562287358567118644714355469
siren 		 :  0.00568220065906643867492675781250
street_music 		 :  0.63950371742248535156250000000000


In [35]:
filename = '../Evaluation audio/siren_1.wav'

print_prediction(filename) 

The predicted class is: siren 

air_conditioner 		 :  0.00000667103313389816321432590485
car_horn 		 :  0.00022991636069491505622863769531
children_playing 		 :  0.00570126995444297790527343750000
dog_bark 		 :  0.05127241089940071105957031250000
drilling 		 :  0.00003865715189022012054920196533
engine_idling 		 :  0.04880295693874359130859375000000
gun_shot 		 :  0.00522171519696712493896484375000
jackhammer 		 :  0.00035019431379623711109161376953
siren 		 :  0.88807439804077148437500000000000
street_music 		 :  0.00030179941677488386631011962891


#### Observations 

The performance of our initial model is satisfactorry and has generalised well, seeming to predict well when tested against new audio data. 

### *In the next notebook we will refine our model*