# Classifying Urban sounds using Deep Learning

## 3 Model Training and Evaluation 

### Load Preprocessed data 

In [1]:
# retrieve the preprocessed data from previous notebook

%store -r x_train 
%store -r x_test 
%store -r y_train 
%store -r y_test 
%store -r yy 
%store -r le

### Initial model architecture - MLP

We will start with constructing a Multilayer Perceptron (MLP) Neural Network using Keras and a Tensorflow backend. 

Starting with a `sequential` model so we can build the model layer by layer. 

We will begin with a simple model architecture, consisting of three layers, an input layer, a hidden layer and an output layer. All three layers will be of the `dense` layer type which is a standard layer type that is used in many cases for neural networks. 

The first layer will receive the input shape. As each sample contains 40 MFCCs (or columns) we have a shape of (1x40) this means we will start with an input shape of 40. 

The first two layers will have 256 nodes. The activation function we will be using for our first 2 layers is the `ReLU`, or `Rectified Linear Activation`. This activation function has been proven to work well in neural networks.

We will also apply a `Dropout` value of 50% on our first two layers. This will randomly exclude nodes from each update cycle which in turn results in a network that is capable of better generalisation and is less likely to overfit the training data.

Our output layer will have 10 nodes (num_labels) which matches the number of possible classifications. The activation is for our output layer is `softmax`. Softmax makes the output sum up to 1 so the output can be interpreted as probabilities. The model will then make its prediction based on which option has the highest probability.

In [2]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.optimizers import Adam
from keras.utils import np_utils
from sklearn import metrics 

num_labels = yy.shape[1]
filter_size = 2

# Construct model 
model = Sequential()

model.add(Dense(256, input_shape=(40,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(num_labels))
model.add(Activation('softmax'))

### Compiling the model 

For compiling our model, we will use the following three parameters: 

* Loss function - we will use `categorical_crossentropy`. This is the most common choice for classification. A lower score indicates that the model is performing better.

* Metrics - we will use the `accuracy` metric which will allow us to view the accuracy score on the validation data when we train the model. 

* Optimizer - here we will use `adam` which is a generally good optimizer for many use cases.


In [3]:
# Compile the model
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam') 

In [4]:
# Display model architecture summary 
model.summary()

# Calculate pre-training accuracy 
score = model.evaluate(x_test, y_test, verbose=0)
accuracy = 100*score[1]

print("Pre-training accuracy: %.4f%%" % accuracy)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 256)               10496     
_________________________________________________________________
activation (Activation)      (None, 256)               0         
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               65792     
_________________________________________________________________
activation_1 (Activation)    (None, 256)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 10)                2

### Training 

Here we will train the model. 

We will start with 100 epochs which is the number of times the model will cycle through the data. The model will improve on each cycle until it reaches a certain point. 

We will also start with a low batch size, as having a large batch size can reduce the generalisation ability of the model. 

In [5]:
from keras.callbacks import ModelCheckpoint 
from datetime import datetime 

num_epochs = 100
num_batch_size = 32

checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.basic_mlp.hdf5', 
                               verbose=1, save_best_only=True)
start = datetime.now()

model.fit(x_train, y_train, batch_size=num_batch_size, epochs=num_epochs, validation_data=(x_test, y_test), callbacks=[checkpointer], verbose=1)


duration = datetime.now() - start
print("Training completed in time: ", duration)

Epoch 1/100

Epoch 00001: val_loss improved from inf to 2.16960, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 2/100

Epoch 00002: val_loss improved from 2.16960 to 2.06056, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 3/100

Epoch 00003: val_loss improved from 2.06056 to 1.81939, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 4/100

Epoch 00004: val_loss improved from 1.81939 to 1.68668, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 5/100

Epoch 00005: val_loss improved from 1.68668 to 1.60236, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 6/100

Epoch 00006: val_loss improved from 1.60236 to 1.44620, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 7/100

Epoch 00007: val_loss improved from 1.44620 to 1.30049, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 8/100

Epoch 00008: val_loss improved from 1.30049 to 1.23906, saving model to saved_models\weights.best.basic_mlp.h


Epoch 00034: val_loss improved from 0.57329 to 0.55781, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 35/100

Epoch 00035: val_loss improved from 0.55781 to 0.55471, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 36/100

Epoch 00036: val_loss improved from 0.55471 to 0.53465, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 37/100

Epoch 00037: val_loss did not improve from 0.53465
Epoch 38/100

Epoch 00038: val_loss did not improve from 0.53465
Epoch 39/100

Epoch 00039: val_loss did not improve from 0.53465
Epoch 40/100

Epoch 00040: val_loss improved from 0.53465 to 0.52716, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 41/100

Epoch 00041: val_loss improved from 0.52716 to 0.51337, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 42/100

Epoch 00042: val_loss did not improve from 0.51337
Epoch 43/100

Epoch 00043: val_loss improved from 0.51337 to 0.49891, saving model to saved_models\weights.best.basic_


Epoch 00072: val_loss did not improve from 0.45512
Epoch 73/100

Epoch 00073: val_loss did not improve from 0.45512
Epoch 74/100

Epoch 00074: val_loss did not improve from 0.45512
Epoch 75/100

Epoch 00075: val_loss improved from 0.45512 to 0.43899, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 76/100

Epoch 00076: val_loss did not improve from 0.43899
Epoch 77/100

Epoch 00077: val_loss did not improve from 0.43899
Epoch 78/100

Epoch 00078: val_loss did not improve from 0.43899
Epoch 79/100

Epoch 00079: val_loss did not improve from 0.43899
Epoch 80/100

Epoch 00080: val_loss did not improve from 0.43899
Epoch 81/100

Epoch 00081: val_loss did not improve from 0.43899
Epoch 82/100

Epoch 00082: val_loss did not improve from 0.43899
Epoch 83/100

Epoch 00083: val_loss did not improve from 0.43899
Epoch 84/100

Epoch 00084: val_loss did not improve from 0.43899
Epoch 85/100

Epoch 00085: val_loss improved from 0.43899 to 0.42358, saving model to saved_models\weights

### Test the model 

Here we will review the accuracy of the model on both the training and test data sets. 

In [6]:
# Evaluating the model on the training and testing set
score = model.evaluate(x_train, y_train, verbose=0)
print("Training Accuracy: ", score[1])

score = model.evaluate(x_test, y_test, verbose=0)
print("Testing Accuracy: ", score[1])

Training Accuracy:  0.9338582754135132
Testing Accuracy:  0.878649115562439


The initial Training and Testing accuracy scores are quite high. As there is not a great difference between the Training and Test scores (~5%) this suggests that the model has not suffered from overfitting. 

### Predictions  

Here we will build a method which will allow us to test the models predictions on a specified audio .wav file. 

In [7]:
import librosa 
import numpy as np 

def extract_feature(file_name):
   
    try:
        audio_data, sample_rate = librosa.load(file_name, res_type='kaiser_fast') 
        mfccs = librosa.feature.mfcc(y=audio_data, sr=sample_rate, n_mfcc=40)
        mfccsscaled = np.mean(mfccs.T,axis=0)
        
    except Exception as e:
        print("Error encountered while parsing file: ", file)
        return None, None

    return np.array([mfccsscaled])


In [8]:
def print_prediction(file_name):
    prediction_feature = extract_feature(file_name) 

    predicted_vector = model.predict_classes(prediction_feature)
    predicted_class = le.inverse_transform(predicted_vector) 
    print("The predicted class is:", predicted_class[0], '\n') 

    predicted_proba_vector = model.predict_proba(prediction_feature) 
    predicted_proba = predicted_proba_vector[0]
    for i in range(len(predicted_proba)): 
        category = le.inverse_transform(np.array([i]))
        print(category[0], "\t\t : ", format(predicted_proba[i], '.32f') )

### Validation 

#### Test with sample data 

Initial sainity check to verify the predictions using a subsection of the sample audio files we explored in the first notebook. We expect the bulk of these to be classified correctly. 

In [9]:
# Class: Air Conditioner

filename = (r"C:\Users\Simriti Koul\Desktop\CAPSTONE\UrbanSound8K.tar\UrbanSound8K\train\fold5\100852-0-0-0.wav") 
print_prediction(filename) 



The predicted class is: air_conditioner 

air_conditioner 		 :  0.99765741825103759765625000000000
car_horn 		 :  0.00051001412793993949890136718750
children_playing 		 :  0.00010772443783935159444808959961
dog_bark 		 :  0.00004354374686954542994499206543
drilling 		 :  0.00042141648009419441223144531250
engine_idling 		 :  0.00103361869696527719497680664062
gun_shot 		 :  0.00001595761568751186132431030273
jackhammer 		 :  0.00006289265729719772934913635254
siren 		 :  0.00000077757891858709626831114292
street_music 		 :  0.00014671032840851694345474243164




In [10]:
# Class: Drilling

filename = (r"C:\Users\Simriti Koul\Desktop\CAPSTONE\UrbanSound8K.tar\UrbanSound8K\train\fold3\103199-4-0-0.wav")
print_prediction(filename) 

The predicted class is: drilling 

air_conditioner 		 :  0.00000453069560535368509590625763
car_horn 		 :  0.00013925565872341394424438476562
children_playing 		 :  0.00147388770710676908493041992188
dog_bark 		 :  0.00015168463869486004114151000977
drilling 		 :  0.93388199806213378906250000000000
engine_idling 		 :  0.00000031090453944671025965362787
gun_shot 		 :  0.00001169474489870481193065643311
jackhammer 		 :  0.00000526221401742077432572841644
siren 		 :  0.00000013799305520478810649365187
street_music 		 :  0.06433135271072387695312500000000


In [11]:
# Class: Street music 

filename = (r"C:\Users\Simriti Koul\Desktop\CAPSTONE\UrbanSound8K.tar\UrbanSound8K\train\fold7\101848-9-0-0.wav")
print_prediction(filename) 

The predicted class is: street_music 

air_conditioner 		 :  0.00354291405528783798217773437500
car_horn 		 :  0.00334771955385804176330566406250
children_playing 		 :  0.02169414423406124114990234375000
dog_bark 		 :  0.18936577439308166503906250000000
drilling 		 :  0.00625380314886569976806640625000
engine_idling 		 :  0.00078949128510430455207824707031
gun_shot 		 :  0.00902641285210847854614257812500
jackhammer 		 :  0.00607445463538169860839843750000
siren 		 :  0.00041220203274860978126525878906
street_music 		 :  0.75949299335479736328125000000000


In [12]:
# Class: Car Horn 

filename = (r"C:\Users\Simriti Koul\Desktop\CAPSTONE\UrbanSound8K.tar\UrbanSound8K\train\fold10\100648-1-0-0.wav")
print_prediction(filename) 

The predicted class is: street_music 

air_conditioner 		 :  0.00472804810851812362670898437500
car_horn 		 :  0.05722814798355102539062500000000
children_playing 		 :  0.10663680732250213623046875000000
dog_bark 		 :  0.19712772965431213378906250000000
drilling 		 :  0.10537705570459365844726562500000
engine_idling 		 :  0.00423351302742958068847656250000
gun_shot 		 :  0.04030582308769226074218750000000
jackhammer 		 :  0.13848775625228881835937500000000
siren 		 :  0.00154224026482552289962768554688
street_music 		 :  0.34433290362358093261718750000000


#### Observations 

From this brief sanity check the model seems to predict well. One errror was observed whereby a car horn was incorrectly classifed as a dog bark. 

We can see from the per class confidence that this was quite a low score (43%). This allows follows our early observation that a dog bark and car horn are similar in spectral shape. 

### Other audio

Here we will use a sample of various copyright free sounds that we not part of either our test or training data to further validate our model. 

In [13]:
filename = (r"C:\Users\Simriti Koul\Desktop\CAPSTONE\Evaluation audio\dog_bark_1.wav")
print_prediction(filename) 

The predicted class is: dog_bark 

air_conditioner 		 :  0.00030687192338518798351287841797
car_horn 		 :  0.00032225772156380116939544677734
children_playing 		 :  0.00474609946832060813903808593750
dog_bark 		 :  0.77336806058883666992187500000000
drilling 		 :  0.00888785626739263534545898437500
engine_idling 		 :  0.00018908230413217097520828247070
gun_shot 		 :  0.01179265696555376052856445312500
jackhammer 		 :  0.00010120595106855034828186035156
siren 		 :  0.00019697181414812803268432617188
street_music 		 :  0.20008899271488189697265625000000


In [14]:
filename = (r"C:\Users\Simriti Koul\Desktop\CAPSTONE\Evaluation audio\drilling_1.wav")

print_prediction(filename) 

The predicted class is: drilling 

air_conditioner 		 :  0.34108763933181762695312500000000
car_horn 		 :  0.00678393011912703514099121093750
children_playing 		 :  0.03632827475666999816894531250000
dog_bark 		 :  0.00861750636249780654907226562500
drilling 		 :  0.42608013749122619628906250000000
engine_idling 		 :  0.01366939768195152282714843750000
gun_shot 		 :  0.00453296350315213203430175781250
jackhammer 		 :  0.14550997316837310791015625000000
siren 		 :  0.00192271184641867876052856445312
street_music 		 :  0.01546743605285882949829101562500


In [15]:
filename = (r"C:\Users\Simriti Koul\Desktop\CAPSTONE\Evaluation audio\gun_shot_1.wav")

print_prediction(filename) 

# sample data weighted towards gun shot - peak in the dog barking sample is simmilar in shape to the gun shot sample

The predicted class is: dog_bark 

air_conditioner 		 :  0.06776662170886993408203125000000
car_horn 		 :  0.00015831571363378316164016723633
children_playing 		 :  0.00096621987177059054374694824219
dog_bark 		 :  0.55382043123245239257812500000000
drilling 		 :  0.00146359112113714218139648437500
engine_idling 		 :  0.00372890196740627288818359375000
gun_shot 		 :  0.00162448163609951734542846679688
jackhammer 		 :  0.00007820993778295814990997314453
siren 		 :  0.00039035922964103519916534423828
street_music 		 :  0.37000289559364318847656250000000


In [16]:
filename = (r"C:\Users\Simriti Koul\Desktop\CAPSTONE\Evaluation audio\siren_1.wav")

print_prediction(filename) 

The predicted class is: siren 

air_conditioner 		 :  0.00000010728005861437850398942828
car_horn 		 :  0.00000962881131272297352552413940
children_playing 		 :  0.00002930800656031351536512374878
dog_bark 		 :  0.01859312690794467926025390625000
drilling 		 :  0.00000646794342173961922526359558
engine_idling 		 :  0.02171797119081020355224609375000
gun_shot 		 :  0.00006894153921166434884071350098
jackhammer 		 :  0.00000164585208040080033242702484
siren 		 :  0.95715039968490600585937500000000
street_music 		 :  0.00242242426611483097076416015625


#### Observations 

The performance of our initial model is satisfactorry and has generalised well, seeming to predict well when tested against new audio data. 

### *In the next notebook we will refine our model*