# Classifying Urban sounds using Deep Learning

## 3 Model Training and Evaluation 

### Load Preprocessed data 

In [1]:
# retrieve the preprocessed data from previous notebook

%store -r x_train 
%store -r x_test 
%store -r y_train 
%store -r y_test 
%store -r yy 
%store -r le

### Initial model architecture - MLP

We will start with constructing a Multilayer Perceptron (MLP) Neural Network using Keras and a Tensorflow backend. 

Starting with a `sequential` model so we can build the model layer by layer. 

We will begin with a simple model architecture, consisting of three layers, an input layer, a hidden layer and an output layer. All three layers will be of the `dense` layer type which is a standard layer type that is used in many cases for neural networks. 

The first layer will receive the input shape. As each sample contains 40 MFCCs (or columns) we have a shape of (1x40) this means we will start with an input shape of 40. 

The first two layers will have 256 nodes. The activation function we will be using for our first 2 layers is the `ReLU`, or `Rectified Linear Activation`. This activation function has been proven to work well in neural networks.

We will also apply a `Dropout` value of 50% on our first two layers. This will randomly exclude nodes from each update cycle which in turn results in a network that is capable of better generalisation and is less likely to overfit the training data.

Our output layer will have 10 nodes (num_labels) which matches the number of possible classifications. The activation is for our output layer is `softmax`. Softmax makes the output sum up to 1 so the output can be interpreted as probabilities. The model will then make its prediction based on which option has the highest probability.

In [2]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.optimizers import Adam
from keras.utils import np_utils
from sklearn import metrics 

num_labels = yy.shape[1]
filter_size = 2

# Construct model 
model = Sequential()

model.add(Dense(256, input_shape=(40,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(num_labels))
model.add(Activation('softmax'))

### Compiling the model 

For compiling our model, we will use the following three parameters: 

* Loss function - we will use `categorical_crossentropy`. This is the most common choice for classification. A lower score indicates that the model is performing better.

* Metrics - we will use the `accuracy` metric which will allow us to view the accuracy score on the validation data when we train the model. 

* Optimizer - here we will use `adam` which is a generally good optimizer for many use cases.


In [3]:
# Compile the model
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam') 

In [4]:
# Display model architecture summary 
model.summary()

# Calculate pre-training accuracy 
score = model.evaluate(x_test, y_test, verbose=0)
accuracy = 100*score[1]

print("Pre-training accuracy: %.4f%%" % accuracy)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 256)               10496     
_________________________________________________________________
activation (Activation)      (None, 256)               0         
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               65792     
_________________________________________________________________
activation_1 (Activation)    (None, 256)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 11)                2

### Training 

Here we will train the model. 

We will start with 100 epochs which is the number of times the model will cycle through the data. The model will improve on each cycle until it reaches a certain point. 

We will also start with a low batch size, as having a large batch size can reduce the generalisation ability of the model. 

In [5]:
from keras.callbacks import ModelCheckpoint 
from datetime import datetime 

num_epochs = 100
num_batch_size = 32

checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.basic_mlp.hdf5', 
                               verbose=1, save_best_only=True)
start = datetime.now()

model.fit(x_train, y_train, batch_size=num_batch_size, epochs=num_epochs, validation_data=(x_test, y_test), callbacks=[checkpointer], verbose=1)


duration = datetime.now() - start
print("Training completed in time: ", duration)

Epoch 1/100

Epoch 00001: val_loss improved from inf to 2.22269, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 2/100

Epoch 00002: val_loss improved from 2.22269 to 2.01319, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 3/100

Epoch 00003: val_loss improved from 2.01319 to 1.93386, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 4/100

Epoch 00004: val_loss improved from 1.93386 to 1.75590, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 5/100

Epoch 00005: val_loss improved from 1.75590 to 1.64573, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 6/100

Epoch 00006: val_loss improved from 1.64573 to 1.50761, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 7/100

Epoch 00007: val_loss improved from 1.50761 to 1.37554, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 8/100

Epoch 00008: val_loss improved from 1.37554 to 1.30504, saving model to saved_models\weights.best.basic_mlp.h


Epoch 00034: val_loss did not improve from 0.60597
Epoch 35/100

Epoch 00035: val_loss improved from 0.60597 to 0.60185, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 36/100

Epoch 00036: val_loss improved from 0.60185 to 0.58743, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 37/100

Epoch 00037: val_loss improved from 0.58743 to 0.57795, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 38/100

Epoch 00038: val_loss improved from 0.57795 to 0.55665, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 39/100

Epoch 00039: val_loss improved from 0.55665 to 0.53862, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 40/100

Epoch 00040: val_loss did not improve from 0.53862
Epoch 41/100

Epoch 00041: val_loss did not improve from 0.53862
Epoch 42/100

Epoch 00042: val_loss did not improve from 0.53862
Epoch 43/100

Epoch 00043: val_loss did not improve from 0.53862
Epoch 44/100

Epoch 00044: val_loss did not improve f


Epoch 00072: val_loss did not improve from 0.45345
Epoch 73/100

Epoch 00073: val_loss did not improve from 0.45345
Epoch 74/100

Epoch 00074: val_loss did not improve from 0.45345
Epoch 75/100

Epoch 00075: val_loss did not improve from 0.45345
Epoch 76/100

Epoch 00076: val_loss did not improve from 0.45345
Epoch 77/100

Epoch 00077: val_loss did not improve from 0.45345
Epoch 78/100

Epoch 00078: val_loss did not improve from 0.45345
Epoch 79/100

Epoch 00079: val_loss improved from 0.45345 to 0.43167, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 80/100

Epoch 00080: val_loss did not improve from 0.43167
Epoch 81/100

Epoch 00081: val_loss did not improve from 0.43167
Epoch 82/100

Epoch 00082: val_loss did not improve from 0.43167
Epoch 83/100

Epoch 00083: val_loss did not improve from 0.43167
Epoch 84/100

Epoch 00084: val_loss improved from 0.43167 to 0.43155, saving model to saved_models\weights.best.basic_mlp.hdf5
Epoch 85/100

Epoch 00085: val_loss improved

### Test the model 

Here we will review the accuracy of the model on both the training and test data sets. 

In [6]:
# Evaluating the model on the training and testing set
score = model.evaluate(x_train, y_train, verbose=0)
print("Training Accuracy: ", score[1])

score = model.evaluate(x_test, y_test, verbose=0)
print("Testing Accuracy: ", score[1])

Training Accuracy:  0.9305971264839172
Testing Accuracy:  0.8672364950180054


The initial Training and Testing accuracy scores are quite high. As there is not a great difference between the Training and Test scores (~5%) this suggests that the model has not suffered from overfitting. 

### Predictions  

Here we will build a method which will allow us to test the models predictions on a specified audio .wav file. 

In [7]:
import librosa 
import numpy as np 

def extract_feature(file_name):
   
    try:
        audio_data, sample_rate = librosa.load(file_name, res_type='kaiser_fast') 
        mfccs = librosa.feature.mfcc(y=audio_data, sr=sample_rate, n_mfcc=40)
        mfccsscaled = np.mean(mfccs.T,axis=0)
        
    except Exception as e:
        print("Error encountered while parsing file: ", file)
        return None, None

    return np.array([mfccsscaled])


In [8]:
def print_prediction(file_name):
    prediction_feature = extract_feature(file_name) 

    predicted_vector = model.predict_classes(prediction_feature)
    predicted_class = le.inverse_transform(predicted_vector) 
    print("The predicted class is:", predicted_class[0], '\n') 

    predicted_proba_vector = model.predict_proba(prediction_feature) 
    predicted_proba = predicted_proba_vector[0]
    for i in range(len(predicted_proba)): 
        category = le.inverse_transform(np.array([i]))
        print(category[0], "\t\t : ", format(predicted_proba[i], '.32f') )

### Validation 

#### Test with sample data 

Initial sainity check to verify the predictions using a subsection of the sample audio files we explored in the first notebook. We expect the bulk of these to be classified correctly. 

In [9]:
# Class: Air Conditioner

filename = '../UrbanSound Dataset sample/audio/100852-0-0-0.wav' 
print_prediction(filename) 

The predicted class is: air_conditioner 

air_conditioner 		 :  0.99992167949676513671875000000000
car_horn 		 :  0.00000011361459684167130035348237
children_playing 		 :  0.00002181141280743759125471115112
dog_bark 		 :  0.00000011583942693960125325247645
drilling 		 :  0.00001428144241799600422382354736
engine_idling 		 :  0.00002238951310573611408472061157
glass_breaking 		 :  0.00000000000000001593990606151730
gun_shot 		 :  0.00000009452040927726557129062712
jackhammer 		 :  0.00000680893981552799232304096222
siren 		 :  0.00000001328410181855588234611787
street_music 		 :  0.00001268768573936540633440017700




In [10]:
# Class: Drilling

filename = '../UrbanSound Dataset sample/audio/103199-4-0-0.wav'
print_prediction(filename) 

The predicted class is: drilling 

air_conditioner 		 :  0.00000000057680454856878782265994
car_horn 		 :  0.00000257601095654536038637161255
children_playing 		 :  0.00003667188866529613733291625977
dog_bark 		 :  0.00014999869745224714279174804688
drilling 		 :  0.91963499784469604492187500000000
engine_idling 		 :  0.00000000083946655182742802026041
glass_breaking 		 :  0.00000000000005380912661517682494
gun_shot 		 :  0.00000523378093930659815669059753
jackhammer 		 :  0.00000159687692757870536297559738
siren 		 :  0.00000000484430495717447229253594
street_music 		 :  0.08016889542341232299804687500000


In [11]:
# Class: Street music 

filename = '../UrbanSound Dataset sample/audio/101848-9-0-0.wav'
print_prediction(filename) 

The predicted class is: street_music 

air_conditioner 		 :  0.00386408553458750247955322265625
car_horn 		 :  0.00136877305340021848678588867188
children_playing 		 :  0.04986609518527984619140625000000
dog_bark 		 :  0.00349211902357637882232666015625
drilling 		 :  0.00110241328366100788116455078125
engine_idling 		 :  0.00106547353789210319519042968750
glass_breaking 		 :  0.00000000000003984033697564115517
gun_shot 		 :  0.00004347573121776804327964782715
jackhammer 		 :  0.00237104855477809906005859375000
siren 		 :  0.00021826046577189117670059204102
street_music 		 :  0.93660825490951538085937500000000


In [12]:
# Class: Car Horn 

filename = '../UrbanSound Dataset sample/audio/100648-1-0-0.wav'
print_prediction(filename) 

The predicted class is: dog_bark 

air_conditioner 		 :  0.00956585630774497985839843750000
car_horn 		 :  0.06483205407857894897460937500000
children_playing 		 :  0.11096300184726715087890625000000
dog_bark 		 :  0.35184580087661743164062500000000
drilling 		 :  0.12993347644805908203125000000000
engine_idling 		 :  0.01048373430967330932617187500000
glass_breaking 		 :  0.01307544298470020294189453125000
gun_shot 		 :  0.05226016417145729064941406250000
jackhammer 		 :  0.03807688504457473754882812500000
siren 		 :  0.01632021740078926086425781250000
street_music 		 :  0.20264334976673126220703125000000


#### Observations 

From this brief sanity check the model seems to predict well. One errror was observed whereby a car horn was incorrectly classifed as a dog bark. 

We can see from the per class confidence that this was quite a low score (43%). This allows follows our early observation that a dog bark and car horn are similar in spectral shape. 

### Other audio

Here we will use a sample of various copyright free sounds that we not part of either our test or training data to further validate our model. 

In [13]:
filename = '../Evaluation audio/dog_bark_1.wav'
print_prediction(filename) 

The predicted class is: dog_bark 

air_conditioner 		 :  0.00003251403541071340441703796387
car_horn 		 :  0.01037766505032777786254882812500
children_playing 		 :  0.00129639077931642532348632812500
dog_bark 		 :  0.54404258728027343750000000000000
drilling 		 :  0.00237517897039651870727539062500
engine_idling 		 :  0.00002496482375136110931634902954
glass_breaking 		 :  0.00000000166194058515145570709137
gun_shot 		 :  0.01835866458714008331298828125000
jackhammer 		 :  0.00000079512329875797149725258350
siren 		 :  0.00026770590920932590961456298828
street_music 		 :  0.42322346568107604980468750000000


In [14]:
filename = '../Evaluation audio/drilling_1.wav'

print_prediction(filename) 

The predicted class is: air_conditioner 

air_conditioner 		 :  0.83771055936813354492187500000000
car_horn 		 :  0.00000222103835767484270036220551
children_playing 		 :  0.00037872296525165438652038574219
dog_bark 		 :  0.00022419940796680748462677001953
drilling 		 :  0.01977697014808654785156250000000
engine_idling 		 :  0.00065056153107434511184692382812
glass_breaking 		 :  0.00000001530618298772878915769979
gun_shot 		 :  0.00057724950602278113365173339844
jackhammer 		 :  0.14055275917053222656250000000000
siren 		 :  0.00001812992923078127205371856689
street_music 		 :  0.00010872689017560333013534545898


In [15]:
filename = '../Evaluation audio/gun_shot_1.wav'

print_prediction(filename) 

# sample data weighted towards gun shot - peak in the dog barking sample is simmilar in shape to the gun shot sample

The predicted class is: street_music 

air_conditioner 		 :  0.08496322482824325561523437500000
car_horn 		 :  0.00068769109202548861503601074219
children_playing 		 :  0.00172164384275674819946289062500
dog_bark 		 :  0.33616262674331665039062500000000
drilling 		 :  0.00136100663803517818450927734375
engine_idling 		 :  0.02243471145629882812500000000000
glass_breaking 		 :  0.00000002199923976320405927253887
gun_shot 		 :  0.01535068452358245849609375000000
jackhammer 		 :  0.00000508102175444946624338626862
siren 		 :  0.00337660964578390121459960937500
street_music 		 :  0.53393661975860595703125000000000


In [16]:
filename = '../Evaluation audio/siren_1.wav'

print_prediction(filename) 

The predicted class is: siren 

air_conditioner 		 :  0.01246712636202573776245117187500
car_horn 		 :  0.00060824892716482281684875488281
children_playing 		 :  0.06705556064844131469726562500000
dog_bark 		 :  0.12617360055446624755859375000000
drilling 		 :  0.00051219062879681587219238281250
engine_idling 		 :  0.35182273387908935546875000000000
glass_breaking 		 :  0.00000006753698045258715865202248
gun_shot 		 :  0.03474744036793708801269531250000
jackhammer 		 :  0.00031340445275418460369110107422
siren 		 :  0.37929305434226989746093750000000
street_music 		 :  0.02700663730502128601074218750000


In [17]:
filename = '../Evaluation audio/glass_break_1.wav'

print_prediction(filename) 

The predicted class is: glass_breaking 

air_conditioner 		 :  0.00000000773925190600266432738863
car_horn 		 :  0.00007980560621945187449455261230
children_playing 		 :  0.00307534891180694103240966796875
dog_bark 		 :  0.44493219256401062011718750000000
drilling 		 :  0.02916192077100276947021484375000
engine_idling 		 :  0.00000000532864996571902338473592
glass_breaking 		 :  0.51823097467422485351562500000000
gun_shot 		 :  0.00438902992755174636840820312500
jackhammer 		 :  0.00000005234744548943126574158669
siren 		 :  0.00012287043500691652297973632812
street_music 		 :  0.00000787032058724435046315193176


#### Observations 

The performance of our initial model is satisfactorry and has generalised well, seeming to predict well when tested against new audio data. 

### *In the next notebook we will refine our model*