# Classifying Urban sounds using Deep Learning

## 3 Model Training and Evaluation 

### Load Preprocessed data 

In [1]:
# retrieve the preprocessed data from previous notebook

%store -r x_train 
%store -r x_test 
%store -r y_train 
%store -r y_test 
%store -r yy 
%store -r le

### Initial model architecture - MLP

We will start with constructing a Multilayer Perceptron (MLP) Neural Network using Keras and a Tensorflow backend. 

Starting with a `sequential` model so we can build the model layer by layer. 

We will begin with a simple model architecture, consisting of three layers, an input layer, a hidden layer and an output layer. All three layers will be of the `dense` layer type which is a standard layer type that is used in many cases for neural networks. 

The first layer will receive the input shape. As each sample contains 40 MFCCs (or columns) we have a shape of (1x40) this means we will start with an input shape of 40. 

The first two layers will have 256 nodes. The activation function we will be using for our first 2 layers is the `ReLU`, or `Rectified Linear Activation`. This activation function has been proven to work well in neural networks.

We will also apply a `Dropout` value of 50% on our first two layers. This will randomly exclude nodes from each update cycle which in turn results in a network that is capable of better generalisation and is less likely to overfit the training data.

Our output layer will have 10 nodes (num_labels) which matches the number of possible classifications. The activation is for our output layer is `softmax`. Softmax makes the output sum up to 1 so the output can be interpreted as probabilities. The model will then make its prediction based on which option has the highest probability.

In [2]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.optimizers import Adam
from keras.utils import np_utils
from sklearn import metrics 

num_labels = yy.shape[1]
filter_size = 2

# Construct model 
model = Sequential()

model.add(Dense(256, input_shape=(40,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dropout(0.5))

model.add(Dense(num_labels))
model.add(Activation('softmax'))

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


Instructions for updating:
Colocations handled automatically by placer.


### Compiling the model 

For compiling our model, we will use the following three parameters: 

* Loss function - we will use `categorical_crossentropy`. This is the most common choice for classification. A lower score indicates that the model is performing better.

* Metrics - we will use the `accuracy` metric which will allow us to view the accuracy score on the validation data when we train the model. 

* Optimizer - here we will use `adam` which is a generally good optimizer for many use cases.


In [3]:
# Compile the model
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam') 

In [4]:
# Display model architecture summary 
model.summary()

# Calculate pre-training accuracy 
score = model.evaluate(x_test, y_test, verbose=0)
accuracy = 100*score[1]

print("Pre-training accuracy: %.4f%%" % accuracy)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 256)               10496     
_________________________________________________________________
activation_1 (Activation)    (None, 256)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 256)               65792     
_________________________________________________________________
activation_2 (Activation)    (None, 256)               0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)               

### Training 

Here we will train the model. 

We will start with 100 epochs which is the number of times the model will cycle through the data. The model will improve on each cycle until it reaches a certain point. 

We will also start with a low batch size, as having a large batch size can reduce the generalisation ability of the model. 

In [5]:
from keras.callbacks import ModelCheckpoint 
from datetime import datetime 

num_epochs = 100
num_batch_size = 32

checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.basic_mlp.hdf5', 
                               verbose=1, save_best_only=True)
start = datetime.now()

model.fit(x_train, y_train, batch_size=num_batch_size, epochs=num_epochs, validation_data=(x_test, y_test), callbacks=[checkpointer], verbose=1)


duration = datetime.now() - start
print("Training completed in time: ", duration)

Instructions for updating:
Use tf.cast instead.
Train on 6985 samples, validate on 1747 samples
Epoch 1/100

Epoch 00001: val_loss improved from inf to 5.83231, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 2/100

Epoch 00002: val_loss improved from 5.83231 to 1.99126, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 3/100

Epoch 00003: val_loss improved from 1.99126 to 1.74013, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 4/100

Epoch 00004: val_loss improved from 1.74013 to 1.53282, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 5/100

Epoch 00005: val_loss improved from 1.53282 to 1.41505, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 6/100

Epoch 00006: val_loss improved from 1.41505 to 1.30108, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 7/100

Epoch 00007: val_loss improved from 1.30108 to 1.20931, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 8/100

Epoch 00008: 


Epoch 00032: val_loss improved from 0.57006 to 0.55217, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 33/100

Epoch 00033: val_loss improved from 0.55217 to 0.54220, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 34/100

Epoch 00034: val_loss improved from 0.54220 to 0.53647, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 35/100

Epoch 00035: val_loss did not improve from 0.53647
Epoch 36/100

Epoch 00036: val_loss did not improve from 0.53647
Epoch 37/100

Epoch 00037: val_loss did not improve from 0.53647
Epoch 38/100

Epoch 00038: val_loss did not improve from 0.53647
Epoch 39/100

Epoch 00039: val_loss improved from 0.53647 to 0.52617, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 40/100

Epoch 00040: val_loss improved from 0.52617 to 0.50408, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 41/100

Epoch 00041: val_loss did not improve from 0.50408
Epoch 42/100

Epoch 00042: val_loss did not improve f


Epoch 00070: val_loss did not improve from 0.42831
Epoch 71/100

Epoch 00071: val_loss did not improve from 0.42831
Epoch 72/100

Epoch 00072: val_loss did not improve from 0.42831
Epoch 73/100

Epoch 00073: val_loss did not improve from 0.42831
Epoch 74/100

Epoch 00074: val_loss did not improve from 0.42831
Epoch 75/100

Epoch 00075: val_loss did not improve from 0.42831
Epoch 76/100

Epoch 00076: val_loss improved from 0.42831 to 0.42508, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 77/100

Epoch 00077: val_loss did not improve from 0.42508
Epoch 78/100

Epoch 00078: val_loss did not improve from 0.42508
Epoch 79/100

Epoch 00079: val_loss did not improve from 0.42508
Epoch 80/100

Epoch 00080: val_loss improved from 0.42508 to 0.41355, saving model to saved_models/weights.best.basic_mlp.hdf5
Epoch 81/100

Epoch 00081: val_loss did not improve from 0.41355
Epoch 82/100

Epoch 00082: val_loss did not improve from 0.41355
Epoch 83/100

Epoch 00083: val_loss did not 

### Test the model 

Here we will review the accuracy of the model on both the training and test data sets. 

In [6]:
# Evaluating the model on the training and testing set
score = model.evaluate(x_train, y_train, verbose=0)
print("Training Accuracy: ", score[1])

score = model.evaluate(x_test, y_test, verbose=0)
print("Testing Accuracy: ", score[1])

Training Accuracy:  0.9315676689147949
Testing Accuracy:  0.880366325378418


The initial Training and Testing accuracy scores are quite high. As there is not a great difference between the Training and Test scores (~5%) this suggests that the model has not suffered from overfitting. 

### Predictions  

Here we will build a method which will allow us to test the models predictions on a specified audio .wav file. 

In [7]:
import librosa 
import numpy as np 

def extract_feature(file_name):
   
    try:
        audio_data, sample_rate = librosa.load(file_name, res_type='kaiser_fast') 
        mfccs = librosa.feature.mfcc(y=audio_data, sr=sample_rate, n_mfcc=40)
        mfccsscaled = np.mean(mfccs.T,axis=0)
        
    except Exception as e:
        print("Error encountered while parsing file: ", file)
        return None, None

    return np.array([mfccsscaled])


In [8]:
def print_prediction(file_name):
    prediction_feature = extract_feature(file_name) 

    predicted_vector = model.predict_classes(prediction_feature)
    predicted_class = le.inverse_transform(predicted_vector) 
    print("The predicted class is:", predicted_class[0], '\n') 

    predicted_proba_vector = model.predict_proba(prediction_feature) 
    predicted_proba = predicted_proba_vector[0]
    for i in range(len(predicted_proba)): 
        category = le.inverse_transform(np.array([i]))
        print(category[0], "\t\t : ", format(predicted_proba[i], '.32f') )

### Validation 

#### Test with sample data 

Initial sainity check to verify the predictions using a subsection of the sample audio files we explored in the first notebook. We expect the bulk of these to be classified correctly. 

In [9]:
# Class: Air Conditioner

filename = '../UrbanSound Dataset sample/audio/100852-0-0-0.wav' 
print_prediction(filename) 

The predicted class is: air_conditioner 

air_conditioner 		 :  0.99999380111694335937500000000000
car_horn 		 :  0.00000000555373436128547837142833
children_playing 		 :  0.00000163127674568386282771825790
dog_bark 		 :  0.00000015679631815146422013640404
drilling 		 :  0.00000291759624815313145518302917
engine_idling 		 :  0.00000081592742162683862261474133
gun_shot 		 :  0.00000001164550766930005920585245
jackhammer 		 :  0.00000063732443322805920615792274
siren 		 :  0.00000000005898734972697994294322
street_music 		 :  0.00000004880940096541053208056837


In [10]:
# Class: Drilling

filename = '../UrbanSound Dataset sample/audio/103199-4-0-0.wav'
print_prediction(filename) 

The predicted class is: street_music 

air_conditioner 		 :  0.00000003026374173487056395970285
car_horn 		 :  0.00000595925257584895007312297821
children_playing 		 :  0.00005171071097720414400100708008
dog_bark 		 :  0.00003775984077947214245796203613
drilling 		 :  0.41230171918869018554687500000000
engine_idling 		 :  0.00000001411426975295171359903179
gun_shot 		 :  0.00000008006793450476834550499916
jackhammer 		 :  0.00000058555855275699286721646786
siren 		 :  0.00000000163676161513137685687980
street_music 		 :  0.58760219812393188476562500000000


In [11]:
# Class: Street music 

filename = '../UrbanSound Dataset sample/audio/101848-9-0-0.wav'
print_prediction(filename) 

The predicted class is: street_music 

air_conditioner 		 :  0.01390491519123315811157226562500
car_horn 		 :  0.00386997428722679615020751953125
children_playing 		 :  0.06208047270774841308593750000000
dog_bark 		 :  0.02224843762814998626708984375000
drilling 		 :  0.00544879632070660591125488281250
engine_idling 		 :  0.00204028235748410224914550781250
gun_shot 		 :  0.00171441223938018083572387695312
jackhammer 		 :  0.03946266323328018188476562500000
siren 		 :  0.00202519563026726245880126953125
street_music 		 :  0.84720492362976074218750000000000


In [12]:
# Class: Car Horn 

filename = '../UrbanSound Dataset sample/audio/100648-1-0-0.wav'
print_prediction(filename) 

The predicted class is: dog_bark 

air_conditioner 		 :  0.00177519978024065494537353515625
car_horn 		 :  0.28378158807754516601562500000000
children_playing 		 :  0.01201169472187757492065429687500
dog_bark 		 :  0.34475034475326538085937500000000
drilling 		 :  0.15373921394348144531250000000000
engine_idling 		 :  0.00199882802553474903106689453125
gun_shot 		 :  0.00488347373902797698974609375000
jackhammer 		 :  0.00751307141035795211791992187500
siren 		 :  0.00164022482931613922119140625000
street_music 		 :  0.18790635466575622558593750000000


#### Observations 

From this brief sanity check the model seems to predict well. One errror was observed whereby a car horn was incorrectly classifed as a dog bark. 

We can see from the per class confidence that this was quite a low score (43%). This allows follows our early observation that a dog bark and car horn are similar in spectral shape. 

### Other audio

Here we will use a sample of various copyright free sounds that we not part of either our test or training data to further validate our model. 

In [13]:
filename = '../Evaluation audio/dog_bark_1.wav'
print_prediction(filename) 

The predicted class is: dog_bark 

air_conditioner 		 :  0.00005259284080239012837409973145
car_horn 		 :  0.01414114609360694885253906250000
children_playing 		 :  0.00767272943630814552307128906250
dog_bark 		 :  0.86162441968917846679687500000000
drilling 		 :  0.01549693755805492401123046875000
engine_idling 		 :  0.00000679979439155431464314460754
gun_shot 		 :  0.01462879497557878494262695312500
jackhammer 		 :  0.00002205451892223209142684936523
siren 		 :  0.00021641349303536117076873779297
street_music 		 :  0.08613799512386322021484375000000


In [14]:
filename = '../Evaluation audio/drilling_1.wav'

print_prediction(filename) 

The predicted class is: drilling 

air_conditioner 		 :  0.04326583445072174072265625000000
car_horn 		 :  0.00000276631953965988941490650177
children_playing 		 :  0.00069270801031962037086486816406
dog_bark 		 :  0.00009103608317673206329345703125
drilling 		 :  0.95385897159576416015625000000000
engine_idling 		 :  0.00055959104793146252632141113281
gun_shot 		 :  0.00000484342490381095558404922485
jackhammer 		 :  0.00138583348598331212997436523438
siren 		 :  0.00000095427276392001658678054810
street_music 		 :  0.00013732921797782182693481445312


In [15]:
filename = '../Evaluation audio/gun_shot_1.wav'

print_prediction(filename) 

# sample data weighted towards gun shot - peak in the dog barking sample is simmilar in shape to the gun shot sample

The predicted class is: dog_bark 

air_conditioner 		 :  0.00274201296269893646240234375000
car_horn 		 :  0.00001035439800034509971737861633
children_playing 		 :  0.00000819436263554962351918220520
dog_bark 		 :  0.49960061907768249511718750000000
drilling 		 :  0.00007722984446445479989051818848
engine_idling 		 :  0.00005253454582998529076576232910
gun_shot 		 :  0.00035251950612291693687438964844
jackhammer 		 :  0.00000070260449547276948578655720
siren 		 :  0.00005389723082771524786949157715
street_music 		 :  0.49710193276405334472656250000000


In [16]:
filename = '../Evaluation audio/siren_1.wav'

print_prediction(filename) 

The predicted class is: siren 

air_conditioner 		 :  0.00001048117519530933350324630737
car_horn 		 :  0.00015899920254014432430267333984
children_playing 		 :  0.00137591850943863391876220703125
dog_bark 		 :  0.01713156886398792266845703125000
drilling 		 :  0.00000629096939519513398408889771
engine_idling 		 :  0.02563379704952239990234375000000
gun_shot 		 :  0.00038412530557252466678619384766
jackhammer 		 :  0.00000720441039447905495762825012
siren 		 :  0.95451909303665161132812500000000
street_music 		 :  0.00077248347224667668342590332031


#### Observations 

The performance of our initial model is satisfactorry and has generalised well, seeming to predict well when tested against new audio data. 

### *In the next notebook we will refine our model*