# Classifying Urban sounds using Deep Learning

## 4 Model Refinement 

### Load Preprocessed data 

In [5]:
# retrieve the preprocessed data from previous notebook

%store -r x_train 
%store -r x_test 
%store -r y_train 
%store -r y_test 
%store -r yy 
%store -r le

#### Model refinement

In our inital attempt, we were able to achieve a Classification Accuracy score of: 

* Training data Accuracy:  92.3% 
* Testing data Accuracy:  87% 

We will now see if we can improve upon that score using a Convolutional Neural Network (CNN). 

#### Feature Extraction refinement 

In the prevous feature extraction stage, the MFCC vectors would vary in size for the different audio files (depending on the samples duration). 

However, CNNs require a fixed size for all inputs. To overcome this we will zero pad the output vectors to make them all the same size. 

In [7]:
import numpy as np
max_pad_len = 174

def extract_features(file_name):
   
    try:
        audio, sample_rate = librosa.load(file_name, res_type='kaiser_fast') 
        mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
        pad_width = max_pad_len - mfccs.shape[1]
        mfccs = np.pad(mfccs, pad_width=((0, 0), (0, pad_width)), mode='constant')
        
    except Exception as e:
        print("Error encountered while parsing file: ", file_name)
        return None 
     
    return mfccs

In [9]:
# Load various imports 
import pandas as pd
import os
import librosa

# Set the path to the full UrbanSound dataset 
fulldatasetpath = 'C:/Users/user/Udacity-ML-Capstone/UrbanSound8K/audio/'

metadata = pd.read_csv('../UrbanSound Dataset sample/metadata/UrbanSound8K.csv')

features = []

# Iterate through each sound file and extract the features 
for index, row in metadata.iterrows():
    
    file_name = os.path.join(os.path.abspath(fulldatasetpath),'fold'+str(row["fold"])+'/',str(row["slice_file_name"]))
    
    class_label = row["class_name"]
    data = extract_features(file_name)
    
    features.append([data, class_label])

# Convert into a Panda dataframe 
featuresdf = pd.DataFrame(features, columns=['feature','class_label'])

print('Finished feature extraction from ', len(featuresdf), ' files') 



Finished feature extraction from  8732  files


In [11]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical

# Convert features and corresponding classification labels into numpy arrays
X = np.array(featuresdf.feature.tolist())
y = np.array(featuresdf.class_label.tolist())

# Encode the classification labels
le = LabelEncoder()
yy = to_categorical(le.fit_transform(y)) 

# split the dataset 
from sklearn.model_selection import train_test_split 

x_train, x_test, y_train, y_test = train_test_split(X, yy, test_size=0.2, random_state = 42)

### Convolutional Neural Network (CNN) model architecture 


We will modify our model to be a Convolutional Neural Network (CNN) again using Keras and a Tensorflow backend. 

Again we will use a `sequential` model, starting with a simple model architecture, consisting of four `Conv2D` convolution layers, with our final output layer being a `dense` layer. 

The convolution layers are designed for feature detection. It works by sliding a filter window over the input and performing a matrix multiplication and storing the result in a feature map. This operation is known as a convolution. 


The `filter` parameter specifies the number of nodes in each layer. Each layer will increase in size from 16, 32, 64 to 128, while the `kernel_size` parameter specifies the size of the kernel window which in this case is 2 resulting in a 2x2 filter matrix. 

The first layer will receive the input shape of (40, 174, 1) where 40 is the number of MFCC's 174 is the number of frames taking padding into account and the 1 signifying that the audio is mono. 

The activation function we will be using for our convolutional layers is `ReLU` which is the same as our previous model. We will use a smaller `Dropout` value of 20% on our convolutional layers. 

Each convolutional layer has an associated pooling layer of `MaxPooling2D` type with the final convolutional layer having a `GlobalAveragePooling2D` type. The pooling layer is do reduce the dimensionality of the model (by reducing the parameters and subsquent computation requirements) which serves to shorten the training time and reduce overfitting. The Max Pooling type takes the maximum size for each window and the Global Average Pooling type takes the average which is suitable for feeding into our `dense` output layer.  

Our output layer will have 10 nodes (num_labels) which matches the number of possible classifications. The activation is for our output layer is `softmax`. Softmax makes the output sum up to 1 so the output can be interpreted as probabilities. The model will then make its prediction based on which option has the highest probability.

In [12]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, Conv2D, MaxPooling2D, GlobalAveragePooling2D
from keras.optimizers import Adam
from keras.utils import np_utils
from sklearn import metrics 

num_rows = 40
num_columns = 174
num_channels = 1

x_train = x_train.reshape(x_train.shape[0], num_rows, num_columns, num_channels)
x_test = x_test.reshape(x_test.shape[0], num_rows, num_columns, num_channels)

num_labels = yy.shape[1]
filter_size = 2

# Construct model 
model = Sequential()
model.add(Conv2D(filters=16, kernel_size=2, input_shape=(num_rows, num_columns, num_channels), activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=32, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=64, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=128, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))
model.add(GlobalAveragePooling2D())

model.add(Dense(num_labels, activation='softmax')) 

### Compiling the model 

For compiling our model, we will use the same three parameters as the previous model: 

In [13]:
# Compile the model
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam') 

In [45]:
# Display model architecture summary 
model.summary()

# Calculate pre-training accuracy 
score = model.evaluate(x_test, y_test, verbose=1)
accuracy = 100*score[1]

print("Pre-training accuracy: %.4f%%" % accuracy)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_11 (Conv2D)           (None, 39, 173, 16)       80        
_________________________________________________________________
max_pooling2d_11 (MaxPooling (None, 19, 86, 16)        0         
_________________________________________________________________
dropout_17 (Dropout)         (None, 19, 86, 16)        0         
_________________________________________________________________
conv2d_12 (Conv2D)           (None, 18, 85, 32)        2080      
_________________________________________________________________
max_pooling2d_12 (MaxPooling (None, 9, 42, 32)         0         
_________________________________________________________________
dropout_18 (Dropout)         (None, 9, 42, 32)         0         
_________________________________________________________________
conv2d_13 (Conv2D)           (None, 8, 41, 64)         8256      
__________

### Training 

Here we will train the model. As training a CNN can take a sigificant amount of time, we will start with a low number of epochs and a low batch size. If we can see from the output that the model is converging, we will increase both numbers.  

In [14]:
from keras.callbacks import ModelCheckpoint 
from datetime import datetime 

#num_epochs = 12
#num_batch_size = 128

num_epochs = 72
num_batch_size = 256

checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.basic_cnn.hdf5', 
                               verbose=1, save_best_only=True)
start = datetime.now()

model.fit(x_train, y_train, batch_size=num_batch_size, epochs=num_epochs, validation_data=(x_test, y_test), callbacks=[checkpointer], verbose=1)


duration = datetime.now() - start
print("Training completed in time: ", duration)

Epoch 1/72
Epoch 00001: val_loss improved from inf to 2.16166, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 2/72
Epoch 00002: val_loss improved from 2.16166 to 1.97656, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 3/72
Epoch 00003: val_loss improved from 1.97656 to 1.80319, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 4/72
Epoch 00004: val_loss improved from 1.80319 to 1.65072, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 5/72
Epoch 00005: val_loss improved from 1.65072 to 1.50073, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 6/72
Epoch 00006: val_loss improved from 1.50073 to 1.43515, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 7/72
Epoch 00007: val_loss improved from 1.43515 to 1.36146, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 8/72
Epoch 00008: val_loss improved from 1.36146 to 1.30493, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 9/72
E

Epoch 26/72
Epoch 00026: val_loss improved from 0.84165 to 0.83587, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 27/72
Epoch 00027: val_loss improved from 0.83587 to 0.80694, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 28/72
Epoch 00028: val_loss improved from 0.80694 to 0.79338, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 29/72
Epoch 00029: val_loss improved from 0.79338 to 0.76853, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 30/72
Epoch 00030: val_loss improved from 0.76853 to 0.74553, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 31/72
Epoch 00031: val_loss improved from 0.74553 to 0.72673, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 32/72
Epoch 00032: val_loss improved from 0.72673 to 0.68795, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 33/72
Epoch 00033: val_loss did not improve from 0.68795
Epoch 34/72
Epoch 00034: val_loss improved from 0.68795 to 0.6

Epoch 52/72
Epoch 00052: val_loss improved from 0.51960 to 0.49570, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 53/72
Epoch 00053: val_loss improved from 0.49570 to 0.46987, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 54/72
Epoch 00054: val_loss did not improve from 0.46987
Epoch 55/72
Epoch 00055: val_loss did not improve from 0.46987
Epoch 56/72
Epoch 00056: val_loss did not improve from 0.46987
Epoch 57/72
Epoch 00057: val_loss improved from 0.46987 to 0.46461, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 58/72
Epoch 00058: val_loss did not improve from 0.46461
Epoch 59/72
Epoch 00059: val_loss improved from 0.46461 to 0.44319, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 60/72
Epoch 00060: val_loss did not improve from 0.44319
Epoch 61/72
Epoch 00061: val_loss improved from 0.44319 to 0.44155, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 62/72
Epoch 00062: val_loss improved from 0.44155 to 0

### Test the model 

Here we will review the accuracy of the model on both the training and test data sets. 

In [15]:
# Evaluating the model on the training and testing set
score = model.evaluate(x_train, y_train, verbose=0)
print("Training Accuracy: ", score[1])

score = model.evaluate(x_test, y_test, verbose=0)
print("Testing Accuracy: ", score[1])

Training Accuracy:  0.9198281764984131
Testing Accuracy:  0.8683457374572754


The Training and Testing accuracy scores are both high and an increase on our initial model. Training accuracy has increased by ~6% and Testing accuracy has increased by ~4%. 

There is a marginal increase in the difference between the Training and Test scores (~6% compared to ~5% previously) though the difference remains low so the model has not suffered from overfitting. 

### Predictions  

Here we will modify our previous method for testing the models predictions on a specified audio .wav file. 

In [16]:
def print_prediction(file_name):
    prediction_feature = extract_features(file_name) 
    prediction_feature = prediction_feature.reshape(1, num_rows, num_columns, num_channels)

    predicted_vector = model.predict_classes(prediction_feature)
    predicted_class = le.inverse_transform(predicted_vector) 
    print("The predicted class is:", predicted_class[0], '\n') 

    predicted_proba_vector = model.predict_proba(prediction_feature) 
    predicted_proba = predicted_proba_vector[0]
    for i in range(len(predicted_proba)): 
        category = le.inverse_transform(np.array([i]))
        print(category[0], "\t\t : ", format(predicted_proba[i], '.32f') )

### Validation 

#### Test with sample data 

As before we will verify the predictions using a subsection of the sample audio files we explored in the first notebook. We expect the bulk of these to be classified correctly. 

In [17]:
# Class: Air Conditioner

filename = '../UrbanSound Dataset sample/audio/100852-0-0-0.wav' 
print_prediction(filename) 

Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).
The predicted class is: air_conditioner 

Instructions for updating:
Please use `model.predict()` instead.
air_conditioner 		 :  0.91008162498474121093750000000000
car_horn 		 :  0.00050033943261951208114624023438
children_playing 		 :  0.03129843249917030334472656250000
dog_bark 		 :  0.00170055136550217866897583007812
drilling 		 :  0.02245077863335609436035156250000
engine_idling 		 :  0.00291876145638525485992431640625
gun_shot 		 :  0.00025273195933550596237182617188
jackhammer 		 :  0.02114298008382320404052734375000
siren 		 :  0.00871215853840112686157226562500
street_music 		 :  0.00094158545834943652153015136719


In [19]:
# Class: Drilling

filename = '../UrbanSound Dataset sample/audio/103199-4-0-0.wav'
print_prediction(filename) 

The predicted class is: drilling 

air_conditioner 		 :  0.00001428662653779610991477966309
car_horn 		 :  0.00001807990884117316454648971558
children_playing 		 :  0.00000809219181974185630679130554
dog_bark 		 :  0.00000011522907072958332719281316
drilling 		 :  0.99887949228286743164062500000000
engine_idling 		 :  0.00001036807134369155392050743103
gun_shot 		 :  0.00001112346490117488428950309753
jackhammer 		 :  0.00051710568368434906005859375000
siren 		 :  0.00000012721483244604314677417278
street_music 		 :  0.00054121948778629302978515625000


In [20]:
# Class: Street music 

filename = '../UrbanSound Dataset sample/audio/101848-9-0-0.wav'
print_prediction(filename) 

The predicted class is: street_music 

air_conditioner 		 :  0.00128569686785340309143066406250
car_horn 		 :  0.00452070496976375579833984375000
children_playing 		 :  0.09430360049009323120117187500000
dog_bark 		 :  0.03333890065550804138183593750000
drilling 		 :  0.00023773724387865513563156127930
engine_idling 		 :  0.00010984209075104445219039916992
gun_shot 		 :  0.00000000280732526114491065527545
jackhammer 		 :  0.00001063408763002371415495872498
siren 		 :  0.01030355598777532577514648437500
street_music 		 :  0.85588932037353515625000000000000


In [21]:
# Class: Car Horn 

filename = '../UrbanSound Dataset sample/audio/100648-1-0-0.wav'
print_prediction(filename) 

The predicted class is: jackhammer 

air_conditioner 		 :  0.00465527782216668128967285156250
car_horn 		 :  0.12551957368850708007812500000000
children_playing 		 :  0.01239559240639209747314453125000
dog_bark 		 :  0.14426797628402709960937500000000
drilling 		 :  0.20020681619644165039062500000000
engine_idling 		 :  0.02069007605314254760742187500000
gun_shot 		 :  0.12677641212940216064453125000000
jackhammer 		 :  0.32507652044296264648437500000000
siren 		 :  0.03476354479789733886718750000000
street_music 		 :  0.00564818549901247024536132812500


#### Observations 

We can see that the model performs well. 

Interestingly, car horn was again incorrectly classifed but this time as drilling - though the per class confidence shows it was a close decision between car horn with 26% confidence and drilling at 34% confidence.  

### Other audio

Again we will further validate our model using a sample of various copyright free sounds that we not part of either our test or training data. 

In [22]:
filename = '../Evaluation audio/dog_bark_1.wav'
print_prediction(filename) 

The predicted class is: dog_bark 

air_conditioner 		 :  0.00143127748742699623107910156250
car_horn 		 :  0.04909551888704299926757812500000
children_playing 		 :  0.00542677706107497215270996093750
dog_bark 		 :  0.80641847848892211914062500000000
drilling 		 :  0.10159351676702499389648437500000
engine_idling 		 :  0.00020524892897810786962509155273
gun_shot 		 :  0.02920273877680301666259765625000
jackhammer 		 :  0.00046065766946412622928619384766
siren 		 :  0.00367915839888155460357666015625
street_music 		 :  0.00248655420728027820587158203125


In [23]:
filename = '../Evaluation audio/drilling_1.wav'

print_prediction(filename) 

The predicted class is: jackhammer 

air_conditioner 		 :  0.03098107129335403442382812500000
car_horn 		 :  0.00001872309258033055812120437622
children_playing 		 :  0.00028658681549131870269775390625
dog_bark 		 :  0.00064256438054144382476806640625
drilling 		 :  0.00351485470309853553771972656250
engine_idling 		 :  0.00356060685589909553527832031250
gun_shot 		 :  0.00000060925168554604169912636280
jackhammer 		 :  0.96077275276184082031250000000000
siren 		 :  0.00006042141831130720674991607666
street_music 		 :  0.00016177661018446087837219238281


In [24]:
filename = '../Evaluation audio/gun_shot_1.wav'

print_prediction(filename) 

The predicted class is: gun_shot 

air_conditioner 		 :  0.00060294655850157141685485839844
car_horn 		 :  0.00108347751665860414505004882812
children_playing 		 :  0.00439422763884067535400390625000
dog_bark 		 :  0.06342021375894546508789062500000
drilling 		 :  0.18518704175949096679687500000000
engine_idling 		 :  0.01274545863270759582519531250000
gun_shot 		 :  0.72738718986511230468750000000000
jackhammer 		 :  0.00005645949568133801221847534180
siren 		 :  0.00274454662576317787170410156250
street_music 		 :  0.00237840879708528518676757812500


#### Observations 

The performance of our final model is very good and has generalised well, seeming to predict well when tested against new audio data. 