# Classifying Urban sounds using Deep Learning

## 4 Model Refinement 

### Load Preprocessed data 

#### Model refinement

In our inital attempt, we were able to achieve a Classification Accuracy score of: 

* Training data Accuracy:  92.3% 
* Testing data Accuracy:  87% 

We will now see if we can improve upon that score using a Convolutional Neural Network (CNN). 

#### Feature Extraction refinement 

In the prevous feature extraction stage, the MFCC vectors would vary in size for the different audio files (depending on the samples duration). 

However, CNNs require a fixed size for all inputs. To overcome this we will zero pad the output vectors to make them all the same size. 

In [None]:
import numpy as np
max_pad_len = 174

def extract_features(file_name):
   
    try:
        audio, sample_rate = librosa.load(file_name, res_type='kaiser_fast') 
        mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
        pad_width = max_pad_len - mfccs.shape[1]
        mfccs = np.pad(mfccs, pad_width=((0, 0), (0, pad_width)), mode='constant')
        
    except Exception as e:
        print("Error encountered while parsing file: ", file_name)
        return None 
     
    return mfccs

In [None]:
# Load various imports 
import pandas as pd
import os
import librosa

# Set the path to the full UrbanSound dataset 
fulldatasetpath = r"C:\Users\dell\Desktop\imagine cup\xfree\Urban sound classification\Udacity-ML-Capstone-master\UrbanSound8K\audio"

metadata = pd.read_csv('../UrbanSound Dataset sample/metadata/UrbanSound8K.csv')

features = []

# Iterate through each sound file and extract the features 
for index, row in metadata.iterrows():
    
    file_name = os.path.join(os.path.abspath(fulldatasetpath),'fold'+str(row["fold"])+'/',str(row["slice_file_name"]))
    
    class_label = row["class_name"]
    data = extract_features(file_name)
    features.append([data, class_label])


# Convert into a Panda dataframe 
featuresdf = pd.DataFrame(features, columns=['feature','class_label'])

print('Finished feature extraction from ', len(featuresdf), ' files') 

Finished feature extraction from  8732  files


In [None]:
#print(featuresdf)

In [None]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical

# Convert features and corresponding classification labels into numpy arrays
X = np.array(featuresdf.feature.tolist())
y = np.array(featuresdf.class_label.tolist())

# Encode the classification labels
le = LabelEncoder()
yy = to_categorical(le.fit_transform(y)) 

# split the dataset 
from sklearn.model_selection import train_test_split 

x_train, x_test, y_train, y_test = train_test_split(X, yy, test_size=0.2, random_state = 42)

Using TensorFlow backend.


### Convolutional Neural Network (CNN) model architecture 


We will modify our model to be a Convolutional Neural Network (CNN) again using Keras and a Tensorflow backend. 

Again we will use a `sequential` model, starting with a simple model architecture, consisting of four `Conv2D` convolution layers, with our final output layer being a `dense` layer. 

The convolution layers are designed for feature detection. It works by sliding a filter window over the input and performing a matrix multiplication and storing the result in a feature map. This operation is known as a convolution. 


The `filter` parameter specifies the number of nodes in each layer. Each layer will increase in size from 16, 32, 64 to 128, while the `kernel_size` parameter specifies the size of the kernel window which in this case is 2 resulting in a 2x2 filter matrix. 

The first layer will receive the input shape of (40, 174, 1) where 40 is the number of MFCC's 174 is the number of frames taking padding into account and the 1 signifying that the audio is mono. 

The activation function we will be using for our convolutional layers is `ReLU` which is the same as our previous model. We will use a smaller `Dropout` value of 20% on our convolutional layers. 

Each convolutional layer has an associated pooling layer of `MaxPooling2D` type with the final convolutional layer having a `GlobalAveragePooling2D` type. The pooling layer is do reduce the dimensionality of the model (by reducing the parameters and subsquent computation requirements) which serves to shorten the training time and reduce overfitting. The Max Pooling type takes the maximum size for each window and the Global Average Pooling type takes the average which is suitable for feeding into our `dense` output layer.  

Our output layer will have 10 nodes (num_labels) which matches the number of possible classifications. The activation is for our output layer is `softmax`. Softmax makes the output sum up to 1 so the output can be interpreted as probabilities. The model will then make its prediction based on which option has the highest probability.

In [None]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, Conv2D, MaxPooling2D, GlobalAveragePooling2D
from keras.optimizers import Adam
from keras.utils import np_utils
from sklearn import metrics 

num_rows = 40
num_columns = 174
num_channels = 1

x_train = x_train.reshape(x_train.shape[0], num_rows, num_columns, num_channels)
x_test = x_test.reshape(x_test.shape[0], num_rows, num_columns, num_channels)

num_labels = yy.shape[1]
filter_size = 2

# Construct model 
model = Sequential()
model.add(Conv2D(filters=16, kernel_size=2, input_shape=(num_rows, num_columns, num_channels), activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=32, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=64, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=128, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))
model.add(GlobalAveragePooling2D())

model.add(Dense(num_labels, activation='softmax')) 






Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


### Compiling the model 

For compiling our model, we will use the same three parameters as the previous model: 

In [None]:
# Compile the model
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam') 





In [None]:
# Display model architecture summary 
model.summary()

# Calculate pre-training accuracy 
score = model.evaluate(x_test, y_test, verbose=1)
accuracy = 100*score[1]

print("Pre-training accuracy: %.4f%%" % accuracy)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 39, 173, 16)       80        
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 19, 86, 16)        0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 19, 86, 16)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 18, 85, 32)        2080      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 9, 42, 32)         0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 9, 42, 32)         0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 8, 41, 64)         8256      
__________

### Training 

Here we will train the model. As training a CNN can take a sigificant amount of time, we will start with a low number of epochs and a low batch size. If we can see from the output that the model is converging, we will increase both numbers.  

In [None]:
from keras.callbacks import ModelCheckpoint 
from datetime import datetime 

#num_epochs = 1
#num_batch_size = 128

num_epochs = 0#72
num_batch_size = 256

checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.basic_cnn.xml',verbose=1, save_best_only=True)
start = datetime.now()

model.fit(x_train, y_train, batch_size=num_batch_size, epochs=num_epochs, validation_data=(x_test, y_test), callbacks=[checkpointer], verbose=1)


duration = datetime.now() - start
print("Training completed in time: ", duration)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 6985 samples, validate on 1747 samples
Epoch 1/72

Epoch 00001: val_loss improved from inf to 2.02562, saving model to saved_models/weights.best.basic_cnn.xml
Epoch 2/72

Epoch 00002: val_loss improved from 2.02562 to 1.80637, saving model to saved_models/weights.best.basic_cnn.xml
Epoch 3/72

Epoch 00003: val_loss improved from 1.80637 to 1.68384, saving model to saved_models/weights.best.basic_cnn.xml
Epoch 4/72

Epoch 00004: val_loss improved from 1.68384 to 1.53285, saving model to saved_models/weights.best.basic_cnn.xml
Epoch 5/72

Epoch 00005: val_loss improved from 1.53285 to 1.41057, saving model to saved_models/weights.best.basic_cnn.xml
Epoch 6/72

Epoch 00006: val_loss improved from 1.41057 to 1.35759, saving model to saved_models/weights.best.basic_cnn.xml
Epoch 7/72

Epoch 00007: val_loss improved from 1.35759 to 1.31697, saving model to saved_models/weights.best.basic_cn


Epoch 00033: val_loss did not improve from 0.53955
Epoch 34/72

Epoch 00034: val_loss did not improve from 0.53955
Epoch 35/72

Epoch 00035: val_loss improved from 0.53955 to 0.52952, saving model to saved_models/weights.best.basic_cnn.xml
Epoch 36/72

Epoch 00036: val_loss improved from 0.52952 to 0.51665, saving model to saved_models/weights.best.basic_cnn.xml
Epoch 37/72

Epoch 00037: val_loss did not improve from 0.51665
Epoch 38/72

Epoch 00038: val_loss did not improve from 0.51665
Epoch 39/72

Epoch 00039: val_loss improved from 0.51665 to 0.49274, saving model to saved_models/weights.best.basic_cnn.xml
Epoch 40/72

Epoch 00040: val_loss improved from 0.49274 to 0.47386, saving model to saved_models/weights.best.basic_cnn.xml
Epoch 41/72

Epoch 00041: val_loss did not improve from 0.47386
Epoch 42/72

Epoch 00042: val_loss did not improve from 0.47386
Epoch 43/72

Epoch 00043: val_loss did not improve from 0.47386
Epoch 44/72

Epoch 00044: val_loss did not improve from 0.47386


### Test the model 

Here we will review the accuracy of the model on both the training and test data sets. 

In [None]:
# Evaluating the model on the training and testing set
score = model.evaluate(x_train, y_train, verbose=0)
print("Training Accuracy: ", score[1])

score = model.evaluate(x_test, y_test, verbose=0)
print("Testing Accuracy: ", score[1])


Training Accuracy:  0.9590551181102362
Testing Accuracy:  0.8912421294328616


NameError: name 'y_true' is not defined

In [None]:
def out(filename):
    prediction_feature = extract_features(file_name) 
    prediction_feature = prediction_feature.reshape(1, num_rows, num_columns, num_channels)

    predicted_vector = model.predict_classes(prediction_feature)
    predicted_class = le.inverse_transform(predicted_vector)
    return predicted_vector

In [None]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix
y_pred=out(x_test)
print(y_pred)
print(y_test)

confusion_matrix(y_test,y_pred)

[1]
[[0. 0. 0. ... 0. 1. 0.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 1. 0.]]


ValueError: Found input variables with inconsistent numbers of samples: [1747, 1]

The Training and Testing accuracy scores are both high and an increase on our initial model. Training accuracy has increased by ~6% and Testing accuracy has increased by ~4%. 

There is a marginal increase in the difference between the Training and Test scores (~6% compared to ~5% previously) though the difference remains low so the model has not suffered from overfitting. 

### Predictions  

Here we will modify our previous method for testing the models predictions on a specified audio .wav file. 

In [None]:
def print_prediction(file_name):
    prediction_feature = extract_features(file_name) 
    prediction_feature = prediction_feature.reshape(1, num_rows, num_columns, num_channels)

    predicted_vector = model.predict_classes(prediction_feature)
    predicted_class = le.inverse_transform(predicted_vector) 
    print("The predicted class is:", predicted_class[0], '\n') 

    predicted_proba_vector = model.predict_proba(prediction_feature) 
    predicted_proba = predicted_proba_vector[0]
    for i in range(len(predicted_proba)): 
        category = le.inverse_transform(np.array([i]))
        print(category[0], "\t\t : ", format(predicted_proba[i], '.32f') )

### Validation 

#### Test with sample data 

As before we will verify the predictions using a subsection of the sample audio files we explored in the first notebook. We expect the bulk of these to be classified correctly. 

In [None]:
# Class: Air Conditioner

filename = '../UrbanSound Dataset sample/audio/100852-0-0-0.wav' 
print_prediction(filename) 

The predicted class is: air_conditioner 

air_conditioner 		 :  0.99687838554382324218750000000000
car_horn 		 :  0.00000051920119403803255409002304
children_playing 		 :  0.00065934192389249801635742187500
dog_bark 		 :  0.00006011105142533779144287109375
drilling 		 :  0.00134915194939821958541870117188
engine_idling 		 :  0.00001786953180271666496992111206
gun_shot 		 :  0.00003663754978333599865436553955
jackhammer 		 :  0.00095585657982155680656433105469
siren 		 :  0.00001384064671583473682403564453
street_music 		 :  0.00002834899351000785827636718750


In [None]:
# Class: Drilling

filename = '../UrbanSound Dataset sample/audio/103199-4-0-0.wav'
print_prediction(filename) 

The predicted class is: drilling 

air_conditioner 		 :  0.00000013431598233637487282976508
car_horn 		 :  0.00000025490251687187992502003908
children_playing 		 :  0.00001615423752809874713420867920
dog_bark 		 :  0.00000021904463665123330429196358
drilling 		 :  0.99468487501144409179687500000000
engine_idling 		 :  0.00000016201553876271646004170179
gun_shot 		 :  0.00000000003519966262910401155750
jackhammer 		 :  0.00004046132744406349956989288330
siren 		 :  0.00000000724200077684145071543753
street_music 		 :  0.00525784213095903396606445312500


In [None]:
# Class: Street music 

filename = '../UrbanSound Dataset sample/audio/101848-9-0-0.wav'
print_prediction(filename) 

The predicted class is: street_music 

air_conditioner 		 :  0.00110940611921250820159912109375
car_horn 		 :  0.00011776933388318866491317749023
children_playing 		 :  0.01611172035336494445800781250000
dog_bark 		 :  0.00072734046261757612228393554688
drilling 		 :  0.00000371879946214903611689805984
engine_idling 		 :  0.00000895845732884481549263000488
gun_shot 		 :  0.00000000263216537454979970789282
jackhammer 		 :  0.00000083919400140075595118105412
siren 		 :  0.00168100092560052871704101562500
street_music 		 :  0.98023921251296997070312500000000


In [None]:
# Class: Car Horn 

filename = '../UrbanSound Dataset sample/audio/100648-1-0-0.wav'
print_prediction(filename) 

The predicted class is: car_horn 

air_conditioner 		 :  0.00447100168094038963317871093750
car_horn 		 :  0.28345683217048645019531250000000
children_playing 		 :  0.00685080140829086303710937500000
dog_bark 		 :  0.15641531348228454589843750000000
drilling 		 :  0.08936390280723571777343750000000
engine_idling 		 :  0.01047550234943628311157226562500
gun_shot 		 :  0.25447627902030944824218750000000
jackhammer 		 :  0.17366543412208557128906250000000
siren 		 :  0.01640361361205577850341796875000
street_music 		 :  0.00442126486450433731079101562500


#### Observations 

We can see that the model performs well. 

Interestingly, car horn was again incorrectly classifed but this time as drilling - though the per class confidence shows it was a close decision between car horn with 26% confidence and drilling at 34% confidence.  

### Other audio

Again we will further validate our model using a sample of various copyright free sounds that we not part of either our test or training data. 

In [None]:
filename = '../Evaluation audio/dog_bark_1.wav'
print_prediction(filename) 

The predicted class is: dog_bark 

air_conditioner 		 :  0.00692389300093054771423339843750
car_horn 		 :  0.01195707358419895172119140625000
children_playing 		 :  0.03886408731341361999511718750000
dog_bark 		 :  0.78876549005508422851562500000000
drilling 		 :  0.06209024786949157714843750000000
engine_idling 		 :  0.00038470054278150200843811035156
gun_shot 		 :  0.08008616417646408081054687500000
jackhammer 		 :  0.00062373012769967317581176757812
siren 		 :  0.00402355147525668144226074218750
street_music 		 :  0.00628101080656051635742187500000


In [None]:
filename = '../Evaluation audio/drilling_1.wav'

print_prediction(filename) 

The predicted class is: jackhammer 

air_conditioner 		 :  0.44938567280769348144531250000000
car_horn 		 :  0.00038593693170696496963500976562
children_playing 		 :  0.00039813644252717494964599609375
dog_bark 		 :  0.00285975052975118160247802734375
drilling 		 :  0.01796808838844299316406250000000
engine_idling 		 :  0.00871684495359659194946289062500
gun_shot 		 :  0.00003318917515571229159832000732
jackhammer 		 :  0.52008956670761108398437500000000
siren 		 :  0.00014330238627735525369644165039
street_music 		 :  0.00001956611959030851721763610840


In [None]:
filename = '../Evaluation audio/gun_shot_1.wav'

print_prediction(filename) 

The predicted class is: gun_shot 

air_conditioner 		 :  0.00000004153686106178611225914210
car_horn 		 :  0.00000016333915198174508986994624
children_playing 		 :  0.00000901593011803925037384033203
dog_bark 		 :  0.00041548791341483592987060546875
drilling 		 :  0.00001343048097623977810144424438
engine_idling 		 :  0.00000009433045278228746610693634
gun_shot 		 :  0.99955993890762329101562500000000
jackhammer 		 :  0.00000000109624220812065686914138
siren 		 :  0.00000020992132476749247871339321
street_music 		 :  0.00000149148161199263995513319969


#### Observations 

The performance of our final model is very good and has generalised well, seeming to predict well when tested against new audio data. 