# Classifying Urban sounds using Deep Learning

## 4 Model Refinement 

### Load Preprocessed data 

In [1]:
# retrieve the preprocessed data from previous notebook

%store -r x_train 
%store -r x_test 
%store -r y_train 
%store -r y_test 
%store -r yy 
%store -r le

#### Model refinement

In our inital attempt, we were able to achieve a Classification Accuracy score of: 

* Training data Accuracy:  92.3% 
* Testing data Accuracy:  87% 

We will now see if we can improve upon that score using a Convolutional Neural Network (CNN). 

#### Feature Extraction refinement 

In the prevous feature extraction stage, the MFCC vectors would vary in size for the different audio files (depending on the samples duration). 

However, CNNs require a fixed size for all inputs. To overcome this we will zero pad the output vectors to make them all the same size. 

In [2]:
import numpy as np
max_pad_len = 174

def extract_features(file_name):
   
    try:
        audio, sample_rate = librosa.load(file_name, res_type='kaiser_fast') 
        mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
        pad_width = max_pad_len - mfccs.shape[1]
        mfccs = np.pad(mfccs, pad_width=((0, 0), (0, pad_width)), mode='constant')
        
    except Exception as e:
        print("Error encountered while parsing file: ", file_name)
        return None 
     
    return mfccs

In [5]:
# Load various imports 
import pandas as pd
import os
import librosa

# Set the path to the full UrbanSound dataset 
fulldatasetpath = (r"C:\Users\Simriti Koul\Desktop\CAPSTONE\UrbanSound8K.tar\UrbanSound8K")

metadata = pd.read_csv(r"C:\Users\Simriti Koul\Desktop\CAPSTONE\UrbanSound8K.tar\UrbanSound8K\metadata\UrbanSound8K.csv")

features = []

# Iterate through each sound file and extract the features 
for index, row in metadata.iterrows():
    
    file_name = os.path.join(os.path.abspath(r"C:\Users\Simriti Koul\Desktop\CAPSTONE\UrbanSound8K.tar\UrbanSound8K\train"),'fold'+str(row["fold"])+'\\',str(row["slice_file_name"]))
    
    class_label = row["class"]
    data = extract_features(file_name)
    
    features.append([data, class_label])

# Convert into a Panda dataframe 
featuresdf = pd.DataFrame(features, columns=['feature','class_label'])

print('Finished feature extraction from ', len(featuresdf), ' files') 

  n_fft, y.shape[-1]
  n_fft, y.shape[-1]
  n_fft, y.shape[-1]


Finished feature extraction from  8732  files


In [7]:
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical

# Convert features and corresponding classification labels into numpy arrays
X = np.array(featuresdf.feature.tolist())
y = np.array(featuresdf.class_label.tolist())

# Encode the classification labels
le = LabelEncoder()
yy = to_categorical(le.fit_transform(y)) 

# split the dataset 
from sklearn.model_selection import train_test_split 

x_train, x_test, y_train, y_test = train_test_split(X, yy, test_size=0.2, random_state = 42)

### Convolutional Neural Network (CNN) model architecture 


We will modify our model to be a Convolutional Neural Network (CNN) again using Keras and a Tensorflow backend. 

Again we will use a `sequential` model, starting with a simple model architecture, consisting of four `Conv2D` convolution layers, with our final output layer being a `dense` layer. 

The convolution layers are designed for feature detection. It works by sliding a filter window over the input and performing a matrix multiplication and storing the result in a feature map. This operation is known as a convolution. 


The `filter` parameter specifies the number of nodes in each layer. Each layer will increase in size from 16, 32, 64 to 128, while the `kernel_size` parameter specifies the size of the kernel window which in this case is 2 resulting in a 2x2 filter matrix. 

The first layer will receive the input shape of (40, 174, 1) where 40 is the number of MFCC's 174 is the number of frames taking padding into account and the 1 signifying that the audio is mono. 

The activation function we will be using for our convolutional layers is `ReLU` which is the same as our previous model. We will use a smaller `Dropout` value of 20% on our convolutional layers. 

Each convolutional layer has an associated pooling layer of `MaxPooling2D` type with the final convolutional layer having a `GlobalAveragePooling2D` type. The pooling layer is do reduce the dimensionality of the model (by reducing the parameters and subsquent computation requirements) which serves to shorten the training time and reduce overfitting. The Max Pooling type takes the maximum size for each window and the Global Average Pooling type takes the average which is suitable for feeding into our `dense` output layer.  

Our output layer will have 10 nodes (num_labels) which matches the number of possible classifications. The activation is for our output layer is `softmax`. Softmax makes the output sum up to 1 so the output can be interpreted as probabilities. The model will then make its prediction based on which option has the highest probability.

In [8]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, Conv2D, MaxPooling2D, GlobalAveragePooling2D
from keras.optimizers import Adam
from keras.utils import np_utils
from sklearn import metrics 

num_rows = 40
num_columns = 174
num_channels = 1

x_train = x_train.reshape(x_train.shape[0], num_rows, num_columns, num_channels)
x_test = x_test.reshape(x_test.shape[0], num_rows, num_columns, num_channels)

num_labels = yy.shape[1]
filter_size = 2

# Construct model 
model = Sequential()
model.add(Conv2D(filters=16, kernel_size=2, input_shape=(num_rows, num_columns, num_channels), activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=32, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=64, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=128, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))
model.add(GlobalAveragePooling2D())

model.add(Dense(num_labels, activation='softmax')) 

### Compiling the model 

For compiling our model, we will use the same three parameters as the previous model: 

In [9]:
# Compile the model
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam') 

In [10]:
# Display model architecture summary 
model.summary()

# Calculate pre-training accuracy 
score = model.evaluate(x_test, y_test, verbose=1)
accuracy = 100*score[1]

print("Pre-training accuracy: %.4f%%" % accuracy)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 39, 173, 16)       80        
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 19, 86, 16)        0         
_________________________________________________________________
dropout (Dropout)            (None, 19, 86, 16)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 18, 85, 32)        2080      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 9, 42, 32)         0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 9, 42, 32)         0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 8, 41, 64)         8

### Training 

Here we will train the model. As training a CNN can take a sigificant amount of time, we will start with a low number of epochs and a low batch size. If we can see from the output that the model is converging, we will increase both numbers.  

In [11]:
from keras.callbacks import ModelCheckpoint 
from datetime import datetime 

#num_epochs = 12
#num_batch_size = 128

num_epochs = 72
num_batch_size = 256

checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.basic_cnn.hdf5', 
                               verbose=1, save_best_only=True)
start = datetime.now()

model.fit(x_train, y_train, batch_size=num_batch_size, epochs=num_epochs, validation_data=(x_test, y_test), callbacks=[checkpointer], verbose=1)


duration = datetime.now() - start
print("Training completed in time: ", duration)

Epoch 1/72

Epoch 00001: val_loss improved from inf to 2.19753, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 2/72

Epoch 00002: val_loss improved from 2.19753 to 2.02024, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 3/72

Epoch 00003: val_loss improved from 2.02024 to 1.82358, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 4/72

Epoch 00004: val_loss improved from 1.82358 to 1.69347, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 5/72

Epoch 00005: val_loss improved from 1.69347 to 1.60424, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 6/72

Epoch 00006: val_loss improved from 1.60424 to 1.51072, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 7/72

Epoch 00007: val_loss improved from 1.51072 to 1.45596, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 8/72

Epoch 00008: val_loss improved from 1.45596 to 1.36484, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoc


Epoch 00034: val_loss improved from 0.64531 to 0.64141, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 35/72

Epoch 00035: val_loss did not improve from 0.64141
Epoch 36/72

Epoch 00036: val_loss did not improve from 0.64141
Epoch 37/72

Epoch 00037: val_loss improved from 0.64141 to 0.61411, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 38/72

Epoch 00038: val_loss improved from 0.61411 to 0.60938, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 39/72

Epoch 00039: val_loss did not improve from 0.60938
Epoch 40/72

Epoch 00040: val_loss improved from 0.60938 to 0.59567, saving model to saved_models\weights.best.basic_cnn.hdf5
Epoch 41/72

Epoch 00041: val_loss did not improve from 0.59567
Epoch 42/72

Epoch 00042: val_loss did not improve from 0.59567
Epoch 43/72

Epoch 00043: val_loss did not improve from 0.59567
Epoch 44/72

Epoch 00044: val_loss improved from 0.59567 to 0.58406, saving model to saved_models\weights.best.basic_cnn.hdf


Epoch 00072: val_loss did not improve from 0.38803
Training completed in time:  0:42:50.245160


### Test the model 

Here we will review the accuracy of the model on both the training and test data sets. 

In [12]:
# Evaluating the model on the training and testing set
score = model.evaluate(x_train, y_train, verbose=0)
print("Training Accuracy: ", score[1])

score = model.evaluate(x_test, y_test, verbose=0)
print("Testing Accuracy: ", score[1])

Training Accuracy:  0.9341446161270142
Testing Accuracy:  0.8712077736854553


The Training and Testing accuracy scores are both high and an increase on our initial model. Training accuracy has increased by ~6% and Testing accuracy has increased by ~4%. 

There is a marginal increase in the difference between the Training and Test scores (~6% compared to ~5% previously) though the difference remains low so the model has not suffered from overfitting. 

### Predictions  

Here we will modify our previous method for testing the models predictions on a specified audio .wav file. 

In [13]:
def print_prediction(file_name):
    prediction_feature = extract_features(file_name) 
    prediction_feature = prediction_feature.reshape(1, num_rows, num_columns, num_channels)

    predicted_vector = model.predict_classes(prediction_feature)
    predicted_class = le.inverse_transform(predicted_vector) 
    print("The predicted class is:", predicted_class[0], '\n') 

    predicted_proba_vector = model.predict_proba(prediction_feature) 
    predicted_proba = predicted_proba_vector[0]
    for i in range(len(predicted_proba)): 
        category = le.inverse_transform(np.array([i]))
        print(category[0], "\t\t : ", format(predicted_proba[i], '.32f') )

### Validation 

#### Test with sample data 

As before we will verify the predictions using a subsection of the sample audio files we explored in the first notebook. We expect the bulk of these to be classified correctly. 

In [14]:
# Class: Air Conditioner

filename = (r"C:\Users\Simriti Koul\Desktop\CAPSTONE\UrbanSound8K.tar\UrbanSound8K\train\fold5\100852-0-0-0.wav") 
print_prediction(filename) 



The predicted class is: air_conditioner 

air_conditioner 		 :  0.98654663562774658203125000000000
car_horn 		 :  0.00000365796358892112039029598236
children_playing 		 :  0.00373674952425062656402587890625
dog_bark 		 :  0.00031396126723848283290863037109
drilling 		 :  0.00213159387931227684020996093750
engine_idling 		 :  0.00406803796067833900451660156250
gun_shot 		 :  0.00000933662704483140259981155396
jackhammer 		 :  0.00316123594529926776885986328125
siren 		 :  0.00000129648049096431350335478783
street_music 		 :  0.00002752682303253095597028732300




In [15]:
# Class: Drilling

filename = (r"C:\Users\Simriti Koul\Desktop\CAPSTONE\UrbanSound8K.tar\UrbanSound8K\train\fold3\103199-4-0-0.wav")
print_prediction(filename) 

The predicted class is: drilling 

air_conditioner 		 :  0.00000014084022836868825834244490
car_horn 		 :  0.02140823565423488616943359375000
children_playing 		 :  0.00003284804915892891585826873779
dog_bark 		 :  0.00000577101127419155091047286987
drilling 		 :  0.88445901870727539062500000000000
engine_idling 		 :  0.00000012620598965895624132826924
gun_shot 		 :  0.00000104415573787264293059706688
jackhammer 		 :  0.00000748574166209436953067779541
siren 		 :  0.00000002659186293385573662817478
street_music 		 :  0.09408541023731231689453125000000


In [16]:
# Class: Street music 

filename = (r"C:\Users\Simriti Koul\Desktop\CAPSTONE\UrbanSound8K.tar\UrbanSound8K\train\fold7\101848-9-0-0.wav")
print_prediction(filename) 

The predicted class is: street_music 

air_conditioner 		 :  0.00064984644996002316474914550781
car_horn 		 :  0.00292213051579892635345458984375
children_playing 		 :  0.02916814200580120086669921875000
dog_bark 		 :  0.00079743505921214818954467773438
drilling 		 :  0.00000521766332894912920892238617
engine_idling 		 :  0.00000295795189231284894049167633
gun_shot 		 :  0.00000000020001218736798165309665
jackhammer 		 :  0.00000089540787939768051728606224
siren 		 :  0.01239700336009263992309570312500
street_music 		 :  0.95405632257461547851562500000000


In [17]:
# Class: Car Horn 

filename = (r"C:\Users\Simriti Koul\Desktop\CAPSTONE\UrbanSound8K.tar\UrbanSound8K\train\fold10\100648-1-0-0.wav")
print_prediction(filename) 

The predicted class is: car_horn 

air_conditioner 		 :  0.00219908985309302806854248046875
car_horn 		 :  0.30767381191253662109375000000000
children_playing 		 :  0.00892277713865041732788085937500
dog_bark 		 :  0.09442199766635894775390625000000
drilling 		 :  0.25490027666091918945312500000000
engine_idling 		 :  0.01859729923307895660400390625000
gun_shot 		 :  0.07564551383256912231445312500000
jackhammer 		 :  0.20189863443374633789062500000000
siren 		 :  0.03126519173383712768554687500000
street_music 		 :  0.00447552790865302085876464843750


#### Observations 

We can see that the model performs well. 

Interestingly, car horn was again incorrectly classifed but this time as drilling - though the per class confidence shows it was a close decision between car horn with 26% confidence and drilling at 34% confidence.  

### Other audio

Again we will further validate our model using a sample of various copyright free sounds that we not part of either our test or training data. 

In [18]:
filename = (r"C:\Users\Simriti Koul\Desktop\CAPSTONE\Evaluation audio\dog_bark_1.wav")
print_prediction(filename) 

The predicted class is: dog_bark 

air_conditioner 		 :  0.00169448740780353546142578125000
car_horn 		 :  0.02364376187324523925781250000000
children_playing 		 :  0.00123100646305829286575317382812
dog_bark 		 :  0.88138276338577270507812500000000
drilling 		 :  0.04747003316879272460937500000000
engine_idling 		 :  0.00059739599237218499183654785156
gun_shot 		 :  0.03793414682149887084960937500000
jackhammer 		 :  0.00192393758334219455718994140625
siren 		 :  0.00134987547062337398529052734375
street_music 		 :  0.00277261622250080108642578125000


In [19]:
filename = (r"C:\Users\Simriti Koul\Desktop\CAPSTONE\Evaluation audio\drilling_1.wav")

print_prediction(filename) 

The predicted class is: jackhammer 

air_conditioner 		 :  0.01153741963207721710205078125000
car_horn 		 :  0.00212912564165890216827392578125
children_playing 		 :  0.00002556872641434893012046813965
dog_bark 		 :  0.00049320107791572809219360351562
drilling 		 :  0.08964597433805465698242187500000
engine_idling 		 :  0.00117059098556637763977050781250
gun_shot 		 :  0.00000086370192775575560517609119
jackhammer 		 :  0.89416062831878662109375000000000
siren 		 :  0.00076230237027630209922790527344
street_music 		 :  0.00007435309089487418532371520996


In [20]:
filename = (r"C:\Users\Simriti Koul\Desktop\CAPSTONE\Evaluation audio\gun_shot_1.wav")

print_prediction(filename) 

The predicted class is: gun_shot 

air_conditioner 		 :  0.00051879772217944264411926269531
car_horn 		 :  0.00005635288107441738247871398926
children_playing 		 :  0.00134165945928543806076049804688
dog_bark 		 :  0.15568615496158599853515625000000
drilling 		 :  0.01065753120929002761840820312500
engine_idling 		 :  0.00056926830438897013664245605469
gun_shot 		 :  0.83017569780349731445312500000000
jackhammer 		 :  0.00000720739217285881750285625458
siren 		 :  0.00025452961563132703304290771484
street_music 		 :  0.00073273148154839873313903808594


#### Observations 

The performance of our final model is very good and has generalised well, seeming to predict well when tested against new audio data. 