# Classifying Urban sounds using Deep Learning

## 4 Model Refinement 

### Load Preprocessed data 

In [1]:
# retrieve the preprocessed data from previous notebook

%store -r x_train 
%store -r x_test 
%store -r y_train 
%store -r y_test 
%store -r yy 
%store -r le

#### Model refinement

In our inital attempt, we were able to achieve a Classification Accuracy score of: 

* Training data Accuracy:  93.1% 
* Testing data Accuracy:  88% 

We will now see if we can improve upon that score using a Convolutional Neural Network (CNN). 

#### Feature Extraction refinement 

In the prevous feature extraction stage, the MFCC vectors would vary in size for the different audio files (depending on the samples duration). 

However, CNNs require a fixed size for all inputs. To overcome this we will zero pad the output vectors to make them all the same size. 

In [2]:
import numpy as np
max_pad_len = 174

def extract_features(file_name):
   
    try:
        audio, sample_rate = librosa.load(file_name, res_type='kaiser_fast') 
        mfccs = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
        pad_width = max_pad_len - mfccs.shape[1]
        mfccs = np.pad(mfccs, pad_width=((0, 0), (0, pad_width)), mode='constant')
        
    except Exception as e:
        print("Error encountered while parsing file: ", file_name)
        return None 
     
    return mfccs

In [4]:
# Load various imports 
import pandas as pd
import os
import librosa
from tqdm import tqdm

# Set the path to the full UrbanSound dataset 
fulldatasetpath = '../Dataset/UrbanSound8K/audio/'

metadata = pd.read_csv('../UrbanSound Dataset sample/metadata/UrbanSound8K.csv')

features = []

# Iterate through each sound file and extract the features 
for index, row in tqdm(metadata.iterrows()):
    
    file_name = os.path.join(os.path.abspath(fulldatasetpath),'fold'+str(row["fold"])+'/',str(row["slice_file_name"]))
    
    class_label = row["class_name"]
    data = extract_features(file_name)
    
    features.append([data, class_label])

# Convert into a Panda dataframe 
featuresdf = pd.DataFrame(features, columns=['feature','class_label'])

print('Finished feature extraction from ', len(featuresdf), ' files') 

8732it [05:52, 24.79it/s]

Finished feature extraction from  8732  files





In [5]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical

# Convert features and corresponding classification labels into numpy arrays
X = np.array(featuresdf.feature.tolist())
y = np.array(featuresdf.class_label.tolist())

# Encode the classification labels
le = LabelEncoder()
yy = to_categorical(le.fit_transform(y)) 

# split the dataset 
from sklearn.model_selection import train_test_split 

x_train, x_test, y_train, y_test = train_test_split(X, yy, test_size=0.2, random_state = 42)

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


### Convolutional Neural Network (CNN) model architecture 


We will modify our model to be a Convolutional Neural Network (CNN) again using Keras and a Tensorflow backend. 

Again we will use a `sequential` model, starting with a simple model architecture, consisting of four `Conv2D` convolution layers, with our final output layer being a `dense` layer. 

The convolution layers are designed for feature detection. It works by sliding a filter window over the input and performing a matrix multiplication and storing the result in a feature map. This operation is known as a convolution. 


The `filter` parameter specifies the number of nodes in each layer. Each layer will increase in size from 16, 32, 64 to 128, while the `kernel_size` parameter specifies the size of the kernel window which in this case is 2 resulting in a 2x2 filter matrix. 

The first layer will receive the input shape of (40, 174, 1) where 40 is the number of MFCC's 174 is the number of frames taking padding into account and the 1 signifying that the audio is mono. 

The activation function we will be using for our convolutional layers is `ReLU` which is the same as our previous model. We will use a smaller `Dropout` value of 20% on our convolutional layers. 

Each convolutional layer has an associated pooling layer of `MaxPooling2D` type with the final convolutional layer having a `GlobalAveragePooling2D` type. The pooling layer is do reduce the dimensionality of the model (by reducing the parameters and subsquent computation requirements) which serves to shorten the training time and reduce overfitting. The Max Pooling type takes the maximum size for each window and the Global Average Pooling type takes the average which is suitable for feeding into our `dense` output layer.  

Our output layer will have 10 nodes (num_labels) which matches the number of possible classifications. The activation is for our output layer is `softmax`. Softmax makes the output sum up to 1 so the output can be interpreted as probabilities. The model will then make its prediction based on which option has the highest probability.

In [6]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, Conv2D, MaxPooling2D, GlobalAveragePooling2D
from keras.optimizers import Adam
from keras.utils import np_utils
from sklearn import metrics 

num_rows = 40
num_columns = 174
num_channels = 1

x_train = x_train.reshape(x_train.shape[0], num_rows, num_columns, num_channels)
x_test = x_test.reshape(x_test.shape[0], num_rows, num_columns, num_channels)

num_labels = yy.shape[1]
filter_size = 2

# Construct model 
model = Sequential()
model.add(Conv2D(filters=16, kernel_size=2, input_shape=(num_rows, num_columns, num_channels), activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=32, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=64, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))

model.add(Conv2D(filters=128, kernel_size=2, activation='relu'))
model.add(MaxPooling2D(pool_size=2))
model.add(Dropout(0.2))
model.add(GlobalAveragePooling2D())

model.add(Dense(num_labels, activation='softmax')) 

Instructions for updating:
Colocations handled automatically by placer.


### Compiling the model 

For compiling our model, we will use the same three parameters as the previous model: 

In [7]:
# Compile the model
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam') 

In [8]:
# Display model architecture summary 
model.summary()

# Calculate pre-training accuracy 
score = model.evaluate(x_test, y_test, verbose=1)
accuracy = 100*score[1]

print("Pre-training accuracy: %.4f%%" % accuracy)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 39, 173, 16)       80        
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 19, 86, 16)        0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 19, 86, 16)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 18, 85, 32)        2080      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 9, 42, 32)         0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 9, 42, 32)         0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 8, 41, 64)        

### Training 

Here we will train the model. As training a CNN can take a sigificant amount of time, we will start with a low number of epochs and a low batch size. If we can see from the output that the model is converging, we will increase both numbers.  

In [9]:
from keras.callbacks import ModelCheckpoint 
from datetime import datetime 

#num_epochs = 12
#num_batch_size = 128

num_epochs = 72
num_batch_size = 256

checkpointer = ModelCheckpoint(filepath='saved_models/weights.best.basic_cnn.hdf5', 
                               verbose=1, save_best_only=True)
start = datetime.now()

model.fit(x_train, y_train, batch_size=num_batch_size, epochs=num_epochs, validation_data=(x_test, y_test), callbacks=[checkpointer], verbose=1)


duration = datetime.now() - start
print("Training completed in time: ", duration)

Instructions for updating:
Use tf.cast instead.
Train on 6985 samples, validate on 1747 samples
Epoch 1/72

Epoch 00001: val_loss improved from inf to 2.16176, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 2/72

Epoch 00002: val_loss improved from 2.16176 to 1.89133, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 3/72

Epoch 00003: val_loss improved from 1.89133 to 1.64556, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 4/72

Epoch 00004: val_loss improved from 1.64556 to 1.47985, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 5/72

Epoch 00005: val_loss improved from 1.47985 to 1.36199, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 6/72

Epoch 00006: val_loss improved from 1.36199 to 1.26733, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 7/72

Epoch 00007: val_loss improved from 1.26733 to 1.25000, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 8/72

Epoch 00008: val_loss


Epoch 00033: val_loss improved from 0.53844 to 0.52855, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 34/72

Epoch 00034: val_loss improved from 0.52855 to 0.50166, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 35/72

Epoch 00035: val_loss did not improve from 0.50166
Epoch 36/72

Epoch 00036: val_loss improved from 0.50166 to 0.49561, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 37/72

Epoch 00037: val_loss did not improve from 0.49561
Epoch 38/72

Epoch 00038: val_loss improved from 0.49561 to 0.48647, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 39/72

Epoch 00039: val_loss improved from 0.48647 to 0.44824, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 40/72

Epoch 00040: val_loss improved from 0.44824 to 0.43901, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 41/72

Epoch 00041: val_loss did not improve from 0.43901
Epoch 42/72

Epoch 00042: val_loss did not improve from 0.43901



Epoch 00069: val_loss improved from 0.33249 to 0.32230, saving model to saved_models/weights.best.basic_cnn.hdf5
Epoch 70/72

Epoch 00070: val_loss did not improve from 0.32230
Epoch 71/72

Epoch 00071: val_loss did not improve from 0.32230
Epoch 72/72

Epoch 00072: val_loss improved from 0.32230 to 0.31517, saving model to saved_models/weights.best.basic_cnn.hdf5
Training completed in time:  0:19:36.063460


### Test the model 

Here we will review the accuracy of the model on both the training and test data sets. 

In [10]:
# Evaluating the model on the training and testing set
score = model.evaluate(x_train, y_train, verbose=0)
print("Training Accuracy: ", score[1])

score = model.evaluate(x_test, y_test, verbose=0)
print("Testing Accuracy: ", score[1])

Training Accuracy:  0.9537580609321594
Testing Accuracy:  0.8986834287643433


The Training and Testing accuracy scores are both high and an increase on our initial model. Training accuracy has increased by ~6% and Testing accuracy has increased by ~4%. 

There is a marginal increase in the difference between the Training and Test scores (~6% compared to ~5% previously) though the difference remains low so the model has not suffered from overfitting. 

### Predictions  

Here we will modify our previous method for testing the models predictions on a specified audio .wav file. 

In [11]:
def print_prediction(file_name):
    prediction_feature = extract_features(file_name) 
    prediction_feature = prediction_feature.reshape(1, num_rows, num_columns, num_channels)

    predicted_vector = model.predict_classes(prediction_feature)
    predicted_class = le.inverse_transform(predicted_vector) 
    print("The predicted class is:", predicted_class[0], '\n') 

    predicted_proba_vector = model.predict_proba(prediction_feature) 
    predicted_proba = predicted_proba_vector[0]
    for i in range(len(predicted_proba)): 
        category = le.inverse_transform(np.array([i]))
        print(category[0], "\t\t : ", format(predicted_proba[i], '.32f') )

### Validation 

#### Test with sample data 

As before we will verify the predictions using a subsection of the sample audio files we explored in the first notebook. We expect the bulk of these to be classified correctly. 

In [12]:
# Class: Air Conditioner

filename = '../UrbanSound Dataset sample/audio/100852-0-0-0.wav' 
print_prediction(filename) 

The predicted class is: air_conditioner 

air_conditioner 		 :  0.99963462352752685546875000000000
car_horn 		 :  0.00000066457164393796119838953018
children_playing 		 :  0.00007972981984494253993034362793
dog_bark 		 :  0.00000393079153582220897078514099
drilling 		 :  0.00009960658644558861851692199707
engine_idling 		 :  0.00002108380795107223093509674072
gun_shot 		 :  0.00013130753359291702508926391602
jackhammer 		 :  0.00001495476590207545086741447449
siren 		 :  0.00000015865006730564346071332693
street_music 		 :  0.00001384513780067209154367446899


In [13]:
# Class: Drilling

filename = '../UrbanSound Dataset sample/audio/103199-4-0-0.wav'
print_prediction(filename) 

The predicted class is: drilling 

air_conditioner 		 :  0.00004755285772262141108512878418
car_horn 		 :  0.00135083636268973350524902343750
children_playing 		 :  0.00000901891871762927621603012085
dog_bark 		 :  0.00000376006028091069310903549194
drilling 		 :  0.98178678750991821289062500000000
engine_idling 		 :  0.00000529857197761884890496730804
gun_shot 		 :  0.00000059720520084738382138311863
jackhammer 		 :  0.00032964342972263693809509277344
siren 		 :  0.00000052914543857696116901934147
street_music 		 :  0.01646599918603897094726562500000


In [14]:
# Class: Street music 

filename = '../UrbanSound Dataset sample/audio/101848-9-0-0.wav'
print_prediction(filename) 

The predicted class is: street_music 

air_conditioner 		 :  0.00019873173732776194810867309570
car_horn 		 :  0.00022817279386799782514572143555
children_playing 		 :  0.01703303307294845581054687500000
dog_bark 		 :  0.00011764083319576457142829895020
drilling 		 :  0.00000011890290352312149479985237
engine_idling 		 :  0.00000590954186918679624795913696
gun_shot 		 :  0.00000000905062158551572792930529
jackhammer 		 :  0.00000007629972742506652139127254
siren 		 :  0.00071084155933931469917297363281
street_music 		 :  0.98170542716979980468750000000000


In [15]:
# Class: Car Horn 

filename = '../UrbanSound Dataset sample/audio/100648-1-0-0.wav'
print_prediction(filename) 

The predicted class is: car_horn 

air_conditioner 		 :  0.00087277742568403482437133789062
car_horn 		 :  0.26948341727256774902343750000000
children_playing 		 :  0.00407882407307624816894531250000
dog_bark 		 :  0.15276922285556793212890625000000
drilling 		 :  0.18205101788043975830078125000000
engine_idling 		 :  0.03219686076045036315917968750000
gun_shot 		 :  0.08857216686010360717773437500000
jackhammer 		 :  0.24637129902839660644531250000000
siren 		 :  0.02285685390233993530273437500000
street_music 		 :  0.00074759544804692268371582031250


#### Observations 

We can see that the model performs well. 

Interestingly, car horn was again incorrectly classifed but this time as drilling - though the per class confidence shows it was a close decision between car horn with 26% confidence and drilling at 34% confidence.  

### Other audio

Again we will further validate our model using a sample of various copyright free sounds that we not part of either our test or training data. 

In [16]:
filename = '../Evaluation audio/dog_bark_1.wav'
print_prediction(filename) 

The predicted class is: dog_bark 

air_conditioner 		 :  0.00026926782447844743728637695312
car_horn 		 :  0.01605159230530261993408203125000
children_playing 		 :  0.00112424034159630537033081054688
dog_bark 		 :  0.93480676412582397460937500000000
drilling 		 :  0.01309823524206876754760742187500
engine_idling 		 :  0.00021353414922486990690231323242
gun_shot 		 :  0.03244085237383842468261718750000
jackhammer 		 :  0.00002938325633294880390167236328
siren 		 :  0.00169011438265442848205566406250
street_music 		 :  0.00027596080326475203037261962891


In [17]:
filename = '../Evaluation audio/drilling_1.wav'

print_prediction(filename) 

The predicted class is: jackhammer 

air_conditioner 		 :  0.00238400720991194248199462890625
car_horn 		 :  0.00000001842655095174450252670795
children_playing 		 :  0.00000596138988839811645448207855
dog_bark 		 :  0.00000344101340488123241811990738
drilling 		 :  0.00001866246384452097117900848389
engine_idling 		 :  0.00002697419040487147867679595947
gun_shot 		 :  0.00001019490991893690079450607300
jackhammer 		 :  0.99754613637924194335937500000000
siren 		 :  0.00000363874869435676373541355133
street_music 		 :  0.00000101370062566275009885430336


In [18]:
filename = '../Evaluation audio/gun_shot_1.wav'

print_prediction(filename) 

The predicted class is: gun_shot 

air_conditioner 		 :  0.00002443616358505096286535263062
car_horn 		 :  0.00000623834739599260501563549042
children_playing 		 :  0.00159835012163966894149780273438
dog_bark 		 :  0.00703781982883810997009277343750
drilling 		 :  0.09889528900384902954101562500000
engine_idling 		 :  0.00002655973185028415173292160034
gun_shot 		 :  0.89151883125305175781250000000000
jackhammer 		 :  0.00000121413609122100751847028732
siren 		 :  0.00016437431622762233018875122070
street_music 		 :  0.00072692235698923468589782714844


#### Observations 

The performance of our final model is very good and has generalised well, seeming to predict well when tested against new audio data. 