
# Introduction to Audio Exploratory Data Analysis
## Part 3: Removing Duplicates and training a very simple model

<b> By Daniel Gladman, December 2022 </b>

In this final part of the exploratory data analysis, I will remove the duplicated data that we identified in the previous part and will do some basic analysis.

I will then quickly demonstrate how to do some quick preprocessing, feature extraction and then train a very simple neural network on this data.

This is not intended to be the final model, nor the final data for this project. You will need to explore this in greater detail on your own.

But this is here to give you a rough idea of what can be done with audio data and how a simple model can be trained.

In [1]:
# Import all the neccessary libraries
from datetime import datetime
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

import librosa
import numpy as np
import os 
import pandas as pd
import tensorflow as tf


First lets load the metadata and duplicate csv files. We will then use the duplicate dataframe as a filter for the metadata. We will extract all rows which have filenames that do not appear in the duplicated filename column. This will be our final dataframe to include in a model.

In [2]:
metadata = pd.read_csv("./Data/metadata.csv")
duplicates = pd.read_csv("./Data/duplicates.csv")
duplicates.drop("Unnamed: 0", axis=1, inplace=True)
df = metadata[~metadata.filename.isin(duplicates['filename'])]

Next, feature extraction. For the project we will likely be using mel-spectrograms as the feature to train the model which I discussed in a previous tutorial. 

However I wanted to explore a similarly related feature, Mel-Frequency Cepstral Coefficients, or mfcc.

This is an interesting topic that is too complicated to explain here, but simply put, the mfcc attempts to model elements of human speech including phoneme expression, the shape and length of the vocal tract, and glottal pulses which are the variances in voice quality caused by the folds in the vocal cords.

Because it is a model designed for humans, I wouldn't expect it to work well with animals. However, thinking on it further, while animals do not possess the anatomy to produce phonemes and language, there may be enough similarities our respective biologies that might allow an animal to generalize to a human model. That's what I want to find out in this little tutorial. I don't expect this model to perform well for numerous reasons, but lets kick on. We might be surprised.

In [3]:
def get_mfcc(filename):
    a, sr = librosa.load(filename, res_type="kaiser_best")
    mfccs_features = librosa.feature.mfcc(y = a, sr = sr, n_mfcc=40)
    mfccs_scaled_features = np.mean(mfccs_features.T,axis=0)
    return mfccs_scaled_features

def feature_extractor(path, df):
    extracted_features=[]
    
    for idx, row in df.iterrows():
        try:
            filename = os.path.join(os.path.abspath(path), os.path.abspath(path),f"{row['class']}/{row['filename']}")
            class_labels = row['class']
            data = get_mfcc(filename)
            extracted_features.append([data, class_labels])
        except(ValueError):
            class_labels = row['class']
            data = np.nan
            extracted_features.append([data, class_labels])
    return extracted_features

def buildDataset(path, df):
    features = feature_extractor(path, df)
    final_df = pd.DataFrame(features, columns=["mfcc", "label"])
    final_df.dropna(inplace=True)
    return final_df


The above functions will allow us to extract all the features we need from the filtered dataframe.

In [4]:
path = './Data/'
f_df = buildDataset(path, df)

  return f(*args, **kwargs)


Next lets quickly check the composition of files in the dataframe. Do we have uneven classes? Do we have sufficient samples?

In [5]:
f_df.groupby("label").count()

Unnamed: 0_level_0,mfcc
label,Unnamed: 1_level_1
bird,191
cat,193
chicken,29
cow,69
dog,188
donkey,19
frog,35
lion,38
monkey,22
sheep,34


I would argue no. Firstly, there are ten distinct classes in the data. Cats, birds and dogs have nearly 200 samples each, which is not ideal but ok... cows have 69 samples, while the rest have under 40 samples. This isn't going to bode well for train-test split. We will definitely need more data if we wanted to build a worthwhile model.

In any case, lets proceed. I will use keras to build the model. I'm not going to go into too much detail here on the model itself as it is simply to demo this sandbox dataset.

In [6]:
# extract features and classes
X=np.array(f_df['mfcc'].tolist())
y = np.array(f_df['label'].tolist())

# encode the classes as labels so that we can check the answer after the fact.
labelencoder = LabelEncoder()
y = tf.keras.utils.to_categorical(labelencoder.fit_transform(y))

#Split the data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=890)

In [7]:
# Build the model
n_labels = y.shape[1]

model = tf.keras.models.Sequential()

#Layer 1
model.add(tf.keras.layers.Dense(100, input_shape=(40,))) #matches n_mels
model.add(tf.keras.layers.Activation('relu'))
model.add(tf.keras.layers.Dropout(0.5))

#Layer 2
model.add(tf.keras.layers.Dense(200))
model.add(tf.keras.layers.Activation('relu'))
model.add(tf.keras.layers.Dropout(0.5))

#Layer 3
model.add(tf.keras.layers.Dense(100))
model.add(tf.keras.layers.Activation('relu'))
model.add(tf.keras.layers.Dropout(0.5))

#Layer 4
model.add(tf.keras.layers.Dense(n_labels))
model.add(tf.keras.layers.Activation('softmax'))

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 100)               4100      
                                                                 
 activation (Activation)     (None, 100)               0         
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense_1 (Dense)             (None, 200)               20200     
                                                                 
 activation_1 (Activation)   (None, 200)               0         
                                                                 
 dropout_1 (Dropout)         (None, 200)               0         
                                                                 
 dense_2 (Dense)             (None, 100)               2

In [8]:
# Compile the model
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')

In [17]:
# Train the model
n_epochs = 100
n_batch_size = 32

checkpointer = tf.keras.callbacks.ModelCheckpoint(filepath='./Data/audio_classifcation.hdf5', verbose=1, save_best_only=True)

start = datetime.now()
model.fit(X_train, y_train, batch_size=n_batch_size, epochs=n_epochs, validation_data=(X_test, y_test), callbacks=[checkpointer])
duration = datetime.now() - start
print("Training time = ", duration)

Epoch 1/100
 1/18 [>.............................] - ETA: 0s - loss: 0.5481 - accuracy: 0.7812
Epoch 1: val_loss improved from inf to 0.97783, saving model to ./Data\audio_classifcation.hdf5
Epoch 2/100
 1/18 [>.............................] - ETA: 0s - loss: 0.6193 - accuracy: 0.8125
Epoch 2: val_loss did not improve from 0.97783
Epoch 3/100
 1/18 [>.............................] - ETA: 0s - loss: 0.8348 - accuracy: 0.7500
Epoch 3: val_loss improved from 0.97783 to 0.92498, saving model to ./Data\audio_classifcation.hdf5
Epoch 4/100
 1/18 [>.............................] - ETA: 0s - loss: 0.8169 - accuracy: 0.8438
Epoch 4: val_loss improved from 0.92498 to 0.91638, saving model to ./Data\audio_classifcation.hdf5
Epoch 5/100
 1/18 [>.............................] - ETA: 0s - loss: 0.3521 - accuracy: 0.8438
Epoch 5: val_loss did not improve from 0.91638
Epoch 6/100
 1/18 [>.............................] - ETA: 0s - loss: 0.2503 - accuracy: 0.9062
Epoch 6: val_loss did not improve from 0

In [16]:
test_accuracy=model.evaluate(X_test, y_test, verbose=0)
print(f'The test accuracy was: {round(test_accuracy[1]*100,4)}%')

The test accuracy was: 75.2033%


You are able to go back and rerun the previous blocks to see if the accuracy can be further improved over a few more generations. Because the dataset is small, it won't take long. 

When I ran it on my system, the first generation was about 58% accurate, the second generation was about 68% accurate, third about 75% accurate, and the fourth not improving much from there.

Not bad for such a small dataset with 10 classes right? Well, lets test it on something it hasn't seen. 

Because I don't have any additional animal noises that the model hasn't seen, I cannot really test it on proper animal sounds. In hindsight, while writing this I probably should have randomly extracted one of two audio files from each class and held them seperately from the model.

Instead, because it amuses me so, I will attempt to make som animal noises myself and see how well the model classifies them.

In [18]:
test_file = "./human_cat.wav"
pred_mfcc = get_mfcc(test_file)
pred_mfcc = pred_mfcc.reshape(1,-1)

prediction = model.predict(pred_mfcc)

result = np.argmax(prediction, axis=1)
result = labelencoder.inverse_transform(result)

result



array(['cat'], dtype='<U7')

That's good, it seems like I can make a cat noise.

In [19]:
test_file = "./human_donkey.wav"
pred_mfcc = get_mfcc(test_file)
pred_mfcc = pred_mfcc.reshape(1,-1)

prediction = model.predict(pred_mfcc)

result = np.argmax(prediction, axis=1)
result = labelencoder.inverse_transform(result)

result



array(['cat'], dtype='<U7')

Hmmm, my donkey seems more like a cat to this model.

In [20]:
test_file = "./human_cow.wav"
pred_mfcc = get_mfcc(test_file)
pred_mfcc = pred_mfcc.reshape(1,-1)

prediction = model.predict(pred_mfcc)

result = np.argmax(prediction, axis=1)
result = labelencoder.inverse_transform(result)

result



array(['cat'], dtype='<U7')

And my cow seems more like a cat to this model.

Now, depending on how many generations you train the model, you will find that the results will change. Right now, four generations isn't enough for the model to really learn. Infact it is probably naively selecting 'cat' because there is such a large imbalance in the samples among classes.

Again, the point of this was not to show you an excellent model. Moreover, there isn't really any expectation that a model trained on animal noises could guess a human mimicking these animals. The point here was to illustrate how important having good quality data is to training a model.

Now, if you wish, you can rewrite this code to extract some animal sounds from the data, exclude them from train-test, and see if the model performs any better.

Or, you can go out, find some high quality data and contribute to the project goals.