Ideas adopted from:

CNN and RNN model : https://www.worldscientific.com/doi/pdf/10.1142/S2196888822500300

Limitations: The nature of the audio in the reasearch is different from the audio data that is given to us. Hence, we think that it is not suitable to adopt the CNN and RNN model here because it is too complex, computationally heavy and might not boost our model's accuracy too. 

How to overcome: We decide to take the trial-and-error approach while designing our CNN and RNN models, starting from the basic layers and slowly adding layers to get a higher accuracy.

Feature extraction for audio files: https://www.kaggle.com/code/gopidurgaprasad/mfcc-feature-extraction-from-audio

Limitations: The article introduces us what to extract for audio classification but only teaches us how to extract MFCC from a single audio file. But for our implementation, we do not only need to extract MFCCs, but also have to ensure that the number of frames is the same across all audio files.

How to overcome: Manually calculate the formula to get different hop lengths for each aduio.

Code referenced and their respective source:

Wrapping using KerasClassifier: https://www.analyticsvidhya.com/blog/2021/05/tuning-the-hyperparameters-and-layers-of-neural-network-deep-learning/

Hyperparameter tuning using GridSearchCV: https://www.analyticsvidhya.com/blog/2021/06/tune-hyperparameters-with-gridsearchcv/

Extracting MFCCs and Mel-spectrogram: https://towardsdatascience.com/learning-from-audio-the-mel-scale-mel-spectrograms-and-mel-frequency-cepstral-coefficients-f5752b6324a8


We create a lot of functions that serves different purposes. This is to increase code readability and also allows users to run a certain part of code independently.

To run the model, scroll to the end of the file and simply uncomment the line that you want to run.

Users can experiment different values of parameters, the instructions are documented in the README pdf file. 

The parameters that can be changed: batch size, num_of_frames, cv (in GridSearchCV) and epochs.


In [46]:
import os
import numpy as np
import matplotlib.pyplot as plt
import librosa
import librosa.display as dsp
import tensorflow as tf
from tensorflow import keras
from keras import layers
from tensorflow.keras.optimizers.legacy import Adam
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV 
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from scikeras.wrappers import KerasClassifier


Functions to extract mfccs/mel spectrograms

In [47]:
def extract_mfccs(audio, sampling_rate, num_mfccs, desired_num_frames):
    hop_length = int(audio.size/(desired_num_frames - 1))  # hop_length depends on the length of audio

    mfccs = librosa.feature.mfcc(y=audio, sr=sampling_rate, n_mfcc=num_mfccs, n_fft=512, hop_length=hop_length)

    num_rows = mfccs.shape[0]
    num_frames = desired_num_frames

    return mfccs, num_rows, num_frames


def extract_specs(audio, sampling_rate, desired_num_frames):
    hop_length = int(audio.size/(desired_num_frames - 1))  # hop_length depends on the length of audio

    specs = librosa.feature.melspectrogram(y=audio, sr=sampling_rate, n_fft=512, hop_length=hop_length)

    num_rows = specs.shape[0]
    num_frames = desired_num_frames

    return specs, num_rows, num_frames


Loads and extract features (mfccs/mel spectrogram) from audio file.

Returns all_data, all_labels, no. of rows and no. of frames

In [48]:
def load_data_mfccs(data_dir, sampling_rate, desired_num_frames, num_mfccs):
    all_mfccs = []
    all_labels = []

    for file in os.listdir(data_dir):
        if not file.endswith(".wav"):  # if the file is invalid, skip
            continue

        audio, sr = librosa.load(f'{data_dir}{file}', sr=sampling_rate)

        '''
        Uncomment to display the audio in waveform

        plt.figure()
        plt.title('Waveform')
        dsp.waveshow(audio, axis='s')
        plt.show()
        '''

        mfccs, num_rows, num_frames = extract_mfccs(audio, sampling_rate, num_mfccs, desired_num_frames)

        '''
        Uncomment to display the mfccs

        plt.figure()
        plt.title('MFCCs')
        dsp.specshow(mfccs, x_axis='s', y_axis='mel')
        plt.colorbar(format='%+2.0f dB')
        plt.show()
        '''

        label = int(file[0])

        all_mfccs.append(mfccs)
        all_labels.append(label)

    all_mfccs = np.array(all_mfccs)
    all_mfccs = all_mfccs.transpose(0, 2, 1)  # transpose to fit into CNN and RNN

    all_labels = np.array(all_labels)

    return all_mfccs, all_labels, num_rows, num_frames  # num_rows and num_frames are important information for CNN and RNN



def load_data_specs(data_dir, sampling_rate, desired_num_frames, *dummy_args): # dummy_args has no use, just to support the higher order functions below
    all_specs = []
    all_labels = []

    for file in os.listdir(data_dir):
        if not file.endswith(".wav"):
            continue

        audio, sr = librosa.load(f'{data_dir}{file}', sr=sampling_rate)

        '''
        Uncomment to display the audio in waveform

        plt.figure()
        plt.title('Waveform')
        dsp.waveshow(audio, axis='s')
        plt.show()
        '''

        specs, num_rows, num_frames = extract_specs(audio, sampling_rate, desired_num_frames)


        '''
        Uncomment to display the mfccs

        specs = librosa.power_to_db(specs, ref=np.max)  # transform into db to make the spectrogram clearer
        plt.figure()
        plt.title('Mel Spectrogram')
        dsp.specshow(specs, x_axis='s', y_axis='mel')
        plt.colorbar(format='%+2.0f dB')
        plt.show()
        '''

        label = int(file[0])

        all_specs.append(specs)
        all_labels.append(label)

    all_specs = np.array(all_specs)
    all_specs = all_specs.transpose(0, 2, 1)
    all_labels = np.array(all_labels)

    return all_specs, all_labels, num_rows, num_frames



A function to get the **CNN** model. (required for KerasClassifier)

In [49]:
def get_cnn_model(num_frames, num_rows, filters, kernel_size, pool_size, dense_units, dropout_rate, learning_rate):
    model = keras.Sequential()

    model.add(layers.Input(shape=(num_frames, num_rows, 1)))

    model.add(layers.Conv2D(filters, kernel_size, activation='relu', padding='same'))
    model.add(layers.Conv2D(filters, kernel_size, activation='relu', padding='same'))
    model.add(layers.Conv2D(filters, kernel_size, activation='relu'))
    model.add(layers.MaxPooling2D(pool_size))
    model.add(layers.Flatten())
    model.add(layers.Dense(dense_units, activation='relu'))
    model.add(layers.Dropout(dropout_rate))

    model.add(layers.Dense(10, activation='softmax'))

    adam = Adam(learning_rate=learning_rate)
    model.compile(optimizer=adam, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

    return model

A function to get the **RNN** model. (required for KerasClassifier)

In [50]:
def get_rnn_model(num_frames, num_rows, units, dropout_rate, learning_rate):
    model = keras.Sequential()
    model.add(layers.LSTM(units, input_shape=(num_frames, num_rows)))

    model.add(layers.Dense(64, activation='relu'))
    model.add(layers.Dropout(dropout_rate))
    model.add(layers.Dense(10, activation='softmax'))

    adam = Adam(learning_rate=learning_rate)
    model.compile(optimizer=adam, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

    return model

Main function to run **CNN** (Cross Validation and Hyperparameter Tuning have been included)

In [51]:
def main_cnn(load_data_function):
    all_data, all_labels, num_rows, num_frames = load_data_function("data/", 8000, 8, 13) # higher order function, if provided function is load_data_specs, 13 will be ignored
                                                                                          # sampling_rate = 8000Hz, num_of_frames = 8, num_of_mfccs = 13

    train_data, test_data, train_labels, test_labels = train_test_split(all_data, all_labels, train_size = 0.8)

    param_grid = {
        'filters': [32, 64],
        'dense_units': [64, 128],
        'dropout_rate': [0.3, 0.5],
        'learning_rate': [0.001, 0.005]
    }

    model = KerasClassifier(model=get_cnn_model, num_frames=num_frames, num_rows=num_rows, filters=32, kernel_size=(3,3),   # wraps the Keras model
                            pool_size=(2,2), dense_units=64, dropout_rate=0.1, learning_rate=0.001, epochs=10)

    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy') 

    grid_result = grid_search.fit(train_data, train_labels, batch_size=32) 
    print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

    best_cnn_model = grid_result.best_estimator_
    test_accuracy = best_cnn_model.score(test_data, test_labels) # test the best model and testing data
    print("Test accuracy: ", test_accuracy)

    prediction = best_cnn_model.predict(test_data)              # get classification report
    report = classification_report(test_labels, prediction)
    print(report)

    cm = confusion_matrix(test_labels, prediction)              # get confusion matrix
    cm_display = ConfusionMatrixDisplay(confusion_matrix=cm)

    cm_display.plot()
    plt.show()

    training_history = best_cnn_model.history_                  # display best model's training history
    plt.plot(training_history['accuracy'])
    plt.title('Model accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.show()

Main function to run **RNN** (Cross Validation and Hyperparameter Tuning have been included)

In [52]:
def main_rnn(load_data_function):
    all_data, all_labels, num_rows, num_frames = load_data_function("data/", 8000, 10, 13)

    train_data, test_data, train_labels, test_labels = train_test_split(all_data, all_labels, train_size = 0.8)

    param_grid = {
        'units': [64, 128],
        'dropout_rate': [0.3, 0.5],
        'learning_rate': [0.001, 0.005]
    }

    model = KerasClassifier(model=get_rnn_model, num_rows=num_rows, num_frames=num_frames, units=64, dropout_rate=0.3, learning_rate=0.001, epochs=10)
    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy')

    grid_result = grid_search.fit(train_data, train_labels, batch_size=32)
    print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

    best_rnn_model = grid_result.best_estimator_
    test_accuracy = best_rnn_model.score(test_data, test_labels)
    print("Test accuracy: ", test_accuracy)

    prediction = best_rnn_model.predict(test_data)
    report = classification_report(test_labels, prediction)
    print(report)

    cm = confusion_matrix(test_labels, prediction)
    cm_display = ConfusionMatrixDisplay(confusion_matrix=cm)

    cm_display.plot()
    plt.show()

    training_history = best_rnn_model.history_
    plt.plot(training_history['accuracy'])
    plt.title('Model accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.show()

Main function to run **KNN** (Cross validation and Hyperparameter Tuning have been included)

In [53]:
def main_knn(load_data_function):
    all_data, all_labels, num_rows, num_frames = load_data_function("data/", 8000, 8, 13)
    all_flattened_data = []

    for data in all_data:                           # flatten all 2-D arrays into 1-D arrays
        flattened_data = data.flatten()
        all_flattened_data.append(flattened_data)

    all_flattened_data = np.array(all_flattened_data)

    train_data, test_data, train_labels, test_labels = train_test_split(all_flattened_data, all_labels, train_size = 0.8, random_state=42)

    sqrt_n = int(pow(train_labels.shape[0], 0.5))   # square root of total number of training data

    param_grid = {
        'n_neighbors' : [5, 15, 25, 35, sqrt_n],
        'weights' : ['uniform','distance'],
        'metric' : ['minkowski','euclidean', 'manhattan']
    }

    model = KNeighborsClassifier(n_neighbors=5, weights='uniform', metric='minkowski')
    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy')

    grid_result = grid_search.fit(train_data, train_labels)
    print("Best score: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

    best_knn_model = grid_result.best_estimator_
    test_accuracy = best_knn_model.score(test_data, test_labels)
    print("Test accuracy: ", test_accuracy)

    prediction = best_knn_model.predict(test_data)
    report = classification_report(test_labels, prediction)
    print(report)

    cm = confusion_matrix(test_labels, prediction)
    cm_display = ConfusionMatrixDisplay(confusion_matrix=cm)

    cm_display.plot()
    plt.show()


6 combinations: main_cnn/main_rnn/main_knn + load_data_mfccs/load_data_specs


In [54]:
## Uncomment any functions below

## main_cnn(load_data_mfccs)
## main_rnn(load_data_mfccs)
## main_knn(load_data_mfccs)

## main_cnn(load_data_specs)
## main_rnn(load_data_specs)
## main_knn(load_data_specs)