# Audio classification using convolutional neural networks


Audio classification can be performed by converting audio streams into [spectrograms](https://en.wikipedia.org/wiki/Spectrogram), which provide visual representations of spectrums of frequencies as they vary over time, and classifying the spectrograms using [convolutional neural networks](https://en.wikipedia.org/wiki/Convolutional_neural_network) (CNNs). The spectrograms below were generated from WAV files with vocal sounds, such as coughing, laughing, tapping etc.. We shall use Tensorflow Keras to build a CNN that can identify the various vocal sounds.

In [None]:
import sys, os
sys.path.append("..")

In [None]:
base_dir_train = "../data/original_audio/train"
base_dir_test  = "../data/original_audio/test"
train_input_dirs  = [os.path.join(base_dir_train, dir)       
                     for dir in os.listdir(base_dir_train) if "DS_Store" not in dir] 
test_input_dirs   = [os.path.join(base_dir_test, dir) 
                     for dir in os.listdir(base_dir_test) if "DS_Store" not in dir] 

## Preprocessing the Audio Data

### Importing Required Libraries and Setting Parameters

Firstly, we set up the Mel and the MFCC (Mel-frequency cepstral coefficients) parameters, define the number of categories, and create a mapping for our categories.

### Preprocessing and Preparing the Data

Next, we preprocess our training and test data: The `preprocess` function (imported from `src.preprocess`) is used to extract features from our audio files. We then convert the features and labels into numpy arrays, mapping the categorical labels to their corresponding numeric values.

### Normalizing the Data and Encoding Labels

Finally, we normalize our feature data and encode our labels: We normalize the feature data to a range between 0 and 1 using min-max normalization. This helps in faster convergence during model training. The labels are then one-hot encoded using `to_categorical` from Keras. This converts our numeric labels into binary class matrices, which is the format required for multi-class classification problems.

After these preprocessing steps, our data is ready to be fed into the CNN model for training and evaluation.



In [None]:
import numpy as np
from src.preprocess import preprocess
from tensorflow.keras.utils import to_categorical

In [None]:
chunk_len = 2
verbose   = 0
sr        = 16000
mels_params  = {"hop_length":128, "n_fft":2048, "n_mels":224}
mfcc_params  = {"hop_length":256, "n_fft":2048, "n_mfcc":128}
params       = mfcc_params
n_categories = 2

if n_categories == 4:
    categories = ["tapping", "talking", "laughing", "cough"]
    map_cat    = { "tapping":0, "talking":1, "laughing":2, "cough":3}
elif n_categories == 2:
    map_cat    = {"laughing":0, "talking":0, "tapping":0, "cough":1}
    categories = ["other", "cough"]

In [None]:
(train_features, train_labels) = preprocess(train_input_dirs, params)
(test_features,  test_labels)  = preprocess(test_input_dirs , params)

X_train = np.array(train_features)
y_train = np.array([map_cat[label] for label in train_labels])

X_test = np.array(test_features)
y_test = np.array([map_cat[label] for label in test_labels])

X_train_norm = (X_train - X_train.min()) / (X_train.max() - X_train.min())
X_test_norm  = (X_test  - X_train.min()) / (X_train.max() - X_train.min())

y_train_encoded = to_categorical(y_train)
y_test_encoded = to_categorical(y_test)

## Build and train a CNN

State-of-the-art image classification is typically performed with convolutional neural networks that use [convolution layers](https://machinelearningmastery.com/convolutional-layers-for-deep-learning-neural-networks/) to extract features from images and [pooling layers](https://machinelearningmastery.com/pooling-layers-for-convolutional-neural-networks/) to downsize images so that features can be detected at various resolutions. The next task is to build a CNN containing a series of convolution and pooling layers for feature extraction, a pair of fully connected layers for classification, and a `softmax` layer that outputs probabilities for each class, and to train it with spectrogram images and labels. Start by defining the CNN.

In [None]:
from tensorflow.keras.losses import binary_crossentropy

In [None]:
activation = "softmax"
loss = "categorical_crossentropy"

activation = "sigmoid"
loss = "binary_crossentropy"

In [None]:
from src.model import create_model
from tensorflow.keras.optimizers.legacy import Adam 
model = create_model(X_train_norm.shape[1:], n_categories, activation=activation)
model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy']) 

In [None]:
hist = model.fit(X_train_norm, y_train_encoded, 
                 validation_data=(X_test_norm, y_test_encoded), 
                 batch_size=16, 
                 epochs=10)

In [None]:
# Save the model in SavedModel format
model.save("../model/mfcc_cnn_model")

# Optionally, you can also save the model weights separately
model.save_weights("../model/mfcc_cnn_weights.h5")

In [None]:
# Recreate the model architecture
loaded_model = create_model(X_train_norm.shape[1:], n_categories, activation="softmax")

# Compile the model (use the same configuration as when you trained it)
loaded_model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])

# Load the weights
loaded_model.load_weights("../model/mfcc_cnn_weights.h5")

## Plot the training and validation accuracy.

In [None]:
from src.model_utils import plot_train_val_metrics
plot_train_val_metrics(hist)

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

y_predicted = model.predict(X_test_norm)
mat = confusion_matrix(y_test_encoded.argmax(axis=1), 
                       y_predicted.argmax(axis=1), 
                       normalize="true")

sns.heatmap(mat, square=True, annot=True, cbar=False, 
            cmap='Blues', xticklabels=categories, yticklabels=categories)

plt.xlabel('Predicted label')
plt.ylabel('Actual label')

## Test with unrelated WAV files

The "Sounds" directory has a subdirectory named "samples" containing WAV files that the CNN was neither trained nor tested with. The WAV files bear no relation to the samples used for training and testing; they were extracted from a YouTube video documenting Brazil's efforts to curb illegal logging. Let's use the model trained in the previous exercise to analyze these files for sounds of logging activity. Start by creating a spectrogram from the first sample WAV file, which contains audio of loggers cutting down trees in the Amazon.

In [None]:
import librosa
from src.utils import create_mfcc
wav_samples = [filename for filename in os.listdir("../samples/") if "wav" in filename]

In [None]:
audio_file = wav_samples[1]
print(audio_file)
y, sr = librosa.load(f"../samples/{audio_file}", sr=16000)
placeholders = np.arange(0, len(y)- chunk_len*sr, 0.5*sr, dtype=int)
for x1 in placeholders:
    x2 = int(x1+chunk_len*sr)
    audio_signal = y[x1:x2]
    x = create_mfcc(audio_signal, sr, **params)
    x = np.expand_dims(x, axis=-1)
    x = np.expand_dims(x, axis=0)
    x = (x-X_train.min()) / (X_train.max() - X_train.min())
    predictions = model.predict(x)
    
    for i, label in enumerate(categories):
        print(f'Seconds {x1/sr:0.2f}:{x2/sr:.2f} => {label}: {predictions[0][i]}')