# Predicting from Sound
> A Neural Networks project by Aleksander Nikolajev, Kayahan Kaya and Severin Brunner

- toc: true 
- badges: true
- comments: true
- categories: [jupyter]
- sticky_rank: 1

## Introduction
The objective of this project is to find ways of obtaining information from sounds using neural networks. Examples of such information include the source of a sound as well as the distance of the source to the microphone.
To approach the detection of the sound source, we are constructing a neural network based on a submission {% fn 1 %} to the [Freesound General-Purpose Audio Tagging Challenge](https://www.kaggle.com/c/freesound-audio-tagging/overview), a competition with the goal of classifying audio files from a wide range of real-world environments.

{{ 'This is footnote' | fndetail: 1}}

## The Dataset

As a dataset for testing and training, we are using the one provided for the 2018 Kaggle Freesound competition, which is downloadable [here](https://www.kaggle.com/c/freesound-audio-tagging/data).
It contains sounds from 41 different categories such as trumpet or fireworks, with 9473 training examples and 1600 test examples. However the samples aren't distributed uniformly over the categories, meaning there's more data for some categories than for others. Also, the amount of manually verified samples varies from category to category.  This might cause the training to become more challenging.

![](kaggle_dataset_distribution.png "The dataset. As it can be seen, only a part of the samples has been manually verified (blue). The amount of samples and the rate of manually verified samples varies between the categories.")

#hide
## Audio theory:  Reading audio files, PCM, MFCC, Spectograms

## Creating the network

We are using Keras as the deep learning library to construct our network.

Librosa is a library with several functions to extract features from audio data, which we will be using to preprocess the network input.

#hide 
TODO: Explain the different approaches to network input, architectures etc.

In [None]:
import librosa
import numpy as np
import scipy
from keras import losses, models, optimizers
from keras.activations import relu, softmax
from keras.callbacks import (EarlyStopping, LearningRateScheduler,
                             ModelCheckpoint, TensorBoard, ReduceLROnPlateau)
from keras.layers import (Convolution1D, Dense, Dropout, GlobalAveragePooling1D, 
                          GlobalMaxPool1D, Input, MaxPool1D, concatenate)
from keras.utils import Sequence, to_categorical

Another crucial preprocessing step is normalization, for which we will be using this function:

In [10]:
def audio_norm(data):
    max_data = np.max(data)
    min_data = np.min(data)
    data = (data-min_data)/(max_data-min_data+1e-6)
    return data-0.5

At this point we must decide in which format we will provide the network input. The dataset samples are given as raw audio data files, which could be directly used as input. However this has several disadvantages, one of them being the extensive memory usage. Loading the entire dataset into memory might exceed our capacities. A so-called data generator can provide relief. It loads the samples one by one in real-time during training and testing instead of using the naive approach of loading all samples at once. The code for such a data generator can be found at https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly. 

Still, computations on large data are more time-consuming, especially when they are loaded into memory on-the-fly like it is the case for a data generator. A better idea might be to extract features from the raw audio files and use these as input. This essentially compresses the input, which makes the use of a data generator unnecessary and greatly accelerates the training process.  

Therefore as a processing step, we are extracting so-called MFCC audio features from our input data using the Librosa library. The MFCC data are created one by one for each sample and stored in a variable X. This process is quite time-consuming. After it is done, we finally normalize the generated input data.

In [None]:
#hide
# This is how MFCC looks like for a single file:
SAMPLE_RATE = 44100
fname = 'drive/MyDrive/datasets/FSDKaggle2018.audio_train/' + '00044347.wav'   # Hi-hat
wav, _ = librosa.core.load(fname, sr=SAMPLE_RATE)
wav = wav[:2*44100]
mfcc = librosa.feature.mfcc(wav, sr = SAMPLE_RATE, n_mfcc=40)

plt.imshow(mfcc, cmap='hot', interpolation='nearest');
# TODO: include image

In [None]:
def prepare_data(df, data_dir):
    dim = (n_mfcc, 1+(int(np.floor(audio_length/512))), 1)
    
    X = np.empty(shape=(df.shape[0], dim[0], dim[1], 1))
    input_length = audio_length
    #create librosa file
    for i, fname in enumerate(df.index):
        file_path = data_dir + fname
        data, _ = librosa.core.load(file_path, sr=sampling_rate, res_type="kaiser_fast")

        # Random offset / Padding
        if len(data) > input_length:
            max_offset = len(data) - input_length
            offset = np.random.randint(max_offset)
            data = data[offset:(input_length+offset)]
        else:
            if input_length > len(data):
                max_offset = input_length - len(data)
                offset = np.random.randint(max_offset)
            else:
                offset = 0
            data = np.pad(data, (offset, input_length - len(data) - offset), "constant")
        #extract mfcc features
        data = librosa.feature.mfcc(data, sr=sampling_rate, n_mfcc=n_mfcc)
        data = np.expand_dims(data, axis=-1)
        #save them and do it for each file, return X
        X[i,] = data
    return X

In [None]:
X_train = prepare_data(train, 'drive/MyDrive/datasets/FSDKaggle2018.audio_train/')
X_test = prepare_data(test, 'drive/MyDrive/datasets/FSDKaggle2018.audio_test/')
# TODO:  what does this function to_categorical() do and where is it defined?
y_train = to_categorical(train.label_idx, num_classes=n_classes)

# normalize the data using the mean and deviation from the training set
mean = np.mean(X_train, axis=0)
std = np.std(X_train, axis=0)
X_train = (X_train - mean)/std
X_test = (X_test - mean)/std

Now, we build our network model. 
The model starts with a convolutional layer followed by ReLU activation and a maxpool layer. Batch normalization is applied by inserting the corresponding layer before the activation function. 
This structure is repeated 3 more times, then the model ends with a fully connected layer of size 64 with batch normalization. The final output is given by a softmax layer which produces a probability distribution over the 41 classes.

As a loss function, we use cross entropy, and the Adam optimizer is used for training.

In [12]:
def getModel():
    input_length = audio_length

    inp = Input(shape=(dim[0], dim[1],1))
    x = Convolution2D(32, (4,10), padding="same")(inp)
    x = BatchNormalization()(x)
    x = Activation("relu")(x)
    x = MaxPool2D()(x)
    
    x = Convolution2D(32, (4,10), padding="same")(x)
    x = BatchNormalization()(x)
    x = Activation("relu")(x)
    x = MaxPool2D()(x)
    
    x = Convolution2D(32, (4,10), padding="same")(x)
    x = BatchNormalization()(x)
    x = Activation("relu")(x)
    x = MaxPool2D()(x)
    
    x = Convolution2D(32, (4,10), padding="same")(x)
    x = BatchNormalization()(x)
    x = Activation("relu")(x)
    x = MaxPool2D()(x)

    x = Flatten()(x)
    x = Dense(64)(x)
    x = BatchNormalization()(x)
    x = Activation("relu")(x)
    out = Dense(n_classes, activation=softmax)(x)

    model = models.Model(inputs=inp, outputs=out)
    opt = optimizers.Adam(learning_rate)

    model.compile(optimizer=opt, loss=losses.categorical_crossentropy, metrics=['acc'])
    return model

To prepare the output, we are converting the raw labels to integer indices and are setting the filename as the index for the train and test data:

In [None]:
# create a dictionary that maps integers to labels
LABELS = list(train.label.unique())
label_idx = {label: i for i, label in enumerate(LABELS)}

# set the index row for train and test to the name of the file
train.set_index("fname", inplace=True)
test.set_index("fname", inplace=True)

# TODO: what does this do?
train["label_idx"] = train.label.apply(lambda x: label_idx[x])


Finally, we are training the network using KFold:

In [None]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=2)
for i, (train_index, test_index) in enumerate(kf.split(X_train)):
  X_t, X_te = X_train[train_index], X_train[test_index]
  y_t, y_te = y_train[train_index], y_train[test_index]
  print("#"*50)
  print("Fold: ", i)
  model = getModel()
  checkpoint = ModelCheckpoint('best_%d.h5'%i, monitor='val_loss', verbose=1, save_best_only=True)
  early = EarlyStopping(monitor="val_loss", mode="min", patience=5)
  tb = TensorBoard(log_dir='./logs/' + PREDICTION_FOLDER + '/fold_%i'%i, write_graph=True)
  callbacks_list = [checkpoint, early, tb]
  history = model.fit(X_t, y_t, validation_data=(X_te, y_te), callbacks=callbacks_list, 
                        batch_size=64, epochs=30)
  model.load_weights('best_%d.h5'%i)
  predictions = model.predict(X_train, batch_size=64, verbose=1)
  np.save(PREDICTION_FOLDER + "/train_predictions_%d.npy"%i, predictions)

  predictions = model.predict(X_test, batch_size=64, verbose=1)
  np.save(PREDICTION_FOLDER + "/test_predictions_%d.npy"%i, predictions)

  top_3 = np.array(LABELS)[np.argsort(-predictions, axis=1)[:, :3]]
  predicted_labels = [' '.join(list(x)) for x in top_3]
  test['label'] = predicted_labels
  test[['label']].to_csv(PREDICTION_FOLDER + "/predictions_%d.csv"%i)

#hide
TODO: StratifiedKFold ?

## Results and Conclusion
Training for 30 epochs results in a promising training accuracy of around 90%. However the network currently shows strong overfitting, as the accuracy on the validation set barely reaches 40%. 
Further investigation is needed in order to find ways to reduce this massive gap. 

Future work includes exploring other ways to obtain information from sounds, for example predicting the distance from a sound source given a sound sample.

# References

[1] https://www.kaggle.com/fizzbuzz/beginner-s-guide-to-audio-data/data