# Predicting from Sound
> A Neural Networks project by Aleksander Nikolajev, Kayahan Kaya and Severin Brunner

- toc: true 
- badges: true
- comments: true
- categories: [jupyter]
- sticky_rank: 1

## Introduction
Gaining information from sounds is a fundamental human ability: We can detect and identify objects just from our hearing as well as estimate the direction and distance of that object. 
Writing software with the same abilities is a difficult task due to the enormous complexity of audio signals. Applying machine learning, in particular neural networks, is the most promising approach to meet this challenge. 
In this project, we are researching common methods to deploy neural networks for prediction from sound and are creating our own neural network that is able to extract certain information from audio samples. In particular, we are trying to predict the source of a sound as well as the distance between source and microphone.


## Audio theory
In this section, essential theoretical elements of audio analysis are introduced, which we later utilize for our network.

### Pulse-code modulation (PCM)
In order to store an analog audio signal in memory, it has to be digitized by applying sampling and quantization. Sampling refers to measuring the signal values at specific timesteps, which transforms the original time-continuous signal into a time-discrete one. Quantization implies mapping the continuous signal values to discrete values in a specific range, e.g. 16 bits.

![](https://upload.wikimedia.org/wikipedia/commons/b/bf/Pcm.svg "Sampling and quantization of an analog signal (red) with 4-bit PCM, resulting in a time-discrete and value-discrete signal (blue).")

PCM is a format for storing uncompressed audio signals. It simply contains an array of values that have been produced by sampling and quantizing an analog signal. It has two basic properties:  The sampling rate (how many samples per second were taken) and the bit depth (the number of bits per sample value), which determines the resolution. A typical sampling rate is 44.1 kHz (e.g. CDs), and 16 bits is a common choice for the bit depth.



### Spectograms

A spectrogram is a visualiziaton of the frequency spectrum of a signal over time. The frequency spectrum represents the signal strength of the various frequencies present in the signal. It can be calculated by applying a fourier transform to the signal.
The spectogram is depicted as a heat map, which means the intensity at a specific frequency and time is expressed through the color.
![](clarinette_spectogram.png "Spectrogram of a recording of a clarinet playing a note. The bottom line is at the frequency of the keynote, the higher lines are the harmonics. The clarinet starts playing at 0.4 seconds.")

### MFCC (Mel-Frequency Cepstral Coefficients)
For audio analysis, it often makes sense to extract certain features from the raw audio signal, like the signal energy or the spectogram. As a feature, the MFCCs represent the entire frequency spectrum compactly with few values (e.g. 40), which approximates the human auditory system more closely. This has proven useful for applications like speech or song recognition.

## The Dataset

In order to train and test the network, a dataset is needed. For the task of identifying objects, there are many datasets available, for example the one from the [2018 Kaggle Freesound competition](https://www.kaggle.com/c/freesound-audio-tagging/data), which contains sounds from 41 different categories such as trumpet or fireworks. We used this dataset for initially creating and testing our sound identification network.

However for estimating the distance of the sound source, datasets are scarce. Therefore we decided to create our own.


### Creating the dataset
TODO: Aleksander, modify it if I got something wrong

The [FSD50K dataset](https://zenodo.org/record/4060432) contains sound events from 200 classes. In order to create our dataset for distance prediction, we played the samples from this dataset on a speaker and recorded it with a microphone placed in certain distances to the speaker, i.e. 1 meter, 2 meter and ?? TODO
This process introduced some background noise into the samples, which we were attempting to reduce by means of preprocessing.

![](dataset_recording_draft.png "Draft of the recording process. A PC connected to a speaker plays the samples, while a laptop records it with a microphone from a certain distance d. The PC signals the laptop when it starts and stops playing over a socket connection, so the laptop can start and stop recording its samples accordingly.") 


### Preprocessing
Preprocessing the raw audio material before using it for training is an essential step to improve the accuracy of our network. Common approaches include normalizing the samples as well as reducing background noise. 

#### Normalization
One way to normalize audio signals as part of data preparation is to set the RMS (root mean square) of all audio signals to a fixed value {% fn 2 %}. The RMS of a signal is its effective value, which can be interpreted as the average power output. The intention behind this kind of normalization is to make the network more robust against differences in loudness, enabling the network to better distinguish a loud, remote signal from a quiet, close signal.

We set the RMS of each audio signal to one by calculating the RMS over the original signal and then dividing the signal values by the RMS. Employing this normalization technique gave the network accuracy a slight bump.

#### Removing background noise


#hide
## The Dataset

As a dataset for testing and training, we are using the one provided for the 2018 Kaggle Freesound competition, which is downloadable [here](https://www.kaggle.com/c/freesound-audio-tagging/data).
It contains sounds from 41 different categories such as trumpet or fireworks, with 9473 training examples and 1600 test examples. However the samples aren't distributed uniformly over the categories, meaning there's more data for some categories than for others. Also, the amount of manually verified samples varies from category to category.  This might cause the training to become more challenging.

![](kaggle_dataset_distribution.png "The dataset. As it can be seen, only a part of the samples has been manually verified (blue). The amount of samples and the fraction of manually verified samples per category varies between the categories.")

## Creating the network


### Network input

A major design decision is in which format the input audio signal will be fed to the network. 
The simplest way would be to emply the raw PCM data, however using features extracted from these data as input instead can provide several advantages. First, the performance will likely be better if meaningful features are chosen. Second, training will be faster if the features represent the raw data more compactly, as the amount of data that the network has to process is smaller. This is especially the case for using MFCCs, as these provide a very compact representation of the frequency spectrum. Another commonly used input format is creating a spectrogram from the raw audio signal and then feeding the spectogram to the network as an image. This allows to employ prevalent image processing techniques to the neural network like convolutional layers.



### Python implementation

We are using Keras as the deep learning library to construct our network.

For audio processing, Librosa is a suitable library. It provides several functions to extract features from audio data, e.g. for creating spectograms, calculating the MFCCs and performing a fourier transform.

#hide

Now, we build our network model. 
The model starts with a convolutional layer followed by ReLU activation and a maxpool layer. Batch normalization is applied by inserting the corresponding layer before the activation function. 
This structure is repeated 3 more times, then the model ends with a fully connected layer of size 64 with batch normalization. The final output is given by a softmax layer which produces a probability distribution over the 41 classes.

As a loss function, we use cross entropy, and the Adam optimizer is used for training.

## Results


## Conclusion and future work

### Source code
[include link here]()

## References

{{ 'https://www.kaggle.com/fizzbuzz/beginner-s-guide-to-audio-data/' | fndetail: 1}} USED?

{{ 'https://arxiv.org/pdf/2003.04210.pdf' | fndetail: 2}}
