<img src="https://raw.githubusercontent.com/pytorch/pytorch/master/docs/source/_static/img/pytorch-logo-dark.png" width="600px"/>

# **PyTorch** (2/2): Audio Classification

In this tutorial, we will train a neural network for audio classification using transfer learning. 

Some of the code in this example is based on [this](https://github.com/hasithsura/Environmental-Sound-Classification).

In [None]:
%matplotlib widget

from helpers.all import *

In [None]:
# select device to run on
use_cuda = torch.cuda.is_available()
device = torch.device('cuda' if use_cuda else 'cpu')
print(device)

# Preparing the Dataset

We will use the [ESC-50](https://github.com/karolpiczak/ESC-50) (Dataset for Environmental Sound Classification) in our experiment. If you cloned this repository including the `--recurse-submodules` flag, it is already downloaded and should be present at

In [None]:
print(DATASET_ROOT)

The dataset includes a `.csv` file that contains information about the samples in the dataset (e.g. which category it belongs to, etc.). We use `pandas` to read the `.csv` into a `DataFrame`.

In [None]:
df = pd.read_csv(DATASET_ROOT/'meta/esc50.csv')

Let's take a look at the first few lines...

In [None]:
df.head()

Fortunately, in the docs of the dataset it states

> The dataset has been prearranged into 5 folds for comparable cross-validation, making sure that fragments from the same original source file are contained in a single fold.

This means, that we can just pick one of the folds to use as our validation set. We create a new column `is_valid` that is true for fold 5.

In [None]:
df['is_valid'] = df['fold'] == 5

In [None]:
df

# Training Phase

Instead of training a model entirely from scratch, we will use the concept of [_transfer learning_](https://en.wikipedia.org/wiki/Transfer_learning). This means we start with a pretrained model from a different but related problem and _fine tune_ it to our specific problem.

Many of the break-through achievements in deep learning are coming from the domain of computer vision (e.g. image classification). By transforming audio into images (spectrograms) we can use those very same networks to perform audio classification.

## Building the Input Transformation Pipeline

A fast.ai [`DataBlock`](https://docs.fast.ai/data.block.html#DataBlock) is a convenient way of organizing our data into a form that can be used during the training phase. 

> By itself, a DataBlock is just a blue print on how to assemble your data. It does not do anything until you pass it a source. You can choose to then convert that source into a Datasets or a DataLoaders by using the DataBlock.datasets or DataBlock.dataloaders method.

If you want to know more about DataBlocks, have a look at this [tutorial](https://docs.fast.ai/tutorial.datablock.html).

In [None]:
audio2spec = AudioToSpec.from_cfg(AudioConfig.BasicMelSpectrogram(sample_rate=16000, n_fft=2048, hop_length=512, n_mels=128))
normalize = AudioNormalize()

In [None]:
block = DataBlock(
    blocks=(AudioBlock, CategoryBlock),
    splitter = ColSplitter(),
    get_x = ColReader('filename', pref=DATASET_ROOT/'audio'),
    get_y = ColReader('category'),
    item_tfms=[normalize],
    batch_tfms=[audio2spec],
)

In [None]:
dls = block.dataloaders(df, num_workers=0) # on windows, num_workers needs to be set to 0
# dls.show_batch(figsize=(7,7))

## Create/Prepare Model

Now, we create a `learner` object by passing it our DataLoaders object and using a pretrained ResNet18 as our architechture.

If you run this the first time, it will download the pretrained ResNet18 parameters.

In [None]:
learn = cnn_learner(dls, models.resnet18, metrics=[accuracy], pretrained=True)

$$
\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}
$$

The ResNet models work on RGB images, i.e. the input layer expects an image with three channels. 

In [None]:
learn.model[0][0].in_channels

Our spectrograms only have a single channel, so we need to modify the first layer of the model slightly to take our spectrograms as input.

In [None]:
make_xresnet_grayscale(learn.model)

If CUDA is available, we now transfer our model to the GPU. (If you do not have CUDA available, this cell will do nothing).

In [None]:
learn.model.to(device);

## Training the Model

... and finally we call `fine_tune` to train the model for 5 epochs.

In [None]:
learn.fine_tune(5)

## Interpreting the Training Results

After 5 epochs of training, you should see an accuracy of ~65 %. This may not seem very impressive at first but keep in mind that we just trained the model for a couple of minutes. Also, we are working with 50 output classes! 

If you have a look at the [original experiment](https://github.com/hasithsura/Environmental-Sound-Classification), you can see that with some further training it is easy to achieve an accuracy of over 85 %.

To get a better understanding of how well our model is doing, we can create a `ClassificationInterpretation` object from our learner and plot the confusion matrix.

In [None]:
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(figsize=(10, 10))

Or we can take a look at which pairs of categories the model has the most trouble with:

In [None]:
interp.most_confused(min_val=3)

# Saving the Trained Model

For running inference in our C++ application, we need to do two things:

## Export the Model as TorchScript

[TorchScript](https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html#) is a different representation of a PyTorch model that can be loaded and run in C++. The process of transforming our trained PyTorch model into a TorchScript module is called _tracing_. 

Tracing a model will invoke it with some _example input_ and record all the operations that occur during its execution. These recorded operations will then be saved into a static representation of our model (called _graph_).

In our case, all we need to do is call `torch.jit.trace()`, passing our model and some dummy input as arguments. As dummy input, we will use a random tensor that has the same shape as our spectrograms during training (`[1, 128, 157]`, i.e. `[n_channels, n_mels, n_frames]`), 

In [None]:
# move model back to CPU memory if we were training on GPU
if use_cuda:
    learn.model.to('cpu')
    
# create some dummy input 
dummy_input = torch.randn([1, 1, 128, 157])
    
# process the trace
traced_script_module = torch.jit.trace(learn.model, dummy_input)

In [None]:
# create output dir if it doesn't exist
if not MODEL_DIR.exists():
    MODEL_DIR.mkdir()

In [None]:
torch.jit.save(traced_script_module, str(MODEL_DIR/'esc-model.pt'))

## Save the Vocabulary

Our model does not directly output a string with the predicted category. Rather, it outputs a vector containing an unnormalized score for each of the categories in our vocabulary. 

In [None]:
learn.model(torch.randn(1, 1, 128, 157))

These are oredered in the same way as our vocabulary, so converting the model output to the string of the predicted category is straightforward: we just need to find the index of the item with the highest score (using `argmax()`) and use this as an index into our vocabulary vector.

To be able to perform this mapping directly in out C++ application, we our vocabulary into a header file that will automatically be placed in the C++ source directory:

In [None]:
dls.vocab

In [None]:
write_vocab_cpp_header(dls.vocab)

For good measure, we will also export the `vocab` list as a [pickle](https://docs.python.org/3/library/pickle.html) file, so we can easily reuse in some other Python script if we need to.

In [None]:
with open(TRAIN_DIR/'esc-model-vocab.pkl', 'wb') as f:
    pickle.dump(dls.vocab, f)

# Inference Phase

## Setting up a Test Dataset

The `DataLoader` class we used earlier automatically takes care of splitting the dataset into training and validation sets. For testing purposes, however, we cannot use samples from either of these. This leaves us with two options:

1. splitting off some items of the dataset for testing before creating the `DataLoader` for training or
2. using new samples, e.g. from a different dataset.

Since we are already done with training, we will go for the latter option. I have manually selected some sound examples from [freesound.org](https://freesound.org/):

* [an airplane](https://freesound.org/people/AurelioSons/sounds/207457/),
* [a guy sneezing](https://freesound.org/people/InspectorJ/sounds/368804/),
* [a toilet flush](https://freesound.org/people/InspectorJ/sounds/404329/),
* [a dog bark](https://freesound.org/people/Juan_Merie_Venter/sounds/327666/),
* [some sheeps](https://freesound.org/people/zachrau/sounds/362283/),
* [a child laughing](https://freesound.org/people/Teumova/sounds/439667/),
* [some mouse clicking](https://freesound.org/people/Masgame/sounds/347544/),
* [a siren](https://freesound.org/people/Kingrock2009/sounds/544376/),
* [a coffee machine](https://freesound.org/people/Acekat13/sounds/515685/),
* [a harp](https://freesound.org/people/pryght%20one/sounds/27130/) and
* [a synth sound](https://freesound.org/people/Erokia/sounds/550708/).

Excecpt for last three examples, all samples belong to categories known to our model.

To begin with, we need to manually label our test data, i.e. we create a dictionary that maps filenames to their category.

In [None]:
test_files_to_label = {
    '404329__inspectorj__toilet-flush-european-distant-lid-up.wav': 'toilet_flush',
    '327666__juan-merie-venter__dog-bark.wav': 'dog',
    '368804__inspectorj__sneeze-single-b.wav': 'sneezing',
    '347544__masgame__mouse-click-sounds.wav': 'mouse_click',
    '207457__aureliosons__avion-de-elices.wav': 'airplane',
    '27130__pryght-one__harp.wav': 'harp (unknown category)',
    '439667__teumova__child-laughing.wav': 'laughing',
    '515685__acekat13__adriana-lopez-coffee-machine.wav': 'coffee_machine (unknown category)',
    '544376__kingrock2009__siren-1.wav': 'siren',
    '362283__zachrau__sheep-bleating.wav': 'sheep',
    '550708__erokia__msfxp9-14-synth-loop-100-bpm.wav': 'synth_loop (unknown category)'
}

Our model expects spectrograms, not the raw waveforms. During training, this transformation was automatically handled by the `DataLoader` class. 

We don't need all the functionality offered by the `DataLoader`, we just want to transform all test samples. For this, we can use the `Pipeline` class. As the name suggests, this is just a pipeline of `Transform`s.

Because we are not dealing with a homogenous dataset as during training, we need to add a couple of transforms to make sure that our test samples have the same properties as those from the training dataset:
* `Resample`: resample to 16 kHz
* `DownmixMono`: downmix stereo signals to mono
* `ResizeSignal`: resize to exactly 5 s in length (padding or clipping if needed)

Let's define our transform pipeline, using the `normalize`, `audio2spec` and `gray2rgb` transforms we defined earlier:

In [None]:
resize_to_5s = ResizeSignal(5000, pad_mode=AudioPadType.Zeros_After)
resample_to_16khz = Resample(16000)
downmix = DownmixMono()

transforms = Pipeline([AudioTensor.create, resize_to_5s, downmix, resample_to_16khz, normalize, audio2spec])

We build a new dictionary mapping our the label of each test sample to the output of the `transforms` pipeline (i.e. our spectrograms):

In [None]:
label_to_spectrogram = {label: transforms(TEST_DIR/'data'/filename) for filename, label in test_files_to_label.items()}

We save the spectrograms into a TorchScript module to be able to load it from the C++ application.

To have access to the labels from within the C++ application, we register each of our spectrograms as a _named buffer_, using the corresponding label as name.

In [None]:
# save test samples as named buffers in a module
container = torch.nn.Module()
for l, s in label_to_spectrogram.items():
    container.register_buffer(l, torch.tensor(s))
    
# save to torch script module
torch.jit.save(torch.jit.script(container), str(TEST_DIR/'inputs.pt'))

## Building the C++ application

For demo purposes, we will be running the inference part in C++.

In IPython (which is running in the backend kernel of this notebook), we can run any command-line command by prefixing it with a `!`. 

This way we can just build and call the C++ application from right here within the notebook! We can also reuse any of our currently defined variables by including them in `{`curly braces`}`.

### Checking for `cmake`

Make sure you have `cmake` installed and its executable can be found. If everything is set up correctly, you should see the version output from `cmake` when executing the following cell.

In [None]:
!cmake --version

### Generating Build System

After making sure that our build directory exists, we call `cmake` to generate our build system.

In [None]:
# create build directory if it does not exist
if not CPP_BUILD_DIR.exists():
    CPP_BUILD_DIR.mkdir()

In [None]:
CPP_SOURCE_DIR

In [None]:
!cmake -S {CPP_SOURCE_DIR} -B {CPP_BUILD_DIR} -DCMAKE_PREFIX_PATH={torch.utils.cmake_prefix_path}

# if you're running cmake < 3.14, comment out the previous command and run the following instead
# !pushd {CPP_BUILD_DIR} && cmake .. -DCMAKE_PREFIX_PATH={torch.utils.cmake_prefix_path} && popd

### Building the Application

Next, we can trigger the actual build.

**WARNING**: note that you cannot just switch the build config to `Debug`, as the `libtorch` libraries for `Debug`/`Release` are not ABI-compatible. 


In [None]:
!cmake --build {CPP_BUILD_DIR} --config Release

## Running Inference

After building the application successfully, we can call it to run inference using our trained model

In [None]:
esc_executable = CPP_BUILD_DIR/'esc-app'

if platform.system() == 'Windows':
    esc_executable = ((esc_executable.parent/'Release')/esc_executable.name).with_suffix('.exe')

In [None]:
!{esc_executable}

In [None]:
model_path = MODEL_DIR/'esc-model.pt'
inputs_path = TEST_DIR/'inputs.pt'
outputs_path = TEST_DIR/'outputs.pt'

In [None]:
!{esc_executable} {model_path} {inputs_path} {outputs_path}

# Using `ipywidgets` to interactively explore our results

In [None]:
outputs = next(torch.load(outputs_path).parameters())

In [None]:
outputs[0]

As the output contains unnormalized scores we use `softmax` to convert them into probabilities.

In [None]:
outputs_prob = torch.softmax(outputs.detach(), dim=1)
outputs_prob[0]

In [None]:
outputs_prob[0].sum()

Using `ipywidgets`, we can create GUI widgets to interactively explore our results. At first, we define a dropdown menu that contains the labels from our test samples:

In [None]:
test_labels = list(test_files_to_label.values())

In [None]:
dropdown = widgets.Dropdown(options = test_labels, description='Test sample: ')
dropdown

Depending on our selection in this dropdown, we want print the output probabilities from our model. We can do this by using the `Output` widget. It acts as a context manager, capturing all output that is produced during the context.

In [None]:
output = widgets.Output()
output

Right now, however, this is just an empty placeholder. 

We need to create an event handler that calls the `print_top_results` using the active index from the dropdown menu and register it as an observer of the `index` trait on the dropdown menu object.

In [None]:
def on_dropdown(_):
    output.clear_output()
    with output:
        print_top_results(dropdown.index, outputs_prob, test_labels, dls.vocab)
    
dropdown.observe(on_dropdown, names=['index'])

**Select some items from the Dropdown menu!**

We can also combine these widgets into Layouts:

In [None]:
output_plot = widgets.Output()
def on_dropdown(_):
    output_plot.clear_output()
    with output_plot:
        fig, _ = plot_top_results(outputs_prob[dropdown.index], dls.vocab, show_max=10, figsize=(5,3))
        fig.canvas.toolbar_visible = False
        fig.tight_layout()
dropdown.observe(on_dropdown, names=['index'])

widgets.VBox([dropdown, output_plot])

# End-to-End Example

Using our trained model and `ipywidgets`, we can even build a small application right here from inside the notebook (inspired an example from the [fast.ai course](https://course.fast.ai/videos/?lesson=3https://course.fast.ai/videos/?lesson=3)).

Select 

Note: selecting an audio file longer than 5 s and classifying it multiple times might yield different results due to a random cropping in the `ResizeSignal` transform.Note: selecting an audio file longer than 5 s and classifying it multiple times might yield different results due to a random cropping in the `ResizeSignal` transform.

In [None]:
model = torch.load(MODEL_DIR/'esc-model.pt')

In [None]:
btn_upload = widgets.FileUpload(accept='.wav')
btn_classify = widgets.Button(description='Classify', layout=widgets.Layout(width='300px'))
audio_player = widgets.Audio(autoplay=False, loop=False, layout=widgets.Layout(width='300px'))
label_upload = widgets.Label(value='No file selected')
label_class = widgets.Label(layout=widgets.Layout(width='300px'))
output = widgets.Output()

def on_click_upload(_):
    audio_player.value = btn_upload.data[-1]
    label_upload.value = btn_upload.metadata[-1]['name']
    btn_classify.description = 'Classify'
    
def classify(_):
    tmp = tempfile.NamedTemporaryFile(delete=False)
    tmp.file.write(btn_upload.data[-1]); tmp.close()

    # load tensor from tmp file and apply transforms
    spec = torch.tensor(transforms(tmp.name))
    probs = model(spec.unsqueeze(0)).softmax(dim=1).squeeze().detach()
    result_idx = probs.argmax()
    output.clear_output()
    with output:
        fig, _ = plot_top_results(probs, dls.vocab, show_max=10, figsize=(5, 3))
        fig.canvas.header_visible = False
        fig.canvas.toolbar_visible = False
        fig.tight_layout()

    tmp.close(); os.unlink(tmp.name)

def reset_button(_):
    btn_upload.value.clear()
    btn_upload._counter = 1

btn_upload.observe(on_click_upload, names=['data'])
btn_classify.on_click(classify)
btn_upload.observe(reset_button, names=['value'])

widgets.VBox([widgets.HBox([btn_upload, label_upload]), audio_player, btn_classify, output])

# Congratulations!

You have successfully 

* trained a neural network to classify audio examples into 50 categories
* built a C++ application to load and run the trained model
* created small interactive applications to analyze the results