# QUT EcoAcoustics Recogniser Workshop 2022

## Overview of the Recogniser Building Workflow


### Dataset preparation 

Annotating audio

Balancing dataset 

Spectrogram segment generation 

### Training

Training/validation split 

Running training 

error analysis 

### Inference
Running inference 

Collating results 

inspecting verifying results 

labeling data for training 

### Setting up your environment

We are going to be using python for everything, including inspecting our dataset, training the recogniser, and visualising spectrograms and results. 

If are already reading this in your browser, you have python and jupyter installed. In python (like many other languages) there are some core functions available in the language and some that are part of packages that need to be imported before they can be used. Some of those packages are already available in the python distribution, and some packages must be installed before they can be imported. 

In the cell below, we install all the packages we will need. 


In [None]:
import sys

!{sys.executable} -m pip install numpy
!{sys.executable} -m pip install librosa
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install matplotlib
!{sys.executable} -m pip install torch torchvision torchaudio


Now that we have installed the required packages, we will import them so that we can use them in our scripts

In [None]:
import os
from os import listdir

import importlib

import math
import numpy as np
import pandas as pd
import librosa
import matplotlib.pyplot as plt
from scipy import signal
from PIL import Image

import configparser

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from IPython.display import Image

import utils
import Spectrogram 

## 1. Dataset Preparation


The recogniser is is a binary classifier, meaning it assigns one of two labels (positive or negative) to segments of audio.  In a basic Convolutional Neural Network (CNN), these segments are fixed-sized spectrogram images. In the training phase, these fixed-size segments are fed into the CNN along with their labels.  Therefore we need to be have a collection of fixed-length labelled segments. 

Because animal vocalisations are not of a fixed-duration, we need to deal with that somehow. There are several approaches to this: We can either resize the spectrogram image to the predetermined fixed size, distorting the image, or we can crop out as section of audio, which may discard some of the vocalisation or may add some padding audio around the vocalisation. For the recogniser we are developing today we have chosen the second method of cropping. 

The labelled data you have provided for this workshop is in one of two formats
- Pre-segented
- Annotated

The annoted files have sections of audio delimited with start and end offsets, indicating where to crop from. For the "pre-segmented" files, we assume we can crop from anywhere. 

The way that this is implemented in the recogniser scripts you are using in this workshop is that 
1. A preprocessing script reads all the annotation files for your audio segments and creates a single list of all annotations
    - for those presegmented files that don't have an associated annotation file, we know that the entire file can be treated as an annotation
2. One at a time each annotation is selected for training
3. A random-fixed length segment is cropped out from the audio matching the current annotation and then that is fed into the CNN


Short annotations (less than the length of the fixed-length segment we are cropping) are cropped so that the entire annotation fits in the crop. This is illustrated below. Extra audio is included outside the annotation to acheive the fixed-duration. The blue area is the annotation and green area shows the possible time interval we can crop from. 


<img alt='short annotation' src='notebook_images/short_annotation.png' width=600 />

Longer annotations (more than the length of the fixed-length segment) are cropped so that the cropped segment fits within the annotation. This is shown in the figure below. 

<img alt='long annotation' src='notebook_images/long_annotation.png' width=600 />


#### Creating annotations

The process described above requires a text file that contains the annotation information, i.e. information about the start and end offset of each vocalisation. There exist a few software tools that can be used to generate this file. The one that we have recommended for this workshop is Raven (or Raven Lite), but the A2O or Ecosounds, or Audacity can also perform a similar task. The resulting file is tabular data, where each row is an annotation, and two of the columns respectively contain information describing the start time (relative to the start of the file) and the end time or duration. 

These two tutorial videos show how to annotate audio files in Raven or Raven lite
https://ravensoundsoftware.com/video-tutorials/english/02-selections-and-measurements/
https://ravensoundsoftware.com/video-tutorials/english/03-saving-selection-tables/




### Spectrogram Generation
 
The CNN used is an image classifier. The sound is converted to an image in the form of a spectrogram.  The spectrogram is created by taking short (possibly overlapping) slices of the audio signal and calculating the discrete spectrum on them with a discrete Fourier transform. Each spectrum-slice then forms a column of the spectrogram, and when arranged side by side form a two dimensional grid. 

Each row of the grid represents a frequency band of equal frequency range. However, in nature, usually more information is encoded within a given frequency range at low frequencies than the same frequency range at higher frequencies. Therefore, most ecoacoustic CNNs use a mel-scale spectrogram, in which have a higher frequency resolution at low frequencies and a lower frequency resolution at higher frequencies. 

There are several parameters you can modify when generating these mel frequency spectrograms. You can edit these in the configuration file `config.ini`


The most important of these is the `time-win` (time window). This sets how many seconds of audio are fed into the CNN. However, note that whatever value is chosen for this, the resulting spectrogram will be reshaped into a size of 128x256 pixels. Setting a very large value for the time window will result in loss of time resolution when this reshaping is done. 

The most important is the 'windows size', which is the number of audio samples in each slice of the audio. This determines both the time resolution (a shorter window gives a more precise point in time for the resulting column) and the frequency resolution (a longer window gives more information to determine the spectrum - the contribution of wavelengths that don't fit in the window will be ignored)

You can also trim the spectrogram to include only a given range of frequencies. You may for example remove low or high frequencies well outside the range of your target species. This can both make the task easier for the model, since confusing are not included, as well as reducing the computational load since the input spectrogram size can be reduced. However it is important to remember that frequencies outside the range of the target call-type do help with discrimination between a positive and negative example


It is important that whatever spectrogram parameters are used during training are also used when the network is deployed. 




### Dataset Preparation - Practical Steps

We need to:
- 1.1 Initialise configuration parameters.
- 1.2 Prepare the training data based on how the training data is layed out on the file system.
- 1.3 Check that the information about the training data is correct.


#### Processing step 1.1 - Initialise the configuration parameters

In [None]:
import utils
importlib.reload(utils)

spec_params = utils.read_spec_params( "config.ini" )
utils.print_params(spec_params)

train_params = utils.read_train_params( "config.ini" )
utils.print_params(train_params)

#### Processing step 1.2 - Preparing the Data

This is where we actually run the code which parses the filesystem and creates a csv which records all the information required for creating samples (image patches) from the audio data. This step also creates spectrograms with the path to the spectrograms recorded in the csv file.


In [None]:
import RavenBinaryDataset
importlib.reload(RavenBinaryDataset)

wav_path_pos = "./sample_data/whipbird/pos"
wav_path_neg = "./sample_data/whipbird/neg"
spec_image_dir_path = "./training_output/preparation/specs"

RavenBinaryDataset.prepare_data( wav_path_pos, wav_path_neg, spec_image_dir_path, spec_params, train_params['dataCSV'])


### Processing step 1.2 - Peruse the results.

Look at the CSV file to see what information is contained in it. Also look at the directory on the file system where all of the spectrograms are stored. Open some of the spectrogram images.

### Balancing dataset 

The number of examples from each class that the CNN trains on influences the predictions that it makes. If 90% of the examples are positive and 10% are negative, then even a terrible model with no capacity to learn anything about the call itself will eventually optimise its weights to always guess positive and acheive a 90% accuracy. Therefore, we ideally want to provide a similar number of examples from each class for the training examples.  In practice, it's usually a lot easier to source negative examples. Furthermore there is usually more variety amongst the negative examples that we would like to include. 

We can still include more negative examples than positive examples, and then 'oversample' the positive examples - feeding the CNN positive examples multiple times for each negative example - to ensure that the CNN is fed the same number of positive and negative examples. However, we recommend that this is not done to extreme and you only include at most 2 to 3 times as many negative examples as positive examples. 

## 3. Training


### Training and Validation split 

When creating a Machine Learning model, meaning a model that learns from examples, we generally split our data into three sets:
1. Training
2. Validation
3. Test

The training set are the examples that are actually used by the algorithm to update the internal parameters.  The simplified overview of training is: 
1. A prediction (probability of it being positive) is made on a training example.
2. The prediction is compared with the true label to get an error
3. This error is used to update the weights in a way that would make that prediction better
4. Repeat many times with many examples

We can look at these predictions as training progresses to see whether it is improving on the training set, but this only tells us how well it is memorising that particular set of examples. What we want to know is how well it works on examples it has never seen before, i.e. if it can recognise the class of audio event (the call type) it is being trained on. 

Therefore, as training progresses, we use the model to make predictions on the **validation set**. This tells us how well the model does at distinguishing the class of examples it has never "seen", which is what we want to know. 

During the course of developing a recogniser, many modifications to it will be made to what are called "hyperparameters", e.g. the spectrogram configuration, the learning rate (more on these later), etc, to try to improve the validation set accuracy. This introduces a chance of "overfitting" to the validation set. Just as the CNN algorithm updates its internal parameters based on the training set, we are updating the "hyperparameters" based on the validation set. 

How do we know that the particular values for the hyperparameters is not tuned specifically to our validation set? This is role of the 3rd set, the **test set**. It is a set of data whose only role is to check the accuracy at the end of creating the model, and its accuracy result should not be used to modify any hyperparameters. 

For this workshop, we won't have the opportunity to spend too long on training and retraining with modified hyperparameters, so we don't worry about the validation set, and the dataset is only divided (randomly) into two parts: *training* and *test*.  We will also use the test set as the validation set - that is to report accuracy during the training process. Since we are not going to be tuning the hyperparameters much during training, this validation accuracy is close to the test accuracy. If we want to be thorough, we should have a third data split which we never look at until the end. 

What proportion of examples should go into training and into testing? The more we put in training, the better our network will learn. The more we put into testing, the more confident we can be of the reported accuracy on the test set. Generally around 10-20% of the examples in the text set is common. Normally the bigger the total number of examples, the smaller the proportion dedicated to testing. 


#### Training / Testing cross contamination

Ideally, the difference between the training examples and the testing examples should be as great as the difference between the training examples and the audio that will be encountered during deployment (that is when the recogniser is run on unlabelled data collected from the field to gather ecological information). 

Therefore, if for example you will deploy your recogniser on recordings taken from a location other than where your training recordings were made, then your test set will also come from a different location from your test set. 

The pipeline used in this workshop randomly divides your recordings into training and testing. That means that, if you have provided files with multiple annotations, all of these annotations will either be used for training or testing. This is what we want, since consecutive vocalisations are likely to be more similar to each other than they are to vocalisations encountered during deployment. If we train on a vocalisation and then test on a consecutive vocalisation from the same individual, this might be an easier task for the recogniser than the kinds of vocalisations it encounters during deployment, and therefore the reported accuracy will be optimistic. 

However, because the individual recordings are allocated randomly to training or testing, the same problem will occur if these recordings are sourced from the same original longer recordings close in time. This is why we recommended to source your recordings from a wide variety of locations and times of day. 

### Basic explanation of a CNN

(you don't need to know this to build the recogniser, but you might find it interesting)

A function is something that maps an input to an output. An example of a function is

`f(x1, x2, x3) = a * x1 + b * x2 + c * x3`

Here, we have three input variables: `x1`, `x2` and `x3`, and three parameters `a`, `b` and `c`. The output is a single number. 

By adjusting the parameters `a`, `b` and `c`, we can change how the function behaves. If we had some examples where we knew those three independent variables and the corresponding dependent variable, we could use an algorithm to find the best values for the parameters `a`, `b` and `c`, so that for those examples, the output of the function is as close as possible to the dependent variable.  Because this is a linear function mapping to any real number, this is called linear regression. 


We could also combine a few of these functions with different paramers so that we map the three input variables to a list of say two output variables. 


We could then chain these together so that the output values from the first set of linear functions is used as the inputs for the next set, and so on. By putting a non-linear transformation between these layers, we have constructed a big non-linear function. This is a **deep neural network**. At each layer, the number of parameters of the function is the number of input variables multiplied by the number of output variables. 

A CNN is special type of deep neural network. In our case, the input variables are the pixels of the spectrogram, and the output is  a number which represents the probability of the spectrogram being the positive class. Inside the CNN are many parameters. 


Unlike the above example, the number of parameters for each output of each layer is **not** the same as the number of input variables. This is because in a CNN, parameters are shared in a special way.  This is necessary because there are so many input parameters (tens of thousands of pixels). 


We mentioned above that for the very first function we could use an algorithm to find the opitimal values of the parameters to fit our training examples. There is an algorithm called **gradient descent** that can be used to acheive this, and this algorithm is how neural networks learn. 

1. For each labelled example it sees, it runs it through its big function to produce a number between 0 and 1 representing the probability that the example is a positive example (i.e. it contains your target call). 
  
2. It then compares the calculated probability with the desired probability (i.e. 1 if it was a positive example and 0 if it was a negative example), using something called a "loss function" - the output of the loss function for a given example's prediction is called its "loss". 
  
3. It then does some calculus to figure out which direction to nudge each parameter so that the loss for that example would be smaller. In a neural network it does this step by step from the last layer to the first layer using a process called "back propagation". 

Actually, the training works in small batches of say 4 or 8 examples. The predictions for all examples in the batch are made, their loss is calculated, and the parameters are updated so that the average loss across the batch is reduced. 


### Training loss, validation loss, validation accuracy

When training, each example will be fed into the CNN many times. Each round of feeding all the examples in is called an **epoch**. After we have fed all the examples in once, we have done one epoch of training. After each epoch we can calculate the average loss across all the examples. Training loss is the average loss across all the training examples. validation loss is the average loss across the validation examples. 

With each epoch, the training loss should go down, because the training algorithm's job is to reduce this loss. If it doesn't go down then something is quite wrong with the configuration of the network. At some point it will stop improving. 

What we are more interested in is the validation loss. This will always be higher than the training loss, because these examples were never seen during training. As long as the validation loss is still decreasing, the network is still learning in a way that generalises to examples it's never seen before. At some point the validation loss will stop improving, usually before the training loss does. The validation loss might even increase after a while. This is a sign that the network is overfitting to the training set. This means that it is memorising the individual training examples while reducing its capacity to generalise its understanding of the target call type. 

 The probabilities and loss can be useful for some analysis, but really at the end of the day, the examples have been labelled as binary positive/negative and so what we are interested in is the binary predictions. Generally anything with a predicted probability over 50% is a 'positive' prediction and anything less is a 'negative' prediction.  The accuracy is the proportion of examples that were correctly labelled. 


### Train the Recogniser

There are two main libraries available in python to work with CNNs and other types of artifical neural network: Tensorflow and Pytorch. In this workshop we are using Pytorch

#### Processing step 2.1 - Training the neural network

In [None]:
import utils
importlib.reload(utils)

train_params = utils.read_train_params( "config.ini" )
utils.print_params(train_params)


In [None]:
import TrainTest
import RavenBinaryDataset
import NeuralNets
importlib.reload(TrainTest)
importlib.reload(RavenBinaryDataset)
importlib.reload(NeuralNets)   

In [None]:
TrainTest.train(train_params, spec_params)

### Error Analysis

The first step to do after training is to check the examples that the network misclassified. Often we find that some of the training or validation set are labelled incorrectly. It's likely that mislabelled examples are "misclassified" by the network (i.e. classified correctly as belonging to a class that does not match the label). We can then correct the label of these examples and retrain. 

We might also notice that many of the misclassified examples have something in common that indicates an inadequacy with the training/testing dataset. For example we might find that many insect tracks are labelled as positive. This might mean that your positive examples often have insect tracks in the background, and maybe your negative examples don't have insect tracks. You could then go back to your original recordings and extract some segments of the same types of insect tracks to include in your negative examples. 

You might also find that very faint positive examples are incorrectly labelled as negative. You might be able to improve this false-negative rate by adding more very faint calls into your dataset. However, this is tricky, because this might actually increase the number of false positives. If finding all these very faint calls once the system is deployed is important, then you might just have to add plenty of faint examples and a corresponding number of negative examples that can be confused with faint positive examples. However, it might be that to answer your ecological question, these faint calls are not important. In this case it might be better to simply exclude them completely from the training/testing set. 


## 4. Inference

So far we have trained the network. This took a set of fixed sized spectorgram images and fed them forward to make predictions, then performed backpropagation to update the weights. 

Inference refers to making predictions on unlabelled data. This is what happens when we deploy our recognizer to do its job. 

Unlike our train/test set, in deployment we will *normally* be recording longer continuious files, with the intention of locating the target call within these longer files. Therefore, our inference pipeline has a few differences from our training pipeline:

- we need to segment the audio recordings
- we don't need to calculate loss or do any backpropagation (we can't with no labels)
- we need to assemble the individual predictions in a way that gives us information in the context of the original unsegmented audio file. 


### Segment overlap (temporal precision vs computational work) 

Our particular type of neural network classifies fixed-length segments of audio. It does not tell us whereabouts within the segment the target call occured. One approach to running this classifier as a call recogniser on a longer file is to split the longer file into non-overlapping consecutive fixed length segment, and then classify each of these. We will then know where the target call has been recognised with a time resolution equal to the duration of the fixed-length segment. 

Alternatively, we could overlap these segments. For example, we can take a 2.5 second segment every 1 second. This approach makes it less likely that a target call will be missed, since it will appear in several of the segments with varying time-offsets. With non-overlapping segments, many of the target calls will be cut in half across two segments, whereas if we overlap the segments, it means that there is likely to be at least one of the segments that contains a complete enough section of the call to produce a positive classification. 

The downside of overlapping is that the amount of computation that needs to be performed 






#### Processing step 3.1 - Doing inference (ie. recognising) 

In [None]:
import utils
import Inference
importlib.reload(utils)
importlib.reload(Inference)

infer_params = utils.read_infer_params( "config.ini" )
utils.print_params(infer_params)

In [None]:
Inference.do_inference( infer_params, spec_params)

### Run Inference

### Verification of Results

## 5. Retraining

### Run Dataset preparation with the enhanced dataset 

### Run training again 

### Things to check

- Short audio files 
- Long Audio files
- Audio files with multiple channels



## Auxilary Code

Below are various useful code snippets

#### Read config file parameters

In [None]:
importlib.reload(utils)

# read config with a particular config section, e.g. "infererence"
inference_params = utils.read_config( "config.ini", "infer")

utils.print_params(inference_params)

#### Load audio data

In [None]:
file_source = "./sample_data/test/XC201977 - Eastern Whipbird - Psophodes olivaceus.wav"
#file_source = ""

if not os.path.exists(file_source):
    print("file doesn't exist")
else:

    samples, sample_rate = librosa.core.load(file_source)
    #samples, sample_rate = librosa.core.load(file_source, sr=None)
    duration = len(samples)/sample_rate

    print(str(len(samples)))
    print("Audio file loaded.")
    print("Sample rate : " + str(sample_rate))
    print("Duration : " + str(int(duration)))

#### Generate spectrogram from loaded data

In [None]:
importlib.reload(Spectrogram)

fftWinSize = 512
fftOverlap = 0.5
maxFreq = 8000
time_win = 2.5
spec = Spectrogram.Spectrogram(samples, sample_rate, duration)
spec.make_spec( fftWinSize, fftOverlap, maxFreq, time_win)


#### Get image patch from spectrogram and save image patch

In [None]:
img = spec.get_image_patch( 3.0, 5.0)
img.save( "tmp/test.png")
Image(filename='test.png') 

#### Read Raven File

In [None]:
import utils
importlib.reload(utils)

anns = utils.load_Raven_annotations("./data/XC201977 - Eastern Whipbird - Psophodes olivaceus.Table.1.selections.txt")
for ann in anns:
    print(ann.start)

#### Classify image patch

In [None]:
import NeuralNets

net = NeuralNets.CNN_4_Layers( 512, 2, 12, 24, 32, 48)