# QUT EcoAcoustics Recogniser Workshop 2022

## Overview of the Recogniser Building Workflow


### Dataset preparation 

- Annotating audio
- Spectrogram segment generation 
- Balancing dataset 

### Training

- Training/validation split 
- Running training 
- error analysis 

### Inference

- Running inference 
- inspecting verifying results 
- labeling data for training 

### Setting up your environment

We are going to be using python for everything, including inspecting our dataset, training the recogniser, and visualising spectrograms and results. 

If are already reading this in your browser, you have python and jupyter installed. In python (like many other languages) there are some core functions available in the language and some that are part of packages that need to be imported before they can be used. Some of those packages are already available in the python distribution, and some packages must be installed before they can be imported. 

In the cell below, we install all the packages we will need. 


In [None]:
import sys

!{sys.executable} -m pip install numpy
!{sys.executable} -m pip install librosa
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install matplotlib
!{sys.executable} -m pip install torch torchvision torchaudio


Before we move on, we need to make sure we are in the correct working directory. By default, the working directory will be the directory this notebook is in which is the 'src' directory. But we want the working directory to be the parent directory, which is the root directory of the repo. 

In [None]:
import os
if os.path.basename(os.path.normpath(os.getcwd())) == 'src':
    os.chdir('../')
print(os.getcwd())


Now that we have installed the required packages, we will import them so that we can use them in our scripts

In [None]:

from os import listdir

import importlib

import math
import numpy as np
import pandas as pd
import librosa
import matplotlib.pyplot as plt
from scipy import signal
from PIL import Image

import configparser

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from IPython.display import Image

import utils
import config
import Spectrogram 


## 1. Dataset Preparation


### Dataset labelling

The recogniser is is a binary classifier, meaning it assigns one of two labels (positive or negative) to segments of audio.  In a basic Convolutional Neural Network (CNN), these segments are fixed-sized spectrogram images. In the training phase, these fixed-size segments are fed into the CNN along with their labels.  Therefore we need to be have a collection of fixed-length labelled segments. 

On the page [Dataset Preparation](https://openecoacoustics.org/resources/lessons/make-your-own-recognizer/theory/#1-dataset-preparation) you can read how to prepare the labelled training data in a way that is compatible with the CNN scripts in this lesson. 

### Spectrogram Generation
 
The CNN used is an image classifier. The audio signal is converted to an image in the form of a spectrogram.  

[Dataset Preparation](https://openecoacoustics.org/resources/lessons/make-your-own-recognizer/theory/#1-dataset-preparation) explains more about the spectrogram. 

You can edit the parameters used to create the spectrogram i in the configuration file `config.ini`


The most important of these is the `time-win` (time window). This sets how many seconds of audio are fed into the CNN. However, note that whatever value is chosen for this, the resulting spectrogram will be reshaped into a size of 128x256 pixels. Setting a very large value for the time window will result in loss of time resolution when this reshaping is done. 

You can read more on the page [Spectrogram Generation](https://openecoacoustics.org/resources/lessons/make-your-own-recognizer/theory/#1-spectrogram-generation) 

### Dataset Preparation - Practical Steps

We need to:
- 1.1 Initialise configuration parameters.
- 1.2 Prepare the training data based on how the training data is layed out on the file system.
- 1.3 Check that the information about the training data is correct.


#### Processing step 1.1 - Initialise the configuration parameters

In [None]:
import utils
importlib.reload(utils)

print(os.path.exists("data/sample_data/noisypitta/config.ini"))

spec_params = config.read_spec_params( "data/sample_data/noisypitta/config.ini" )
config.print_params(spec_params)

train_params = config.read_train_params( "data/sample_data/noisypitta/config.ini" )
config.print_params(train_params)


#### Processing step 1.2 - Preparing the Data

This is where we actually run the code which parses the filesystem and creates a csv which records all the information required for creating samples (image patches) from the audio data. This step also creates spectrograms with the path to the spectrograms recorded in the csv file.


In [None]:
import RavenBinaryDataset
importlib.reload(RavenBinaryDataset)

wav_path_pos = "data/sample_data/noisypitta/pos"
wav_path_neg = "data/sample_data/noisypitta/neg"
spec_image_dir_path = "training_output/preparation/noisypitta/specs"

RavenBinaryDataset.prepare_data( wav_path_pos, wav_path_neg, spec_image_dir_path, spec_params, train_params['dataCSV'])


### Processing step 1.2 - Peruse the results.

Look at the CSV file to see what information is contained in it. Also look at the directory on the file system where all of the spectrograms are stored. Open some of the spectrogram images.

[Balancing dataset](https://openecoacoustics.org/resources/lessons/make-your-own-recognizer/theory/#balancing-dataset) explains more about how to balance the dataset.


## 2. Training


### Training and Validation split 

When creating a Machine Learning model, meaning a model that learns from examples, we generally split our data into three sets:
1. Training
2. Validation
3. Test

The training set are the examples that are actually used by the algorithm to update the internal parameters.  The simplified overview of training is: 
1. A prediction (probability of it being positive) is made on a training example.
2. The prediction is compared with the true label to get an error
3. This error is used to update the weights in a way that would make that prediction better
4. Repeat many times with many examples

We can look at these predictions as training progresses to see whether it is improving on the training set, but this only tells us how well it is memorising that particular set of examples. What we want to know is how well it works on examples it has never seen before, i.e. if it can recognise the class of audio event (the call type) it is being trained on. 

Therefore, as training progresses, we use the model to make predictions on the **validation set**. This tells us how well the model does at distinguishing the class of examples it has never "seen", which is what we want to know. 

How do we know that the particular values for the hyperparameters is not tuned specifically to our validation set? This is role of the 3rd set, the **test set**. It is a set of data whose only role is to check the accuracy at the end of creating the model, and its accuracy result should not be used to modify any hyperparameters. 

For this workshop, we won't have the opportunity to spend too long on training and retraining with modified hyperparameters, so we don't worry about the validation set, and the dataset is only divided (randomly) into two parts: *training* and *test*.  We will also use the test set as the validation set - that is to report accuracy during the training process. Since we are not going to be tuning the hyperparameters much during training, this validation accuracy is close to the test accuracy. If we want to be thorough, we should have a third data split which we never look at until the end. 



### Basic explanation of a CNN

The CNN is just a very big mathematical function on the inputs to produce an output, which contains a whole lot of parameters. But for understanding how the parameters are learned, it’s not much different from the tiny function in step 1. 
To get an intuition on how what calculations a CNN is doing to go from a spectrogram to a prediction, we will start with something familiar, a simple linear function and modify it in small steps until we get to the CNN. 

You don't need to know this to build the recogniser, but you might find it interesting.

[Read more about the CNN here](https://openecoacoustics.org/resources/lessons/make-your-own-recognizer/theory/#basic-explanation-of-a-cnn).


### Training loss, validation loss, validation accuracy

When training, each example will be fed into the CNN many times. Each round of feeding all the examples in is called an **epoch**. After we have fed all the examples in once, we have done one epoch of training. After each epoch we can calculate the average loss across all the examples. Training loss is the average loss across all the training examples. validation loss is the average loss across the validation examples. 

The validation loss is the most interesting part for us. This will always be higher than the training loss, because these examples were never seen during training. As long as the validation loss is still decreasing, the network is still learning in a way that generalises to examples it's never seen before. At some point the validation loss will stop improving, usually before the training loss does. The validation loss might even increase after a while. This is a sign that the network is overfitting to the training set. This means that it is memorising the individual training examples while reducing its capacity to generalise its understanding of the target call type. 



### Train the Recogniser

There are two main libraries available in python to work with CNNs and other types of artifical neural network: Tensorflow and Pytorch. In this workshop we are using Pytorch

#### Processing step 2.1 - Training the neural network

In [None]:
import utils
importlib.reload(utils)

train_params = config.read_train_params( "data/sample_data/noisypitta/config.ini" )
config.print_params(train_params)


In [None]:
import TrainTest
import RavenBinaryDataset
import NeuralNets_3FCL as NeuralNets
importlib.reload(TrainTest)
importlib.reload(RavenBinaryDataset)
importlib.reload(NeuralNets)


In [None]:
TrainTest.train(train_params, spec_params)

### Error Analysis

The first step to do after training is to check the examples that the network misclassified. Often we find that some of the training or validation set are labelled incorrectly. It's likely that mislabelled examples are "misclassified" by the network (i.e. classified correctly as belonging to a class that does not match the label). We can then correct the label of these examples and retrain. 

We might also notice that many of the misclassified examples have something in common that indicates an inadequacy with the training/testing dataset. For example we might find that many insect tracks are labelled as positive. This might mean that your positive examples often have insect tracks in the background, and maybe your negative examples don't have insect tracks. You could then go back to your original recordings and extract some segments of the same types of insect tracks to include in your negative examples. 

You might also find that very faint positive examples are incorrectly labelled as negative. You might be able to improve this false-negative rate by adding more very faint calls into your dataset. However, this is tricky, because this might actually increase the number of false positives. If finding all these very faint calls once the system is deployed is important, then you might just have to add plenty of faint examples and a corresponding number of negative examples that can be confused with faint positive examples. However, it might be that to answer your ecological question, these faint calls are not important. In this case it might be better to simply exclude them completely from the training/testing set. 


In [None]:
import ErrorAnalysis
importlib.reload(ErrorAnalysis)
from IPython import display

error_image_path = 'training_output/noisypitta/errors.png'

ErrorAnalysis.do_analysis(train_params, spec_params, error_image_path)

display.Image(error_image_path)

## 3. Inference

So far we have trained the network. This took a set of fixed sized spectorgram images and fed them forward to make predictions, then performed backpropagation to update the weights. 

Inference refers to making predictions on unlabelled data. This is what happens when we deploy our recogniser to do its job. 

Unlike our train/test set, in deployment we will *normally* be recording longer continuious files, with the intention of locating the target call within these longer files. Therefore, our inference pipeline has a few differences from our training pipeline:

- we need to segment the audio recordings
- we don't need to calculate loss or do any backpropagation (we can't with no labels)
- we need to assemble the individual predictions in a way that gives us information in the context of the original unsegmented audio file. 


### Segment overlap (temporal precision vs computational work) 

Our particular type of neural network classifies fixed-length segments of audio. It does not tell us whereabouts within the segment the target call occured. One approach to running this classifier as a call recogniser on a longer file is to split the longer file into non-overlapping consecutive fixed length segment, and then classify each of these. We will then know where the target call has been recognised with a time resolution equal to the duration of the fixed-length segment. 

Alternatively, we could overlap these segments. For example, we can take a 2.5 second segment every 1 second. This approach makes it less likely that a target call will be missed, since it will appear in several of the segments with varying time-offsets. With non-overlapping segments, many of the target calls will be cut in half across two segments, whereas if we overlap the segments, it means that there is likely to be at least one of the segments that contains a complete enough section of the call to produce a positive classification. 

The downside of overlapping is that the amount of computation that needs to be performed. 

You can adjust the overlap in the `config.ini` file



#### Processing step 3.1 - Doing inference (ie. recognising) 

In [None]:
import importlib
import utils
import Inference
importlib.reload(utils)
importlib.reload(Inference)

infer_params = config.read_infer_params( "data/sample_data/noisypitta/config.ini" )
config.print_params(infer_params)

First find a file you wish to run the recognizer over. There is a file containing noisy pitta here: https://connectqutedu.sharepoint.com/:f:/s/QUTEcoacousticsAnon/EttZnlYpt5FKjtB7JX8IeeABpo4Bhu5xRUradlp-oxaw4w?e=dwcCw8. Save this file to a folder on your computer and then update the function parameter below to point to that folder. 


In [None]:
Inference.do_inference( infer_params, spec_params, "inference_audio/noisypitta")

### Verification of Results

Our inference script generates a raven annotation file. 

To check through the results, you can open one of your inference wav files in raven, and then open the annotation file. 

A couple of things to note:
- The temporal "resolution" of the annotations will be determined by the segment size and the overlap. 
- overlapping contiguous positively identified patches will be merged into a single annotation. 
- there is no frequency information in the annotation, as the recognizer is not designed to detect frequency bounds. 

Look through some of your files to see how the recogniser did. If there are false positives (there probably will be lots), we can take these and put them in our negative set to reduce the false positive rate. 

In raven, go

1. `Go file > selection table`
2. Select the tab of the selection table, right click, and rename Tab to 'neg'
3. scroll through the file looking for false negative annotations. When you see one, draw a new annotation over the top (while the new 'neg' selection table is active). 
  - this annotation will be saved as a short audio clip, so make sure the annotation includes some padding around the call. For a 2.5 second CNN input size, around 4 seconds total duration for the annotation works. 
4. Repeat this until you have a collection of several false positives in your neg selection table. 
5. Go to `file > save all selections in current table as`
  - this will save a folder of short wav files. Rename this file "neg_iteration_01"
  
You might also notice some false negatives (i.e. the target species that was not found by the recogniser). In this case you can do a similar process to above, in a different new selection table. 
  

## 4. Retraining

After adding examples to your negative folder (and maybe the positive folder too), scroll up to the top of this workbook and repeat all of the steps from sections 1 (dataset preparation) and 2 (training). 

See if your accuracy improved. 

You can then repeat section 3 Inference again, to see if you can improve further. 

### Proceed with caution

This iterative process of adding in examples based on inference output works great, however it can skew things a bit.

If your recogniser has poor recall (meaning it misses lots of examples i.e. false negatives), then adding in the false positives to your negative set won't help this.   

You can add in the positive detections to your positive set, but the particular examples that it finds are those which it's already good at finding. If there are certain variations that it's good at finding and certain ones that it's bad at finding, it won't improve at those that it has difficulty with. 

Remember that the training and testing split is done randomly. If you add many new examples in from the same inference file, or the same location, especially if they are near each other within that file, your test/validation set will start to resemble your training set a bit too much to give reliable accuracy.  Remember that ideally we want to test the model on data that is at least as different from your training data as the deployment data. 

If you want to be thorough about this, you can modify the model so that you manage the training/validation split manually in a principled way, and have a holdout test set at the end of the process from a different set of recordings not used, but that's beyond the scope of this workshop. 

