Guidelines adapted from the google Data Prep and Feature Engineering In ML course.<br>

https://developers.google.com/machine-learning/data-prep

Size and quality
- The size and quality of the dataset matters more than what model and training protocol is used. Garbage in, garbage out. 
- As a rule of thumb, the model should train on at least one order of magnitude more examples than trainable parameters. 

Reliability
- How common are label errors? If the dataset is labeled by humans, then how often do mistakes occur?
- Missing values
- Duplicated examples
- Mis-labeled examples
- Bad features (e.g low quality encoding) 

During training, use only the features that you'll have available in serving, and make sure your training set is representative of your serving traffic.


The Golden Rule: Do unto training as you would do unto prediction. That is, the more closely your training task matches your prediction task, the better your ML system will perform. 

### Steps to constructing a dataset
1. Collect the raw data
Currently the raw data for the osfl song clips is downloaded, and the raw data for the no_song dataset is on AWS cloud servers. 
2. Identify feature and label sources
Features will be from the shapes in the spectrogram images. Labels are human labelled tags. 
3. Select sampling strategy
4. Split the data. 



If data is restricted and a sample needs to be taken, ensure the data is spread out temporally, to reduce seasonal variation effects and spatially. 

_A note on unbalanced data:_

Normally when there's a class imbalance in a dataset, the abundant class is downsampled, then examples from this class are weighted proportionally to the amount by which the class is downsamples. For example, if there are 100 case A for every 1 case B, then we might downsample case A by a factor of 10, so that the ratio of A to B is 10:1 instead of 100:1 in each mini batch. 

To keep the model calibrated, we'd want to calculate the loss for examples in A as being 10 times as important as if we hadn't downsampled. This keeps the model's outputs calibrated in the sense that the outputs can still be treated as probabilities. 

However, we're constructing an artificial dataset here. We can choose for the classes to be equally weighted: the 'song' clips can be exactly as numerous as the 'no-song' clips. 

From the perspective of the model in deployment however, the ratio of song to no-song will be much different. Firstly because there is generally more silence than birdsong in the real environment. Andother consideration is that if a signal detection algorithm is used, then this will filter out a lot of the silence, and change the class balance again depending on the features of the algorithm. 

In addition, we don't need the actual probability that we detected a bird - only numbers proportional to the probability - since we can pick a threshold for the recognizer. 

With all this considered, I've decided the simplest approach is to create an equally balanced dataset of song / no-song clips then train a model on these, and come back to the issue of class imbalance if it arises later. 



Train test split
- This should be smarter than a simple random shuffle of all the data, because a model can learn the specific background noises present at a location, then use this information, rather than the shape / sound of a bird's song, to make a prediction. 

- It would be better to split the validation set from completely separate ARUs
- alternatives: split by day, split by project, split by location. 


Once the data split has been decdied, download the relevant audio files and cut the audio segments from the downloaded files. 
If necessary, throw these away to save disk space, or do this in the cloud and download only the  clips. 

Throughout the data building phase, keep plotting scattergraphs and bar plots to make sure the distribution of the data matches expectations and is balanced. 



### Is the model just learning the acoustic signature of the ARUs which recorded an olive sided flycatcher?

It is important to make sure that the datasets contain positive and negative examples of recordings from the same ARUs to stop the model from learning a prediction such as 'if the background noise is like this, then predict olive sided flycatcher'. This could have been happening with the dummy datasets if the osfl clips tended to be clustered around certain ARUs and habitats, and the other vocalizations were from a more diverse spread of locations.

