# Use HawkEars predictions to find clear, focal samples of OSFL calls.

This is done in order to try to preserve the relationship between sound power level in the input audio, and the output score of the model, such that louder and hence closer bird calls produces a higher score and further away calls produce a lower score. 

This is a requirement of downstream statistical applications of the model, such as estimating bird density from the output scores.

The initial dataset contains only a random sample of the first call heard by a human listener within a certain time interval. This provides a range of signal to noise ratios, which is representative of what is expected in the field, however, the first call heard is not always the clearest call. Therefore the HawkEars model was run to find the clearest calls, and these can be mixed into the traininig dataset. 

## Some considerations

- The dataset was split 80/20 into training and testing data. The 20% test data is left untouched, and the remaining 80% has been further split into a validation and training set.

- The dataset is the set of recordings which at some point had an olive sided flycatcher call detected by a human listener. This means that sounds from other habitats are not present in the dataset. The effects of this should be tested by adding sounds from ohter habitats to the training set.

- It is the training set which needs augmenting with the clearest calls.

- The clearest calls are found by taking the top 1% of the scores from the HawkEars model. This is a somewhat arbitrary choice, but it is a good starting point.


## The process
To get the hawkears predictions, the following steps were taken:
1. Output the URLS of all the recordings in the training set to a file
2. open up a google colab instance and download all the recordings in the training set
3. run the HawkEars model on the recordings and save the predictions in a folder.
4. Finally this folder was dragged into the data/processed folder of this repo. 

- The hawkears model was in its 'out of the box' state on 7th Feb 2024. Threshold is 0.7
- This took 74m to run on a paid instance v100 GPU and cost 5.36 compute units per hour. 



In [1]:
# imports
import glob
from pathlib import Path
import pandas as pd

# for utils
from pathlib import Path
import sys
BASE_PATH = Path.cwd().parent.parent
sys.path.append(str(BASE_PATH))
from src.utils import display_all, keep_cols
import src.utils as utils
import sklearn
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, average_precision_score, roc_auc_score, precision_recall_curve

import matplotlib.pyplot as plt
import torch

In [2]:
data_path = Path.cwd().parent.parent / 'data'

# Look at the dataset which will be used to generate the focal recordings
- it consists of the 80% of the full dataset, with a further 20% removed for use as a validation set during training.

In [4]:
train_split = pd.read_pickle(data_path / "processed" / "train_set" / 'train_split.pkl')
train_split.head()

Unnamed: 0_level_0,recording_url,task_method,project,detection_time,tag_duration,latitude,longitude,file_type,media_url,individual_order,location_id,is_valid
recording_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
4396,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,1SPM,Boreal Wetland Community Monitoring,"[27.28, 95.9]","[0.83, 1.18]",57.292989,-111.412116,mp3,https://portal.wildtrax.ca/home/aru-tasks/reco...,1.0,355,False
4427,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,1SPM,Boreal Wetland Community Monitoring,"[106.56, 122.66]","[1.0, 0.84]",57.302163,-111.376885,mp3,https://portal.wildtrax.ca/home/aru-tasks/reco...,1.0,359,False
4429,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,1SPM,Boreal Wetland Community Monitoring,"[31.11, 74.7, 139.78]","[1.38, 2.19, 1.29]",57.302163,-111.376885,mp3,https://portal.wildtrax.ca/home/aru-tasks/reco...,1.0,359,False
4446,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,1SPM,Boreal Wetland Community Monitoring,"[13.63, 74.88, 126.6]","[1.05, 0.89, 0.8]",57.482905,-111.378761,mp3,https://portal.wildtrax.ca/home/aru-tasks/reco...,1.0,362,False
4452,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,1SPM,Boreal Wetland Community Monitoring,"[11.31, 169.2]","[0.96, 1.18]",57.482905,-111.378761,mp3,https://portal.wildtrax.ca/home/aru-tasks/reco...,1.0,362,False


In [6]:
len(train_split)

2371

Make sure that it corresponds to the downloaded recordings I have on disk.

In [7]:
recordings_on_disk = glob.glob(str(data_path / "raw" / "recordings" / "OSFL") + "/*.*")
recording_ids = [file.split("/")[-1].split(".")[0].split("-")[1] for file in recordings_on_disk]
len(recording_ids), len(train_split) / len(recording_ids)

(2897, 0.8184328615809459)

The recordings on disk are for both the training and the validation sets. Separate out the validation recordings from the training recordings

In [16]:
# training df is indexed by recording id
train_split.index.values

array([  4396,   4427,   4429, ..., 826374, 826375, 829015])

In [17]:
recordings_on_disk.remove(...

['/Users/mikeg/code/machine_learning/osfl_cnn_recognizer/data/raw/recordings/OSFL/recording-555101.flac',
 '/Users/mikeg/code/machine_learning/osfl_cnn_recognizer/data/raw/recordings/OSFL/recording-622593.flac',
 '/Users/mikeg/code/machine_learning/osfl_cnn_recognizer/data/raw/recordings/OSFL/recording-48524.mp3',
 '/Users/mikeg/code/machine_learning/osfl_cnn_recognizer/data/raw/recordings/OSFL/recording-554710.flac',
 '/Users/mikeg/code/machine_learning/osfl_cnn_recognizer/data/raw/recordings/OSFL/recording-100689.flac',
 '/Users/mikeg/code/machine_learning/osfl_cnn_recognizer/data/raw/recordings/OSFL/recording-566231.flac',
 '/Users/mikeg/code/machine_learning/osfl_cnn_recognizer/data/raw/recordings/OSFL/recording-50033.flac',
 '/Users/mikeg/code/machine_learning/osfl_cnn_recognizer/data/raw/recordings/OSFL/recording-296498.mp3',
 '/Users/mikeg/code/machine_learning/osfl_cnn_recognizer/data/raw/recordings/OSFL/recording-556668.flac',
 '/Users/mikeg/code/machine_learning/osfl_cnn_reco

In [56]:
df_full = pd.read_csv(data_path / "raw" / "TrainingData_BU&Public_CWS_with_rec_links.csv")

  df_full = pd.read_csv(data_path / "raw" / "TrainingData_BU&Public_CWS_with_rec_links.csv")
