# Use HawkEars predictions to find clear, focal samples of OSFL calls.

This is done in order to try to preserve the relationship between sound power level in the input audio, and the output score of the model, such that louder and hence closer bird calls produces a higher score and further away calls produce a lower score. 

This is a requirement of downstream statistical applications of the model, such as estimating bird density from the output scores.

The initial dataset contains only a random sample of the first call heard by a human listener within a certain time interval. This provides a range of signal to noise ratios, which is representative of what is expected in the field, however, the first call heard is not always the clearest call. Therefore the HawkEars model was run to find the clearest calls, and these can be mixed into the traininig dataset. 

## Some considerations

- The dataset was split 80/20 into training and testing data. The 20% test data is left untouched, and the remaining 80% has been further split into a validation and training set.

- The dataset is the set of recordings which at some point had an olive sided flycatcher call detected by a human listener. This means that sounds from other habitats are not present in the dataset. The effects of this should be tested by adding sounds from ohter habitats to the training set.

- It is the training set which needs augmenting with the clearest calls.

- The clearest calls are found by taking the top 1% of the scores from the HawkEars model. This is a somewhat arbitrary choice, but it is a good starting point.


## The process
To get the hawkears predictions, the following steps were taken:
1. Output the URLS of all the recordings in the training set to a file
2. open up a google colab instance and download all the recordings in the training set
3. run the HawkEars model on the recordings and save the predictions in a folder.
4. Finally this folder was dragged into the data/processed folder of this repo. 

- The hawkears model was in its 'out of the box' state on 7th Feb 2024. Threshold is 0.7
- This took 74m to run on a paid instance v100 GPU and cost 5.36 compute units per hour. 



In [1]:
# imports
import glob
from pathlib import Path
import pandas as pd
from pandas.errors import EmptyDataError

# for utils
from pathlib import Path
import sys
BASE_PATH = Path.cwd().parent.parent
sys.path.append(str(BASE_PATH))
from src.utils import display_all, keep_cols
import src.utils as utils
import sklearn
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, average_precision_score, roc_auc_score, precision_recall_curve

import matplotlib.pyplot as plt
import torch

In [2]:
data_path = Path.cwd().parent.parent / 'data'

# Look at the dataset which will be used to generate the focal recordings
- it consists of the 80% of the full dataset, with a further 20% removed for use as a validation set during training.

In [3]:
train_split = pd.read_pickle(data_path / "processed" / "train_set" / 'train_split.pkl')
train_split.head()

Unnamed: 0_level_0,recording_url,task_method,project,detection_time,tag_duration,latitude,longitude,file_type,media_url,individual_order,location_id,is_valid
recording_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
4396,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,1SPM,Boreal Wetland Community Monitoring,"[27.28, 95.9]","[0.83, 1.18]",57.292989,-111.412116,mp3,https://portal.wildtrax.ca/home/aru-tasks/reco...,1.0,355,False
4427,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,1SPM,Boreal Wetland Community Monitoring,"[106.56, 122.66]","[1.0, 0.84]",57.302163,-111.376885,mp3,https://portal.wildtrax.ca/home/aru-tasks/reco...,1.0,359,False
4429,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,1SPM,Boreal Wetland Community Monitoring,"[31.11, 74.7, 139.78]","[1.38, 2.19, 1.29]",57.302163,-111.376885,mp3,https://portal.wildtrax.ca/home/aru-tasks/reco...,1.0,359,False
4446,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,1SPM,Boreal Wetland Community Monitoring,"[13.63, 74.88, 126.6]","[1.05, 0.89, 0.8]",57.482905,-111.378761,mp3,https://portal.wildtrax.ca/home/aru-tasks/reco...,1.0,362,False
4452,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,1SPM,Boreal Wetland Community Monitoring,"[11.31, 169.2]","[0.96, 1.18]",57.482905,-111.378761,mp3,https://portal.wildtrax.ca/home/aru-tasks/reco...,1.0,362,False


In [5]:
len(train_split.index.values)

2371

Make sure that the training set split corresponds to the downloaded recordings I have on disk.

In [26]:
recordings_on_disk = glob.glob(str(data_path / "raw" / "recordings" / "OSFL") + "/*.*")
recording_ids = [file.split("/")[-1].split(".")[0].split("-")[1] for file in recordings_on_disk]
len(recording_ids), len(train_split) / len(recording_ids), len(train_split)

(2897, 0.8184328615809459, 2371)

There are 2897 recordings downloaded, but only 2371 of these are in the training set after the validation split. 

Filter the recordings on disk to only include the ones in the training set.

In [27]:
def get_recording_id(file):
    return int(file.split("/")[-1].split(".")[0].split("-")[1])

train_recs = [file for file in recordings_on_disk if get_recording_id(file) in train_split.index.values]

len(train_recs)

2308

After filtering the reocrdings on disk to include only the ones in the training set, there are only 2308 recordings instead of the expected 2371. It is not clear why this is the case. 

# We want to augment the training set with high quality recordings. 
HawkEars was run on the training set and any scores over 0.7 were saved. 

Filter out the other birds and OSFL calls from these recordings and count how many detections there are. 

In [28]:
hawkears_output_files = glob.glob(str(data_path/'processed'/'hawkears_predictions/*.txt'))
len(hawkears_output_files)

2371

In [29]:
def get_recording_id(hawkears_out):
    return int(hawkears_out.split("/")[-1].split(".")[0].split("_")[0])


In [33]:
dfs = []
for file in hawkears_output_files:
    print(f"processing {file}")
    recording_id = get_recording_id(file)
    try:
        df = pd.read_csv(file, sep="\t", header=None)
        df.columns = ["start", "end", "label;score"]
        df["recording_id"] = recording_id
    except EmptyDataError:
        print("(file was empty)")
        continue
    dfs.append(df)

processing /Users/mikeg/code/machine_learning/osfl_cnn_recognizer/data/processed/hawkears_predictions/292337_HawkEars.txt
processing /Users/mikeg/code/machine_learning/osfl_cnn_recognizer/data/processed/hawkears_predictions/293662_HawkEars.txt
processing /Users/mikeg/code/machine_learning/osfl_cnn_recognizer/data/processed/hawkears_predictions/555135_HawkEars.txt
processing /Users/mikeg/code/machine_learning/osfl_cnn_recognizer/data/processed/hawkears_predictions/556684_HawkEars.txt
processing /Users/mikeg/code/machine_learning/osfl_cnn_recognizer/data/processed/hawkears_predictions/815670_HawkEars.txt
processing /Users/mikeg/code/machine_learning/osfl_cnn_recognizer/data/processed/hawkears_predictions/556082_HawkEars.txt
processing /Users/mikeg/code/machine_learning/osfl_cnn_recognizer/data/processed/hawkears_predictions/294585_HawkEars.txt
processing /Users/mikeg/code/machine_learning/osfl_cnn_recognizer/data/processed/hawkears_predictions/291905_HawkEars.txt
processing /Users/mikeg/

### Add the full file path to the recordings df since this is used by opensoundscape for finding files

In [64]:
recordings_df = pd.DataFrame(recordings_on_disk, columns=["full_path"])
recordings_df['file_id'] = recordings_df['full_path'].apply(lambda x: int(x.split("/")[-1].split(".")[0].split("-")[1]))
recordings_df['file_extension'] = recordings_df['full_path'].apply(lambda x: x.split("/")[-1].split(".")[1])


Drop the recordings which are not in the training set, since this is a potential source of data leakage from validation into training set. 

In [67]:
# drop the recordings_df rows which are not in the train_split 
recordings_df = recordings_df[recordings_df['file_id'].isin(train_split.index.values)]
# make a dictionary from file id to file extension
file_id_to_ext_dict = recordings_df.set_index('file_id')['file_extension'].to_dict()


# Make a new dataframe from the hawkears output
- contains the relative path to each recording file
- contains the start, end and species tag for each detection
- therefore we can easily convert these into an opensoundscape AudioFileDataset. 

In [75]:
result_df = pd.concat(dfs, ignore_index=True)
result_df[['label', 'score']] = result_df['label;score'].str.split(';', expand=True)
del(result_df['label;score'])

result_df['file_extension'] = result_df['recording_id'].map(lambda x: file_id_to_ext_dict.get(x, 'Missing'))

result_df.drop(result_df[result_df['file_extension'] == 'Missing'].index, inplace=True)


In [88]:
# concat and parse the hawkears output files
result_df = pd.concat(dfs, ignore_index=True)
result_df[['label', 'score']] = result_df['label;score'].str.split(';', expand=True)
del(result_df['label;score'])

# explicitly drop entries which were not in both the train split and the downloaded recordings.
result_df.drop(result_df[~result_df['recording_id'].isin(file_id_to_ext_dict.keys())].index, inplace=True)
result_df['file_extension'] = result_df['recording_id'].map(lambda x: file_id_to_ext_dict[x])
len(result_df)


69832

In [89]:
start_of_path = "../../data/raw/recordings/OSFL/recording-"
result_df['full_path'] = start_of_path + result_df['recording_id'].astype(str) + "." + result_df['file_extension']
# drop all the rows where the label is not "OSFL"
result_df.drop(result_df[result_df['label'] != 'OSFL'].index, inplace=True)
# drop all the rows where the score is less than 0.99
top_osfls = result_df.drop(result_df[result_df['score'].astype(float) < 0.97].index)
top_osfls.head()

Unnamed: 0,start,end,recording_id,label,score,file_extension,full_path
41,0.0,60.0,555135,OSFL,1.0,flac,../../data/raw/recordings/OSFL/recording-55513...
89,63.0,67.5,815670,OSFL,0.98,flac,../../data/raw/recordings/OSFL/recording-81567...
327,0.0,12.0,555132,OSFL,0.99,flac,../../data/raw/recordings/OSFL/recording-55513...
328,13.5,27.0,555132,OSFL,0.99,flac,../../data/raw/recordings/OSFL/recording-55513...
329,45.0,58.5,555132,OSFL,0.99,flac,../../data/raw/recordings/OSFL/recording-55513...


In [90]:
len(top_osfls)

1426

In [92]:
top_osfls.set_index(['full_path', 'start', 'end'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,recording_id,label,score,file_extension
full_path,start,end,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
../../data/raw/recordings/OSFL/recording-555135.flac,0.0,60.0,555135,OSFL,1.00,flac
../../data/raw/recordings/OSFL/recording-815670.flac,63.0,67.5,815670,OSFL,0.98,flac
../../data/raw/recordings/OSFL/recording-555132.flac,0.0,12.0,555132,OSFL,0.99,flac
../../data/raw/recordings/OSFL/recording-555132.flac,13.5,27.0,555132,OSFL,0.99,flac
../../data/raw/recordings/OSFL/recording-555132.flac,45.0,58.5,555132,OSFL,0.99,flac
...,...,...,...,...,...,...
../../data/raw/recordings/OSFL/recording-293120.mp3,78.0,99.0,293120,OSFL,0.98,mp3
../../data/raw/recordings/OSFL/recording-293120.mp3,100.5,108.0,293120,OSFL,0.97,mp3
../../data/raw/recordings/OSFL/recording-293120.mp3,111.0,135.0,293120,OSFL,0.98,mp3
../../data/raw/recordings/OSFL/recording-293120.mp3,136.5,174.0,293120,OSFL,0.98,mp3
