# 2. Generate labels for the data from a cleaned csv file.

We need to get from the CSV file which is indexed by individual bird vocalization times, to a dataframe which is indexed by 3 second audio clips with a present / absent label. 

In this notebook we will:
- Download recordings containing audio to train the model
- Make a dataframe indexed by overlapping 3 second chunks of audio. The windows overlap by 50% and the window length is chosen so that the target species vocalization fits entirely within the window with some spare.
- generate target present and absent tags for each window by looking at the window's overlap with human labelled clip start and end times. 

In [1]:
from pathlib import Path
import sys
import pandas as pd

BASE_PATH = Path.cwd().parent.parent
data_path = BASE_PATH / "data" 
sys.path.append(str(BASE_PATH / "src"))
sys.path.append(str(BASE_PATH / "src" / "data"))

In [2]:
import build

  from tqdm.autonotebook import tqdm


Load the processed data - this is a cleaned version of the WildTrax csv data with an additional column for recording_url, latitude and longitude. 

In [3]:
processed_df = pd.read_pickle(data_path / 'interim' / 'cleaned_metadata.pkl')
processed_df.head()

Unnamed: 0,organization,project,project_id,location,location_id,recording_date_time,recording_id,task_method,task_id,aru_task_status,...,spectrogram_url,clip_url,sensorId,tasks,status,recording_url,latitude,longitude,location_buffer_m,file_type
1623,BU,Alberta Archetypes,1501,P-E0-1-10,308678,2022-06-05 06:51:00,416962,no_restrictions,596169,Transcribed,...,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,ARU,357,Active,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,52.64404,-115.14051,,flac
1752,BU,Amplitude Quality Testing 2020,293,AM-403-SE2,36043,2017-06-15 04:46:00,92051,no_restrictions,87956,Transcribed,...,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,ARU,174,Published - Private,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,54.607774,-110.681271,,flac
1758,BU,Amplitude Quality Testing 2020,293,AM-403-SE2,36043,2017-06-15 04:46:00,92051,no_restrictions,87898,Transcribed,...,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,ARU,174,Published - Private,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,54.607774,-110.681271,,flac
1761,BU,Amplitude Quality Testing 2020,293,AM-403-SE2,36043,2017-06-15 04:46:00,92051,no_restrictions,87840,Transcribed,...,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,ARU,174,Published - Private,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,54.607774,-110.681271,,flac
1764,BU,Amplitude Quality Testing 2020,293,AM-403-SE2,36043,2017-06-15 04:46:00,92051,no_restrictions,87927,Transcribed,...,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,ARU,174,Published - Private,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,54.607774,-110.681271,,flac


If you have an existing test set, you'll want to make sure it doesn't end up in the training data - otherwise the model will be tested on audio it has already encountered during training - a form of data leakage which will cause over optimistic performance ratings. 

In [4]:
# Load the test data if you have it. 
existing_test_set = pd.read_csv(data_path / 'raw' / "SingleSpecies_all.csv", low_memory=False)

# Train/test split
### Download audio files and make a train+validation/test split of the database. 

This will download recordings which contained the target species. There might be far too many of these to download all of them, so aim for 2000-4000 total present clips in the training set.

The dataframes created here will only reference audio which has been downloaded. 

In [5]:
train_and_valid_df, test_df = build.new_labelled_df(processed_df, target_species="OSFL", download_n=0, existing_test_set=existing_test_set, seed=42)

485 not downloaded
downloading 0 clips
skipped 0 previously downloaded files
dropped 77 locations from training set

--------------------------------------------------
train set
clips per task method = 
 task_method
1SPT               35450
1SPM               10828
no_restrictions     2660
Name: count, dtype: int64
total clips = 48938

clips generated from each tagging method:
                 target_present  target_absent
task_method                                   
1SPM                     1519.0         9309.0
1SPT                     2158.0        33292.0
no_restrictions           313.0         2347.0
total present clips =  3990
total absent clips =  44948
total available human labelled tags = 48938

--------------------------------------------------
valid set
clips per task method = 
 task_method
1SPT               7655
1SPM               2899
no_restrictions    1053
Name: count, dtype: int64
total clips = 11607

clips generated from each tagging method:
                 target_

# Save the test split somewhere out of the way
Don't look at it until after model training and hyperparameter tuning is complete. This is the data the model will be evaluated on after training. 

In [6]:
test_set_dir = data_path / 'interim' / 'test_set'
if not test_set_dir.exists():
    Path.mkdir(test_set_dir)
test_df.to_pickle(data_path / 'interim' / 'test_set' / 'test_set.pkl')

# Save the training and validation sets in a different folder
This is the data the model will be trained and evaluated on during training. 

In [7]:
train_and_valid_set_dir = data_path / 'interim' / 'train_and_valid_set'
if not train_and_valid_set_dir.exists():
    Path.mkdir(train_and_valid_set_dir)
train_and_valid_df.to_pickle(data_path / 'interim' / 'train_and_valid_set' / 'train_and_valid_set.pkl')

# Split the train and valid set by location in the same way as the train/test split was made


In [8]:
# Make the folders
train_set_dir = data_path / 'interim' / 'train_set'
valid_set_dir = data_path / 'interim' / 'valid_set'
if not train_set_dir.exists():
    Path.mkdir(train_set_dir)
if not valid_set_dir.exists():
    Path.mkdir(valid_set_dir)

# Split the train and valid set
train_df, valid_df = build.make_train_valid_split(train_and_valid_df, seed=42, pct_train=0.8)



In [9]:
build.report_counts(train_df, header="Clips in training dataset")
build.report_counts(valid_df, header="Clips in validation dataset")


--------------------------------------------------
Clips in training dataset
clips per task method = 
 task_method
1SPT               29010
1SPM                8755
no_restrictions     1915
Name: count, dtype: int64
total clips = 39680

clips generated from each tagging method:
                 target_present  target_absent
task_method                                   
1SPM                     1239.0         7516.0
1SPT                     1813.0        27197.0
no_restrictions           251.0         1664.0
total present clips =  3303
total absent clips =  36377
total available human labelled tags = 39680

--------------------------------------------------
Clips in validation dataset
clips per task method = 
 task_method
1SPT               6440
1SPM               2073
no_restrictions     745
Name: count, dtype: int64
total clips = 9258

clips generated from each tagging method:
                 target_present  target_absent
task_method                                   
1SPM         

# Save the train and valid set into different folders

In [10]:
# Save the train and valid sets
train_df.to_pickle(data_path / 'interim' / 'train_set' / 'train_set.pkl')
valid_df.to_pickle(data_path / 'interim' / 'valid_set' / 'valid_set.pkl')