# This notebook aims to identify sections of audio containing target vocalization, and sections which do not contain the target vocalization. 

__proxy for target vocalisation:__ 
- Audio from a section of a recording contianing target vocalizaition, and taken from within the same timestamp as the tag. 
- Audio from a recording with tagging method 'no restrictions' AND taken from within tag timestamp of the target species. 

__proxys for NOT target vocalization__
- Audio from a recording with tagging method '1SPM' AND there is no target species tag in the recording
- Audio from a recording with tagging method '1SPM' AND there is an target species tag in the recording AND the sample is taken from before the start of the target species tag. 
- Audio from a recording with tagging method 'no restrictions' AND taken from inbetween tags of the target species. 

In [2]:
import sys
from pathlib import Path
import pandas as pd
BASE_PATH = Path.cwd().parents[1]
sys.path.append(str(BASE_PATH / "src" / "data"))  # for clean_csv and train_test_split
sys.path.append(str(BASE_PATH / "src"))  # for utils
from utils import *

In [3]:
%load_ext autoreload
%autoreload 2

In [27]:
df = pd.read_pickle(BASE_PATH / "data"/"processed" / "train_set" / "train_set.pkl")
df_lite = df[keep_cols]
osfls = df_lite.loc[df.species_code == 'OSFL']

### What's the distribution of the different tagging methods?

In [11]:
df.task_method.value_counts(dropna=False)

task_method
1SPT                        201904
1SPM                        178249
NaN                          50291
1SPM Audio/Visual hybrid      2530
Name: count, dtype: int64

## How many recording files are there in the training set?

In [28]:
unique_recordings = df.recording_id.unique()
recordings_containing_target_species = osfls.recording_id.unique()
print(f"{unique_recordings.shape} unique recordings, {recordings_containing_target_species.shape} recordings with the target species.")

(54416,) unique recordings, (2967,) recordings with the target species.


### Look at how to group df by recording and keep the other info
- recording ids and urls are all the same for each tag entry in the database. 
- It would be nice to have the database indexed by recording ID, and have the species tag, clip start/stop time etc stored as a list per recording. That way we could see the timestamps of all the clips for one recording. 

- Using an aggregate function, we can pass a dictionary into the groupby function so that different columns are grouped differently. This way we can end up with a list of all the target species start stop times per recording. 

In [80]:
target_species = 'OSFL'
filtered_df = df.loc[df.species_code == target_species]
grouped = filtered_df.groupby('recording_id').agg({'recording_url': 'first', 'detection_time': lambda x: list(x), 'tag_duration': lambda x: list(x)})
grouped

Unnamed: 0_level_0,recording_url,detection_time,tag_duration
recording_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4396,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,"[27.28, 95.9]","[0.83, 1.18]"
4399,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,"[32.63, 82.51]","[1.33, 1.11]"
4427,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,"[106.56, 122.66]","[1.0, 0.84]"
4429,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,"[31.11, 74.7, 139.78]","[1.38, 2.19, 1.29]"
4446,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,"[13.63, 74.88, 126.6]","[1.05, 0.89, 0.8]"
...,...,...,...
826329,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,[15.06],[0.54]
826352,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,[6.7],[0.95]
826374,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,"[4.11, 16.04]","[0.96, 0.88]"
826375,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,"[2.16, 48.0]","[0.66, 0.83]"


1

# First let's build a dataset by downloading only the recordings which contain the target species

In [81]:
import download_recordings

In [88]:
df.columns[1]

'organization'

### Choose where to save the downloaded recordings. 


In [119]:
audio_save_path = Path(BASE_PATH / "data" / "raw" / "recordings" / "OSFL")
audio_save_path.mkdir(parents=True, exist_ok=True)

pathlib.PosixPath

In [None]:
df.shape

(432974, 70)

In [138]:
download_recordings.from_url(df, 'recording_url', audio_save_path, target = 'OSFL', n=10)

downloading 10 clips
skipped 10 previously downloaded files


In [135]:
osfls['recording_url'].iloc[0]

'https://wildtrax-aru.s3.us-west-2.amazonaws.com/d587fac1-ae78-4729-b816-5e6eaf5a5c9e/255412.mp3'