# This notebook aims to identify sections of audio containing target vocalization, and sections which do not contain the target vocalization. 

__proxy for target vocalisation:__ 
- Audio from a section of a recording contianing target vocalizaition, and taken from within the same timestamp as the tag. 
- Audio from a recording with tagging method 'no restrictions' AND taken from within tag timestamp of the target species. 

__proxys for NOT target vocalization__
- Audio from a recording with tagging method '1SPM' AND there is no target species tag in the recording
- Audio from a recording with tagging method '1SPM' AND there is an target species tag in the recording AND the sample is taken from before the start of the target species tag. 
- Audio from a recording with tagging method 'no restrictions' AND taken from inbetween tags of the target species. 

In [1]:
import sys
from pathlib import Path
import pandas as pd
BASE_PATH = Path.cwd().parents[1]
sys.path.append(str(BASE_PATH / "src" / "data"))  # for clean_csv and train_test_split
sys.path.append(str(BASE_PATH / "src"))  # for utils
from utils import *

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
df = pd.read_pickle(BASE_PATH / "data"/"processed" / "train_set" / "train_set.pkl")
df_lite = df[keep_cols]
osfls = df_lite.loc[df.species_code == 'OSFL']

### What's the distribution of the different tagging methods?

In [4]:
df.task_method.value_counts(dropna=False)

task_method
1SPT                        201904
1SPM                        178249
NaN                          50291
1SPM Audio/Visual hybrid      2530
Name: count, dtype: int64

#### 1SPT = 1 sample per task <br> 1SPM = 1 sample per minute
- If the tagging method is 1SPT then the time interval is the duration of the recording.
- In each case, the audio before the onset of the target species vocalization can be treated as audio which does not contain the target audio. 
- In the 1SPM recordings, there are additional sources of the negative target, found between the start of each minute, and the onset of the target vocalization within that minute. 
- Initially the two classes will be treated the same, since audio for the null class isn't scarce. 

### What about the 'None' method?
- The 'None' method, or sections of the dataframe without a string value for the tagging method, are tasks without a restriction on the number of tags. These can be used as a source of positive and negative class recordigs. 



## How many recording files are there in the training set?

In [5]:
unique_recordings = df.recording_id.unique()
recordings_containing_target_species = osfls.recording_id.unique()
print(f"{unique_recordings.shape} unique recordings, {recordings_containing_target_species.shape} recordings with the target species present.")

(54416,) unique recordings, (2967,) recordings with the target species present.


### Look at how to group df by recording and keep the other info
- recording ids and urls are all the same for each tag entry in the database. 
- It would be useful to have the database indexed by recording ID, and have the species tag, clip start/stop time etc stored as a list per recording. That way we could see the timestamps of all the clips for one recording. 

- Using an aggregate function, we can pass a dictionary into the groupby function so that different columns are grouped differently. This way we can end up with a list of all the target species start stop times per recording. 

In [6]:
target_species = 'OSFL'
filtered_df = df.loc[df.species_code == target_species]
grouped = filtered_df.groupby('recording_id').agg({'recording_url': 'first', 'detection_time': lambda x: list(x), 'tag_duration': lambda x: list(x), 'file_type': 'first'})
grouped

Unnamed: 0_level_0,recording_url,detection_time,tag_duration,file_type
recording_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
4396,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,"[27.28, 95.9]","[0.83, 1.18]",mp3
4399,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,"[32.63, 82.51]","[1.33, 1.11]",mp3
4427,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,"[106.56, 122.66]","[1.0, 0.84]",mp3
4429,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,"[31.11, 74.7, 139.78]","[1.38, 2.19, 1.29]",mp3
4446,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,"[13.63, 74.88, 126.6]","[1.05, 0.89, 0.8]",mp3
...,...,...,...,...
826329,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,[15.06],[0.54],flac
826352,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,[6.7],[0.95],flac
826374,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,"[4.11, 16.04]","[0.96, 0.88]",flac
826375,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,"[2.16, 48.0]","[0.66, 0.83]",flac


In [19]:
grouped.loc[grouped.index < 4427]
grouped['filename'] = grouped.index.astype(str) + file_type

NameError: name 'file_type' is not defined

# First let's build a dataset by downloading only the recordings which contain the target species
### This dataset will also be accompanied by a dataframe with the file path as its index. 

In [9]:


import download_recordings
audio_save_path = Path(BASE_PATH / "data" / "raw" / "recordings" / "OSFL")
audio_save_path.mkdir(parents=True, exist_ok=True)
download_recordings.from_url(df, 'recording_url', audio_save_path, target = 'OSFL', n=10)
osfls['recording_url'].iloc[0]

downloading 10 clips
skipped 10 previously downloaded files


'https://wildtrax-aru.s3.us-west-2.amazonaws.com/d587fac1-ae78-4729-b816-5e6eaf5a5c9e/255412.mp3'

In [None]:
import opensoundscape 