# Explore newly updated raw data
Initial data exploration showed that the original raw database TrainingData_BU&Public_CWS_with_rec_links.csv didn't contain links to longer recordings - only clips. A new .csv file is added to data/raw which includes these links to longer recordings. 

This notebook aims to explore these recordings to ensure that each unique recording ID corresponds to a unique url, and if not then to document which recordings have missing recording urls.

In [1]:
import pandas as pd
from pathlib import Path
data_path = Path('../../data')

In [2]:
!ls $data_path/raw/

BU_Location_Codes.zip
BU_locations_20230920.csv
[34mSpeciesRawDownload[m[m
TrainingData_BU&Public_CWS.csv
TrainingData_BU&Public_CWS_with_rec_links.csv
[34mrecordings[m[m
[34mrenamed[m[m


In [3]:
old_meta = pd.read_csv(data_path/'raw/TrainingData_BU&Public_CWS.csv')
meta = pd.read_csv(data_path/'raw/TrainingData_BU&Public_CWS_with_rec_links.csv')

  old_meta = pd.read_csv(data_path/'raw/TrainingData_BU&Public_CWS.csv')
  meta = pd.read_csv(data_path/'raw/TrainingData_BU&Public_CWS_with_rec_links.csv')


In [4]:
old_meta.shape, meta.shape

((1152840, 64), (1209426, 65))

The new database contains another 50,000 or so clips than the old database. There is also one extra column as expected.

In [5]:
set(meta.columns.to_list()) ^ set(old_meta.columns.to_list())

{'media_url',
 'observer_id',
 'observer_user_id',
 'recording_url',
 'spectrogram_url',
 'tag_spectrogram_url',
 'url',
 'verifier_id',
 'verifier_user_id'}

In [18]:
meta.columns

Index(['organization', 'project', 'project_id', 'location', 'location_id',
       'recording_date_time', 'recording_id', 'task_method', 'task_id',
       'aru_task_status', 'species_code', 'species_common_name',
       'species_scientific_name', 'species_class', 'detection_time',
       'task_duration', 'tag_duration', 'min_tag_freq', 'max_tag_freq',
       'tag_id', 'individual_order', 'vocalization', 'abundance', 'tag_rating',
       'tag_is_verified', 'clip_channel_used', 'observer', 'observer_id',
       'verifier_id', 'left_full_freq_tag_rms_peak_dbfs',
       'left_full_freq_tag_rms_trough_dbfs', 'left_full_freq_tag_pk_count',
       'left_full_freq_tag_dc_offset', 'left_full_freq_tag_min_level',
       'left_full_freq_tag_max_level', 'left_full_freq_tag_peak_level_dbfs',
       'left_freq_filter_tag_rms_peak_dbfs',
       'left_freq_filter_tag_rms_trough_dbfs', 'left_freq_filter_tag_pk_count',
       'left_freq_filter_tag_dc_offset', 'left_freq_filter_tag_min_level',
       'lef

The intersection of the two column names shows that some of the column names are named differently between the two datasets. 

These are sufficiently different that the data should be re-generated from this new csv file rather than appending the missing column onto the processed metadata file. 

In [6]:
meta = pd.read_csv(data_path/'raw/TrainingData_BU&Public_CWS_with_rec_links.csv')

  meta = pd.read_csv(data_path/'raw/TrainingData_BU&Public_CWS_with_rec_links.csv')


In [7]:
meta.clip_url.iloc[0]

'https://wildtrax-aru.s3.us-west-2.amazonaws.com/ac531827-1c53-409c-ab4f-fd449d162760/fb1e6315496e4ac9bd03d64fadd90afd.flac'

In [8]:
meta.recording_url.value_counts()[:20]

https://wildtrax-aru.s3.us-west-2.amazonaws.com/860f0d98-4a9a-4348-9a78-498908fe6549/211689.mp3    308
https://wildtrax-aru.s3.us-west-2.amazonaws.com/a82260a8-3dcf-49d9-999f-85864166a46b/211704.mp3    303
https://wildtrax-aru.s3.us-west-2.amazonaws.com/6a0191b2-61e0-4d5e-9805-401776dbbc86/72961.mp3     286
https://wildtrax-aru.s3.us-west-2.amazonaws.com/1a960759-3cb6-446e-b491-376d0d4b6b9e/211803.mp3    282
https://wildtrax-aru.s3.us-west-2.amazonaws.com/253cded9-393b-446b-be74-02a56ee8b6e4/99174.mp3     275
https://wildtrax-aru.s3.us-west-2.amazonaws.com/16db80e8-3181-45cb-ad0f-0de28a8a958f/211686.mp3    267
https://wildtrax-aru.s3.us-west-2.amazonaws.com/1dfe4340-0e00-4d49-b500-afd315b5e994/211746.mp3    259
https://wildtrax-aru.s3.us-west-2.amazonaws.com/57c37585-d8d3-4067-878e-d07f2760fd48/211632.mp3    255
https://wildtrax-aru.s3.us-west-2.amazonaws.com/6505e27a-0bfa-412b-9d06-7d7354bb5dff/211697.mp3    251
https://wildtrax-aru.s3.us-west-2.amazonaws.com/56dcc8a2-3c28-49b1-9c7a-c

In [9]:
meta.recording_id.value_counts()

104940.0    395
104939.0    371
104942.0    314
211689.0    308
211704.0    303
           ... 
453220.0      1
453219.0      1
609940.0      1
453218.0      1
211246.0      1
Name: recording_id, Length: 99181, dtype: int64

We can see the demo samples don't contain recording URLs.

In [10]:
meta.columns

Index(['organization', 'project', 'project_id', 'location', 'location_id',
       'recording_date_time', 'recording_id', 'task_method', 'task_id',
       'aru_task_status', 'species_code', 'species_common_name',
       'species_scientific_name', 'species_class', 'detection_time',
       'task_duration', 'tag_duration', 'min_tag_freq', 'max_tag_freq',
       'tag_id', 'individual_order', 'vocalization', 'abundance', 'tag_rating',
       'tag_is_verified', 'clip_channel_used', 'observer', 'observer_id',
       'verifier_id', 'left_full_freq_tag_rms_peak_dbfs',
       'left_full_freq_tag_rms_trough_dbfs', 'left_full_freq_tag_pk_count',
       'left_full_freq_tag_dc_offset', 'left_full_freq_tag_min_level',
       'left_full_freq_tag_max_level', 'left_full_freq_tag_peak_level_dbfs',
       'left_freq_filter_tag_rms_peak_dbfs',
       'left_freq_filter_tag_rms_trough_dbfs', 'left_freq_filter_tag_pk_count',
       'left_freq_filter_tag_dc_offset', 'left_freq_filter_tag_min_level',
       'lef

In [11]:
meta.loc[meta.recording_url.isna()].project.value_counts()

General Community from Understory Protection                                  25893
Natural Disturbance Long-term Monitoring Program 2016                          5263
Wildtrax Demo 2020                                                             1370
SWTH, TEWA & WTSP - Edge use by roads                                           291
Boreal Wetland Community Monitoring                                             241
Tharindu-LFWidth-BU 2021                                                        235
Old Growth Forest Monitoring                                                    177
Big Grids                                                                       114
Elk Island Localization 2020                                                     92
BATS & LATS                                                                      86
Hart(s)-SWTH-NorthVsSouthSongRate-BU 2021                                        58
General-Community-CallingLakeFragmentationStudy-2014                        

In [12]:
meta.project.value_counts()

Boreal Wetland Community Monitoring                                           339149
Big Grids                                                                     156492
CWS-Ontario Boreal Shield-Lowlands Transition 2022                             57945
CWS-Ontario Birds of James Bay Lowlands 2021                                   56029
Old Growth Forest Monitoring                                                   55283
                                                                               ...  
Low-frequency species detections - BU - Limited amplitude for coyotes 2021        11
Tibbitt to Contwoyto Winter Road CWS Northern Region 2016                          8
Workshop #3 - Tagging fundamentals 2021                                            3
Examples of species vocalizations 2022                                             1
Community tagging: EMCLA Pilot Program 2012                                        1
Name: project, Length: 134, dtype: int64

# check for a one to one mapping between recording ID and clip URL. Output will be True for one to one, or False if there is more than one recording URL per recording ID. 

In [13]:
# One to one between recording ID and recording URL?
meta.groupby('recording_id').recording_url.nunique().max()==1 

True

In [14]:
# One to one between recording URL and recording ID?
meta.groupby('recording_url').recording_id.nunique().max()==1 

True

In [15]:
meta.groupby('recording_url').recording_id.nunique().min()==1

True

Good. The recording IDs never point to more than one URL, and each recording URL has exactly one recording ID. 

In [16]:
meta.groupby('recording_id').recording_url.nunique().min()

0

However, some of the recording IDs don't have a URL. 

In [17]:
len(meta.loc[meta.recording_url.isna()].recording_id.unique())

1485

There are 1485 recordings which don't have a URL.