# 1. Process the raw input CSV file. 

# The input for this project is a .csv downloaded from WildTrax. 
The csv file contains some necessary information to build the training dataset:
- start and end times of bird calls
- locations of ARUs used to collect the audio
- URL link to the recording on which the tags were made. 
- The date, file type and other metadata associated with the audio.

The csv file needs obtaining from WildTrax and placing in the folder named `data/raw/`

There is a list of URLs which needs appending to the end of the dataframe. This should contain URL links to the source audio, and the column should be named __recording_url__

The lattitude and longitude columns should be present too. 

The .csv file needs cleaning by running it through the python script __process_raw_csv.py__ 

This can be done by running this notebook or by running the python file. 

Once this notebook is running in a jupyter notebook server, hit __ctrl+enter__ to execute cells in order from top to bottom. 

In [1]:
# imports
import pandas as pd
from pathlib import Path
import sys

# Set the paths
BASE_PATH = Path.cwd().parent.parent
data_path = Path.cwd().parent.parent/'data'
sys.path.append(str(BASE_PATH))
sys.path.append(str(BASE_PATH / "src"))
sys.path.append(str(BASE_PATH / "src" / "data")) 

from src.data.clean_csv import process_raw_csv

# Load the raw data


In [None]:
raw_data = pd.read_csv(data_path/ 'raw' / 'TrainingData_BU&Public_CWS_with_rec_links.csv', low_memory=False)
raw_data.head()


In order to avoid mixed data types in the dataframe, a type dictionary is provided. If in the future some columns are added or removed, this type dictionary can be updated to reflect the data type expected. The type dictioary can be found in `src/data/preset_types.py`.

# Process the csv
`process_raw_csv` does the following:
- Load raw csv file
- Drop last entry since it's all NaN values.
- Replace empty fields with -1 for verifier_id to enable import to pandas dataframe as int type.
- Change all the data types in the DataFrame to the types specified in in the preset_types.py
- Drop 'too many to tag' abundance tags.
- Specify a value for 'no restrictions' tagging method - since it's stored as 'na' by default
- Drop non song vocalizations
- Drop recordings not labeled in wildtrax
- Remove the clips which don't contain a link to a clip
- Remove any clips which belong to a recording with a missing recording_url
- Remove clips from projects which might contain data which contaminates the dataset with duplicated or synthetic recordings.
- Remove duplicated clips from the database
- Add a column to store file type derived from clip URL
- Export the cleaned version of the database


In [3]:
process_raw_csv(data_path / 'raw' / 'TrainingData_BU&Public_CWS_with_rec_links.csv')

Processing raw csv file...
Done processing raw csv file. Outputted to data/interim/cleaned_metadata.pkl


# Check that it worked
Load the processed csv and take a look at the first few lines. 

In [4]:
processed_csv = pd.read_pickle(data_path / 'interim' / 'cleaned_metadata.pkl')
processed_csv.head()

Unnamed: 0,organization,project,project_id,location,location_id,recording_date_time,recording_id,task_method,task_id,aru_task_status,...,spectrogram_url,clip_url,sensorId,tasks,status,recording_url,latitude,longitude,location_buffer_m,file_type
1623,BU,Alberta Archetypes,1501,P-E0-1-10,308678,2022-06-05 06:51:00,416962,no_restrictions,596169,Transcribed,...,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,ARU,357,Active,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,52.64404,-115.14051,,flac
1752,BU,Amplitude Quality Testing 2020,293,AM-403-SE2,36043,2017-06-15 04:46:00,92051,no_restrictions,87956,Transcribed,...,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,ARU,174,Published - Private,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,54.607774,-110.681271,,flac
1758,BU,Amplitude Quality Testing 2020,293,AM-403-SE2,36043,2017-06-15 04:46:00,92051,no_restrictions,87898,Transcribed,...,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,ARU,174,Published - Private,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,54.607774,-110.681271,,flac
1761,BU,Amplitude Quality Testing 2020,293,AM-403-SE2,36043,2017-06-15 04:46:00,92051,no_restrictions,87840,Transcribed,...,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,ARU,174,Published - Private,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,54.607774,-110.681271,,flac
1764,BU,Amplitude Quality Testing 2020,293,AM-403-SE2,36043,2017-06-15 04:46:00,92051,no_restrictions,87927,Transcribed,...,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,ARU,174,Published - Private,https://wildtrax-aru.s3.us-west-2.amazonaws.co...,54.607774,-110.681271,,flac
