# Creating a training database: import and load Pamguard CSV files


(CLICKLEARN DSTI Project)

The objective of this notebook is to complete Ketos *Creating Database (Extended)* notebook:
https://docs.meridian.cs.dal.ca/ketos/tutorials/create_database/index.html

*Chapter 2. Loading the annotations* of the original *Creating Database (Extended)* notebook describes methods to import CSV files which already match to the Ketos requirements, but no methods are provided to build the annotation table from scratch or to import and convert other CSV files. 

Sections *1. Imports* and *2. Building annotation files* of the present work complete this chapter 2. This notebook provides methods to load a CSV file exported by **Pamguard software**, and convert it as Ketos annotation table, dealing with datetime and timedelta operations. Pamguard is an open source software which objective is to provide free and easy to handle tools for cetacean passive acoustic monitoring (PAM):
https://www.pamguard.org/

Two types of CSV file exported by Pamguard can be used with these methods:

* **Events annotation CSV files**: quoting start and end of time ranges (several seconds long) in which clicks streams have been identified.
* **Clicks annotation CSV files**: quoting start time of identified clicks. Click duration is about few milliseconds and has to be set in parameters (see below).

The converted annotation tables will then be ready to be directly used by Ketos library (see Chapter 3 and followings). No major implementations were provided in Chapter 3 and following ones.


## Contents:

[1. Imports](#section1)  
[2. Building annotation files](#section2)  



<a id=section1></a>

<a id="section1"></a>
## 1. Imports

### Importing the packages
(CLICKLEARN DSTI)

We will use several modules within ketos and also the pandas package



In [1]:
import os
import numpy as np
import pandas as pd
from ketos.data_handling import selection_table as sl
import ketos.data_handling.database_interface as dbi
from ketos.data_handling.parsing import load_audio_representation
from ketos.audio.spectrogram import MagSpectrogram
from ketos.data_handling.parsing import load_audio_representation

from datetime import datetime
from typing import Union
import math
import random

# This last package is made by ClickLearn DSTI team to extract informations from a Pamguard CSV export file 
# and buid a Ketos annotation table (Pandas dataframe)
import ketos_annotation_table as kat

### Import functions

(CLICKLEARN DSTI Project)

<span style = "color : red;">
TODO:
    
* Move following methods in Ketos_annotation_table.py when finished, then delete this chapter
</span>


In [47]:
def pamguard_name_to_datetime(filename:str):
    """Extract a datetime value from a PAMGUARD sound file name.

        Args:
            filename : str
                PAMGUARD .wav or audio_file name
        Returns:
            datetime
                Formatted datetime value (like 2019-02-24 13:15:00)

        Examples
            >>> pamguard_name_to_datetime('Click_Detector_Sperm_whale_click_detector_Clicks_20180601_010000.pgdf')
            datetime.datetime(2018, 6, 1, 1, 0)
            >>> pamguard_name_to_datetime('192_20180705_123056_853.wav')
            datetime.datetime(2018, 7, 5, 12, 30, 56)
    """
    # split filename from its suffix (.wav or .pgdf)
    filename, suffix = os.path.splitext(filename)
    # remove milliseconds from filename (occurs on Pamguard .wav filenames)
    if ".wav" in suffix:
        filename = filename[:-4]
    # returns the date part of the remaining filename as a datetime
    return datetime.strptime(filename[-15:], '%Y%m%d_%H%M%S')

In [48]:
def strip_dataframe(df):
    """Removes any white space before and after value of every cells of a dataframe.

        Args:
            df: pd.DataFrame
        Returns:
            df: pd.DataFrame
    """
    for column in df:
        if isinstance(df[column][0], str):
            df[column] = df[column].str.strip()
    return df

In [49]:
# NO MORE USED (keep it to detect unmatched labels ?)

def update_annotation_df_labels(annotation_df, labels_to_modify:set, valid_labels: set):
    """Replaces a set of labels with valid labels, in a columns named 'label' of a Pandas Annotation dataframe
    
        Args:
            annotation_df: pd.DataFrame
                Annotation table imported from Pamguard
            labels_to_modify: set
                set of strings
            valid_labels: set
                set of strings characters corresponding to every valid labels for any annotation

        Returns:
            annotation_df: pd.DataFrame

    """
    # check validity of new labels
    unmatched_labels = set(labels_to_modify.values()) - set(valid_labels)
    if len(unmatched_labels) != 0:
        print('Following labels doesn\'t match label list: ', unmatched_labels)
    else:
        # replace old labels by new ones
        for old_label, new_label in labels_to_modify.items():
            annotation_df.loc[annotation_df.label == old_label,'label'] = new_label
    return annotation_df

In [50]:
def subsample_dataframe(df_to_subsample, train_proportion:float):
    """Subsample a Pandas DataFrame into 2 others, based on its indexes.
        
        Args:
            df_to_subsample: pd.DataFrame
            train_proportion: float
                proportion (]0;1[) of indexes to subsample in the dataset, in order to build the train dataset
    
        Returns: tuple with 2 dataframes:
            [0] train subsample 
            [1] test subsamples
    """
    length = len(df_to_subsample)
    nb_train_rows = math.ceil(train_proportion*length)
    #create a sequence corresponding to df indexes to subsample 
    sequence = set(range(length))
    train_indexes = random.sample(sequence, nb_train_rows)
    test_indexes = list(sequence - set(train_indexes))
    return df_to_subsample.iloc[train_indexes], df_to_subsample.iloc[test_indexes]

In [69]:
def pamguard_annotations_csv_to_df(annotation_csv_path:str, annotation_type:str = 'events', events_subsampling_option:int = 2, click_duration:float = 2):
    """Convert a Pamguard CSV annotation file into a Ketos annotation table (non standardized). 
        This method contains several identical "if" statements, that were not gathered for more readability.

        Args:
            annotation_csv_path: str
                complete path of the CSV file. CSV separator must be ','
            annotation_type: str
                2 values are possible: 'events' (default value, for Events annotation CSV files) or 'clicks' (for Clicks annotation CSV files)
            events_subsampling_option: int
                Subsampling is optional but highly recommended to avoid unsure detections. 
                Several subsamplings of annotations_csv_df are possible, 
                from the less to the most selective option:
                - 0: no subsampling
                - 1: subsample excluding all commented detections in 'comment' column 
                    (comments always refer to unsure detections), 
                    then including 'definite' ('DLD') and 'probable' ('DLP') dolphin click detections as well
                - 2 (default value): subsample excluding all commented detections in 'comment' column 
                    (comments always refer to unsure detections), 
                    then including only 'definite' dolphin click detections ('DLD')
                N.B.: Clicks annotation CSV files are only based on (uncommented) definite dolphin click 
                detections ('DLD'), so it is unnecessary to subsample them regarding to any comment or label. 
            click_duration: float
               click duration in milliseconds
               
        Returns:
            annotations_df: pd.DataFrame
                Ketos annotation table (non standardized)
    """
    # import CSV in a dataframe
    annotations_df = pd.read_csv(annotation_csv_path, sep=",")

    # Remove all whitespaces from string typed series of the dataframe 
    annotations_df = strip_dataframe(annotations_df)
    
    # SUBSAMPLING (SPECIFIC TO EVENT ANNOTATION DF, with option 2 > 1 > 0, check docstring):
    if annotation_type == 'events':
        if events_subsampling_option == 2:
            annotations_df = annotations_df.loc[((annotations_df.comment.isna()) 
                                                 & (annotations_df.eventType == 'DLD')),]
        if events_subsampling_option == 1:
            annotations_df = annotations_df.loc[(annotations_df.comment.isna()),]    
    
    # DATETIME AND LABELS REWORKING:
    if annotation_type == 'events':
        
        # updating pd.Series type from string to datetime
        annotations_df.EventStart = pd.to_datetime(annotations_df.EventStart, format = '%Y-%m-%d %H:%M:%S.%f')
        
        # To complete Eventstart series, EventStartMilliseconds must be added to Clickstart values as a pd.Timedelta
        EventStartMilliseconds_events = pd.Series([pd.Timedelta(val, unit = 'milliseconds') for val in annotations_df.EventStartMilliseconds])
        # reset index allows to do arithmetic operators between annotations_df and EventStartMilliseconds_events
        annotations_df = annotations_df.reset_index()
        annotations_df.EventEnd = pd.to_datetime(annotations_df.EventEnd, format = '%Y-%m-%d %H:%M:%S.%f')

        # updating all datetime values (beginning & end of events) into seconds from the beginning of the recording
        # for this, convert first file name into pandas Timedelta format
        annotations_df.EventStart = (annotations_df.EventStart 
                                     + EventStartMilliseconds_events
                                     - annotations_df.WAVFile.apply(pamguard_name_to_datetime)).apply(pd.Timedelta.total_seconds)
        annotations_df.EventEnd = (annotations_df.EventEnd 
                                   - annotations_df.WAVFile.apply(pamguard_name_to_datetime)).apply(pd.Timedelta.total_seconds)
        
    if annotation_type == 'clicks':
        
        # updating pd.Series type from string to datetime
        annotations_df.Clickstart = pd.to_datetime(annotations_df.Clickstart, format = '%Y-%m-%d %H:%M:%S.%f')
        
        # To complete Clickstart series, UTCMilliseconds must be added to Clickstart values as a pd.Timedelta
        UTCMilliseconds_clicks = pd.Series([pd.Timedelta(val, unit = 'milliseconds') for val in annotations_df.UTCMilliseconds])
        
        # Converts click_duration into Timedelta format
        click_duration = pd.Timedelta(click_duration, unit = 'milliseconds')
        
        # Compute Clickend series thanks to click duration then add it in the dataframe
        annotations_df['Clickend'] = pd.Series(annotations_df.Clickstart + click_duration)
        
        # add a label for each click detection
        annotations_df['label'] = label

        # updating all datetime values (beginning & end of events) into seconds from the beginning of the recording
        # for this, convert first file name into pandas Timedelta format
        annotations_df.Clickstart = (annotations_df.Clickstart 
                                     + UTCMilliseconds_clicks
                                     - annotations_df.BinaryFile.apply(pamguard_name_to_datetime)).apply(pd.Timedelta.total_seconds)
        annotations_df.Clickend = (annotations_df.Clickend 
                                           + UTCMilliseconds_clicks 
                                           - annotations_df.BinaryFile.apply(pamguard_name_to_datetime)).apply(pd.Timedelta.total_seconds)

    # COLUMNS REWORKING:
    if annotation_type == 'events':
        # select only columns required in Ketos annotation table
        annotations_df = annotations_df.loc[:,('Id', 'EventStart', 'EventEnd', 'eventType', 'WAVFile', 'comment')]
        # update column names to fit Ketos annotation table format
        annotations_df = annotations_df.rename(
            columns = {'EventStart':'start', 'EventEnd':'end', 'eventType':'label', 'WAVFile':'filename'})
    
    if annotation_type == 'clicks':
        # select only columns required in Ketos annotation table
        annotations_df = annotations_df.loc[:,('Id', 'Clickstart', 'Clickend', 'label', 'BinaryFile')]
        # update column names to fit Ketos annotation table format
        annotations_df = annotations_df.rename(
            columns = {'Clickstart':'start', 'Clickend':'end', 'BinaryFile':'filename'})

    # Check start and end values of events annotation df
    if annotation_type == 'events':
        check_pamguard_annotation_df(annotations_df)
        

    return annotations_df

In [52]:
def check_pamguard_annotation_df(df):
    """Check annotations from a CSV file extracted from Pamguard software. Uses check_annotation in a loop.
    
        Args:
            df: Pandas DataFrame
                Pamguard annotation DataFrame converted by the pamguard_annotations_csv_to_df() method.
        
        Returns: 
            errors: list
                List of errors containing each incorrect annotation
    """
    errors = []
    for index, row in df.iterrows():
        check_annotation_time(row.start, row.end, row.Id)

In [53]:
def check_annotation_time(start:str, end:str, index = 0):
    """Check time validity of 1 annotation (or 1 row of an annotation table): end value must be subsequent to start value.
        Print error messages.
        
        Args:
            start: str
                Start time for the annotation, in seconds from the beginning of the file
            end: str
                End time for the annotation, in seconds from the beginning of the file
            index: int
                Row index (DataFrames)
        
        Returns: 
            str
               String, containing index and details about the incorrect annotation
    
    """
    if start > end: 
        return f'Id {index}: end value ({end}) is prior to start value ({start})'

##### Hand input methods

In [54]:
def check_hand_input(start: Union[int, float, str], end: Union[int, float, str], label: str, valid_labels: set):
    """Check validity of an annotation added by hand. This checking is used before appending it in the annotation 
        dataframe with append_hand_input_to_annotations() method.     
        Each annotation must pass following tests before being considered as correct :
        - start and end types must be integer, float or string
        - label must be included in kat.labels list
        
        Args:
            start: int, float, str
                Start time for the annotation, in seconds from the beginning of the file
            end: int, float, str
                End time for the annotation, in seconds from the beginning of the file
            label: str
                Label of the annotation
            valid_labels: set
                Set of every valid labels (strings characters) for any annotation 
        
        Returns: 
            errors: list
                List of errors (string characters) describing each incorrect annotation.
    """
    #initialize error list to be returned
    errors = []

    #check validity of start & end values 
    try:
        float(start)
    except:
        errors += ['start value is not int, float or string']
    try:
        float(end)
    except:
        errors += ['end value is not int, float or string']
    if start > end: 
        errors += ['end value is prior to start value']
    
    #test if label is included in kat.labels list
    if label not in valid_labels:
        errors += ['label is not included in valid labels list']
        
    return errors

In [55]:
def check_overlap_errors(annotation_to_add, annotation_df):
    """Check if a hand mande annotation start or end value overlaps one or several annotations of a Pandas DataFrame.
        
        Args:
            annotation_to_add: Pandas Series
                Annotation to add in the annotation table
            annotation_df: Pandas DataFrame
                Ketos annotation table
        
        Returns: 
            errors: list
                List of errors (string characters) describing each incorrect annotation.
    """
    errors = []
    for index, row in annotation_df.iterrows():
        if (row.start <= annotation_to_add.iloc[0]['start'] <= row.end):
            errors += [f'start value overlap at index {index}']
        if (row.start <= annotation_to_add.iloc[0]['end'] <= row.end):
            errors += [f'end value overlap at index {index}']
    return errors

In [56]:
def append_hand_input_to_annotations(annotation_dataframe, filename:str, start:float, end:float, label:str, valid_labels:set):
    """Appends a custom annotation to a Ketos annotation table. If any error occurs, print the error description. 
        
        Args:
            annotation_dataframe: Pandas DataFrame
                Ketos annotation table
            filename: str
                Name of the audio file
            start:float
                Start time for the annotation, in seconds from the beginning of the file
            end: float
                End time for the annotation, in seconds from the beginning of the file
            label: str
                Label of the annotation
            valid_labels: set
                Set of every valid labels (strings characters) for any annotation 
        
        Returns: 
            annotation_dataframe: Pandas DataFrame
                Updated Ketos annotation table with the new annotation 
    """
    errors = check_hand_input(start, end, label, valid_labels)
    
    if not errors:
        Id = annotation_dataframe.iloc[-1]['Id'] + 1
        annotation_temp = pd.DataFrame([[Id , round(float(start),4), round(float(end),4), label, filename]], 
                                       columns = ['Id', 'start', 'end', 'label', 'filename']) 
        errors = check_overlap_errors(annotation_temp, annotation_dataframe)
        if not errors:
            #append temp to annotation DF 
            annotation_dataframe = annotation_dataframe.append(annotation_temp, ignore_index = True)
    if errors:
        print('Following errors occured:')
        for er in errors:
            print(er)
    return annotation_dataframe

<a id="section2"></a>
## 2. Building annotation files


This section describes methods to create or modify an annotation table. New methods are added to the original documentation to import Pamguard CSV files and to add annotations by hand.

### Setting parameters to build the Annotation table 
(CLICKLEARN DSTI Project)


This part contains all required settings to build annotation dataframes:
* Audio files path (where all audio files are stored)
* Annotation file path, where: 
    + CSV fils are stored
    + Annotation tables **will be** stored
* Click duration in milliseconds (for Click annotation CSV files)
* Set of valid labels to be used (all labels from CSV file that are not matching this set will be stored in an error log file)

In [80]:
# PATH SETTINGS
# Annotation CSV file folder path (MAC)
annotations_folder_path ='/Users/benoitmialet/Ketos/tutorials_create_database_database_creation_tutorial'
# Annotation CSV file folder path (W10)
# annotations_folder_path = (r'D:\SYSTEL\Ketos\tutorials_create_database_database_creation_tutorial')
# Annotation CSV file name
csv_events_name = 'UBISEA_acoustic_detections_samples_events_completed.csv' # WAVFile column was manually added in this file 
csv_clicks_name = 'UBISEA_acoustic_detections_samples_clicks.csv'

# PATH BUILDING
csv_folder_path = annotations_folder_path
csv_events_path = os.path.join(csv_folder_path,csv_events_name)
csv_clicks_path = os.path.join(csv_folder_path,csv_clicks_name)

# CLICK ANNOTATION SETTINGS 
# set click duration (milliseconds) for the dataframe
clickDuration = pd.Timedelta(2.5, unit = 'milliseconds')
# set label corresponding to a click detection
label = 'DLD'

# LABELS SETTINGS
# valid labels for annotation dataframes, must be a list (or list of lists)
# each element of the list correspond dto 1 label, ex: ([1,2,[3,3,3],...]):
ketos_signal_labels = [['dolphin_click', 'DLP', 'DLD']]
ketos_backgr_labels = [] # CHANGE TO ['other']


# DELETE #########################################
# valid_labels = {'dolphin_click','other'}
# # annotation dataframe labels to replace by valid labels (must be included in valid labels)
# labels_to_modify = {'DLP':'dolphin_click','DLD':'dolphin_click'}

### 2-Option 1: Importing a Pamguard CSV file
(CLICKLEARN DSTI Project)

CSV files exported by Pamguard software can be converted into a Ketos annotation table.
Export must be done with **default options and variable names**. 2 CSV file types are possible to import, with 'pamguard_annotations_csv_to_df' method : 
* **"events"**: in these files, annotations correspond to groups of clicks of several seconds.
    + Between EventStart and EventEnd datetime values, a various number of clicks (nClicks) are included.
    + use method with parameters annotation_type == 'events' and  events_subsampling_option (report to the method's docstring)
    + In the CSV files, columns **in bold** are required and must have **exactely the same headers** as followig: 
![Pamguard event csv](capture_PAMGUARD_event_csv.png)


* **"clicks"**: in these files, annotations correspond to beginning of clicks ('Clickstart', in datetime format). 'Clickend' values end are automatically computed by setting a click duration ('click_duration') in milliseconds among parameters. 
    Clickend will simply be the result of Clickstart + click_duration datetimes. In the CSV files, columns **in bold** are required and must have **exactely the same headers** as followig:
![Pamguard click csv](capture_PAMGUARD_click_csv.png)


#### Convert CSV annotation files into Ketos annotation tables
pamguard_annotations_csv_to_df() and pamguard_annotations_csv_to_df() methods are built with a set of other methods to import and convert a Pamguard CSV file into a Ketos annotation Table. The returned data frame are not yet standardized and thus can be modified or processed.

In [71]:
events_annot_df = pamguard_annotations_csv_to_df(csv_events_path, 'events', events_subsampling_option = 2)
clicks_annot_df = pamguard_annotations_csv_to_df(csv_clicks_path, 'clicks', click_duration = 2.5)

In [72]:
events_annot_df.head(5)

Unnamed: 0,Id,start,end,label,filename,comment
0,465,25968.249,26012.0,DLD,192_20180601_010000_458.wav,
1,466,26699.235,26726.0,DLD,192_20180601_010000_458.wav,
2,469,66225.582,66262.0,DLD,192_20180601_010000_458.wav,
3,473,67330.013,67353.0,DLD,192_20180601_010000_458.wav,
4,474,67522.346,67577.0,DLD,192_20180601_010000_458.wav,


In [73]:
events_annot_df.head(5)

Unnamed: 0,Id,start,end,label,filename,comment
0,465,25968.249,26012.0,DLD,192_20180601_010000_458.wav,
1,466,26699.235,26726.0,DLD,192_20180601_010000_458.wav,
2,469,66225.582,66262.0,DLD,192_20180601_010000_458.wav,
3,473,67330.013,67353.0,DLD,192_20180601_010000_458.wav,
4,474,67522.346,67577.0,DLD,192_20180601_010000_458.wav,


#### Standardize Ketos annotation tables
ketos.data_handling.selection_table.standardize() is the original method to standardize an annotation table for Ketos once it is built. ketos.data_handling.selection_table.is_standardized() checks if this standardization is completed or return error description.

(ClickLearn DSTI Project :) If *signal_labels* and *backgr_labels* arguments are used, skip section 6 as background background noise will correspond to label 0 and will not need random generation

In [81]:
# standardize all annotation tables
# std_events_annot_df = sl.standardize(table = events_annot_df, signal_labels = valid_labels, trim_table=False)
# std_clicks_annot_df = sl.standardize(table = clicks_annot_df, signal_labels = valid_labels, trim_table=False)

std_events_annot_df = sl.standardize(
    table = events_annot_df, 
    signal_labels = ketos_signal_labels,
    backgr_labels= ketos_backgr_labels,
    trim_table=False
)

std_clicks_annot_df = sl.standardize(
    table = clicks_annot_df, 
    signal_labels = ketos_signal_labels,
    backgr_labels= ketos_backgr_labels,
    trim_table=False
)

print('check if both events and clicks files are in standardized format:\n', 
     sl.is_standardized(std_events_annot_df), '\n',
     sl.is_standardized(std_clicks_annot_df))

check if both events and clicks files are in standardized format:
 True 
 True


In [82]:
std_events_annot_df.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Id,start,end,label,comment
filename,annot_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
192_20180601_010000_458.wav,0,465,25968.249,26012.0,1,
192_20180601_010000_458.wav,1,466,26699.235,26726.0,1,
192_20180601_010000_458.wav,2,469,66225.582,66262.0,1,
192_20180601_010000_458.wav,3,473,67330.013,67353.0,1,
192_20180601_010000_458.wav,4,474,67522.346,67577.0,1,


In [83]:
std_clicks_annot_df.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Id,start,end,label
filename,annot_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Click_Detector_Sperm_whale_click_detector_Clicks_20180601_010000.pgdf,0,177015,878.928,878.9305,1
Click_Detector_Sperm_whale_click_detector_Clicks_20180601_010000.pgdf,1,177016,878.192,878.1945,1
Click_Detector_Sperm_whale_click_detector_Clicks_20180601_010000.pgdf,2,177017,878.366,878.3685,1
Click_Detector_Sperm_whale_click_detector_Clicks_20180601_010000.pgdf,3,177018,879.532,879.5345,1
Click_Detector_Sperm_whale_click_detector_Clicks_20180601_010000.pgdf,4,177019,879.708,879.7105,1


#### Splitting into 2 random subsamples (annotation_train / annotation_test)
(CLICKLEARN DSTI Project)
subsample_dataframe() method is created to split the annotation table into 2 subsamples:
* Train annotation table (Ketos will use it to create the training dataset)
* Test annotation table (Ketos will use it to create the testing dataset)

In [71]:
help(subsample_dataframe)

Help on function subsample_dataframe in module __main__:

subsample_dataframe(df_to_subsample, train_proportion: float)
    Subsample a Pandas DataFrame into 2 others, based on its indexes.
    
    Args:
        df_to_subsample: pd.DataFrame
        train_proportion: float
            proportion (]0;1[) of indexes to subsample in the dataset, in order to build the train dataset
    
    Returns: tuple with 2 dataframes:
        - [0] train subsample 
        - [1] test subsamples



In [84]:
std_annot_train_events = subsample_dataframe(std_events_annot_df, 0.7)[0]
std_annot_test_events = subsample_dataframe(std_events_annot_df, 0.7)[1]

std_annot_train_clicks  = subsample_dataframe(std_clicks_annot_df, 0.7)[0]
std_annot_test_clicks = subsample_dataframe(std_clicks_annot_df, 0.7)[1]

In [85]:
std_annot_train_events.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Id,start,end,label,comment
filename,annot_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
192_20180601_010000_458.wav,24,506,74929.634,74962.0,1,
192_20180601_010000_458.wav,2,469,66225.582,66262.0,1,
192_20180601_010000_458.wav,10,484,72639.575,72650.0,1,
192_20180603_000000_458.wav,0,522,1378.201,1390.0,1,
192_20180601_010000_458.wav,20,498,73456.496,73473.0,1,


In [86]:
std_annot_train_clicks.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Id,start,end,label
filename,annot_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Click_Detector_Sperm_whale_click_detector_Clicks_20180601_210000.pgdf,1546,180517,2460.158,2460.1605,1
Click_Detector_Sperm_whale_click_detector_Clicks_20180601_220000.pgdf,13,180926,849.593,849.5955,1
Click_Detector_Sperm_whale_click_detector_Clicks_20180603_010000.pgdf,432,182294,1315.686,1315.6885,1
Click_Detector_Sperm_whale_click_detector_Clicks_20180601_210000.pgdf,908,179800,1326.285,1326.2875,1
Click_Detector_Sperm_whale_click_detector_Clicks_20180601_210000.pgdf,551,179404,924.92,924.9225,1


**The train and test files are now ready to be used in step *5. Creating segments of uniform length* and following ones from the original Notebook.**





### 2-Option 2: adding annotations by hand (not likely to be used)
(CLICKLEARN DSTI Project)

This part allows to add custom annotation by hand to a Ketos annotation table.

If used on a **standardized** Ketos annotation table, numerical labels must be used.

##### Report each annotation here, and then run next cell to append the annotation into the dataframe:

*    **start**: start time for the annotation, in seconds from the beginning of the file
*    **end**: end time for the annotation, in seconds from the beginning of the file
*    **label**: label for the annotation

In [213]:
# type annotation details here

annotations_df = events_annot_df
filename = '192_20180601_010000_458.wav'
start = 527.55  #in seconds
end = 528.32    #in seconds
label = 'other' #annotation label

In [214]:
# append the annotation to an annotation dataframe

annotations_df = append_hand_input_to_annotations(annotations_df, filename, start, end, label, valid_labels)
annotations_df.tail(5)

Unnamed: 0,Id,start,end,label,filename,comment
31,521,10157.91,10180.0,dolphin_click,192_20180602_000000_458.wav,
32,522,1378.201,1390.0,dolphin_click,192_20180603_000000_458.wav,
33,527,5258.439,5269.0,dolphin_click,192_20180603_000000_458.wav,
34,528,5660.552,5672.0,dolphin_click,192_20180603_000000_458.wav,
35,529,527.55,528.32,other,192_20180601_010000_458.wav,


### 2-Option 3: Importing a CSV file already formatted for Ketos
(Text from original Notebook)

Our annotations are saved in two `.csv` files (with values separated by `;`): `annotations_train.csv` and `annotations_test.csv`, which we will use to create the training and test datasets respectively. These files can also be found within the `.zip` file at the top of the page. 

In [3]:
annot_train_path = os.path.join(*annotations_folder_path.split('\\'),"annotations_train.csv")
annot_train = pd.read_csv(annot_train_path, sep = ",")
annot_test_path = os.path.join(*annotations_folder_path.split('\\'),"annotations_test.csv")
annot_test = pd.read_csv(annot_test_path, sep = ",")

Let's inspect our annotations

In [4]:
annot_train

Unnamed: 0.1,Unnamed: 0,start,end,label,sound_file,datetime
0,2957,188.8115,190.5858,upcall,NOPP6_EST_20090328_000000.wav,2009-03-28 00:00:00
1,2958,235.7556,237.1603,upcall,NOPP6_EST_20090328_000000.wav,2009-03-28 00:00:00
2,2959,398.6924,400.1710,upcall,NOPP6_EST_20090328_000000.wav,2009-03-28 00:00:00
3,2960,438.9091,440.3138,upcall,NOPP6_EST_20090328_000000.wav,2009-03-28 00:00:00
4,2961,451.0518,452.2716,upcall,NOPP6_EST_20090328_000000.wav,2009-03-28 00:00:00
...,...,...,...,...,...,...
995,3952,52.0791,53.6686,upcall,NOPP6_EST_20090329_031500.wav,2009-03-29 03:15:00
996,3953,76.1057,77.2146,upcall,NOPP6_EST_20090329_031500.wav,2009-03-29 03:15:00
997,3954,99.9104,101.3520,upcall,NOPP6_EST_20090329_031500.wav,2009-03-29 03:15:00
998,3955,120.9983,121.9224,upcall,NOPP6_EST_20090329_031500.wav,2009-03-29 03:15:00


The **annot_train** dataframe contains 1000 rows and the **annot_test** 500.
The columns indicate:

**start:** start time for the annotation, in seconds from the beginning of the file  
**end:** end time for the annotation, in seconds from the beginning of the file   
**label:** label for the annotation (in our case, all annotated signals are 'upcalls', but the origincal DCLDE2013 dataset also had 'gunshots')  
**sound_file:** name of the audio file  
**datetime:** a timestamp for the beginning of the file (UTC)  

---
**Starting from here, following text is taken from the orginial Notebook, with no major implementation from ClickLearn DSTI project team. 
Only some line of codes which are specific to the project needs were added.**

<a id=section3></a>

### 3. Putting the annotations in the Ketos format (to skip with Pamguard CSV importation)


Let's check if our annotations follow the Ketos standard.

If that's the case, the function ```sl.is_standardized``` will return ```True```. 


In [180]:
sl.is_standardized(annot_train)

Setting the *verbose* argument to ```False``` will not show the example above:

In [126]:
sl.is_standardized(annot_test, verbose=False)

Neither of our annotations are in the format ketos expects. But we can use the ```sl.standardize``` function to convert to the specified format.

The *annot_id* column is created automatically by the ```sl.standardize``` function. From the remaining required columns indicated in the example above, we already have *start*, *end* and *label*. Our *sound_file* column needs to be renamed to *filename*, so we will need to provide a dictionary to specify that. 

We have one extra column, *datetime*, that we don't really need to keep, so we'll set ```trim_table=True```, which will discard any columns that are not required by the standardized.

If we wanted to keep the datetime (or any other columns), we would just set ```trim_table=False```. One situation in which you might want to do that is if you need this information to split a dataset into train/test or train/validation/test, because then you can sort all your annotations by time and make sure the training set does not overlap with the validation/test. But in our case, the annotations are already split.

In [5]:
map_to_ketos_annot_std = {'sound_file': 'filename'} 
std_annot_train = sl.standardize(table=annot_train, signal_labels=["upcall"], mapper=map_to_ketos_annot_std, trim_table=True)
std_annot_test = sl.standardize(table=annot_test, signal_labels=["upcall"], mapper=map_to_ketos_annot_std, trim_table=True)


Let's have a look at our standardized tables

In [6]:
std_annot_train

Unnamed: 0_level_0,Unnamed: 1_level_0,start,end,label
filename,annot_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
NOPP6_EST_20090328_000000.wav,0,188.8115,190.5858,1
NOPP6_EST_20090328_000000.wav,1,235.7556,237.1603,1
NOPP6_EST_20090328_000000.wav,2,398.6924,400.1710,1
NOPP6_EST_20090328_000000.wav,3,438.9091,440.3138,1
NOPP6_EST_20090328_000000.wav,4,451.0518,452.2716,1
...,...,...,...,...
NOPP6_EST_20090329_031500.wav,1,52.0791,53.6686,1
NOPP6_EST_20090329_031500.wav,2,76.1057,77.2146,1
NOPP6_EST_20090329_031500.wav,3,99.9104,101.3520,1
NOPP6_EST_20090329_031500.wav,4,120.9983,121.9224,1


Notice that the 'label' column now encodes 'upcall' as ones (1), as the ketos format uses integers to represent labels.

<a id=section4></a>

### 5. Creating segments of uniform length

If you look back at our `` std_annot_train`` and ```std_annot_test``` you'll notice that annotations have a variety of lengths, since they mark the beginning and end of an upcall and these have variable durations. For our purposes, we want each signal in the database to be represented as spectrograms, all of same length. Each spectrogram will be labelled as containing an upcall or not. 

The ```sl.select``` function in ketos can help us to do just that: for each annotated upcall, it will select a portion of the recording surrounding it. It takes a standardized annotatoin table as input and lets you specify the length of the output segments. We'll use 3 seconds, as it is enough to encompass most upcalls.

Our standardized tables only contain annotated upcalls. Later we will also want some examples of segments that only contain background noise, but for now we'll just create the uniform upcall segments, which we'll call 'positives'


(CLICKLEARN DSTI Project add:)
* **length** (seconds): uniform length of a segment
* **step** (seconds): shift between 2 segments (no shift if length == step)
* **min_overlap** (proportion): if a segment crosses the start or the end of annotation, 
    proportion of its length that is allowed to cross (1 means no cross allowed)

In [43]:
#Clicklearn
positives_train_events = sl.select(annotations=std_annot_train_events, length=0.5, step=0.5, min_overlap=1, center=False, discard_long=False, keep_id=False)
positives_test_events = sl.select(annotations=std_annot_test_events, length=0.5, step=0.5, min_overlap=1, center=False, discard_long=False, keep_id=False)

positives_train_clicks = sl.select(annotations=std_annot_train_clicks, length=0.0025, step=0.0025, min_overlap=1, center=False, discard_long=False, keep_id=False)
positives_test_clicks = sl.select(annotations=std_annot_test_clicks, length=0.0025, step=0.0025, min_overlap=1, center=False, discard_long=False, keep_id=False)

In [7]:
#Ketos documentation
positives_train = sl.select(annotations=std_annot_train, length=3.0)
positives_test = sl.select(annotations=std_annot_test, length=3.0, step=0.0, center=False)


Have a look at the results and notice how each entry is now 3.0 seconds long.

In [90]:
(positives_train['end']-positives_train['start']).values.mean()

3.0

<a id=section5></a>

### 5. Augmenting the data

Data augmentation is a set of tecnhiques used in machine learning to increase the data available to train models. There are many different techniques that can be used. The ```sl.select``` function we just used offers a simple way to augment the data while you are creating the uniform selections. It creates segments that are longer than the annotated signals and then shifts the start and end of those segments, resulting in multiple segments with the same annotated signal (our upcalls) positioned at different times. This is a very safe technique, as it is not altering the original signal, but it can already help to increase the amount of data available. It also helps to present a larger variety of contexts in which the upcall can appear.  

We'll augment the training portion of our annotations by using two additional arguments. The ``step`` specifies how much the signal will be shifted (in seconds). Smaller values will produce more augmented selections, but they will be more similar to the previous selection. The ```min_overlap``` argument specifies the fraction of the augmented signal that needs to overlap the original annotation in order for it to be included in the augmented selections table. A value of 1.0 means 100%, this is, the new annotation will only be included if the entire upcall falls within the stablished interval. Lower values will result in segments that only contain part of the original upcall. We'll set this value to 0.5, meaning that some of our augmented segments might have as little as half of the original call.

In [30]:
positives_train = sl.select(annotations=std_annot_train, length=3.0, step=0.5, min_overlap=0.5, center=False)

In [31]:
positives_train

Unnamed: 0_level_0,Unnamed: 1_level_0,label,start,end
filename,sel_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
NOPP6_EST_20090328_000000.wav,0,1,187.476816,190.476816
NOPP6_EST_20090328_000000.wav,1,1,187.976816,190.976816
NOPP6_EST_20090328_000000.wav,2,1,188.476816,191.476816
NOPP6_EST_20090328_000000.wav,3,1,188.976816,191.976816
NOPP6_EST_20090328_000000.wav,4,1,234.296197,237.296197
...,...,...,...,...
NOPP6_EST_20090329_031500.wav,12,1,103.808134,106.808134
NOPP6_EST_20090329_031500.wav,13,1,104.308134,107.308134
NOPP6_EST_20090329_031500.wav,14,1,119.876635,122.876635
NOPP6_EST_20090329_031500.wav,15,1,120.376635,123.376635


Notice that now our ``positives_train`` tables has almost 3x more rows than before.

<a id=section6></a> 

### 6. Including background noise

Now that we have the positive instances that we need to create our database, we need to include some examples of the negative class, or instances without upcalls.

The ```sl.create_rndm_backgr_selections``` is ideal for our situation. It takes a standardized ketos table describing all sections of the recordings that contain annotations and takes samples from the non-annotaded portions of the files, assuming everything that is not annotated can be used as a 'background' category.

**Note**:
You might find yourself in a different scenario. For example, your annotations might already include a 'background' class or you might have annoted different classes of sounds and you only want to use a few of them. In any case, ketos provides a variety of other functions that are helpful in different scenarios. Have a look at the documentation for more details. Specially the ```selection_table``` module.



In [92]:
# sl.create_rndm_backgr_selections(std_annot_test, len(std_annot_test), 12, annotations=None, no_overlap=False, trim_table=False)

The ```sl.create_rndm_backgr_selections``` also needs the duration of each file, which we can generate using the ```sl.file_duration``` function.

In [93]:
file_durations_train = sl.file_duration_table('data/train')
file_durations_test = sl.file_duration_table('data/test') 

In [94]:
file_durations_train

Unnamed: 0,filename,duration
0,NOPP6_EST_20090328_000000.wav,900.0
1,NOPP6_EST_20090328_001500.wav,900.0
2,NOPP6_EST_20090328_003000.wav,900.0
3,NOPP6_EST_20090328_004500.wav,900.0
4,NOPP6_EST_20090328_010000.wav,900.0
...,...,...
79,NOPP6_EST_20090329_021500.wav,900.0
80,NOPP6_EST_20090329_023000.wav,900.0
81,NOPP6_EST_20090329_024500.wav,900.0
82,NOPP6_EST_20090329_030000.wav,900.0


Now that we have the file durations, we can generate our table of negative segments. We'll specify the same length (3.0 seconds). The ```num``` argument specifies the number of background segments we would like to generate. Let's make this number equal to the number of positive examples in each dataset (```len(positive_train)``` and ``` len(positive_test)```)

In [95]:
negatives_train=sl.create_rndm_backgr_selections(annotations=std_annot_train, files=file_durations_train, length=3.0, num=len(positives_train), trim_table=True)
negatives_train

Unnamed: 0_level_0,Unnamed: 1_level_0,start,end,label
filename,sel_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
NOPP6_EST_20090328_000000.wav,0,10.038680,13.038680,0
NOPP6_EST_20090328_000000.wav,1,15.184157,18.184157,0
NOPP6_EST_20090328_000000.wav,2,34.412736,37.412736,0
NOPP6_EST_20090328_000000.wav,3,54.765640,57.765640,0
NOPP6_EST_20090328_000000.wav,4,64.034017,67.034017,0
...,...,...,...,...
NOPP6_EST_20090329_031500.wav,34,773.620888,776.620888,0
NOPP6_EST_20090329_031500.wav,35,797.263613,800.263613,0
NOPP6_EST_20090329_031500.wav,36,817.536221,820.536221,0
NOPP6_EST_20090329_031500.wav,37,865.566218,868.566218,0


In [96]:
negatives_test=sl.create_rndm_backgr_selections(annotations=std_annot_train, files=file_durations_test, length=3.0, num=len(positives_test), trim_table=True)
negatives_test

Unnamed: 0_level_0,Unnamed: 1_level_0,start,end,label
filename,sel_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
NOPP6_EST_20090329_084500.wav,0,18.297599,21.297599,0
NOPP6_EST_20090329_084500.wav,1,53.077267,56.077267,0
NOPP6_EST_20090329_084500.wav,2,57.503250,60.503250,0
NOPP6_EST_20090329_084500.wav,3,185.283499,188.283499,0
NOPP6_EST_20090329_084500.wav,4,228.091559,231.091559,0
...,...,...,...,...
NOPP6_EST_20090329_130000.wav,27,711.265965,714.265965,0
NOPP6_EST_20090329_130000.wav,28,734.553712,737.553712,0
NOPP6_EST_20090329_130000.wav,29,793.627781,796.627781,0
NOPP6_EST_20090329_130000.wav,30,846.031280,849.031280,0


There we have it! Now we'll just put the ```positives_train``` and ```negatives_train``` together and do the same to the test tables.

In [97]:
selections_train = positives_train.append(negatives_train, sort=False)
selections_test = positives_test.append(negatives_test, sort=False)

In [98]:
selections_train

Unnamed: 0_level_0,Unnamed: 1_level_0,label,start,end
filename,sel_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
NOPP6_EST_20090328_000000.wav,0,1,187.476816,190.476816
NOPP6_EST_20090328_000000.wav,1,1,187.976816,190.976816
NOPP6_EST_20090328_000000.wav,2,1,188.476816,191.476816
NOPP6_EST_20090328_000000.wav,3,1,188.976816,191.976816
NOPP6_EST_20090328_000000.wav,4,1,234.296197,237.296197
...,...,...,...,...
NOPP6_EST_20090329_031500.wav,34,0,773.620888,776.620888
NOPP6_EST_20090329_031500.wav,35,0,797.263613,800.263613
NOPP6_EST_20090329_031500.wav,36,0,817.536221,820.536221
NOPP6_EST_20090329_031500.wav,37,0,865.566218,868.566218


In [99]:
selections_test

Unnamed: 0_level_0,Unnamed: 1_level_0,label,start,end
filename,sel_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
NOPP6_EST_20090329_084500.wav,0,1,890.291522,893.291522
NOPP6_EST_20090329_090000.wav,0,1,52.403343,55.403343
NOPP6_EST_20090329_090000.wav,1,1,41.795390,44.795390
NOPP6_EST_20090329_090000.wav,2,1,96.902297,99.902297
NOPP6_EST_20090329_090000.wav,3,1,114.921926,117.921926
...,...,...,...,...
NOPP6_EST_20090329_130000.wav,27,0,711.265965,714.265965
NOPP6_EST_20090329_130000.wav,28,0,734.553712,737.553712
NOPP6_EST_20090329_130000.wav,29,0,793.627781,796.627781
NOPP6_EST_20090329_130000.wav,30,0,846.031280,849.031280


At this point, we have defined *which* audio segments we want in our database: a little over 5500 in the training dataset, 50% with upcalls and 50% without, and 1000 for the test set, maintaining the same ratio.

Now we need to decide *how* these segments will be represented.

<a id=section7></a>

###  7. Choosing the spectrogram settings

As mentioned earlier, we'll represent the segments as spectrograms.
In the .zip file where you found the data, there's also a spectrogram configuration file (```spec_config.json```) which contains the settings we want to use.

This configuration file is simply a text file in the ```.json``` format, so you could make a copy of it, change a few parameters and save several settings to use later or to share the with someone else.


In [100]:
spec_cfg = load_audio_representation('spec_config.json', name="spectrogram")

In [101]:
spec_cfg

{'type': 'MagSpectrogram',
 'rate': 1000,
 'window': 0.256,
 'step': 0.032,
 'freq_min': 0,
 'freq_max': 500,
 'window_func': 'hamming'}

The result is a python dictionary. We could change some value, like the step size:

In [None]:
#spec_cfg['step'] = 0.064

But we will stick to the original here.

<a id=section8></a>

### 8. Creating the database

Now we have to compute the spectrograms following the settings above for each selection in our selection tables and then save them in a database.

All of this can be done with the ```dbi.create_database``` function in Ketos.

We will start with the training dataset. We need to indicate the name for the database we want to create, where the audio files are, a name for the dataset, the selections table and, finally the audio representation. As specified in our ``spec_cfg``, this is a Magnitude spectrogram, but ketos can also create databases with Power, Mel and CQT spectrograms, as well as time-domain data (waveforms).


In [102]:
dbi.create_database(output_file='database.h5', 
                    data_dir='data/train',
                    dataset_name='train',
                    selections=selections_train,
                    audio_repres=spec_cfg
                   )                             

100%|██████████| 5504/5504 [01:17<00:00, 70.77it/s]


5504 items saved to database.h5


And we do the same thing for the test set. Note that, by specifying the same database name, we are telling ketos that we want to add the test set to the existing database.

In [103]:
dbi.create_database(output_file='database.h5', 
                    data_dir='data/test',
                    dataset_name='test',
                    selections=selections_test,
                    audio_repres=spec_cfg
                   )                            

100%|██████████| 1000/1000 [00:15<00:00, 65.97it/s]

1000 items saved to database.h5





Now we have our database with spectrograms representing audio segments with and without the North Atlantic Right Whale upcall. The data is divided into 'train' and 'test'. 



In [104]:
db = dbi.open_file("database.h5", 'r')

In [105]:
db

File(filename=database.h5, title='', mode='r', root_uep='/', filters=Filters(complevel=0, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) ''
/test (Group) ''
/test/data (Table(1000,), fletcher32, shuffle, zlib(1)) ''
  description := {
  "data": Float32Col(shape=(94, 129), dflt=0.0, pos=0),
  "filename": StringCol(itemsize=100, shape=(), dflt=b'', pos=1),
  "id": UInt32Col(shape=(), dflt=0, pos=2),
  "label": UInt8Col(shape=(), dflt=0, pos=3),
  "offset": Float64Col(shape=(), dflt=0.0, pos=4)}
  byteorder := 'little'
  chunkshape := (5,)
/train (Group) ''
/train/data (Table(5504,), fletcher32, shuffle, zlib(1)) ''
  description := {
  "data": Float32Col(shape=(94, 129), dflt=0.0, pos=0),
  "filename": StringCol(itemsize=100, shape=(), dflt=b'', pos=1),
  "id": UInt32Col(shape=(), dflt=0, pos=2),
  "label": UInt8Col(shape=(), dflt=0, pos=3),
  "offset": Float64Col(shape=(), dflt=0.0, pos=4)}
  byteorder := 'little'
  chunkshape := (5,)

Here we can see the data divided into 'train' and 'test' These are called 'groups' in HDF5 terms. Within each of them there's a dataset called 'data', which contains the spectrograms and respective labels.

In [106]:
db.close() #close the database connection

You will likely not need to directly interact with the database. In a following tutorial, we will use Ketos to build a deep neural network and train it to recognize upcalls. Ketos handles the database interactions, so we won't really have to go into the details of it, but if you would like to learn more about how to get data from this database, take a look at the [database_interface](https://docs.meridian.cs.dal.ca/ketos/modules/data_handling/database_interface.html) module in ketos and the [pyTables](https://www.pytables.org/index.html) documentation.