# 01 - Dataset Creation


## Step 1: Initialization and Slicing

Here we create the dataset from a bunch of audio source files
These files resampled and sliced into chunks, according to the parameters provided in the dataset manifest (sample rate, duration, overlap).

A SQLite database is also created for the dataset, to persist any useful information. 

*Note: When multiprocessing, progress is not tracked in jupyter notebook, so you have to look at the console*

In [1]:
import warnings                            # This block prevents display of harmless warnings, but should be
warnings.filterwarnings('ignore')          # commented out till the final version, to avoid missing "real" warnings 

import kilroy_was_here                     # Mandatory. Allow access to shared python code in the upper 'codelib' directory
from lib.audiodataset import AudioDataset  # Class for audio dataset handling
from lib.jupytools import iprint           # timstamped (to the ms) print with CPU and RAM consumption information

# Path where to find initial annotated dataset (audio and lab files)
SOURCE_PATH ='D:/datasets/sounds/Nolasco'

# Dataset name is the master key for dataset adressing
DATASET_NAME = 'SMALL1005'

# Initialize Dataset Object. 
try:
    #By providing a source path,we implicitly indicates that you want to CREATE the data set.
    # Run with a pool of 4 processes
    ds = AudioDataset(DATASET_NAME, SOURCE_PATH, nprocs=4)
except FileExistsError:
    # To allow rerun, we catch the exception in case the dataset was already created.
    # Ideally, you should create the dataset once for all in a dedicated notebook,
    # and then retrieve it from other notebooks when needed
    # Here, by not providing a source path, we implicitly express the intent of RETRIEVING
    # an existing dataset rather than CREATING a new one
    iprint("Retrieving existing dataset")
    ds = AudioDataset(DATASET_NAME)
    iprint("Dataset retrieved")
    
# The following line provides some information about the newly created (or retrived) AudioDataset object    
ds.info()



[2020-08-05/21:10:12.499|15.5%|82.0%|0.25GB] The dataset directory (D:\Jupyter\ShowBees\datasets\SMALL1005) already exists.
[2020-08-05/21:10:12.501|00.0%|82.0%|0.25GB] If you really intent to CREATE this dataset, please erase this directory first
[2020-08-05/21:10:12.501|00.0%|82.0%|0.25GB] ### ABORTING! ###
[2020-08-05/21:10:12.501|00.0%|82.0%|0.25GB] Retrieving existing dataset
[2020-08-05/21:10:12.501|00.0%|82.0%|0.25GB] Dataset retrieved
[2020-08-05/21:10:12.501|00.0%|82.0%|0.25GB] ------------------------------------------------------
[2020-08-05/21:10:12.501|00.0%|82.0%|0.25GB] DATASET PATH          : D:\Jupyter\ShowBees\datasets\SMALL1005
[2020-08-05/21:10:12.501|00.0%|82.0%|0.25GB] DATASET DB PATH       : D:\Jupyter\ShowBees\datasets\SMALL1005\database.db
[2020-08-05/21:10:12.501|00.0%|82.0%|0.25GB] DATASET SAMPLES PATH  : D:\Jupyter\ShowBees\datasets\SMALL1005\samples
[2020-08-05/21:10:12.501|00.0%|82.0%|0.25GB] NB SOURCE AUDIO FILES : 4
[2020-08-05/21:10:12.501|00.0%|82.0%|0

## Step 2: Add Labels

Here we add **labels** to our dataset samples. Labels can be set using various functions called ***Labelizers*** which basically define the source of the label, and the way it wil be inserted into the database. Labelizers then make use of **transformers**, which (as you may have guessed) transform the source into an acceptable format.

Both labelizers and transformers are either builtin within the toolbox, or developped by the user (the toolbox provides utilities functions for their development).

Labels have a name, a numeric value, and an optionnal strength. 

- *Note1: By design, labels do not have a string value, as usually machine learning frameworks expect numerals.* 
- *Note2: Currently, labels are aimed only at binary classifiers, so their value is either 0 or 1 (no multi-class).*


We are adding two labels, using two different labelizers:

- the "queen" label, using the builtin FromFileName labelizer, associated with the builtin StringMatcher transformer
- the "nobee" label, using the builtin FromAnnotation labelizer, without transformation


*Note: The `setLabel` method use an "insert or update" based on the (sample_id, label_name) unique index. Consecutive invocations for a given label, with the same parameters won't affect the already labeled samples. Conversely, iF different parameters are used, existing samples will be updated.

In [2]:
from lib import labelizers
from lib import transformers

# The "queen" label value is deduced from the source file name, using a StringMapper transformer
# This transformer iterates over a list 2-uples (regular expression, target value) and return
# the target value as soon as a match is found. Thus, you must order your list from stricter to looser
trsfrm_queen = transformers.StringMapper(
        [('(?i)active', 1), 
         ('(?i)missing queen', 0),
         ('NO_QueenBee', 0),
         ('QueenBee', 1)     
        ])

# The transformer is then used over the source filenames, using the FromFileName labelizer
# This labelizer does not provide label strength.

n = ds.setLabel('queen', labelizers.FromFileName(trsfrm_queen))
iprint(n, "samples where processed for 'queen' label")


# The "nobee" label value comes from annotation files, (.lab files using the same base name as the audio
# source file they annotate), using the FromAnnotation labelizer, with no transformation.
# This labelizer takes 3 arguments:
# - a mandatory source path, pointing to the directory where the .lab files reside
# - an optional Unitary THreshold, allowing to disregard any "label" event with a duration under this treshold
# - an optionnal Global THreshold, allowing to affect the target label only if its strength is above this threshold
# In other words:
# - The label strength over a sample is computed by summing the duration of "label" events (if > uth) and dividing
#   this sum by the sample duration
# - The label value for a sample will be 1 if the label strength is above the global threshold, else 0
 
n = ds.setLabel('nobee', labelizers.FromAnnotation(SOURCE_PATH, uth=0, gth=0))
iprint(n, "samples where processed for 'nobee' label")

[2020-08-05/21:10:19.062|07.5%|81.2%|0.26GB] 4744 samples where processed for 'queen' label
[2020-08-05/21:10:19.062|00.0%|81.2%|0.26GB] [1] Hive1_12_06_2018_QueenBee_H1_audio___15_00_00.wav
[2020-08-05/21:10:19.305|17.4%|81.2%|0.26GB] [2] Hive1_31_05_2018_NO_QueenBee_H1_audio___15_00_00.wav
[2020-08-05/21:10:19.604|14.7%|81.2%|0.26GB] [3] Hive3_15_07_2017_NO_QueenBee_H3_audio___06_10_00.wav
[2020-08-05/21:10:19.832|12.1%|81.2%|0.26GB] [4] Hive3_20_07_2017_QueenBee_H3_audio___06_10_00.wav
[2020-08-05/21:10:20.096|16.8%|81.2%|0.26GB] 4744 samples where processed for 'nobee' label


## Step 3: Add Attributes

Here we add **attributes** to our dataset samples. Attributes can be set using various functions called ***Attributors***, either builtin within the toolbox, or developped by the user (the toolbox provides utilities functions for Attributors development).

Attributes can be used to "tag" samples, for future subsets extractions. They have a name, and a value, always stored as a string (note the difference with labels)

Here we tag each sample with the hive it belongs to. As the hive is encoded in the first 5 characters of the source file name, we use a FromFileName attributor, with a StringMatcher transformer.

In [3]:
from lib import attributors

#The string matcher transformer behave differently than the StringMapper. It uses regexp
# capture group to retrieve part pf a string matching a specific pattern. This can be used
# either for complex or very basic matching. Here we just ask for the five first chars,
# provided they belong to characters valid for identifiers (A-Z, a-z,0-9 and underscore)
ds.setAttr('hive', attributors.FromFileName(transformers.StringMatcher("^(\w{5})")))

4744

## Step 4: Next Steps

We can also add **features** and **augmentations** to our dataset samples. Features (resp. augmentations) can be set using various functions called ***Featurizers*** (resp. ***Augmentators**). Both are either builtin within the toolbox, or developped by the user (the toolbox provides utilities functions for Featurizers and Augmentators development).