# 01 - Dataset Creation


## Step 1: Initialization and Slicing

Here we create the dataset from a bunch of audio source files
These files resampled and sliced into chunks, according to the parameters provided in the dataset manifest (sample rate, duration, overlap).

A SQLite database is also created for the dataset, to persist any useful information. 

*Note: When multiprocessing, progress is not tracked in jupyter notebook, so you have to look at the console*

In [1]:
import warnings                            # This block prevents display of harmless warnings, but should be
warnings.filterwarnings('ignore')          # commented out till the final version, to avoid missing "real" warnings 

import kilroy_was_here                     # Mandatory. Allow access to shared python code from repository root
from lib.audiodataset import AudioDataset  # Class for audio dataset handling
from lib.jupytools import iprint           # timstamped (to the ms) print with CPU and RAM consumption information

# Path where to find initial annotated dataset (audio and lab files)
SOURCE_PATH ='D:/datasets/sounds/Nolasco'

# Dataset name is the master key for dataset adressing.
DATASET_NAME = 'SMALL1005'

# Initialize Dataset Object. 
try:
    #By providing a source path,we implicitly indicates that you want to CREATE the data set.
    # Run with a pool of 4 processes
    ds = AudioDataset(DATASET_NAME, SOURCE_PATH, nprocs=4)
    
except FileExistsError:
    # To allow rerun, we catch the exception in case the dataset was already created.
    # Ideally, you should create the dataset once for all in a dedicated notebook,
    # and then retrieve it from other notebooks when needed
    # Here, by not providing a source path, we implicitly express the intent of RETRIEVING
    # an existing dataset rather than CREATING a new one
    iprint("Retrieving existing dataset")
    ds = AudioDataset(DATASET_NAME)
    iprint("Dataset retrieved")
    
# The following line provides some information about the newly created (or retrieved) AudioDataset object    
ds.info()



[2020-08-07/21:40:56.002|12.8%|78.0%|0.25GB] The dataset directory (D:\Jupyter\ShowBees\datasets\SMALL1005) already exists.
[2020-08-07/21:40:56.002|00.0%|78.0%|0.25GB] If you really intent to CREATE this dataset, please erase this directory first
[2020-08-07/21:40:56.002|00.0%|78.0%|0.25GB] ### ABORTING! ###
[2020-08-07/21:40:56.002|00.0%|78.0%|0.25GB] Retrieving existing dataset
[2020-08-07/21:40:56.003|00.0%|78.0%|0.25GB] Dataset retrieved
[2020-08-07/21:40:56.004|20.0%|78.0%|0.25GB] ------------------------------------------------------
[2020-08-07/21:40:56.004|00.0%|78.0%|0.25GB] DATASET PATH          : D:\Jupyter\ShowBees\datasets\SMALL1005
[2020-08-07/21:40:56.004|00.0%|78.0%|0.25GB] DATASET DB PATH       : D:\Jupyter\ShowBees\datasets\SMALL1005\database.db
[2020-08-07/21:40:56.004|00.0%|78.0%|0.25GB] DATASET SAMPLES PATH  : D:\Jupyter\ShowBees\datasets\SMALL1005\samples
[2020-08-07/21:40:56.004|00.0%|78.0%|0.25GB] NB SOURCE AUDIO FILES : 4
[2020-08-07/21:40:56.004|00.0%|78.0%|0

## Step 2: Add Labels

Here we add **labels** to our dataset samples. Labels can be set using various functions called ***Labelizers*** which basically define the source of the label, and the way it wil be inserted into the database. Labelizers then make use of **Transformers**, which (as the name suggests) transform the source into an acceptable format.

Both labelizers and transformers are either built in within the toolbox, or developped by the user (the toolbox provides utilities functions for their development).

Labels have a name and a float numeric value 

- *Note1: By design, labels do not have a string value, as usually machine learning frameworks expect numerals.* 
- *Note2: Currently, labels are aimed only at binary classifiers, so their value is usually either 0 or 1 (There are some cases where the label value belongs to [0,1] which reflects the confidence associated with the label).*


We are adding two labels, using two different labelizers:

- the "queen" label, using the builtin FromFileName labelizer, associated with the builtin StringMatcher transformer
- the "nobee" label, using the builtin FromAnnotation labelizer, without transformation


**First, we use the listLabels method to show that no labels where defined**

In [3]:
ds.listLabels()

[]

**Next, we add the labels**

*Label addition just extends the database tables to store the labels, they have no value yet*

In [4]:
ds.addLabel("queen")
ds.addLabel("nobee")

# Check that labels were created
ds.listLabels()

['nobee', 'queen']

In [10]:
from lib import labelizers
from lib import transformers

# The "queen" label value is deduced from the source file name, using a StringMapper transformer
# This transformer iterates over a list 2-uples (regular expression, target value) and return
# the target value as soon as a match is found. Thus, you must order your list from stricter to looser
trsfrm_queen = transformers.StringMapper(
        [('(?i)active', 1), 
         ('(?i)missing queen', 0),
         ('NO_QueenBee', 0),
         ('QueenBee', 1)     
        ])

# The transformer is then used over the source filenames, using the FromFileName labelizer
# This labelizer does not provide label strength.

n = ds.setLabel('queen', labelizers.FromFileName(trsfrm_queen))
iprint(n, "samples where processed for 'queen' label")

# The "nobee" label value comes from annotation files, (.lab files using the same base name as the audio
# source file they annotate), using the FromAnnotation labelizer, with no transformation.
# This labelizer takes 2 arguments:
# - a mandatory source path, pointing to the directory where the .lab files reside
# - an optional threshold, allowing to disregard any "label" event with a duration under this treshold
# The label strength over a sample is computed by summing the duration of "label" events (if > th) and dividing
#   this sum by the sample duration
 
n = ds.setLabel('nobee', labelizers.FromAnnotation(SOURCE_PATH, th=0))
iprint(n, "samples where processed for 'nobee' label")

[2020-08-07/21:14:06.038|06.5%|76.6%|0.26GB] 4744 samples where processed for 'queen' label
[2020-08-07/21:14:07.034|12.9%|76.6%|0.26GB] 4744 samples where processed for 'nobee' label


## Step 3: Add Attributes

Here we add **attributes** to our dataset samples. Attributes can be set using various functions called ***Attributors***, either builtin within the toolbox, or developped by the user (the toolbox provides utilities functions for Attributors development).

Attributes can be used to "tag" samples, for future subsets extractions. They have a name, and a value, always stored as a string (note the difference with labels)

Here we tag each sample with the hive it belongs to. As the hive is encoded in the first 5 characters of the source file name, we use a FromFileName attributor, with a StringMatcher transformer.

In [11]:
from lib import attributors

#The string matcher transformer behave differently than the StringMapper. It uses regexp
# capture group to retrieve part pf a string matching a specific pattern. This can be used
# either for complex or very basic matching. Here we just ask for the five first chars,
# provided they belong to characters valid for identifiers (A-Z, a-z,0-9 and underscore)
ds.addAttribute('hive')
ds.setAttribute('hive', attributors.FromFileName(transformers.StringMatcher("^(\w{5})")))

None


4744

## Step 4: Perform some requests

In [18]:
# this dump all columns from the sampls table into a pandas dataframe
ds.dumpDataFrame()

Unnamed: 0,name,file_id,start_t,end_t,queen,nobee,hive
0,00-000000,1,0.0,1.0,1.0,0.0,Hive1
1,00-000001,1,0.5,1.5,1.0,0.0,Hive1
2,00-000002,1,1.0,2.0,1.0,0.0,Hive1
3,00-000003,1,1.5,2.5,1.0,0.0,Hive1
4,00-000004,1,2.0,3.0,1.0,0.0,Hive1
...,...,...,...,...,...,...,...
4739,03-001178,4,589.0,590.0,1.0,0.0,Hive3
4740,03-001179,4,589.5,590.5,1.0,0.0,Hive3
4741,03-001180,4,590.0,591.0,1.0,0.0,Hive3
4742,03-001181,4,590.5,591.5,1.0,0.0,Hive3


In [20]:
# but you can also be more specific
ds.queryDataFrame("select name, hive, nobee, queen from samples where nobee = 0")

Unnamed: 0,name,hive,nobee,queen
0,00-000000,Hive1,0.0,1.0
1,00-000001,Hive1,0.0,1.0
2,00-000002,Hive1,0.0,1.0
3,00-000003,Hive1,0.0,1.0
4,00-000004,Hive1,0.0,1.0
...,...,...,...,...
3020,03-001178,Hive3,0.0,1.0
3021,03-001179,Hive3,0.0,1.0
3022,03-001180,Hive3,0.0,1.0
3023,03-001181,Hive3,0.0,1.0
