# 02 - Datasets (Creation)

<hr style="border:1px solid gray"></hr>

## Step 1: Initialization and Slicing

Here we create the dataset from a bunch of audio source files
These files resampled and sliced into chunks, according to the parameters provided in the dataset manifest (sample rate, duration, overlap).

A SQLite database is also created for the dataset, to persist any useful information. 

*Note: In the current version, when multiprocessing, progress is not tracked in jupyter notebook, so you have to look at the console*

In [1]:
import warnings                            # This block prevents display of harmless warnings, but should be
warnings.filterwarnings('ignore')          # commented out till the final version, to avoid missing "real" warnings 

import kilroy_was_here                     # Mandatory. Allow access to shared python code from repository root
from audace.audiodataset import AudioDataset  # Class for audio dataset handling
from audace.jupytools import iprint           # timstamped (to the ms) print with CPU and RAM consumption information

# Path where to find initial annotated dataset (audio and lab files)
SOURCE_PATH ='/Users/jpg/Documents/Nolasco'

# Dataset name is the master key for dataset adressing.
# He we use a tiny dataset with only 2 audio files
# that will be sliced into 60s chunks 
DATASET_NAME = 'TUTO'

# Initialize Dataset Object. 
try:
    #By providing a source path,we implicitly indicates that you want to CREATE the data set.
    # Run with a pool of 2 processes
    ds = AudioDataset(DATASET_NAME, SOURCE_PATH, nprocs=2)
    
except FileExistsError:
    # To allow rerun, we catch the exception in case the dataset was already created.
    # Ideally, you should create the dataset once for all in a dedicated notebook,
    # and then retrieve it from other notebooks when needed
    # Here, by not providing a source path, we implicitly express the intent of RETRIEVING
    # an existing dataset rather than CREATING a new one
    iprint("Retrieving existing dataset")
    ds = AudioDataset(DATASET_NAME)
    iprint("Dataset retrieved")
    
# The following line provides some information about the newly created (or retrieved) AudioDataset object    
ds.info()



[2020-09-03/12:42:34.895|11.4%|72.9%|0.28GB] >>>>> Starting Dataset TUTO build
[2020-09-03/12:42:34.914|00.0%|72.9%|0.28GB] Starting to process 2 audio files.
[2020-09-03/12:43:01.621|25.3%|72.4%|0.28GB] Creating Database
[2020-09-03/12:43:01.637|33.3%|72.4%|0.28GB] Database created
[2020-09-03/12:43:01.637|00.0%|72.4%|0.28GB] Please wait, computing checksum...
[2020-09-03/12:43:01.721|20.0%|72.5%|0.28GB]   Computed checksum d02ebf42437ed11fa55c3d35cc5502ec
[2020-09-03/12:43:01.721|00.0%|72.5%|0.28GB]   Expected checksum d02ebf42437ed11fa55c3d35cc5502ec
[2020-09-03/12:43:01.721|00.0%|72.5%|0.28GB] >>>>> Dataset TUTO successfully created.
[2020-09-03/12:43:01.721|00.0%|72.5%|0.28GB] ------------------------------------------------------
[2020-09-03/12:43:01.721|00.0%|72.5%|0.28GB] DATASET NAME          : TUTO
[2020-09-03/12:43:01.721|00.0%|72.5%|0.28GB] DATASET PATH          : D:\Jupyter\ShowBees\datasets\TUTO
[2020-09-03/12:43:01.721|00.0%|72.5%|0.28GB] DATASET DB PATH       : D:\Jupyt

<hr style="border:1px solid gray"></hr>

## Step 2: Add Labels

Here we add **labels** to our dataset samples. Labels can be set using various functions called ***Providers*** which basically define the source of the label. Providers then make use of **Transformers**, which (as the name suggests) transform the source into an acceptable format.

Both providers and transformers are either built in within the toolbox, or developped by the user (the Audace framework provides utilities functions for their development).

Labels have a name and a float numeric value 

- *Note1: By design, labels do not have a string value, as usually machine learning frameworks expect numerals.* 
- *Note2: Currently, labels are aimed only at binary classifiers, so their value is usually either 0 or 1 (There are some cases where the label value belongs to [0,1] which reflects the confidence associated with the label).*


We are adding two labels, using two different labelizers:

- the "queen" label, using the builtin FromFileName provider, associated with the builtin StringMatcher transformer
- the "nobee" label, using the builtin FromAnnotation provider, without transformation


**First, we use the listLabels method to show that no labels where defined**

In [3]:
ds.listLabels()

[]

**Next, we add the labels**

*Label addition just extends the database tables to store the labels, they have no value yet*

In [4]:
ds.addLabel("queen")
ds.addLabel("nobee")

# Check that labels were created
ds.listLabels()

['nobee', 'queen']

In [5]:
from audace import providers
from audace import transformers

# The "queen" label value is deduced from the source file name, using a StringMapper transformer
# This transformer iterates over a list 2-uples (regular expression, target value) and return
# the target value as soon as a match is found. Thus, you must order your list from stricter to looser
trsfrm_queen = transformers.StringMapper(
        [('(?i)active', 1), 
         ('(?i)missing queen', 0),
         ('NO_QueenBee', 0),
         ('QueenBee', 1)     
        ])

# The transformer is then used over the source filenames, using the FromFileName provider
# This labelizer does not provide label strength.

n = ds.setLabel('queen', providers.FromFileName(trsfrm_queen))
iprint(n, "samples where processed for 'queen' label")

# The "nobee" label value comes from annotation files, (.lab files using the same base name as the audio
# source file they annotate), using the FromAnnotation labelizer, with no transformation.
# This labelizer takes 2 arguments:
# - a mandatory source path, pointing to the directory where the .lab files reside
# - an optional threshold, allowing to disregard any "label" event with a duration under this treshold
# The label strength over a sample is computed by summing the duration of "label" events (if > th) and dividing
#   this sum by the sample duration
 
n = ds.setLabel('nobee', providers.FromAnnotation(SOURCE_PATH, th=0))
iprint(n, "samples where processed for 'nobee' label")

[2020-09-03/12:48:19.373|06.5%|72.2%|0.28GB] 18 samples where processed for 'queen' label


HBox(children=(FloatProgress(value=0.0, description='Annotating nobee', max=2.0, style=ProgressStyle(descripti…


[2020-09-03/12:48:19.403|35.7%|72.2%|0.28GB] 18 samples where processed for 'nobee' label


<hr style="border:1px solid gray"></hr>

## Step 3: Add Attributes

Here we add **attributes** to our dataset samples. Just like Labels, Attributes make use of providers and transformers 

Attributes can be used to "tag" samples, for future subsets extractions. They have a name, and a value, always stored as a string (note the difference with labels)

Here we tag each sample with the hive it belongs to. As the hive is encoded in the first 5 characters of the source file name, we use a FromFileName attributor, with a StringMatcher transformer.

In [6]:
#The string matcher transformer behave differently than the StringMapper. It uses regexp
# capture group to retrieve part pf a string matching a specific pattern. This can be used
# either for complex or very basic matching. Here we just ask for the five first chars,
# provided they belong to characters valid for identifiers (A-Z, a-z,0-9 and underscore)
ds.addAttribute('hive')
ds.setAttribute('hive', providers.FromFileName(transformers.StringMatcher("^(\w{5})")))
ds.listAttributes()

['hive']

<hr style="border:1px solid gray"></hr>

## Step 4: Perform some requests

**You can dump the full db as a pandas dataframe**

In [7]:
# this dump all columns from the samples table into a pandas dataframe
ds.dumpDataFrame()

Unnamed: 0,name,file_id,start_t,end_t,queen,nobee,hive
0,00-000000,1,0.0,60.0,1.0,0.084,Hive1
1,00-000001,1,60.0,120.0,1.0,0.001,Hive1
2,00-000002,1,120.0,180.0,1.0,0.583,Hive1
3,00-000003,1,180.0,240.0,1.0,0.125,Hive1
4,00-000004,1,240.0,300.0,1.0,0.499,Hive1
5,00-000005,1,300.0,360.0,1.0,0.459,Hive1
6,00-000006,1,360.0,420.0,1.0,0.581,Hive1
7,00-000007,1,420.0,480.0,1.0,0.126,Hive1
8,00-000008,1,480.0,540.0,1.0,0.541,Hive1
9,01-000000,2,0.0,60.0,0.0,0.064,Hive3


**Or be more specific**

Here we select only some columns, for chuncks without external perturbation (nobee = 0)

In [8]:
# but you can also be more specific
ds.queryDataFrame("select name, hive, nobee, queen from samples where nobee = 0")

Unnamed: 0,name,hive,nobee,queen
0,01-000004,Hive3,0.0,0.0


**Or use sqlite builtin functions and the full power of the sql langage**

Here  we select only chunks recorded on Hived1, that were perturbed by an external noise, ordered by descending perturbation ratio, and at the same time we binarize this perturbation ratio into a boolean via the use of the iif sqlite builtin function.

In [9]:
sql = """
select
    rowid,
    name,
    file_id,
    hive,
    nobee,
    iif(nobee < 0.1, 0, 1) as b_nobee -- using sqlite builtin function 
from samples
where hive = 'Hive1'
and nobee != 0
ORDER BY nobee DESC
"""
ds.queryDataFrame(sql)

Unnamed: 0,rowid,name,file_id,hive,nobee,b_nobee
0,3,00-000002,1,Hive1,0.583,1
1,7,00-000006,1,Hive1,0.581,1
2,9,00-000008,1,Hive1,0.541,1
3,5,00-000004,1,Hive1,0.499,1
4,6,00-000005,1,Hive1,0.459,1
5,8,00-000007,1,Hive1,0.126,1
6,4,00-000003,1,Hive1,0.125,1
7,1,00-000000,1,Hive1,0.084,0
8,2,00-000001,1,Hive1,0.001,0
