# 02 - Datasets (Creation)

## Step 1: Initialization and Slicing

Here we create the dataset from a bunch of audio source files
These files resampled and sliced into chunks, according to the parameters provided in the dataset manifest (sample rate, duration, overlap).

A SQLite database is also created for the dataset, to persist any useful information. 

*Note: When multiprocessing, progress is not tracked in jupyter notebook, so you have to look at the console*

In [1]:
import warnings                            # This block prevents display of harmless warnings, but should be
warnings.filterwarnings('ignore')          # commented out till the final version, to avoid missing "real" warnings 

import kilroy_was_here                     # Mandatory. Allow access to shared python code from repository root
from audace.audiodataset import AudioDataset  # Class for audio dataset handling
from audace.jupytools import iprint           # timstamped (to the ms) print with CPU and RAM consumption information

# Path where to find initial annotated dataset (audio and lab files)
SOURCE_PATH ='D:/datasets/sounds/Nolasco'
#SOURCE_PATH ='/Users/jpg/Documents/Nolasco'

# Dataset name is the master key for dataset adressing.
DATASET_NAME = 'SMALL1005'

# Initialize Dataset Object. 
try:
    #By providing a source path,we implicitly indicates that you want to CREATE the data set.
    # Run with a pool of 4 processes
    ds = AudioDataset(DATASET_NAME, SOURCE_PATH, nprocs=4)
    
except FileExistsError:
    # To allow rerun, we catch the exception in case the dataset was already created.
    # Ideally, you should create the dataset once for all in a dedicated notebook,
    # and then retrieve it from other notebooks when needed
    # Here, by not providing a source path, we implicitly express the intent of RETRIEVING
    # an existing dataset rather than CREATING a new one
    iprint("Retrieving existing dataset")
    ds = AudioDataset(DATASET_NAME)
    iprint("Dataset retrieved")
    
# The following line provides some information about the newly created (or retrieved) AudioDataset object    
ds.info()



[2020-08-10/11:14:41.679|23.1%|70.4%|0.26GB] >>>>> Starting Dataset SMALL1005 build
[2020-08-10/11:14:41.706|09.1%|70.4%|0.26GB] Starting to process 4 audio files.
[2020-08-10/11:15:26.124|50.2%|69.1%|0.26GB] Creating Database
[2020-08-10/11:15:26.145|40.0%|69.1%|0.26GB] Database created
[2020-08-10/11:15:26.148|00.0%|69.1%|0.26GB] Please wait, computing checksum...
[2020-08-10/11:15:27.144|18.3%|70.9%|0.26GB]   Computed checksum 6671ad6663eb2019bd3af30170705bb3
[2020-08-10/11:15:27.145|00.0%|70.9%|0.26GB]   Expected checksum 6671ad6663eb2019bd3af30170705bb3
[2020-08-10/11:15:27.145|00.0%|70.9%|0.26GB] Checksum OK!
[2020-08-10/11:15:27.145|00.0%|70.9%|0.26GB] >>>>> Dataset SMALL1005 successfully created.
[2020-08-10/11:15:27.145|00.0%|70.9%|0.26GB] ------------------------------------------------------
[2020-08-10/11:15:27.145|00.0%|70.9%|0.26GB] DATASET PATH          : D:\Jupyter\ShowBees\datasets\SMALL1005
[2020-08-10/11:15:27.146|00.0%|70.9%|0.26GB] DATASET DB PATH       : D:\Jupyte

## Step 2: Add Labels

Here we add **labels** to our dataset samples. Labels can be set using various functions called ***Providers*** which basically define the source of the label, and the way it wil be inserted into the database. Providers then make use of **Transformers**, which (as the name suggests) transform the source into an acceptable format.

Both providers and transformers are either built in within the toolbox, or developped by the user (the Audace framework provides utilities functions for their development).

Labels have a name and a float numeric value 

- *Note1: By design, labels do not have a string value, as usually machine learning frameworks expect numerals.* 
- *Note2: Currently, labels are aimed only at binary classifiers, so their value is usually either 0 or 1 (There are some cases where the label value belongs to [0,1] which reflects the confidence associated with the label).*


We are adding two labels, using two different labelizers:

- the "queen" label, using the builtin FromFileName provider, associated with the builtin StringMatcher transformer
- the "nobee" label, using the builtin FromAnnotation provider, without transformation


**First, we use the listLabels method to show that no labels where defined**

In [2]:
ds.listLabels()

['nobee', 'queen']

**Next, we add the labels**

*Label addition just extends the database tables to store the labels, they have no value yet*

In [3]:
ds.addLabel("queen")
ds.addLabel("nobee")

# Check that labels were created
ds.listLabels()

['nobee', 'queen']

In [4]:
from audace import providers
from audace import transformers

# The "queen" label value is deduced from the source file name, using a StringMapper transformer
# This transformer iterates over a list 2-uples (regular expression, target value) and return
# the target value as soon as a match is found. Thus, you must order your list from stricter to looser
trsfrm_queen = transformers.StringMapper(
        [('(?i)active', 1), 
         ('(?i)missing queen', 0),
         ('NO_QueenBee', 0),
         ('QueenBee', 1)     
        ])

# The transformer is then used over the source filenames, using the FromFileName provider
# This labelizer does not provide label strength.

n = ds.setLabel('queen', providers.FromFileName(trsfrm_queen))
iprint(n, "samples where processed for 'queen' label")

# The "nobee" label value comes from annotation files, (.lab files using the same base name as the audio
# source file they annotate), using the FromAnnotation labelizer, with no transformation.
# This labelizer takes 2 arguments:
# - a mandatory source path, pointing to the directory where the .lab files reside
# - an optional threshold, allowing to disregard any "label" event with a duration under this treshold
# The label strength over a sample is computed by summing the duration of "label" events (if > th) and dividing
#   this sum by the sample duration
 
n = ds.setLabel('nobee', providers.FromAnnotation(SOURCE_PATH, th=0))
iprint(n, "samples where processed for 'nobee' label")

[2020-08-10/07:53:35.047|09.6%|70.6%|0.26GB] 4744 samples where processed for 'queen' label


HBox(children=(FloatProgress(value=0.0, description='Annotating nobee', max=4.0, style=ProgressStyle(descripti…


[2020-08-10/07:53:36.210|21.5%|70.5%|0.26GB] 4744 samples where processed for 'nobee' label


## Step 3: Add Attributes

Here we add **attributes** to our dataset samples. Just like Labels, Attributes make use of providers and transformers 

Attributes can be used to "tag" samples, for future subsets extractions. They have a name, and a value, always stored as a string (note the difference with labels)

Here we tag each sample with the hive it belongs to. As the hive is encoded in the first 5 characters of the source file name, we use a FromFileName attributor, with a StringMatcher transformer.

In [5]:
#The string matcher transformer behave differently than the StringMapper. It uses regexp
# capture group to retrieve part pf a string matching a specific pattern. This can be used
# either for complex or very basic matching. Here we just ask for the five first chars,
# provided they belong to characters valid for identifiers (A-Z, a-z,0-9 and underscore)
ds.addAttribute('hive')
ds.setAttribute('hive', providers.FromFileName(transformers.StringMatcher("^(\w{5})")))
ds.listAttributes()

['hive']

## Step 4: Perform some requests

In [6]:
# this dump all columns from the samples table into a pandas dataframe
ds.dumpDataFrame()

Unnamed: 0,name,file_id,start_t,end_t,queen,nobee,hive,MFCC
0,00-000000,1,0.0,1.0,1.0,0.0,Hive1,"[[-408.87164, -408.7036, -409.22083, -409.4723..."
1,00-000001,1,0.5,1.5,1.0,0.0,Hive1,"[[-412.63693, -409.90994, -411.59604, -406.177..."
2,00-000002,1,1.0,2.0,1.0,0.0,Hive1,"[[-423.80463, -417.4408, -409.61368, -403.3458..."
3,00-000003,1,1.5,2.5,1.0,0.0,Hive1,"[[-412.18826, -414.3556, -417.38345, -416.6880..."
4,00-000004,1,2.0,3.0,1.0,0.0,Hive1,"[[-402.08826, -405.79462, -413.061, -406.19598..."
...,...,...,...,...,...,...,...,...
4739,03-001178,4,589.0,590.0,1.0,0.0,Hive3,"[[-456.44434, -455.50305, -456.2868, -448.7767..."
4740,03-001179,4,589.5,590.5,1.0,0.0,Hive3,"[[-438.58572, -431.85278, -440.3269, -449.5843..."
4741,03-001180,4,590.0,591.0,1.0,0.0,Hive3,"[[-444.9086, -443.4558, -448.03052, -452.29404..."
4742,03-001181,4,590.5,591.5,1.0,0.0,Hive3,"[[-455.91925, -447.42047, -445.4355, -446.2871..."


In [7]:
# but you can also be more specific
ds.queryDataFrame("select name, hive, nobee, queen from samples where nobee = 0")

Unnamed: 0,name,hive,nobee,queen
0,00-000000,Hive1,0.0,1.0
1,00-000001,Hive1,0.0,1.0
2,00-000002,Hive1,0.0,1.0
3,00-000003,Hive1,0.0,1.0
4,00-000004,Hive1,0.0,1.0
...,...,...,...,...
3020,03-001178,Hive3,0.0,1.0
3021,03-001179,Hive3,0.0,1.0
3022,03-001180,Hive3,0.0,1.0
3023,03-001181,Hive3,0.0,1.0


In [8]:
sql = """
select
    rowid,
    name,
    file_id,
    hive,
    nobee,
    iif(nobee < 0.1, 0, 1) as b_nobee, -- using sqlite builtin function 
    queen
from samples
where hive = 'Hive1'
and nobee != 0
"""
ds.queryDataFrame(sql)

Unnamed: 0,rowid,name,file_id,hive,nobee,b_nobee,queen
0,79,00-000078,1,Hive1,0.03,0,1.0
1,80,00-000079,1,Hive1,0.53,1,1.0
2,81,00-000080,1,Hive1,1.00,1,1.0
3,82,00-000081,1,Hive1,1.00,1,1.0
4,83,00-000082,1,Hive1,1.00,1,1.0
...,...,...,...,...,...,...,...
765,2324,01-001128,2,Hive1,0.26,1,0.0
766,2325,01-001129,2,Hive1,0.76,1,0.0
767,2326,01-001130,2,Hive1,1.00,1,0.0
768,2327,01-001131,2,Hive1,0.84,1,0.0
