# Dataset Creation
You should run this notebook only if you have not previoulsy created the dataset

## Step 1: Initialization

Here we create the dataset from a bunch of audio source files
These files resampled and sliced into chunks, according to the parameters provided in the dataset manifest (sample rate, duration, overlap).

A SQLite database is also created for the dataset, to persist any useful information. 

*Note: When multiprocessing, progress is not tracked in jupyter notebook, so you have to look at the console*

In [1]:
import warnings                            # This block prevents display of harmless warnings, but should be
warnings.filterwarnings('ignore')          # commented out till the final version, to avoid missing "real" warnings 

import kilroy_was_here                     # Mandatory. Allow access to shared python code in the upper 'codelib' directory
from lib.audiodataset import AudioDataset  # Class for audio dataset handling

# Path where to find initial annotated dataset (audio and lab files)
SOURCE_PATH ='D:/datasets/sounds/Nolasco'

# Dataset name is the master key for dataset adressing
DATASET_NAME = 'SMALL1005'

# Initialize Dataset Object. By providing a source path, you implicitly 
# indicates that you want to CREATE the data set
# Run with a pool of 4 processes
ds = AudioDataset(DATASET_NAME, SOURCE_PATH, nprocs=4)



[2020-08-05/12:06:00.084|14.1%|71.6%|0.25GB] >>>>> Starting Dataset SMALL1005 build
[2020-08-05/12:06:00.084|00.0%|71.6%|0.25GB] Creating database tables
[2020-08-05/12:06:00.131|04.2%|71.6%|0.25GB] Ready to process 4 audio files.
[2020-08-05/12:06:38.078|49.6%|64.7%|0.26GB] Creating Database
[2020-08-05/12:06:38.105|62.5%|64.7%|0.26GB] Database created
[2020-08-05/12:06:38.105|00.0%|64.7%|0.26GB] Please wait, computing checksum...
[2020-08-05/12:06:39.774|28.1%|67.0%|0.26GB]   Computed checksum 628e97cd138dc7586883e3b33e41cb2b
[2020-08-05/12:06:39.774|00.0%|67.0%|0.26GB]   Expected checksum 628e97cd138dc7586883e3b33e41cb2b
[2020-08-05/12:06:39.774|00.0%|67.0%|0.26GB] Checksum OK!
[2020-08-05/12:06:39.774|00.0%|67.0%|0.26GB] >>>>> Dataset SMALL1005 successfully created.



**The following line provides some information about the newly created AudioDataset object**


In [3]:
ds.info()

[2020-08-04/10:04:03.889|07.3%|64.2%|0.25GB] ------------------------------------------------------
[2020-08-04/10:04:03.889|00.0%|64.2%|0.25GB] DATASET PATH          : D:\Jupyter\ShowBees\datasets\SMALL1005
[2020-08-04/10:04:03.889|00.0%|64.2%|0.25GB] DATASET DB PATH       : D:\Jupyter\ShowBees\datasets\SMALL1005\database.db
[2020-08-04/10:04:03.889|00.0%|64.2%|0.25GB] DATASET SAMPLES PATH  : D:\Jupyter\ShowBees\datasets\SMALL1005\samples
[2020-08-04/10:04:03.889|00.0%|64.2%|0.25GB] NB SOURCE AUDIO FILES : 4
[2020-08-04/10:04:03.894|00.0%|64.2%|0.25GB] SAMPLE RATE           : 22050
[2020-08-04/10:04:03.894|00.0%|64.2%|0.25GB] DURATION              : 1.0
[2020-08-04/10:04:03.896|00.0%|64.2%|0.25GB] OVERLAP               : 0.5
[2020-08-04/10:04:03.896|00.0%|64.2%|0.25GB] NB AUDIO CHUNKS       : 4744
[2020-08-04/10:04:03.896|00.0%|64.2%|0.25GB] ------------------------------------------------------


## Step 2: Add Labels

Here we add **labels** to our dataset samples. Labels can be set using various functions called ***Labelizers***, either builtin within the toolbox, or developped by the user (the toolbox provides utilities functions for Labelizer development).

In [None]:
from lib.labelizers import AnnotationFile
ds.setLabels()

## Step 3: Add Attributes

Here we add **attributes** to our dataset samples. Attributes can be set using various functions called ***Attributors***, either builtin within the toolbox, or developped by the user (the toolbox provides utilities functions for Attributors development)

In [None]:
from lib.Attributors 

## Step 4: Next Steps

We can also add **features** and **augmentations** to our dataset samples. Features (resp. augmentations) can be set using various functions called ***Featurizers*** (resp. ***Augmentators**). Both are either builtin within the toolbox, or developped by the user (the toolbox provides utilities functions for Featurizers and Augmentators development).