<a href="https://colab.research.google.com/github/ShawnHymel/ei-faucet-dataset/blob/master/ei_sound_classifier_dataset_curation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sound Classifier Dataset Curation

[![Open In Colab <](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ShawnHymel/ei-faucet-dataset/blob/master/ei_sound_classifier_dataset_curation.ipynb)

Use this tool to download the Edge Impulse faucet dataset, combine it with your own sound samples, mix in some background noise, and upload the curated dataset to Edge Impulse. From there, you can train a neural network to classify sound events and upload it to a microcontroller to perform real-time sound event classification.

 1. Upload samples of your own sounds (optional)
 2. Adjust parameters in the Settings cell (you will need an [Edge Impulse](https://www.edgeimpulse.com/) account)
 3. Run the rest of the cells! ('shift' + 'enter' on each cell)

### Upload your own keyword samples
If you'd like to upload your own custom sound dataset, follow these instructions:

On the left pane, in the file browser, create a directory structure containing space for your keyword audio samples. All samples for each keyword should be in a directory with that keyword's name. 

The audio samples should be `.wav` format, mono, and 1 second long. Bitrate and bitdepth should not matter. Samples shorter than 1 second will be padded with 0s, and samples longer than 1 second will be truncated to 1 second. The exact name of each `.wav` file does not matter, as they will be read, mixed with background noise, and saved to a separate file with an auto-generated name. Directory name does matter (it is used to determine the name of the class during neural network training).

Right-click on each keyword directory and upload all of your samples. Your directory structor should look like the following:

```
/
|- content
|--- custom_sounds
|----- sound_1
|------- 000.wav
|------- 001.wav
|------- ...
|----- sound_2
|------- 000.wav
|------- 001.wav
|------- ...
|----- ...
```

In [1]:
### Update Node.js to the latest stable version
!npm cache clean -f
!npm install -g n
!n stable

[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35musing --force[0m I sure hope you know what you are doing.
[K[?25h/tools/node/bin/n -> /tools/node/lib/node_modules/n/bin/n
+ n@7.0.0
added 1 package from 4 contributors in 0.274s

[33m[39m
[33m   ╭─────────────────────────────────────────────────────────────────╮[39m
   [33m│[39m                                                                 [33m│[39m
   [33m│[39m      New [32mpatch[39m version of npm available! [31m6.14.8[39m → [32m6.14.10[39m       [33m│[39m
   [33m│[39m   [33mChangelog:[39m [36mhttps://github.com/npm/cli/releases/tag/v6.14.10[39m   [33m│[39m
   [33m│[39m                Run [32mnpm install -g npm[39m to update!                [33m│[39m
   [33m│[39m                                                                 [33m│[39m
[33m   ╰─────────────────────────────────────────────────────────────────╯[39m
[33m[39m
  [36minstalling[0m : [2mnode-v14.15.4[0m
  [36m     mkdir[0m : [

In [2]:
### Install required packages and tools
!python -m pip install soundfile
!npm install -g --unsafe-perm edge-impulse-cli

Collecting soundfile
  Downloading https://files.pythonhosted.org/packages/eb/f2/3cbbbf3b96fb9fa91582c438b574cff3f45b29c772f94c400e2c99ef5db9/SoundFile-0.10.3.post1-py2.py3-none-any.whl
Installing collected packages: soundfile
Successfully installed soundfile-0.10.3.post1
[K[?25h[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35mdeprecated[0m request-promise@4.2.6: request-promise has been deprecated because it extends the now deprecated request package, see https://github.com/request/request/issues/3142
[0m[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35mdeprecated[0m request@2.88.2: request has been deprecated, see https://github.com/request/request/issues/3142
[0m[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35mdeprecated[0m @zeit/dockerignore@0.0.5: "@zeit/dockerignore" is no longer maintained
[K[?25h[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35mdeprecated[0m har-validator@5.1.5: this library is no longer supported
[K[?25h/usr/local/bin/edge-impulse-blocks -> /usr/local/lib/node_m

In [3]:
### Imports
from os.path import join

In [4]:
### Settings (You probably do not need to change these)
BASE_DIR = "/content"
OUT_DIR = "sounds_curated"
EI_DATASET_FILENAME = "faucet_dataset_v01.zip"
EI_DATASET_URL = "https://github.com/ShawnHymel/ei-faucet-dataset/raw/master/" + EI_DATASET_FILENAME
EI_DATASET_DIR = "faucet_dataset"
CURATION_SCRIPT = "dataset-curation.py"
CURATION_SCRIPT_URL = "https://raw.githubusercontent.com/ShawnHymel/ei-keyword-spotting/master/" + CURATION_SCRIPT
UTILS_SCRIPT_URL = "https://raw.githubusercontent.com/ShawnHymel/ei-keyword-spotting/master/utils.py"
NUM_SAMPLES = 1500    # Target number of samples to mix and send to Edge Impulse
WORD_VOL = 1.0        # Relative volume of word in output sample
BG_VOL = 0.1          # Relative volume of noise in output sample
SAMPLE_TIME = 1.0     # Time (seconds) of output sample
SAMPLE_RATE = 16000   # Sample rate (Hz) of output sample
BIT_DEPTH = "PCM_16"  # Options: [PCM_16, PCM_24, PCM_32, PCM_U8, FLOAT, DOUBLE]
BG_DIR = "noise"
TEST_RATIO = 0.2      # 20% reserved for test set, rest is for training
EI_INGEST_TEST_URL = "https://ingestion.edgeimpulse.com/api/test/data"
EI_INGEST_TRAIN_URL = "https://ingestion.edgeimpulse.com/api/training/data"

In [5]:
### Download Edge Impulse faucet dataset
!cd {BASE_DIR}
!wget {EI_DATASET_URL}
!echo "Extracting..."
!unzip -q {EI_DATASET_FILENAME}

--2021-01-07 01:33:16--  https://github.com/ShawnHymel/ei-faucet-dataset/raw/master/faucet_dataset_v01.zip
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ShawnHymel/ei-faucet-dataset/master/faucet_dataset_v01.zip [following]
--2021-01-07 01:33:16--  https://raw.githubusercontent.com/ShawnHymel/ei-faucet-dataset/master/faucet_dataset_v01.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22657140 (22M) [application/zip]
Saving to: ‘faucet_dataset_v01.zip’


2021-01-07 01:33:17 (75.1 MB/s) - ‘faucet_dataset_v01.zip’ saved [22657140/22657140]

Extracting...


In [6]:
### Pull out background noise directory
!cd {BASE_DIR}
!mv "{join(EI_DATASET_DIR, BG_DIR)}" "{BG_DIR}"

In [7]:
### User Settings (do change these)

# Location of your custom keyword samples (e.g. "/content/custom_sounds")
# Leave blank ("") for no custom keywords.
CUSTOM_DATASET_PATH = "/content/custom_sounds"

# Edge Impulse > your_project > Dashboard > Keys
EI_API_KEY = "ei_2a78aa9af8f16414d24394b68c7fe180bd597adc6e391a535485eccf433ba5c7" 

# Comma separated words. Must match directory names (that contain samples).
# Recommended: use 2 or 3 labels for microcontroller demo
TARGETS = "fan, faucet"

In [8]:
### Download curation and utils scripts
!wget {CURATION_SCRIPT_URL}
!wget {UTILS_SCRIPT_URL}

--2021-01-07 01:43:30--  https://raw.githubusercontent.com/ShawnHymel/ei-keyword-spotting/master/dataset-curation.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17199 (17K) [text/plain]
Saving to: ‘dataset-curation.py’


2021-01-07 01:43:30 (50.9 MB/s) - ‘dataset-curation.py’ saved [17199/17199]

--2021-01-07 01:43:30--  https://raw.githubusercontent.com/ShawnHymel/ei-keyword-spotting/master/utils.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3966 (3.9K) [text/plain]
Saving to: ‘utils.py’


2021-01-07 01:43:30 (63.0 MB/s) - ‘utils.

In [20]:
### Perform curation and mixing of samples with background noise
!cd {BASE_DIR}
!python "{CURATION_SCRIPT}" \
  -t "{TARGETS}" \
  -n {NUM_SAMPLES} \
  -w {WORD_VOL} \
  -g {BG_VOL} \
  -s {SAMPLE_TIME} \
  -r {SAMPLE_RATE} \
  -e {BIT_DEPTH} \
  -b "{BG_DIR}" \
  -o "{OUT_DIR}" \
  "{EI_DATASET_DIR}" \
  "{CUSTOM_DATASET_PATH}"

-----------------------------------------------------------------------
Keyword Dataset Curation Tool
v0.1
-----------------------------------------------------------------------
Gathering random background noise snippets (1500 files)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/audioread/ffdec.py", line 94, in popen_multiple
    return subprocess.Popen(cmd, *args, **kwargs)
  File "/usr/lib/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.6/subprocess.py", line 1318, in _execute_child
    part = os.read(errpipe_read, 50000)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "dataset-curation.py", line 344, in <module>
    sample_rate=sample_rate)
  File "dataset-curation.py", line 126, in mix_audio
    bg_waveform, fs = librosa.load(bg_path, sr=fs)
  File "/usr/local/lib/python3.6/dist-packages/librosa/core/

In [19]:
!python {CURATION_SCRIPT}

usage: dataset-curation.py [-h] -t TARGETS [-n NUM_SAMPLES] [-w WORD_VOL]
                           [-g BG_VOL] [-s SAMPLE_TIME] [-r SAMPLE_RATE]
                           [-e {PCM_16,PCM_24,PCM_32,PCM_U8,FLOAT,DOUBLE}] -b
                           BG_DIR -o OUT_DIR
                           d [d ...]
dataset-curation.py: error: the following arguments are required: -t/--targets, -b/--bg_dir, -o/--out_dir, d
