# Keyword Spotting Dataset Curation

[![Open In Colab <](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ShawnHymel/ei-keyword-spotting/blob/master/ei-audio-dataset-curation.ipynb)

Use this tool to download the Google Speech Commands Dataset, combine it with your own keywords, mix in some background noise, and upload the curated dataset to Edge Impulse. From there, you can train a neural network to classify spoken words and upload it to a microcontroller to perform real-time keyword spotting.

 1. Upload samples of your own keyword (optional)
 2. Adjust parameters in the Settings cell (you will need an [Edge Impulse](https://www.edgeimpulse.com/) account)
 3. Run the rest of the cells! ('shift' + 'enter' on each cell)



### Upload your own keyword samples
You are welcome to use my [custom keyword dataset](https://github.com/ShawnHymel/custom-speech-commands-dataset), but note that it's limited and that I can't promise it will work well. If you want to use it, uncomment the `###Download custom dataset` cell below. You may also add your own recorded keywords to the extracted folder (`/content/custom_keywords`) to augment what's already there.

If you'd rather upload your own custom keyword dataset, follow these instructions:

On the left pane, in the file browser, create a directory structure containing space for your keyword audio samples. All samples for each keyword should be in a directory with that keyword's name. 

The audio samples should be `.wav` format, mono, and 1 second long. Bitrate and bitdepth should not matter. Samples shorter than 1 second will be padded with 0s, and samples longer than 1 second will be truncated to 1 second. The exact name of each `.wav` file does not matter, as they will be read, mixed with background noise, and saved to a separate file with an auto-generated name. Directory name does matter (it is used to determine the name of the class during neural network training).

Right-click on each keyword directory and upload all of your samples. Your directory structor should look like the following:

```
/
|- content
|--- custom_keywords
|----- keyword_1
|------- 000.wav
|------- 001.wav
|------- ...
|----- keyword_2
|------- 000.wav
|------- 001.wav
|------- ...
|----- ...
```




In [None]:
# This code is designed to format the original Google Speech Command datasets such that it is sent to Edge Impulse

### Update Node.js to the latest stable version
!npm cache clean -f
!npm install -g n
!n stable

# Based on the tutorial by Shawn Hymel: https://github.com/ShawnHymel/ei-keyword-spotting/blob/master/dataset-curation.py

[37;40mnpm[0m [0m[30;43mWARN[0m [0m[35musing --force[0m I sure hope you know what you are doing.
[K[?25h/tools/node/bin/n -> /tools/node/lib/node_modules/n/bin/n
+ n@7.3.1
added 1 package from 2 contributors in 0.331s

[33m[39m
[33m   ╭────────────────────────────────────────────────────────────────╮[39m
   [33m│[39m                                                                [33m│[39m
   [33m│[39m      New [31mmajor[39m version of npm available! [31m6.14.8[39m → [32m7.20.5[39m       [33m│[39m
   [33m│[39m   [33mChangelog:[39m [36mhttps://github.com/npm/cli/releases/tag/v7.20.5[39m   [33m│[39m
   [33m│[39m               Run [32mnpm install -g npm[39m to update!                [33m│[39m
   [33m│[39m                                                                [33m│[39m
[33m   ╰────────────────────────────────────────────────────────────────╯[39m
[33m[39m
  [36minstalling[0m : [2mnode-v14.17.4[0m
  [36m     mkdir[0m : [2m/usr/

In [None]:
### Install required packages and tools
!python -m pip install soundfile
!npm install -g --unsafe-perm edge-impulse-cli

In [None]:
### Settings (You probably do not need to change these)
BASE_DIR = "/content"
OUT_DIR = "keywords_curated"
GOOGLE_DATASET_FILENAME = "mini_speech_commands.zip"
GOOGLE_DATASET_URL = "http://storage.googleapis.com/download.tensorflow.org/data/mini_speech_commands.zip"
GOOGLE_DATASET_DIR = "google_speech_commands"
CUSTOM_KEYWORDS_FILENAME = "main.zip"
CUSTOM_KEYWORDS_URL = "https://github.com/ShawnHymel/custom-speech-commands-dataset/archive/" + CUSTOM_KEYWORDS_FILENAME
CUSTOM_KEYWORDS_DIR = "custom_keywords" # Unused
CUSTOM_KEYWORDS_REPO_NAME = "custom-speech-commands-dataset-main"
CURATION_SCRIPT = "dataset-curation.py"
CURATION_SCRIPT_URL = "https://raw.githubusercontent.com/ShawnHymel/ei-keyword-spotting/master/" + CURATION_SCRIPT
UTILS_SCRIPT_URL = "https://raw.githubusercontent.com/ShawnHymel/ei-keyword-spotting/master/utils.py"
NUM_SAMPLES = 1700    # Target number of samples to mix and send to Edge Impulse (from the background noise directory)
WORD_VOL = 1.0        # Relative volume of word in output sample
BG_VOL = 0.1          # Relative volume of noise in output sample
SAMPLE_TIME = 1.0     # Time (seconds) of output sample
SAMPLE_RATE = 16000   # Sample rate (Hz) of output sample
BIT_DEPTH = "PCM_16"  # Options: [PCM_16, PCM_24, PCM_32, PCM_U8, FLOAT, DOUBLE]
BG_DIR = "_background_noise_"  #Unused
TEST_RATIO = 0.2      # 20% reserved for test set, rest is for training
EI_INGEST_TEST_URL = "https://ingestion.edgeimpulse.com/api/test/data"
EI_INGEST_TRAIN_URL = "https://ingestion.edgeimpulse.com/api/training/data"

# Only needed for the background noise directory
GOOGLE_DATASET_2_URL = "http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz"
GOOGLE_DATASET_2_DIR = "original_dataset_ShawnHymel"
GOOGLE_DATASET_2_FILENAME = "speech_commands_v0.02.tar.gz"

# This was the directory where the "split_directory" file was kept on my version
SPLIT_DIRECTORY_BASH_SCRIPT = "/content/gdrive/MyDrive/assets/speech_commands/split_directory.sh" 

In [None]:
### Download Google Speech Commands Dataset
%cd {BASE_DIR}
!wget {GOOGLE_DATASET_URL}
!mkdir {GOOGLE_DATASET_DIR}
!echo "Extracting..."
!unzip {GOOGLE_DATASET_FILENAME} -d {GOOGLE_DATASET_DIR}

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: google_speech_commands/mini_speech_commands/up/1ecfb537_nohash_1.wav  
  inflating: google_speech_commands/__MACOSX/mini_speech_commands/up/._1ecfb537_nohash_1.wav  
  inflating: google_speech_commands/mini_speech_commands/up/c137814b_nohash_0.wav  
  inflating: google_speech_commands/__MACOSX/mini_speech_commands/up/._c137814b_nohash_0.wav  
  inflating: google_speech_commands/mini_speech_commands/up/135c6841_nohash_2.wav  
  inflating: google_speech_commands/__MACOSX/mini_speech_commands/up/._135c6841_nohash_2.wav  
  inflating: google_speech_commands/mini_speech_commands/up/3eb8764c_nohash_0.wav  
  inflating: google_speech_commands/__MACOSX/mini_speech_commands/up/._3eb8764c_nohash_0.wav  
  inflating: google_speech_commands/mini_speech_commands/up/caf9fceb_nohash_0.wav  
  inflating: google_speech_commands/__MACOSX/mini_speech_commands/up/._caf9fceb_nohash_0.wav  
  inflating: google_speech_commands/mini

In [None]:
# Download Google Speech Commands (original TAR file, needed because code doesn't work without background noise)
%cd {BASE_DIR}
!wget {GOOGLE_DATASET_2_URL}
!mkdir {GOOGLE_DATASET_2_DIR}
!echo "Extracting..."
!tar xfz {GOOGLE_DATASET_2_FILENAME} -C {GOOGLE_DATASET_2_DIR}

/content
--2021-08-07 04:36:45--  http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz
Resolving download.tensorflow.org (download.tensorflow.org)... 74.125.142.128, 2607:f8b0:400e:c08::80
Connecting to download.tensorflow.org (download.tensorflow.org)|74.125.142.128|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2428923189 (2.3G) [application/gzip]
Saving to: ‘speech_commands_v0.02.tar.gz’


2021-08-07 04:37:13 (82.4 MB/s) - ‘speech_commands_v0.02.tar.gz’ saved [2428923189/2428923189]

Extracting...


In [None]:
# Moving the background noise directory to the main speech commands folder
!mv {BASE_DIR}/{GOOGLE_DATASET_2_DIR}/{BG_DIR} {BASE_DIR}/{GOOGLE_DATASET_DIR}

In [None]:
# Copying the rest of the non-standard words to the main speech commands folder (where they will become the unknown class)

# This directory is currently too big too be sent to Edge Impulse

!mkdir {BASE_DIR}/{GOOGLE_DATASET_DIR}/unknown
!cp {BASE_DIR}/{GOOGLE_DATASET_2_DIR}/on/* {BASE_DIR}/{GOOGLE_DATASET_DIR}/unknown
!cp {BASE_DIR}/{GOOGLE_DATASET_2_DIR}/one/* {BASE_DIR}/{GOOGLE_DATASET_DIR}/unknown
!cp {BASE_DIR}/{GOOGLE_DATASET_2_DIR}/seven/* {BASE_DIR}/{GOOGLE_DATASET_DIR}/unknown
!cp {BASE_DIR}/{GOOGLE_DATASET_2_DIR}/sheila/* {BASE_DIR}/{GOOGLE_DATASET_DIR}/unknown
!cp {BASE_DIR}/{GOOGLE_DATASET_2_DIR}/six/* {BASE_DIR}/{GOOGLE_DATASET_DIR}/unknown
!cp {BASE_DIR}/{GOOGLE_DATASET_2_DIR}/three/* {BASE_DIR}/{GOOGLE_DATASET_DIR}/unknown
!cp {BASE_DIR}/{GOOGLE_DATASET_2_DIR}/tree/* {BASE_DIR}/{GOOGLE_DATASET_DIR}/unknown
!cp {BASE_DIR}/{GOOGLE_DATASET_2_DIR}/two/* {BASE_DIR}/{GOOGLE_DATASET_DIR}/unknown
!cp {BASE_DIR}/{GOOGLE_DATASET_2_DIR}/visual/* {BASE_DIR}/{GOOGLE_DATASET_DIR}/unknown
!cp {BASE_DIR}/{GOOGLE_DATASET_2_DIR}/wow/* {BASE_DIR}/{GOOGLE_DATASET_DIR}/unknown
!cp {BASE_DIR}/{GOOGLE_DATASET_2_DIR}/zero/* {BASE_DIR}/{GOOGLE_DATASET_DIR}/unknown

In [None]:
# Formatting the ZIP file to match the original TAR file output (getting rid of the random MACOSX directory)
%cd {GOOGLE_DATASET_DIR}
!ls
!sudo rm -r __MACOSX

/content/google_speech_commands
_background_noise_  __MACOSX  mini_speech_commands  unknown


In [None]:
!mv {BASE_DIR}/{GOOGLE_DATASET_DIR}/mini_speech_commands/down {BASE_DIR}/{GOOGLE_DATASET_DIR}
!mv {BASE_DIR}/{GOOGLE_DATASET_DIR}/mini_speech_commands/go {BASE_DIR}/{GOOGLE_DATASET_DIR}
!mv {BASE_DIR}/{GOOGLE_DATASET_DIR}/mini_speech_commands/left {BASE_DIR}/{GOOGLE_DATASET_DIR}
!mv {BASE_DIR}/{GOOGLE_DATASET_DIR}/mini_speech_commands/no {BASE_DIR}/{GOOGLE_DATASET_DIR}
!mv {BASE_DIR}/{GOOGLE_DATASET_DIR}/mini_speech_commands/right {BASE_DIR}/{GOOGLE_DATASET_DIR}
!mv {BASE_DIR}/{GOOGLE_DATASET_DIR}/mini_speech_commands/stop {BASE_DIR}/{GOOGLE_DATASET_DIR}
!mv {BASE_DIR}/{GOOGLE_DATASET_DIR}/mini_speech_commands/up {BASE_DIR}/{GOOGLE_DATASET_DIR}
!mv {BASE_DIR}/{GOOGLE_DATASET_DIR}/mini_speech_commands/yes {BASE_DIR}/{GOOGLE_DATASET_DIR}
!mv {BASE_DIR}/{GOOGLE_DATASET_DIR}/mini_speech_commands/README.md {BASE_DIR}/{GOOGLE_DATASET_DIR}

!ls
!rm -r {BASE_DIR}/{GOOGLE_DATASET_DIR}/mini_speech_commands

_background_noise_  go	  mini_speech_commands	README.md  stop     up
down		    left  no			right	   unknown  yes


In [None]:
### Pull out background noise directory
%cd {BASE_DIR}
!ls
!mv "{BASE_DIR}/{GOOGLE_DATASET_DIR}/{BG_DIR}" "{BASE_DIR}/{BG_DIR}"

/content
google_speech_commands	     sample_data
mini_speech_commands.zip     speech_commands_v0.02.tar.gz
original_dataset_ShawnHymel


In [None]:
# Check the amount of files in "unknown"
%cd "{BASE_DIR}/{GOOGLE_DATASET_DIR}/unknown"
!ls | wc -l

/content/google_speech_commands/unknown
7464


In [None]:
### (Optional) Download custom dataset--uncomment the code in this cell if you want to use my custom datase

## Download, extract, and move dataset to separate directory
# !cd {BASE_DIR}
# !wget {CUSTOM_KEYWORDS_URL}
# !echo "Extracting..."
# !unzip -q {CUSTOM_KEYWORDS_FILENAME}
# !mv "{CUSTOM_KEYWORDS_REPO_NAME}/{CUSTOM_KEYWORDS_DIR}" "{CUSTOM_KEYWORDS_DIR}"

In [None]:
### User Settings (do change these)

# Location of your custom keyword samples (e.g. "/content/custom_keywords")
# Leave blank ("") for no custom keywords. set to the CUSTOM_KEYWORDS_DIR
# variable to use samples from my custom-speech-commands-dataset repo.
CUSTOM_DATASET_PATH = ""

# Edge Impulse > your_project > Dashboard > Keys
EI_API_KEY = " " # Write your key here

# Comma separated words. Must match directory names (that contain samples).
TARGETS = "up, yes, no, right, down, stop, left, go, unknown"

In [None]:
### Download curation and utils scripts
%cd {BASE_DIR}
!wget {CURATION_SCRIPT_URL}
!wget {UTILS_SCRIPT_URL}

/content
--2021-08-07 04:40:21--  https://raw.githubusercontent.com/ShawnHymel/ei-keyword-spotting/master/dataset-curation.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17427 (17K) [text/plain]
Saving to: ‘dataset-curation.py’


2021-08-07 04:40:21 (59.7 MB/s) - ‘dataset-curation.py’ saved [17427/17427]

--2021-08-07 04:40:21--  https://raw.githubusercontent.com/ShawnHymel/ei-keyword-spotting/master/utils.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3966 (3.9K) [text/plain]
Saving to: ‘utils.py’


2021-08-07 04:40:21 (3

In [None]:
### Perform curation and mixing of samples with background noise
%cd {BASE_DIR}
!ls
!python {CURATION_SCRIPT} \
  -t "{TARGETS}" \
  -n {NUM_SAMPLES} \
  -w {WORD_VOL} \
  -g {BG_VOL} \
  -s {SAMPLE_TIME} \
  -r {SAMPLE_RATE} \
  -e {BIT_DEPTH} \
  -b "{BASE_DIR}/{BG_DIR}" \
  -o "{OUT_DIR}" \
  {BASE_DIR}/{GOOGLE_DATASET_DIR} \
  "{CUSTOM_DATASET_PATH}"

/content
_background_noise_	  original_dataset_ShawnHymel
dataset-curation.py	  sample_data
google_speech_commands	  speech_commands_v0.02.tar.gz
mini_speech_commands.zip  utils.py
-----------------------------------------------------------------------
Keyword Dataset Curation Tool
v0.1
-----------------------------------------------------------------------
No directory named ''. Ignoring.
Gathering random background noise snippets (1700 files)
Progress: |██████████████████████████████████████████████████| 100.0% Complete
Mixing: up (1700 files)
Progress: |██████████████████████████████████████████████████| 100.0% Complete
Mixing: yes (1700 files)
Progress: |██████████████████████████████████████████████████| 100.0% Complete
Mixing: no (1700 files)
Progress: |██████████████████████████████████████████████████| 100.0% Complete
Mixing: right (1700 files)
Progress: |██████████████████████████████████████████████████| 100.0% Complete
Mixing: down (1700 files)
Progress: |███████████████████

In [None]:
# Check the amount of files in unknown
%cd "{BASE_DIR}/{OUT_DIR}/unknown"
!ls | wc -l

/content/keywords_curated/unknown
1700


In [None]:
%cd "{BASE_DIR}/{OUT_DIR}/no"
!ls | wc -l

/content/keywords_curated/no
1700


In [None]:
#!cp -r {BASE_DIR}/{GOOGLE_DATASET_DIR}/unknown {BASE_DIR}/{OUT_DIR}
#%cd {BASE_DIR}/{OUT_DIR}
#!ls

/content/keywords_curated
down  go  left	no  _noise  right  stop  unknown  _unknown  up	yes


In [None]:
!mv {BASE_DIR}/{OUT_DIR}/_unknown {BASE_DIR}/{GOOGLE_DATASET_DIR}/sample_data
%cd {BASE_DIR}/{OUT_DIR}
!ls

mv: cannot stat '/content/keywords_curated/_unknown': No such file or directory
/content/keywords_curated
down  go  left	no  _noise  right  stop  unknown  up  yes


In [None]:
### Use CLI tool to send curated dataset to Edge Impulse

%cd {BASE_DIR}

# Imports
import os
import random

# Seed with system time
random.seed()

# Go through each category in our curated dataset
for dir in os.listdir(OUT_DIR):
  
  # Create list of files for one category
  paths = []
  for filename in os.listdir(os.path.join(OUT_DIR, dir)):
    paths.append(os.path.join(OUT_DIR, dir, filename))

  # Shuffle and divide into test and training sets
  random.shuffle(paths)
  num_test_samples = int(TEST_RATIO * len(paths))
  test_paths = paths[:num_test_samples]
  train_paths = paths[num_test_samples:]

  # Create arugments list (as a string) for CLI call
  test_paths = ['"' + s + '"' for s in test_paths]
  test_paths = ' '.join(test_paths)
  train_paths = ['"' + s + '"' for s in train_paths]
  train_paths = ' '.join(train_paths)
  
  # Send test files to Edge Impulse
  !edge-impulse-uploader \
    --category testing \
    --label {dir} \
    --api-key {EI_API_KEY} \
    --silent \
    {test_paths}

  # # Send training files to Edge Impulse
  !edge-impulse-uploader \
    --category training \
    --label {dir} \
    --api-key {EI_API_KEY} \
    --silent \
    {train_paths}

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
[101/340] Uploading keywords_curated/no/no.1472.wav OK (280 ms)
[102/340] Uploading keywords_curated/no/no.1424.wav OK (265 ms)
[103/340] Uploading keywords_curated/no/no.0900.wav OK (369 ms)
[104/340] Uploading keywords_curated/no/no.0200.wav OK (355 ms)
[105/340] Uploading keywords_curated/no/no.0391.wav OK (303 ms)
[106/340] Uploading keywords_curated/no/no.1395.wav OK (320 ms)
[107/340] Uploading keywords_curated/no/no.0789.wav OK (289 ms)
[108/340] Uploading keywords_curated/no/no.0782.wav OK (262 ms)
[109/340] Uploading keywords_curated/no/no.0384.wav OK (329 ms)
[110/340] Uploading keywords_curated/no/no.1426.wav OK (314 ms)
[111/340] Uploading keywords_curated/no/no.1664.wav OK (442 ms)
[112/340] Uploading keywords_curated/no/no.1343.wav OK (279 ms)
[113/340] Uploading keywords_curated/no/no.0445.wav OK (312 ms)
[114/340] Uploading keywords_curated/no/no.1528.wav OK (266 ms)
[115/340] Uploading keywords_curated/no