# Keyword Spotting Dataset Curation

[![Open In Colab <](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ShawnHymel/precise-hey-jorvon/blob/main/keyword-spotting-dataset-curation.ipynb)

Use this tool to download the Google Speech Commands Dataset, combine it with your own keywords, and mix in some background noise.

 1. Upload samples of your own keyword (optional)
 2. Adjust parameters in the Settings cell
 3. Run the rest of the cells! ('shift' + 'enter' on each cell)

*Author:* Shawn Hymel

*Date:* March 11, 2022

*License:* [0BSD](https://opensource.org/licenses/0BSD)

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

### Upload your own keyword samples
You are welcome to use my [custom keyword dataset](https://github.com/ShawnHymel/custom-speech-commands-dataset), but note that it's limited and that I can't promise it will work well. If you want to use it, uncomment the `###Download custom dataset` cell below. You may also add your own recorded keywords to the extracted folder (`/content/custom_keywords`) to augment what's already there.

If you'd rather upload your own custom keyword dataset, follow these instructions:

On the left pane, in the file browser, create a directory structure containing space for your keyword audio samples. All samples for each keyword should be in a directory with that keyword's name. 

The audio samples should be `.wav` format, mono, and 1 second long. Bitrate and bitdepth should not matter. Samples shorter than 1 second will be padded with 0s, and samples longer than 1 second will be truncated to 1 second. The exact name of each `.wav` file does not matter, as they will be read, mixed with background noise, and saved to a separate file with an auto-generated name. Directory name does matter (it is used to determine the name of the class during neural network training).

Right-click on each keyword directory and upload all of your samples. Your directory structor should look like the following:

```
/
|- content
|--- custom_keywords
|----- keyword_1
|------- 000.wav
|------- 001.wav
|------- ...
|----- keyword_2
|------- 000.wav
|------- 001.wav
|------- ...
|----- ...
```

In [None]:
### Settings (You probably do not need to change these)
BASE_DIR = "/content"
TEMP_DIR = "temp_dir"
OUT_DIR = "keywords_curated"
GOOGLE_DATASET_FILENAME = "speech_commands_v0.02.tar.gz"
GOOGLE_DATASET_URL = "http://download.tensorflow.org/data/" + GOOGLE_DATASET_FILENAME
GOOGLE_DATASET_DIR = "google_speech_commands"
CUSTOM_KEYWORDS_FILENAME = "main.zip"
CUSTOM_KEYWORDS_URL = "https://github.com/ShawnHymel/custom-speech-commands-dataset/archive/" + CUSTOM_KEYWORDS_FILENAME
CUSTOM_KEYWORDS_DIR = "custom_keywords"
CUSTOM_KEYWORDS_REPO_NAME = "custom-speech-commands-dataset-main"
CURATION_SCRIPT = "dataset-curation.py"
CURATION_SCRIPT_URL = "https://raw.githubusercontent.com/ShawnHymel/ei-keyword-spotting/master/" + CURATION_SCRIPT
UTILS_SCRIPT_URL = "https://raw.githubusercontent.com/ShawnHymel/ei-keyword-spotting/master/utils.py"
NUM_SAMPLES = 1500    # Target number of output samples per class
WORD_VOL = 1.0        # Relative volume of word in output sample
BG_VOL = 0.1          # Relative volume of noise in output sample
SAMPLE_TIME = 1.0     # Time (seconds) of output sample
SAMPLE_RATE = 16000   # Sample rate (Hz) of output sample
BIT_DEPTH = "PCM_16"  # Options: [PCM_16, PCM_24, PCM_32, PCM_U8, FLOAT, DOUBLE]
BG_DIR = "_background_noise_"
TEST_RATIO = 0.2      # 20% reserved for test set, rest is for training
EI_INGEST_TEST_URL = "https://ingestion.edgeimpulse.com/api/test/data"
EI_INGEST_TRAIN_URL = "https://ingestion.edgeimpulse.com/api/training/data"

In [None]:
### Download Google Speech Commands Dataset
!cd {BASE_DIR}
!wget {GOOGLE_DATASET_URL}
!mkdir {GOOGLE_DATASET_DIR}
!echo "Extracting..."
!tar xfz {GOOGLE_DATASET_FILENAME} -C {GOOGLE_DATASET_DIR}

In [None]:
### Pull out background noise directory
!cd {BASE_DIR}
!mv "{GOOGLE_DATASET_DIR}/{BG_DIR}" "{BG_DIR}"

In [None]:
### (Optional) Download custom dataset--uncomment the code in this cell if you want to use my custom datase

## Download, extract, and move dataset to separate directory
# !cd {BASE_DIR}
# !wget {CUSTOM_KEYWORDS_URL}
# !echo "Extracting..."
# !unzip -q {CUSTOM_KEYWORDS_FILENAME}
# !mv "{CUSTOM_KEYWORDS_REPO_NAME}/{CUSTOM_KEYWORDS_DIR}" "{CUSTOM_KEYWORDS_DIR}"

In [None]:
### User Settings (do change these)

# Location of your custom keyword samples (e.g. "/content/custom_keywords")
# Leave blank ("") for no custom keywords. set to the CUSTOM_KEYWORDS_DIR
# variable to use samples from my custom-speech-commands-dataset repo.
CUSTOM_DATASET_PATH = "/content/custom_keywords"

# Comma separated words. Must match directory names (that contain samples).
TARGETS = "hey_jorvon"

In [None]:
### Download curation and utils scripts
!wget {CURATION_SCRIPT_URL}
!wget {UTILS_SCRIPT_URL}

In [None]:
### Perform curation and mixing of samples with background noise
!cd {BASE_DIR}
!python {CURATION_SCRIPT} \
  -t "{TARGETS}" \
  -n {NUM_SAMPLES} \
  -w {WORD_VOL} \
  -g {BG_VOL} \
  -s {SAMPLE_TIME} \
  -r {SAMPLE_RATE} \
  -e {BIT_DEPTH} \
  -b "{BG_DIR}" \
  -o "{TEMP_DIR}" \
  "{GOOGLE_DATASET_DIR}" \
  "{CUSTOM_DATASET_PATH}"

In [None]:
### Split and move samples to train and test folders
!cd {BASE_DIR}

# Imports
import os
import random
import shutil

# Seed with system time
random.seed()

# Remove output directory (start from scratch)
shutil.rmtree(OUT_DIR)

# Go through each category in our curated dataset
for dir in os.listdir(TEMP_DIR):

  # Ignore notebook checkpoint
  if dir == ".ipynb_checkpoints":
    continue

  # Create output directories
  os.makedirs(os.path.join(OUT_DIR, "train", dir))
  os.makedirs(os.path.join(OUT_DIR, "test", dir))
  
  # Create list of files for one category
  paths = []
  for filename in os.listdir(os.path.join(TEMP_DIR, dir)):
    paths.append(os.path.join(TEMP_DIR, dir, filename))

  # Shuffle and divide into test and training sets
  random.shuffle(paths)
  num_test_samples = int(TEST_RATIO * len(paths))
  test_paths = paths[:num_test_samples]
  train_paths = paths[num_test_samples:]

  # Copy files
  for file in train_paths:
    out_path = os.path.join(OUT_DIR, "train", dir, os.path.basename(file))
    shutil.copy(file, out_path)
  for file in test_paths:
    out_path = os.path.join(OUT_DIR, "test", dir, os.path.basename(file))
    shutil.copy(file, out_path)

In [None]:
### Zip dataset for easy download
!cd {BASE_DIR}
!zip -r -q "{OUT_DIR}.zip" "{OUT_DIR}"