# **Keyword Spotting Dataset Curation**

[![Open In Colab <](https://colab.research.google.com/assets/colab-badge.svg)]()

***We will use colab to download the Google Speech Commands Dataset, mix in some background noise, and upload the curated dataset to Edge Impulse. From there, we can train a neural network to classify spoken words and upload it to a microcontroller to perform real-time keyword spotting.***


***Note:*** ***Adjust parameters in the Settings cell (you will need an [Edge Impulse](https://www.edgeimpulse.com/) account)***




***P.S: Since I've already included the TensorFlow Lite model to the main project, running this colab is not required. You don't need to retrain the model and go through all the steps again.***

### ***Step 1 : Update Node.js to the latest stable version***

In [None]:
!npm cache clean -f
!npm install -g n
!n 16.18.1

### ***Step 2 : Install required packages and tools***

In [None]:
!python -m pip install soundfile
!npm install -g --unsafe-perm edge-impulse-cli

### ***Step 3: Settings***

In [None]:
### Settings (do not need to change these)
BASE_DIR = "/content"
OUT_DIR = "keywords_curated"
GOOGLE_DATASET_FILENAME = "speech_commands_v0.02.tar.gz"
GOOGLE_DATASET_URL = "http://download.tensorflow.org/data/" + GOOGLE_DATASET_FILENAME
GOOGLE_DATASET_DIR = "google_speech_commands"
CURATION_SCRIPT = "dataset-curation.py"
CURATION_SCRIPT_URL = "https://raw.githubusercontent.com/Ashish0-0/NeuroKey-Embedded/main/" + CURATION_SCRIPT
UTILS_SCRIPT_URL = "https://raw.githubusercontent.com/Ashish0-0/NeuroKey-Embedded/main/utils.py"
NUM_SAMPLES = 1500    # Target number of samples to mix and send to Edge Impulse
WORD_VOL = 1.0        # Relative volume of word in output sample
BG_VOL = 0.1          # Relative volume of noise in output sample
SAMPLE_TIME = 1.0     # Time (seconds) of output sample
SAMPLE_RATE = 16000   # Sample rate (Hz) of output sample
BIT_DEPTH = "PCM_16"  # Options: [PCM_16, PCM_24, PCM_32, PCM_U8, FLOAT, DOUBLE]
BG_DIR = "_background_noise_"
TEST_RATIO = 0.2      # 20% reserved for test set, rest is for training
EI_INGEST_TEST_URL = "https://ingestion.edgeimpulse.com/api/test/data"
EI_INGEST_TRAIN_URL = "https://ingestion.edgeimpulse.com/api/training/data"

### ***Step 4 : Download Google Speech Commands Dataset***

In [None]:
!cd {BASE_DIR}
!wget {GOOGLE_DATASET_URL}
!mkdir {GOOGLE_DATASET_DIR}
!echo "Extracting..."
!tar xfz {GOOGLE_DATASET_FILENAME} -C {GOOGLE_DATASET_DIR}

In [None]:
### Pull out background noise directory
!cd {BASE_DIR}
!mv "{GOOGLE_DATASET_DIR}/{BG_DIR}" "{BG_DIR}"

### ***Step 5 : Add the API key***

In [None]:
# Note: It is necessary to add your api key below in the EI_API_KEY

# Edge Impulse > your_project > Dashboard > Keys
EI_API_KEY = "ei_e544..." # Replace with your API key

# Recommended: use 2 keywords for microcontroller demo becuase we dont have space for more keywords, for that we need to use a microcontroller with more storage.
TARGETS = "yes, no"

### ***Step 6 : Download curation and utils scripts***

In [None]:
!wget {CURATION_SCRIPT_URL}
!wget {UTILS_SCRIPT_URL}

### ***Step 7 : Perform curation and mixing of samples with background noise***

In [None]:
!cd {BASE_DIR}
!python {CURATION_SCRIPT} \
  -t "{TARGETS}" \
  -n {NUM_SAMPLES} \
  -w {WORD_VOL} \
  -g {BG_VOL} \
  -s {SAMPLE_TIME} \
  -r {SAMPLE_RATE} \
  -e {BIT_DEPTH} \
  -b "{BG_DIR}" \
  -o "{OUT_DIR}" \
  "{GOOGLE_DATASET_DIR}" \
  "{CUSTOM_DATASET_PATH}"

### ***Step 8 : Use CLI tool to send curated dataset to Edge Impulse***

In [None]:
!cd {BASE_DIR}

# Imports
import os
import random

# Seed with system time
random.seed()

# Go through each category in our curated dataset
for dir in os.listdir(OUT_DIR):

  # Create list of files for one category
  paths = []
  for filename in os.listdir(os.path.join(OUT_DIR, dir)):
    paths.append(os.path.join(OUT_DIR, dir, filename))

  # Shuffle and divide into test and training sets
  random.shuffle(paths)
  num_test_samples = int(TEST_RATIO * len(paths))
  test_paths = paths[:num_test_samples]
  train_paths = paths[num_test_samples:]

  # Create arugments list (as a string) for CLI call
  test_paths = ['"' + s + '"' for s in test_paths]
  test_paths = ' '.join(test_paths)
  train_paths = ['"' + s + '"' for s in train_paths]
  train_paths = ' '.join(train_paths)

  # Send test files to Edge Impulse
  !edge-impulse-uploader \
    --category testing \
    --label {dir} \
    --api-key {EI_API_KEY} \
    --silent \
    {test_paths}

  # # Send training files to Edge Impulse
  !edge-impulse-uploader \
    --category training \
    --label {dir} \
    --api-key {EI_API_KEY} \
    --silent \
    {train_paths}