# Project: Deploy a Keyword Spotting Model on a Custom Dataset

This Notebook ist based on HarvardX's [4-6-8-CustomDatasetKWSModel.ipynb](https://colab.research.google.com/github/tinyMLx/colabs/blob/master/4-6-8-CustomDatasetKWSModel.ipynb). This time we train our own custom keyword spotting model using your Custom Dataset!

In [None]:
%%bash
rm -rf tensorflow log  v2.4.1.zip logs models train dataset extract_loudest_section
apt-get update -qq && apt-get install -y --no-install-recommends wget unzip
wget https://github.com/tensorflow/tensorflow/archive/v2.4.1.zip
unzip v2.4.1.zip &> log
mv tensorflow-2.4.1/ tensorflow/
rm -rf v2.4.1.zip log
python3 -m pip install --upgrade --quiet --no-cache-dir pip ffmpeg-python

## **Import Packages**
Import standard packages as well as the additional packages from the cloned Github Repo.

In [None]:
import tensorflow as tf
import sys
# We add this path so we can import the speech processing modules.
sys.path.append("./tensorflow/tensorflow/examples/speech_commands/")
import input_data
import models
import numpy as np
import glob
import os
import re
import shutil

# **Create Custom Dataset**
### Import the Dataset Basis
We will build our dataset upon Pete Warden's dataset (a base set of "other words" and "background noise").

In [None]:
# Set Data Set Path for Python
DATASET_DIR =  'dataset/'

In [None]:
%%bash
rm -rf speech_commands_v0.02.tar.gz*
wget https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.02.tar.gz
mkdir dataset
tar -xf speech_commands_v0.02.tar.gz -C 'dataset'
rm -rf speech_commands_v0.02.tar.gz

 ## Import Recorded Audio Files
Now upload all of your previously recorded custom audio files (aka the *.ogg files) by using my [Open Speech Recording Tool Start Script](https://github.com/KlausPuchner/TinyML/blob/main/02_projects/keyword_spotting/customwords/start-open-speech-recording-app.sh). Please select all/multiple files to upload them all at once!

In [None]:
from ipywidgets import FileUpload
upload = FileUpload(multiple=True)
upload

In [None]:
for name, file_info in upload.value.items():
    with open(name, 'wb') as fp:
        fp.write(file_info['content'])

# Convert .ogg Files to .wav
Now we convert them into correctly trimmed WAV files and then store them in the appropriate folders in the ```DATASET_DIR```.
We will use Pete's extract_loudest_section tool which you can find more documentation about here: https://github.com/petewarden/extract_loudest_section

In [None]:
%%bash
apt-get update -qqq && apt-get install -y --no-install-recommends -qqq git ffmpeg zip
rm -rf wavs
mkdir wavs
find *.ogg -print0 | xargs -0 basename -s .ogg | xargs -I {} ffmpeg -i {}.ogg -ar 16000 wavs/{}.wav
rm -rf *.ogg

# then use pete's tool to only extract 1 second clips from them for use with the KWS pipeline
mkdir trimmed_wavs
git clone https://github.com/petewarden/extract_loudest_section.git
make -C extract_loudest_section/
/tmp/extract_loudest_section/gen/bin/extract_loudest_section 'wavs/*.wav' trimmed_wavs/
rm -rf /wavs

In [None]:
# Store them in the appropriate folders
data_index = {}
os.chdir('trimmed_wavs')
search_path = os.path.join('*.wav')
for wav_path in glob.glob(search_path):
    matches = re.search('([^/_]+)_([^/_]+)\.wav', wav_path)
    if not matches:
        raise Exception('File name not in a recognized form:"%s"' % wav_path)
    word = matches.group(1).lower()
    instance = matches.group(2).lower()
    if not word in data_index:
      data_index[word] = {}
    if instance in data_index[word]:
        raise Exception('Audio instance already seen:"%s"' % wav_path)
    data_index[word][instance] = wav_path

output_dir = os.path.join('..', 'dataset')
try:
    os.mkdir(output_dir)
except:
    pass
for word in data_index:
  word_dir = os.path.join(output_dir, word)
  try:
      os.mkdir(word_dir)
      print('Created dir: ' + word_dir)
  except:
      print('Storing in existing dir: ' + word_dir)
  for instance in data_index[word]:
    wav_path = data_index[word][instance]
    output_path = os.path.join(word_dir, instance + '.wav')
    shutil.copyfile(wav_path, output_path)
os.chdir('..')
!rm -r -f trimmed_wavs

# (Optional) Download your Dataset
To zip and download your dataset run the code cell below.

In [None]:
!zip -r 3customworddataset.zip dataset
from IPython.display import display, FileLink

local_file = FileLink('./3customworddataset.zip', result_html_prefix="Click here to download: ")
display(local_file)

In [None]:
!rm -rf 3customworddataset.zip

# **Train your Model**

Next we need to select your keywords and model settings with which to train!

WANTED_WORDS = A comma-delimited string of the words you want to train for (e.g., "yes,no").

Since we collected the words "activate", "deactivate" and "snapshot" we make sure to input them!

In [None]:
WANTED_WORDS = "activate,deactivate,snapshot"

The number of training steps and learning rates can be specified as comma-separated strings to define the amount/rate at each stage. For example, TRAINING_STEPS="12000,3000" and LEARNING_RATE="0.001,0.0001" will run 12,000 training steps with a rate of 0.001 followed by 3,000 final steps with a learning rate of 0.0001. These are good default values to work off of when you choose your values as the course staff has gotten this to work well with those values in the past!

In [None]:
TRAINING_STEPS = "12000,3000"
LEARNING_RATE = "0.001,0.0001"

Leave the MODEL_ARCHITECTURE as tiny_conv the first time but if you would like to do this again and explore additional models some options are: single_fc, conv, low_latency_conv, low_latency_svdf, tiny_embedding_conv. **Do remember if you switch the model type you may need to update the C++ code to include the tflite::AllOpsResolver to make sure you have all of the neccessary ops!**

In [None]:
# Calculate the total number of steps, which is used to identify the checkpoint
# file name.
TOTAL_STEPS = str(sum(map(lambda string: int(string), TRAINING_STEPS.split(","))))

# Print the configuration to confirm it
print("Training these words: %s" % WANTED_WORDS)
print("Training steps in each stage: %s" % TRAINING_STEPS)
print("Learning rate in each stage: %s" % LEARNING_RATE)
print("Total number of training steps: %s" % TOTAL_STEPS)

In [None]:
# Calculate the percentage of 'silence' and 'unknown' training samples required
# to ensure that we have equal number of samples for each label.
number_of_labels = WANTED_WORDS.count(',') + 1
number_of_total_labels = number_of_labels + 2 # for 'silence' and 'unknown' label
equal_percentage_of_training_samples = int(100.0/(number_of_total_labels))
SILENT_PERCENTAGE = equal_percentage_of_training_samples
UNKNOWN_PERCENTAGE = equal_percentage_of_training_samples

# Constants used during training only
VERBOSITY = 'DEBUG'
EVAL_STEP_INTERVAL = '1000'
SAVE_STEP_INTERVAL = '1000'

# Constants for training directories and filepaths
LOGS_DIR = 'logs/'
TRAIN_DIR = 'train/' # for training checkpoints and other files.

# Constants for inference directories and filepaths
import os
MODELS_DIR = 'models'
if not os.path.exists(MODELS_DIR):
  os.mkdir(MODELS_DIR)
MODEL_TF = os.path.join(MODELS_DIR, 'KWS_custom.pb')
MODEL_TFLITE = os.path.join(MODELS_DIR, 'KWS_custom.tflite')
FLOAT_MODEL_TFLITE = os.path.join(MODELS_DIR, 'KWS_custom_float.tflite')
MODEL_TFLITE_MICRO = os.path.join(MODELS_DIR, 'KWS_custom.cc')
SAVED_MODEL = os.path.join(MODELS_DIR, 'KWS_custom_saved_model')

In [None]:
# Constants which are shared during training and inference
PREPROCESS = 'micro'
WINDOW_STRIDE = 20

# Constants for Quantization
QUANT_INPUT_MIN = 0.0
QUANT_INPUT_MAX = 26.0
QUANT_INPUT_RANGE = QUANT_INPUT_MAX - QUANT_INPUT_MIN

# Constants for audio process during Quantization and Evaluation
SAMPLE_RATE = 16000
CLIP_DURATION_MS = 1000
WINDOW_SIZE_MS = 30.0
FEATURE_BIN_COUNT = 40
BACKGROUND_FREQUENCY = 0.8
BACKGROUND_VOLUME_RANGE = 0.1
TIME_SHIFT_MS = 100.0

# Use the custom local dataset and set the tes/val/train split
DATA_URL = ''
VALIDATION_PERCENTAGE = 10
TESTING_PERCENTAGE = 10

In [None]:
%load_ext tensorboard
logs_base_dir='./logs/'
os.makedirs(logs_base_dir, exist_ok=True)
%tensorboard --logdir {logs_base_dir} --host 0.0.0.0 --port 6006

## **Launch Training**
More information on the training script can be found  in the source code for the script [here](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/speech_commands/train.py). In short it sets up the optimizer and preprocessor based on all of the flags we pass in!

Finally, by setting the VERBOSITY = 'DEBUG' above be aware that the training cell will print A LOT of information. Specifically you will get the accuracy and loss at each step as well as a confusion matrix every 1000 steps. We hope that is helpful in case TensorBoard fails to work. If you would like to run with less printouts you can change the setting to WARN or FATAL. You will find this in the "Configure Your Model!" section.

In [None]:
!python tensorflow/tensorflow/examples/speech_commands/train.py \
  --data_dir={DATASET_DIR} \
  --data_url={DATA_URL} \
  --wanted_words={WANTED_WORDS} \
  --silence_percentage={SILENT_PERCENTAGE} \
  --unknown_percentage={UNKNOWN_PERCENTAGE} \
  --preprocess={PREPROCESS} \
  --window_stride={WINDOW_STRIDE} \
  --model_architecture={MODEL_ARCHITECTURE} \
  --how_many_training_steps={TRAINING_STEPS} \
  --learning_rate={LEARNING_RATE} \
  --train_dir={TRAIN_DIR} \
  --summaries_dir={LOGS_DIR} \
  --verbosity={VERBOSITY} \
  --eval_step_interval={EVAL_STEP_INTERVAL} \
  --save_step_interval={SAVE_STEP_INTERVAL}

## **Generating your Model**
Just like with the pre-trained model we will now take the final checkpoint and convert it into a quantized TensorFlow Lite model.

### **Generate a TensorFlow Model for Inference**
Combine relevant training results (graph, weights, etc) into a single file for inference. This process is known as freezing a model and the resulting model is known as a frozen model/graph, as it cannot be further re-trained after this process.

In [None]:
!rm -rf {SAVED_MODEL}
!python tensorflow/tensorflow/examples/speech_commands/freeze.py \
--wanted_words=$WANTED_WORDS \
--window_stride_ms=$WINDOW_STRIDE \
--preprocess=$PREPROCESS \
--model_architecture=$MODEL_ARCHITECTURE \
--start_checkpoint=$TRAIN_DIR$MODEL_ARCHITECTURE'.ckpt-'{TOTAL_STEPS} \
--save_format=saved_model \
--output_file={SAVED_MODEL}

### **Generate a TensorFlow Lite Model**
Convert the frozen graph into a TensorFlow Lite model, which is fully quantized for use with embedded devices.

The following cell will also print the model size, which will be under 20 kilobytes.

We download the dataset to use as a representative dataset for more thoughtful post training quantization.

In [None]:
model_settings = models.prepare_model_settings(
    len(input_data.prepare_words_list(WANTED_WORDS.split(','))),
    SAMPLE_RATE, CLIP_DURATION_MS, WINDOW_SIZE_MS,
    WINDOW_STRIDE, FEATURE_BIN_COUNT, PREPROCESS)
audio_processor = input_data.AudioProcessor(
    DATA_URL, DATASET_DIR,
    SILENT_PERCENTAGE, UNKNOWN_PERCENTAGE,
    WANTED_WORDS.split(','), VALIDATION_PERCENTAGE,
    TESTING_PERCENTAGE, model_settings, LOGS_DIR)


Note: if the below cell fails it might be because you do not have enough data to have 100 recordings in the representative dataset! If this happens you will see an error that says something like ValueError: cannot reshape array of size 0 into shape (1,1960). To help you fix this we have added a print(i) into the loop. As such, all you have to do is change the REP_DATA_SIZE variable to be equal to the last integer value printed out by the loop and then re-run the cell!

In [None]:
REP_DATA_SIZE = 100
with tf.Session() as sess:
  float_converter = tf.lite.TFLiteConverter.from_saved_model(SAVED_MODEL)
  float_tflite_model = float_converter.convert()
  float_tflite_model_size = open(FLOAT_MODEL_TFLITE, "wb").write(float_tflite_model)
  print("Float model is %d bytes" % float_tflite_model_size)

  converter = tf.lite.TFLiteConverter.from_saved_model(SAVED_MODEL)
  converter.optimizations = [tf.lite.Optimize.DEFAULT]
  converter.inference_input_type = tf.lite.constants.INT8
  converter.inference_output_type = tf.lite.constants.INT8
  def representative_dataset_gen():
    for i in range(REP_DATA_SIZE):
      data, _ = audio_processor.get_data(1, i*1, model_settings,
                                         BACKGROUND_FREQUENCY, 
                                         BACKGROUND_VOLUME_RANGE,
                                         TIME_SHIFT_MS,
                                         'testing',
                                         sess)
      flattened_data = np.array(data.flatten(), dtype=np.float32).reshape(1, 1960)
      print(i)
      yield [flattened_data]
  converter.representative_dataset = representative_dataset_gen
  tflite_model = converter.convert()
  tflite_model_size = open(MODEL_TFLITE, "wb").write(tflite_model)
  print("Quantized model is %d bytes" % tflite_model_size)


### Testing the accuracy after Quantization

Verify that the model we've exported is still accurate, using the TF Lite Python API and our test set.

In [None]:
# Helper function to run inference
def run_tflite_inference_testSet(tflite_model_path, model_type="Float"):
  #
  # Load test data
  #
  np.random.seed(0) # set random seed for reproducible test results.
  with tf.Session() as sess:
    test_data, test_labels = audio_processor.get_data(
        -1, 0, model_settings, BACKGROUND_FREQUENCY, BACKGROUND_VOLUME_RANGE,
        TIME_SHIFT_MS, 'testing', sess)
  test_data = np.expand_dims(test_data, axis=1).astype(np.float32)

  #
  # Initialize the interpreter
  #
  interpreter = tf.lite.Interpreter(tflite_model_path)
  interpreter.allocate_tensors()
  input_details = interpreter.get_input_details()[0]
  output_details = interpreter.get_output_details()[0]
  
  #
  # For quantized models, manually quantize the input data from float to integer
  #
  if model_type == "Quantized":
    input_scale, input_zero_point = input_details["quantization"]
    test_data = test_data / input_scale + input_zero_point
    test_data = test_data.astype(input_details["dtype"])

  #
  # Evaluate the predictions
  #
  correct_predictions = 0
  for i in range(len(test_data)):
    interpreter.set_tensor(input_details["index"], test_data[i])
    interpreter.invoke()
    output = interpreter.get_tensor(output_details["index"])[0]
    top_prediction = output.argmax()
    correct_predictions += (top_prediction == test_labels[i])

  print('%s model accuracy is %f%% (Number of test samples=%d)' % (
      model_type, (correct_predictions * 100) / len(test_data), len(test_data)))

In [None]:
# Compute float model accuracy
run_tflite_inference_testSet(FLOAT_MODEL_TFLITE)

# Compute quantized model accuracy
run_tflite_inference_testSet(MODEL_TFLITE, model_type='Quantized')

### Generate a TensorFlow Lite for Microcontrollers Model
To convert the TensorFlow Lite quantized model into a C source file that can be loaded by TensorFlow Lite for Microcontrollers on Arduino we simply need to use the xxd tool to convert the .tflite file into a .cc file.

In [None]:
!apt-get update -qqq && apt-get -qqq install xxd

In [None]:
MODEL_TFLITE = './models/model.tflite'
MODEL_TFLITE_MICRO = './models/model.cc'
!xxd -i {MODEL_TFLITE} > {MODEL_TFLITE_MICRO}
REPLACE_TEXT = MODEL_TFLITE.replace('/', '_').replace('.', '_')
!sed -i 's/'{REPLACE_TEXT}'/g_model/g' {MODEL_TFLITE_MICRO}

The generated Tensorflow Lite for Microcontroller model can now be used in the Arduino IDE. There are two options to do this:

1. Copy the screen output directly from the Jupyter Notebook into the **micro_features_model.cpp** file (in the Arduino IDE)
2. Download the **model.cc** file for later use to copy its content into the **micro_features_model.cpp** file (in the Arduino IDE)

### Option 1: Copy Output directly

In [None]:
!cat {MODEL_TFLITE_MICRO}

### **Option 2: Download Model File**

In [None]:
from IPython.display import FileLink
local_file = FileLink('./models/model.cc', result_html_prefix="Click here to download: ")
display(local_file)