# Building a Better Visual Wake Words Dataset

## Introduction
This notebook trains and evaluates models for detecting whether an object is in an image. A custom dataset is built using images from the COCO Dataset. It is based on the Tensorflow Lite Micro walk through on building a person detector. A few changes to the underlying code have been made to correct bugs and make it easier to run. The original directions are available here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/micro/examples/person_detection/training_a_model.md

More information on running the model that gets generated, can be found here: 
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/micro/examples/person_detection

## Building the dataset
In order to train a person detector model, we need a large collection of images that are labeled depending on whether or not they have people in them. The ImageNet one-thousand class data that's widely used for training image classifiers doesn't include labels for people, but luckily the COCO dataset does. You can also download this data without manually registering too, and Slim provides a convenient script to grab it automatically.

This is a large download, about 40GB, so it will take a while and you'll need to make sure you have at least 100GB free on your drive to allow space for unpacking and further processing. The argument to the script is the path that the data will be downloaded to. If you change this, you'll also need to update the commands below that use it.

The dataset is designed to be used for training models for localization, so the images aren't labeled with the "contains a person", "doesn't contain a person" categories that we want to train for. Instead each image comes with a list of bounding boxes for all of the objects it contains. "Person" is one of these object categories, so to get to the classification labels we want, we have to look for images with bounding boxes for people. To make sure that they aren't too tiny to be recognizable we also need to exclude very small bounding boxes. Slim contains a script to convert the bounding box into labels.

Don't be surprised if this takes up to twenty minutes to complete. When it's done, you'll have a set of TFRecords in coco/processed holding the labeled image information. This data was created by Aakanksha Chowdhery and is known as the Visual Wake Words dataset. It's designed to be useful for benchmarking and testing embedded computer vision, since it represents a very common task that we need to accomplish with tight resource constraints. We're hoping to see it drive even better models for this and similar tasks.

## Model Configuration
The following parameters are used to configure the model that will be generated. A Movilenet V1 model architecture will be used. 

- **DATA_DIR** This is the directory where the COCO Dataset is downloaded and stored and the training/eval data TFRecords are created. The default for the Docker container is: /tf/dataset
- **TRAINING_DIR** This is the directory where model checkpoints are stored during training, and TFlite models are stored after conversion. The default for the Docker container is: /tf/training
- **TRAINING_NAME** This is the name of the directory used in the Training Dir for the current model.
- **CLASS_OF_INTEREST** This is the class of object, from the coco dataset that a detector will be built for.
- **IMAGE_SIZE** This is the size of the image that will be used. The MobileNet V1 Architecture can support the following image sizes: 96, 128, 160, 192, 224
- **USE_GRAYSCALE** Whether grayscale or color images should be used. It can be: True or False
- **SMALL_OBJECT_AREA_THRESHOLD** This is the minimum percentage an object's bounding box can be of the overall image area. The default is 0.005 or 0.5%. 

In [None]:
%env DATA_DIR = /tf/dataset
%env TRAINING_DIR = /tf/training
%env TRAINING_NAME = vww_128_color_bicycle_005
%env CLASS_OF_INTEREST=bicycle
%env IMAGE_SIZE=128
%env USE_GRAYSCALE=False
%env SMALL_OBJECT_AREA_THRESHOLD=0.005

### Building the dataset

In order to train a detector model the class of interest, we need a large collection of images
that are labeled depending on whether or not they have that class in them. The
ImageNet one-thousand class data that's widely used for training image
classifiers doesn't include labels for people, but luckily the
[COCO dataset](http://cocodataset.org/#home) does.

This is a large download, about 40GB, so it will take a while and you'll need
to make sure you have at least 100GB free on your drive to allow space for
unpacking and further processing. 

The dataset is designed to be used for training models for localization, so the
images aren't labeled with the "contains an object", "doesn't contain an object"
categories that we want to train for. Instead each image comes with a list of
bounding boxes for all of the objects it contains. To make sure that objects aren't
too tiny to be recognizable we also need to exclude very small bounding boxes.

Don't be surprised if this takes up to twenty minutes to complete. When it's
done, you'll have a set of TFRecords in `coco/processed` holding the labeled
image information. This data was created by Aakanksha Chowdhery and is known as
the [Visual Wake Words dataset](https://arxiv.org/abs/1906.05721). It's designed
to be useful for benchmarking and testing embedded computer vision, since it
represents a very common task that we need to accomplish with tight resource
constraints. We're hoping to see it drive even better models for this and
similar tasks.

In [None]:
!python /tf/models/research/slim/download_and_convert_data.py \
    --dataset_name=visualwakewords \
    --dataset_dir="${DATA_DIR}" \
    --foreground_class_of_interest="${CLASS_OF_INTEREST}" \
    --small_object_area_threshold=${SMALL_OBJECT_AREA_THRESHOLD}

## Train Model

One of the nice things about using tf.slim to handle the training is that the parameters you commonly need to modify are available as command line arguments, so we can just call the standard train_image_classifier.py script to train our model.

This will take a couple of days on a single-GPU v100 instance to complete all one-million steps, but you should be able to get a fairly accurate model after a few hours if you want to experiment early.

- The checkpoints and summaries will the saved in the folder given in the --train_dir argument, so that's where you'll have to look for the results.

- The --dataset_dir parameter should match the one where you saved the TFRecords from the Visual Wake Words build script.

- The architecture we'll be using is defined by the --model_name argument. The 'mobilenet_v1' prefix tells the script to use the first version of MobileNet. We did experiment with later versions, but these used more RAM for their intermediate activation buffers, so for now we kept with the original. The '025' is the depth multiplier to use, which mostly affects the number of weight parameters, this low setting ensures the model fits within 250KB of Flash.
- --preprocessing_name controls how input images are modified before they're fed into the model. The 'mobilenet_v1' version shrinks the width and height of the images to the size given in --train_image_size (in our case 96 pixels since we want to reduce the compute requirements). It also scales the pixel values from 0 to 255 integers into -1.0 to +1.0 floating point numbers (though we'll be quantizing those after training).

- The --input_grayscale flag configures whether images are converted to grayscale during preprocessing.

- The --learning_rate, --label_smoothing, --learning_rate_decay_factor, --num_epochs_per_decay, --moving_average_decay and --batch_size are all parameters that control how weights are updated during the the training process. Training deep networks is still a bit of a dark art, so these exact values we found through experimentation for this particular model. You can try tweaking them to speed up training or gain a small boost in accuracy, but we can't give much guidance for how to make those changes, and it's easy to get combinations where the training accuracy never converges.

- The --max_number_of_steps defines how long the training should continue. There's no good way to figure out this threshold in advance, you have to experiment to tell when the accuracy of the model is no longer improving to tell when to cut it off. In our case we default to a million steps, since with this particular model we know that's a good point to stop.


In [None]:
! python /tf/models/research/slim/train_image_classifier.py \
    --train_dir="${TRAINING_DIR}/${TRAINING_NAME}" \
    --dataset_name=visualwakewords \
    --dataset_split_name=train \
    --dataset_dir="${DATA_DIR}"  \
    --model_name=mobilenet_v1_025 \
    --preprocessing_name=mobilenet_v1 \
    --train_image_size=${IMAGE_SIZE} \
    --use_grayscale=${USE_GRAYSCALE} \
    --save_summaries_secs=300 \
    --learning_rate=0.045 \
    --label_smoothing=0.1 \
    --learning_rate_decay_factor=0.98 \
    --num_epochs_per_decay=2.5 \
    --moving_average_decay=0.9999 \
    --batch_size=96 \
    --max_number_of_steps=1000000 

Don't worry about the line duplication, this is just a side-effect of the way TensorFlow log printing interacts with Python. Each line has two key bits of information about the training process. The global step is a count of how far through the training we are. Since we've set the limit as a million steps, in this case we're nearly five percent complete. The steps per second estimate is also useful, since you can use it to estimate a rough duration for the whole training process. In this case, we're completing about four steps a second, so a million steps will take about 70 hours, or three days. The other crucial piece of information is the loss. This is a measure of how close the partially-trained model's predictions are to the correct values, and lower values are better. This will show a lot of variation but should on average decrease during training if the model is learning. Because it's so noisy, the amounts will bounce around a lot over short time periods, but if things are working well you should see a noticeable drop if you wait an hour or so and check back. This kind of variation is a lot easier to see in a graph, which is one of the main reasons to try TensorBoard.

## Evaluate Model
The loss function correlates with how well your model is training, but it isn't a direct, understandable metric. What we really care about is how many people our model detects correctly, but to get calculate this we need to run a separate script.

The important number here is the accuracy. It shows the proportion of the images that were classified correctly, which is 72% in this case, after converting to a percentage. If you follow the example script, you should expect a fully-trained model to achieve an accuracy of around 84% after one million steps, and show a loss of around 0.4.

### Checkpoint number to use
Enter the checkpoint number you wish to use. Checkpoints are stored in three separate files, so the value should be their common prefix. For example if you have a checkpoint file called 'model.ckpt-5179.data-00000-of-00001', the prefix would be 'model.ckpt-5179'.

In [None]:
%env CHECKPOINT_NUM=1000000

In [None]:
! python /tf/models/research/slim/eval_image_classifier.py \
    --alsologtostderr \
    --checkpoint_path="${TRAINING_DIR}/${TRAINING_NAME}/model.ckpt-${CHECKPOINT_NUM}" \
    --dataset_dir="${DATA_DIR}" \
    --dataset_name=visualwakewords \
    --dataset_split_name=val \
    --model_name=mobilenet_v1_025 \
    --preprocessing_name=mobilenet_v1 \
    --use_grayscale=${USE_GRAYSCALE} \
    --train_image_size=${IMAGE_SIZE}

## Export the  Graph

When the model has trained to an accuracy you're happy with, you'll need to convert the results from the TensorFlow training environment into a form you can run on an embedded device. As we've seen in previous chapters, this can be a complex process, and tf.slim adds a few of its own wrinkles too.

### Exporting to a GraphDef protobuf file

Slim generates the architecture from the model_name every time one of its scripts is run, so for a model to be used outside of Slim it needs to be saved in a common format. We're going to use the GraphDef protobuf serialization format, since that's understood by both Slim and the rest of TensorFlow.

If this succeeds, you should have a new '${TRAINING_NAME}_graph.pb' file in your ${TRAINING_DIR}/${TRAINING_NAME} folder. This contains the layout of the operations in the model, but doesn't yet have any of the weight data.

In [None]:
! python /tf/models/research/slim/export_inference_graph.py \
    --alsologtostderr \
    --dataset_name=visualwakewords \
    --model_name=mobilenet_v1_025 \
    --image_size=${IMAGE_SIZE} \
    --use_grayscale=${USE_GRAYSCALE} \
    --output_file="${TRAINING_DIR}/${TRAINING_NAME}/${TRAINING_NAME}_graph.pb"

## Freeze the weights

The process of storing the trained weights together with the operation graph is known as freezing. This converts all of the variables in the graph to constants, after loading their values from a checkpoint file. The command below uses a checkpoint from the millionth training step, but you can supply any valid checkpoint path. The graph freezing script is stored inside the main tensorflow repository.

In [None]:
! python /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/tools/freeze_graph.py \
--input_graph="${TRAINING_DIR}/${TRAINING_NAME}/${TRAINING_NAME}_graph.pb" \
--input_checkpoint="${TRAINING_DIR}/${TRAINING_NAME}/model.ckpt-${CHECKPOINT_NUM}" \
--input_binary=true --output_graph="${TRAINING_DIR}/${TRAINING_NAME}/${TRAINING_NAME}_frozen.pb" \
--output_node_names=MobilenetV1/Predictions/Reshape_1

In [None]:
! git clone -b r1.15 https://github.com/tensorflow/tensorflow
! python /tf/notebooks/tensorflow/tensorflow/python/tools/freeze_graph.py \
! python /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/tools/freeze_graph.py \
--input_graph="${TRAINING_DIR}/${TRAINING_NAME}/${TRAINING_NAME}_graph.pb" \
--input_checkpoint="${TRAINING_DIR}/${TRAINING_NAME}/model.ckpt-${CHECKPOINT_NUM}" \
--input_binary=true --output_graph="${TRAINING_DIR}/${TRAINING_NAME}/${TRAINING_NAME}_frozen.pb" \
--output_node_names=MobilenetV1/Predictions/Reshape_1

## Quantizing and converting to TensorFlow Lite

Quantization is a tricky and involved process, and it's still very much an active area of research, so taking the float graph that we've trained so far and converting it down to eight bit takes quite a bit of code. The majority of the code is preparing example images to feed into the trained network, so that the ranges of the activation layers in typical use can be measured. This is done differently if color or grayscale images are used. We rely on the TFLiteConverter class to handle the quantization and conversion into the TensorFlow Lite flatbuffer file that we need for the inference engine. 

TFLite models are generated with both UInt8 and Int8 inputs and outputs. The [person_detection](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/micro/examples/person_detection) example uses UInt8 while the [person_detection_experimental](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/micro/examples/person_detection_experimental) uses Int8. 

Updated based on:
- https://stackoverflow.com/questions/58775848/tflite-cannot-set-tensor-dimension-mismatch-on-model-conversion
- https://github.com/tensorflow/tensorflow/issues/34720
- https://github.com/tensorflow/tensorflow/issues/34720#issuecomment-567652143

In [None]:
import tensorflow as tf
import io
import PIL
import numpy as np
import os

image_size = int(os.environ.get('IMAGE_SIZE'))
use_grayscale = os.environ.get('USE_GRAYSCALE')
training_name = os.environ.get('TRAINING_NAME')
training_dir = os.environ.get('TRAINING_DIR')

# generates a subset of the dataset, with grayscale images
def representative_grayscale_dataset_gen():
  image_size = int(os.environ.get('IMAGE_SIZE'))
  data_dir = os.environ.get('DATA_DIR')
  record_iterator = tf.python_io.tf_record_iterator(path=data_dir+"/val.record-00000-of-00010")

  count = 0
  for string_record in record_iterator:
    example = tf.train.Example()
    example.ParseFromString(string_record)
    image_stream = io.BytesIO(example.features.feature['image/encoded'].bytes_list.value[0])
    image = PIL.Image.open(image_stream)
    image = image.resize((image_size, image_size))
    image = image.convert('L')
    array = np.array(image)
    array = np.expand_dims(array, axis=2)
    array = np.expand_dims(array, axis=0)
    array = ((array / 127.5) - 1.0).astype(np.float32)
    yield([array])
    count += 1
    if count > 300:
        break
        
        
# generates a subset of the dataset, with color images
def representative_color_dataset_gen():
  image_size = int(os.environ.get('IMAGE_SIZE'))
  data_dir = os.environ.get('DATA_DIR')
  record_iterator = tf.python_io.tf_record_iterator(path=data_dir+"/val.record-00000-of-00010")

  count = 0
  for string_record in record_iterator:
    example = tf.train.Example()
    example.ParseFromString(string_record)
    image_stream = io.BytesIO(example.features.feature['image/encoded'].bytes_list.value[0])
    image = PIL.Image.open(image_stream)
    image = image.resize((image_size, image_size))
    image = image.convert('RGB') 
    array = np.array(image)
    array = np.expand_dims(array, axis=0)
    array = ((array / 127.5) - 1.0).astype(np.float32)
    yield([array])
    count += 1
    if count > 300:
        break
        
converter = tf.lite.TFLiteConverter.from_frozen_graph(training_dir+"/" + training_name + "/"  + training_name + "_frozen.pb",['input'], ['MobilenetV1/Predictions/Reshape_1'])
converter.optimizations = [tf.lite.Optimize.DEFAULT]
if use_grayscale == "True":
    converter.representative_dataset = representative_grayscale_dataset_gen
    print("Using Graysclae")
else:
    converter.representative_dataset = representative_color_dataset_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS]
converter.inference_input_type = tf.uint8
converter.inference_output_type =  tf.uint8
tflite_quant_model = converter.convert()
open(training_dir +"/" + training_name + "/" + training_name + "_quantized-uint8.tflite", "wb").write(tflite_quant_model)


converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8  
converter.inference_output_type = tf.int8  
tflite_quant_model = converter.convert()
open(training_dir +"/" + training_name + "/" + training_name + "_quantized-int8.tflite", "wb").write(tflite_quant_model)




### Converting into a C source file

The converter writes out a file, but most embedded devices don't have a file
system. To access the serialized data from our program, we have to compile it
into the executable and store it in Flash. The easiest way to do that is to
convert the file into a C data array.

In [None]:
# Save the file as a C source file
!xxd -i "${TRAINING_DIR}/${TRAINING_NAME}/${TRAINING_NAME}_quantized-uint8.tflite" > "${TRAINING_DIR}/${TRAINING_NAME}/${TRAINING_NAME}_quantized_flat-uint8.cc"
!xxd -i "${TRAINING_DIR}/${TRAINING_NAME}/${TRAINING_NAME}_quantized-int8.tflite" > "${TRAINING_DIR}/${TRAINING_NAME}/${TRAINING_NAME}_quantized_flat-int8.cc"

You can now replace the existing person_detect_model_data.cc file with the
version you've trained, and be able to run your own model on embedded devices.

### Convert the model to work with Tensorflow JS
The same trained model can also be used in browser based Web Apps, thanks to Tensorflow JS. This step converts the frozen model into a form that can be loaded in Javascript.


In [None]:
!tensorflowjs_converter \
    --input_format=tf_frozen_model \
    --output_node_names='MobilenetV1/Predictions/Reshape_1' \
    "${TRAINING_DIR}/${TRAINING_NAME}/${TRAINING_NAME}_frozen.pb" \
    "${TRAINING_DIR}/${TRAINING_NAME}/${TRAINING_NAME}_web_model"
!tensorflowjs_converter \
    --input_format=tf_frozen_model \
    --quantize_uint8 \
    --output_node_names='MobilenetV1/Predictions/Reshape_1' \
    "${TRAINING_DIR}/${TRAINING_NAME}/${TRAINING_NAME}_frozen.pb" \
    "${TRAINING_DIR}/${TRAINING_NAME}/${TRAINING_NAME}_web_model_uint8"

### BYO-Data Testing
It is helpful to test against the types of images your TinyML device will be seeing in the real world. This script lets you bring in an arbitrary set of images, run the model against them and then visualize the results. The images should be saved as JPEG using the `.jpg` extension. While this scripte will resize an image, it may not use the best method possible. If you are using very large images, from a phone, it might be best to resize first.

In [None]:
import matplotlib.pyplot as plt
import os
import math

from PIL import Image, ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True
import json

#This is the directory, located inside
directory = "dataset/bicycle_capture_resize/test3"

# https://note.nkmk.me/en/python-pillow-square-circle-thumbnail/

# Center crops an image to get the desired width and height
def crop_center(pil_img, crop_width, crop_height):
    img_width, img_height = pil_img.size
    return pil_img.crop(((img_width - crop_width) // 2,
                         (img_height - crop_height) // 2,
                         (img_width + crop_width) // 2,
                         (img_height + crop_height) // 2))

# Makes a square image
def crop_max_square(pil_img):
    return crop_center(pil_img, min(pil_img.size), min(pil_img.size))

image_size = int(os.environ.get('IMAGE_SIZE'))
use_grayscale = os.environ.get('USE_GRAYSCALE')
training_name = os.environ.get('TRAINING_NAME')
training_dir = os.environ.get('TRAINING_DIR')
if(use_grayscale == "True"):
    print("Using grayscale")

# The Int8 version of the TFLite model will be used
tflite_interpreter = tf.lite.Interpreter(model_path=training_dir +"/" + training_name + "/" + training_name + "_quantized-int8.tflite")
tflite_interpreter.allocate_tensors()

input_details = tflite_interpreter.get_input_details()
output_details = tflite_interpreter.get_output_details()

# It looks like it is impossible to batch inference with the TF1.0 TFLite interpertter: https://github.com/tensorflow/tensorflow/issues/38158
# doing 1 by 1 instead of as a batch

files = os.listdir(directory)
if (use_grayscale == "True"):
    raw_image_batch = np.empty((0, image_size, image_size), np.int8)
else:
    raw_image_batch = np.empty((0, image_size, image_size, 3), np.int8)
    
results = []
images_with_object = 0
for f in files:
    if f.endswith(".jpg"):
        if (use_grayscale == "True"):
            im = Image.open(directory+"/"+f).convert('L')
        else:
            im = Image.open(directory+"/"+f)

        # the input images are resized and square cropped to match the what the model expects
        im_thumb = crop_max_square(im).resize((image_size, image_size), Image.LANCZOS)
        input_data = np.array(im_thumb, dtype=np.uint8)
        input_data = np.expand_dims(input_data, axis=0)
        
        # The UInt8 version of the image is needed for plotting later on
        raw_image_batch = np.append(raw_image_batch, input_data, axis=0)
        
        # The input image data needs to be converted from UInt8 to Int8 for use with the model
        input_data = (input_data -128).astype(np.int8)


        # If it is a grayscale image we need to fill out the last axis of the array because it is expecting that shape
        if (use_grayscale == "True"):
            input_data = np.expand_dims(input_data, axis=3)

        tflite_interpreter.set_tensor(input_details[0]['index'], input_data)
        tflite_interpreter.invoke()
        predictions = tflite_interpreter.get_tensor(output_details[0]['index']).flatten()

        # parse the results for this image
        if (predictions[1] > predictions[0]):
            has_object = True
            images_with_object += 1
        else:
            has_object = False

        results.append({"prediction":has_object})
        

# how many images are there
batch_size = np.size(raw_image_batch,axis=0)

# the outputted plot is going to have 3 columns, so figure out how many rows there should be
rows = math.ceil(batch_size/3)
plt.figure(figsize=(6, rows*2 ))



print("{} totals images, {} images were detected with object".format(batch_size, images_with_object))

# plot out each image, along with whether the object was detected in it
for i in range(batch_size):
  ax = plt.subplot( rows,3, i+1)
  if (use_grayscale == "True"):
    plt.imshow(raw_image_batch[i].astype("uint8"),cmap='gray')
  else:
    plt.imshow(raw_image_batch[i].astype("uint8"))

  plt.title(results[i]["prediction"])
  plt.axis("off")