# Post Training Quantization

## Overview

[TensorFlow Lite](https://www.tensorflow.org/lite/) now supports
converting weights to 8 bit precision as part of model conversion from
tensorflow graphdefs to TFLite's flat buffer format. Weight quantization
achieves a 4x reduction in the model size. In addition, TFLite supports on the
fly quantization and dequantization of activations to allow for:

1.  Using quantized kernels for faster implementation when available.

2.  Mixing of floating-point kernels with quantized kernels for different parts
    of the graph.

Note that the activations are always stored in floating point. For ops that
support quantized kernels, the activations are quantized to 8 bits of precision
dynamically prior to processing and are de-quantized to float precision after
processing. Depending on the model being converted, this can give a speedup over
pure floating point computation.

In contrast to
[quantization aware training](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/quantize)
, the weights are quantized post training and the activations are quantized dynamically 
at inference in this method.
Therefore, the model weights are not retrained to compensate for quantization
induced errors. It is important to check the accuracy of the quantized model to
ensure that the degradation is acceptable.

In this tutorial, we train an MNIST model from scratch, check its accuracy in
tensorflow and then convert the saved model into a Tensorflow Lite flatbuffer
with weight quantization. We finally check the
accuracy of the converted model and compare it to the original saved model. We
run the training script mnist.py from
[Tensorflow official mnist tutorial](https://github.com/tensorflow/models/tree/master/official/mnist).


## Building an MNIST model

### Setup

In [1]:
! pip uninstall -y tensorflow
! pip install -U tf-nightly

[31mCannot uninstall requirement tensorflow, not installed[0m
[33mYou are using pip version 9.0.3, however version 18.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Collecting tf-nightly
  Downloading https://files.pythonhosted.org/packages/a2/a5/162858bb8ee3bbd7320a13242f6d3be0ecd70154a52c3e1c1078615bd045/tf_nightly-1.12.0.dev20180922-cp36-cp36m-manylinux1_x86_64.whl (63.7MB)
[K    100% |████████████████████████████████| 63.7MB 21kB/s 
[?25hRequirement already up-to-date: six>=1.10.0 in /opt/conda/lib/python3.6/site-packages (from tf-nightly)
Collecting tb-nightly<1.12.0a0,>=1.11.0a0 (from tf-nightly)
  Downloading https://files.pythonhosted.org/packages/84/93/fe1018e4449c7f809dd08a730821cd4b79e2ce6188f22ee84f42e20eead3/tb_nightly-1.11.0a20180922-py3-none-any.whl (3.0MB)
[K    100% |████████████████████████████████| 3.0MB 436kB/s 
[?25hCollecting keras-preprocessing>=1.0.3 (from tf-nightly)
  Downloading https://files.pythonhosted

In [2]:
import tensorflow as tf
tf.enable_eager_execution()

In [3]:
! git clone --depth 1 https://github.com/tensorflow/models

Cloning into 'models'...
remote: Enumerating objects: 2915, done.[K
remote: Counting objects: 100% (2915/2915), done.[K
remote: Compressing objects: 100% (2530/2530), done.[K
remote: Total 2915 (delta 500), reused 1743 (delta 313), pack-reused 0[K
Receiving objects: 100% (2915/2915), 376.37 MiB | 54.09 MiB/s, done.
Resolving deltas: 100% (500/500), done.
Checking connectivity... done.


In [4]:
import sys
import os

if sys.version_info.major >= 3:
    import pathlib
else:
    import pathlib2 as pathlib

# Add `models` to the python path.
models_path = os.path.join(os.getcwd(), "models")
sys.path.append(models_path)

### Train and export the model

In [5]:
saved_models_root = "/tmp/mnist_saved_model"

In [6]:
# The above path addition is not visible to subprocesses, add the path for the subprocess as well.
# Note: channels_last is required here or the conversion may fail. 
!PYTHONPATH={models_path} python models/official/mnist/mnist.py --train_epochs=1 --export_dir {saved_models_root} --data_format=channels_last

2018-09-22 17:22:42.896671: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
I0922 17:22:46.298367 140509021062912 tf_logging.py:115] Initializing RunConfig with distribution strategies.
I0922 17:22:46.298596 140509021062912 tf_logging.py:115] Not using Distribute Coordinator.
I0922 17:22:46.298949 140509021062912 tf_logging.py:115] Using config: {'_model_dir': '/tmp/mnist_model', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.contrib.distribute.python.one_device_strategy.OneDeviceStrategy object at 0x7fca3a0eae10>, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec'

For the example, we only trained the model for a single epoch, so it only trains to ~96% accuracy.



### Convert to a TFLite model

The `savedmodel` directory is named with a timestamp. Select the most recent one: 

In [7]:
saved_model_dir = str(sorted(pathlib.Path(saved_models_root).glob("*"))[-1])
saved_model_dir

'/tmp/mnist_saved_model/1537637031'

Using the python `TocoConverter`, the saved model can be converted into a TFLite model.

First load the model using the `TocoConverter`:

In [8]:
import tensorflow as tf
tf.enable_eager_execution()
converter = tf.contrib.lite.TocoConverter.from_saved_model(saved_model_dir)
tflite_model = converter.convert()

INFO:tensorflow:Restoring parameters from /tmp/mnist_saved_model/1537637031/variables/variables
INFO:tensorflow:The given SavedModel MetaGraphDef contains SignatureDefs with the following keys: {'classify', 'serving_default'}
INFO:tensorflow:input tensors info: 
INFO:tensorflow:Tensor's key in saved_model's tensor_map: image
INFO:tensorflow: tensor name: Placeholder:0, shape: (-1, 28, 28), type: DT_FLOAT
INFO:tensorflow:output tensors info: 
INFO:tensorflow:Tensor's key in saved_model's tensor_map: classes
INFO:tensorflow: tensor name: ArgMax:0, shape: (-1), type: DT_INT64
INFO:tensorflow:Tensor's key in saved_model's tensor_map: probabilities
INFO:tensorflow: tensor name: Softmax:0, shape: (-1, 10), type: DT_FLOAT
INFO:tensorflow:Restoring parameters from /tmp/mnist_saved_model/1537637031/variables/variables
INFO:tensorflow:Froze 8 variables.
INFO:tensorflow:Converted 8 variables to const ops.


Write it out to a tflite file:

In [9]:
tflite_models_dir = pathlib.Path("/tmp/mnist_tflite_models/")
tflite_models_dir.mkdir(exist_ok=True, parents=True)

In [10]:
tflite_model_file = tflite_models_dir/"mnist_model.tflite"
tflite_model_file.write_bytes(tflite_model)

13101280

To quantize the model on export, set the `post_training_quantize` flag:

In [11]:
# Note: If you don't have a recent tf-nightly installed, the
# "post_training_quantize" line will have no effect.
tf.logging.set_verbosity(tf.logging.INFO)
converter.post_training_quantize = True
tflite_quant_model = converter.convert()
tflite_model_quant_file = tflite_models_dir/"mnist_model_quant.tflite"
tflite_model_quant_file.write_bytes(tflite_quant_model)

3283208

Note how the resulting file, with `post_training_quantize` set, is approximately `1/4` the size.

In [12]:
!ls -lh {tflite_models_dir}

total 16M
-rw-r--r-- 1 root root  13M Sep 22 17:24 mnist_model.tflite
-rw-r--r-- 1 root root 3.2M Sep 22 17:24 mnist_model_quant.tflite


## Run the TFLite models

We can run the TensorFlow Lite model using the python TensorFlow Lite
Interpreter. 

### load the test data

First let's load the mnist test data to feed to it:

In [13]:
import numpy as np
mnist_train, mnist_test = tf.keras.datasets.mnist.load_data()
images, labels = tf.to_float(mnist_test[0])/255.0, mnist_test[1]

# Note: If you change the batch size, then use 
# `tf.contrib.lite.Interpreter.resize_tensor_input` to also change it for
# the interpreter.
mnist_ds = tf.data.Dataset.from_tensor_slices((images, labels)).batch(1)

### Load the model into an interpreter

In [14]:
interpreter = tf.contrib.lite.Interpreter(model_path=str(tflite_model_file))
interpreter.allocate_tensors()
input_index = interpreter.get_input_details()[0]["index"]
output_index = interpreter.get_output_details()[0]["index"]

In [15]:
tf.logging.set_verbosity(tf.logging.DEBUG)
interpreter_quant = tf.contrib.lite.Interpreter(model_path=str(tflite_model_quant_file))

In [16]:
interpreter_quant.allocate_tensors()
input_index = interpreter_quant.get_input_details()[0]["index"]
output_index = interpreter_quant.get_output_details()[0]["index"]


### Test the model on one image

In [17]:
for img, label in mnist_ds.take(1):
  break

interpreter.set_tensor(input_index, img)
interpreter.invoke()
predictions = interpreter.get_tensor(output_index)

In [18]:
import matplotlib.pylab as plt

plt.imshow(img[0])
template = "True:{true}, predicted:{predict}"
_ = plt.title(template.format(true= str(label[0].numpy()),
                              predict=str(predictions[0,0])))
plt.grid(False)

### Evaluate the models

In [19]:
def eval_model(interpreter, mnist_ds):
  total_seen = 0
  num_correct = 0

  for img, label in mnist_ds:
    total_seen += 1
    interpreter.set_tensor(input_index, img)
    interpreter.invoke()
    predictions = interpreter.get_tensor(output_index)
    if predictions == label.numpy():
      num_correct += 1

    if total_seen % 500 == 0:
        print("Accuracy after %i images: %f" %
              (total_seen, float(num_correct) / float(total_seen)))

  return float(num_correct) / float(total_seen)

In [20]:
print(eval_model(interpreter, mnist_ds))

Accuracy after 500 images: 0.978000
Accuracy after 1000 images: 0.968000
Accuracy after 1500 images: 0.962000
Accuracy after 2000 images: 0.958500
Accuracy after 2500 images: 0.956000
Accuracy after 3000 images: 0.960000
Accuracy after 3500 images: 0.960857
Accuracy after 4000 images: 0.957500
Accuracy after 4500 images: 0.956889
Accuracy after 5000 images: 0.956000
Accuracy after 5500 images: 0.959455
Accuracy after 6000 images: 0.960000
Accuracy after 6500 images: 0.961538
Accuracy after 7000 images: 0.962429
Accuracy after 7500 images: 0.964267
Accuracy after 8000 images: 0.965750
Accuracy after 8500 images: 0.966824
Accuracy after 9000 images: 0.968556
Accuracy after 9500 images: 0.969579
Accuracy after 10000 images: 0.968900
0.9689


We can repeat the evaluation on the weight quantized model to obtain:


In [21]:
print(eval_model(interpreter_quant, mnist_ds))


Accuracy after 500 images: 0.978000
Accuracy after 1000 images: 0.968000
Accuracy after 1500 images: 0.962000
Accuracy after 2000 images: 0.958500
Accuracy after 2500 images: 0.956000
Accuracy after 3000 images: 0.960000
Accuracy after 3500 images: 0.960857
Accuracy after 4000 images: 0.957500
Accuracy after 4500 images: 0.956889
Accuracy after 5000 images: 0.956000
Accuracy after 5500 images: 0.959455
Accuracy after 6000 images: 0.960000
Accuracy after 6500 images: 0.961692
Accuracy after 7000 images: 0.962571
Accuracy after 7500 images: 0.964400
Accuracy after 8000 images: 0.965875
Accuracy after 8500 images: 0.966941
Accuracy after 9000 images: 0.968667
Accuracy after 9500 images: 0.969684
Accuracy after 10000 images: 0.969000
0.969



In this example, we have compressed model with no difference in the accuracy.



## Optimizing an existing model

We now consider another example. Resnets with pre-activation layers (Resnet-v2) are widely used for vision applications.
  Pre-trained frozen graph for resnet-v2-101 is available at the
  [Tensorflow Lite model repository](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/lite/g3doc/models.md).

We can convert the frozen graph to a TFLite flatbuffer with quantization by:


In [None]:
archive_path = tf.keras.utils.get_file("resnet_v2_101.tgz", "https://storage.googleapis.com/download.tensorflow.org/models/tflite_11_05_08/resnet_v2_101.tgz", extract=True)
archive_path = pathlib.Path(archive_path)
archive_dir = str(archive_path.parent)

The `info.txt` file lists the input and output names. You can also find them using TensorBoard to visually inspect the graph.

In [None]:
! cat {archive_dir}/resnet_v2_101_299_info.txt

In [None]:
graph_def_file = pathlib.Path(archive_path).parent/"resnet_v2_101_299_frozen.pb"
input_arrays = ["input"] 
output_arrays = ["output"]
converter = tf.contrib.lite.TocoConverter.from_frozen_graph(
  str(graph_def_file), input_arrays, output_arrays, input_shapes={"input":[1,299,299,3]})
converter.post_training_quantize = True
resnet_tflite_file = graph_def_file.parent/"resnet_v2_101_quantized.tflite"
resnet_tflite_file.write_bytes(converter.convert())


In [None]:

!ls -lh {archive_dir}/*.tflite


The model size reduces from 171 MB to 43 MB.
The accuracy of this model on imagenet can be evaluated using the scripts provided for [TFLite accuracy measurement](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/lite/tools/accuracy/ilsvrc).

The optimized model top-1 accuracy is 76.8, the same as the floating point model.