# Post Training Quantization

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/lite/tutorials/post_training_quant.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/tutorials/post_training_quant.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

## Overview

[TensorFlow Lite](https://www.tensorflow.org/lite/) now supports
converting weights to 8 bit precision as part of model conversion from
tensorflow graphdefs to TFLite's flat buffer format. Weight quantization
achieves a 4x reduction in the model size. In addition, `TFLite supports on the
fly quantization and dequantization of activations to allow for`:

1.  `Using quantized kernels for faster implementation when available.`

2.  `Mixing of floating-point kernels with quantized kernels for different parts
    of the graph.`

Note that `the activations are always stored in floating point. For ops that
support quantized kernels, the activations are quantized to 8 bits of precision
dynamically prior to processing and are de-quantized to float precision after
processing. Depending on the model being converted, this can give a speedup over
pure floating point computation.`

In contrast to
[quantization aware training](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/quantize)
, the weights are quantized post training and the activations are quantized dynamically 
at inference in this method.
Therefore, the model weights are not retrained to compensate for quantization
induced errors. It is important to check the accuracy of the quantized model to
ensure that the degradation is acceptable.

`In this tutorial, we train an MNIST model from scratch, check its accuracy in
tensorflow and then convert the saved model into a Tensorflow Lite flatbuffer
with weight quantization`. We finally check the
accuracy of the converted model and compare it to the original saved model. We
run the training script mnist.py from
[Tensorflow official mnist tutorial](https://github.com/tensorflow/models/tree/master/official/mnist).


## Building an MNIST model

### Setup

In [1]:
# Note: Nightly Builds are now CUDA 10 as of 16-DEC-2018
# But currently, I use CUDA 8. So I use "pip install tf-nightly-gpu==1.13.0.dev20181210"
# https://github.com/tensorflow/tensorflow/issues/22706
# ! pip uninstall -y tensorflow
# ! pip install -U tf-nightly-gpu

In [2]:
import tensorflow as tf
import time
import os

tf.enable_eager_execution()

# 通過使用 tf.enable_eager_execution() 可以獲得實際值。在 eager_execution中，操作的輸出將是實際值而不是張量。
# 使用 Eager Execution，你可以在沒有 session 的情況下運行你的代碼

In [3]:
# 下載 TensorFlow 官方所提供的預設 model
# ! git clone --depth 1 https://github.com/tensorflow/models

In [4]:
import sys
import os

if sys.version_info.major >= 3:
    import pathlib
else:
    import pathlib2 as pathlib

# Add `models` to the python path.
models_path = os.path.join(os.getcwd(), "models")
sys.path.append(models_path)
print("models path:", models_path)

models path: /home/ryanyao/work/my_ml_study/TensorFlow_Study_1112/quantization/models


### Train and export the model

In [5]:
saved_models_root = "/tmp/mnist_saved_model"

In [6]:
# The above path addition is not visible to subprocesses, add the path for the subprocess as well.
# Note: channels_last is required here or the conversion may fail. 
!PYTHONPATH={models_path} python models/official/mnist/mnist.py --train_epochs=1 --export_dir {saved_models_root} --data_format=channels_last

2018-12-19 19:12:43.357482: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-12-19 19:12:43.656666: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-12-19 19:12:43.667056: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55991ace10d0 executing computations on platform CUDA. Devices:
2018-12-19 19:12:43.667076: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GTX 1060 3GB, Compute Capability 6.1
2018-12-19 19:12:43.808153: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3392130000 Hz
2018-12-19 19:12:43.808332: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55991ad99ae0 executing computations on platform Host. Devices:
2018-12-19 19:12:43.808354: I

2018-12-19 19:12:53.896360: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.9.0 locally
I1219 19:13:07.367123 140627962443584 basic_session_run_hooks.py:249] cross_entropy = 0.061259326, learning_rate = 1e-04, train_accuracy = 0.97
I1219 19:13:07.367995 140627962443584 basic_session_run_hooks.py:249] loss = 0.061259326, step = 1800
I1219 19:13:08.168468 140627962443584 basic_session_run_hooks.py:680] global_step/sec: 124.625
I1219 19:13:08.168969 140627962443584 basic_session_run_hooks.py:247] cross_entropy = 0.0578548, learning_rate = 1e-04, train_accuracy = 0.98 (0.802 sec)
I1219 19:13:08.169080 140627962443584 basic_session_run_hooks.py:247] loss = 0.0578548, step = 1900 (0.801 sec)
I1219 19:13:08.805134 140627962443584 basic_session_run_hooks.py:680] global_step/sec: 157.067
I1219 19:13:08.805649 140627962443584 basic_session_run_hooks.py:247] cross_entropy = 0.04504158, learning_rate = 1e-04, train_accuracy = 0.98333335 (0.637 sec)
I12

For the example, we only trained the model for a single epoch, so it only trains to ~96% accuracy.



### Convert to a TFLite model

The `savedmodel` directory is named with a timestamp. Select the most recent one: 

In [7]:
saved_model_dir = str(sorted(pathlib.Path(saved_models_root).glob("*"))[-1])
saved_model_dir

'/tmp/mnist_saved_model/1545217992'

Using the python `TFLiteConverter`, the saved model can be converted into a TFLite model.

First load the model using the `TFLiteConverter`:

In [8]:
import tensorflow as tf
tf.enable_eager_execution()
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
tflite_model = converter.convert()

Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.loader.load or tf.compat.v1.saved_model.load. There will be a new function for importing SavedModels in Tensorflow 2.0.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from /tmp/mnist_saved_model/1545217992/variables/variables
INFO:tensorflow:The given SavedModel MetaGraphDef contains SignatureDefs with the following keys: {'serving_default', 'classify'}
INFO:tensorflow:input tensors info: 
INFO:tensorflow:Tensor's key in saved_model's tensor_map: image
INFO:tensorflow: tensor name: Placeholder:0, shape: (-1, 28, 28), type: DT_FLOAT
INFO:tensorflow:output tensors info: 
INFO:tensorflow:Tensor's key in saved_model's tensor_map: probabilities
INFO:tensorflow: tensor name: Softmax:0, shape: (-1, 10), type: DT_FLOAT
INFO:tensorflow:Tensor's key in saved_model's tensor_map: classes
INFO:te

Write it out to a tflite file:

In [9]:
tflite_models_dir = pathlib.Path("/tmp/mnist_tflite_models/")
tflite_models_dir.mkdir(exist_ok=True, parents=True)

In [10]:
tflite_model_file = tflite_models_dir/"mnist_model.tflite"
tflite_model_file.write_bytes(tflite_model)

13101280

To quantize the model on export, set the `post_training_quantize` flag:

In [11]:
# Note: If you don't have a recent tf-nightly installed, the
# "post_training_quantize" line will have no effect.
tf.logging.set_verbosity(tf.logging.INFO)
converter.post_training_quantize = True
tflite_quant_model = converter.convert()
tflite_model_quant_file = tflite_models_dir/"mnist_model_quant.tflite"
tflite_model_quant_file.write_bytes(tflite_quant_model)

3283208

Note how the resulting file, with `post_training_quantize` set, is approximately `1/4` the size.

In [12]:
!ls -lh {tflite_models_dir}

total 16M
-rw-rw-r-- 1 ryanyao ryanyao 3.2M Dec 19 19:13 mnist_model_quant.tflite
-rw-rw-r-- 1 ryanyao ryanyao  13M Dec 19 19:13 mnist_model.tflite


## Run the TFLite models

We can run the TensorFlow Lite model using the python TensorFlow Lite
Interpreter. 

### load the test data

First let's load the mnist test data to feed to it:

In [13]:
import numpy as np

# Return: Tuple of Numpy arrays: `(x_train, y_train), (x_test, y_test)`.
mnist_train, mnist_test = tf.keras.datasets.mnist.load_data()
images, labels = tf.to_float(mnist_test[0])/255.0, mnist_test[1]

# Note: If you change the batch size, then use 
# `tf.lite.Interpreter.resize_tensor_input` to also change it for
# the interpreter.
mnist_ds = tf.data.Dataset.from_tensor_slices((images, labels)).batch(1)

Instructions for updating:
Use tf.cast instead.


### Load the model into an interpreter

In [14]:
interpreter = tf.lite.Interpreter(model_path=str(tflite_model_file))
interpreter.allocate_tensors()
input_index = interpreter.get_input_details()[0]["index"]
output_index = interpreter.get_output_details()[0]["index"]

In [15]:
tf.logging.set_verbosity(tf.logging.DEBUG)
interpreter_quant = tf.lite.Interpreter(model_path=str(tflite_model_quant_file))

In [16]:
interpreter_quant.allocate_tensors()
input_index = interpreter_quant.get_input_details()[0]["index"]
output_index = interpreter_quant.get_output_details()[0]["index"]


### Test the model on one image

In [17]:
for img, label in mnist_ds.take(1):
  break

interpreter.set_tensor(input_index, img)
interpreter.invoke()
predictions = interpreter.get_tensor(output_index)

Instructions for updating:
Colocations handled automatically by placer.


In [18]:
import matplotlib.pylab as plt

plt.imshow(img[0])
template = "True:{true}, predicted:{predict}"
_ = plt.title(template.format(true= str(label[0].numpy()),
                              predict=str(predictions[0,0])))
plt.grid(False)

### Evaluate the models

In [19]:
def eval_model(interpreter, mnist_ds):
  total_seen = 0
  num_correct = 0

  for img, label in mnist_ds:
    total_seen += 1
    interpreter.set_tensor(input_index, img)
    interpreter.invoke()
    predictions = interpreter.get_tensor(output_index)
    if predictions == label.numpy():
      num_correct += 1

    if total_seen % 500 == 0:
        print("Accuracy after %i images: %f" %
              (total_seen, float(num_correct) / float(total_seen)))

  return float(num_correct) / float(total_seen)

In [20]:
start_time = time.time()
print(eval_model(interpreter, mnist_ds))
duration = time.time() - start_time
print('%.1f sec' % (duration))

Accuracy after 500 images: 0.992000
Accuracy after 1000 images: 0.985000
Accuracy after 1500 images: 0.982000
Accuracy after 2000 images: 0.982500
Accuracy after 2500 images: 0.981200
Accuracy after 3000 images: 0.982000
Accuracy after 3500 images: 0.983143
Accuracy after 4000 images: 0.983000
Accuracy after 4500 images: 0.982667
Accuracy after 5000 images: 0.982400
Accuracy after 5500 images: 0.984000
Accuracy after 6000 images: 0.984000
Accuracy after 6500 images: 0.985231
Accuracy after 7000 images: 0.985143
Accuracy after 7500 images: 0.986000
Accuracy after 8000 images: 0.986875
Accuracy after 8500 images: 0.987294
Accuracy after 9000 images: 0.988000
Accuracy after 9500 images: 0.988421
Accuracy after 10000 images: 0.987400
0.9874
21.2 sec


We can repeat the evaluation on the weight quantized model to obtain:


In [21]:
start_time = time.time()
print(eval_model(interpreter_quant, mnist_ds))
duration = time.time() - start_time
print('%.1f sec' % (duration))

Accuracy after 500 images: 0.992000
Accuracy after 1000 images: 0.985000
Accuracy after 1500 images: 0.982000
Accuracy after 2000 images: 0.982500
Accuracy after 2500 images: 0.981200
Accuracy after 3000 images: 0.982000
Accuracy after 3500 images: 0.983143
Accuracy after 4000 images: 0.983000
Accuracy after 4500 images: 0.982667
Accuracy after 5000 images: 0.982400
Accuracy after 5500 images: 0.984000
Accuracy after 6000 images: 0.984000
Accuracy after 6500 images: 0.985231
Accuracy after 7000 images: 0.985143
Accuracy after 7500 images: 0.986000
Accuracy after 8000 images: 0.986875
Accuracy after 8500 images: 0.987294
Accuracy after 9000 images: 0.988000
Accuracy after 9500 images: 0.988421
Accuracy after 10000 images: 0.987400
0.9874
53.8 sec


In this example, we have compressed model with no difference in the accuracy.  
但目前在 time cost 的結果，quantized 後的反而較秏時 (unquantized: 22.6 sec, quantized: 53.9 sec)，可見以下討論  
* [Slow quantized graph #2807](https://github.com/tensorflow/tensorflow/issues/2807)  
  * Quantized ops currently only work on the CPU, because most GPUs don't support eight-bit matrix multiplications natively.  
  * If I quantize the graph and run it on iOS (CPU), I too get about 3 times worse performance than running the unquantized version.
  * The quantization is aimed at mobile performance, so most of the optimizations are for ARM not x86. We're hoping to get good quantization on Intel eventually, but we don't have anyone actively working on it yet.
  * We are focusing our eight-bit efforts on TF Lite, so we aren't expecting TensorFlow's quantized performance to improve in cases where it's not currently fast. Close the issue.



## Optimizing an existing model

We now consider another example. Resnets with pre-activation layers (Resnet-v2) are widely used for vision applications.
  Pre-trained frozen graph for resnet-v2-101 is available at the
  [Tensorflow Lite model repository](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/models.md).

We can convert the frozen graph to a TFLite flatbuffer with quantization by:


In [22]:
archive_path = tf.keras.utils.get_file("resnet_v2_101.tgz", "https://storage.googleapis.com/download.tensorflow.org/models/tflite_11_05_08/resnet_v2_101.tgz", extract=True)
archive_path = pathlib.Path(archive_path)
archive_dir = str(archive_path.parent)

The `info.txt` file lists the input and output names. You can also find them using TensorBoard to visually inspect the graph.

In [23]:
! cat {archive_dir}/resnet_v2_101_299_info.txt

Model: resnet_v2_101
Input: input
Output: output


In [24]:
graph_def_file = pathlib.Path(archive_path).parent/"resnet_v2_101_299_frozen.pb"
input_arrays = ["input"] 
output_arrays = ["output"]
converter = tf.lite.TFLiteConverter.from_frozen_graph(
  str(graph_def_file), input_arrays, output_arrays, input_shapes={"input":[1,299,299,3]})
converter.post_training_quantize = True
resnet_tflite_file = graph_def_file.parent/"resnet_v2_101_quantized.tflite"
resnet_tflite_file.write_bytes(converter.convert())


44997240

In [25]:

!ls -lh {archive_dir}/*.tflite

-rw-r--r-- 1 ryanyao ryanyao 171M Sep  6 04:56 /home/ryanyao/.keras/datasets/resnet_v2_101_299.tflite
-rw-rw-r-- 1 ryanyao ryanyao  43M Dec 19 19:15 /home/ryanyao/.keras/datasets/resnet_v2_101_quantized.tflite



The model size reduces from 171 MB to 43 MB.
The accuracy of this model on imagenet can be evaluated using the scripts provided for [TFLite accuracy measurement](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/accuracy/ilsvrc).

The optimized model top-1 accuracy is 76.8, the same as the floating point model.