# Introduction to TensorFlow lite

Tensorflow Lite is a set of tools for running TensorFlow models on edge devices. It has two main components:

- TensorFlow Lite Converter: This converts TF models into special efficient models for use in memory-contrained devices. It can reduce the model's size and make it run faster on edge devices.

- Tensorfow Lite Interpreter: This runs efficiently TensorFlow Lite models.

We will convert our previous saved models using the TensorFlow Converter's Python API. We will perform quantization to a model, one of the most well known optimitzations (and also needed to run our model in a Coral device, for example).

You can find more information in the official [TensorFlow Lite documentation](https://www.tensorflow.org/api_docs/python/tf/lite).

In [None]:
import os
import time

import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
from IPython.display import display # to display images
from tqdm.notebook import tqdm 

## About quantization

By default, the model parameters are stored as 32-bit floating-point numbers. Quantization is a technique that allows us to reduce the precision of these parameters to 8-bit integers. This will produce a 4 times reduction in size and will also increase the speed of inference because it is easier for the CPU to operate with integers.

The quantization process requires a representative dataset that must represent the full range of input values. This is required to adjust the dynamic range of the quantization levels.

<center><img src="assets/quantization.png" alt="quantization" width="600"/></center>

To use our model with TensorFlow Lite, wee need to convert it first. In order to do this, we will use the **TensorFlow Lite Converter's Python API**. Using this API we can write the model in the form of a **FlatBuffer**, which is more space-efficient. It also can apply several optimizations to the model, to reduce its size, the time of inferences or both.

These optimizations can cost a reduction on the accuracy of the model, but they are usually very small, so applying these optimizations is preferred. In order to compare the effect of quantization, we will deploy two models, one without quantization and the other with quantization.

In [None]:
# We need the MINST dataset again to perform the quantization optimization.
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train = x_train / 255.0
x_test = x_test / 255.0
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)

In [None]:
# Defines a generator function that provides our test data's X values as a representative dataset 
# and tell the converter to use it
def representative_dataset_generator():
    for value in x_train:
        yield [value.reshape(1, 28, 28, 1).astype(np.float32)]

In [None]:
# Load the target baseline model
model_4 = keras.models.load_model("Models/model_4.h5")

## Model conversion 1

In [None]:
# Convert the model to the Tensorflow Lite format without quantitzation
converter = tf.lite.TFLiteConverter.from_keras_model(model_4)
# Convert the model
tflite_model = converter.convert()
# Save the model to disk
open("Models/converted_model_4.tflite", "wb").write(tflite_model)

## Model conversion 2

In [None]:
converter = tf.lite.TFLiteConverter.from_keras_model(model_4)
# This enables quantization
converter.optimizations = [tf.lite.Optimize.DEFAULT]
# This sets the representative dataset for quantization
converter.representative_dataset = representative_dataset_generator
# This ensures that if any ops can't be quantized, the converter throws an error
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
# For full integer quantization, though supported types defaults to int8 only, we explicitly declare it for clarity.
converter.target_spec.supported_types = [tf.int8]
# These set the input and output tensors to uint8 (added in r2.3)
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
# Convert the model
tflite_quant_model = converter.convert()

# Save the model to disk
open("Models/converted_quant_model_4.tflite", "wb").write(tflite_quant_model)

## Models comparison

In [None]:
def compute_accuracy(predictions:np.ndarray, ground_truth:np.ndarray):

    assert len(predictions) == len(ground_truth)
    
    acc = 0

    for i in range(len(predictions)):
        if np.argmax(predictions[i]) == ground_truth[i]:
            acc += 1

    return acc/len(predictions)

### Baseline model evaluation

In [None]:
elapsed = 0
predictions_norm = []

for x_value in tqdm(x_test):

    x_value = x_value.reshape(1, 28, 28, 1)
    tic = time.perf_counter()
    prediction = model_4.predict(x_value)
    elapsed += time.perf_counter() - tic
    predictions_norm.append(prediction)

tf_time = elapsed/len(x_test)
tf_acc = compute_accuracy(predictions=predictions_norm, ground_truth=y_test)
tf_size = os.path.getsize("Models/model_4.h5")

print("TF elapsed time/inference(s): ", tf_time)
print("TF acc: ", tf_acc)
print("TF model size (bytes): ", tf_size)

### Converted model evaluation

In [None]:
# Instantiate an interpreter
tflite_model = tf.lite.Interpreter("Models/converted_model_4.tflite")
# Allocate memory for each model
tflite_model.allocate_tensors()
# Get the indexes of the input and output tensors
tflite_model_input_index = tflite_model.get_input_details()[0]['index']
tflite_model_output_index = tflite_model.get_output_details()[0]['index']
# Create arrays to store the results
predictions_tflite = []
elapsed = 0

for x_value in tqdm(x_test):

    x_value = x_value.reshape(1, 28, 28, 1).astype(np.float32)
    # Create a tensor wrapping the current x value
    x_value_tensor = tf.convert_to_tensor(x_value, dtype = np.float32)
    # Write the value to the input tensor
    tflite_model.set_tensor(tflite_model_input_index, x_value_tensor)
    # Run inference
    tic = time.perf_counter()
    tflite_model.invoke()
    elapsed += time.perf_counter() - tic
    # Read the prediction and store it
    predictions_tflite.append(tflite_model.get_tensor(tflite_model_output_index)[0])

tflite_time = elapsed/len(x_test)
tflite_acc = compute_accuracy(predictions=predictions_tflite, ground_truth=y_test)
tflite_size = os.path.getsize("Models/converted_model_4.tflite")

print("TFLite elapsed time/inference(s): ", tflite_time)
print("TFLite acc: ", tflite_acc)
print("TFLite model size (bytes): ", tflite_size)

### Converted and quantizated model evaluation

In [None]:
def set_input_tensor(interpreter, input):
  input_details = interpreter.get_input_details()[0]
  tensor_index = input_details['index']
  input_tensor = interpreter.tensor(tensor_index)()[0]
  # Inputs for the TFLite model must be uint8, so we quantize our input data.
  # NOTE: This step is necessary only because we're receiving input data from
  # ImageDataGenerator, which rescaled all image data to float [0,1]. When using
  # bitmap inputs, they're already uint8 [0,255] so this can be replaced with:
  #   input_tensor[:, :] = input
  scale, zero_point = input_details['quantization']
  input_tensor[:, :] = np.uint8(input / scale + zero_point)

def classify_image(interpreter, input):
  set_input_tensor(interpreter, input)

  tic = time.perf_counter()
  interpreter.invoke()
  elapsed = time.perf_counter() - tic

  # Get the indexes of the input and output tensors
  output_details = interpreter.get_output_details()[0]
  output = interpreter.get_tensor(output_details['index'])
  # Outputs from the TFLite model are uint8, so we dequantize the results:
  scale, zero_point = output_details['quantization']
  output = scale * (output - zero_point)

  return elapsed,output

# Instantiate an interpreter
quant_model = tf.lite.Interpreter("Models/converted_quant_model_4.tflite")
# Allocate memory for each model
quant_model.allocate_tensors()

# Create arrays to store the results
predictions_quant = []
elapsed = 0

for x_value in tqdm(x_test):

    pred_time, pred = classify_image(quant_model, x_value.reshape(1,28,28,1))
    predictions_quant.append(pred)
    elapsed += pred_time

tflite_quant_time = elapsed/len(x_test)
tflite_quant_acc = compute_accuracy(predictions=predictions_quant, ground_truth=y_test)
tflite_quant_size = os.path.getsize("Models/converted_quant_model_4.tflite")

print("TFLite quant elapsed time/inference(s): ", tflite_quant_time)
print("TFLite quant acc: ", tflite_quant_acc)
print("TFLite quant model size (bytes): ", tflite_quant_size)

## Final results

In [None]:
results = {
    "name" : ["regular TF", "TF lite", "TF lite quant"],
    "time/inference(s)" : [tf_time, tflite_time, tflite_quant_time],
    "size(bytes)" : [tf_size, tflite_size, tflite_quant_size],
    "acc" : [tf_acc, tflite_acc, tflite_quant_acc]   
}

results_df = pd.DataFrame(data=results)

In [None]:
display(results_df)