# Convert the Keras h5 model to a quantized TFLite model

This Jupyter notebook demonstrates the steps to take a pretrained Keras model and convert it to a quantized TensorFlow Lite (TFLite) model for optimized size and latency on mobile devices. Quantization refers to using lower precision numerical formats in the model calculations to reduce model size with minimal accuracy drop.

In [1]:
import tensorflow as tf
import os

### Load Keras Model

This code loads the previously trained Keras model to be converted. It will start with the full precision floating point Keras model.

In [2]:
# Load the previously trained and saved Keras model
model = tf.keras.models.load_model('/models/best_model.h5')

### Create TFLite Converter

It uses a TFLiteConverter to convert the Keras model to TFLite format. The initial output is a floating point TFLite model.

In [3]:
# Create a TFLite converter object from the Keras model
converter = tf.lite.TFLiteConverter.from_keras_model(model)

### Set Optimization

It sets the optimization level to use the TFLite default optimizations. This includes optimizations like fusion that will help optimize model performance.

In [4]:
# Set the optimization to use the default optimizations
converter.optimizations = [tf.lite.Optimize.DEFAULT]

### Convert Model to Quantized TFLite

This converts the model to a quantized TFLite model using full integer quantization for highest model compression. After quantization the model calculations use 8-bit integers instead of 32-bit floats.

In [5]:
# Convert the model to a quantized TFLite model
tflite_quant_model = converter.convert()

INFO:tensorflow:Assets written to: /var/folders/pg/k661lbqs4sl4jplc4nftcthr0000gn/T/tmp3nkqhgm_/assets


INFO:tensorflow:Assets written to: /var/folders/pg/k661lbqs4sl4jplc4nftcthr0000gn/T/tmp3nkqhgm_/assets
2024-02-08 16:27:36.977904: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:378] Ignored output_format.
2024-02-08 16:27:36.977919: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:381] Ignored drop_control_dependency.
2024-02-08 16:27:36.978525: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: /var/folders/pg/k661lbqs4sl4jplc4nftcthr0000gn/T/tmp3nkqhgm_
2024-02-08 16:27:36.979743: I tensorflow/cc/saved_model/reader.cc:51] Reading meta graph with tags { serve }
2024-02-08 16:27:36.979749: I tensorflow/cc/saved_model/reader.cc:146] Reading SavedModel debug info (if present) from: /var/folders/pg/k661lbqs4sl4jplc4nftcthr0000gn/T/tmp3nkqhgm_
2024-02-08 16:27:36.981767: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-02-08 16:27:36.982731: I tensorflow/cc/saved_model/load

### Save Quantized Model

The quantized TFLite model is saved to file for later use in applications.

In [6]:
# Save the quantized model to file 
with open('/models/quant_model.tflite', 'wb') as f:
    f.write(tflite_quant_model)

## Compare Model Sizes

Finally, model sizes are printed out to show the compression achieved by quantization. The quantized model is almost 4x smaller than the floating point TFLite model.

In [7]:
print("Full Integer quantization model saved!")

# Print model sizes for comparison
print("Initial model in Mb:", os.path.getsize('/models/best_model.h5') / float(2**20))
print("Float model in Mb:", os.path.getsize('/models/model.tflite') / float(2**20))
print("Quantized model in Mb:", os.path.getsize('/models/quant_model.tflite') / float(2**20))

# Print compression ratio between float and quantized model
print("Compression ratio:", os.path.getsize('/models/model.tflite')/os.path.getsize('/models/quant_model.tflite'))

Full Integer quantization model saved!
Initial model in Mb: 42.47442626953125
Float model in Mb: 14.140071868896484
Quantized model in Mb: 3.5412826538085938
Compression ratio: 3.99292382201942
