# Edge AI - Lecture 1 (Hands-on)
# TAIA - Advanced Topics on Artificial Intelligence
# Tiago Filipe Sousa Gonçalves
# tiago.f.goncalves@inesctec.pt | tiagofs@fe.up.pt


# Contents


## 1.   Model Quantisation in TensorFlow & Keras
## 2.   Model Pruning in TensorFlow & Keras
## 3.   What if we want to jointly use pruning and quantisation?
## 4.   More tutorials, exercises and readings

# Model Quantisation in TensorFlow & Keras
Theory:
https://www.tensorflow.org/model_optimization/guide/quantization/training?hl=en

Code and Exercises adapted from:
https://www.tensorflow.org/model_optimization/guide/quantization/training_example?hl=en

## Setup
Let's start with the setup of our development environment:

In [None]:
# Install libraries
!pip install -q tensorflow
!pip install -q tensorflow-model-optimization

In [None]:
# Imports
import tempfile
import os
import numpy as np

# TensorFlow and Keras Imports
import tensorflow as tf
from tensorflow import keras

# We need to import the optimisation library for TensorFlow & Keras
import tensorflow_model_optimization as tfmot

In [None]:
# Helper Function: Evaluate a TensorFlow Lite model on the test dataset (we will need this later)
def evaluate_model(interpreter):
  input_index = interpreter.get_input_details()[0]["index"]
  output_index = interpreter.get_output_details()[0]["index"]

  # Run predictions on every image in the "test" dataset.
  prediction_digits = []
  for i, test_image in enumerate(test_images):
    
    # Status prints (uncomment if you need)
    # if i % 1000 == 0:
      # print(f'Evaluated on {n=i} results so far.')
    
    # Pre-processing: add batch dimension and convert to float32 to match with the model's input data format.
    test_image = np.expand_dims(test_image, axis=0).astype(np.float32)
    interpreter.set_tensor(input_index, test_image)

    # Run inference.
    interpreter.invoke()

    # Post-processing: remove batch dimension and find the digit with highest
    # probability.
    output = interpreter.tensor(output_index)
    digit = np.argmax(output()[0])
    prediction_digits.append(digit)


  # Status prints (uncomment if you need)
  # print('\n')
  
  # Compare prediction results with ground truth labels to calculate accuracy
  prediction_digits = np.array(prediction_digits)
  accuracy = (prediction_digits == test_labels).mean()
  
  
  return accuracy

## MNIST
We will used the MNIST example to understand the quantisation pipeline in TensorFlow & Keras

We will implement a small neural network (**without quantisation**) and train it on the MNIST dataset:

In [None]:
# Load MNIST dataset
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()


# Normalize the input image so that each pixel value is between 0 to 1
train_images = train_images / 255.0
test_images = test_images / 255.0


# Define the model architecture
model = keras.Sequential([
  keras.layers.InputLayer(input_shape=(28, 28)),
  keras.layers.Reshape(target_shape=(28, 28, 1)),
  keras.layers.Conv2D(filters=12, kernel_size=(3, 3), activation='relu'),
  keras.layers.MaxPooling2D(pool_size=(2, 2)),
  keras.layers.Flatten(),
  keras.layers.Dense(10)
])


# Train the digit classification model
# We have to compile the model first (TensorFlow & Keras typical procedure)
model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'])

# Let's print the model summary (the structure of our model)
model.summary()

# Fit the model to data
model.fit(train_images, train_labels, epochs=1, validation_split=0.1)

Let's now build the same small neural network, **but now with quantisation**:

In [None]:
# We create an object that will allow us to apply quantisation to our previous model
quantize_model = tfmot.quantization.keras.quantize_model

# TensorFlow & Keras now "understands" that we have a model with quantisation
quant_model = quantize_model(model)

# We must recompile this model (TensorFlow & Keras typical procedure)
quant_model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'])

# Let's print the model summary (the structure of our model)
# Please notice that all layers are now prefixed by "quant"
quant_model.summary()

# Fit the quantised model to data
quant_model.fit(train_images, train_labels, epochs=1, validation_split=0.1)

Let's compare the accuracy of both models and check if there is any difference:

In [None]:
# Evaluate the accuracy of the model without quantisation
_, baseline_model_accuracy = model.evaluate(test_images, test_labels, verbose=0)

# Evaluate the accuracy of the model with quantisation
_, quant_model_accuracy = quant_model.evaluate(test_images, test_labels, verbose=0)

# Print these values in the Terminal
print('Model with no quantisation Test Accuracy:', baseline_model_accuracy)
print('Model with quantisation Test Accuracy:', quant_model_accuracy)

Let's go a little bit further and create a quantised TensorFlow Lite model with a slightly *aggressive* quantisation:

In [None]:
# We need to create an object that allows us to convert the quantised model from TensorFlow & Keras
converter = tf.lite.TFLiteConverter.from_keras_model(quant_model)

# We define a "default" optimisation settings 
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Let's now convert our model from TensorFlow & Keras to TensorFlow Lite
quantized_tflite_model = converter.convert()

Let's check if accuracy persists for the sake of sanity:

In [None]:
# We define an interpreter (it is an object that enables us to perform inference)
interpreter = tf.lite.Interpreter(model_content=quantized_tflite_model)
interpreter.allocate_tensors()

# Let's use our helper function (remember it?) to evaluate accuracy on this object
test_accuracy = evaluate_model(interpreter)

# Print accuracy values in the Terminal
print('Quantised TensorFlow Lite model Test Accuracy:', test_accuracy)
print('Quantised TensorFlow model Test Accuracy:', quant_model_accuracy)

Let's create a TensorFlow Lite and see if this model is indeed smaller than the model with no quantisation:

In [None]:
# Create float TensorFlow Lite model (from the model with no quantisation)
float_converter = tf.lite.TFLiteConverter.from_keras_model(model)
float_tflite_model = float_converter.convert()


# Measure sizes of models
# Create temporary files with the contents of both the float and quantised models
# Float File
_, float_file = tempfile.mkstemp('.tflite')

# Quantised File
_, quant_file = tempfile.mkstemp('.tflite')


# Write these files to disk
# Float File
with open(float_file, 'wb') as f:
  f.write(float_tflite_model)

# Quantised File
with open(quant_file, 'wb') as f:
  f.write(quantized_tflite_model)


# Print the sizes of these models in the Terminal
print("Float model in MB:", os.path.getsize(float_file) / float(2**20))
print("Quantised model in MB:", os.path.getsize(quant_file) / float(2**20))

## Fashion-MNIST
Let's see if we can apply our knowledge in a much more practical example, using Fashion-MNIST dataset

We will implement a small neural network (**without quantisation**) and train it on the MNIST dataset:

In [None]:
# Load Fashion-MNIST dataset
fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()


# Normalize the input image so that each pixel value is between 0 to 1
train_images = train_images / 255.0
test_images = test_images / 255.0


# Define the model architecture
model = keras.Sequential([
  keras.layers.InputLayer(input_shape=(28, 28)),
  keras.layers.Reshape(target_shape=(28, 28, 1)),
  keras.layers.Conv2D(filters=12, kernel_size=(3, 3), activation='relu'),
  keras.layers.MaxPooling2D(pool_size=(2, 2)),
  keras.layers.Flatten(),
  keras.layers.Dense(10)
])


# Train the digit classification model
# We have to compile the model first (TensorFlow & Keras typical procedure)
# Your code here

# Let's print the model summary (the structure of our model)
# Your code here

# Fit the model to data
# Your code here

Let's now build the same small neural network, **but now with quantisation**:

In [None]:
# We create an object that will allow us to apply quantisation to our previous model
quantize_model = # Your code here

# TensorFlow & Keras now "understands" that we have a model with quantisation
quant_model = # Your code here

# We must recompile this model (TensorFlow & Keras typical procedure)
# Your code here

# Let's print the model summary (the structure of our model)
# Please notice that all layers are now prefixed by "quant"
# Your code here

# Fit the quantised model to data
# Your code here

Let's compare the accuracy of both models and check if there is any difference:

In [None]:
# Evaluate the accuracy of the model without quantisation
_, baseline_model_accuracy = model.evaluate(test_images, test_labels, verbose=0)

# Evaluate the accuracy of the model with quantisation
_, quant_model_accuracy = quant_model.evaluate(test_images, test_labels, verbose=0)

# Print these values in the Terminal
print('Model with no quantisation Test Accuracy:', baseline_model_accuracy)
print('Model with quantisation Test Accuracy:', quant_model_accuracy)

Let's go a little bit further and create a quantised TensorFlow Lite model with a slightly *aggressive* quantisation:

In [None]:
# We need to create an object that allows us to convert the quantised model from TensorFlow & Keras
converter = # Your code here

# We define a "default" optimisation settings 
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Let's now convert our model from TensorFlow & Keras to TensorFlow Lite
quantized_tflite_model = # Your code here

Let's check if accuracy persists for the sake of sanity:

In [None]:
# We define an interpreter (it is an object that enables us to perform inference)
interpreter = # Your code here
interpreter.allocate_tensors()

# Let's use our helper function (remember it?) to evaluate accuracy on this object
test_accuracy = evaluate_model(interpreter)

# Print accuracy values in the Terminal
print('Quantised TensorFlow Lite model Test Accuracy:', test_accuracy)
print('Quantised TensorFlow model Test Accuracy:', quant_model_accuracy)

Let's create a TensorFlow Lite and see if this model is indeed smaller than the model with no quantisation (results should be similar to MNIST example since we did not change the architecture of the model):

In [None]:
# Create float TensorFlow Lite model (from the model with no quantisation)
float_converter = tf.lite.TFLiteConverter.from_keras_model(model)
float_tflite_model = float_converter.convert()


# Measure sizes of models
# Create temporary files with the contents of both the float and quantised models
# Float File
_, float_file = tempfile.mkstemp('.tflite')

# Quantised File
_, quant_file = tempfile.mkstemp('.tflite')


# Write these files to disk
# Float File
with open(float_file, 'wb') as f:
  f.write(float_tflite_model)

# Quantised File
with open(quant_file, 'wb') as f:
  f.write(quantized_tflite_model)


# Print the sizes of these models in the Terminal
print("Float model in MB:", os.path.getsize(float_file) / float(2**20))
print("Quantised model in MB:", os.path.getsize(quant_file) / float(2**20))

# Model Pruning in TensorFlow & Keras

Theory:
https://www.tensorflow.org/model_optimization/guide/pruning?hl=en

Code and Exercises adapted from:
https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras?hl=en

## Setup
Let's start with the setup of our development environment:

In [None]:
# Install libraries
!pip install -q tensorflow-model-optimization

In [None]:
# Imports
import tempfile
import os
import numpy as np

# TensorFlow & Keras Imports
import tensorflow as tf
from tensorflow import keras

# TensorFlow & Keras Optimisation Libraries
import tensorflow_model_optimization as tfmot

%load_ext tensorboard

In [None]:
# Helper Funtion: Compress the models via gzip and measure the zipped size
def get_gzipped_model_size(file):
  # Returns size of gzipped model, in bytes
  import os
  import zipfile

  _, zipped_file = tempfile.mkstemp('.zip')
  with zipfile.ZipFile(zipped_file, 'w', compression=zipfile.ZIP_DEFLATED) as f:
    f.write(file)


  return os.path.getsize(zipped_file)

In [None]:
# Helper Function: Evaluate the TensorFlow Lite model on the test dataset
def evaluate_model(interpreter):
  input_index = interpreter.get_input_details()[0]["index"]
  output_index = interpreter.get_output_details()[0]["index"]

  # Run predictions on ever y image in the "test" dataset.
  prediction_digits = []
  for i, test_image in enumerate(test_images):
    # if i % 1000 == 0:
      # print('Evaluated on {n} results so far.'.format(n=i))
    
    # Pre-processing: add batch dimension and convert to float32 to match with the model's input data format
    test_image = np.expand_dims(test_image, axis=0).astype(np.float32)
    interpreter.set_tensor(input_index, test_image)

    # Run inference.
    interpreter.invoke()

    # Post-processing: remove batch dimension and find the digit with highest
    # probability.
    output = interpreter.tensor(output_index)
    digit = np.argmax(output()[0])
    prediction_digits.append(digit)

  # print('\n')
  
  # Compare prediction results with ground truth labels to calculate accuracy.
  prediction_digits = np.array(prediction_digits)
  accuracy = (prediction_digits == test_labels).mean()
  
  
  return accuracy

## MNIST
We will used the MNIST example to understand the pruning pipeline in TensorFlow & Keras

We start by training a model without pruning, on the MNIST dataset:

In [None]:
# Load MNIST dataset
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()


# Normalize the input image so that each pixel value is between 0 and 1.
train_images = train_images / 255.0
test_images = test_images / 255.0


# Define the model architecture.
model = keras.Sequential([
  keras.layers.InputLayer(input_shape=(28, 28)),
  keras.layers.Reshape(target_shape=(28, 28, 1)),
  keras.layers.Conv2D(filters=12, kernel_size=(3, 3), activation='relu'),
  keras.layers.MaxPooling2D(pool_size=(2, 2)),
  keras.layers.Flatten(),
  keras.layers.Dense(10)
])


# Train the digit classification model
# We have to compile the model first
model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'])

# Let's print the model summary
model.summary()

# Let's fit the model to our data
model.fit(train_images, train_labels, epochs=5, validation_split=0.1)

In [None]:
# Let's check the model accuracy
_, baseline_model_accuracy = model.evaluate(test_images, test_labels, verbose=0)

# Print the value in the Terminal
print('Baseline Model Test Accuracy:', baseline_model_accuracy)


# Save the model for later usage
_, keras_file = tempfile.mkstemp('.h5')
tf.keras.models.save_model(model, keras_file, include_optimizer=False)
print('Saved baseline model to:', keras_file)

Let's apply a **pruning strategy** and retrain the model:

In [None]:
# Create an object that enables us to apply pruning
prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude


# Compute end step to finish pruning after 5 epochs
batch_size = 128
epochs = 5
validation_split = 0.1

# Number of training images
num_images = train_images.shape[0] * (1 - validation_split)
end_step = np.ceil(num_images / batch_size).astype(np.int32) * epochs

# Define model for pruning
# In this example, you start the model with 50% sparsity (50% zeros in weights) and end with 80% sparsity
# We define the prunning parameters with a dictionary object
pruning_params = {
      'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(initial_sparsity=0.50,
                                                               final_sparsity=0.80,
                                                               begin_step=0,
                                                               end_step=end_step)
}


# Apply pruning to our baseline model
model_for_pruning = prune_low_magnitude(model, **pruning_params)


# As you already know, we have to recompile the model again
model_for_pruning.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'])

# Let's print the summary of this model
model_for_pruning.summary()


# Callbacks are TensorFlow & Keras functions that we want to call during the training process
# We create a logdir for one of our callbacks
logdir = tempfile.mkdtemp()

# tfmot.sparsity.keras.UpdatePruningStep is required during training
# tfmot.sparsity.keras.PruningSummaries provides logs for tracking progress and debugging.
callbacks = [tfmot.sparsity.keras.UpdatePruningStep(), tfmot.sparsity.keras.PruningSummaries(log_dir=logdir)]

# Fit this model to our data
model_for_pruning.fit(train_images, train_labels, batch_size=batch_size, epochs=epochs, validation_split=validation_split, callbacks=callbacks)

In [None]:
# Let's evaluate the accuracy of this model
_, model_for_pruning_accuracy = model_for_pruning.evaluate(test_images, test_labels, verbose=0)


# Print the accuracy values on the Terminal
print('Baseline Model Test Accuracy:', baseline_model_accuracy) 
print('Pruned Model Test Accuracy:', model_for_pruning_accuracy)


# Let's check the evolution of the sparsity of our models
%tensorboard --logdir={logdir}

Let's use TensorFlow Lite to reduce the size of our model:

In [None]:
# According to TensorFlow Lite documentation
# Both tfmot.sparsity.keras.strip_pruning and applying a standard compression algorithm (e.g. via gzip) are necessary to see the compression benefits of pruning
# strip_pruning is necessary since it removes every tf.Variable that pruning only needs during training, which would otherwise add to model size during inference
# Applying a standard compression algorithm is necessary since the serialized weight matrices are the same size as they were before pruning 
# However, pruning makes most of the weights zeros, which is added redundancy that algorithms can utilize to further compress the model


# Create an object to export the model
model_for_export = tfmot.sparsity.keras.strip_pruning(model_for_pruning)

# Create a temporary file to save the contents of this model
_, pruned_keras_file = tempfile.mkstemp('.h5')

# Save this model into disk
tf.keras.models.save_model(model_for_export, pruned_keras_file, include_optimizer=False)
print('Saved pruned Keras model to:', pruned_keras_file)

In [None]:
# Similarly to the previous tutorial on quantisation, we have to create TensorFlow Lite objects to convert our model
converter = tf.lite.TFLiteConverter.from_keras_model(model_for_export)
pruned_tflite_model = converter.convert()


# Create a temporary file to save the contents of this model
_, pruned_tflite_file = tempfile.mkstemp('.tflite')


# Save this model into disk
with open(pruned_tflite_file, 'wb') as f:
  f.write(pruned_tflite_model)

print('Saved pruned TFLite model to:', pruned_tflite_file)

In [None]:
# Let's use one of our helper functions to evaluate the size of our models
print("Size of gzipped baseline Keras model: %.2f bytes" % (get_gzipped_model_size(keras_file)))
print("Size of gzipped pruned Keras model: %.2f bytes" % (get_gzipped_model_size(pruned_keras_file)))
print("Size of gzipped pruned TFlite model: %.2f bytes" % (get_gzipped_model_size(pruned_tflite_file)))

## CIFAR-10
Let's now replicate this strategy using a RGB image dataset:

In [None]:
# Load CIFAR-10 dataset
cifar10 = tf.keras.datasets.cifar10
(train_images, train_labels), (test_images, test_labels) = cifar10.load_data()


# Normalize the input image so that each pixel value is between 0 and 1.
train_images = # Your code here
test_images = # Your code here


# Define the model architecture.
model = keras.Sequential([
  keras.layers.InputLayer(input_shape=(32, 32, 3)),
  keras.layers.Conv2D(filters=12, kernel_size=(3, 3), activation='relu'),
  keras.layers.MaxPooling2D(pool_size=(2, 2)),
  keras.layers.Flatten(),
  keras.layers.Dense(10)
])


# Train the digit classification model
# We have to compile the model first
# Your code here

# Let's print the model summary
# Your code here

# Let's fit the model to our data
# Your code here

In [None]:
# Let's check the model accuracy
_, baseline_model_accuracy = # Your code here

# Print the value in the Terminal
print('Baseline Model Test Accuracy:', baseline_model_accuracy)


# Save the model for later usage
# Create a temporary file to save this model contents
_, keras_file = tempfile.mkstemp('.h5')

# Save model
# Your code here (hint: use the function tf.keras.models.save_model) 

print('Saved baseline model to:', keras_file)

Let's apply a **pruning strategy** and retrain the model:

In [None]:
# Create an object that enables us to apply pruning
prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude


# Compute end step to finish pruning after 5 epochs
batch_size = 128
epochs = 5
validation_split = 0.1

# Number of training images
num_images = train_images.shape[0] * (1 - validation_split)
end_step = np.ceil(num_images / batch_size).astype(np.int32) * epochs

# Define model for pruning
# In this example, you start the model with 50% sparsity (50% zeros in weights) and end with 80% sparsity
# We define the prunning parameters with a dictionary object
pruning_params = {
      'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(initial_sparsity=0.50,
                                                               final_sparsity=0.80,
                                                               begin_step=0,
                                                               end_step=end_step)
}


# Apply pruning to our baseline model
model_for_pruning = # Your code here (hint: use the prune_low_magnitude function)


# As you already know, we have to recompile the model again
model_for_pruning.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'])

# Let's print the summary of this model
model_for_pruning.summary()


# Callbacks are TensorFlow & Keras functions that we want to call during the training process
# We create a logdir for one of our callbacks
logdir = tempfile.mkdtemp()

# tfmot.sparsity.keras.UpdatePruningStep is required during training
# tfmot.sparsity.keras.PruningSummaries provides logs for tracking progress and debugging.
callbacks = # Your code here (check the previous example)

# Fit this model to our data
# Your code here (check the previous example)

In [None]:
# Let's evaluate the accuracy of this model
_, model_for_pruning_accuracy = # Your code here (check the previous example)


# Print the accuracy values on the Terminal
print('Baseline Model Test Accuracy:', baseline_model_accuracy) 
print('Pruned Model Test Accuracy:', model_for_pruning_accuracy)


# Let's check the evolution of the sparsity of our models
%tensorboard --logdir={logdir}

Let's use TensorFlow Lite to reduce the size of our model:

In [None]:
# According to TensorFlow Lite documentation
# Both tfmot.sparsity.keras.strip_pruning and applying a standard compression algorithm (e.g. via gzip) are necessary to see the compression benefits of pruning
# strip_pruning is necessary since it removes every tf.Variable that pruning only needs during training, which would otherwise add to model size during inference
# Applying a standard compression algorithm is necessary since the serialized weight matrices are the same size as they were before pruning 
# However, pruning makes most of the weights zeros, which is added redundancy that algorithms can utilize to further compress the model


# Create an object to export the model
model_for_export = # Your code here (hint: use the tfmot.sparsity.keras.strip_pruning function)

# Create a temporary file to save the contents of this model
_, pruned_keras_file = tempfile.mkstemp('.h5')

# Save this model into disk
# Your code here (hint: use the tf.keras.models.save_model function)
print('Saved pruned Keras model to:', pruned_keras_file)

In [None]:
# Similarly to the previous tutorial on quantisation, we have to create TensorFlow Lite objects to convert our model
converter = # Your code here (hint: use the tf.lite.TFLiteConverter.from_keras_model function)
pruned_tflite_model = converter.convert()


# Create a temporary file to save the contents of this model
_, pruned_tflite_file = tempfile.mkstemp('.tflite')


# Save this model into disk
with open(pruned_tflite_file, 'wb') as f:
  f.write(pruned_tflite_model)

print('Saved pruned TFLite model to:', pruned_tflite_file)

In [None]:
# Let's use one of our helper functions to evaluate the size of our models
print("Size of gzipped baseline Keras model: %.2f bytes" % (get_gzipped_model_size(keras_file)))
print("Size of gzipped pruned Keras model: %.2f bytes" % (get_gzipped_model_size(pruned_keras_file)))
print("Size of gzipped pruned TFlite model: %.2f bytes" % (get_gzipped_model_size(pruned_tflite_file)))

# What if we want to jointly use pruning and quantisation?
Theory: https://www.tensorflow.org/model_optimization/guide/pruning?hl=en

Code and Exercises adapted from: https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras?hl=en

## Setup
Let's start with the setup of our development environment:

In [None]:
# Install libraries
!pip install -q tensorflow-model-optimization

In [None]:
# Imports
import tempfile
import os
import numpy as np

# TensorFlow & Keras Imports
import tensorflow as tf
from tensorflow import keras

# TensorFlow & Keras Optimisation Libraries
import tensorflow_model_optimization as tfmot

%load_ext tensorboard

In [None]:
# Helper Funtion: Compress the models via gzip and measure the zipped size
def get_gzipped_model_size(file):
  # Returns size of gzipped model, in bytes
  import os
  import zipfile

  _, zipped_file = tempfile.mkstemp('.zip')
  with zipfile.ZipFile(zipped_file, 'w', compression=zipfile.ZIP_DEFLATED) as f:
    f.write(file)


  return os.path.getsize(zipped_file)

In [None]:
# Helper Function: Evaluate the TensorFlow Lite model on the test dataset
def evaluate_model(interpreter):
  input_index = interpreter.get_input_details()[0]["index"]
  output_index = interpreter.get_output_details()[0]["index"]

  # Run predictions on ever y image in the "test" dataset.
  prediction_digits = []
  for i, test_image in enumerate(test_images):
    # if i % 1000 == 0:
      # print('Evaluated on {n} results so far.'.format(n=i))
    
    # Pre-processing: add batch dimension and convert to float32 to match with the model's input data format
    test_image = np.expand_dims(test_image, axis=0).astype(np.float32)
    interpreter.set_tensor(input_index, test_image)

    # Run inference.
    interpreter.invoke()

    # Post-processing: remove batch dimension and find the digit with highest
    # probability.
    output = interpreter.tensor(output_index)
    digit = np.argmax(output()[0])
    prediction_digits.append(digit)

  # print('\n')
  
  # Compare prediction results with ground truth labels to calculate accuracy.
  prediction_digits = np.array(prediction_digits)
  accuracy = (prediction_digits == test_labels).mean()
  
  
  return accuracy

## MNIST
We will used the MNIST example to understand the pruning pipeline in TensorFlow & Keras

We start by training a model without pruning, on the MNIST dataset:

In [None]:
# Load MNIST dataset
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()


# Normalize the input image so that each pixel value is between 0 and 1.
train_images = train_images / 255.0
test_images = test_images / 255.0


# Define the model architecture.
model = keras.Sequential([
  keras.layers.InputLayer(input_shape=(28, 28)),
  keras.layers.Reshape(target_shape=(28, 28, 1)),
  keras.layers.Conv2D(filters=12, kernel_size=(3, 3), activation='relu'),
  keras.layers.MaxPooling2D(pool_size=(2, 2)),
  keras.layers.Flatten(),
  keras.layers.Dense(10)
])


# Train the digit classification model
# We have to compile the model first
model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'])

# Let's print the model summary
model.summary()

# Let's fit the model to our data
model.fit(train_images, train_labels, epochs=5, validation_split=0.1)

In [None]:
# Let's check the model accuracy
_, baseline_model_accuracy = model.evaluate(test_images, test_labels, verbose=0)

# Print the value in the Terminal
print('Baseline Model Test Accuracy:', baseline_model_accuracy)


# Save the model for later usage
_, keras_file = tempfile.mkstemp('.h5')
tf.keras.models.save_model(model, keras_file, include_optimizer=False)
print('Saved baseline model to:', keras_file)

Let's apply a **pruning strategy** and retrain the model:

In [None]:
# Create an object that enables us to apply pruning
prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude


# Compute end step to finish pruning after 5 epochs
batch_size = 128
epochs = 5
validation_split = 0.1

# Number of training images
num_images = train_images.shape[0] * (1 - validation_split)
end_step = np.ceil(num_images / batch_size).astype(np.int32) * epochs

# Define model for pruning
# In this example, you start the model with 50% sparsity (50% zeros in weights) and end with 80% sparsity
# We define the prunning parameters with a dictionary object
pruning_params = {
      'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(initial_sparsity=0.50,
                                                               final_sparsity=0.80,
                                                               begin_step=0,
                                                               end_step=end_step)
}


# Apply pruning to our baseline model
model_for_pruning = prune_low_magnitude(model, **pruning_params)


# As you already know, we have to recompile the model again
model_for_pruning.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'])

# Let's print the summary of this model
model_for_pruning.summary()


# Callbacks are TensorFlow & Keras functions that we want to call during the training process
# We create a logdir for one of our callbacks
logdir = tempfile.mkdtemp()

# tfmot.sparsity.keras.UpdatePruningStep is required during training
# tfmot.sparsity.keras.PruningSummaries provides logs for tracking progress and debugging.
callbacks = [tfmot.sparsity.keras.UpdatePruningStep(), tfmot.sparsity.keras.PruningSummaries(log_dir=logdir)]

# Fit this model to our data
model_for_pruning.fit(train_images, train_labels, batch_size=batch_size, epochs=epochs, validation_split=validation_split, callbacks=callbacks)

In [None]:
# Let's evaluate the accuracy of this model
_, model_for_pruning_accuracy = model_for_pruning.evaluate(test_images, test_labels, verbose=0)


# Print the accuracy values on the Terminal
print('Baseline Model Test Accuracy:', baseline_model_accuracy) 
print('Pruned Model Test Accuracy:', model_for_pruning_accuracy)


# Let's check the evolution of the sparsity of our models
%tensorboard --logdir={logdir}

Let's use TensorFlow Lite to reduce the size of our model:

In [None]:
# According to TensorFlow Lite documentation
# Both tfmot.sparsity.keras.strip_pruning and applying a standard compression algorithm (e.g. via gzip) are necessary to see the compression benefits of pruning
# strip_pruning is necessary since it removes every tf.Variable that pruning only needs during training, which would otherwise add to model size during inference
# Applying a standard compression algorithm is necessary since the serialized weight matrices are the same size as they were before pruning 
# However, pruning makes most of the weights zeros, which is added redundancy that algorithms can utilize to further compress the model


# Create an object to export the model
model_for_export = tfmot.sparsity.keras.strip_pruning(model_for_pruning)

# Create a temporary file to save the contents of this model
_, pruned_keras_file = tempfile.mkstemp('.h5')

# Save this model into disk
tf.keras.models.save_model(model_for_export, pruned_keras_file, include_optimizer=False)
print('Saved pruned Keras model to:', pruned_keras_file)

In [None]:
# Similarly to the previous tutorial on quantisation, we have to create TensorFlow Lite objects to convert our model
converter = tf.lite.TFLiteConverter.from_keras_model(model_for_export)
pruned_tflite_model = converter.convert()


# Create a temporary file to save the contents of this model
_, pruned_tflite_file = tempfile.mkstemp('.tflite')


# Save this model into disk
with open(pruned_tflite_file, 'wb') as f:
  f.write(pruned_tflite_model)

print('Saved pruned TFLite model to:', pruned_tflite_file)

In [None]:
# Let's use one of our helper functions to evaluate the size of our models
print("Size of gzipped baseline Keras model: %.2f bytes" % (get_gzipped_model_size(keras_file)))
print("Size of gzipped pruned Keras model: %.2f bytes" % (get_gzipped_model_size(pruned_keras_file)))
print("Size of gzipped pruned TFlite model: %.2f bytes" % (get_gzipped_model_size(pruned_tflite_file)))

Let's now observe if we can use quantisation after pruning to reduce the size of our model:

In [None]:
# Let's, once again, create a TensorFlow Lite object to convert our models
converter = tf.lite.TFLiteConverter.from_keras_model(model_for_export)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_and_pruned_tflite_model = converter.convert()


# We create a temporary file to save the contents of this model
_, quantized_and_pruned_tflite_file = tempfile.mkstemp('.tflite')


# Save the model into disk
with open(quantized_and_pruned_tflite_file, 'wb') as f:
  f.write(quantized_and_pruned_tflite_model)

print('Saved quantized and pruned TFLite model to:', quantized_and_pruned_tflite_file)

print("Size of gzipped baseline Keras model: %.2f bytes" % (get_gzipped_model_size(keras_file)))
print("Size of gzipped pruned and quantized TFlite model: %.2f bytes" % (get_gzipped_model_size(quantized_and_pruned_tflite_file)))

Let's check if the accuracy of this compressed model is similar to the baseline model, for the sake of sanity:

In [None]:
# We create an object to perform inference (remember the Quantisation tutorial?)
interpreter = tf.lite.Interpreter(model_content=quantized_and_pruned_tflite_model)
interpreter.allocate_tensors()

# We use the helper function to evaluate this model
test_accuracy = evaluate_model(interpreter)

print('Pruned and quantized TFLite test_accuracy:', test_accuracy)
print('Pruned TF test accuracy:', model_for_pruning_accuracy)

## Challenge: can you pick up one (or more) dataset(s) and build the entire pipeline yourself?

In [None]:
# Start your code here

# More tutorials, exercises and readings:


1.   [Model Quantization Methods In TensorFlow Lite By Bhavika Kanani](https://studymachinelearning.com/model-quantization-methods-in-tensorflow-lite/)

2.   [TensorFlow model optimization guide](https://www.tensorflow.org/model_optimization/guide?hl=en)

3.   [Speeding Up Deep Learning Inference Using TensorFlow, ONNX, and NVIDIA TensorRT](https://developer.nvidia.com/blog/speeding-up-deep-learning-inference-using-tensorflow-onnx-and-tensorrt/)

4.   [PyTorch: Pruning Tutorial by Michela Paganini](https://pytorch.org/tutorials/intermediate/pruning_tutorial.html#extending-torch-nn-utils-prune-with-custom-pruning-functions)

5.   [PyTorch: Dynamic Quantization on an LSTM Word Language Model by James Reed](https://pytorch.org/tutorials/advanced/dynamic_quantization_tutorial.html)

6.   [PyTorch: Dynamic Quantization on BERT by Jianyu Huang](https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html)

7.   [PyTorch: Quantized Transfer Learning for Computer Vision Tutorial by Zafar Takhirov](https://pytorch.org/tutorials/intermediate/quantized_transfer_learning_tutorial.html)

8.   [PyTorch: Static Quantization with Eager Mode in PyTorch](https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html)

9.   [PyTorch: Exporting a Model from PyTorch to ONNX and Running it using ONNX Runtime](https://pytorch.org/tutorials/advanced/super_resolution_with_onnxruntime.html)

10.  [Brevitas: a PyTorch research library for quantization-aware training (QAT)](https://github.com/Xilinx/brevitas)