# **Tutorial Pruning | Quantization**

Developed by:
*   Miguel Santos - M12960
*   Rui Ferreira - M11911 

University of Beira Interior (UBI)
This tutorial is based on: 

*   https://github.com/christianversloot/machine-learning-articles/blob/main/tensorflow-model-optimization-an-introduction-to-pruning.md#loading-and-configuring-pruning

* https://www.tensorflow.org/model_optimization/guide/pruning/comprehensive_guide.md

* https://www.tensorflow.org/model_optimization/guide/pruning/comprehensive_guide.md

**Upgrade Tensorflow**

In [None]:
!pip install --upgrade tensorflow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


**Installing the TensorFlow Model Optimization toolkit**

It is a collection for optimizing TensorFlow models, which minimizes the complexity of optimizing machine learning inference. For prunning, we will use this.

In [None]:
!git clone https://github.com/google/qkeras.git 
import sys 
sys.path.append('qkeras') 
!pip install git+https://github.com/keras-team/keras-tuner.git 

!pip install tensorflow_model_optimization 
import tensorflow_model_optimization as tfmot

Cloning into 'qkeras'...
remote: Enumerating objects: 2229, done.[K
remote: Counting objects: 100% (761/761), done.[K
remote: Compressing objects: 100% (309/309), done.[K
remote: Total 2229 (delta 510), reused 658 (delta 439), pack-reused 1468[K
Receiving objects: 100% (2229/2229), 1001.04 KiB | 13.17 MiB/s, done.
Resolving deltas: 100% (1548/1548), done.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/keras-team/keras-tuner.git
  Cloning https://github.com/keras-team/keras-tuner.git to /tmp/pip-req-build-0qd3buiw
  Running command git clone --filter=blob:none --quiet https://github.com/keras-team/keras-tuner.git /tmp/pip-req-build-0qd3buiw
  Resolved https://github.com/keras-team/keras-tuner.git to commit 0dd114dd5353c941e8e14e68356a3fb124e3a0d1
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... 

**Import the remaining necessary libraries.**

In [None]:
import tensorflow
from tensorflow import keras
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from keras import models  
from tensorflow.keras.layers import Dense, Dropout, Flatten, InputLayer, Reshape, Conv2D, MaxPooling2D
import tempfile
import numpy as np


**Model configuration**

The next cell defines global variables for the code.

In [None]:
img_width, img_height = 28, 28
batch_size = 64
no_epochs = 10
no_classes = 10
validation_split = 0.2
verbosity = 1

**Load MNIST dataset**

The dataset used in this tutorial is MNIST dataset. The dataset contains 60,000 training images and 10,000 testing images of handwritten digits (0-9).

In the case of MNIST dataset, the images are grayscale, meaning they have only one color channel. 

This line defines the shape of the input data for the ConvNet model. The img_width and img_height are the dimensions of each image, and 1 indicates that the images are grayscale.

In [None]:
# Load MNIST dataset
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


**Normalization Data**

These lines normalize the input data by dividing each pixel value by 255, which scales the pixel values from the range of [0, 255] to the range of [0, 1].

The normalization process is applied to each individual image in the training and testing sets separately.

In [None]:
# Parse numbers as floats
input_train = train_images.astype('float32')
input_test = test_images.astype('float32')

# Normalize the input image so that each pixel value is between 0 and 1.
train_images = train_images / 255.0
test_images = test_images / 255.0

**Create and Training Model**

In [None]:
model = Sequential([
    InputLayer(input_shape=(28, 28)),
    #The added dimension represents the number of channels in the input tensor, which is 1 in this case since we are dealing with grayscale images.
    Reshape(target_shape=(28, 28, 1)),
    Conv2D(32, kernel_size=(3, 3), activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    Dropout(0.25),
    Conv2D(64, kernel_size=(3, 3), activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    Dropout(0.25),
    Flatten(),
    Dense(no_classes, activation='softmax')
])


# Compile the model
model.compile(loss=tensorflow.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              optimizer=tensorflow.keras.optimizers.Adam(),
              metrics=['accuracy'])

# Fit data to model
model.fit(train_images, train_labels,
          batch_size=batch_size,
          epochs=no_epochs,
          verbose=verbosity,
          validation_split=validation_split)

Epoch 1/10


  output, from_logits = _get_logits(


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f3ff537cfd0>

note: The Flatten layer in a neural network architecture is used to convert a multidimensional input tensor into a one-dimensional tensor that can be passed to the fully connected layers for classification.

**Evaluate Model**

In [None]:
# Generate generalization metrics
score = model.evaluate(test_images, test_labels, verbose=0)
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')

Test loss: 0.030351150780916214 / Test accuracy: 0.9897000193595886


**Save Model**

In [None]:
_, keras_file = tempfile.mkstemp('.h5')
models.save_model(model, keras_file, include_optimizer=False)
print(f'Baseline model saved: {keras_file}')

Baseline model saved: /tmp/tmp7vyo3vso.h5


**Loading and Configuring PRUNING**

The next cells are responsible for adding prunning functionality to our code.

Firstly, we use the **prune_low_magnitude** functionality ensure that model's layers are prunable, which means that it contains weights that can be safely removed without affecting the overall performance of the network. This only loads the functionality, we'll actually call it later.

In [None]:
# Load functionality for adding pruning wrappers
prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude

In this cell, we are applying pruning to a Keras model using the TensorFlow Model Optimization (tfmot) library.


*   **pruning_epochs**: number of epochs for which we want to apply pruning to the model.

*   **num_images** : the number of images in the training set.

*   **end_step** : Represents the step at which the pruning process will end during the training of a pruned model.

*   **pruning_params** : Represent the configurations for the prunning operation. We define a pruning schedule using **PolynomialDecay**, which means that sparsity of the model increases with increasing number of epochs and to stop pruning at the specified end_step. Initially, we set the model to be 40% sparse, increasingly getting sparser to eventually 70%.

* **pruning_schedule**: Specifies how the pruning rate changes over time during training. In this case, what is changing is the sparsity level.

*   **model_for_prunning** : Thanks to the **prune_low_magnitude** functionality, the prunable model is generated taking into account our initial model and the prunning_params. It has the same architecture as the original model, but with some of the weights pruned according to the specified pruning schedule


In [None]:
# Finish pruning after 5 epochs
pruning_epochs = 5
num_images = train_images.shape[0] * (1 - validation_split)
end_step = np.ceil(num_images / batch_size).astype(np.int32) * pruning_epochs

# Define pruning configuration
pruning_params = {
      'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(initial_sparsity=0.40,
                                                               final_sparsity=0.70,
                                                               begin_step=0,
                                                               end_step=end_step)
}
model_for_pruning = prune_low_magnitude(model, **pruning_params)

# Recompile the model
model_for_pruning.compile(loss=tensorflow.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              optimizer=tensorflow.keras.optimizers.Adam(),
              metrics=['accuracy'])


#model_for_pruning.summary()

**Starting Prunning Process**

After configuring the pruning process, we can actually recompile the model (this is necessary because we added pruning functionality), and start the pruning process. We must use the UpdatePruningStep callback here, because it  updates the pruning step after each epoch to ensure that the weights remain pruned in subsequent epochs.

By finishing pruning after 5 epochs and using the defined pruning configuration, we can train a smaller model with fewer parameters and less memory usage. 

In [None]:
# Model callbacks
callbacks = [
  tfmot.sparsity.keras.UpdatePruningStep()
]

# Fitting data
model_for_pruning.fit(train_images, train_labels,
                      batch_size=batch_size,
                      epochs=pruning_epochs,
                      verbose=verbosity,
                      callbacks=callbacks,
                      validation_split=validation_split)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f3ff4cf4b20>

**Once pruning finishes, we must measure its effectiveness. We can do so in two ways:**

*   By measuring how much performance has changed. compared to before pruning;
*   By measuring how much model size has changed, compared to before pruning.

For this example, there is minimal loss in test accuracy after pruning, compared to the baseline.

In [None]:
# Generate generalization metrics
score_pruned = model_for_pruning.evaluate(test_images, test_labels, verbose=0) #Evaluate model

print(f'Pruned CNN - Test loss: {score_pruned[0]} / Test accuracy: {score_pruned[1]}')
print(f'Regular CNN - Test loss: {score[0]} / Test accuracy: {score[1]}')

Pruned CNN - Test loss: 0.027494464069604874 / Test accuracy: 0.9908000230789185
Regular CNN - Test loss: 0.030351150780916214 / Test accuracy: 0.9897000193595886


**note: **
The pruned model even performs slightly better than the regular one. This is likely because we trained the initial model for only 10 epochs, and subsequently continued with pruning afterwards. It's very much possible that the model had not yet converged; that moving towards convergence has continued in the pruning process. Often, performance deteriorates a bit, but should do so only slightly.

When a Keras model is pruned, some of its weights are set to zero to achieve sparsity, which can reduce the model's size and time of processing. However, the pruned model cannot be directly apply. In this way, we need first use **strip_pruning()** to removes the pruning operations and restores the original values of the weights that were pruned, thus returning the unpruned model. 

In [None]:
# Export the model
model_for_export = tfmot.sparsity.keras.strip_pruning(model_for_pruning) # Export prunning model. 

_, pruned_keras_file = tempfile.mkstemp('.h5')
models.save_model(model_for_export, pruned_keras_file, include_optimizer=False)
print(f'Pruned model saved: {pruned_keras_file}')



Pruned model saved: /tmp/tmpcofv3h83.h5


**Measuring the size of your pruned model**

The bellow cell enables we verify how much the size of the model reduce with appplication of prunning.

Define a helper function to actually compress the models via gzip and measure the zipped size.

In [None]:
# Measuring the size of your pruned model
# (source: https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras#fine-tune_pre-trained_model_with_pruning)

def get_gzipped_model_size(file):
  # Returns size of gzipped model, in bytes.
  import os
  import zipfile

  _, zipped_file = tempfile.mkstemp('.zip')
  with zipfile.ZipFile(zipped_file, 'w', compression=zipfile.ZIP_DEFLATED) as f:
    f.write(file)

  return os.path.getsize(zipped_file)


In [None]:
print("Size of gzipped baseline Keras model: %.2f bytes" % (get_gzipped_model_size(keras_file)))
print("Size of gzipped pruned Keras model: %.2f bytes" % (get_gzipped_model_size(pruned_keras_file)))

Size of gzipped baseline Keras model: 132445.00 bytes
Size of gzipped pruned Keras model: 57389.00 bytes


In [None]:
print(f"Ratio:{get_gzipped_model_size(keras_file)/get_gzipped_model_size(pruned_keras_file)}")

Ratio:2.307846451410549


##**Combining Pruning with Quantization for compound optimization**

Quantization is another process that can be used to reduce the complexity and size of the model. It represents the number representation of the machine learning model (whether that's weights or also activations) in order to make it smaller.



*   **tensorflow.lite.Optimize.DEFAULT**: The default optimization strategy that enables post-training quantization. 

*   **TFLiteConverter**: Convert a Keras model into a TensorFlow Lite model, which is a format optimized for deployment on mobile and embedded devices.



In [None]:
# convert Keras model into a TensorFlow Lite model 
converter = tensorflow.lite.TFLiteConverter.from_keras_model(model_for_export)

converter.optimizations = [tensorflow.lite.Optimize.DEFAULT]

#converts the Keras model to a quantized and pruned TensorFlow Lite model using the settings specified above.
quantized_and_pruned_tflite_model = converter.convert()

_, quantized_and_pruned_tflite_file = tempfile.mkstemp('.tflite')

with open(quantized_and_pruned_tflite_file, 'wb') as f:
  f.write(quantized_and_pruned_tflite_model)

print('Saved quantized and pruned TFLite model to:', quantized_and_pruned_tflite_file)

print("Size of gzipped baseline Keras model: %.2f bytes" % (get_gzipped_model_size(keras_file)))
print("Size of gzipped pruned and quantized TFlite model: %.2f bytes" % (get_gzipped_model_size(quantized_and_pruned_tflite_file)))
print("")



Saved quantized and pruned TFLite model to: /tmp/tmpgo3z3bmk.tflite
Size of gzipped baseline Keras model: 132445.00 bytes
Size of gzipped pruned and quantized TFlite model: 19292.00 bytes



In [None]:
print(f"Ratio:{get_gzipped_model_size(keras_file)/get_gzipped_model_size(quantized_and_pruned_tflite_file)}")

Ratio:6.865280945469625


**Accuracy from TF to TFLite**

Define a helper function to evaluate the TF Lite model on the test dataset.

In [None]:
def evaluate_model(interpreter):
  input_index = interpreter.get_input_details()[0]["index"]
  output_index = interpreter.get_output_details()[0]["index"]

  # Run predictions on ever y image in the "test" dataset.
  prediction_digits = []
  for i, test_image in enumerate(test_images):
    if i % 1000 == 0:
      print('Evaluated on {n} results so far.'.format(n=i))
    # Pre-processing: add batch dimension and convert to float32 to match with
    # the model's input data format.
    test_image = np.expand_dims(test_image, axis=0).astype(np.float32)
    interpreter.set_tensor(input_index, test_image)

    # Run inference.
    interpreter.invoke()

    # Post-processing: remove batch dimension and find the digit with highest
    # probability.
    output = interpreter.tensor(output_index)
    digit = np.argmax(output()[0])
    prediction_digits.append(digit)

  print('\n')
  # Compare prediction results with ground truth labels to calculate accuracy.
  prediction_digits = np.array(prediction_digits)
  accuracy = (prediction_digits == test_labels).mean()
  return accuracy

**Evaluate Accuracy of TF Lite model**

In [None]:
interpreter = tensorflow.lite.Interpreter(model_content=quantized_and_pruned_tflite_model)
interpreter.allocate_tensors()

test_accuracy = evaluate_model(interpreter)

print('Pruned and quantized TFLite test_accuracy:', test_accuracy)
print('Pruned TF test accuracy:', score_pruned[1])

Evaluated on 0 results so far.
Evaluated on 1000 results so far.
Evaluated on 2000 results so far.
Evaluated on 3000 results so far.
Evaluated on 4000 results so far.
Evaluated on 5000 results so far.
Evaluated on 6000 results so far.
Evaluated on 7000 results so far.
Evaluated on 8000 results so far.
Evaluated on 9000 results so far.


Pruned and quantized TFLite test_accuracy: 0.9909
Pruned TF test accuracy: 0.9908000230789185
