# Installing Required Packages

### Note: Resolving `keras.src` Namespace Issue
When using TensorFlow and TensorFlow Model Optimization in Colab, you may encounter a `keras.src` namespace issue, causing incompatibility with `tensorflow_model_optimization.quantization.keras`. To resolve this:

1. Set the `KERAS_BACKEND` environment variable to `tensorflow` before importing TensorFlow.
2. Ensure you are using compatible versions of TensorFlow (`>=2.12`) and TensorFlow Model Optimization.
3. Clone the model using `tensorflow.keras.models.clone_model()` to ensure it aligns with the `tensorflow.keras` namespace.
4. Always restart the runtime and reinstall TensorFlow-related packages to avoid lingering conflicts.

This ensures that all operations use the correct `tensorflow.keras` implementation, avoiding compatibility issues.


In [1]:
!pip uninstall -y keras tensorflow tensorflow-model-optimization
!pip install tensorflow==2.12 tensorflow-model-optimization

Found existing installation: keras 3.5.0
Uninstalling keras-3.5.0:
  Successfully uninstalled keras-3.5.0
Found existing installation: tensorflow 2.17.1
Uninstalling tensorflow-2.17.1:
  Successfully uninstalled tensorflow-2.17.1
[0mCollecting tensorflow==2.12
  Downloading tensorflow-2.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting tensorflow-model-optimization
  Downloading tensorflow_model_optimization-0.8.0-py2.py3-none-any.whl.metadata (904 bytes)
Collecting gast<=0.4.0,>=0.2.1 (from tensorflow==2.12)
  Downloading gast-0.4.0-py3-none-any.whl.metadata (1.1 kB)
Collecting keras<2.13,>=2.12.0 (from tensorflow==2.12)
  Downloading keras-2.12.0-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting numpy<1.24,>=1.22 (from tensorflow==2.12)
  Downloading numpy-1.23.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.3 kB)
Collecting tensorboard<2.13,>=2.12 (from tensorflow==2.12)
  Downloading tensorboard-2.12.3-py3-none-an

In [1]:
# Import required libraries

import os
os.environ["KERAS_BACKEND"] = "tensorflow"

import tensorflow as tf
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

# Part I: Model Pruning with Sparsity
This part of the notebook demonstrates how pruning can be used to reduce the size of a model by removing insignificant weights. However, pruning alone does not always lead to size reductions unless additional steps are taken:
1. Remove the pruning mask to finalize the pruned model.
2. Use TensorFlow Lite's **experimental sparsity-aware optimization** when converting to TFLite format.

## Objectives:
1. Compare model size before and after removing the pruning mask.
2. Show the impact of experimental sparsity optimization on reducing the TFLite model size.


## Dataset Preparation
We use the MNIST dataset, which contains grayscale images of handwritten digits (0-9).
1. Normalize the pixel values to the range [0, 1].
2. Reshape the data for input into the CNN model.
3. One-hot encode the labels for classification.

In [2]:
# Import required libraries
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

# Load and preprocess the MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0
x_train = x_train.reshape(-1, 28, 28, 1)
x_test = x_test.reshape(-1, 28, 28, 1)
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)

print(f"Training data shape: {x_train.shape}, Labels shape: {y_train.shape}")
print(f"Test data shape: {x_test.shape}, Labels shape: {y_test.shape}")


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
Training data shape: (60000, 28, 28, 1), Labels shape: (60000, 10)
Test data shape: (10000, 28, 28, 1), Labels shape: (10000, 10)


## Training a Simple CNN
We build a Convolutional Neural Network (CNN) with the following layers:
1. **Convolutional Layer**: Extracts features from the input images.
2. **MaxPooling Layer**: Reduces spatial dimensions, lowering computational requirements.
3. **Flatten Layer**: Converts the 2D feature maps into a 1D vector.
4. **Dense Layers**: Fully connected layers for classification.

The model is compiled using the Adam optimizer and trained for 2 epochs.


In [3]:
# Build a simple CNN model
def create_cnn_model():
    inputs = Input(shape=(28, 28, 1))
    x = Conv2D(32, (3, 3), activation='relu')(inputs)
    x = MaxPooling2D((2, 2))(x)
    x = Flatten()(x)
    x = Dense(128, activation='relu')(x)
    outputs = Dense(10, activation='softmax')(x)
    return Model(inputs, outputs)

# Compile and train the model
model = create_cnn_model()
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train, y_train, epochs=2, batch_size=32, validation_data=(x_test, y_test))

# Evaluate the trained model
baseline_accuracy = model.evaluate(x_test, y_test, verbose=0)[1]
print(f"Baseline Model Accuracy: {baseline_accuracy:.4f}")


Epoch 1/2
Epoch 2/2
Baseline Model Accuracy: 0.9841


## Pruning the Model
We apply pruning using TensorFlow Model Optimization. This process sparsifies the model by removing insignificant weights while maintaining comparable accuracy.


In [4]:
from tensorflow_model_optimization.sparsity.keras import (
    prune_low_magnitude,
    PolynomialDecay,
    UpdatePruningStep
)

# Apply pruning to the model
def apply_pruning(model):
    pruning_params = {
        'pruning_schedule': PolynomialDecay(
            initial_sparsity=0.5,
            final_sparsity=0.9,
            begin_step=0,
            end_step=2000
        )
    }
    pruned_model = prune_low_magnitude(model, **pruning_params)
    return pruned_model

# Compile and train the pruned model
pruned_model = apply_pruning(create_cnn_model())
pruned_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Include the UpdatePruningStep callback
callbacks = [UpdatePruningStep()]

# Train the pruned model
pruned_model.fit(
    x_train, y_train,
    epochs=2,
    batch_size=32,
    validation_data=(x_test, y_test),
    callbacks=callbacks  # Add the required callback
)

# Evaluate the pruned model
pruned_accuracy = pruned_model.evaluate(x_test, y_test, verbose=0)[1]
print(f"Pruned Model Accuracy: {pruned_accuracy:.4f}")


Epoch 1/2
Epoch 2/2
Pruned Model Accuracy: 0.9695


## Saving the Pruned Model Without Removing the Pruning Mask
Pruning introduces a pruning mask to track sparse connections. If we save the model without removing the mask, the model size does not reduce significantly.


In [5]:
# Save the pruned model without removing the pruning mask
converter = tf.lite.TFLiteConverter.from_keras_model(pruned_model)
tflite_model_with_mask = converter.convert()

# Save the TFLite model
with open("pruned_model_with_mask.tflite", "wb") as f:
    f.write(tflite_model_with_mask)

print("Saved pruned model with pruning mask.")




Saved pruned model with pruning mask.


## Removing the Pruning Mask and Using Sparsity Optimization
To reduce the model size, we must:
1. Strip the pruning mask using `strip_pruning`.
2. Use TensorFlow Lite's experimental sparsity-aware optimization.


In [6]:
from tensorflow_model_optimization.sparsity.keras import strip_pruning

# Strip the pruning mask
stripped_model = strip_pruning(pruned_model)

# Convert the stripped model with experimental sparsity optimization
converter = tf.lite.TFLiteConverter.from_keras_model(stripped_model)
converter.optimizations = [tf.lite.Optimize.EXPERIMENTAL_SPARSITY]
tflite_model_sparse = converter.convert()

# Save the optimized sparse TFLite model
with open("pruned_model_sparse.tflite", "wb") as f:
    f.write(tflite_model_sparse)

print("Saved pruned model with experimental sparsity optimization.")




Saved pruned model with experimental sparsity optimization.


## Comparing Model Sizes and Accuracy
We compare the sizes of:
1. The baseline model.
2. The pruned model without removing the pruning mask.
3. The pruned model with experimental sparsity optimization.


In [7]:
import os

# Compare model sizes
model_files = [
    "pruned_model_with_mask.tflite",
    "pruned_model_sparse.tflite"
]

print("\nModel Sizes (KB):")
for file in model_files:
    print(f"{file}: {os.path.getsize(file) / 1024:.2f} KB")

# Evaluate accuracy for sparse TFLite model
def evaluate_tflite_model(tflite_model_path):
    interpreter = tf.lite.Interpreter(model_path=tflite_model_path)
    interpreter.allocate_tensors()
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    correct_predictions = 0
    for i in range(len(x_test)):
        input_data = x_test[i:i+1].astype("float32")
        interpreter.set_tensor(input_details[0]['index'], input_data)
        interpreter.invoke()
        output_data = interpreter.get_tensor(output_details[0]['index'])
        if tf.argmax(output_data, axis=1) == tf.argmax(y_test[i:i+1], axis=1):
            correct_predictions += 1

    return correct_predictions / len(x_test)

# Print accuracies
print("\nModel Accuracies:")
print(f"Pruned Model with Mask Accuracy: {pruned_accuracy:.4f}")
print(f"Pruned Sparse Model Accuracy: {evaluate_tflite_model('pruned_model_sparse.tflite'):.4f}")



Model Sizes (KB):
pruned_model_with_mask.tflite: 5427.45 KB
pruned_model_sparse.tflite: 411.48 KB

Model Accuracies:
Pruned Model with Mask Accuracy: 0.9695
Pruned Sparse Model Accuracy: 0.9695


In [8]:
# Convert the baseline (unpruned) model to TensorFlow Lite format
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model_baseline = converter.convert()

# Save the baseline TFLite model
with open("baseline_model.tflite", "wb") as f:
    f.write(tflite_model_baseline)

# Print the size of the baseline model
import os
baseline_model_size = os.path.getsize("baseline_model.tflite") / 1024  # Size in KB
print(f"Baseline Model Size: {baseline_model_size:.2f} KB")



Baseline Model Size: 2713.34 KB


# Part I Summary
- Saving the pruned model without removing the pruning mask does not reduce the size.
- Using experimental sparsity-aware optimization after stripping the pruning mask significantly reduces the model size while maintaining accuracy.
- Pruning combined with sparsity is highly effective for compressing models for deployment on resource-constrained devices.


# Part II: Model Pruning + Quantization

### Overview
In this section, we combine **model pruning** and **quantization** to optimize a neural network for deployment on resource-constrained devices.

- **Pruning** removes insignificant connections (weights) in the model, introducing sparsity, which can reduce model size and computational requirements.
- **Quantization** reduces the precision of weights and activations, further compressing the model and enabling efficient inference.

### Goals
1. Apply pruning to the model to introduce sparsity.
2. Explore the impact of:
   - Saving the pruned model **without removing the pruning mask**.
   - Removing the pruning mask and enabling sparsity-aware optimizations.
3. Quantize the pruned model using **Full Integer Quantization** and measure its impact on:
   - Model size.
   - Accuracy on the test set.

### Key Points
- **Pruning Masks:**
  Pruning introduces masks to track sparse connections in the model. These masks must be removed before final deployment to reduce size.
  
- **Sparsity-Aware Optimization:**
  TensorFlow Lite’s `EXPERIMENTAL_SPARSITY` optimization leverages the sparse structure of pruned models to significantly reduce storage and computation.

- **Quantization:**
  Integer quantization further reduces the model's size and enables inference on hardware that supports integer arithmetic, such as microcontrollers.

### Steps Demonstrated
1. Prune the model and save it **with the pruning mask**.
2. Remove the pruning mask and apply **sparsity-aware optimization**.
3. Quantize the pruned models (both with and without the mask) to **Full Integer Quantization**.
4. Compare the model sizes and test accuracies across:
   - Pruned Model with Mask
   - Pruned Model with Sparsity Optimization
   - Fully Integer Quantized Models

In [9]:
# Full Integer Quantization for Pruned Model with Mask

def representative_data_gen():
    for input_value in x_test[:100]:  # Use a subset of the test set
        # Yield a dictionary where the key matches the model's input tensor name
        yield [input_value.reshape(1, 28, 28, 1).astype("float32")]


converter = tf.lite.TFLiteConverter.from_keras_model(pruned_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen  # Use the same representative dataset generator
converter.target_spec.supported_types = [tf.int8]
tflite_model_with_mask_int8 = converter.convert()

# Save the quantized model
with open("pruned_model_with_mask_int8.tflite", "wb") as f:
    f.write(tflite_model_with_mask_int8)
print("Saved Full Integer Quantized Model (with Mask).")

# Full Integer Quantization for Stripped Model
converter = tf.lite.TFLiteConverter.from_keras_model(stripped_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT, tf.lite.Optimize.EXPERIMENTAL_SPARSITY]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_types = [tf.int8]
tflite_model_sparse_int8 = converter.convert()

# Save the quantized model
with open("pruned_model_sparse_int8.tflite", "wb") as f:
    f.write(tflite_model_sparse_int8)
print("Saved Full Integer Quantized Model (Stripped).")




Saved Full Integer Quantized Model (with Mask).




Saved Full Integer Quantized Model (Stripped).


In [10]:
# Compare model sizes after full integer quantization
quantized_model_files = [
    "pruned_model_with_mask_int8.tflite",
    "pruned_model_sparse_int8.tflite"
]

print("\nModel Sizes After Full Integer Quantization (KB):")
for file in quantized_model_files:
    print(f"{file}: {os.path.getsize(file) / 1024:.2f} KB")



Model Sizes After Full Integer Quantization (KB):
pruned_model_with_mask_int8.tflite: 3397.02 KB
pruned_model_sparse_int8.tflite: 209.93 KB


In [16]:
import numpy as np

def evaluate_tflite_model(tflite_model_path):
    # Load the TFLite model
    interpreter = tf.lite.Interpreter(model_path=tflite_model_path)
    interpreter.allocate_tensors()

    # Get input and output tensor details
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    # Extract input type and scale/zero point for quantization
    input_type = input_details[0]['dtype']
    input_scale, input_zero_point = input_details[0]['quantization']

    correct_predictions = 0

    for i in range(len(x_test)):
        # Prepare the input data based on quantization parameters
        input_data = x_test[i:i+1].astype("float32")  # Original test input (float32)

        # Quantize if necessary
        if input_type == np.uint8 or input_type == np.int8:
            input_data = (input_data / input_scale + input_zero_point).astype(input_type)

        interpreter.set_tensor(input_details[0]['index'], input_data)
        interpreter.invoke()

        # Get the output data
        output_data = interpreter.get_tensor(output_details[0]['index'])

        # Dequantize the output if necessary
        output_scale, output_zero_point = output_details[0]['quantization']
        if output_scale != 0:  # Quantized output
            output_data = output_scale * (output_data - output_zero_point)

        # Compare predictions
        if np.argmax(output_data) == np.argmax(y_test[i]):
            correct_predictions += 1

    return correct_predictions / len(x_test)


# Evaluate accuracy for quantized TFLite models
#accuracy_with_mask = evaluate_tflite_model("pruned_model_with_mask_int8.tflite")
accuracy_sparse = evaluate_tflite_model("pruned_model_sparse_int8.tflite")

#print(f"Pruned Model with Mask Accuracy: {accuracy_with_mask:.4f}")
print(f"Pruned Sparse Model Accuracy: {accuracy_sparse:.4f}")


Pruned Sparse Model Accuracy: 0.9691


# Part II Summary
- **Model Size Comparison**:
  - Full Integer Quantization on the pruned model with the mask does not achieve optimal compression.
  - Full Integer Quantization on the stripped sparse model results in a much smaller size due to the removal of the pruning mask and sparsity-aware optimization.

- **Accuracy Comparison**:
  - Both models achieve similar accuracy on the test set, demonstrating that stripping the pruning mask does not affect performance while optimizing size.

This demonstrates the importance of properly finalizing a pruned model for deployment.

