<a href="https://colab.research.google.com/github/LRManamperi/Machine-Learning/blob/main/tinyML/Model_Pruning_(with_Quantization)_Student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installing Required Packages

### Note: Resolving `keras.src` Namespace Issue
When using TensorFlow and TensorFlow Model Optimization in Colab, you may encounter a `keras.src` namespace issue, causing incompatibility with `tensorflow_model_optimization.quantization.keras`. To resolve this:

1. Set the `KERAS_BACKEND` environment variable to `tensorflow` before importing TensorFlow.
2. Ensure you are using compatible versions of TensorFlow (`>=2.12`) and TensorFlow Model Optimization.
3. Clone the model using `tensorflow.keras.models.clone_model()` to ensure it aligns with the `tensorflow.keras` namespace.
4. Always restart the runtime and reinstall TensorFlow-related packages to avoid lingering conflicts.

This ensures that all operations use the correct `tensorflow.keras` implementation, avoiding compatibility issues.


In [None]:
!pip uninstall -y keras tensorflow tensorflow-model-optimization
!pip install tensorflow==2.12 tensorflow-model-optimization

Found existing installation: keras 2.12.0
Uninstalling keras-2.12.0:
  Successfully uninstalled keras-2.12.0
Found existing installation: tensorflow 2.12.0
Uninstalling tensorflow-2.12.0:
  Successfully uninstalled tensorflow-2.12.0
Found existing installation: tensorflow-model-optimization 0.8.0
Uninstalling tensorflow-model-optimization-0.8.0:
  Successfully uninstalled tensorflow-model-optimization-0.8.0
Collecting tensorflow==2.12
  Using cached tensorflow-2.12.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting tensorflow-model-optimization
  Using cached tensorflow_model_optimization-0.8.0-py2.py3-none-any.whl.metadata (904 bytes)
Collecting keras<2.13,>=2.12.0 (from tensorflow==2.12)
  Using cached keras-2.12.0-py2.py3-none-any.whl.metadata (1.4 kB)
Using cached tensorflow-2.12.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (586.0 MB)
Using cached tensorflow_model_optimization-0.8.0-py2.py3-none-any.whl (242 kB)
Using cached ke

In [None]:
# Import required libraries

import os
os.environ["KERAS_BACKEND"] = "tensorflow"

import tensorflow as tf
from tensorflow.keras import Model, Input
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

# Part I: Model Pruning with Sparsity
This part of the notebook demonstrates how pruning can be used to reduce the size of a model by removing insignificant weights. However, pruning alone does not always lead to size reductions unless additional steps are taken:
1. Remove the pruning mask to finalize the pruned model.
2. Use TensorFlow Lite's **experimental sparsity-aware optimization** when converting to TFLite format.

## Objectives:
1. Compare model size before and after removing the pruning mask.
2. Show the impact of experimental sparsity optimization on reducing the TFLite model size.


## Dataset Preparation
We use the MNIST dataset, which contains grayscale images of handwritten digits (0-9).
1. Normalize the pixel values to the range [0, 1].
2. Reshape the data for input into the CNN model.
3. One-hot encode the labels for classification.

In [None]:
# Import required libraries
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

# Load and preprocess the MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0
x_train = x_train.reshape(-1, 28, 28, 1)
x_test = x_test.reshape(-1, 28, 28, 1)
y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)

print(f"Training data shape: {x_train.shape}, Labels shape: {y_train.shape}")
print(f"Test data shape: {x_test.shape}, Labels shape: {y_test.shape}")


Training data shape: (60000, 28, 28, 1), Labels shape: (60000, 10)
Test data shape: (10000, 28, 28, 1), Labels shape: (10000, 10)


## Training a Simple CNN
We build a Convolutional Neural Network (CNN) with the following layers:
1. **Convolutional Layer**: Extracts features from the input images.
2. **MaxPooling Layer**: Reduces spatial dimensions, lowering computational requirements.
3. **Flatten Layer**: Converts the 2D feature maps into a 1D vector.
4. **Dense Layers**: Fully connected layers for classification.

The model is compiled using the Adam optimizer and trained for 2 epochs.


In [None]:
# Build a simple CNN model
def create_cnn_model():
    inputs = Input(shape=(28, 28, 1))
    x = Conv2D(32, (3, 3), activation='relu')(inputs)
    x = MaxPooling2D((2, 2))(x)
    x = Flatten()(x)
    x = Dense(128, activation='relu')(x)
    outputs = Dense(10, activation='softmax')(x)
    return Model(inputs, outputs)

# Compile and train the model
model = create_cnn_model()
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train, y_train, epochs=2, batch_size=32, validation_data=(x_test, y_test))

# Evaluate the trained model
baseline_accuracy = model.evaluate(x_test, y_test, verbose=0)[1]
print(f"Baseline Model Accuracy: {baseline_accuracy:.4f}")


Epoch 1/2
[1m 557/1875[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m26s[0m 20ms/step - accuracy: 0.8407 - loss: 0.5367

KeyboardInterrupt: 

## Pruning the Model
We apply pruning using TensorFlow Model Optimization. This process sparsifies the model by removing insignificant weights while maintaining comparable accuracy.

### 🔍 Explanation of `pruning_params`

The `pruning_params` dictionary configures how pruning is applied using a `PolynomialDecay` schedule:

- **`initial_sparsity`**:  
  The starting proportion of zero weights when pruning begins.  
  **Example**: `0.5` means 50% of the weights will be pruned at `begin_step`.

- **`final_sparsity`**:  
  The target proportion of zero weights to reach by `end_step`.  
  **Example**: `0.9` means 90% of the weights will be pruned by the end of the schedule.

- **`begin_step`**:  
  The training step at which pruning begins.  
  Typically set to `0` to start pruning from the beginning of training.

- **`end_step`**:  
  The training step at which pruning ends.  
  After this step, the sparsity level remains fixed at `final_sparsity`.

---

### 🧮 How to Estimate `begin_step` and `end_step`

Each training **step** corresponds to one batch update. Use the following formulas to calculate the total number of training steps:

$$ \text{steps_per_epoch} = ceil\Big(\frac{\text{num_train_samples}}{\text{batch_size}}\Big)$$

$$\text{total_steps} = \text{steps_per_epoch} * \text{num_epochs}$$


**Example**:  
If `num_train_samples = 60000`, `batch_size = 32`, and `num_epochs = 2`:

$$ \text{steps_per_epoch} = ceil\Big(\frac{60000}{32}\Big) = 1875 $$

$$\text{total_steps} = 1875 * 2 = 3750$$

You could then set:

- `begin_step = 0`
- `end_step = 2000`  _(pruning is applied during the first ~half of training)_

In [None]:
from tensorflow_model_optimization.sparsity.keras import (
    prune_low_magnitude,
    PolynomialDecay,
    UpdatePruningStep
)

# Apply pruning to the model using a polynomial decay schedule
def apply_pruning(model):
    pruning_params = {
        'pruning_schedule': PolynomialDecay(
            initial_sparsity=0.5,   # Start pruning with 50% of weights set to zero
            final_sparsity=0.9,     # Gradually increase sparsity to 90% by end_step
            begin_step=0,           # Start pruning from the first training step
            end_step=2000           # End pruning at step 2000
        )
    }
    pruned_model = prune_low_magnitude(model, **pruning_params)
    return pruned_model

# Compile and train the pruned model
pruned_model = apply_pruning(create_cnn_model())
pruned_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Add the pruning update callback (required)
callbacks = [UpdatePruningStep()]

# Train the pruned model
pruned_model.fit(
    x_train, y_train,
    epochs=2,
    batch_size=32,
    validation_data=(x_test, y_test),
    callbacks=callbacks  # Important: ensures pruning step is updated during training
)

# Evaluate the pruned model
pruned_accuracy = pruned_model.evaluate(x_test, y_test, verbose=0)[1]
print(f"Pruned Model Accuracy: {pruned_accuracy:.4f}")

ModuleNotFoundError: No module named 'tensorflow_model_optimization'

## Saving the Pruned Model Without Removing the Pruning Mask
Pruning introduces a pruning mask to track sparse connections. If we save the model without removing the mask, the model size does not reduce significantly.


In [None]:
# Save the pruned model without removing the pruning mask
converter = tf.lite.TFLiteConverter.from_keras_model(pruned_model)
tflite_model_with_mask = converter.convert()

# Save the TFLite model
with open("pruned_model_with_mask.tflite", "wb") as f:
    f.write(tflite_model_with_mask)

print("Saved pruned model with pruning mask.")




Saved pruned model with pruning mask.


## Removing the Pruning Mask and Using Sparsity Optimization
To reduce the model size, we must:
1. Strip the pruning mask using `strip_pruning`.
2. Use TensorFlow Lite's experimental sparsity-aware optimization.


In [None]:
from tensorflow_model_optimization.sparsity.keras import strip_pruning

# Strip the pruning mask
stripped_model = strip_pruning(pruned_model)

# Convert the stripped model with experimental sparsity optimization
converter = tf.lite.TFLiteConverter.from_keras_model(stripped_model)
converter.optimizations = [tf.lite.Optimize.EXPERIMENTAL_SPARSITY]
tflite_model_sparse = converter.convert()

# Save the optimized sparse TFLite model
with open("pruned_model_sparse.tflite", "wb") as f:
    f.write(tflite_model_sparse)

print("Saved pruned model with experimental sparsity optimization.")




Saved pruned model with experimental sparsity optimization.


## Comparing Model Sizes and Accuracy
We compare the sizes of:
1. The baseline model.
2. The pruned model without removing the pruning mask.
3. The pruned model with experimental sparsity optimization.


In [None]:
import os

# Compare model sizes
model_files = [
    "pruned_model_with_mask.tflite",
    "pruned_model_sparse.tflite"
]

print("\nModel Sizes (KB):")
for file in model_files:
    print(f"{file}: {os.path.getsize(file) / 1024:.2f} KB")

# Evaluate accuracy for sparse TFLite model
def evaluate_tflite_model(tflite_model_path):
    interpreter = tf.lite.Interpreter(model_path=tflite_model_path)
    interpreter.allocate_tensors()
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    correct_predictions = 0
    for i in range(len(x_test)):
        input_data = x_test[i:i+1].astype("float32")
        interpreter.set_tensor(input_details[0]['index'], input_data)
        interpreter.invoke()
        output_data = interpreter.get_tensor(output_details[0]['index'])
        if tf.argmax(output_data, axis=1) == tf.argmax(y_test[i:i+1], axis=1):
            correct_predictions += 1

    return correct_predictions / len(x_test)

# Print accuracies
print("\nModel Accuracies:")
print(f"Pruned Model with Mask Accuracy: {pruned_accuracy:.4f}")
print(f"Pruned Sparse Model Accuracy: {evaluate_tflite_model('pruned_model_sparse.tflite'):.4f}")



Model Sizes (KB):
pruned_model_with_mask.tflite: 5427.45 KB
pruned_model_sparse.tflite: 411.48 KB

Model Accuracies:
Pruned Model with Mask Accuracy: 0.9740
Pruned Sparse Model Accuracy: 0.9740


In [None]:
# Convert the baseline (unpruned) model to TensorFlow Lite format
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model_baseline = converter.convert()

# Save the baseline TFLite model
with open("baseline_model.tflite", "wb") as f:
    f.write(tflite_model_baseline)

# Print the size of the baseline model
import os
baseline_model_size = os.path.getsize("baseline_model.tflite") / 1024  # Size in KB
print(f"Baseline Model Size: {baseline_model_size:.2f} KB")



Baseline Model Size: 2713.34 KB


# Part I Summary
- Saving the pruned model without removing the pruning mask does not reduce the size.
- Using experimental sparsity-aware optimization after stripping the pruning mask significantly reduces the model size while maintaining accuracy.
- Pruning combined with sparsity is highly effective for compressing models for deployment on resource-constrained devices.


# Part II: Model Pruning + Quantization

### Overview
In this section, we combine **model pruning** and **quantization** to optimize a neural network for deployment on resource-constrained devices.

- **Pruning** removes insignificant connections (weights) in the model, introducing sparsity, which can reduce model size and computational requirements.
- **Quantization** reduces the precision of weights and activations, further compressing the model and enabling efficient inference.

### Goals
1. Apply pruning to the model to introduce sparsity.
2. Explore the impact of:
   - Saving the pruned model **without removing the pruning mask**.
   - **Removing the pruning mask** and enabling **sparsity-aware optimizations**.
3. Quantize the pruned model using **Float16 Quantization** and measure its impact on:
   - Model size.
   - Accuracy on the test set.

### Key Points
- **Pruning Masks:**  
  Pruning introduces masks to track sparse connections in the model. These masks must be removed before final deployment to reduce size.
  
- **Sparsity-Aware Optimization:**  
  TensorFlow Lite’s `EXPERIMENTAL_SPARSITY` optimization leverages the sparse structure of pruned models to significantly reduce storage and computation.

- **Quantization:**  
  Float16 quantization reduces model size by converting all float32 weights to float16 while keeping activations in float32. This allows the model to remain accurate while benefiting from size reduction—particularly effective on hardware that supports float16 computations (e.g., GPUs, NPUs).

### Steps Demonstrated
1. Prune the model and save it **with the pruning mask**.
2. Remove the pruning mask and apply **sparsity-aware optimization**.
3. Quantize the pruned models (both with and without the mask) using **Float16 Quantization**.
4. Compare the model sizes and test accuracies across:
   - Pruned Model with Mask  
   - Pruned Model with Sparsity Optimization  
   - Float16 Quantized Models


---

### 🧪 Lab 2 Submission Reminder

**Please complete the code below and take a screenshot of it as part of your Lab 2 submission.**

⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️⬇️

---


In [None]:
# Float16 Quantization for Pruned Model with Mask

converter = #<--- Enter Your Code Here --->#
converter.optimizations = #<--- Enter Your Code Here --->#
converter.target_spec.supported_types = #<--- Enter Your Code Here --->#
tflite_model_with_mask_float16 = #<--- Enter Your Code Here --->#

# Save the quantized model
with open("pruned_model_with_mask_float16.tflite", "wb") as f:
    f.write(tflite_model_with_mask_float16)
print("Saved Float16 Quantized Model (with Mask).")

# Float16 Quantization for Stripped Model

converter = #<--- Enter Your Code Here --->#
# Make sure to apply appropriate optimizations to account for both pruning and quantization
converter.optimizations = #<--- Enter Your Code Here --->#
converter.target_spec.supported_types = #<--- Enter Your Code Here --->#
tflite_model_sparse_float16 = #<--- Enter Your Code Here --->#

# Save the quantized model
with open("pruned_model_sparse_float16.tflite", "wb") as f:
    f.write(tflite_model_sparse_float16)
print("Saved Float16 Quantized Model (Stripped).")


SyntaxError: invalid syntax (ipython-input-2775117757.py, line 3)

In [None]:
import numpy as np

# Compare model sizes after full integer quantization
quantized_model_files = [
    "pruned_model_with_mask_float16.tflite",
    "pruned_model_sparse_float16.tflite"
]

print("\nModel Sizes After Float16 Quantization (KB):")
for file in quantized_model_files:
    print(f"{file}: {os.path.getsize(file) / 1024:.2f} KB")

# Evaluate accuracy of a TFLite model on the test set
def evaluate_tflite_model(tflite_model_path):
    # Load the TFLite model and allocate tensors
    interpreter = tf.lite.Interpreter(model_path=tflite_model_path)
    interpreter.allocate_tensors()

    # Get input and output tensor details
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    input_dtype = input_details[0]['dtype']
    input_scale, input_zero_point = input_details[0]['quantization']

    correct_predictions = 0

    for i in range(len(x_test)):
        # Prepare one input sample (shape: [1, 28, 28, 1])
        input_data = x_test[i:i+1].astype("float32")

        # Quantize the input if model expects int8
        if input_dtype == np.int8:
            input_data = input_data / input_scale + input_zero_point
            input_data = np.round(input_data).astype(np.int8)
        elif input_dtype == np.float16:
            input_data = input_data.astype(np.float16)

        # Set the tensor and invoke the interpreter
        interpreter.set_tensor(input_details[0]['index'], input_data)
        interpreter.invoke()

        # Get prediction and compare to true label
        output_data = interpreter.get_tensor(output_details[0]['index'])
        predicted_label = np.argmax(output_data, axis=1)
        true_label = np.argmax(y_test[i:i+1], axis=1)

        if predicted_label == true_label:
            correct_predictions += 1

    # Return classification accuracy
    return correct_predictions / len(x_test)


# Evaluate accuracy for quantized TFLite models
accuracy_with_mask = evaluate_tflite_model("pruned_model_with_mask_float16.tflite")
accuracy_sparse = evaluate_tflite_model("pruned_model_sparse_float16.tflite")

print(f"Pruned Model with Mask Accuracy + Float16 Quantization Accuracy: {accuracy_with_mask:.4f}")
print(f"Pruned Sparse Model + Float16 Quantization Accuracy: {accuracy_sparse:.4f}")


Model Sizes After Full Integer Quantization (KB):


NameError: name 'os' is not defined

---
⬆️⬆️⬆️⬆️⬆️⬆️⬆️⬆️⬆️⬆️⬆️⬆️⬆️⬆️⬆️⬆️⬆️⬆️⬆️⬆️

### 📸 Lab 2 Submission Reminder

**Please take a screenshot of the above result and include it as part of your Lab 2 submission.**

---

# Part II Summary
- **Model Size Comparison**:
  - Float 16 Quantization on the pruned model with the mask does not achieve optimal compression.
  - Float 16 Quantization on the stripped sparse model results in a much smaller size due to the removal of the pruning mask and sparsity-aware optimization.

- **Accuracy Comparison**:
  - Both models achieve similar accuracy on the test set, demonstrating that stripping the pruning mask does not affect performance while only affecting optimizing the model size.

This demonstrates the importance of properly finalizing a pruned model for deployment.

