Deep Compression: Pruning & Quantization

Deep compression techniques reduce the size of deep neural networks while maintaining accuracy. 

This makes models faster, memory-efficient, and deployable on edge devices (like mobile phones, IoT devices).



Why is Compression Needed?

Deep learning models have millions of parameters, requiring high computational power.

Problems with Large Models:

Slow inference time.

High memory usage.

Difficult to deploy on edge devices.

Solution: Use pruning & quantization to compress models while keeping accuracy high.

Pruning (Removing Unnecessary Weights)

Pruning removes low-importance connections from a neural network, reducing the number of parameters.

 Types of Pruning:

Weight Pruning: Removes small weight values.

Neuron Pruning: Removes entire neurons that contribute less.

Structured Pruning: Removes entire filters in CNNs.

Step 1:

 Train a Simple Model

We first train a fully connected neural network on MNIST.

In [None]:
import tensorflow as tf
from tensorflow import keras
import numpy as np

# Load MNIST dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Normalize data
x_train, x_test = x_train / 255.0, x_test / 255.0

# Define a simple model
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dense(10, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(x_train, y_train, epochs=5)


Step 2:

 Apply Weight Pruning


We use TensorFlow Model Optimization Toolkit to prune small weights.

In [None]:
import tensorflow_model_optimization as tfmot

# Define pruning parameters
pruning_params = tfmot.sparsity.keras.PruningSchedule(
    tfmot.sparsity.keras.PolynomialDecay(initial_sparsity=0.2, final_sparsity=0.8, begin_step=2000, end_step=4000)
)

# Apply pruning to the model
pruned_model = tfmot.sparsity.keras.prune_low_magnitude(model, pruning_schedule=pruning_params)

# Compile pruned model
pruned_model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Train pruned model
pruned_model.fit(x_train, y_train, epochs=5)


Step 3: 

Convert Pruned Model to Normal Model

After pruning, we strip unnecessary parameters to save storage.

In [None]:
# Strip pruning
pruned_model = tfmot.sparsity.keras.strip_pruning(pruned_model)

# Save model
pruned_model.save("pruned_model.h5")


Quantization (Reducing Precision of Weights)

Quantization reduces the precision of model weights, typically from 32-bit floats to 8-bit integers.

This greatly reduces model size and speeds up inference.

 Types of Quantization:

Post-Training Quantization (Convert after training)

Quantization-Aware Training (Train with lower precision)


Step 1:

 Apply Post-Training Quantization

In [None]:
import tensorflow as tf

# Load the trained model
model = keras.models.load_model("pruned_model.h5")

# Convert model to TensorFlow Lite format
converter = tf.lite.TFLiteConverter.from_keras_model(model)

# Enable quantization
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Convert model
tflite_model = converter.convert()

# Save compressed model
with open("quantized_model.tflite", "wb") as f:
    f.write(tflite_model)
