# Embedded ML - Lab 1.2: Model Compression

In this lab you are asked to create a compressed verion of an ANNs model. You are not allowed to use ML libraries such as SciKit-learn, PyTorch or TensorFlow, but you are allowed to use standard libraries such as math, numpy and matplotlib if needed. You are given some code but you are expected to write some more and be able to explain and modify everything. This lab is essential for you to grasp the details of some of the most important techniques for compressing or making ML models more efficient: quantization and pruning.

### Learning outcomes


* Explain the basic concepts of compression in ANNs
* Apply range tuning and centering when doing quantization
* Calculate and analyze the impact of quantization and pruning on memory and computing

### Naive quantization
Quantization means reducing the precission of model parameters and mainly targets weights, since they represent the most volumne of memory and processing in ANNs.

Take the code from the last part of Lab 1.1 (MNIST model) and add methods to export and import weights to and from a binary file, making sure both processes work with your code in such a way that you don't have to train every time you want to run inference, but insted, the wieghts are loaded into the model when needed. Investigate which serialization/desarialization options exist in Python and choose one that you understand.

Then, create two additional inference methods: FP16 and INT8. The FP16 method should treat all computations in the network involving the weights, as 16-bit floating-point. The INT8 method should work with 8-bit integers instead. In both cases, use the native datatype conversion methods. Investigate the NumPy methods available to enforce the desired datatypes.

Run the two quantized models and compare them with the baseline in terms of model size, accuracy and latency.

In [5]:
# COPY HERE YOUR MNIST MODEL CODE
import pickle

def save_weights_binary(weights):
  # store all weights in a binary file using pickle
  with open('weights.pkl', 'wb') as f:
    pickle.dump(weights, f)

def load_weights_binary(weights_file):
  # load all weights from a binary file using pickle
  with open(weights_file, 'rb') as f:
    weights = pickle.load(f)
  return weights

In [None]:
import numpy as np
import time
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

# === NeuralNetwork class ===
class NeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size

        # Initialize weights and biases
        self.weights_input_hidden = np.random.randn(self.input_size, self.hidden_size)* 0.01
        self.bias_input_hidden = np.zeros((1, self.hidden_size))
        self.weights_hidden_output = np.random.randn(self.hidden_size, self.output_size)* 0.01
        self.bias_hidden_output = np.zeros((1, self.output_size))

    def my_dot(self, A, B):
        # dot product
        # complete here the missing code...

        # Check if A is 1D and reshape it to 2D for consistency
        if A.ndim == 1:
            A = A.reshape(1, -1)

        # Initialize the result matrix with zeros
        result = np.zeros((A.shape[0], B.shape[1]))
        for i in range(A.shape[0]):
            for j in range(B.shape[1]):
                for k in range(A.shape[1]):
                    result[i, j] += A[i, k] * B[k, j]
        return result
    

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

    def sigmoid_derivative(self, x):
        return x * (1 - x)

    def forward(self, x):
        # Forward propagation through the network...

        # Step 1: dot product between the input and the weights
        # that connect with the hidden layer.
        # complete here the missing code...

        self.hidden_input = self.my_dot(x, self.weights_input_hidden) + self.bias_input_hidden
        self.hidden_output = self.sigmoid(self.hidden_input)
        

        # Step 2: dot product between the activations (outputs) of the
        # hidden layer and the weights that connect with the output layer.
        # complete here the missing code...

        self.final_input =  self.my_dot(self.hidden_output, self.weights_hidden_output) + self.bias_hidden_output
        self.output = self.sigmoid(self.final_input)
        
        return self.output

    def backward(self, x, y, output, learning_rate):
        # Backpropagation and weight updates
        self.error = y - output
        d_output = self.error * self.sigmoid_derivative(output)

        self.hidden_error = d_output.dot(self.weights_hidden_output.T)
        d_hidden = self.hidden_error * self.sigmoid_derivative(self.hidden_output)

        self.weights_hidden_output += self.hidden_output.T.dot(d_output) * learning_rate
        self.bias_hidden_output += np.sum(d_output, axis=0, keepdims=True) * learning_rate
        self.weights_input_hidden += x.T.dot(d_hidden) * learning_rate
        self.bias_input_hidden += np.sum(d_hidden, axis=0, keepdims=True) * learning_rate

    def train(self, x, y, epochs, learning_rate):
        error = 0
        for epoch in range(epochs):
            output = self.forward(x)
            self.backward(x, y, output, learning_rate)
            if epoch % 10 == 0:
                error = np.mean(np.square(y - output))
                print(f'Epoch {epoch}: Loss = {error:.4f}')
        print(f'Epoch {epochs}: Loss = {error:.4f} ')


# === Load and preprocess MNIST dataset ===
print("Loading MNIST dataset...")
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X_data = mnist.data / 255.0
y_data = mnist.target.astype(np.int32)

# Reduce dataset size for faster training
X_data = X_data[:10]
y_data = y_data[:10]

# One-hot encode labels
encoder = OneHotEncoder(sparse_output=False, categories='auto')
y_data_encoded = encoder.fit_transform(y_data.reshape(-1, 1))

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data_encoded, test_size=0.1, random_state=42)

# === Run experiments with different hidden layer sizes ===
h_size = 10
results = []

# Define parameters

 #Number of features (28*28=784)
input_size = X_train.shape[1]
# Number of classes (based on one-hot encoding)
output_size = 10
# Number of epochs
epochs = 200
# Learning rate
learning_rate = 0.05


print(f"\nTraining model with {h_size} hidden neurons...")
nn = NeuralNetwork(input_size=input_size, hidden_size=h_size, output_size=output_size)

start_train = time.time()
nn.train(X_train, y_train, epochs=epochs, learning_rate=learning_rate)
end_train = time.time()

# Time one forward pass
start_forward = time.time()
output = nn.forward(X_test[:1])
end_forward = time.time()

# Predict and evaluate accuracy
predictions = np.argmax(nn.forward(X_test), axis=1)
print("Predictions: ", predictions)

labels = np.argmax(y_test, axis=1)
print("Labels: ", labels)
accuracy = np.mean(predictions == labels) * 100

# Count parameters and forward operations
param_count = (2 * input_size * h_size) + (2 * h_size * output_size) + (h_size + output_size)
forward_ops = 2*input_size*h_size + 3*h_size + 2

results.append({
    "Hidden Neurons": h_size,
    "Parameters": param_count,
    "Forward Ops": forward_ops,
    "Training Time (s)": round(end_train - start_train, 2),
    "Forward Time (ms)": round((end_forward - start_forward) * 1000, 2),
    "Accuracy (%)": round(accuracy, 2)
})



Loading MNIST dataset...

Training model with 10 hidden neurons...


ValueError: operands could not be broadcast together with shapes (9,7) (9,10) 

In [6]:
#Save weights like binary file one file (weights.pkl) with all weights and bias(input, hidden, output)
save_weights_binary({
    "weights_input_hidden": nn.weights_input_hidden,
    "bias_input_hidden": nn.bias_input_hidden,
    "weights_hidden_output": nn.weights_hidden_output,
    "bias_hidden_output": nn.bias_hidden_output
})

#Load weights from binary file
weights = load_weights_binary('weights.pkl')
print("Weights loaded from binary file.")
print("Weights and biases:")
print("weights_input_hidden:", weights["weights_input_hidden"])
print("bias_input_hidden:", weights["bias_input_hidden"])
print("weights_hidden_output:", weights["weights_hidden_output"])
print("bias_hidden_output:", weights["bias_hidden_output"])

Weights loaded from binary file.
Weights and biases:
weights_input_hidden: [[-2.63844632e-02]
 [ 1.16349034e-02]
 [ 1.15859558e-02]
 [-1.07652234e-02]
 [ 2.01325221e-02]
 [-1.12839139e-03]
 [ 1.11884251e-02]
 [-6.39881631e-03]
 [ 8.27274213e-03]
 [ 3.14626283e-03]
 [ 5.65563864e-03]
 [ 2.39364628e-03]
 [ 2.08765575e-02]
 [-3.71191495e-03]
 [ 8.81994268e-03]
 [ 2.14848382e-02]
 [-6.65136127e-03]
 [ 1.34496627e-03]
 [-1.01032192e-02]
 [ 1.36689124e-03]
 [-1.14263403e-02]
 [-9.09418261e-03]
 [-2.57980472e-03]
 [ 1.29155093e-02]
 [ 3.97969660e-03]
 [ 4.96065852e-03]
 [-1.89374979e-02]
 [-1.54475399e-02]
 [ 5.04541680e-03]
 [-7.67275375e-04]
 [ 1.53384865e-02]
 [-1.52935548e-03]
 [-6.92564544e-03]
 [ 3.63325166e-03]
 [ 1.02930808e-02]
 [-1.92630572e-03]
 [ 1.03864598e-02]
 [-1.32548560e-02]
 [-1.09060095e-02]
 [-2.07971775e-02]
 [-1.30623166e-02]
 [-1.14790191e-02]
 [ 9.84533905e-03]
 [ 1.00037441e-02]
 [-3.69918796e-03]
 [-5.18424124e-03]
 [ 1.14560507e-03]
 [ 1.85934853e-02]
 [ 3.13047163

### Range tuning and centering
For quantization to be effective, you should smartly choose the range of numbers you will code with the fewer bits available after quantization. To do so, you should evaluate the dynamic ranges of the variables to be quantized and map the values using that as the full range.

Make a histogram plot of the model weights in order to verify their range. Then write a function to quantize the weights stored in the exported binary file to INT8 and store the resulting weights in another file. Finally, run again the INT8 quantized inference with the newly computed weights and compare with the previous versions using the same metrics.

In [None]:
def plot_histogram(weights):
  # plot a histogram of all model weights
  pass

def quantize_INT8(weights):
  # quantize to INT8 the model weights
  weights_int8 = None
  return weights_int8

### Pruning
Besides reducing precision for the network weights, we can also decide to eliminate network connections that do not contribute significantly to the model. This can be achieved by simply removing the connections whose weights are closest to zero.

In this part of the lab you are asked to generate three pruned versions of the original model by setting to zero some of the weights:


*   Set to zero the smallest 10% of weights
*   Set to zero the smallest 30% of weights
*   Set to zero the smallest 50% of weights

Report the accuracy for each model against the estimated memory savings.



In [None]:
def prune_model(weights, percentage):
  # set to zero the smallest weights, according to the given percentage
  weights_pruned = None
  pass
  return weights_pruned

## Analysis

Discuss the following questions based on the lab experiments and the theory studied:


*   What are the advantages an disadvantages of storing model weights in different formats?
*   How much reduction in model memory requirements can be achieved by each of the versions obtained?
*   What are the posible computational advantages of the obtained models and how do they depend on the hardware?

