<a href="https://colab.research.google.com/github/JuanZapa7a/semiotics/blob/main/mnist.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Quantization Aware Training (QAT) using Larq for binarized quantization with the MNIST dataset

[Larq](https://larq.dev/) is a library designed to build and train binarised neural networks (BNNs) using TensorFlow and Keras. We are interested in performing hardware-aware training (considering noise, quantization, etc.) for deep models using Larq. We can achieve this by taking advantage of Larq's specific functionalities for binarisation and compact model training.

Here is an outline of what we will cover in this notebook:
 1. Installation of Larq and necessary dependencies.
 2. Data preparation (MNIST).
 3. Creation of a based model.
 4. Training and evaluation of the based model .
 5. Creation of a binarized model with QAT.
 6. Training and evaluation of the binarized model.
 7. Compare Models.

This NoteBook uses [Larq](https://larq.dev/) and the [Keras Sequential API](https://www.tensorflow.org/guide/keras).

The API of Larq is built on top of `tf.keras` and is designed to provide an easy to use, composable way to design and train BNNs (1 bit) and other types of Quantized Neural Networks (QNNs).

It provides tools specifically designed to aid in BNN development, such as specialized optimizers, training metrics, and profiling tools.

Note that efficient inference using a trained BNN requires the use of an optimized inference engine; we provide these for several platforms in [Larq Compute Engine](https://docs.larq.dev/compute-engine).

To create a **Quantized Neural Network (QNN)**, Larq introduces two main components: **[quantized layers](https://docs.larq.dev/larq/api/layers/)** and **[quantizers](https://docs.larq.dev/larq/api/quantizers/)**.

1. **Quantizers**: A quantizer defines two critical aspects:
   - **Transformation of full-precision input to quantized output**: This involves converting high-precision values (usually 32-bit floating-point) to a lower-precision format (e.g., binary or integer). This reduces memory usage and computational load, which is helpful for efficiency.
   - **Pseudo-gradient method for backpropagation**: Since quantization can create non-differentiable points, Larq uses an approximate or "pseudo" gradient method for the backward pass during training. This allows the model to still update weights even if the quantized values don't support traditional gradient computation.

2. **Quantized Layers**: These layers use quantizers to handle activations and weights with reduced precision. Each quantized layer requires:
   - **input_quantizer**: Defines how to quantize the incoming activations for the layer. This allows the model to operate on low-precision activations instead of full-precision ones.
   - **kernel_quantizer**: Defines how to quantize the layer’s weights (often referred to as kernels in neural network layers).

  If both `input_quantizer` and `kernel_quantizer` are set to `None`, then the layer behaves as a regular, full-precision layer, similar to standard TensorFlow/Keras layers.

3. **Integration with Models**: These quantized layers can be added to a Keras model just like other layers. Alternatively, you can use them with a custom training loop if you need more control over the training process.

Larq's QNN approach leverages quantizers to efficiently reduce precision while maintaining trainability through pseudo-gradients, which can then be integrated seamlessly into standard Keras workflows.

## 1. Installation of Larq and necessary dependencies

In [None]:
!pip -q install tensorflow==2.10.0
!pip -q install larq==0.13.1

import tensorflow as tf
import larq as lq

## 2. Data preparation (MNIST)

Download and prepare the MNIST dataset.

By default, each MNIST image has a shape of (28, 28), which is 2D.
However, neural networks (especially convolutional networks) typically expect 3D inputs: (height, width, channels).
Adding a channel dimension (with value 1 for grayscale) changes each image shape from (28, 28) to (28, 28, 1), which is required for most neural network layers to interpret the images correctly.
The overall shapes for the dataset become (60000, 28, 28, 1) for training and (10000, 28, 28, 1) for testing.

The MNIST dataset’s pixel values originally range from 0 to 255.
Dividing by 127.5 and then subtracting 1 maps the values to a -1 to 1 range, which can help certain models converge more quickly and maintain numerical stability. (Centering pixel values around zero often benefits neural networks as it reduces bias and helps gradient-based methods perform better.)

In [None]:
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()

train_images = train_images.reshape((60000, 28, 28, 1)) # (60000, 28, 28) (60000,)
test_images = test_images.reshape((10000, 28, 28, 1)) # (10000, 28, 28) (10000,)

# For binarized models, it is standard to normalize images to a range between -1 and 1.
train_images, test_images = train_images / 127.5 - 1
test_images / 127.5 - 1

print(train_images.shape, train_labels.shape)  # Debe ser (60000, 28, 28, 1), (60000,)
print(test_images.shape, test_labels.shape)    # Debe ser (10000, 28, 28, 1), (10000,)

## 3. Creation of a base model.

To train the same model without binarization and quantization (i.e., using full precision for both weights and activations), we simply need to remove the QuantConv2D and QuantDense layers, replacing them with standard convolutional (Conv2D) and dense (Dense) layers from TensorFlow's Keras API.

In [None]:
import tensorflow as tf

# Define the standard (non-binarized) model architecture
model_normal = tf.keras.models.Sequential()

# First layer: Standard convolutional layer (no quantization)
model_normal.add(tf.keras.layers.Conv2D(
    32, (3, 3),                        # 32 filters of size 3x3
    activation="relu",                 # Use ReLU activation instead of binary
    use_bias=False,                    # No bias for simplicity
    input_shape=(28, 28, 1)            # Input shape for MNIST
))
model_normal.add(tf.keras.layers.MaxPooling2D((2, 2)))    # Max pooling for downsampling
model_normal.add(tf.keras.layers.BatchNormalization(scale=False)) # Batch normalization

# Second standard convolutional layer
model_normal.add(tf.keras.layers.Conv2D(
    64, (3, 3),
    activation="relu",
    use_bias=False,
))
model_normal.add(tf.keras.layers.MaxPooling2D((2, 2)))     # Max pooling
model_normal.add(tf.keras.layers.BatchNormalization(scale=False))

# Third standard convolutional layer
model_normal.add(tf.keras.layers.Conv2D(
    64, (3, 3),
    activation="relu",
    use_bias=False,
))
model_normal.add(tf.keras.layers.BatchNormalization(scale=False))
model_normal.add(tf.keras.layers.Flatten())                # Flatten the output for the dense layers

# Fully connected (dense) layer with 64 units
model_normal.add(tf.keras.layers.Dense(64, activation="relu", use_bias=False))
model_normal.add(tf.keras.layers.BatchNormalization(scale=False))

# Output layer for classification (10 classes for MNIST)
model_normal.add(tf.keras.layers.Dense(10, activation="softmax", use_bias=False))
model_normal.add(tf.keras.layers.BatchNormalization(scale=False))


In [None]:
model_normal.summary()

## 4. Training and evaluation of the normal model

In [None]:
# Compile the model for normal training
model_normal.compile(
    optimizer='adam',                       # Adam optimizer
    loss="sparse_categorical_crossentropy", # Cross-entropy loss for classification
    metrics=["accuracy"]                    # Accuracy as the evaluation metric
)

# Train the model
history_normal = model_normal.fit(
    train_images, train_labels,
    epochs=6,                        # Number of epochs
    batch_size=64,                    # Batch size
    validation_data=(test_images, test_labels)  # Use test data for validation
)

# Evaluate the model on the test dataset
test_loss_normal, test_accuracy_normal = model_normal.evaluate(test_images, test_labels)

print(f"Test Accuracy (Normal): {test_accuracy_normal * 100:.2f}%")
print(f"Test Loss(Normal): {test_loss_normal:.4f}")

In [None]:
import matplotlib.pyplot as plt

# Plot training & validation accuracy values
plt.figure(figsize=(12, 5))

# Accuracy plot
plt.subplot(1, 2, 1)
plt.plot(history_normal.history['accuracy'], label='Train Accuracy')
plt.plot(history_normal.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')

# Loss plot
plt.subplot(1, 2, 2)
plt.plot(history_normal.history['loss'], label='Train Loss')
plt.plot(history_normal.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend(loc='upper right')

plt.tight_layout()
plt.show()


## 5. Creation of a binarized model

The following will create a simple binarized CNN.

The quantization function
$$
q(x) = \begin{cases}
    -1 & x < 0 \\\
    1 & x \geq 0
\end{cases}
$$
is used in the forward pass to binarize the activations and the latent full precision weights. The gradient of this function is zero almost everywhere which prevents the model from learning.

To be able to train the model the gradient is instead estimated using the Straight-Through Estimator (STE)
(the binarization is essentially replaced by a clipped identity on the backward pass):
$$
\frac{\partial q(x)}{\partial x} = \begin{cases}
    1 & \left|x\right| \leq 1 \\\
    0 & \left|x\right| > 1
\end{cases}
$$

In Larq this can be done by using `input_quantizer="ste_sign"` and `kernel_quantizer="ste_sign"`.
Additionally, the latent full precision weights are clipped to -1 and 1 using `kernel_constraint="weight_clip"`.

In [None]:
import larq as lq
import tensorflow as tf

# Define default quantization options for all layers except the first layer
kwargs = dict(
    input_quantizer="ste_sign",      # Quantizes activations using Sign-STE (binary quantization to -1 or +1)
    kernel_quantizer="ste_sign",     # Quantizes weights to binary values (-1 or +1) with Sign-STE
    kernel_constraint="weight_clip"  # Clips weights within a set range (typically -1 to +1) to stabilize training
)

# Initialize a Sequential model
model = tf.keras.models.Sequential()

# First layer: Quantized convolutional layer (only quantizing weights, not inputs)
model.add(lq.layers.QuantConv2D(
    32, (3, 3),                      # 32 filters of size 3x3
    kernel_quantizer="ste_sign",     # Quantize weights to -1 or +1
    kernel_constraint="weight_clip", # Restrict weights to a range for stability
    use_bias=False,                  # Disable bias for simplicity
    input_shape=(28, 28, 1)          # Input shape for MNIST (28x28 grayscale images)
))
model.add(tf.keras.layers.MaxPooling2D((2, 2)))         # Downsample with max pooling
model.add(tf.keras.layers.BatchNormalization(scale=False)) # Batch normalization to stabilize activations

# Second quantized convolutional layer (quantizes both weights and activations)
model.add(lq.layers.QuantConv2D(64, (3, 3), use_bias=False, **kwargs))
model.add(tf.keras.layers.MaxPooling2D((2, 2)))         # Downsample again
model.add(tf.keras.layers.BatchNormalization(scale=False))

# Third quantized convolutional layer (also quantizes both weights and activations)
model.add(lq.layers.QuantConv2D(64, (3, 3), use_bias=False, **kwargs))
model.add(tf.keras.layers.BatchNormalization(scale=False))
model.add(tf.keras.layers.Flatten())                    # Flatten the output for fully connected layers

# Quantized dense (fully connected) layer with 64 units
model.add(lq.layers.QuantDense(64, use_bias=False, **kwargs))
model.add(tf.keras.layers.BatchNormalization(scale=False))

# Final quantized dense layer for output (10 classes for MNIST)
model.add(lq.layers.QuantDense(10, use_bias=False, **kwargs))
model.add(tf.keras.layers.BatchNormalization(scale=False))
model.add(tf.keras.layers.Activation("softmax"))        # Softmax activation for classification probabilities


Almost all parameters in the network are binarized, so either -1 or 1. This makes the network extremely fast if it would be deployed on custom BNN hardware.

 Here is the complete architecture of our model:

In [None]:
lq.models.summary(model)

## 6. Training and evaluation of the binarized model with QAT



In [None]:
# Compile the model with an optimizer and loss function
model.compile(
    optimizer='adam',                      # Adam optimizer is often effective for training binarized networks
    loss="sparse_categorical_crossentropy",# Cross-entropy loss for classification
    metrics=["accuracy"]                   # Accuracy as the evaluation metric
)

# Train the model where QAT takes place as the model learns to optimize quantized weights and activations.
history = model.fit(
    train_images, train_labels,
    epochs=6,                           # Number of training epochs
    batch_size=64,                      # Batch size for training
    validation_data=(test_images, test_labels) # Evaluate on test set after each epoch
)

# Evaluate the model on the test dataset
test_loss, test_accuracy = model.evaluate(test_images, test_labels)

print(f"Test Accuracy: {test_accuracy * 100:.2f}%")
print(f"Test Loss: {test_loss:.4f}")

In [None]:
import matplotlib.pyplot as plt

# Plot training & validation accuracy values for both models
plt.figure(figsize=(12, 5))

# Accuracy plot
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Train Accuracy (Binarized)')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy (Binarized)')
plt.plot(history_normal.history['accuracy'], label='Train Accuracy (Normal)')
plt.plot(history_normal.history['val_accuracy'], label='Validation Accuracy (Normal)')
plt.title('Model Accuracy Comparison')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')

# Loss plot
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Train Loss (Binarized)')
plt.plot(history.history['val_loss'], label='Validation Loss (Binarized)')
plt.plot(history_normal.history['loss'], label='Train Loss (Normal)')
plt.plot(history_normal.history['val_loss'], label='Validation Loss (Normal)')
plt.title('Model Loss Comparison')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend(loc='upper right')

plt.tight_layout()
plt.show()


Visualize the full training history

1. **Training and Validation Accuracy**: Shows how well the model performs on the training and validation sets in terms of classification accuracy.
2. **Training and Validation Loss**: Displays the loss values over epochs, providing insight into how well the model is learning and whether it might be overfitting.

## 7. Compare models

###**A. SIZE**

We can evaluate how much memory each model takes up, specifically looking at the number of parameters (weights) in each model. The size of the model can be directly related to the total number of parameters, since each parameter is a floating-point value and requires memory for storage.

### Step 1: Get the Number of Parameters

We can get the total number of parameters in both models by using the `summary()` method or by accessing the `count_params()` method for each layer.

In [None]:
# Print summary of both models to compare parameters
print("\nBinarized Model Summary:")
model.summary()

print("Normal Model Summary:")
model_normal.summary()



### Step 2: Compare Model Size in Terms of Memory Usage

Another way to compare model sizes is to look at the memory footprint, which depends on the number of parameters and their precision.

#### Model Size Calculation in Bytes

- **Normal Model**: Each parameter is a 32-bit floating-point value (4 bytes).
- **Binarized Model**: Each parameter is a binary value, which can be represented in a single bit. However, when stored, these values are typically packed into bytes (so 1 byte per parameter).

Let’s calculate the model size using the following formula:

- **Normal Model Size (in bytes)**:
  $$
  \text{Model Size} = \text{Total Parameters} \times 4 \text{ bytes}
  $$

- **Binarized Model Size (in bytes)**:
  $$
  \text{Model Size} = \text{Total Parameters} \times 1 \text{ byte (because each parameter is binary)}
  $$


This will print the sizes of the two models in megabytes (MB), allowing you to easily compare the memory footprint.



In [None]:

# Calculate the model size in bytes
def calculate_model_size(model, is_binarized=False):
    total_params = model.count_params()

    if is_binarized:
        # Binarized model parameters are stored in 1 byte per parameter
        return total_params  # 1 byte per parameter for binarized model
    else:
        # Normal model parameters are stored in 4 bytes per parameter (32-bit floating point)
        return total_params * 4  # 4 bytes per parameter for normal model

# Get the sizes of both models
normal_model_size = calculate_model_size(model_normal, is_binarized=False)
binarized_model_size = calculate_model_size(model, is_binarized=True)

# Print the model sizes in MB
print(f"Normal Model Size: {normal_model_size / (1024**2):.2f} MB")
print(f"Binarized Model Size: {binarized_model_size / (1024**2):.2f} MB")


### Step 3: Save and Compare the File Sizes

We can also save the models to disk in `.h5` or `.tf` format and compare their file sizes directly:

This approach gives you the actual file sizes on disk, which could vary slightly due to additional overhead from the file format or model saving process, but it will still be a useful comparison.



In [None]:
# Save both models to disk
model.save("binarized_model.h5")
model_normal.save("normal_model.h5")

# Compare file sizes
import os

binarized_model_size = os.path.getsize("binarized_model.h5") / (1024**2)
normal_model_size = os.path.getsize("normal_model.h5") / (1024**2)  # Size in MB

print(f"Binarized Model File Size: {binarized_model_size:.2f} MB")
print(f"Normal Model File Size: {normal_model_size:.2f} MB")

### **B. Performance**

To compare the performance of both models (the binarized model and the normal model), we can evaluate the following aspects:

#### 1. **Inference Speed (Latency)**
Binarized models tend to be faster in terms of inference because binary operations (like XOR) are more efficient than floating-point operations.
   - **Latency**: This refers to the time it takes to make a prediction after the input data is passed through the model. The binarized model, which uses binary weights and activations, can be much faster than the normal model, since binary operations (XOR, bitwise operations) are generally faster than floating-point operations.
   - **How to measure**: We can measure the time taken to make predictions on a batch of test data.

#### 2. **Throughput (FPS - Frames Per Second)**
The binarized model should also have higher throughput, meaning it can process more images per second.
   - **Throughput**: This measures how many predictions the model can make per second. A higher throughput means better performance.
   - **How to measure**: We can calculate how many images the model processes per second (images per second = batch size / time taken for prediction).

#### 3. **Model Accuracy (Evaluation Metrics)**
The normal model will typically have higher accuracy, but the binarized model’s performance can still be quite good for tasks like MNIST classification.
   - **Accuracy**: Although not a direct "performance" measure in terms of speed, comparing the accuracy of the models on the same test set is essential. The normal model usually provides higher accuracy than the binarized model.
   - **How to measure**: Use the test accuracy and loss from the evaluation of each model.

#### 4. **Memory Usage during Inference**
The binarized model will consume much less memory since the parameters are stored as binary values (1 bit per parameter), whereas the normal model uses 32-bit floating-point values (4 bytes per parameter).
   - **Memory usage**: The normal model will consume more memory during inference due to larger parameter sizes. The binarized model requires less memory because it uses binary representations for weights and activations.
   - **How to measure**: You can use system tools like `psutil` to monitor the memory usage while running inference.


### Step 4: Measure Inference Time (Latency)

To measure inference time, use the `time` module to track how long it takes to perform inference on a test batch.

This will give you a sense of how much faster the binarized model performs compared to the normal model.

In [None]:
import time

# Function to measure inference time
def measure_inference_time(model, test_images):
    start_time = time.time()
    predictions = model.predict(test_images)
    end_time = time.time()

    inference_time = end_time - start_time  # Time taken in seconds
    return inference_time

# Choose a batch of 100 test samples for inference time
batch_size = 100
test_batch = test_images[:batch_size]

# Measure inference time for both models
inference_time_normal = measure_inference_time(model_normal, test_batch)
inference_time_binarized = measure_inference_time(model, test_batch)

print(f"Inference Time (Normal Model): {inference_time_normal:.4f} seconds")
print(f"Inference Time (Binarized Model): {inference_time_binarized:.4f} seconds")

### Step 5: Measure Throughput (FPS)

Throughput can be measured by calculating how many images are processed per second. This will give us a sense of how much faster the binarized model performs compared to the normal model.

In [None]:
# Function to measure throughput (images per second)
def measure_throughput(model, test_images, batch_size):
    start_time = time.time()
    model.predict(test_images[:batch_size])  # Make prediction on a batch
    end_time = time.time()

    time_taken = end_time - start_time
    throughput = batch_size / time_taken  # Images per second
    return throughput

# Measure throughput for both models
throughput_normal = measure_throughput(model_normal, test_images, batch_size)
throughput_binarized = measure_throughput(model, test_images, batch_size)

print(f"Throughput (Normal Model): {throughput_normal:.2f} images per second")
print(f"Throughput (Binarized Model): {throughput_binarized:.2f} images per second")


### Step 6: Compare Accuracy

After training both models, evaluate their performance on the test data:


In [None]:
# Evaluate the accuracy of both models
test_loss_normal, test_accuracy_normal = model_normal.evaluate(test_images, test_labels)
test_loss_binarized, test_accuracy_binarized = model.evaluate(test_images, test_labels)

print(f"Test Accuracy (Normal Model): {test_accuracy_normal * 100:.2f}%")
print(f"Test Accuracy (Binarized Model): {test_accuracy_binarized * 100:.2f}%")

### Step 7: Memory Usage (Optional)

You can monitor the memory usage of each model during inference using a package like `psutil` (on most systems) or use platform-specific tools like `nvidia-smi` for GPU memory usage:

In [None]:
import psutil
import os

# Function to get current memory usage
def get_memory_usage():
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / (1024 ** 2)  # Memory in MB

# Measure memory usage during inference for both models
memory_usage_normal_start = get_memory_usage()
model_normal.predict(test_batch)  # Inference on the normal model
memory_usage_normal_end = get_memory_usage()

memory_usage_binarized_start = get_memory_usage()
model.predict(test_batch)  # Inference on the binarized model
memory_usage_binarized_end = get_memory_usage()

print(f"Memory Usage (Normal Model): {memory_usage_normal_end - memory_usage_normal_start:.2f} MB")
print(f"Memory Usage (Binarized Model): {memory_usage_binarized_end - memory_usage_binarized_start:.2f} MB")

### **C. Speed**

In terms of **speed**, the primary difference between a **normal model** (using full precision floating-point values) and a **binarized model** (using binary values for weights and activations) comes down to the computational efficiency of operations. We can expect:

1. **Reduced Computational Complexity**:
   - Binarized models use binary values for weights and activations. Binary operations like **XOR** (exclusive or) and **bitwise operations** are computationally cheaper than floating-point operations like multiplication and addition.
   - In a normal model, for each operation (like a convolution or matrix multiplication), floating-point multiplications and additions are performed, which take more time.
   - In a binarized model, the binary values can be processed with efficient bitwise operations, which are typically **faster** than floating-point operations.

2. **Optimized Memory Access**:
   - Binary weights can be packed into smaller data structures (e.g., bit-packed arrays), which allows for faster memory access, fewer cache misses, and reduced bandwidth usage.
   - Since binary weights are stored in 1 bit per weight, the data being fetched is much smaller compared to a normal model, which stores 32-bit floating-point numbers for each weight. This can reduce the time it takes to load the model's weights into memory and process them.

3. **Fewer Operations**:
   - Because binarized models have only two possible states (1 or -1 for weights), convolution and fully-connected operations involve simpler calculations. For example:
     - **Dot product** in a normal model involves multiplications and additions for each weight. In a binarized model, the same operation can be reduced to **XOR** and simple summations.
     - The complexity of these operations reduces significantly when you binarize the network, resulting in a faster forward pass during inference.

4. **Potential for Hardware Acceleration**:
   - Binarized models can be optimized further on specialized hardware, such as **ASICs** or **FPGAs**, which can accelerate binary operations much faster than standard CPUs or GPUs.
   - Some hardware is designed to take advantage of binary operations, making them much faster than regular floating-point operations.



### Step 8: Measure Speed (Inference Time)

To compare the **speed** (in terms of inference time) between the normal and binarized models, you can follow this process:

**Use Time Tracking**:
   1. We can measure the inference time by tracking how long it takes for each model to make predictions on a batch of test data.
   2. **Binarized models** should take less time compared to normal models, especially for larger models or datasets where the computational savings become more noticeable.



In [None]:
import time

# Function to measure inference time
def measure_inference_time(model, test_images, batch_size=100):
    start_time = time.time()
    model.predict(test_images[:batch_size])  # Make prediction on a batch
    end_time = time.time()
    return end_time - start_time  # Time taken in seconds

# Measure inference time for both models
batch_size = 100
test_batch = test_images[:batch_size]  # Select a batch of 100 samples for comparison

# Measure inference time for normal model
inference_time_normal = measure_inference_time(model_normal, test_batch)
# Measure inference time for binarized model
inference_time_binarized = measure_inference_time(model, test_batch)

print(f"Inference Time for Normal Model: {inference_time_normal:.4f} seconds")
print(f"Inference Time for Binarized Model: {inference_time_binarized:.4f} seconds")

## 9. **Conclusion: Comparison of Model Size, Performance, and Speed**

When comparing a **binarized model** and a **normal model** in terms of **size**, **performance**, and **speed**, the results typically show clear differences that highlight the strengths and trade-offs of each approach. Here's a breakdown of the expected conclusions and values for each factor:

---

### **1. Model Size**
- **Binarized Model**:
  - **Smaller Size**: A binarized model uses **binary weights** (1 bit per weight), so the overall model size is drastically reduced. For example, if a normal model has 1 million parameters with 32-bit floating-point weights, the binarized model will only use **1/32 of the storage** for the weights, resulting in a model that is **32 times smaller**.
  - **Expected Size**: For example, if the normal model is 50MB, the binarized model would be approximately **1.5MB**.
  
- **Normal Model**:
  - **Larger Size**: The normal model uses **32-bit floating-point weights** for its parameters, which results in a larger model size.
  - **Expected Size**: A typical normal model with 1 million parameters might be around **50MB** in size (using 32-bit precision).

**Summary for Model Size**:
- **Binarized Model**: Much smaller in size due to reduced precision (binary weights).
- **Normal Model**: Larger due to full-precision (32-bit) weights.

---

### **2. Performance (Accuracy)**
- **Binarized Model**:
  - **Lower Accuracy**: While the binarized model is faster and smaller, it typically has lower **accuracy** compared to the normal model because reducing the precision of the weights and activations introduces approximation errors.
  - **Expected Accuracy**: The binarized model may achieve **90-95% accuracy** on a dataset like MNIST, but it will generally be less precise than the normal model.

- **Normal Model**:
  - **Higher Accuracy**: The full-precision normal model typically delivers **higher accuracy**, as floating-point operations allow more precise calculations.
  - **Expected Accuracy**: The normal model should achieve **98-99% accuracy** on datasets like MNIST.

**Summary for Performance (Accuracy)**:
- **Binarized Model**: Achieves slightly lower accuracy but remains competitive.
- **Normal Model**: Achieves higher accuracy, but at the cost of increased model size and computational requirements.

---

### **3. Speed (Inference Time and Throughput)**

- **Binarized Model**:
  - **Faster Inference**: The binarized model should have **much faster inference** time because binary operations (like XOR and bitwise summation) are computationally cheaper than floating-point operations. This results in faster processing of input data.
  - **Higher Throughput**: The binarized model can process more images per second (higher FPS) because it performs less complex computations, making it ideal for real-time applications or devices with limited resources.
  - **Expected Inference Time**: A binarized model might process 100 images in **0.05 seconds**, compared to the normal model’s **0.12 seconds** for the same batch size.
  
- **Normal Model**:
  - **Slower Inference**: The normal model takes more time per inference due to the heavier computational load of floating-point operations (multiplications and additions).
  - **Lower Throughput**: Because it requires more time per image, the throughput (images per second) will be lower.
  - **Expected Inference Time**: The normal model might take **0.12 seconds** for processing a batch of 100 images, compared to the binarized model’s **0.05 seconds**.
  
**Summary for Speed**:
- **Binarized Model**: Significantly faster due to simpler binary operations, resulting in reduced inference time and higher throughput.
- **Normal Model**: Slower due to more complex floating-point operations.

---

### **4. Memory Usage (During Inference)**
- **Binarized Model**:
  - **Lower Memory Usage**: The reduced size of the model (1 bit per weight) means that less memory is required to store the weights and activations. This is especially useful in **resource-constrained environments** such as mobile devices or edge computing.
  - **Expected Memory Usage**: A binarized model might use **a fraction of the memory** compared to a normal model (e.g., 1MB vs. 50MB).

- **Normal Model**:
  - **Higher Memory Usage**: The normal model will require **more memory** because each weight is stored as a 32-bit floating-point number.
  - **Expected Memory Usage**: A normal model might use **50MB** or more for storing the weights, depending on the number of parameters.

**Summary for Memory Usage**:
- **Binarized Model**: Much lower memory usage due to binary weights.
- **Normal Model**: Higher memory usage due to 32-bit floating-point weights.

---

### **Expected Comparison Summary**

| **Aspect**             | **Binarized Model**                  | **Normal Model**                      |
|------------------------|--------------------------------------|---------------------------------------|
| **Model Size**         | Much smaller (32x smaller)           | Larger (e.g., 50MB or more)           |
| **Accuracy**           | Lower (90-95% for MNIST)             | Higher (98-99% for MNIST)             |
| **Inference Speed**    | Much faster (e.g., 0.05s for 100 images) | Slower (e.g., 0.12s for 100 images)   |
| **Throughput**         | Higher (more images per second)      | Lower (less images per second)        |
| **Memory Usage**       | Lower (1MB vs 50MB)                  | Higher (50MB or more)                 |

---

### **Final Conclusion**:

- **Binarized models** offer significant **advantages in terms of size**, **speed**, and **memory usage**. They are ideal for deployment in **resource-constrained environments** (like mobile devices, embedded systems, or real-time applications). However, they come with the **trade-off of lower accuracy**, which is typically acceptable for many applications like image classification tasks (e.g., MNIST).
  
- **Normal models**, while offering **higher accuracy**, are **slower**, **larger**, and **require more memory**. They are suitable for environments where **accuracy is paramount** and computational resources are less constrained (e.g., cloud or high-performance systems).

When choosing between these two approaches, the decision should be based on the trade-offs acceptable for the given application and deployment constraints.