#### About

> Model compression and quantization

Model compression and quantization are techniques used to reduce the size or complexity of machine learning models, making them more efficient in terms of storage, memory usage, and computation.

1. Model Compression: Model compression techniques aim to reduce the size of a model by reducing its parameter count or memory footprint while maintaining its performance or accuracy. Some common methods of model compression include:

- Pruning: Pruning involves removing redundant or unnecessary weights or neurons from the model. This can be done during training (e.g. by setting small weights to zero) or after training (e.g. by removing neurons with negligible activation values).

- Quantization: Quantization involves reducing the precision of model weights and/or activations from floating-point numbers (eg 32-bit) to a smaller bit representation (eg 8-bit or less). This reduces the memory footprint of the model and can speed up inference on hardware where precision support is limited.

- Distillation of knowledge. Knowledge distillation involves training a smaller “learner” model to mimic the predictions of a larger “teacher” model. The student model learns to approximate the output of the teacher model, which can often be a more compact representation of the knowledge in the original model.

2. Model quantization. Model quantization techniques aim to represent model weights and/or activations in low-bit representation (e.g., 8-bit or smaller) rather than floating-point numbers (e.g., 32-bit). This reduces the memory footprint of the model and can speed up inference on hardware where accuracy support is limited. Some common methods of pattern quantization include:

- Post-training quantization: Post-training quantization involves post-training quantization of the weights and/or activations of a trained model. This can be done using techniques such as uniform quantization, where values ​​are quantized to a fixed set of levels, or uneven quantization, where quantization levels are adaptive based on the distribution of the data. 

- Quantization Awareness Training: Quantization training involves training a model with the goal of optimizing its performance using quantized weights and/or activations. This may include techniques such as quantization-aware backpropagation, which takes quantization effects into account during gradient computation and weight update, or the use of specialized quantization-aware optimization algorithms. 


Model compression and quantization techniques are often used in cases where model size, memory consumption, or computational efficiency are critical, such as deployment on resource-constrained devices such as mobile devices, embedded systems, or edge devices with limited computing power or memory. However, it is important to note that model compression and quantization techniques may change the level of model accuracy or performance in favor of reduced size or complexity, the effectiveness of which depends on the specific use case and application requirements.



In [21]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.datasets import mnist

In [27]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()


In [28]:
# Normalize pixel values to be between 0 and 1
x_train = x_train / 255.0
x_test = x_test / 255.0

# Flatten the input images
x_train = x_train.reshape(-1, 784)
x_test = x_test.reshape(-1, 784)


In [29]:
model = Sequential()
model.add(Dense(128, activation='relu', input_shape=(784,)))
model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))

In [30]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])


In [31]:
model.fit(x_train, y_train, epochs=5, batch_size=32, validation_split=0.1)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f947027f4c0>

In [32]:
# Evaluate the model
_, accuracy = model.evaluate(x_test, y_test, batch_size=32)
print('Accuracy:', accuracy)

Accuracy: 0.9747999906539917


In [33]:
# Convert the model to a quantized model with 8-bit weights and activations
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tflite_model_quantized = converter.convert()



INFO:tensorflow:Assets written to: /tmp/tmp8ngs7tut/assets


INFO:tensorflow:Assets written to: /tmp/tmp8ngs7tut/assets
2023-04-22 06:06:44.167954: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:364] Ignored output_format.
2023-04-22 06:06:44.168012: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:367] Ignored drop_control_dependency.
2023-04-22 06:06:44.168312: I tensorflow/cc/saved_model/reader.cc:45] Reading SavedModel from: /tmp/tmp8ngs7tut
2023-04-22 06:06:44.169497: I tensorflow/cc/saved_model/reader.cc:89] Reading meta graph with tags { serve }
2023-04-22 06:06:44.169576: I tensorflow/cc/saved_model/reader.cc:130] Reading SavedModel debug info (if present) from: /tmp/tmp8ngs7tut
2023-04-22 06:06:44.176735: I tensorflow/cc/saved_model/loader.cc:231] Restoring SavedModel bundle.
2023-04-22 06:06:44.234895: I tensorflow/cc/saved_model/loader.cc:215] Running initialization op on SavedModel bundle at path: /tmp/tmp8ngs7tut
2023-04-22 06:06:44.250530: I tensorflow/cc/saved_model/loader.cc:314] SavedModel

In [35]:
# Save the quantized model to a file
with open('model_quantized.tflite', 'wb') as f:
    f.write(tflite_model_quantized)

In [36]:
# Load the quantized model
interpreter = tf.lite.Interpreter(model_path='model_quantized.tflite')
interpreter.allocate_tensors()

# Get the input and output details of the model
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Prepare the test data
test_data = np.array(x_test, dtype=np.float32)

# Run inference on the quantized model
predictions = []
for i in range(len(test_data)):
    interpreter.set_tensor(input_details[0]['index'], [test_data[i]])
    interpreter.invoke()
    output = interpreter.get_tensor(output_details[0]['index'])
    predicted_label = np.argmax(output)
    predictions.append(predicted_label)

# Calculate accuracy
correct_predictions = np.sum(np.array(predictions) == y_test)
accuracy = correct_predictions / len(y_test)

print('Accuracy:', accuracy)

Accuracy: 0.9748
