# Week 2. Quantization & Pruning

---

## Mobile, IoT, and Similar Use Cases
---
- Trends in adoption of smart devices
- Demands move ML capability from cloud to on-device
- Cost-effectiveness
- Compliance with privacy regulations


### Online ML inference
---
1. To generate real-time predictions you can:
- Host the model on a server
- Embed the model in the device

2. Is it faster on a server, or on-device?
3. Mobile processing limitations?

<img src = "https://i.gyazo.com/2cf8354b9ad0d99dfd7e03ded86534f2.png" width = "500px">
<img src = "https://i.gyazo.com/f5fd34b88e8b5d4c19c31ba56ecbf68c.png" width = "500px">

### Model deployment

<img src = "https://i.gyazo.com/8033e6966b9ef0818f0513ca2dbe30d1.png" width = "500px">

## Benefits and Process of Quantization
---

- Quantization involves transforming a model into an equivalent representation that uses parameters and computations at a lower precision. This improves the model's execution performance and efficiency, but it can often result in lower model accuracy.
- Quantization, in essence, lessens or reduces the number of bits needed to represent information. However, you may notice that as you reduce the number of pixels beyond a certain point, depending on the image, it may get harder to recognize what that image is.

### Why quantize neural networks?
- Neural networks have many parameters and take up space
- Srinking model file size
- Reduce computational resources
- Make models run faster and use less power with low-precision

### Benefits of quantization
- Faster compute
- Low memory bandwidth
- Low power
- Integer operations supported across CPU/DSP/NPUs

### The quantization process
<img src = "https://i.gyazo.com/16553809c1015c2cb41ea2d4782e68c0.png" width = "500px">

### What parts of the model are affected?
- Static values (parameters)
- Dynamic values (activations)
- Computation (transformations)

### Trade-offs
- Optimization impacts model accuracy: Difficult to predict ahead of time
- In rare cases, models may actually gain some accuracy
- Undefined effects on ML interpretability


### Chose the best model for the task
- Trade-off between model accuracy and model complexity

## Post Training Quantization

---

- Post-training quantization is a conversion technique that can reduce model size while also improving CPU and hardware accelerator latency with little degradation in model accuracy. 
- You can quantize an already trained TensorFlow model when you convert it to TensorFlow Lite format using the TensorFlow Lite converter.
-  What post-training quantization basically does is to convert, or more precisely, quantize the weights from floating point numbers to integers in an efficient way. By doing this, you can gain up to three times lower latency without taking a major hit on accuracy. With the default optimization strategy, the converter will do its best to apply a post-training quantization, trying to optimize the model for both size and latency. 

<img src = "https://i.gyazo.com/2d37ff842c9d1c49584527d0de3b1ccf.png" width = "500px">


- Using dynamic range quantization, you can reduce the model size and/or latency, but this comes with a limitation as it requires inference to be done with floating point numbers. 
- This may not always be ideal since some hardware accelerators only support integer operations, for example, Edge TPUs. The optimization toolkit also supports post-training integer quantization. 
- This enables users to take an already trained floating point model and fully quantize it to use only eight bits signed integer, which enables fixed point hardware accelerators to run these models. 
- When targeting greater CPU improvements or fixed point accelerators, this is often a better option. Post-training integer quantization works by gathering calibration data, which it does by running inferences on a small set of inputs so as to determine the right scaling parameters needed to convert the model to an integer quantized model. 
- Post-training quantization can result in a loss of accuracy, particularly for smaller networks, but it is often fairly negligible. On the plus side, this will speed up execution of the heaviest computations by using lower precision and the most sensitive computations with higher precision, thus typically resulting in little or no final loss of accuracy. 

### Model accuracy

- Small accuracy loss incurred (mostly for smaller networks).

<img src = "https://i.gyazo.com/8216f7ab4f8b739b2dac77e1e4da870d.png" width = "500px">

## Quantization Aware Training
---

- Inserts fake quantization (FQ) nodes in the forward pass
- Rewrites the graph to emulate quantized inference
- Reduce the loss of accuracy due to quantization
- Resulting model contains all data to be quantized according to spec

## Pruning

---

<img src = "https://i.gyazo.com/6af44724ac31c70cbe173b5cb83ff837.png" width = "500px">

- Pruning aims to reduce the number of parameters and operations involved in generating a prediction by removing network connections. 
- With pruning, you can lower the overall parameter count in the network. Networks generally look like the one on the left.
- Here every neuron in a layer has a connection to the layer before it, but this means we have to multiply a lot of floats together.
-  Ideally, we'd only connect each neuron to a few others and save on doing some of the multiplications, if we can find a way to do that without too much loss of accuracy.
- Restricting the search space **can also act as a regularizer**.
- Better storage and/or transmission
- Gain speedups in CPU and some ML accelerators
- Can be used in tandem with quantization to get additional benefits
- Unlock performace improvements