# Automatic heterogeneous quantization of DNN for low-latency inference on the edge for particle detectors

<span style="color: red;">  ¿What does "low-latency inference" means? </span>

#### **There is two facts facing each other:**

* More accurate solutions is pushing DL research towards more complex algorithms (_higher model size_).
* Edge devices demand efficient inference (_model size_, latency and energy consumption _reduction_).

#### **There is a possible solution for this contradiction:**

* _Quantization_ limits the model size. It consists in using fewer bits to represent weights and biases. 

#### **But this possible solution has a problem:**

* _Quantization_ usually results in a performance decline. 

#### **For this reason, we want to find (indeed, we already have found):**

* A method for designing optimally heterogeneusly quantized version of DNN models for minimum-energy, high-accuracy, nanosecond inference and fully automated deployment on chip.

#### ** And how we can minimize the energy consumption and model size while high accuracy is mantained?**

* In order for that, we can use a per-layer and per-parameter type automatic quantization procedure, sampling for a wide range of quantizers.

_That is crucial for the event selection in proton-proton collisions at CERN-LHC_

At LHC, resources are limited and a latency of O(1) us is required. Nanosecond inference and a resource consumption reduced are achieved when its is implemented on a FPGA.

#### **What is all this about?**

There are two main ideas:

* Real-time inference of DNNs on custom hardware has become increasingly relevant.
* Typical acceptable latency for real-time inference is O(1)ms. There is other applications which require sub-us inference _¿why?_.

<span style="color: red;"> Key idea: ¿Why does we want low latency? </span>

In HEP:

* HEP is at the extreme inference spectrum of both the low-latency and limited-area.
* In particular, proton-proton collisions data processing at LHC-CERN requires thoses conditions.

#### **What is going on in the LHC?**

* In its particle detectors, tens of  data terabytes per second are produced from collisions ocurring every 25 ns.
* This data is reduced by _the trigger_ (a real-time event filter processing system).
* The trigger decides whether a discrete collision event is able to analyze or not.

#### **What about the trigger?**

* Data is buffered while the processing occurs, with a O(1) us maximum latency to make the trigger decision.
* High selection accuracy in the trigger is crucial to keep only the most interesting events  while keeping the output bandwidth low. This reduces the event reate from 40 MHz to 100kHz.

_LHC will be upgraded to HL-LHC, increasing the collision rate by a factor of 5-7. This will result in a accumulated data total amount one order higher than LHC capabilities._

_With this extreme increase, ML solutions are being explored as fast approximations of current algorithms in use to minimize the latency and maximize the precision of tasks._

#### **What about the implementation?**

* Real-time inference hardware in detectors has limited computational capacity due to size constraints. 
* Incorporating resource-intensive models without a loss in performance poses a great challenge.
* Compact network design, weight and filter prunning or quantization are part of efficient inference development techniques.

_Quantization-aware training solutions have been suggested_

#### **What does Quantization-aware consist?**

* A fixed numerical representation is adopted for the whole model. The model training is performed enforcing this constraint during weight optimization.

_Some layers may be more accommodating for aggressive quantization, whereas others may require more expensive arithmetic._

_Per-layer heterogeneous quantization is the optimal way to achieve higher accuracy at low resource cost. It might require further specialization of the hardware resources._

#### **What do we want to develop?**

* A novel workflow for finding the optimal heterogeneous quantization per layer and per parameter type for a given model.
* Deploying that workflow on FPGA hardware.

Also we expect:

* Implement a range of quantization methods in a common library (that will be provide a broad base for optimal quantization easily sampling).
* A novel method for optimal heterogeneous quantization finding for a given model, resulting in minimum area or power DNNs while maintaining high accuracy.

#### **Indeed, we have _QKeras & AutoQKeras_ libraries:**

* These libraries replace Keras layers, transforming Keras models to their equivalent deep heterogeneously quantized versions, which are trained quantization aware.
* Using AutoQKeras, a user can trade-off accuracy by model size reduction.

#### **Why is it important for HEP on edge applications?**

* It can classify events in the proton-proton collisions triggering at CERN-LHC, where resoruces are limited and a maximum latency of O(1) us is imposed.
* Inference within 60 ns and model resource consumption reduction by a factor of 50 can be achieved through heterogeneous quantization, while maintaining similar accuracy (within 3% of the floating point model accuracy).
* It show that the original floating point model accuracy can be maintained for homogeneously quantized DNNs down to a bit width of 6 while reducing resource consumption up to 75% thorugh Qkeras traning quantization-aware.

_Another ML goal is deploying ultra low latency and low-area DNNs on chip. That is crucial for the deployment of ML models on FPGAs in particle detectors and other fields with extreme inference and low power requirements._