# Notes of Exporting Models to ONNX format

## 1. Before Quantization

### 1.1 ONNX-Compatible Modification Before Exporting

Model architecture should be modified to be ONNX-compatible. Related code:

```python
x = F.avg_pool1d(x, x.shape[-1]) 
```

->

```python
x = F.adaptive_avg_pool1d(x, 1)
```
- `F.avg_pool1d(x, x.shape[-1])` uses a dynamic kernel size and PyTorch evaluates it at runtime. ONNX export uses static tracing (`torch.onnx.export()`), and it cannot trace dynamic kernel sizes derived from input shapes.
- `adaptive_avg_pool1d` is natively supported in ONNX and symbolically defines the output to always have a fixed length (here, 1).

### 1.2 Preprocess by `onnxruntime.quantization.preprocess` Before Quantization

``` bash
python -m onnxruntime.quantization.preprocess \
    --input models/cnn_fp32.onnx \
    --output models/cnn_fp32_infer.onnx
```
Pre-processing is to transform a float32 model to prepare it for quantization and improve quantization quality. It consists of the following three optional steps:

- Symbolic shape inference. This is best suited for transformer models.
- ONNX shape inference.
- Model optimization: This step uses ONNX Runtime native library to rewrite the computation graph, including merging computation nodes, eliminating redundancies to improve runtime efficiency.

In our case, according to the computational graph, the preprocessing helps to:
- Figures out the shape in each step in the graph (The shape is noted next to each arrow after preprocessing);
- Fuse `Matmul` and `Add` operators into [`Gemm`](https://onnx.ai/onnx/operators/onnx__Gemm.html) operator for matrix multiplication.




## 2. During the Quantization

I wrapped the `onnxruntime.quantization.quantize_static` in a script `./src/onnx_static_quantize.py` like [the example](https://github.com/microsoft/onnxruntime-inference-examples/blob/main/quantization/image_classification/cpu/run.py) did. 





# 3. Evaluation after Quantization
Run the script `./onnx_static_quantize.sh` of overall workflow of static quantization in ONNX Runtime and evaluation, given by the well-trained fp32 model checkpoint. The output will be close to this:

|Metrics|FP32 model|QInt8 model|
| ---- | ---- | ---- |
| Model Size | 106.51 KB | 44.09 KB |
| Accuracy | 83.14% | 78.41% |
| Average Inference Time | 9.62 ms | 2.85 ms |

# 4. Check the computational graph of quantized model

In [None]:
import onnx


model = onnx.load("../models/cnn_int8.onnx")
graph = model.graph
node = graph.node 
input = graph.input
output = graph.output
print(len(node))
node_map = {node.name: node for node in graph.node}
print(node_map.keys())

initializer_map = {init.name: init for init in graph.initializer}
print(len(initializer_map))
print(initializer_map.keys())
# for init in initializer_map.values():
#     if "55" in init.name:
#         print(init.name)

49
dict_keys(['fc1.bias_DequantizeLinear', 'input_QuantizeLinear', 'onnx::Conv_54_DequantizeLinear', 'onnx::Conv_55_DequantizeLinear', 'onnx::Conv_57_DequantizeLinear', 'onnx::Conv_58_DequantizeLinear', 'onnx::Conv_60_DequantizeLinear', 'onnx::Conv_61_DequantizeLinear', 'onnx::Conv_63_DequantizeLinear', 'onnx::Conv_64_DequantizeLinear', 'onnx::MatMul_65_DequantizeLinear', 'input_DequantizeLinear', '/block1/block/block.0/Conv', '/block1/block/block.2/Relu_output_0_QuantizeLinear', '/block1/block/block.2/Relu_output_0_DequantizeLinear', '/block1/block/block.3/MaxPool', '/block1/block/block.3/MaxPool_output_0_QuantizeLinear', '/block1/block/block.3/MaxPool_output_0_DequantizeLinear', '/block2/block/block.0/Conv', '/block2/block/block.2/Relu_output_0_QuantizeLinear', '/block2/block/block.2/Relu_output_0_DequantizeLinear', '/block2/block/block.3/MaxPool', '/block2/block/block.3/MaxPool_output_0_QuantizeLinear', '/block2/block/block.3/MaxPool_output_0_DequantizeLinear', '/block3/block/block.