# Optimize ONNX Models Throughput with OLive Python Package

This notebook demos how to use OLive python package to optimize an ONNX model inference throughput performance.

## 0. Prerequisites

1. Install OLive package with command `pip install onnxruntime_olive==0.5.0 --extra-index-url https://olivewheels.azureedge.net/oaas` 


2. Install ONNX Runtime with 

    `pip install --extra-index-url https://olivewheels.azureedge.net/oaas onnxruntime_openvino_dnnl==1.11.0` for cpu
    
    or 
    
    `pip install --extra-index-url https://olivewheels.azureedge.net/oaas onnxruntime_gpu_tensorrt==1.11.0` for gpu


3. Install mlperf_loadgen with 

    `pip install --extra-index-url https://olivewheels.azureedge.net/oaas mlperf_loadgen`

## 1. Optimize ONNX Model

Configurations for OLive Model Optimization includes:

| Configuration  | Detail  | Example |
|:--|:--|:--|
| **model_path** | model path for optimization | "PytorchBertSquad.onnx" |
| **openmp_enabled** | whether the onnxruntime package is built with OpenMP | False |
| **result_path** | result directory for OLive optimization | "olive_opt_result" |
| **throughput_tuning_enabled** | whether tune model for optimal throughput | True |
| **max_latency_percentile** | throughput max latency pct tile | 0.95 |
| **max_latency_ms** | max latency in pct tile in millisecond | 50 |
| **dynamic_batching_size** | max batchsize for dynamic batching | 1 |
| **threads_num** | threads num for throughput optimization | 4 |
| **min_duration_sec** | minimum duration for each run in second | 10 |
| **inputs_spec** | dict of input’s names and shapes | {"attention_mask": [1, 7], "input_ids": [1, 7], "token_type_ids": [1, 7]} |
| **output_names** | output names for onnxruntime session inference | "scores" |
| **providers_list** | providers used for perftuning | ["cpu", "dnnl"] |
| **trt_fp16_enabled** | whether enable fp16 mode for TensorRT | True |
| **quantization_enabled** | whether enable the quantization or not | True |
| **transformer_enabled** | whether enable transformer optimization | True |
| **transformer_args** | onnxruntime transformer optimizer args | "--model_type bert" |
| **sample_input_data_path** | path to sample_input_data.npz | "sample_input_data.npz" |
| **concurrency_num** | tuning process concurrency number | 2 |
| **kmp_affinity** | bind OpenMP* threads to physical processing units | ["respect,none"] |
| **omp_max_active_levels** | maximum number of nested active parallel regions | ["1"] |
| **inter_thread_num_list** | list of inter thread number for perftuning | [1,2,4] |
| **intra_thread_num_list** | list of intra thread number for perftuning | [1,2,4] |
| **execution_mode_list** | list of execution mode for perftuning | ["parallel", "sequential"] |
| **ort_opt_level_list** | onnxruntime optimization level | ["all"] |
| **omp_wait_policy_list** | list of OpenMP wait policy for perftuning | ["active"] |
| **warmup_num** | warmup times for latency measurement | 20 |
| **test_num** | repeat test times for latency measurement | 200 |

In [1]:
from olive.optimization_config import OptimizationConfig
from olive.optimize import optimize

In [4]:
opt_config = OptimizationConfig(

    model_path = "./craft.onnx",
    sample_input_data_path="./input.npz",
    result_path = "v2",
    #inputs_spec = {"input": [1, 3, -1, -1]},  # does not need if sample_input_data_path

    dynamic_batching_size=1,

    throughput_tuning_enabled=False,  # crashes on True
    threads_num = 1,
    max_latency_percentile = 0.95,
    min_duration_sec=50,

    openmp_enabled=False,
    max_latency_ms = 1000,
    
    providers_list = [
        "cpu", 
        # "dnnl"
    ],
    inter_thread_num_list=[1],
    intra_thread_num_list=[1],
    execution_mode_list = ["sequential"],
    ort_opt_level_list=['all'],
    quantization_enabled=True,

    concurrency_num=1,  # uses 6 cores (from 12) with num=1, 10-12 cores with num=2 

    warmup_num = 4,
    test_num = 4
)

result = optimize(opt_config)

2022-12-27 03:28:25,947 - olive.optimization_config - INFO - Checking the model file...
2022-12-27 03:28:26,402 - olive.optimization_config - INFO - Providers will be tested for optimization: ['CPUExecutionProvider']
2022-12-27 03:29:33,718 - olive.optimization.optimize_quantization - INFO - ONNX model quantization started
2022-12-27 03:29:34,948 - root - INFO - Quantization parameters for tensor:"input" not specified
2022-12-27 03:29:34,950 - root - INFO - Quantization parameters for tensor:"157" not specified
2022-12-27 03:29:34,955 - root - INFO - Quantization parameters for tensor:"161" not specified
2022-12-27 03:29:34,965 - root - INFO - Quantization parameters for tensor:"164" not specified
2022-12-27 03:29:34,982 - root - INFO - Quantization parameters for tensor:"168" not specified
2022-12-27 03:29:35,017 - root - INFO - Quantization parameters for tensor:"171" not specified
2022-12-27 03:29:35,085 - root - INFO - Quantization parameters for tensor:"174" not specified
2022-12-