# Optimize ONNX Models Latency with OLive Python Package

This notebook demos how to use OLive python package to optimize an ONNX model inference lantency performance.

## 0. Prerequisites

0. Download OLive python package [here](https://olivewheels.blob.core.windows.net/repo/onnxruntime_olive-0.2.0-py3-none-any.whl)


1. Install OLive with `pip install onnxruntime_olive-0.2.0-py3-none-any.whl`


2. Install ONNX Runtime with 

    `pip install --extra-index-url https://olivewheels.azureedge.net/test onnxruntime_openvino_dnnl==1.9.0` for cpu
    
    or 
    
    `pip install --extra-index-url https://olivewheels.azureedge.net/test onnxruntime_gpu_tensorrt==1.9.0` for gpu


3. Download example model

In [None]:
import urllib.request
import os

optimization_example = 'optimization_example'
if not os.path.isdir(optimization_example):
    os.mkdir(optimization_example)

model_url = "https://olivemodels.blob.core.windows.net/models/optimization/TFBertForQuestionAnswering.onnx"
model_response = urllib.request.urlretrieve(model_url, optimization_example + "/TFBertForQuestionAnswering.onnx")

## 1. Optimize ONNX Model

Configurations for OLive Model Optimization includes:

| Configuration  | Detail  | Example |
|:--|:--|:--|
| **model_path** | model path for optimization | "PytorchBertSquad.onnx" |
| **openmp_enabled** | whether the onnxruntime package is built with OpenMP | False |
| **result_path** | result directory for OLive optimization | "olive_opt_result" |
| **inputs_spec** | dict of input’s names and shapes | {"attention_mask": [1, 7], "input_ids": [1, 7], "token_type_ids": [1, 7]} |
| **output_names** | output names for onnxruntime session inference | "scores" |
| **providers_list** | providers used for perftuning | ["cpu", "dnnl"] |
| **trt_fp16_enabled** | whether enable fp16 mode for TensorRT | True |
| **quantization_enabled** | whether enable the quantization or not | True |
| **transformer_enabled** | whether enable transformer optimization | True |
| **transformer_args** | onnxruntime transformer optimizer args | "--model_type bert" |
| **sample_input_data_path** | path to sample_input_data.npz | "sample_input_data.npz" |
| **concurrency_num** | tuning process concurrency number | 2 |
| **kmp_affinity** | bind OpenMP* threads to physical processing units | ["respect,none"] |
| **omp_max_active_levels** | maximum number of nested active parallel regions | ["1"] |
| **inter_thread_num_list** | list of inter thread number for perftuning | [1,2,4] |
| **intra_thread_num_list** | list of intra thread number for perftuning | [1,2,4] |
| **execution_mode_list** | list of execution mode for perftuning | ["parallel", "sequential"] |
| **ort_opt_level_list** | onnxruntime optimization level | ["all"] |
| **omp_wait_policy_list** | list of OpenMP wait policy for perftuning | ["active"] |
| **warmup_num** | warmup times for latency measurement | 20 |
| **test_num** | repeat test times for latency measurement | 200 |

In [None]:
from olive.optimization_config import OptimizationConfig
from olive.optimize import optimize

In [None]:
opt_config = OptimizationConfig(
    model_path = os.path.join(optimization_example, "TFBertForQuestionAnswering.onnx"),
    result_path = "olive_opt_latency_result",
    inputs_spec = {"attention_mask": [1, 7], "input_ids": [1, 7], "token_type_ids": [1, 7]},
    providers_list = ["cpu", "dnnl"],
    inter_thread_num_list = [1,2,4],
    execution_mode_list = ["parallel", "sequential"],
    warmup_num = 10,
    test_num = 40)

result = optimize(opt_config)