# OLIVE: Model optimization toolkit for the ONNX Runtime

OLIVE (ONNX LIVE) is a cutting-edge model optimization toolkit with accompanying CLI that enables you to ship models for the ONNX runtime with quality and performance.

<img src="./images/olive-flow.png" alt="Olive Flow" width="500"/>

The input to OLIVE is typically a PyTorch or Hugging Face model and the output is an optimized ONNX model that is executed on a device (deployment target) running the ONNX runtime. OLIVE will optimize the model for the deployment target's AI accelerator (NPU, GPU, CPU) provided by a hardware vendor such as Qualcomm, AMD, Nvidia or Intel.

OLIVE executes a workflow, which is an ordered sequence of individual model optimization tasks called passes - example passes include: model compression, graph capture, quantization, graph optimization. Each pass has a set of parameters that can be tuned to achieve the best metrics, say accuracy and latency, that are evaluated by the respective evaluator. OLIVE employs a search strategy that uses a search algorithm to auto-tune each pass one by one or set of passes together.

## ➕ Benefits of OLIVE

- Reduce frustration and time of trial-and-error manual experimentation with different techniquies for graph optimization, compression and quantization. Define your quality and performance constraints and let OLIVE automatically find the best model for you.
- 40+ built-in model optimization components covering cutting edge techniques in quantization, compression, graph optimization and finetuning.
- Easy-to-use CLI for common model optimization tasks. For example, olive quantize, olive auto-opt, olive finetune.
- Model packaging and deployment built-in.
- Supports Multi LoRA serving.
- Construct workflows using YAML/JSON to orchestrate model optimization and deployment tasks.
- Hugging Face and Azure AI Integration.
- Built-in caching mechanism to save costs.

## The data

In this example, you're going to fine-tune Phi-3.5-Mini model so that it is specialized in answering travel related questions. The code below displays the first few records of the dataset, which are in JSON lines format.

In [8]:
from datasets import load_dataset

dataset = load_dataset("json", data_files="data/data_sample_travel.jsonl")
dataset["train"].to_pandas().head()

Unnamed: 0,prompt,response
0,What's a must-see in Paris?,Oh la la! You simply must twirl around the Eif...
1,Best way to get around Tokyo?,"Hop on a bullet train for speed, explore the c..."
2,What's the best museum in New York?,"The Met is a must-visit,t don't overlook the M..."
3,What should I pack for a trip to Australia?,Don't forget sunscreen and a hat for those sun...
4,Best place to eat in Bangkok?,"For street food heaven,y the night markets – y..."


## 🗜️ Quantize the model

Before training the model, we first quantize it using a technique called [Active Aware Quantization (AWQ)](https://arxiv.org/abs/2306.00978) - below is the abstract of the AWQ paper.

*Large language models (LLMs) have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users' privacy. However, the astronomical model size and the limited hardware resource pose significant deployment challenges. We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. AWQ finds that not all weights in an LLM are equally important. Protecting only 1% salient weights can greatly reduce quantization error. To identify salient weight channels, we should refer to the activation distribution, not weights. To avoid the hardware-inefficient mix-precision quantization, we mathematically derive that scaling up the salient channels can reduce the quantization error. AWQ employs an equivalent transformation to scale the salient weight channels to protect them. The scale is determined by collecting the activation statistics offline. AWQ does not rely on any backpropagation or reconstruction, so it generalizes to different domains and modalities without overfitting the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks (coding and math). Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement TinyChat, an efficient and flexible inference framework tailored for 4-bit on-device LLM/VLMs. With kernel fusion and platform-aware weight packing, TinyChat offers more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPUs.*

We find that quantizing the model *before* fine-tuning greatly improves the accuracy of the model.

> **📝 It takes around 10mins for the quantization to complete.**

In [12]:
%%bash

olive quantize \
    --model_name_or_path microsoft/Phi-3.5-mini-instruct \
    --algorithm awq \
    --output_path models/phi/awq \
    --log_level 1

Loading HuggingFace model from microsoft/Phi-3.5-mini-instruct
[2024-10-16 13:26:53,855] [INFO] [run.py:138:run_engine] Running workflow default_workflow
[2024-10-16 13:26:53,869] [INFO] [cache.py:137:__init__] Using cache directory: /home/azureuser/code/Ignite_FineTuning_workshop/lab/workshop-instructions/lab5-optimize-model/.olive-cache/default_workflow
[2024-10-16 13:26:53,871] [INFO] [accelerator_creator.py:224:create_accelerators] Running workflow on accelerator specs: cpu-cpu
[2024-10-16 13:26:53,871] [INFO] [engine.py:255:run] Running Olive on accelerator: cpu-cpu
[2024-10-16 13:26:53,871] [INFO] [engine.py:897:_create_system] Creating target system ...
[2024-10-16 13:26:53,872] [INFO] [engine.py:900:_create_system] Target system created in 0.000424 seconds
[2024-10-16 13:26:53,872] [INFO] [engine.py:909:_create_system] Creating host system ...
[2024-10-16 13:26:53,872] [INFO] [engine.py:912:_create_system] Host system created in 0.000042 seconds
[2024-10-16 13:26:53,939] [INFO]

Fetching 19 files: 100%|██████████| 19/19 [00:00<00:00, 200230.59it/s]
`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  2.36it/s]
Repo card metadata block was not found. Setting CardData to empty.
AWQ: 100%|██████████| 32/32 [10:55<00:00, 20.48s/it]
Note that `shard_checkpoint` is deprecated and will be removed in v4.44. We recommend you using split_torch_state_dict_into_shards from huggingface_hub library


[2024-10-16 13:37:56,133] [INFO] [engine.py:790:_run_pass] Pass awq:AutoAWQQuantizer finished in 662.193807 seconds
[2024-10-16 13:37:56,134] [INFO] [cache.py:192:load_model] Loading model 13e1ac3a from cache.
[2024-10-16 13:37:58,022] [INFO] [engine.py:435:run_no_search] Saved output model to /home/azureuser/code/Ignite_FineTuning_workshop/lab/workshop-instructions/lab5-optimize-model/models/phi/awq/olive-cli-tmp-wlo5c24n/output_model
[2024-10-16 13:37:58,022] [INFO] [engine.py:347:run_accelerator] Save footprint to /home/azureuser/code/Ignite_FineTuning_workshop/lab/workshop-instructions/lab5-optimize-model/models/phi/awq/olive-cli-tmp-wlo5c24n/footprints.json.
[2024-10-16 13:37:58,023] [INFO] [engine.py:274:run] Run history for cpu-cpu:
[2024-10-16 13:37:58,023] [INFO] [engine.py:528:dump_run_history] Please install tabulate for better run history output
Command succeeded. Output model saved to /home/azureuser/code/Ignite_FineTuning_workshop/lab/workshop-instructions/lab5-optimize-m

## 👟 Train the model

Next, the `olive finetune` command executes. 

🧠 Olive supports the following models out-of-the-box: Phi, Llama, Mistral, Gemma, Qwen, Falcon and [many others](https://huggingface.co/docs/optimum/en/exporters/onnx/overview).

☕ It can take around 5-10mins for the finetuning complete. At the end of the process you will have an PEFT adapter.

⚙️ For more information on available options, read the [Olive Finetune documentation](https://microsoft.github.io/Olive/features/cli.html#finetune).

In [13]:
%%bash

olive finetune \
    --method lora \
    --model_name_or_path models/phi/awq \
    --trust_remote_code \
    --data_files "data/data_sample_travel.jsonl" \
    --data_name "json" \
    --text_template "<|user|>\n{prompt}<|end|>\n<|assistant|>\n{response}<|end|>" \
    --max_steps 15 \
    --output_path ./models/phi/ft \
    --log_level 1

Loaded previous command output of type hfmodel from models/phi/awq
[2024-10-16 13:43:46,076] [INFO] [run.py:138:run_engine] Running workflow default_workflow
[2024-10-16 13:43:46,089] [INFO] [cache.py:137:__init__] Using cache directory: /home/azureuser/code/Ignite_FineTuning_workshop/lab/workshop-instructions/lab5-optimize-model/.olive-cache/default_workflow
[2024-10-16 13:43:46,092] [INFO] [accelerator_creator.py:224:create_accelerators] Running workflow on accelerator specs: gpu-cuda
[2024-10-16 13:43:46,100] [INFO] [engine.py:255:run] Running Olive on accelerator: gpu-cuda
[2024-10-16 13:43:46,100] [INFO] [engine.py:897:_create_system] Creating target system ...
[2024-10-16 13:43:46,100] [INFO] [engine.py:900:_create_system] Target system created in 0.000344 seconds
[2024-10-16 13:43:46,100] [INFO] [engine.py:909:_create_system] Creating host system ...
[2024-10-16 13:43:46,101] [INFO] [engine.py:912:_create_system] Host system created in 0.000137 seconds
[2024-10-16 13:43:47,482] 

We suggest you to set `torch_dtype=torch.float16` for better efficiency with AWQ.
max_steps is given, it will override any value given in num_train_epochs


[2024-10-16 13:43:51,042] [INFO] [lora.py:546:train_and_save_new_model] Running fine-tuning


  0%|          | 0/15 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
You are not running the flash-attention implementation, expect numerical differences.


{'train_runtime': 106.4467, 'train_samples_per_second': 1.127, 'train_steps_per_second': 0.141, 'train_loss': 1.3030354817708334, 'epoch': 5.0}


100%|██████████| 15/15 [01:46<00:00,  7.10s/it]


[2024-10-16 13:45:38,380] [INFO] [engine.py:790:_run_pass] Pass f:LoRA finished in 110.898584 seconds
[2024-10-16 13:45:38,381] [INFO] [cache.py:192:load_model] Loading model 48230c00 from cache.
[2024-10-16 13:45:40,548] [INFO] [engine.py:435:run_no_search] Saved output model to /home/azureuser/code/Ignite_FineTuning_workshop/lab/workshop-instructions/lab5-optimize-model/models/phi/ft/olive-cli-tmp-8b4j0mwd/output_model
[2024-10-16 13:45:40,549] [INFO] [engine.py:347:run_accelerator] Save footprint to /home/azureuser/code/Ignite_FineTuning_workshop/lab/workshop-instructions/lab5-optimize-model/models/phi/ft/olive-cli-tmp-8b4j0mwd/footprints.json.
[2024-10-16 13:45:40,549] [INFO] [engine.py:274:run] Run history for gpu-cuda:
[2024-10-16 13:45:40,550] [INFO] [engine.py:528:dump_run_history] Please install tabulate for better run history output
Command succeeded. Output model saved to /home/azureuser/code/Ignite_FineTuning_workshop/lab/workshop-instructions/lab5-optimize-model/models/phi

📂 The output is located in a folder named `models/phi/ft`. Below is a list of the folder - notice that OLIVE just produces the PEFT adapter (not the base model)

In [14]:
%ls -lah models/phi/ft/adapter

total 145M
drwxrwxr-x 2 azureuser azureuser 4.0K Oct 16 13:45 [0m[01;34m.[0m/
drwxrwxr-x 4 azureuser azureuser 4.0K Oct 16 13:45 [01;34m..[0m/
-rw-rw-r-- 1 azureuser azureuser 5.1K Oct 16 13:45 README.md
-rw-rw-r-- 1 azureuser azureuser  744 Oct 16 13:45 adapter_config.json
-rw-rw-r-- 1 azureuser azureuser 145M Oct 16 13:45 adapter_model.safetensors


## 🔌 Generate Adapters for ONNX Runtime

Next, you need to generate the Hugging Face PEFT adapter into a format for the ONNX runtime. This command will:

1. Convert the base model into ONNX format
2. Optimize the base model for the ONNX runtime (e.g. graph optimization).
3. Convert the adapter into an optimized format for the ONNX Runtime


In [15]:
%%bash

olive generate-adapter \
    --model_name_or_path models/phi/ft \
    --use_ort_genai \
    --output_path models/phi/ft-onnx \
    --log_level 1

Loaded previous command output of type hfmodel from models/phi/ft
[2024-10-16 13:46:10,172] [INFO] [run.py:138:run_engine] Running workflow default_workflow
[2024-10-16 13:46:10,185] [INFO] [cache.py:137:__init__] Using cache directory: /home/azureuser/code/Ignite_FineTuning_workshop/lab/workshop-instructions/lab5-optimize-model/.olive-cache/default_workflow
[2024-10-16 13:46:10,189] [INFO] [accelerator_creator.py:224:create_accelerators] Running workflow on accelerator specs: gpu-cuda
[2024-10-16 13:46:10,192] [INFO] [engine.py:255:run] Running Olive on accelerator: gpu-cuda
[2024-10-16 13:46:10,192] [INFO] [engine.py:897:_create_system] Creating target system ...
[2024-10-16 13:46:10,192] [INFO] [engine.py:900:_create_system] Target system created in 0.000049 seconds
[2024-10-16 13:46:10,192] [INFO] [engine.py:909:_create_system] Creating host system ...
[2024-10-16 13:46:10,192] [INFO] [engine.py:912:_create_system] Host system created in 0.000040 seconds
[2024-10-16 13:46:11,605] [

You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
We suggest you to set `torch_dtype=torch.float16` for better efficiency with AWQ.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
We suggest you to set `torch_dtype=torch.float16` for better efficiency with AWQ.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
We detected that you are passing `past_key_values` as a tuple of tuples. This is deprecated and will be removed in v4.47. Please convert your cache or use an appropriate `Cache` class (https://huggingface.co/docs/transformers/kv_cache#legacy-cache-format)
  or len(self.key_cache[layer_idx]) == 0  # the layer has no cache
  if sequence_length != 1:
You are not running the flash-attention implementation, expect numerical differences.
  seq_len = seq_len or torch.max(position_ids) + 1
  if seq_len > self.original_max_position_embeddings:
  ext_fac

[2024-10-16 13:46:43,800] [INFO] [engine.py:790:_run_pass] Pass c:OnnxConversion finished in 32.195229 seconds
[2024-10-16 13:46:43,802] [INFO] [engine.py:718:_run_pass] Running pass o:OrtTransformersOptimization {}
[2024-10-16 13:47:25,679] [INFO] [engine.py:790:_run_pass] Pass o:OrtTransformersOptimization finished in 41.876406 seconds
[2024-10-16 13:47:25,681] [INFO] [engine.py:718:_run_pass] Running pass e:ExtractAdapters {}
[2024-10-16 13:47:31,369] [INFO] [engine.py:790:_run_pass] Pass e:ExtractAdapters finished in 5.688837 seconds
[2024-10-16 13:47:31,376] [INFO] [engine.py:718:_run_pass] Running pass m:ModelBuilder {}




GroupQueryAttention (GQA) is used in this model.




Saving GenAI config in /home/azureuser/code/Ignite_FineTuning_workshop/lab/workshop-instructions/lab5-optimize-model/.olive-cache/default_workflow/runs/04ec9341/models




Saving processing files in /home/azureuser/code/Ignite_FineTuning_workshop/lab/workshop-instructions/lab5-optimize-model/.olive-cache/default_workflow/runs/04ec9341/models for GenAI
[2024-10-16 13:47:31,592] [INFO] [engine.py:790:_run_pass] Pass m:ModelBuilder finished in 0.216728 seconds
[2024-10-16 13:47:31,600] [INFO] [cache.py:192:load_model] Loading model 04ec9341 from cache.
[2024-10-16 13:47:33,578] [INFO] [engine.py:435:run_no_search] Saved output model to /home/azureuser/code/Ignite_FineTuning_workshop/lab/workshop-instructions/lab5-optimize-model/models/phi/ft-onnx/olive-cli-tmp-9z23lgcb/output_model
[2024-10-16 13:47:33,581] [INFO] [engine.py:347:run_accelerator] Save footprint to /home/azureuser/code/Ignite_FineTuning_workshop/lab/workshop-instructions/lab5-optimize-model/models/phi/ft-onnx/olive-cli-tmp-9z23lgcb/footprints.json.
[2024-10-16 13:47:33,585] [INFO] [engine.py:274:run] Run history for gpu-cuda:
[2024-10-16 13:47:33,585] [INFO] [engine.py:528:dump_run_history] P

## 🧪 Quick test

The code below creates a test app that consumes the model in a simple console chat interface. You will be prompted to enter an input. Here are some phrases to try:

- "Cricket is a great game"
- "I was taken aback by the size of the whale"
- "there was concern about the dark lighting on the street"

🧑‍💻 Below we show the Python API for the ONNX Runtime. However, other language bindings are available in [Java, C#, C++](https://github.com/microsoft/onnxruntime-genai/tree/main/examples).

🚪To exit the chat interface, enter `exit` or select `Ctrl+c`.


In [12]:
import onnxruntime_genai as og
import numpy as np
from olive.common.utils import load_weights
import os

model_folder = "models/phi/ft-onnx/model"

# Load the base model and tokenizer
model = og.Model(model_folder)
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()

# Load the LoRA adapter weights
weights_file = os.path.join(model_folder, "adapter_weights.onnx_adapter")

adapters = {
    "travel": {
        "weights": weights_file,
        "template": "<|user|>\n{input}</s>\n<|assistant|>"
    }
}

adapters_weights = {
    key: load_weights(value["weights"]) for key, value in adapters.items()
}

# Set the max length to something sensible by default,
# since otherwise it will be set to the entire context length
search_options = {}
search_options['max_length'] = 200
search_options['past_present_share_buffer'] = False

chat_template = "<|user|>\n{input}</s>\n<|assistant|>"

text = input("Input: ")

# Keep asking for input phrases
while text != "exit":
  if not text:
    print("Error, input cannot be empty")
    exit

  # generate prompt (prompt template + input)
  prompt = f'{chat_template.format(input=text)}'

  # encode the prompt using the tokenizer
  input_tokens = tokenizer.encode(prompt)

  # the adapter weights are added to the model at inference time. This means you
  # can select different adapters for different tasks i.e. multi-LoRA.

  params = og.GeneratorParams(model)
  for k, v in adapter_weights.items():
    params.set_model_input(k, v)
  params.set_search_options(**search_options)
  params.input_ids = input_tokens
  generator = og.Generator(model, params)

  print("Output: ", end='', flush=True)
  # stream the output
  try:
    while not generator.is_done():
      generator.compute_logits()
      generator.generate_next_token()

      new_token = generator.get_next_tokens()[0]
      print(tokenizer_stream.decode(new_token), end='', flush=True)
  except KeyboardInterrupt:
      print("  --control+c pressed, aborting generation--")

  print()
  text = input("Input: ")

# delete the objects to free up resources.
del generator
del model
del tokenizer
del tokenizer_stream

Enter phrase: cricket is an amazing game!
Output: 
joy
Input: exit


## Publish to Hugging Face

🤗 You'll need to get a token from https://huggingface.co/settings/tokens.

In [2]:
%%bash

# update these parameters
TOKEN="" # get a token from https://huggingface.co/settings/tokens
REPO_ID="" # for example username/repo-name
MODEL_PATH="models/phi/ft-onnx" # no need to change

huggingface-cli upload --token $TOKEN $REPO_ID $MODEL_PATH

UsageError: Cell magic `%%shell` not found.
