# OLIVE: Model optimization toolkit for the ONNX Runtime

OLIVE (ONNX LIVE) is a cutting-edge model optimization toolkit with accompanying CLI that enables you to ship models for the ONNX runtime with quality and performance.

The input to OLIVE is typically a PyTorch or Hugging Face model and the output is an optimized ONNX model that is executed on a device (deployment target) running the ONNX runtime. OLIVE will optimize the model for the deployment target's AI accelerator (NPU, GPU, CPU) provided by a hardware vendor such as Qualcomm, AMD, Nvidia or Intel.

OLIVE executes a workflow, which is an ordered sequence of individual model optimization tasks called passes - example passes include: model compression, graph capture, quantization, graph optimization. Each pass has a set of parameters that can be tuned to achieve the best metrics, say accuracy and latency, that are evaluated by the respective evaluator. OLIVE employs a search strategy that uses a search algorithm to auto-tune each pass one by one or set of passes together.

##   The data

In this example, you'll be fine-tuning a language model to classify phrases into one of joy/surprise/fear/sadness categories. The dataset, which is available on Hugging Face, is show below.

In [18]:
from datasets import load_dataset

dataset = load_dataset("json", data_files="data/data_sample_travel.jsonl")
dataset["train"].to_pandas().head()

Unnamed: 0,prompt,response
0,What's a must-see in Paris?,Oh la la! You simply must twirl around the Eif...
1,Best way to get around Tokyo?,"Hop on a bullet train for speed, explore the c..."
2,What's the best museum in New York?,"The Met is a must-visit,t don't overlook the M..."
3,What should I pack for a trip to Australia?,Don't forget sunscreen and a hat for those sun...
4,Best place to eat in Bangkok?,"For street food heaven,y the night markets – y..."


## 👟 Train the model

Next, the `olive finetune` command executes. This single command will not only fine-tune the model but also optimize the model to run with quality and performance on the [ONNX runtime](https://onnxruntime.ai/).

🧠 Olive supports the following models out-of-the-box: Phi, Llama, Mistral, Gemma, Qwen, Falcon and [many others](https://huggingface.co/docs/optimum/en/exporters/onnx/overview).

☕ It can take around 5-10mins for the finetuning complete. At the end of the process you will have an PEFT adapter.

⚙️ For more information on available options, read the [Olive Finetune documentation](https://microsoft.github.io/Olive/features/cli.html#finetune).

In [None]:
%%bash

# Execute the finetune command - this will also run optimization and adapter extraction.
olive finetune \
    --method qlora \
    --model_name_or_path microsoft/Phi-3.5-mini-instruct \
    --trust_remote_code \
    --data_files "data/data_sample_travel.jsonl" \
    --data_name "json" \
    --text_template "<|user|>\n{prompt}<|end|>\n<|assistant|>\n{response}<|end|>" \
    --max_steps 15 \
    --output_path ./models/ft \
    --log_level 1

📂 The output is located in a folder named `models/ft/adapter`. Below is a list of the folder - notice that OLIVE just produces the PEFT adapter (not the base model)

In [4]:
%ls -lah models/ft/adapter

total 385M
drwxrwxrwx 2 root root    0 Oct 11 14:35 [0m[34;42m.[0m/
drwxrwxrwx 2 root root    0 Oct 11 14:35 [34;42m..[0m/
-rwxrwxrwx 1 root root 5.0K Oct 11 14:35 [01;32mREADME.md[0m*
-rwxrwxrwx 1 root root  698 Oct 11 14:35 [01;32madapter_config.json[0m*
-rwxrwxrwx 1 root root 385M Oct 11 14:35 [01;32madapter_model.safetensors[0m*


## 🔌 Generate Adapters for ONNX Runtime

Next, you need to generate the Hugging Face PEFT adapter into a format for the ONNX runtime. This command will:

1. Convert the base model into ONNX format
2. Optimize the base model for the ONNX runtime (e.g. graph optimization).
3. Convert the adapter into an optimized format for the ONNX Runtime
4. Quantize the model.

> ☕ **NOTE: It can take around 15mins-20mins for the optimization of the base model to complete**

In [6]:
%%bash

olive generate-adapter \
    --model_name_or_path microsoft/Phi-3.5-mini-instruct \
    --adapter_path models/ft/adapter \
    --use_ort_genai \
    --output_path models/onnx \
    --adapter_format onnx_adapter \
    --log_level 1

Loading HuggingFace model from microsoft/Phi-3.5-mini-instruct
[2024-10-11 16:15:42,249] [INFO] [run.py:138:run_engine] Running workflow default_workflow
[2024-10-11 16:15:42,280] [INFO] [cache.py:137:__init__] Using cache directory: /mnt/batch/tasks/shared/LS_root/mounts/clusters/samkemp1/code/Ignite_FineTuning_workshop/lab/Workshop Instructions/Lab5_Optimize_Model/.olive-cache/default_workflow
[2024-10-11 16:15:42,379] [INFO] [accelerator_creator.py:224:create_accelerators] Running workflow on accelerator specs: gpu-cuda
[2024-10-11 16:15:42,404] [INFO] [engine.py:255:run] Running Olive on accelerator: gpu-cuda
[2024-10-11 16:15:42,404] [INFO] [engine.py:897:_create_system] Creating target system ...
[2024-10-11 16:15:42,404] [INFO] [engine.py:900:_create_system] Target system created in 0.000090 seconds
[2024-10-11 16:15:42,404] [INFO] [engine.py:909:_create_system] Creating host system ...
[2024-10-11 16:15:42,404] [INFO] [engine.py:912:_create_system] Host system created in 0.0000

## 🧪 Quick test

The code below creates a test app that consumes the model in a simple console chat interface. You will be prompted to enter an input. Here are some phrases to try:

- "Cricket is a great game"
- "I was taken aback by the size of the whale"
- "there was concern about the dark lighting on the street"

🧑‍💻 Below we show the Python API for the ONNX Runtime. However, other language bindings are available in [Java, C#, C++](https://github.com/microsoft/onnxruntime-genai/tree/main/examples).

🚪To exit the chat interface, enter `exit` or select `Ctrl+c`.


In [12]:
import onnxruntime_genai as og
import numpy as np
import os

model_folder = "optimized-model"

# Load the base model and tokenizer
model = og.Model(model_folder)
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()

# Load the LoRA adapter weights
weights_file = os.path.join(model_folder, "adapter_weights.npz")
adapter_weights = np.load(weights_file)

# Set the max length to something sensible by default,
# since otherwise it will be set to the entire context length
search_options = {}
search_options['max_length'] = 200
search_options['past_present_share_buffer'] = False

chat_template = "<|user|>\n{input}</s>\n<|assistant|>"

text = input("Enter phrase: ")

# Keep asking for input phrases
while text != "exit":
  if not text:
    print("Error, input cannot be empty")
    exit

  # generate prompt (prompt template + input)
  prompt = f'{chat_template.format(input=text)}'

  # encode the prompt using the tokenizer
  input_tokens = tokenizer.encode(prompt)

  # the adapter weights are added to the model at inference time. This means you
  # can select different adapters for different tasks i.e. multi-LoRA.

  params = og.GeneratorParams(model)
  for key in adapter_weights.keys():
      params.set_model_input(key, adapter_weights[key])
  params.set_search_options(**search_options)
  params.input_ids = input_tokens
  generator = og.Generator(model, params)

  print("Output: ", end='', flush=True)
  # stream the output
  try:
    while not generator.is_done():
      generator.compute_logits()
      generator.generate_next_token()

      new_token = generator.get_next_tokens()[0]
      print(tokenizer_stream.decode(new_token), end='', flush=True)
  except KeyboardInterrupt:
      print("  --control+c pressed, aborting generation--")

  print()
  text = input("Input: ")

# delete the objects to free up resources.
del generator
del model
del tokenizer
del tokenizer_stream

Enter phrase: cricket is an amazing game!
Output: 
joy
Input: exit


## Publish to Hugging Face

🤗 You'll need to get a token from https://huggingface.co/settings/tokens.

In [2]:
%%shell

# update these parameters
TOKEN="" # get a token from https://huggingface.co/settings/tokens
REPO_ID="" # for example username/repo-name
MODEL_PATH="optimized-model" # no need to change

huggingface-cli upload --token $TOKEN $REPO_ID $MODEL_PATH

UsageError: Cell magic `%%shell` not found.
