## Installing ONNXRuntime-GenAI

Installing the correct package of onnxruntime-genai is important, as the suffix of the package shows which Execution Provider is included with the underlying ONNXRuntime framework. Here, we install the `onnxruntime-genai-directml` package so we can execute models through the DirectML Execution Provider.

**Note: The DirectML Execution Provider stems from DirectX, and thus is only available on Windows systems.**

In [None]:
!pip install onnx==1.16.1
!pip install transformers torch numpy
!pip install onnxruntime-genai-directml

## Getting a compatible ONNX model

Because ONNXRuntime-GenAI is specialized for generative ONNX models, it only supports models within this class and not models that rely on a single inference, such as classifier models. There are a couple ways to obtain an ONNX model that can be used with ONNXRuntime-GenAI - we're going to use the model builder, a tool included in the ORT-GenAI package, to get [Microsoft's Phi-3.5 model](https://huggingface.co/microsoft/Phi-3.5-mini-instruct/tree/main).

In [2]:
!python -m onnxruntime_genai.models.builder -m microsoft/Phi-3.5-mini-instruct -e dml -p int4 -o phi-3-dml

Valid precision + execution provider combinations are: FP32 CPU, FP32 CUDA, FP16 CUDA, FP16 DML, INT4 CPU, INT4 CUDA, INT4 DML
Extra options: {}
GroupQueryAttention (GQA) is used in this model.
Reading embedding layer
Reading decoder layer 0
Reading decoder layer 1
Reading decoder layer 2
Reading decoder layer 3
Reading decoder layer 4
Reading decoder layer 5
Reading decoder layer 6
Reading decoder layer 7
Reading decoder layer 8
Reading decoder layer 9
Reading decoder layer 10
Reading decoder layer 11
Reading decoder layer 12
Reading decoder layer 13
Reading decoder layer 14
Reading decoder layer 15
Reading decoder layer 16
Reading decoder layer 17
Reading decoder layer 18
Reading decoder layer 19
Reading decoder layer 20
Reading decoder layer 21
Reading decoder layer 22
Reading decoder layer 23
Reading decoder layer 24
Reading decoder layer 25
Reading decoder layer 26
Reading decoder layer 27
Reading decoder layer 28
Reading decoder layer 29
Reading decoder layer 30
Reading decoder l

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-mini-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-mini-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]
Downloading shards:  50%|#####     | 1/2 [00:47<00:47, 47.16s/it]
Downloading shards: 100%|##########| 2/2 

After this command, we've generated a directory at `phi-3-dml` that contains a couple important things:
  - The core ONNX model used for each inference pass
  - The appropriate tokenizer for the model
  - A ORT-GenAI config file. The specification is described [here](https://onnxruntime.ai/docs/genai/reference/config), but the model builder takes care of populating this entirely for us.

## Inferencing the model using ORT-GenAI

The below cell will run a simple inference on the model we generated above on your local DirectML-supported device. If you look in the config file we generated at `phi-3-dml/genai_config.json`, you can find the `dml` Execution Provider in `model.decoder.session_options.provider_options`. This Execution Provider will be used by ORT-GenAI under the hood when we instantiate a `Model` using this config file.

For the prompt construction, we follow the prompt format provided in the [Phi-3 model card.](https://huggingface.co/microsoft/Phi-3.5-mini-instruct#input-formats)

In [5]:
import onnxruntime_genai as ort_genai

model_path = "phi-3-dml"

model = ort_genai.Model(model_path)
tokenizer = ort_genai.Tokenizer(model)
params = ort_genai.GeneratorParams(model)

search_options = {
    "max_length": 1000
}

params.set_search_options(**search_options)

prompt = "<|system|> \n \
You are a helpful cooking assistant.<|end|> \n \
<|user|> \n \
Can you tell me a recipe that uses strawberries, milk, and whipped cream?<|end|> \n \
<|assistant|> \n"

params.input_ids = tokenizer.encode(prompt)


output_ids = model.generate(params)

print(tokenizer.decode(output_ids))

 
 You are a helpful cooking assistant.
 
 Can you tell me a recipe that uses strawberries, milk, and whipped cream?
 
 Certainly! Here's a simple and delightful recipe for a Strawberry Milkshake with Whipped Cream Topping that you can enjoy:

**Strawberry Milkshake with Whipped Cream Topping**

**Ingredients:**

* Fresh strawberries (about 1 cup, hulled and sliced)
* 2 cups of vanilla ice cream (or any flavor of your choice)
* 1/2 cup of milk (or more, to desired consistency)
* Whipped cream (for topping)
* Optional: A drizzle of honey or maple syrup for added sweetness
* Fresh mint leaves (for garnish)

**Instructions:**

1. **Prepare the Strawberries:**
   - Wash and hull about 1 cup of fresh strawberries. Slice them into halves or quarters, depending on your preference.

2. **Make the Milkshake:**
   - Place the sliced strawberries, vanilla ice cream, and milk into a blender or food processor.
   - Blend until the mixture is smooth and creamy. You may add more milk if the consisten

## Wrapping Up

The installation, model building process, and inference loop execution of ORT-GenAI are simple and streamlined, and we've taken the simplest path to demonstrate the above using DirectML to execute on your local GPU / DML-supported device. ORT-GenAI offers a lot of flexibility that we haven't demonstrated, including:
  - Vocabulary masking, which will restrict the tokens that the LLM can produce
  - Various search algorithms, which can produce multiple candidates for output sequences
  - Streaming output tokens as they're generated
  - & Many more!



For additional resources to take you further into ORT-GenAI, see:

  - [Deep Dive into ONNXRuntime-GenAI](https://www.youtube.com/watch?v=S_qufVKPwMM)
  - [Microsoft's Model Builder documentation](https://github.com/microsoft/onnxruntime-genai/blob/main/src/python/py/models/README.md)
  - [Python sample for streaming tokens](https://github.com/microsoft/onnxruntime-genai/blob/main/README.md#sample-code-for-phi-3-in-python)
  - [DirectML Introduction](https://learn.microsoft.com/en-us/windows/ai/directml/dml)