## ONNX (open neural network exchange)

* ONNX became widely recognized as a standardized format that facilitates the representation of deep learning models.
* Ability to promote seamless interchange and collaboration between various frameworks.

* check and ensure the following four aspects for a successful conversion with ONNX-
1) Model traiing
2) Input and output names
3) Handling Dynamic Axes - allowing tensors to represent parameters like batch size or sequence length.
4) Conversion Evaluation


Framework not specified. Using pt to export the model.
Cannot initialize model with low cpu memory usage because `accelerate` was not found in the environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install `accelerate` for faster and less memory-intense model loading. You can do so with: 
```
pip install accelerate
```

## Onnxruntime (ORT)- (dedicated accelerator)

In [2]:
import sys
!{sys.executable} -m pip install -q onnx optimum diffusers tf-keras accelerate

In [11]:
import torch
import onnx

In [12]:
# Defining PyTorch model
class MyModel(torch.nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.fc = torch.nn.Linear(10, 10)

    def forward(self, x):
        x = self.fc(x)
        return x

# Creating an instance
model = MyModel()

In [13]:
# Defining input example
example_input = torch.randn(1, 10)

# Exporting to ONNX format
torch.onnx.export(model, example_input, "linear.onnx")

In [14]:
import onnx
onnx_model = onnx.load("linear.onnx")
onnx.checker.check_model(onnx_model)

In [15]:
import numpy as np
import onnxruntime

# Compare the output of the original model and the ONNX-converted model to ensure their equivalence.
original_output = model(example_input)

onnx_model = onnx.load("linear.onnx")
onnx.checker.check_model(onnx_model)
rep = onnx.shape_inference.infer_shapes(onnx_model)

# onnx.checker.check_shapes(rep)
rep

ir_version: 8
opset_import {
  version: 17
}
producer_name: "pytorch"
producer_version: "2.2.1"
graph {
  node {
    input: "onnx::Gemm_0"
    input: "fc.weight"
    input: "fc.bias"
    output: "3"
    name: "/fc/Gemm"
    op_type: "Gemm"
    attribute {
      name: "alpha"
      type: FLOAT
      f: 1
    }
    attribute {
      name: "beta"
      type: FLOAT
      f: 1
    }
    attribute {
      name: "transB"
      type: INT
      i: 1
    }
  }
  name: "main_graph"
  initializer {
    dims: 10
    dims: 10
    data_type: 1
    name: "fc.weight"
    raw_data: "\214N\216\276\226\377\372\273\306\274R<\004\031\240\275\256\177 \276+\207\001\276x\026\212\272\014[\260=j\266\215\276\234|-<4\305\214\276\253\026L>\222\303\213\276\344;(=\016\025Z\276\367\001i\275\312\322j>\005\235\003\276\213$\034\276\203P\323\275\364\\\236\276\351q)=\354\272e\276{\321$<.\344\016\276\251{:>\271g1=\333\375\213\274\250G\224>\373m\201\276\201N\313\275+\357s\276\276\223?\276e\030W>(x\343\275\265\227\327\274\327

In [16]:
ort_session = onnxruntime.InferenceSession(onnx_model.SerializeToString())
ort_inputs = {ort_session.get_inputs()[0].name: example_input.numpy()} # send sample input as dictionary
ort_outs = ort_session.run(None, ort_inputs)
np.testing.assert_allclose(original_output.detach().numpy(), ort_outs[0], rtol=1e-03, atol=1e-05)
print("Original Output:", original_output)
print("Onnx model Output:", ort_outs[0])

Original Output: tensor([[ 0.1367, -0.1534, -0.6783,  0.0909, -0.0157, -0.8290, -0.1983,  0.6777,
          0.4129, -0.6922]], grad_fn=<AddmmBackward0>)
Onnx model Output: [[ 0.13670191 -0.15336278 -0.6782581   0.09087509 -0.01571308 -0.8289522
  -0.1983419   0.67769706  0.41286647 -0.6922476 ]]


## Quickstart Examples for PyTorch, TensorFlow, and SciKit Learn

In [20]:
# Export the model using torch.onnx.export

device="cuda:0" if torch.cuda.is_available() else "cpu"
torch.onnx.export(model,                                # model being run
                  torch.randn(1, 10, 10).to(device),    # model input (or a tuple for multiple inputs)
                  "fashion_model.onnx",           # where to save the model (can be a file or file-like object)
                  input_names = ['input'],              # the model's input names
                  output_names = ['output'])            # the model's output names


In [21]:
import onnx
onnx_model = onnx.load("fashion_model.onnx")
onnx.checker.check_model(onnx_model)

In [None]:
# Create inference session using ort.InferenceSession

import onnxruntime as ort
import numpy as np
x, y = test_data[0][0], test_data[0][1]
ort_sess = ort.InferenceSession('fashion_model.onnx')
outputs = ort_sess.run(None, {'input': x.numpy()})

# Print Result
predicted, actual = classes[outputs[0][0].argmax(0)], classes[y]
print(f'Predicted: "{predicted}", Actual: "{actual}"')


In [None]:
## Another sample

# Export the model
torch.onnx.export(model,                     # model being run
                (text, offsets),           # model input (or a tuple for multiple inputs)
                "ag_news_model.onnx",      # where to save the model (can be a file or file-like object)
                export_params=True,        # store the trained parameter weights inside the model file
                opset_version=10,          # the ONNX version to export the model to
                do_constant_folding=True,  # whether to execute constant folding for optimization
                input_names = ['input', 'offsets'],   # the model's input names
                output_names = ['output'], # the model's output names
                dynamic_axes={'input' : {0 : 'batch_size'},    # variable length axes
                              'output' : {0 : 'batch_size'}})


## onnxruntime in optimum huggingface

In [None]:
from optimum.onnxruntime import ORTStableDiffusionPipeline # Needs diffusers along with optimum, and accelerator for less cpu space requirement

model_id = "runwayml/stable-diffusion-v1-5"
pipeline = ORTStableDiffusionPipeline.from_pretrained(model_id, export=True)
prompt = "sailing ship in storm by Leonardo da Vinci"
image = pipeline(prompt).images[0]
pipeline.save_pretrained("./onnx-stable-diffusion-v1-5")

The optimum.onnxruntime.ORTModelForXXX model classes are API compatible with Hugging Face Transformers models. This means you can just replace your AutoModelForXXX class with the corresponding ORTModelForXXX class in optimum.onnxruntime.

from transformers import AutoTokenizer, pipeline

-from transformers import AutoModelForQuestionAnswering

+from optimum.onnxruntime import ORTModelForQuestionAnswering

-model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2") # PyTorch checkpoint

+model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2") # ONNX checkpoint

tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")

onnx_qa = pipeline("question-answering",model=model,tokenizer=tokenizer)

question = "What's my name?"

context = "My name is Philipp and I live in Nuremberg."

pred = onnx_qa(question, context)