[Training] IR version incompatibility in artifact generation for on-device training #20726

tomaz-suller · 2024-05-19T10:26:16Z

Describe the issue

Trying to execute the example notebook provided in on_device_training/desktop/python/mnist.ipynb results in an error about IR version incompatibility, stating the optimiser only supports version <=9 while the generated artifacts use version 10.

To reproduce

Install on-device training dependencies for offline stage as instructed here
Install additional dependencies to execute the notebook
```
ipykernel
ipywidgets
torch
torchvision
matplotlib
netron
evaluate
```
(initially added them to requirements.txt, then installed one-by-one after each ImportError to check if that wasn't the problem)
Execute notebook until the first cell of section "3 - Initialize Module and Optimizer"; no errors should be raised

Execute first cell of the section

# Create checkpoint state.
state = CheckpointState.load_checkpoint("data/checkpoint")

# Create module.
model = Module("data/training_model.onnx", state, "data/eval_model.onnx")

# Create optimizer.
optimizer = Optimizer("data/optimizer_model.onnx", model)

which should raise the following error

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[18], line 8
      5 model = Module(\"data/training_model.onnx\", state, \"data/eval_model.onnx\")
      7 # Create optimizer.
----> 8 optimizer = Optimizer(\"data/optimizer_model.onnx\", model)

File venv/lib/python3.12/site-packages/onnxruntime/training/api/optimizer.py:24, in Optimizer.__init__(self, optimizer_uri, module)
     23 def __init__(self, optimizer_uri: str | os.PathLike, module: Module):
---> 24     self._optimizer = C.Optimizer(
     25         os.fspath(optimizer_uri), module._state._state, module._device, module._session_options
     26     )

RuntimeError: /onnxruntime_src/orttraining/orttraining/training_api/optimizer.cc:273 void onnxruntime::training::api::Optimizer::Initialize(const onnxruntime::training::api::ModelIdentifiers&, const std::vector<std::shared_ptr<onnxruntime::IExecutionProvider> >&, gsl::span<OrtCustomOpDomain* const>) [ONNXRuntimeError] : 1 : FAIL : Load model from data/optimizer_model.onnx failed:/onnxruntime_src/onnxruntime/core/graph/model.cc:179 onnxruntime::Model::Model(onnx::ModelProto&&, const onnxruntime::PathString&, const onnxruntime::IOnnxRuntimeOpSchemaRegistryList*, const onnxruntime::logging::Logger&, const onnxruntime::ModelOptions&) Unsupported model IR version: 10, max supported IR version: 9

Urgency

I need to develop on top of this for a project due next month.

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.17.3

PyTorch Version

2.3.0+cu121

Execution Provider

ROCm

Execution Provider Library Version

ROCm 6.0.2

The text was updated successfully, but these errors were encountered:

tomaz-suller · 2024-05-19T10:31:59Z

I suspect some incompatibility due to versions of system or other Python packages could be to blame, since I'm running EndeavourOS (rolling release, Arch-based) with Python 3.12.3.

I tried downgrading the onnx to 1.14.1 but I got a build error from absl complaining my compiler didn't support C++14 (which is weird since it should but I just gave up then).

tomaz-suller · 2024-05-19T10:44:01Z

Just checked and also in Google Colab I get the same error following the same steps I mentioned, but running on CPU and in Python 3.10.12

carzh · 2024-05-19T23:23:23Z

@tomaz-suller what version of ONNX are you using? If you haven't already, could you try with onnx==1.15.0? Also, what version of onnxruntime-training are you using?

tomaz-suller · 2024-05-20T04:16:57Z

It does work with onnx==1.15.0 in Colab. I'm using onnx-training-cpu==1.17.3

Edit: locally, I get the ABSL build error about C++14 I mentioned when trying to downgrade, but then the issue isn't with ONNX anymore.

tomaz-suller added the training issues related to ONNX Runtime training; typically submitted using template label May 19, 2024

github-actions bot added the ep:ROCm questions/issues related to ROCm execution provider label May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Training] IR version incompatibility in artifact generation for on-device training #20726

[Training] IR version incompatibility in artifact generation for on-device training #20726

tomaz-suller commented May 19, 2024

tomaz-suller commented May 19, 2024

tomaz-suller commented May 19, 2024 •

edited

carzh commented May 19, 2024

tomaz-suller commented May 20, 2024 •

edited

[Training] IR version incompatibility in artifact generation for on-device training #20726

[Training] IR version incompatibility in artifact generation for on-device training #20726

Comments

tomaz-suller commented May 19, 2024

Describe the issue

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

PyTorch Version

Execution Provider

Execution Provider Library Version

tomaz-suller commented May 19, 2024

tomaz-suller commented May 19, 2024 • edited

carzh commented May 19, 2024

tomaz-suller commented May 20, 2024 • edited

tomaz-suller commented May 19, 2024 •

edited

tomaz-suller commented May 20, 2024 •

edited