Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Training] IR version incompatibility in artifact generation for on-device training #20726

Open
tomaz-suller opened this issue May 19, 2024 · 4 comments
Labels
ep:ROCm questions/issues related to ROCm execution provider training issues related to ONNX Runtime training; typically submitted using template

Comments

@tomaz-suller
Copy link

Describe the issue

Trying to execute the example notebook provided in on_device_training/desktop/python/mnist.ipynb results in an error about IR version incompatibility, stating the optimiser only supports version <=9 while the generated artifacts use version 10.

To reproduce

  1. Install on-device training dependencies for offline stage as instructed here
  2. Install additional dependencies to execute the notebook
    ipykernel
    ipywidgets
    torch
    torchvision
    matplotlib
    netron
    evaluate
    
    (initially added them to requirements.txt, then installed one-by-one after each ImportError to check if that wasn't the problem)
  3. Execute notebook until the first cell of section "3 - Initialize Module and Optimizer"; no errors should be raised
  4. Execute first cell of the section
    # Create checkpoint state.
    state = CheckpointState.load_checkpoint("data/checkpoint")
    
    # Create module.
    model = Module("data/training_model.onnx", state, "data/eval_model.onnx")
    
    # Create optimizer.
    optimizer = Optimizer("data/optimizer_model.onnx", model)
    which should raise the following error
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[18], line 8
      5 model = Module(\"data/training_model.onnx\", state, \"data/eval_model.onnx\")
      7 # Create optimizer.
----> 8 optimizer = Optimizer(\"data/optimizer_model.onnx\", model)

File venv/lib/python3.12/site-packages/onnxruntime/training/api/optimizer.py:24, in Optimizer.__init__(self, optimizer_uri, module)
     23 def __init__(self, optimizer_uri: str | os.PathLike, module: Module):
---> 24     self._optimizer = C.Optimizer(
     25         os.fspath(optimizer_uri), module._state._state, module._device, module._session_options
     26     )

RuntimeError: /onnxruntime_src/orttraining/orttraining/training_api/optimizer.cc:273 void onnxruntime::training::api::Optimizer::Initialize(const onnxruntime::training::api::ModelIdentifiers&, const std::vector<std::shared_ptr<onnxruntime::IExecutionProvider> >&, gsl::span<OrtCustomOpDomain* const>) [ONNXRuntimeError] : 1 : FAIL : Load model from data/optimizer_model.onnx failed:/onnxruntime_src/onnxruntime/core/graph/model.cc:179 onnxruntime::Model::Model(onnx::ModelProto&&, const onnxruntime::PathString&, const onnxruntime::IOnnxRuntimeOpSchemaRegistryList*, const onnxruntime::logging::Logger&, const onnxruntime::ModelOptions&) Unsupported model IR version: 10, max supported IR version: 9

Urgency

I need to develop on top of this for a project due next month.

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.17.3

PyTorch Version

2.3.0+cu121

Execution Provider

ROCm

Execution Provider Library Version

ROCm 6.0.2

@tomaz-suller tomaz-suller added the training issues related to ONNX Runtime training; typically submitted using template label May 19, 2024
@github-actions github-actions bot added the ep:ROCm questions/issues related to ROCm execution provider label May 19, 2024
@tomaz-suller
Copy link
Author

I suspect some incompatibility due to versions of system or other Python packages could be to blame, since I'm running EndeavourOS (rolling release, Arch-based) with Python 3.12.3.

I tried downgrading the onnx to 1.14.1 but I got a build error from absl complaining my compiler didn't support C++14 (which is weird since it should but I just gave up then).

@tomaz-suller
Copy link
Author

tomaz-suller commented May 19, 2024

Just checked and also in Google Colab I get the same error following the same steps I mentioned, but running on CPU and in Python 3.10.12

@carzh
Copy link
Contributor

carzh commented May 19, 2024

@tomaz-suller what version of ONNX are you using? If you haven't already, could you try with onnx==1.15.0? Also, what version of onnxruntime-training are you using?

@tomaz-suller
Copy link
Author

tomaz-suller commented May 20, 2024

It does work with onnx==1.15.0 in Colab. I'm using onnx-training-cpu==1.17.3

Edit: locally, I get the ABSL build error about C++14 I mentioned when trying to downgrade, but then the issue isn't with ONNX anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:ROCm questions/issues related to ROCm execution provider training issues related to ONNX Runtime training; typically submitted using template
Projects
None yet
Development

No branches or pull requests

2 participants