Skip to content

Conversation

@MagnusBuehler
Copy link
Contributor

@MagnusBuehler MagnusBuehler commented Sep 5, 2025

Motivation and Context

Issue #416 reported a crash when a model saved on a CUDA device was later loaded on a CPU-only machine, due to tensors being serialized as CUDA tensors. This has been fixed by ensuring all tensor-like elements of the predictor are saved as CPU tensors, allowing safe loading on any device. The desired device can be specified (or automatically inferred) at loading time. This adds a negligible one-time overhead (~±10 ms, tested over 10 runs). Corresponding tests now cover all combinations of fitting, saving, and loading devices.
Further I added uv.lock to the .gitignore file and renamed a function in "src/tabpfn/inference.py" from save_state_expect_model_weights to save_state_except_model_weights, as it makes more sense with what the function is doing.


Public API Changes

  • No Public API changes
  • Yes, Public API changes (Details below)

How Has This Been Tested?

On a cuda-environment & locally. Also added corresponding pytests.

Checklist

  • The changes have been tested locally.
  • Documentation has been updated (if the public API or usage changes).
  • A entry has been added to CHANGELOG.md (if relevant for users).
  • The code follows the project's style guidelines.
  • I have considered the impact of these changes on the public API.

MagnusBuehler and others added 3 commits August 29, 2025 11:14
… cuda device and loaded on a cpu device.

Fix consists of always loading the device to cpu before saving.
Reason: Once a "cuda tensor" is saved and later loaded on a cpu only machine the joblib.load() throws an error. One could monkey patch the torch.load(...,map_location="cpu") and restore it later. But moving all tensors to cpu once during saving is cleaner and has minimal one-time performance costs (+-10 ms) (and is a robust fix, without potential side-effects).
also extended the "test_save_load_happy_path" test, with a `loading_device` and `saving_device`.
… on a cuda device and lader loading on a cpu-device. Also added test for different cuda-cpu saving-loading combinations.
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves the issue of saving and loading models across different devices (CPU/CUDA) by ensuring all tensor-like objects are moved to the CPU before serialization. The renaming of save_state_expect_model_weights to save_state_except_model_weights is a good clarification. The addition of comprehensive tests for various device combinations is a great improvement. I have a few suggestions to enhance the robustness of the implementation and the tests.

Copy link
Collaborator

@noahho noahho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work. Please only review the comment made.

Co-authored-by: noahho <noah@priorlabs.ai>
@noahho noahho merged commit 542270b into PriorLabs:main Sep 16, 2025
18 of 19 checks passed
oscarkey pushed a commit that referenced this pull request Nov 12, 2025
…ressor/Classifier) model on different devices (cuda / cpu) (#156)

* Record copied public PR 486

* Resolved issue saving & loading the predictor (Regressor/Classifier) model on different devices (cuda / cpu) (#486)

Co-authored-by: noahho <noah@priorlabs.ai>

---------

Co-authored-by: mirror-bot <mirror-bot@users.noreply.github.com>
Co-authored-by: MagnusBuehler <111045718+MagnusBuehler@users.noreply.github.com>
Co-authored-by: noahho <noah@priorlabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants