-
Notifications
You must be signed in to change notification settings - Fork 538
Resolved issue saving & loading the predictor (Regressor/Classifier) model on different devices (cuda / cpu) #486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolved issue saving & loading the predictor (Regressor/Classifier) model on different devices (cuda / cpu) #486
Conversation
… cuda device and loaded on a cpu device. Fix consists of always loading the device to cpu before saving. Reason: Once a "cuda tensor" is saved and later loaded on a cpu only machine the joblib.load() throws an error. One could monkey patch the torch.load(...,map_location="cpu") and restore it later. But moving all tensors to cpu once during saving is cleaner and has minimal one-time performance costs (+-10 ms) (and is a robust fix, without potential side-effects). also extended the "test_save_load_happy_path" test, with a `loading_device` and `saving_device`.
… on a cuda device and lader loading on a cpu-device. Also added test for different cuda-cpu saving-loading combinations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively resolves the issue of saving and loading models across different devices (CPU/CUDA) by ensuring all tensor-like objects are moved to the CPU before serialization. The renaming of save_state_expect_model_weights to save_state_except_model_weights is a good clarification. The addition of comprehensive tests for various device combinations is a great improvement. I have a few suggestions to enhance the robustness of the implementation and the tests.
…loading.py checked for tensors and nn.modules
noahho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the work. Please only review the comment made.
Co-authored-by: noahho <noah@priorlabs.ai>
…ressor/Classifier) model on different devices (cuda / cpu) (#156) * Record copied public PR 486 * Resolved issue saving & loading the predictor (Regressor/Classifier) model on different devices (cuda / cpu) (#486) Co-authored-by: noahho <noah@priorlabs.ai> --------- Co-authored-by: mirror-bot <mirror-bot@users.noreply.github.com> Co-authored-by: MagnusBuehler <111045718+MagnusBuehler@users.noreply.github.com> Co-authored-by: noahho <noah@priorlabs.ai>
Motivation and Context
Issue #416 reported a crash when a model saved on a CUDA device was later loaded on a CPU-only machine, due to tensors being serialized as CUDA tensors. This has been fixed by ensuring all tensor-like elements of the predictor are saved as CPU tensors, allowing safe loading on any device. The desired device can be specified (or automatically inferred) at loading time. This adds a negligible one-time overhead (~±10 ms, tested over 10 runs). Corresponding tests now cover all combinations of fitting, saving, and loading devices.
Further I added uv.lock to the .gitignore file and renamed a function in "src/tabpfn/inference.py" from
save_state_expect_model_weightstosave_state_except_model_weights, as it makes more sense with what the function is doing.Public API Changes
How Has This Been Tested?
On a cuda-environment & locally. Also added corresponding pytests.
Checklist
CHANGELOG.md(if relevant for users).