fix: add --simulated_cpu_devices_count to to_huggingface.py to prevent OOM#3192
Merged
copybara-service[bot] merged 1 commit intoAI-Hypercomputer:mainfrom Feb 19, 2026
Merged
Conversation
Contributor
Author
python3 -u -m maxtext.checkpoint_conversion.to_huggingface \
maxtext/configs/base.yml \
model_name=gemma3-4b \
hf_access_token=${HF_AUTH_TOKEN} \
load_parameters_path=${LOAD_PATH} \
base_output_directory=gs://mymodel-training/checkpoints/gemma3-4b-pt-hf \
per_device_batch_size=1 \
run_name=export \
scan_layers=true \
hardware=cpu \
skip_jax_distributed_system=True \
checkpoint_storage_concurrent_gb=16 \
--simulated_cpu_devices_count=1
|
shuningjin
reviewed
Feb 19, 2026
Collaborator
shuningjin
left a comment
There was a problem hiding this comment.
Thank you for identifying the issue, proposing the fix, and performing tests! I don't see the benefit of simulated_cpu_devices_count > 1, so perhaps we can remove this entirely.
Contributor
Author
Done ✅ |
shuningjin
approved these changes
Feb 19, 2026
Collaborator
shuningjin
left a comment
There was a problem hiding this comment.
Thank you! Could you squash into one commit before merge?
hengtaoguo
approved these changes
Feb 19, 2026
ae0bfbe to
389e383
Compare
RissyRan
approved these changes
Feb 19, 2026
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fix OOM in
to_huggingface.pycaused by 16× weight replication during Orbax checkpoint restore. Add--simulated_cpu_devices_countflag (default16).Problem
to_huggingface.pyhardcodesxla_force_host_platform_device_count=16, creating 16 simulated CPU devices.load_orbax_checkpoint()then builds a mesh with all devices and restores every parameter withPartitionSpec().For gemma3-4b in float32 (~14.5 GiB), this results in 16 × 14.5 = ~231 GiB of memory usage just for the checkpoint load - far exceeding typical CPU node RAM.
Fix
main()to__main__block (before JAX initialization), following the same pattern asto_maxtext.py.--simulated_cpu_devices_countargparse flag (default16) that is pre-parsed beforeabsl.app.run(), matching the existing flag into_maxtext.py.This preserves backward compatibility: users who explicitly need one device can pass
--simulated_cpu_devices_count=1.Tests
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.