Fix on-the-fly model conversion in Pathways by forwarding storage flags#3966
Merged
Merged
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
bvandermoon
approved these changes
May 21, 2026
bvandermoon
reviewed
May 21, 2026
shralex
approved these changes
May 22, 2026
Pass checkpoint_storage_use_ocdbt and checkpoint_storage_use_zarr3 to to_maxtext subprocess in from_pretrained to ensure compatibility with Pathways parent process.
aaa8a08 to
1582436
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR ensures that on-the-fly model checkpoint conversion (Hugging Face to MaxText) during a Pathways run correctly respects the storage compatibility flags configured by the parent process. It explicitly forwards the
checkpoint_storage_use_ocdbtandcheckpoint_storage_use_zarr3configurations to the spawnedto_maxtextsubprocess.Context & Problem Solved
According to the official Hugging Face to MaxText Conversion Guide, when manually converting checkpoints for a Pathways run, users are instructed to set the Pathways compatibility flags to
False(abstracted viaUSE_PATHWAYS=1which disablesocdbtandzarr3storage formats).However, MaxText also supports automatic, on-the-fly checkpoint conversion during training startup (enabled by default in RL configs via
convert_checkpoint_if_possible: Trueinpost_train/rl.yml).How it is broken today:
When running a Pathways RL job with on-the-fly conversion enabled,
from_pretrainedinmodel_creation_utils.pyspawns theto_maxtext.pysubprocess but fails to forward the parent process'scheckpoint_storage_use_ocdbtandcheckpoint_storage_use_zarr3configurations.As a result, the subprocess defaults to saving the converted checkpoint with
ocdbt=Trueandzarr3=True(the McJAX defaults frombase.yml). Because Pathways does not support these formats, the parent training process immediately fails with a restore error when attempting to load the newly converted parameters.Solution
This PR programmatically forwards
checkpoint_storage_use_ocdbtandcheckpoint_storage_use_zarr3from the parent configuration to theto_maxtextsubprocess command line, ensuring that automatically converted checkpoints are instantly compatible with the running Pathways environment.FIXES: b/497915531
Tests
Verified the changes by running the MaxText unit tests inside the TPU VM environment.
Ran unit tests forcing CPU execution and simulating 16 devices (to satisfy test sharding divisibility requirements):
Results: All relevant checkpointing and model setup tests passed successfully.
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.