Skip to content

Fix on-the-fly model conversion in Pathways by forwarding storage flags#3966

Merged
copybara-service[bot] merged 1 commit into
mainfrom
darisoy-fix-pathways-conversion
May 22, 2026
Merged

Fix on-the-fly model conversion in Pathways by forwarding storage flags#3966
copybara-service[bot] merged 1 commit into
mainfrom
darisoy-fix-pathways-conversion

Conversation

@darisoy
Copy link
Copy Markdown
Collaborator

@darisoy darisoy commented May 21, 2026

Description

This PR ensures that on-the-fly model checkpoint conversion (Hugging Face to MaxText) during a Pathways run correctly respects the storage compatibility flags configured by the parent process. It explicitly forwards the checkpoint_storage_use_ocdbt and checkpoint_storage_use_zarr3 configurations to the spawned to_maxtext subprocess.

Context & Problem Solved

According to the official Hugging Face to MaxText Conversion Guide, when manually converting checkpoints for a Pathways run, users are instructed to set the Pathways compatibility flags to False (abstracted via USE_PATHWAYS=1 which disables ocdbt and zarr3 storage formats).

However, MaxText also supports automatic, on-the-fly checkpoint conversion during training startup (enabled by default in RL configs via convert_checkpoint_if_possible: True in post_train/rl.yml).

How it is broken today:
When running a Pathways RL job with on-the-fly conversion enabled, from_pretrained in model_creation_utils.py spawns the to_maxtext.py subprocess but fails to forward the parent process's checkpoint_storage_use_ocdbt and checkpoint_storage_use_zarr3 configurations.

As a result, the subprocess defaults to saving the converted checkpoint with ocdbt=True and zarr3=True (the McJAX defaults from base.yml). Because Pathways does not support these formats, the parent training process immediately fails with a restore error when attempting to load the newly converted parameters.

Solution

This PR programmatically forwards checkpoint_storage_use_ocdbt and checkpoint_storage_use_zarr3 from the parent configuration to the to_maxtext subprocess command line, ensuring that automatically converted checkpoints are instantly compatible with the running Pathways environment.

FIXES: b/497915531

Tests

Verified the changes by running the MaxText unit tests inside the TPU VM environment.
Ran unit tests forcing CPU execution and simulating 16 devices (to satisfy test sharding divisibility requirements):

JAX_PLATFORMS=cpu XLA_FLAGS=--xla_force_host_platform_device_count=16 pytest tests/unit/maxtext_utils_test.py

Results: All relevant checkpointing and model setup tests passed successfully.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Comment thread src/maxtext/utils/model_creation_utils.py
Pass checkpoint_storage_use_ocdbt and checkpoint_storage_use_zarr3 to to_maxtext subprocess in from_pretrained to ensure compatibility with Pathways parent process.
@darisoy darisoy force-pushed the darisoy-fix-pathways-conversion branch from aaa8a08 to 1582436 Compare May 22, 2026 02:18
@copybara-service copybara-service Bot merged commit 7249840 into main May 22, 2026
51 of 53 checks passed
@copybara-service copybara-service Bot deleted the darisoy-fix-pathways-conversion branch May 22, 2026 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants