[DECOUPLED-MODE] Adding Decoupling Logic by gulsumgudukbay · Pull Request #2865 · AI-Hypercomputer/maxtext

gulsumgudukbay · 2025-12-21T06:49:51Z

Description

This PR is the second part of the decoupling support. It adds logic for decoupling support, along with some test modifications for decoupling to be enabled.

Details:

Update decoupled_base_test.yml
Add decoupling locig to src/MaxText/decode.py, src/MaxText/elastic_train.py, src/MaxText/experimental/rl/grpo_trainer.py, src/MaxText/gcp_workload_monitor.py, src/MaxText/max_utils.py, src/MaxText/maxengine.py, src/MaxText/maxengine_config.py, src/MaxText/maxengine_server.py, src/MaxText/metric_logger.py, src/MaxText/prefill_packing.py, src/MaxText/profiler.py, src/MaxText/sft/hooks.py, src/MaxText/sft/sft_trainer.py, src/MaxText/train.py, src/MaxText/utils/gcs_utils.py, src/MaxText/utils/goodput_utils.py, src/MaxText/vertex_tensorboard.py
Update src/MaxText/gcloud_stub.py to add IS_STUB variables, and add google_cloud_mldiagnostics stub
Update tests to support decoupled mode (add markers, update file paths, make them use decoupled_base_test.yml config file).

Tests

Added two tests to test decoupled mode and gcloud_stub.

tests/unit/gcloud_stub_test.py:

All stub function tests with proper variable usage
GCS utils integration tests (6 tests):
_gcs_guard behavior in different modes
write_config_raw_keys_for_gcs skipping logic
Maxengine config integration tests (4 tests):
create_exp_maxengine device handling in decoupled/non-decoupled mode
get_server_config raising in decoupled mode
create_maxengine ignoring devices
Maxengine server integration tests (4 tests):
Server startup guard in decoupled mode
Prefix caching config creation logic

All unit tests pass in decoupled mode.
UT results:
== 306 passed, 170 skipped, 25 deselected, 6588 warnings in 975.16s (0:16:15) ==

Train test:
python -m MaxText.train MaxText/configs/base.yml run_name=test hardware=gpu steps=5 model_name=llama2-7b attention=cudnn_flash_te enable_checkpointing=False ici_expert_parallelism=1 ici_fsdp_parallelism=-1 ici_data_parallelism=1 remat_policy=minimal scan_layers=True dataset_type=synthetic logits_dot_in_fp32=False dtype=bfloat16 weight_dtype=bfloat16 per_device_batch_size=1 max_target_length=2048 shardy=False

works.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-01-05T16:55:43Z

Codecov Report

❌ Patch coverage is 50.68493% with 108 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/utils/max_utils.py	14.70%	25 Missing and 4 partials ⚠️
src/maxtext/common/gcloud_stub.py	75.00%	15 Missing ⚠️
src/maxtext/utils/gcs_utils.py	37.50%	14 Missing and 1 partial ⚠️
src/MaxText/maxengine_server.py	0.00%	12 Missing ⚠️
src/maxtext/common/goodput.py	18.18%	7 Missing and 2 partials ⚠️
src/maxtext/common/gcp_workload_monitor.py	30.00%	7 Missing ⚠️
src/MaxText/maxengine.py	77.77%	2 Missing and 2 partials ⚠️
src/MaxText/maxengine_config.py	66.66%	4 Missing ⚠️
src/maxtext/common/metric_logger.py	50.00%	3 Missing and 1 partial ⚠️
src/maxtext/common/vertex_tensorboard.py	25.00%	3 Missing ⚠️
... and 3 more

📢 Thoughts on this report? Let us know!

add decoupling logic config patch to tests add correct ICI parallelism to tests for decoupled mode fixing more UTs adding decoupling logic, biggest change add tensorboardX stub fixing train_tests removing tunix from decoupling logic renaming datasets to local_datasets to avoid confusion with HF datasets library make jax_remove_size_one_mesh_axis_from_type param setting in try block, todo: remove this after updating jax. Configure ICI data parallelism for decoupled mode revert legacy rl trainer Update grpo_trainer.py adding conditional imports because pytest collect always imports before marks are used centralize decoupled dataset paths and base_output_directory skip packed attention if not on cuda sm90+ parameterize test_env_smoke tests undo sft test decoupling changes as it is marked as external training move path logic to setup method and use if logic in train_smoke_test.py fix ref to dummy summary writer add yield to GOODPUT_STUB Add pytest marker for train_compile tests These are actually requiring libtpu and should be TPU tests. fixed path for GrainArrayRecordBestFitPackingTest update local output adding refactoring for test_utils import moved local_datasets, refactored add missing import from flop_calculation test adding gcloud_stub_test.py fix gcloud_stub test, add cpu_only marker