Change mosaic_group_id in case of GPUs/CPUs by Steboss · Pull Request #3041 · AI-Hypercomputer/maxtext

Steboss · 2026-01-29T15:12:29Z

Description

On multi-process GPU/CPU runs, random.randint() generates different mosaic_fusion_group IDs per process. This causes HLO fingerprint to diverge across all the processes. The XLA Autotuner then deadlocks because each process wait for autotune results, keyed by fingerprints that other processes do not have.
The fix here allows to have a deterministic group ID, '0', for GPU/CPU, and keep it random for TPUs.

The parent code originated from this PR

Tests

To test the changes I have run llama4-17b-16e model with the following specifications:

export VARIANT="llama4-17b-16e"
export TRAIN_STEPS=31
export BATCH_SIZE=1
export SEQ_LENGTH=8192
export REPEAT_LAYER=true
export REMAT_POLICY="full"
export SPARSE_MATMUL=true
export MEM_FRACTION=0.94

# 4 nodes
export ICI_FSDP=1
export DCN_FSDP=4 
export ICI_EP=8
export DCN_EP=1
export DCN_DP=1

export VOCAB_PATH=gs://t5-data/vocabs/cc_all.32000.100extra/sentencepiece.model
export XLA_PYTHON_CLIENT_MEM_FRACTION=${MEM_FRACTION}
export CUDA_DEVICE_MAX_CONNECTIONS=16
export NVTE_FUSED_ATTN=1

RUN_SETTINGS="/opt/maxtext/src/MaxText/configs/base.yml \
    run_name=logdir \
    use_iota_embed=false \
    scan_layers=${REPEAT_LAYER} \
    steps=${TRAIN_STEPS} \
    per_device_batch_size=${BATCH_SIZE} \
    model_name=${VARIANT} \
    remat_policy=${REMAT_POLICY} \
    enable_checkpointing=false \
    logits_dot_in_fp32=false \
    base_output_directory=/maxtext/local_train \
    dataset_type=synthetic \
    attention=cudnn_flash_te \
    max_segments_per_seq=1 \
    max_target_length=${SEQ_LENGTH} \
   sparse_matmul=${SPARSE_MATMUL} \
    megablox=false \
    enable_goodput_recording=false \
    monitor_goodput=false \
    hardware=gpu_multiprocess \
    dcn_fsdp_parallelism=${DCN_FSDP} \
    ici_fsdp_parallelism=${ICI_FSDP} \
    ici_expert_parallelism=${ICI_EP} \
    dcn_expert_parallelism=${DCN_EP} \
    ici_data_parallelism=1 \
    dcn_data_parallelism=${DCN_DP}"

python3 -m MaxText.train $RUN_SETTINGS

The above was in deadlock state when running without the mosaic modification, e.g. last line of the log

I0128 10:33:11.701275 1841390 autotuner.cc:421] Picked best config: {Cublas_fission : goo.gle/debugonly   { algorithm: -1 } duration: 5.94128ms, scratch_bytes: 33554432}
I0128 10:33:11.741443 1841390 autotuner.cc:245] Storing results for autotune_results_0dcef21b52dde45e1dbbaca84549a421_31
I0128 10:33:11.741950 1841390 autotuner.cc:247] Shard 31 stored results at autotune_results_0dcef21b52dde45e1dbbaca84549a421_31
I0128 10:33:11.761985 1841390 autotuner.cc:261] Shard 31: waiting for results from shard 0 / 32 at autotune_results_0dcef21b52dde45e1dbbaca84549a421_0

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

google-cla · 2026-01-29T15:12:34Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

codecov · 2026-01-29T17:05:30Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

push changes to mosaic_group for tpus/gpus

7f13f98

Steboss requested review from RissyRan, gagika, jesselu-google, michelle-yooh, richjames0, shralex and suexu1025 as code owners January 29, 2026 15:12

gobbleturk approved these changes Jan 29, 2026

View reviewed changes

NuojCheng approved these changes Jan 30, 2026

View reviewed changes

Steboss closed this Feb 2, 2026

Steboss reopened this Feb 2, 2026

gobbleturk added the pull ready label Feb 2, 2026

copybara-service Bot merged commit 0a39e70 into AI-Hypercomputer:main Feb 2, 2026
48 of 65 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change mosaic_group_id in case of GPUs/CPUs#3041

Change mosaic_group_id in case of GPUs/CPUs#3041
copybara-service[bot] merged 1 commit intoAI-Hypercomputer:mainfrom
Steboss:main

Steboss commented Jan 29, 2026 •

edited

Loading

Uh oh!

google-cla Bot commented Jan 29, 2026

Uh oh!

codecov Bot commented Jan 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Steboss commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

google-cla Bot commented Jan 29, 2026

Uh oh!

codecov Bot commented Jan 29, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Steboss commented Jan 29, 2026 •

edited

Loading