Skip to content

Change mosaic_group_id in case of GPUs/CPUs#3041

Merged
copybara-service[bot] merged 1 commit intoAI-Hypercomputer:mainfrom
Steboss:main
Feb 2, 2026
Merged

Change mosaic_group_id in case of GPUs/CPUs#3041
copybara-service[bot] merged 1 commit intoAI-Hypercomputer:mainfrom
Steboss:main

Conversation

@Steboss
Copy link
Copy Markdown
Contributor

@Steboss Steboss commented Jan 29, 2026

Description

On multi-process GPU/CPU runs, random.randint() generates different mosaic_fusion_group IDs per process. This causes HLO fingerprint to diverge across all the processes. The XLA Autotuner then deadlocks because each process wait for autotune results, keyed by fingerprints that other processes do not have.
The fix here allows to have a deterministic group ID, '0', for GPU/CPU, and keep it random for TPUs.

The parent code originated from this PR

Tests

To test the changes I have run llama4-17b-16e model with the following specifications:

export VARIANT="llama4-17b-16e"
export TRAIN_STEPS=31
export BATCH_SIZE=1
export SEQ_LENGTH=8192
export REPEAT_LAYER=true
export REMAT_POLICY="full"
export SPARSE_MATMUL=true
export MEM_FRACTION=0.94

# 4 nodes
export ICI_FSDP=1
export DCN_FSDP=4 
export ICI_EP=8
export DCN_EP=1
export DCN_DP=1

export VOCAB_PATH=gs://t5-data/vocabs/cc_all.32000.100extra/sentencepiece.model
export XLA_PYTHON_CLIENT_MEM_FRACTION=${MEM_FRACTION}
export CUDA_DEVICE_MAX_CONNECTIONS=16
export NVTE_FUSED_ATTN=1

RUN_SETTINGS="/opt/maxtext/src/MaxText/configs/base.yml \
    run_name=logdir \
    use_iota_embed=false \
    scan_layers=${REPEAT_LAYER} \
    steps=${TRAIN_STEPS} \
    per_device_batch_size=${BATCH_SIZE} \
    model_name=${VARIANT} \
    remat_policy=${REMAT_POLICY} \
    enable_checkpointing=false \
    logits_dot_in_fp32=false \
    base_output_directory=/maxtext/local_train \
    dataset_type=synthetic \
    attention=cudnn_flash_te \
    max_segments_per_seq=1 \
    max_target_length=${SEQ_LENGTH} \
   sparse_matmul=${SPARSE_MATMUL} \
    megablox=false \
    enable_goodput_recording=false \
    monitor_goodput=false \
    hardware=gpu_multiprocess \
    dcn_fsdp_parallelism=${DCN_FSDP} \
    ici_fsdp_parallelism=${ICI_FSDP} \
    ici_expert_parallelism=${ICI_EP} \
    dcn_expert_parallelism=${DCN_EP} \
    ici_data_parallelism=1 \
    dcn_data_parallelism=${DCN_DP}"

python3 -m MaxText.train $RUN_SETTINGS

The above was in deadlock state when running without the mosaic modification, e.g. last line of the log

I0128 10:33:11.701275 1841390 autotuner.cc:421] Picked best config: {Cublas_fission : goo.gle/debugonly   { algorithm: -1 } duration: 5.94128ms, scratch_bytes: 33554432}
I0128 10:33:11.741443 1841390 autotuner.cc:245] Storing results for autotune_results_0dcef21b52dde45e1dbbaca84549a421_31
I0128 10:33:11.741950 1841390 autotuner.cc:247] Shard 31 stored results at autotune_results_0dcef21b52dde45e1dbbaca84549a421_31
I0128 10:33:11.761985 1841390 autotuner.cc:261] Shard 31: waiting for results from shard 0 / 32 at autotune_results_0dcef21b52dde45e1dbbaca84549a421_0

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@google-cla
Copy link
Copy Markdown

google-cla Bot commented Jan 29, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@Steboss Steboss closed this Feb 2, 2026
@Steboss Steboss reopened this Feb 2, 2026
@copybara-service copybara-service Bot merged commit 0a39e70 into AI-Hypercomputer:main Feb 2, 2026
48 of 65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants