Skip to content

Move TE cross entropy guard to training args#5162

Merged
yaoyu-33 merged 2 commits into
NVIDIA:mainfrom
yaoyu-33:yuya/allow-te-ce-config
Jun 5, 2026
Merged

Move TE cross entropy guard to training args#5162
yaoyu-33 merged 2 commits into
NVIDIA:mainfrom
yaoyu-33:yuya/allow-te-ce-config

Conversation

@yaoyu-33
Copy link
Copy Markdown
Contributor

@yaoyu-33 yaoyu-33 commented Jun 4, 2026

Summary

Follow-up to #5115. This keeps the TE cross entropy fusion safety guard in validate_args, but removes it from ModelParallelConfig.__post_init__ so programmatic config construction is not blocked.

  • Allow ModelParallelConfig(cross_entropy_loss_fusion=True, cross_entropy_fusion_impl='te') to be constructed.
  • Keep the training CLI / args validation assertion for the unsafe combination.
  • Update unit coverage to reflect that split: core config can represent the setting, training args reject it.

Test Plan

  • UV_CACHE_DIR=/home/yuya/Projects/Megatron-LM/.uv-cache uv run isort --check-only tests/unit_tests/test_model_parallel_config.py
  • UV_CACHE_DIR=/home/yuya/Projects/Megatron-LM/.uv-cache uv run black --check tests/unit_tests/test_model_parallel_config.py megatron/core/model_parallel_config.py
  • PYTHONPYCACHEPREFIX=/home/yuya/Projects/Megatron-LM/.pycache-check /home/yuya/mypython/bin/python -m py_compile megatron/core/model_parallel_config.py tests/unit_tests/test_model_parallel_config.py

Local focused pytest could not complete on this workstation: direct run is blocked by the local nvidia-resiliency-ext dev version assertion (0.6.0.dev69 compares below required 0.6.0); masking NVRx as unavailable gets past that but the local CUDA/PyTorch driver mismatch segfaults during import.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33 yaoyu-33 force-pushed the yuya/allow-te-ce-config branch from 0446db6 to bf436d6 Compare June 4, 2026 16:48
@yaoyu-33 yaoyu-33 marked this pull request as ready for review June 4, 2026 16:53
@yaoyu-33 yaoyu-33 requested review from a team as code owners June 4, 2026 16:53
@svcnvidia-nemo-ci svcnvidia-nemo-ci added Final Review PR is in the "final review" stage complexity: low labels Jun 4, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci added Approved All necessary approvals have been made and removed Final Review PR is in the "final review" stage labels Jun 4, 2026
@cuichenx cuichenx added Final Review PR is in the "final review" stage Run MBridge tests Attach this for testing this PR against MBridge main and removed Approved All necessary approvals have been made labels Jun 4, 2026
@yaoyu-33 yaoyu-33 added Approved All necessary approvals have been made and removed Final Review PR is in the "final review" stage labels Jun 4, 2026
@yaoyu-33 yaoyu-33 enabled auto-merge June 4, 2026 17:42
@yaoyu-33
Copy link
Copy Markdown
Contributor Author

yaoyu-33 commented Jun 4, 2026

/ok to test bf436d6

@ko3n1g ko3n1g added the core_r0.18.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. label Jun 4, 2026
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@ko3n1g
Copy link
Copy Markdown
Contributor

ko3n1g commented Jun 5, 2026

/ok to test 288795f

@svcnvidia-nemo-ci
Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/27016083461

Merged via the queue into NVIDIA:main with commit b574499 Jun 5, 2026
170 of 173 checks passed
@yaoyu-33 yaoyu-33 deleted the yuya/allow-te-ce-config branch June 5, 2026 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made complexity: low core_r0.18.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. Run MBridge tests Attach this for testing this PR against MBridge main Run tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants