Build Triton from source and bump Primus-Turbo #694
Conversation
There was a problem hiding this comment.
Pull request overview
Builds Triton from a pinned source commit during Docker image build and updates workflow-pinned Primus-Turbo commit references.
Changes:
- Add
TRITON_COMMITbuild-arg and build/install Triton fromtriton-lang/triton(release/3.7.x@ pinned commit) in the Dockerfile. - Wire
TRITON_COMMITthrough.github/workflows/ci.yamlso both PyTorch and JAX image builds receive it. - Bump
PRIMUS_TURBO_COMMITin CI and benchmark workflows.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| .github/workflows/docker/Dockerfile | Adds Triton build-from-source step parameterized by TRITON_COMMIT. |
| .github/workflows/ci.yaml | Passes TRITON_COMMIT into Docker builds; bumps PRIMUS_TURBO_COMMIT. |
| .github/workflows/benchmark.yaml | Bumps PRIMUS_TURBO_COMMIT used by benchmark workflow. |
| PRIMUS_TURBO_COMMIT: ecb7e5cad1f5f86e5719f4afbe5e644b702d0aa9 | ||
| PRIMUS_TURBO_AITER_COMMIT: 857f4d15775a29af153a2c68a2f8e8a8d696c986 |
There was a problem hiding this comment.
PR description mentions bumping PRIMUS_TURBO_COMMIT, but this change also updates PRIMUS_TURBO_AITER_COMMIT. Please either update the PR description to include the AITER bump (and why) or revert this env change if it’s unintended.
| submodules: "recursive" | ||
| path: Primus-Turbo | ||
| ref: ${{ env.PRIMUS_TURBO_COMMIT }} | ||
| - name: Init Primus-Turbo submodules (full depth) |
There was a problem hiding this comment.
The step name says “(full depth)”, but actions/checkout here uses the default shallow clone and git submodule update is not given any depth options. Either rename the step to avoid implying full history, or set fetch-depth: 0 (and, if needed, configure submodule depth explicitly) to match the intent.
| - name: Init Primus-Turbo submodules (full depth) | |
| - name: Init Primus-Turbo submodules |
| - name: Init Primus-Turbo submodules (full clone) | ||
| working-directory: Primus-Turbo | ||
| run: | | ||
| rm -rf 3rdparty/composable_kernel .git/modules/3rdparty/composable_kernel | ||
| git submodule sync --recursive | ||
| git submodule update --init --recursive |
There was a problem hiding this comment.
This job now runs git submodule commands in the checked-out Primus-Turbo directory, but safe.directory is only configured later after moving the repo to /tmp. On runners where the workspace ownership differs (common on self-hosted/containerized runners), git submodule ... can fail with "dubious ownership". Configure safe.directory (or run the submodule init/update) after moving to /tmp/Primus-Turbo where safe.directory is already set.
Build Triton from triton-lang/triton release/3.7.x branch in the Docker image
Bump PRIMUS_TURBO_COMMIT to ecb7e5c in ci.yaml and benchmark.yaml,
and pass TRITON_COMMIT=88b227e to the main and jax docker builds so
the Dockerfile's `git checkout ${TRITON_COMMIT}` step has a value.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Primus-Turbo ecb7e5c pins AITER_COMMIT to 857f4d1 in setup.py and calls aiter._flash_attn_forward() with an `out=` kwarg only the new aiter accepts. Bumping the CI pin matches Turbo's expectation.
actions/checkout@v4 fetches submodules with --depth=1, which fails when Primus-Turbo's pinned composable_kernel SHA is not at the tip of CK's default branch. Drop submodules: recursive from the action and run `git submodule update --init --recursive` ourselves.
The self-hosted JAX runner caches Primus-Turbo across runs, and a prior --depth=1 submodule fetch leaves a shallow CK clone in .git/modules/3rdparty/composable_kernel/. The new Turbo pin (78ae3835) lives on a non-default branch, so the cached shallow clone cannot resolve it and `git submodule update` reports "Unable to find current revision". Remove the cached submodule and its modules-dir before re-init so git performs a full fetch.
Bump Primus-Turbo to current main (ef5b58e) and set PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1 in run-unittest-torch to work around the gfx942 sbhd backward guard added in #275 that rejects bf16+sbhd when is_v3_atomic_fp32=False. Drop the override once the upstream loosen_atomic_fp16_constraint fix lands in Primus-Turbo main.
822a1d7 to
23744d2
Compare
23744d2 to
85a2a24
Compare
Stock torchtitan qwen3 Attention transposes q/k/v from BSHD to BHSD
before calling self.inner_attention (PyTorch SDPA contract). When
PrimusTubroConverter swaps inner_attention for TurboAttention /
flash_attn_func, those callees expect q/k/v with logical shape
[B, S, H, D] (format encoded as strides — see
primus_turbo.pytorch.ops.attention.attention_utils._infer_qkv_format).
The BHSD-shape input mismatches the BSHD-shape contract: aiter
mis-interprets the seq/heads axes (computing attention along the wrong
axis) and the registered fake / real op shapes diverge under
torch.compile, surfacing as an inductor stride assertion on
attention_aiter_forward_impl for qwen3 (test_qwen3_0_6B fails with
"expected size 16==16, stride 128==524288 at dim=1").
Mirror the LLaMA3 / LLaMA4 / DeepSeek-V3 Primus overrides: keep q/k/v
in BSHD (drop the BHSD transpose) and let TurboAttention handle GQA
natively (no repeat_kv needed). Patch qwen3.model.model.Attention via
the existing torchtitan.primus_turbo.turbo_attention setup hook.
Verified locally:
- tests/trainer/test_torchtitan_trainer.py 12/12 pass (incl.
test_qwen3_0_6B / test_qwen3_1_7B / test_qwen3_32B)
- tests/trainer/test_megatron_trainer.py 17/17 pass
85a2a24 to
4289676
Compare
Summary
release/3.7.x@88b227e) inside the docker imageTRITON_COMMITbuild-arg throughci.yamlso both the main and JAX docker builds receive itPRIMUS_TURBO_COMMITtoecb7e5cinci.yamlandbenchmark.yaml