Skip to content

Build Triton from source and bump Primus-Turbo #694

Merged
wenxie-amd merged 8 commits into
mainfrom
dev/kyle_update_triton
May 7, 2026
Merged

Build Triton from source and bump Primus-Turbo #694
wenxie-amd merged 8 commits into
mainfrom
dev/kyle_update_triton

Conversation

@kyle-256
Copy link
Copy Markdown
Contributor

Summary

  • Compile Triton from source (release/3.7.x @ 88b227e) inside the docker image
  • Wire a new TRITON_COMMIT build-arg through ci.yaml so both the main and JAX docker builds receive it
  • Bump PRIMUS_TURBO_COMMIT to ecb7e5c in ci.yaml and benchmark.yaml

Copilot AI review requested due to automatic review settings April 28, 2026 05:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Builds Triton from a pinned source commit during Docker image build and updates workflow-pinned Primus-Turbo commit references.

Changes:

  • Add TRITON_COMMIT build-arg and build/install Triton from triton-lang/triton (release/3.7.x @ pinned commit) in the Dockerfile.
  • Wire TRITON_COMMIT through .github/workflows/ci.yaml so both PyTorch and JAX image builds receive it.
  • Bump PRIMUS_TURBO_COMMIT in CI and benchmark workflows.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
.github/workflows/docker/Dockerfile Adds Triton build-from-source step parameterized by TRITON_COMMIT.
.github/workflows/ci.yaml Passes TRITON_COMMIT into Docker builds; bumps PRIMUS_TURBO_COMMIT.
.github/workflows/benchmark.yaml Bumps PRIMUS_TURBO_COMMIT used by benchmark workflow.

Comment thread .github/workflows/docker/Dockerfile
Comment thread .github/workflows/docker/Dockerfile
wenxie-amd
wenxie-amd previously approved these changes Apr 28, 2026
Copilot AI review requested due to automatic review settings April 28, 2026 07:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread .github/workflows/ci.yaml Outdated
Comment on lines +17 to +18
PRIMUS_TURBO_COMMIT: ecb7e5cad1f5f86e5719f4afbe5e644b702d0aa9
PRIMUS_TURBO_AITER_COMMIT: 857f4d15775a29af153a2c68a2f8e8a8d696c986
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description mentions bumping PRIMUS_TURBO_COMMIT, but this change also updates PRIMUS_TURBO_AITER_COMMIT. Please either update the PR description to include the AITER bump (and why) or revert this env change if it’s unintended.

Copilot uses AI. Check for mistakes.
Comment thread .github/workflows/ci.yaml Outdated
submodules: "recursive"
path: Primus-Turbo
ref: ${{ env.PRIMUS_TURBO_COMMIT }}
- name: Init Primus-Turbo submodules (full depth)
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The step name says “(full depth)”, but actions/checkout here uses the default shallow clone and git submodule update is not given any depth options. Either rename the step to avoid implying full history, or set fetch-depth: 0 (and, if needed, configure submodule depth explicitly) to match the intent.

Suggested change
- name: Init Primus-Turbo submodules (full depth)
- name: Init Primus-Turbo submodules

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings April 30, 2026 00:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread .github/workflows/docker/Dockerfile
Comment thread .github/workflows/ci.yaml
Comment on lines +341 to +346
- name: Init Primus-Turbo submodules (full clone)
working-directory: Primus-Turbo
run: |
rm -rf 3rdparty/composable_kernel .git/modules/3rdparty/composable_kernel
git submodule sync --recursive
git submodule update --init --recursive
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This job now runs git submodule commands in the checked-out Primus-Turbo directory, but safe.directory is only configured later after moving the repo to /tmp. On runners where the workspace ownership differs (common on self-hosted/containerized runners), git submodule ... can fail with "dubious ownership". Configure safe.directory (or run the submodule init/update) after moving to /tmp/Primus-Turbo where safe.directory is already set.

Copilot uses AI. Check for mistakes.
kyle-256 and others added 6 commits May 6, 2026 05:44
Build Triton from triton-lang/triton release/3.7.x branch in the Docker image
Bump PRIMUS_TURBO_COMMIT to ecb7e5c in ci.yaml and benchmark.yaml,
and pass TRITON_COMMIT=88b227e to the main and jax docker builds so
the Dockerfile's `git checkout ${TRITON_COMMIT}` step has a value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Primus-Turbo ecb7e5c pins AITER_COMMIT to 857f4d1 in setup.py and
calls aiter._flash_attn_forward() with an `out=` kwarg only the
new aiter accepts. Bumping the CI pin matches Turbo's expectation.
actions/checkout@v4 fetches submodules with --depth=1, which fails
when Primus-Turbo's pinned composable_kernel SHA is not at the tip
of CK's default branch. Drop submodules: recursive from the action
and run `git submodule update --init --recursive` ourselves.
The self-hosted JAX runner caches Primus-Turbo across runs, and a prior
--depth=1 submodule fetch leaves a shallow CK clone in
.git/modules/3rdparty/composable_kernel/. The new Turbo pin
(78ae3835) lives on a non-default branch, so the cached shallow
clone cannot resolve it and `git submodule update` reports
"Unable to find current revision".

Remove the cached submodule and its modules-dir before re-init so
git performs a full fetch.
Bump Primus-Turbo to current main (ef5b58e) and set
PRIMUS_TURBO_ATTN_V3_ATOMIC_FP32=1 in run-unittest-torch to work around
the gfx942 sbhd backward guard added in #275 that rejects bf16+sbhd
when is_v3_atomic_fp32=False. Drop the override once the upstream
loosen_atomic_fp16_constraint fix lands in Primus-Turbo main.
Copilot AI review requested due to automatic review settings May 6, 2026 05:46
@kyle-256 kyle-256 force-pushed the dev/kyle_update_triton branch from 822a1d7 to 23744d2 Compare May 6, 2026 05:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread .github/workflows/docker/Dockerfile
Comment thread .github/workflows/docker/Dockerfile
@kyle-256 kyle-256 force-pushed the dev/kyle_update_triton branch from 23744d2 to 85a2a24 Compare May 6, 2026 11:28
Stock torchtitan qwen3 Attention transposes q/k/v from BSHD to BHSD
before calling self.inner_attention (PyTorch SDPA contract). When
PrimusTubroConverter swaps inner_attention for TurboAttention /
flash_attn_func, those callees expect q/k/v with logical shape
[B, S, H, D] (format encoded as strides — see
primus_turbo.pytorch.ops.attention.attention_utils._infer_qkv_format).
The BHSD-shape input mismatches the BSHD-shape contract: aiter
mis-interprets the seq/heads axes (computing attention along the wrong
axis) and the registered fake / real op shapes diverge under
torch.compile, surfacing as an inductor stride assertion on
attention_aiter_forward_impl for qwen3 (test_qwen3_0_6B fails with
"expected size 16==16, stride 128==524288 at dim=1").

Mirror the LLaMA3 / LLaMA4 / DeepSeek-V3 Primus overrides: keep q/k/v
in BSHD (drop the BHSD transpose) and let TurboAttention handle GQA
natively (no repeat_kv needed). Patch qwen3.model.model.Attention via
the existing torchtitan.primus_turbo.turbo_attention setup hook.

Verified locally:
  - tests/trainer/test_torchtitan_trainer.py 12/12 pass (incl.
    test_qwen3_0_6B / test_qwen3_1_7B / test_qwen3_32B)
  - tests/trainer/test_megatron_trainer.py 17/17 pass
Copilot AI review requested due to automatic review settings May 7, 2026 03:41
@kyle-256 kyle-256 force-pushed the dev/kyle_update_triton branch from 85a2a24 to 4289676 Compare May 7, 2026 03:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 7 changed files in this pull request and generated 4 comments.

Comment thread .github/workflows/ci.yaml
Comment thread .github/workflows/ci.yaml
Comment thread primus/backends/torchtitan/patches/turbo/attention_patches.py
Comment thread .github/workflows/benchmark.yaml
@wenxie-amd wenxie-amd merged commit 9a52f9a into main May 7, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants