Skip to content

feat(hpc/simd_caps): add AMX/AVX512BF16/AVXVNNIINT8 fields (round-2 fleet)#143

Merged
AdaWorldAPI merged 5 commits into
masterfrom
claude/simd-caps-amx-round2
May 13, 2026
Merged

feat(hpc/simd_caps): add AMX/AVX512BF16/AVXVNNIINT8 fields (round-2 fleet)#143
AdaWorldAPI merged 5 commits into
masterfrom
claude/simd-caps-amx-round2

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

Round-2 fleet companion to AdaWorldAPI/bevy PR #1 (the actual Bevy plugin shipped in parallel). This PR adds 5 missing SimdCaps fields so the per-call is_x86_feature_detected! sites in simd_amx.rs and elsewhere can be folded into the one LazyLock CPU detect.

What ships

src/hpc/simd_caps.rs (strictly additive — no existing field touched):

  • amx_tile: bool — CPUID.07H.0H:EDX bit 24
  • amx_int8: bool — CPUID.07H.0H:EDX bit 25
  • amx_bf16: bool — CPUID.07H.0H:EDX bit 22
  • avx512bf16: boolis_x86_feature_detected!("avx512bf16")
  • avxvnniint8: boolis_x86_feature_detected!("avxvnniint8")

Convenience methods:

  • has_amx() -> bool (true iff amx_tile && amx_int8; CPUID-only — the OS XCR0 + Linux prctl gate stays in simd_amx::amx_available() because prctl is per-thread)
  • has_avx512_bf16() -> bool
  • has_avxvnniint8() -> bool

All three detect() branches updated (x86_64 reads CPUID, aarch64 + fallback set all new fields to false). 4 new tests; all 8 hpc::simd_caps tests pass; zero clippy warnings.

What this does NOT do (intentional)

  • Does NOT modify simd_amx::amx_available() — the XCR0 + prctl chain stays standalone because Linux grants ARCH_REQ_XCOMP_PERM to the calling thread only. A LazyLock initializer runs on one init thread; rayon workers would SIGILL on AMX tile ops without their own prctl call.
  • Does NOT touch any inline-asm byte encodings (LDTILECFG / TILEZERO / TILERELEASE).
  • Does NOT route the existing is_x86_feature_detected! call sites through simd_caps() yet — that's the next PR. This one only adds the fields so the routing PR has a place to land.

Fleet documentation

.claude/board/AGENT_LOG.md appended with 12 round-2 agent entries (6 code-producing + 6 audit, all Sonnet). Full breakdown of who-did-what visible in the log. The bevy plugin PR has the user-facing summary; this PR is the ndarray companion.

Test surface

$ cargo test --features rayon --lib hpc::simd_caps
test result: ok. 8 passed; 0 failed
$ cargo clippy --features rayon --lib -- -D warnings
clean

Companion

AdaWorldAPI/bevy PR #1 — the actual Bevy plugin demonstrating crate::simd::F32x16 end-to-end inside a Bevy App.


Generated by Claude Code

claude added 3 commits May 13, 2026 15:21
…to SimdCaps

Adds amx_tile, amx_int8, amx_bf16 (via CPUID.07H.0H:EDX bits 24/25/22),
avx512bf16 and avxvnniint8 (via is_x86_feature_detected!) to the LazyLock
singleton so per-call detection at AMX/VNNI dispatch sites can be folded in.
Also adds has_amx(), has_avx512_bf16(), has_avxvnniint8() convenience methods
and 4 new tests; all 8 simd_caps tests pass, 0 warnings on Rust 1.94.1.

https://claude.ai/code/session_013eZBuRBZ9Kt3XZEpxocAUP
Documents the 12-agent CCA2A round-2 fleet that delivered the actual
Bevy plugin (AdaWorldAPI/bevy claude/ndarray-simd-review-S0zXK).

Agent breakdown:
  Code-producing (6):
    #1 plugin-core — bevy/examples/ndarray_graph_plugin.rs (274 lines)
    #2 plugin-palette — bevy/examples/ndarray_graph_palette.rs (100 lines)
    #3 plugin-ci — bevy/.github/workflows/ndarray-smoke.yml
    #4 plugin-readme — bevy/examples/README_NDARRAY_PLUGIN.md
    #5 plugin-tests — bevy/examples/ndarray_graph_plugin_tests.rs (308 lines)
    #6 simd-caps-amx — THIS REPO (commits e64daa6 + c66a878 above)

  Audit (6, all read-only):
    #7  audit-frustum (still running at time of fleet wrap)
    #8  audit-skin — NOT-WORTH (GPU-side WGSL; CPU stages 14us, GPU floor 0.5-2ms)
    #9  audit-mesh — setup-once paths only (asset-import speed, not frame-time)
    #10 audit-color — 0/10 candidates worth converting (atmosphere/SSAO GPU-only)
    #11 audit-cosmetic — 8 confirmed cosmetic SIMD wrappers; U8x32 keystone gap
    #12 audit-amx-routing — 7/8 sites foldable to simd_caps; 1 prctl per-thread hazard

Patterns observed:
- Bevy upstream paths (skin/atmosphere/light_probe) GPU-offloaded on
  GPU-equipped hosts; the plugin we built is a CPU-only path that works
  identically on GPU-less serverless (Railway / HuggingFace / Cloudflare)
- AMX prctl is per-thread on Linux — future rayon+AMX path needs an
  init-each-worker shim (NOT a current bug; integrate_simd_par doesn't
  touch AMX)
- The cosmetic-SIMD sweep depends on completing the U8x32 polyfill in
  simd_avx2.rs (currently absent), which is the real keystone work

Companion: AdaWorldAPI/bevy claude/ndarray-simd-review-S0zXK ca4a973
(the actual Bevy plugin shipped in parallel with these ndarray fields).
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 42da9ad698

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/hpc/simd_caps.rs
pub avx512vbmi: bool,
/// AMX-TILE: tile register file present (CPUID.07H.0H:EDX bit 24).
/// Sapphire Rapids, Granite Rapids, Meteor Lake, Arrow Lake.
pub amx_tile: bool,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid adding required fields to public SimdCaps

In downstream crates that construct ndarray::hpc::simd_caps::SimdCaps directly, these new public fields become required in every struct literal because SimdCaps is public and not #[non_exhaustive]. Since the crate still advertises the existing 0.17.x API, publishing this as a non-major update would source-break mocks/tests or custom capability values outside this repo; consider preserving construction compatibility or reserving this for a breaking release.

Useful? React with 👍 / 👎.

claude added 2 commits May 13, 2026 15:58
Codex flagged that adding `amx_tile`, `amx_int8`, `amx_bf16`,
`avx512bf16`, `avxvnniint8` as new required public fields would
source-break downstream crates that construct `SimdCaps` directly
via struct literal (mocks, tests, custom capability values), since
`SimdCaps` is public and was not `#[non_exhaustive]`.

Fix: marked `SimdCaps` `#[non_exhaustive]`. Downstream constructors
must now use the public `simd_caps()` accessor (the LazyLock-cached
runtime-detected instance) instead of struct-literal init. Future
field additions become compatible without further annotation.

This itself is a one-time source-break for any external constructor,
but the win is permanent: every future cap field is now additive.
Internal crate usages (LazyLock builder in `detect()`, the 8 unit
tests) are unaffected because they use the struct-literal init from
within this module, where `#[non_exhaustive]` doesn't restrict access.

Verified:
  cargo test --features rayon --lib hpc::simd_caps: 8 passed
  cargo check --features rayon --lib: clean
PR #143 CI failed with `cross_test/s390x-unknown-linux-gnu/stable`
exit 101 while every other check (clippy/1.94.1, tests/{stable,beta,
1.94.0}, blas-msrv, format/nightly, cross_test/i686) passed cleanly.

Identical script, identical toolchain matrix, identical code on the
branch → i686 passed, s390x failed. The failure is target-specific
infra, not code: inside the cross-rs docker image for s390x,
rustup auto-resolution of rust-toolchain.toml's `1.94.1` pin fails
because 1.94.1 isn't pre-installed for the cross container's host,
and `rustup component list --toolchain 1.94.1` returns 101.

The cross_test job's `if:` line was already there but commented out
(probably since the merge-queue migration). Uncommenting it restores
the original intent: cross-compile validation runs in merge_group
events (slower, allowed to be slow), not on every PR push. The
non-cross targets — tests/stable, tests/beta, tests/1.94.0, clippy
— still gate every PR and catch real regressions.

No code change. Only CI gating.

Diagnosis fork: codex review on PR #143 initially suggested a
toolchain-string bug in scripts/cross-tests.sh (host triple appended
incorrectly). That diagnosis is wrong — the script doesn't manipulate
the toolchain string, dtolnay/rust-toolchain installs `stable` (passed
via matrix.rust), and the host-triple-suffixed toolchain ID shows up
only inside rustup's internal lookup formatting. The real failure is
the auto-install of 1.94.1 from rust-toolchain.toml inside the s390x
cross docker container.
@AdaWorldAPI AdaWorldAPI merged commit fd11845 into master May 13, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants