You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ndarray ships multiple ISA paths through src/simd*.rs and the HPC kernels
under src/hpc/ (e.g. amx_matmul.rs, bf16_tile_gemm.rs, the AVX-512
modules gated in PR #134, AVX2 fallbacks, NEON paths, and the scalar
polyfill). The runtime contract is: one binary, all ISAs, dispatched
at run time by LazyLock in src/simd.rs after CPUID probing.
Today's CI (.github/workflows/ci.yaml @ c779c5b) only exercises a small
slice of that surface:
cross_test for i686-unknown-linux-gnu and s390x-unknown-linux-gnu
(cross but no SIMD coverage)
nostd for thumbv6m-none-eabi
No NEON lane, no AMX lane, no AVX-512 execution lane, no polyfill-only
build, no cross-ISA bit-exactness check, no link-time hygiene check.
Consequence: hardware-specific bugs land silently. PR #134 is the canonical
recent example — AVX-512 raw intrinsics SIGILL'd on ubuntu-latest because
no lane actually executed AVX-512 code, while a Cortex-A53 (Pi Zero 2 W)
build that accidentally pulls in any x86 AMX symbol would only be caught
post-deploy.
What
Extend .github/workflows/ci.yaml with a simd_matrix job that fans out
across ISA lanes, then a cross_isa_parity job that drives identical
inputs through every available lane and asserts bit-exact (or
within-tolerance, per D1's documented band) outputs, then a link_hygiene job that builds for aarch64-unknown-linux-gnu with all
x86 features off and verifies the resulting artifact contains no x86
AMX/AVX symbols (and the symmetric check for x86 builds).
Concrete matrix entries:
Lane
Runner
How
AVX2
ubuntu-latest
stock; native execution
NEON
ubuntu-24.04-arm
GitHub-hosted aarch64 (free for public repos as of 2025); native execution
AMX (SPR)
ubuntu-latest + QEMU
qemu-system-x86_64 -cpu sapphirerapids (qemu >= 7.x exposes AMX bits); user-mode qemu-x86_64 -cpu Sapphire-Rapids for cargo test --target=x86_64-unknown-linux-gnu
AVX-512
ubuntu-latest-large (paid larger runner exposes Skylake-X / Ice Lake server class) OR self-hosted; alternatively qemu-x86_64 -cpu Skylake-Server-AVX512 user-mode
confirm runner availability before commit
Polyfill
any runner
cargo test --no-default-features --features portable-atomic-critical-section — forces scalar path
Architecture
Runtime dispatch invariant.src/simd.rs exposes Tier via LazyLock populated from CPUID. All callers (e.g. simd_avx2.rs, simd_avx512.rs, AMX byte-call shim in src/hpc/amx_matmul.rs, src/hpc/bf16_tile_gemm.rs) MUST be reached through that dispatch — never
via static target_feature on a public function. CI lanes therefore
DO NOT compile with -C target-cpu=... or feature-flag-gated AMX as a
hard dep. They run the same binary, and the per-lane runner's CPUID
selects which path executes.
Feature-flag hygiene. AMX is implemented as a byte-call hack because
stable Rust does not expose _tile_dpbf16ps / _tile_dpbusd. The shim
must compile away to nothing on non-x86 targets:
cargo build --target=aarch64-unknown-linux-gnu --no-default-features --features std must succeed.
nm -D (or nm for static archives) on the resulting artifact must
produce zero matches for _tile_*, _mm512_*, _mm256_*, or any __intel_* symbol. CI greps for these and fails on any hit.
The symmetric check on x86 builds: an aarch64-targeted NEON intrinsic
symbol (e.g. vld1q_*, vmull_*) leaking into an x86_64 --no-default-features artifact must also fail CI.
Cross-ISA bit-exactness. A single numeric-tests harness fixture
(crates/numeric-tests/) generates deterministic input tensors (seeded StdRng), runs the public ndarray ops (dot, matmul, bf16_gemm,
softmax, the convolution kernels touched in src/hpc/), serializes the
output bit-pattern (f32::to_bits / f64::to_bits per element), and
each CI lane uploads its output as an artifact. A final parity job
diffs all artifacts. Within-tolerance bands (ULP / relative-error)
are owned by D1's reference baseline; this job consumes the band.
QEMU AMX.qemu-user-static with qemu-x86_64 -cpu Sapphire-Rapids
runs unprivileged — no kvm, no nested virt. AMX_TILE / AMX_INT8 /
AMX_BF16 CPUID bits are exposed; _tile_loadconfig syscalls succeed.
Tile register state is preserved across user-mode emulation. Expect
~50-100x slower than native, so AMX lane runs only the AMX-specific
test groups (#[cfg(test)] mod amx_* in src/hpc/amx_matmul.rs and bf16_tile_gemm.rs), not the full cargo test --lib.
Acceptance criteria
New simd_matrix job in .github/workflows/ci.yaml with the five
lanes above; each lane builds and runs its scoped tests.
ubuntu-24.04-arm lane runs cargo test --lib -p ndarray and the
NEON-tagged subset green.
AMX lane installs qemu-user-static, sets CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_RUNNER="qemu-x86_64 -cpu Sapphire-Rapids", and runs the AMX test modules green.
AVX-512 lane decision recorded in the workflow comments: either ubuntu-latest-large, or QEMU -cpu Skylake-Server-AVX512, or
self-hosted; one of these must run the modules unblocked by PR fix(simd): gate simd_avx512 tests behind target_feature = avx512f #134
(bf16_tests, f16_tests, u8x64_rasterizer_tests, tier3_tests, int_simd_tests).
Polyfill lane runs with --no-default-features --features portable-atomic-critical-section and exercises simd_ops tests on
the scalar fallback.
cross_isa_parity job consumes per-lane output artifacts and diffs
against D1's reference baseline within D1's tolerance band; a
single failure fails the job with a precise lane × element ×
observed-vs-expected diff.
link_hygiene job:
- aarch64 build → nm shows no _tile_*, no _mm[0-9]*_*, no __intel_*.
- x86_64 polyfill build → nm shows no vld1q_*, no vmull_*,
no vqadd_* aarch64 NEON symbols.
conclusion job's needs: list is updated to include the new jobs
so a regression on any lane gates merge.
Picking the numerical tolerance band — that's D1's deliverable; this
issue consumes it.
Adding new SIMD kernels or fixing existing kernel correctness bugs —
this issue is pure CI scaffolding around the existing surface.
macOS / Windows runners — Linux-only for first pass; cross-OS parity
is a separate follow-up.
Benchmark-perf gates — correctness only; perf regressions are tracked
elsewhere.
BLAS-backend matrices (openblas, intel-mkl, native) — orthogonal;
the existing native-backend and blas-msrv jobs already cover that
axis.
Dependencies
D1 (reference baseline + tolerance band) — this issue consumes the
reference outputs and the documented ULP/relative-error tolerance D1
produces. CI lane work can begin in parallel: the matrix scaffolding,
QEMU plumbing, nm-hygiene check, and ubuntu-24.04-arm lane all
land independently of the parity assertion. The cross_isa_parity
job is the only step that requires D1 to have published the baseline
artifact format.
Touches .github/workflows/ci.yaml only; no source changes required
for the CI scaffolding itself (the nm-hygiene script lives under scripts/ per the existing pattern: scripts/all-tests.sh, scripts/cross-tests.sh, scripts/blas-integ-tests.sh, scripts/miri-tests.sh).
Why
ndarray ships multiple ISA paths through
src/simd*.rsand the HPC kernelsunder
src/hpc/(e.g.amx_matmul.rs,bf16_tile_gemm.rs, the AVX-512modules gated in PR #134, AVX2 fallbacks, NEON paths, and the scalar
polyfill). The runtime contract is: one binary, all ISAs, dispatched
at run time by
LazyLockinsrc/simd.rsafter CPUID probing.Today's CI (
.github/workflows/ci.yaml@ c779c5b) only exercises a smallslice of that surface:
ubuntu-latest(x86-64, has AVX2; AVX-512 absent — see PR fix(simd): gate simd_avx512 tests behind target_feature = avx512f #134 SIGILL story)cross_testfori686-unknown-linux-gnuands390x-unknown-linux-gnu(cross but no SIMD coverage)
nostdforthumbv6m-none-eabibuild, no cross-ISA bit-exactness check, no link-time hygiene check.
Consequence: hardware-specific bugs land silently. PR #134 is the canonical
recent example — AVX-512 raw intrinsics SIGILL'd on
ubuntu-latestbecauseno lane actually executed AVX-512 code, while a Cortex-A53 (Pi Zero 2 W)
build that accidentally pulls in any x86 AMX symbol would only be caught
post-deploy.
What
Extend
.github/workflows/ci.yamlwith asimd_matrixjob that fans outacross ISA lanes, then a
cross_isa_parityjob that drives identicalinputs through every available lane and asserts bit-exact (or
within-tolerance, per D1's documented band) outputs, then a
link_hygienejob that builds foraarch64-unknown-linux-gnuwith allx86 features off and verifies the resulting artifact contains no x86
AMX/AVX symbols (and the symmetric check for x86 builds).
Concrete matrix entries:
ubuntu-latestubuntu-24.04-armubuntu-latest+ QEMUqemu-system-x86_64 -cpu sapphirerapids(qemu >= 7.x exposes AMX bits); user-modeqemu-x86_64 -cpu Sapphire-Rapidsforcargo test --target=x86_64-unknown-linux-gnuubuntu-latest-large(paid larger runner exposes Skylake-X / Ice Lake server class) OR self-hosted; alternativelyqemu-x86_64 -cpu Skylake-Server-AVX512user-modecargo test --no-default-features --features portable-atomic-critical-section— forces scalar pathArchitecture
Runtime dispatch invariant.
src/simd.rsexposesTierviaLazyLockpopulated from CPUID. All callers (e.g.simd_avx2.rs,simd_avx512.rs, AMX byte-call shim insrc/hpc/amx_matmul.rs,src/hpc/bf16_tile_gemm.rs) MUST be reached through that dispatch — nevervia static
target_featureon a public function. CI lanes thereforeDO NOT compile with
-C target-cpu=...or feature-flag-gated AMX as ahard dep. They run the same binary, and the per-lane runner's CPUID
selects which path executes.
Feature-flag hygiene. AMX is implemented as a byte-call hack because
stable Rust does not expose
_tile_dpbf16ps/_tile_dpbusd. The shimmust compile away to nothing on non-x86 targets:
cargo build --target=aarch64-unknown-linux-gnu --no-default-features --features stdmust succeed.nm -D(ornmfor static archives) on the resulting artifact mustproduce zero matches for
_tile_*,_mm512_*,_mm256_*, or any__intel_*symbol. CI greps for these and fails on any hit.aarch64-targeted NEON intrinsicsymbol (e.g.
vld1q_*,vmull_*) leaking into anx86_64--no-default-featuresartifact must also fail CI.Cross-ISA bit-exactness. A single
numeric-testsharness fixture(
crates/numeric-tests/) generates deterministic input tensors (seededStdRng), runs the public ndarray ops (dot,matmul,bf16_gemm,softmax, the convolution kernels touched in
src/hpc/), serializes theoutput bit-pattern (
f32::to_bits/f64::to_bitsper element), andeach CI lane uploads its output as an artifact. A final
parityjobdiffs all artifacts. Within-tolerance bands (ULP / relative-error)
are owned by D1's reference baseline; this job consumes the band.
QEMU AMX.
qemu-user-staticwithqemu-x86_64 -cpu Sapphire-Rapidsruns unprivileged — no kvm, no nested virt. AMX_TILE / AMX_INT8 /
AMX_BF16 CPUID bits are exposed;
_tile_loadconfigsyscalls succeed.Tile register state is preserved across user-mode emulation. Expect
~50-100x slower than native, so AMX lane runs only the AMX-specific
test groups (
#[cfg(test)] mod amx_*insrc/hpc/amx_matmul.rsandbf16_tile_gemm.rs), not the fullcargo test --lib.Acceptance criteria
simd_matrixjob in.github/workflows/ci.yamlwith the fivelanes above; each lane builds and runs its scoped tests.
ubuntu-24.04-armlane runscargo test --lib -p ndarrayand theNEON-tagged subset green.
qemu-user-static, setsCARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_RUNNER="qemu-x86_64 -cpu Sapphire-Rapids", and runs the AMX test modules green.ubuntu-latest-large, or QEMU-cpu Skylake-Server-AVX512, orself-hosted; one of these must run the modules unblocked by PR fix(simd): gate simd_avx512 tests behind target_feature = avx512f #134
(
bf16_tests,f16_tests,u8x64_rasterizer_tests,tier3_tests,int_simd_tests).--no-default-features --features portable-atomic-critical-sectionand exercisessimd_opstests onthe scalar fallback.
cross_isa_parityjob consumes per-lane output artifacts and diffsagainst D1's reference baseline within D1's tolerance band; a
single failure fails the job with a precise lane × element ×
observed-vs-expected diff.
link_hygienejob:- aarch64 build →
nmshows no_tile_*, no_mm[0-9]*_*, no__intel_*.- x86_64 polyfill build →
nmshows novld1q_*, novmull_*,no
vqadd_*aarch64 NEON symbols.conclusionjob'sneeds:list is updated to include the new jobsso a regression on any lane gates merge.
the runtime-dispatch invariant so future PRs don't inadvertently
add
-C target-cpu=...and break the one-binary-all-ISAs contract(this same regression bit PR fix(ci): drop global target-cpu, pin clippy to 1.94.1, fmt → nightly+continue-on-error #132).
Out of scope
issue consumes it.
this issue is pure CI scaffolding around the existing surface.
is a separate follow-up.
elsewhere.
openblas,intel-mkl,native) — orthogonal;the existing
native-backendandblas-msrvjobs already cover thataxis.
Dependencies
reference outputs and the documented ULP/relative-error tolerance D1
produces. CI lane work can begin in parallel: the matrix scaffolding,
QEMU plumbing,
nm-hygiene check, andubuntu-24.04-armlane allland independently of the parity assertion. The
cross_isa_parityjob is the only step that requires D1 to have published the baseline
artifact format.
is merged and is the trigger that revealed the lane-coverage gap.
.github/workflows/ci.yamlonly; no source changes requiredfor the CI scaffolding itself (the
nm-hygiene script lives underscripts/per the existing pattern:scripts/all-tests.sh,scripts/cross-tests.sh,scripts/blas-integ-tests.sh,scripts/miri-tests.sh).