Skip to content

CI: matrix expansion — AMX/AVX-512/AVX2/NEON/polyfill lanes + cross-ISA bit-exactness #138

@AdaWorldAPI

Description

@AdaWorldAPI

Why

ndarray ships multiple ISA paths through src/simd*.rs and the HPC kernels
under src/hpc/ (e.g. amx_matmul.rs, bf16_tile_gemm.rs, the AVX-512
modules gated in PR #134, AVX2 fallbacks, NEON paths, and the scalar
polyfill). The runtime contract is: one binary, all ISAs, dispatched
at run time by LazyLock in src/simd.rs after CPUID probing.

Today's CI (.github/workflows/ci.yaml @ c779c5b) only exercises a small
slice of that surface:

  • ubuntu-latest (x86-64, has AVX2; AVX-512 absent — see PR fix(simd): gate simd_avx512 tests behind target_feature = avx512f #134 SIGILL story)
  • cross_test for i686-unknown-linux-gnu and s390x-unknown-linux-gnu
    (cross but no SIMD coverage)
  • nostd for thumbv6m-none-eabi
  • No NEON lane, no AMX lane, no AVX-512 execution lane, no polyfill-only
    build, no cross-ISA bit-exactness check, no link-time hygiene check.

Consequence: hardware-specific bugs land silently. PR #134 is the canonical
recent example — AVX-512 raw intrinsics SIGILL'd on ubuntu-latest because
no lane actually executed AVX-512 code, while a Cortex-A53 (Pi Zero 2 W)
build that accidentally pulls in any x86 AMX symbol would only be caught
post-deploy.

What

Extend .github/workflows/ci.yaml with a simd_matrix job that fans out
across ISA lanes, then a cross_isa_parity job that drives identical
inputs through every available lane and asserts bit-exact (or
within-tolerance, per D1's documented band) outputs, then a
link_hygiene job that builds for aarch64-unknown-linux-gnu with all
x86 features off and verifies the resulting artifact contains no x86
AMX/AVX symbols (and the symmetric check for x86 builds).

Concrete matrix entries:

Lane Runner How
AVX2 ubuntu-latest stock; native execution
NEON ubuntu-24.04-arm GitHub-hosted aarch64 (free for public repos as of 2025); native execution
AMX (SPR) ubuntu-latest + QEMU qemu-system-x86_64 -cpu sapphirerapids (qemu >= 7.x exposes AMX bits); user-mode qemu-x86_64 -cpu Sapphire-Rapids for cargo test --target=x86_64-unknown-linux-gnu
AVX-512 ubuntu-latest-large (paid larger runner exposes Skylake-X / Ice Lake server class) OR self-hosted; alternatively qemu-x86_64 -cpu Skylake-Server-AVX512 user-mode confirm runner availability before commit
Polyfill any runner cargo test --no-default-features --features portable-atomic-critical-section — forces scalar path

Architecture

Runtime dispatch invariant. src/simd.rs exposes Tier via
LazyLock populated from CPUID. All callers (e.g. simd_avx2.rs,
simd_avx512.rs, AMX byte-call shim in src/hpc/amx_matmul.rs,
src/hpc/bf16_tile_gemm.rs) MUST be reached through that dispatch — never
via static target_feature on a public function. CI lanes therefore
DO NOT compile with -C target-cpu=... or feature-flag-gated AMX as a
hard dep. They run the same binary, and the per-lane runner's CPUID
selects which path executes.

Feature-flag hygiene. AMX is implemented as a byte-call hack because
stable Rust does not expose _tile_dpbf16ps / _tile_dpbusd. The shim
must compile away to nothing on non-x86 targets:

  • cargo build --target=aarch64-unknown-linux-gnu --no-default-features --features std must succeed.
  • nm -D (or nm for static archives) on the resulting artifact must
    produce zero matches for _tile_*, _mm512_*, _mm256_*, or any
    __intel_* symbol. CI greps for these and fails on any hit.
  • The symmetric check on x86 builds: an aarch64-targeted NEON intrinsic
    symbol (e.g. vld1q_*, vmull_*) leaking into an x86_64
    --no-default-features artifact must also fail CI.

Cross-ISA bit-exactness. A single numeric-tests harness fixture
(crates/numeric-tests/) generates deterministic input tensors (seeded
StdRng), runs the public ndarray ops (dot, matmul, bf16_gemm,
softmax, the convolution kernels touched in src/hpc/), serializes the
output bit-pattern (f32::to_bits / f64::to_bits per element), and
each CI lane uploads its output as an artifact. A final parity job
diffs all artifacts. Within-tolerance bands (ULP / relative-error)
are owned by D1's reference baseline; this job consumes the band.

QEMU AMX. qemu-user-static with qemu-x86_64 -cpu Sapphire-Rapids
runs unprivileged — no kvm, no nested virt. AMX_TILE / AMX_INT8 /
AMX_BF16 CPUID bits are exposed; _tile_loadconfig syscalls succeed.
Tile register state is preserved across user-mode emulation. Expect
~50-100x slower than native, so AMX lane runs only the AMX-specific
test groups (#[cfg(test)] mod amx_* in src/hpc/amx_matmul.rs and
bf16_tile_gemm.rs), not the full cargo test --lib.

Acceptance criteria

  • New simd_matrix job in .github/workflows/ci.yaml with the five
    lanes above; each lane builds and runs its scoped tests.
  • ubuntu-24.04-arm lane runs cargo test --lib -p ndarray and the
    NEON-tagged subset green.
  • AMX lane installs qemu-user-static, sets
    CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_RUNNER="qemu-x86_64 -cpu Sapphire-Rapids", and runs the AMX test modules green.
  • AVX-512 lane decision recorded in the workflow comments: either
    ubuntu-latest-large, or QEMU -cpu Skylake-Server-AVX512, or
    self-hosted; one of these must run the modules unblocked by PR fix(simd): gate simd_avx512 tests behind target_feature = avx512f #134
    (bf16_tests, f16_tests, u8x64_rasterizer_tests, tier3_tests,
    int_simd_tests).
  • Polyfill lane runs with --no-default-features --features portable-atomic-critical-section and exercises simd_ops tests on
    the scalar fallback.
  • cross_isa_parity job consumes per-lane output artifacts and diffs
    against D1's reference baseline within D1's tolerance band; a
    single failure fails the job with a precise lane × element ×
    observed-vs-expected diff.
  • link_hygiene job:
    - aarch64 build → nm shows no _tile_*, no _mm[0-9]*_*, no
    __intel_*.
    - x86_64 polyfill build → nm shows no vld1q_*, no vmull_*,
    no vqadd_* aarch64 NEON symbols.
  • conclusion job's needs: list is updated to include the new jobs
    so a regression on any lane gates merge.
  • Documentation comment block at the top of the new job(s) explains
    the runtime-dispatch invariant so future PRs don't inadvertently
    add -C target-cpu=... and break the one-binary-all-ISAs contract
    (this same regression bit PR fix(ci): drop global target-cpu, pin clippy to 1.94.1, fmt → nightly+continue-on-error #132).

Out of scope

  • Picking the numerical tolerance band — that's D1's deliverable; this
    issue consumes it.
  • Adding new SIMD kernels or fixing existing kernel correctness bugs —
    this issue is pure CI scaffolding around the existing surface.
  • macOS / Windows runners — Linux-only for first pass; cross-OS parity
    is a separate follow-up.
  • Benchmark-perf gates — correctness only; perf regressions are tracked
    elsewhere.
  • BLAS-backend matrices (openblas, intel-mkl, native) — orthogonal;
    the existing native-backend and blas-msrv jobs already cover that
    axis.

Dependencies

  • D1 (reference baseline + tolerance band) — this issue consumes the
    reference outputs and the documented ULP/relative-error tolerance D1
    produces. CI lane work can begin in parallel: the matrix scaffolding,
    QEMU plumbing, nm-hygiene check, and ubuntu-24.04-arm lane all
    land independently of the parity assertion. The cross_isa_parity
    job is the only step that requires D1 to have published the baseline
    artifact format.
  • No PR currently in-flight blocks this; PR fix(simd): gate simd_avx512 tests behind target_feature = avx512f #134 (AVX-512 test gating)
    is merged and is the trigger that revealed the lane-coverage gap.
  • Touches .github/workflows/ci.yaml only; no source changes required
    for the CI scaffolding itself (the nm-hygiene script lives under
    scripts/ per the existing pattern: scripts/all-tests.sh,
    scripts/cross-tests.sh, scripts/blas-integ-tests.sh,
    scripts/miri-tests.sh).

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions