CI: matrix expansion — AMX/AVX-512/AVX2/NEON/polyfill lanes + cross-ISA bit-exactness

## Why

ndarray ships multiple ISA paths through `src/simd*.rs` and the HPC kernels
under `src/hpc/` (e.g. `amx_matmul.rs`, `bf16_tile_gemm.rs`, the AVX-512
modules gated in PR #134, AVX2 fallbacks, NEON paths, and the scalar
polyfill). The runtime contract is: **one binary, all ISAs**, dispatched
at run time by `LazyLock` in `src/simd.rs` after CPUID probing.

Today's CI (`.github/workflows/ci.yaml` @ c779c5b) only exercises a small
slice of that surface:

- `ubuntu-latest` (x86-64, has AVX2; AVX-512 absent — see PR #134 SIGILL story)
- `cross_test` for `i686-unknown-linux-gnu` and `s390x-unknown-linux-gnu`
  (cross but no SIMD coverage)
- `nostd` for `thumbv6m-none-eabi`
- No NEON lane, no AMX lane, no AVX-512 execution lane, no polyfill-only
  build, no cross-ISA bit-exactness check, no link-time hygiene check.

Consequence: hardware-specific bugs land silently. PR #134 is the canonical
recent example — AVX-512 raw intrinsics SIGILL'd on `ubuntu-latest` because
no lane actually executed AVX-512 code, while a Cortex-A53 (Pi Zero 2 W)
build that accidentally pulls in any x86 AMX symbol would only be caught
post-deploy.

## What

Extend `.github/workflows/ci.yaml` with a `simd_matrix` job that fans out
across ISA lanes, then a `cross_isa_parity` job that drives identical
inputs through every available lane and asserts bit-exact (or
within-tolerance, per D1's documented band) outputs, then a
`link_hygiene` job that builds for `aarch64-unknown-linux-gnu` with all
x86 features off and verifies the resulting artifact contains no x86
AMX/AVX symbols (and the symmetric check for x86 builds).

Concrete matrix entries:

| Lane | Runner | How |
|------|--------|-----|
| AVX2 | `ubuntu-latest` | stock; native execution |
| NEON | `ubuntu-24.04-arm` | GitHub-hosted aarch64 (free for public repos as of 2025); native execution |
| AMX (SPR) | `ubuntu-latest` + QEMU | `qemu-system-x86_64 -cpu sapphirerapids` (qemu >= 7.x exposes AMX bits); user-mode `qemu-x86_64 -cpu Sapphire-Rapids` for `cargo test --target=x86_64-unknown-linux-gnu` |
| AVX-512 | `ubuntu-latest-large` (paid larger runner exposes Skylake-X / Ice Lake server class) OR self-hosted; alternatively `qemu-x86_64 -cpu Skylake-Server-AVX512` user-mode | confirm runner availability before commit |
| Polyfill | any runner | `cargo test --no-default-features --features portable-atomic-critical-section` — forces scalar path |

## Architecture

**Runtime dispatch invariant.** `src/simd.rs` exposes `Tier` via
`LazyLock` populated from CPUID. All callers (e.g. `simd_avx2.rs`,
`simd_avx512.rs`, AMX byte-call shim in `src/hpc/amx_matmul.rs`,
`src/hpc/bf16_tile_gemm.rs`) MUST be reached through that dispatch — never
via static `target_feature` on a public function. CI lanes therefore
DO NOT compile with `-C target-cpu=...` or feature-flag-gated AMX as a
hard dep. They run the same binary, and the per-lane runner's CPUID
selects which path executes.

**Feature-flag hygiene.** AMX is implemented as a byte-call hack because
stable Rust does not expose `_tile_dpbf16ps` / `_tile_dpbusd`. The shim
must compile away to nothing on non-x86 targets:

- `cargo build --target=aarch64-unknown-linux-gnu --no-default-features
  --features std` must succeed.
- `nm -D` (or `nm` for static archives) on the resulting artifact must
  produce zero matches for `_tile_*`, `_mm512_*`, `_mm256_*`, or any
  `__intel_*` symbol. CI greps for these and fails on any hit.
- The symmetric check on x86 builds: an `aarch64`-targeted NEON intrinsic
  symbol (e.g. `vld1q_*`, `vmull_*`) leaking into an `x86_64`
  `--no-default-features` artifact must also fail CI.

**Cross-ISA bit-exactness.** A single `numeric-tests` harness fixture
(`crates/numeric-tests/`) generates deterministic input tensors (seeded
`StdRng`), runs the public ndarray ops (`dot`, `matmul`, `bf16_gemm`,
softmax, the convolution kernels touched in `src/hpc/`), serializes the
output bit-pattern (`f32::to_bits` / `f64::to_bits` per element), and
each CI lane uploads its output as an artifact. A final `parity` job
diffs all artifacts. Within-tolerance bands (ULP / relative-error)
are owned by D1's reference baseline; this job consumes the band.

**QEMU AMX.** `qemu-user-static` with `qemu-x86_64 -cpu Sapphire-Rapids`
runs unprivileged — no kvm, no nested virt. AMX_TILE / AMX_INT8 /
AMX_BF16 CPUID bits are exposed; `_tile_loadconfig` syscalls succeed.
Tile register state is preserved across user-mode emulation. Expect
~50-100x slower than native, so AMX lane runs only the AMX-specific
test groups (`#[cfg(test)] mod amx_*` in `src/hpc/amx_matmul.rs` and
`bf16_tile_gemm.rs`), not the full `cargo test --lib`.

## Acceptance criteria

- [ ] New `simd_matrix` job in `.github/workflows/ci.yaml` with the five
      lanes above; each lane builds and runs its scoped tests.
- [ ] `ubuntu-24.04-arm` lane runs `cargo test --lib -p ndarray` and the
      NEON-tagged subset green.
- [ ] AMX lane installs `qemu-user-static`, sets
      `CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_RUNNER="qemu-x86_64 -cpu
      Sapphire-Rapids"`, and runs the AMX test modules green.
- [ ] AVX-512 lane decision recorded in the workflow comments: either
      `ubuntu-latest-large`, or QEMU `-cpu Skylake-Server-AVX512`, or
      self-hosted; one of these must run the modules unblocked by PR #134
      (`bf16_tests`, `f16_tests`, `u8x64_rasterizer_tests`, `tier3_tests`,
      `int_simd_tests`).
- [ ] Polyfill lane runs with `--no-default-features --features
      portable-atomic-critical-section` and exercises `simd_ops` tests on
      the scalar fallback.
- [ ] `cross_isa_parity` job consumes per-lane output artifacts and diffs
      against D1's reference baseline within D1's tolerance band; a
      single failure fails the job with a precise lane × element ×
      observed-vs-expected diff.
- [ ] `link_hygiene` job:
      - aarch64 build → `nm` shows no `_tile_*`, no `_mm[0-9]*_*`, no
        `__intel_*`.
      - x86_64 polyfill build → `nm` shows no `vld1q_*`, no `vmull_*`,
        no `vqadd_*` aarch64 NEON symbols.
- [ ] `conclusion` job's `needs:` list is updated to include the new jobs
      so a regression on any lane gates merge.
- [ ] Documentation comment block at the top of the new job(s) explains
      the runtime-dispatch invariant so future PRs don't inadvertently
      add `-C target-cpu=...` and break the one-binary-all-ISAs contract
      (this same regression bit PR #132).

## Out of scope

- Picking the numerical tolerance band — that's D1's deliverable; this
  issue consumes it.
- Adding new SIMD kernels or fixing existing kernel correctness bugs —
  this issue is pure CI scaffolding around the existing surface.
- macOS / Windows runners — Linux-only for first pass; cross-OS parity
  is a separate follow-up.
- Benchmark-perf gates — correctness only; perf regressions are tracked
  elsewhere.
- BLAS-backend matrices (`openblas`, `intel-mkl`, `native`) — orthogonal;
  the existing `native-backend` and `blas-msrv` jobs already cover that
  axis.

## Dependencies

- **D1 (reference baseline + tolerance band)** — this issue consumes the
  reference outputs and the documented ULP/relative-error tolerance D1
  produces. CI lane work can begin in parallel: the matrix scaffolding,
  QEMU plumbing, `nm`-hygiene check, and `ubuntu-24.04-arm` lane all
  land independently of the parity assertion. The `cross_isa_parity`
  job is the only step that requires D1 to have published the baseline
  artifact format.
- No PR currently in-flight blocks this; PR #134 (AVX-512 test gating)
  is merged and is the trigger that revealed the lane-coverage gap.
- Touches `.github/workflows/ci.yaml` only; no source changes required
  for the CI scaffolding itself (the `nm`-hygiene script lives under
  `scripts/` per the existing pattern: `scripts/all-tests.sh`,
  `scripts/cross-tests.sh`, `scripts/blas-integ-tests.sh`,
  `scripts/miri-tests.sh`).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: matrix expansion — AMX/AVX-512/AVX2/NEON/polyfill lanes + cross-ISA bit-exactness #138

Why

What

Architecture

Acceptance criteria

Out of scope

Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Lane	Runner	How
AVX2	`ubuntu-latest`	stock; native execution
NEON	`ubuntu-24.04-arm`	GitHub-hosted aarch64 (free for public repos as of 2025); native execution
AMX (SPR)	`ubuntu-latest` + QEMU	`qemu-system-x86_64 -cpu sapphirerapids` (qemu >= 7.x exposes AMX bits); user-mode `qemu-x86_64 -cpu Sapphire-Rapids` for `cargo test --target=x86_64-unknown-linux-gnu`
AVX-512	`ubuntu-latest-large` (paid larger runner exposes Skylake-X / Ice Lake server class) OR self-hosted; alternatively `qemu-x86_64 -cpu Skylake-Server-AVX512` user-mode	confirm runner availability before commit
Polyfill	any runner	`cargo test --no-default-features --features portable-atomic-critical-section` — forces scalar path

CI: matrix expansion — AMX/AVX-512/AVX2/NEON/polyfill lanes + cross-ISA bit-exactness #138

Description

Why

What

Architecture

Acceptance criteria

Out of scope

Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions