Skip to content

Ship 8b-2b: Yuva420p family u8 RGBA SIMD across all 5 backends#36

Merged
al8n merged 3 commits intomainfrom
feat/ship8b-2b-yuva420p-family-u8-simd
Apr 27, 2026
Merged

Ship 8b-2b: Yuva420p family u8 RGBA SIMD across all 5 backends#36
al8n merged 3 commits intomainfrom
feat/ship8b-2b-yuva420p-family-u8-simd

Conversation

@al8n
Copy link
Copy Markdown
Collaborator

@al8n al8n commented Apr 27, 2026

Summary

  • Adds u8 RGBA SIMD across NEON / SSE4.1 / AVX2 / AVX-512 / wasm simd128 for the YUVA 4:2:0 family — 8-bit Yuva420p plus high-bit Yuva420p9 / Yuva420p10 / Yuva420p16
  • Wires the 4 u8 RGBA dispatchers in `src/row/mod.rs` that landed as scalar-only stubs in PR Ship 8b-2a: Yuva420p family scalar prep (Yuva420p / Yuva420p9 / Yuva420p10 / Yuva420p16) #35 (Ship 8b-2a) — replaces the `let _ = use_simd` lines with the standard `cfg_select!` per-arch route block
  • u16 RGBA SIMD for this family is deferred to Ship 8b-2c

Changes

  • 5 SIMD backends — each gain a third const-generic `ALPHA_SRC: bool` added to the existing `<BITS, ALPHA>` (or `` for 8-bit / 16-bit) templates across 3 kernel families:

    • 8-bit: `yuv_420_to_rgb_or_rgba_row<ALPHA, ALPHA_SRC>`
    • high-bit BITS-generic: `yuv_420p_n_to_rgb_or_rgba_row<BITS, ALPHA, ALPHA_SRC>`
    • 16-bit: `yuv_420p16_to_rgb_or_rgba_row<ALPHA, ALPHA_SRC>`

    When `ALPHA_SRC = true` the kernel reads the source alpha plane + masks with `bits_mask::()` (high-bit only) + depth-converts (`>> (BITS - 8)` variable shift for u8 output — literal `>> 8` for 16-bit). 8-bit Yuva420p alpha is already u8 so loads directly via the wide load intrinsic. Existing no-alpha / opaque-alpha wrappers stay backward-compat by passing `ALPHA_SRC = false, None`.

  • 4 u8 RGBA dispatchers wired in `src/row/mod.rs` (`yuva420p_to_rgba_row`, `yuva420p9_to_rgba_row`, `yuva420p10_to_rgba_row`, `yuva420p16_to_rgba_row`) — replace the prior `let _ = use_simd` stubs with the standard `cfg_select!` per-arch route block, mirroring the existing Yuva444p10 dispatchers' patterns. `use_simd = false` still forces scalar.

  • Per-backend RGBA equivalence tests — 31 new `#[test]` functions across the 5 backend test modules (7 NEON, 6 each on SSE4.1 / AVX2 / AVX-512 / wasm simd128). Each new x86 test early-returns on `is_x86_feature_detected!` so the suite stays clean under sanitizer / Miri / non-feature-flagged CI runners. Pseudo-random alpha flushes out lane-order corruption that a solid-alpha buffer would mask.

  • Compile-time `const { assert!(!ALPHA_SRC || ALPHA) }` retained on every shared template — source alpha requires RGBA output (no 3 bpp store with alpha to put it in).

Test plan

  • `cargo check --lib --tests` (aarch64) — clean
  • `cargo test --lib` (aarch64) — 624 passed (+7 NEON)
  • `RUSTFLAGS=-Dwarnings cargo clippy --lib --tests` (aarch64) — clean
  • `cargo check --target x86_64-unknown-freebsd --lib --tests` — clean
  • `RUSTFLAGS=-Dwarnings cargo clippy --target x86_64-unknown-freebsd --lib --tests` — clean
  • `cargo check --target wasm32-unknown-unknown --lib --tests` — clean

Follow-up

  • Ship 8b-2c: u16 RGBA SIMD for the same Yuva420p family (extends `yuv_420p_n_to_rgb_or_rgba_u16_row` and `yuv_420p16_to_rgb_or_rgba_u16_row` with the third `ALPHA_SRC` const generic). The `*_to_rgba_u16_row` dispatchers in `src/row/mod.rs` remain scalar-only until then.

🤖 Generated with Claude Code

Adds u8 RGBA SIMD across NEON / SSE4.1 / AVX2 / AVX-512 / wasm
simd128 for the YUVA 4:2:0 family — 8-bit Yuva420p plus high-bit
Yuva420p9 / Yuva420p10 / Yuva420p16 — and wires them into the 4
u8 RGBA dispatchers in src/row/mod.rs that landed as scalar-only
stubs in PR #35 (Ship 8b-2a). The u16 RGBA SIMD work is deferred
to Ship 8b-2c.

## Changes

- **5 SIMD backends** — each gain a third const-generic
  `ALPHA_SRC: bool` added to the existing `<BITS, ALPHA>` (or
  `<ALPHA>` for 8-bit / 16-bit) templates across 3 kernel families:
  - 8-bit: `yuv_420_to_rgb_or_rgba_row<ALPHA, ALPHA_SRC>`
  - high-bit BITS-generic: `yuv_420p_n_to_rgb_or_rgba_row<BITS, ALPHA, ALPHA_SRC>`
  - 16-bit: `yuv_420p16_to_rgb_or_rgba_row<ALPHA, ALPHA_SRC>`

  When `ALPHA_SRC = true` the kernel reads the source alpha plane
  + masks with `bits_mask::<BITS>()` (high-bit only) +
  depth-converts (`>> (BITS - 8)` variable shift for u8 output —
  literal `>> 8` for 16-bit). 8-bit Yuva420p alpha is already u8
  so loads directly via the wide load intrinsic. Existing
  no-alpha / opaque-alpha wrappers stay backward-compat by
  passing `ALPHA_SRC = false, None`.

- **4 u8 RGBA dispatchers wired** in `src/row/mod.rs`
  (`yuva420p_to_rgba_row`, `yuva420p9_to_rgba_row`,
  `yuva420p10_to_rgba_row`, `yuva420p16_to_rgba_row`) — replace
  the prior `let _ = use_simd` stubs with the standard
  `cfg_select!` per-arch route block, mirroring the existing
  Yuva444p10 dispatchers' patterns. `use_simd = false` still
  forces scalar.

- **Per-backend RGBA equivalence tests** — 31 new `#[test]`
  functions across the 5 backend test modules (7 NEON, 6 each on
  SSE4.1 / AVX2 / AVX-512 / wasm simd128). Each new x86 test
  early-returns on `is_x86_feature_detected!` so the suite stays
  clean under sanitizer / Miri / non-feature-flagged CI runners.
  Pseudo-random alpha is used to flush out lane-order corruption
  that a solid-alpha buffer would mask.

- Compile-time `const { assert!(!ALPHA_SRC || ALPHA) }` retained
  on every shared template — source alpha requires RGBA output
  (no 3 bpp store with alpha to put it in).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 27, 2026 23:06
uqio and others added 2 commits April 28, 2026 11:07
`16 - BITS` is already `u32` (BITS is `const BITS: u32`), so the
trailing `as u32` was a no-op. Clippy's `unnecessary_cast`
(`u32` → `u32`) flagged all 4 occurrences in
`wasm_simd128.rs` (lines 1904 / 2082 / 4075 / 4253) as errors
under `RUSTFLAGS=-Dwarnings`. These predate this branch, but
were exposed once `clippy --target wasm32-unknown-unknown
--lib --tests` ran clean otherwise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds SIMD-accelerated u8 RGBA conversion for the YUVA 4:2:0 family (Yuva420p / Yuva420p9 / Yuva420p10 / Yuva420p16) across all supported SIMD backends, wiring the previously scalar-only dispatchers and adding backend equivalence tests.

Changes:

  • Wire yuva420p*_to_rgba_row u8 dispatchers in src/row/mod.rs to per-arch SIMD wrappers via cfg_select! (with scalar fallback when use_simd = false or SIMD isn’t available).
  • Extend each SIMD backend’s shared 4:2:0 kernels with a third const generic ALPHA_SRC and add *_with_alpha_src_row wrappers that read/depth-convert the source alpha plane.
  • Add per-backend SIMD-vs-scalar equivalence tests for YUVA 4:2:0 u8 RGBA paths (including varying alpha seeds).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/row/mod.rs Replaces scalar-only stubs for YUVA 4:2:0 u8 RGBA dispatchers with per-arch SIMD routing + scalar fallback; keeps u16 RGBA dispatchers scalar.
src/row/arch/neon.rs Adds ALPHA_SRC support and YUVA420 u8 RGBA-with-alpha-source wrappers to NEON kernels.
src/row/arch/neon/tests.rs Adds NEON SIMD-vs-scalar equivalence tests for YUVA 4:2:0 u8 RGBA with source alpha.
src/row/arch/x86_sse41.rs Adds ALPHA_SRC support and YUVA420 u8 RGBA-with-alpha-source wrappers to SSE4.1 kernels.
src/row/arch/x86_sse41/tests.rs Adds SSE4.1 SIMD-vs-scalar equivalence tests for YUVA 4:2:0 u8 RGBA with source alpha.
src/row/arch/x86_avx2.rs Adds ALPHA_SRC support and YUVA420 u8 RGBA-with-alpha-source wrappers to AVX2 kernels.
src/row/arch/x86_avx2/tests.rs Adds AVX2 SIMD-vs-scalar equivalence tests for YUVA 4:2:0 u8 RGBA with source alpha.
src/row/arch/x86_avx512.rs Adds ALPHA_SRC support and YUVA420 u8 RGBA-with-alpha-source wrappers to AVX-512BW kernels.
src/row/arch/x86_avx512/tests.rs Adds AVX-512 SIMD-vs-scalar equivalence tests for YUVA 4:2:0 u8 RGBA with source alpha.
src/row/arch/wasm_simd128.rs Adds ALPHA_SRC support and YUVA420 u8 RGBA-with-alpha-source wrappers to wasm simd128 kernels.
src/row/arch/wasm_simd128/tests.rs Adds wasm simd128 SIMD-vs-scalar equivalence tests for YUVA 4:2:0 u8 RGBA with source alpha.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@al8n al8n requested a review from Copilot April 27, 2026 23:18
@al8n al8n merged commit e0392c7 into main Apr 27, 2026
45 checks passed
@al8n al8n deleted the feat/ship8b-2b-yuva420p-family-u8-simd branch April 27, 2026 23:18
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants