Ship 8b-2b: Yuva420p family u8 RGBA SIMD across all 5 backends#36
Merged
Conversation
Adds u8 RGBA SIMD across NEON / SSE4.1 / AVX2 / AVX-512 / wasm simd128 for the YUVA 4:2:0 family — 8-bit Yuva420p plus high-bit Yuva420p9 / Yuva420p10 / Yuva420p16 — and wires them into the 4 u8 RGBA dispatchers in src/row/mod.rs that landed as scalar-only stubs in PR #35 (Ship 8b-2a). The u16 RGBA SIMD work is deferred to Ship 8b-2c. ## Changes - **5 SIMD backends** — each gain a third const-generic `ALPHA_SRC: bool` added to the existing `<BITS, ALPHA>` (or `<ALPHA>` for 8-bit / 16-bit) templates across 3 kernel families: - 8-bit: `yuv_420_to_rgb_or_rgba_row<ALPHA, ALPHA_SRC>` - high-bit BITS-generic: `yuv_420p_n_to_rgb_or_rgba_row<BITS, ALPHA, ALPHA_SRC>` - 16-bit: `yuv_420p16_to_rgb_or_rgba_row<ALPHA, ALPHA_SRC>` When `ALPHA_SRC = true` the kernel reads the source alpha plane + masks with `bits_mask::<BITS>()` (high-bit only) + depth-converts (`>> (BITS - 8)` variable shift for u8 output — literal `>> 8` for 16-bit). 8-bit Yuva420p alpha is already u8 so loads directly via the wide load intrinsic. Existing no-alpha / opaque-alpha wrappers stay backward-compat by passing `ALPHA_SRC = false, None`. - **4 u8 RGBA dispatchers wired** in `src/row/mod.rs` (`yuva420p_to_rgba_row`, `yuva420p9_to_rgba_row`, `yuva420p10_to_rgba_row`, `yuva420p16_to_rgba_row`) — replace the prior `let _ = use_simd` stubs with the standard `cfg_select!` per-arch route block, mirroring the existing Yuva444p10 dispatchers' patterns. `use_simd = false` still forces scalar. - **Per-backend RGBA equivalence tests** — 31 new `#[test]` functions across the 5 backend test modules (7 NEON, 6 each on SSE4.1 / AVX2 / AVX-512 / wasm simd128). Each new x86 test early-returns on `is_x86_feature_detected!` so the suite stays clean under sanitizer / Miri / non-feature-flagged CI runners. Pseudo-random alpha is used to flush out lane-order corruption that a solid-alpha buffer would mask. - Compile-time `const { assert!(!ALPHA_SRC || ALPHA) }` retained on every shared template — source alpha requires RGBA output (no 3 bpp store with alpha to put it in). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`16 - BITS` is already `u32` (BITS is `const BITS: u32`), so the trailing `as u32` was a no-op. Clippy's `unnecessary_cast` (`u32` → `u32`) flagged all 4 occurrences in `wasm_simd128.rs` (lines 1904 / 2082 / 4075 / 4253) as errors under `RUSTFLAGS=-Dwarnings`. These predate this branch, but were exposed once `clippy --target wasm32-unknown-unknown --lib --tests` ran clean otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds SIMD-accelerated u8 RGBA conversion for the YUVA 4:2:0 family (Yuva420p / Yuva420p9 / Yuva420p10 / Yuva420p16) across all supported SIMD backends, wiring the previously scalar-only dispatchers and adding backend equivalence tests.
Changes:
- Wire
yuva420p*_to_rgba_rowu8 dispatchers insrc/row/mod.rsto per-arch SIMD wrappers viacfg_select!(with scalar fallback whenuse_simd = falseor SIMD isn’t available). - Extend each SIMD backend’s shared 4:2:0 kernels with a third const generic
ALPHA_SRCand add*_with_alpha_src_rowwrappers that read/depth-convert the source alpha plane. - Add per-backend SIMD-vs-scalar equivalence tests for YUVA 4:2:0 u8 RGBA paths (including varying alpha seeds).
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/row/mod.rs | Replaces scalar-only stubs for YUVA 4:2:0 u8 RGBA dispatchers with per-arch SIMD routing + scalar fallback; keeps u16 RGBA dispatchers scalar. |
| src/row/arch/neon.rs | Adds ALPHA_SRC support and YUVA420 u8 RGBA-with-alpha-source wrappers to NEON kernels. |
| src/row/arch/neon/tests.rs | Adds NEON SIMD-vs-scalar equivalence tests for YUVA 4:2:0 u8 RGBA with source alpha. |
| src/row/arch/x86_sse41.rs | Adds ALPHA_SRC support and YUVA420 u8 RGBA-with-alpha-source wrappers to SSE4.1 kernels. |
| src/row/arch/x86_sse41/tests.rs | Adds SSE4.1 SIMD-vs-scalar equivalence tests for YUVA 4:2:0 u8 RGBA with source alpha. |
| src/row/arch/x86_avx2.rs | Adds ALPHA_SRC support and YUVA420 u8 RGBA-with-alpha-source wrappers to AVX2 kernels. |
| src/row/arch/x86_avx2/tests.rs | Adds AVX2 SIMD-vs-scalar equivalence tests for YUVA 4:2:0 u8 RGBA with source alpha. |
| src/row/arch/x86_avx512.rs | Adds ALPHA_SRC support and YUVA420 u8 RGBA-with-alpha-source wrappers to AVX-512BW kernels. |
| src/row/arch/x86_avx512/tests.rs | Adds AVX-512 SIMD-vs-scalar equivalence tests for YUVA 4:2:0 u8 RGBA with source alpha. |
| src/row/arch/wasm_simd128.rs | Adds ALPHA_SRC support and YUVA420 u8 RGBA-with-alpha-source wrappers to wasm simd128 kernels. |
| src/row/arch/wasm_simd128/tests.rs | Adds wasm simd128 SIMD-vs-scalar equivalence tests for YUVA 4:2:0 u8 RGBA with source alpha. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Changes
5 SIMD backends — each gain a third const-generic `ALPHA_SRC: bool` added to the existing `<BITS, ALPHA>` (or `` for 8-bit / 16-bit) templates across 3 kernel families:
When `ALPHA_SRC = true` the kernel reads the source alpha plane + masks with `bits_mask::()` (high-bit only) + depth-converts (`>> (BITS - 8)` variable shift for u8 output — literal `>> 8` for 16-bit). 8-bit Yuva420p alpha is already u8 so loads directly via the wide load intrinsic. Existing no-alpha / opaque-alpha wrappers stay backward-compat by passing `ALPHA_SRC = false, None`.
4 u8 RGBA dispatchers wired in `src/row/mod.rs` (`yuva420p_to_rgba_row`, `yuva420p9_to_rgba_row`, `yuva420p10_to_rgba_row`, `yuva420p16_to_rgba_row`) — replace the prior `let _ = use_simd` stubs with the standard `cfg_select!` per-arch route block, mirroring the existing Yuva444p10 dispatchers' patterns. `use_simd = false` still forces scalar.
Per-backend RGBA equivalence tests — 31 new `#[test]` functions across the 5 backend test modules (7 NEON, 6 each on SSE4.1 / AVX2 / AVX-512 / wasm simd128). Each new x86 test early-returns on `is_x86_feature_detected!` so the suite stays clean under sanitizer / Miri / non-feature-flagged CI runners. Pseudo-random alpha flushes out lane-order corruption that a solid-alpha buffer would mask.
Compile-time `const { assert!(!ALPHA_SRC || ALPHA) }` retained on every shared template — source alpha requires RGBA output (no 3 bpp store with alpha to put it in).
Test plan
Follow-up
🤖 Generated with Claude Code