feat(row): Ship 8 — high-bit 4:2:0 RGBA scalar (SIMD lands in 5a/5b)#24
feat(row): Ship 8 — high-bit 4:2:0 RGBA scalar (SIMD lands in 5a/5b)#24
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
Adds high-bit-depth 4:2:0 RGBA support (prep work for “Ship 8” tranche) by introducing scalar RGBA kernels and wiring new public row dispatchers (currently scalar-only).
Changes:
- Add scalar RGB→RGBA expansion for
u16rows and new RGBA variants for high-bit-depth planar/semi-planar YUV 4:2:0 conversions (u8 RGBA + native-depth u16 RGBA). - Refactor several existing scalar RGB kernels into shared RGB/RGBA kernels via a
const ALPHA: boolmonomorphization. - Add public RGBA row dispatchers for Yuv420p{9,10,12,14,16} and P01{0,2,6} plus a new
rgba_row_elems()sizing helper.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
src/row/scalar.rs |
Introduces/extends scalar kernels to support RGBA outputs (u8 and u16) and adds a u16 RGB→RGBA fan-out helper; refactors into shared RGB/RGBA kernels. |
src/row/mod.rs |
Adds public RGBA dispatchers for high-bit 4:2:0 formats and a u16 RGBA row sizing helper. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| } | ||
|
|
||
| // ---- High-bit-depth YUV 4:2:0 → RGB (BITS ∈ {10, 12, 14}) ------------- |
There was a problem hiding this comment.
The section header comment says BITS ∈ {10, 12, 14}, but this module now supports BITS = 9 as well (via the compile-time guard and new 9-bit RGBA/RGB dispatchers). Please update the header to include 9-bit so documentation matches the actual supported set.
| // ---- High-bit-depth YUV 4:2:0 → RGB (BITS ∈ {10, 12, 14}) ------------- | |
| // ---- High-bit-depth YUV 4:2:0 → RGB (BITS ∈ {9, 10, 12, 14}) ---------- |
| matrix: ColorMatrix, | ||
| full_range: bool, | ||
| ) { | ||
| yuv_420p16_to_rgb_or_rgba_row::<false>(y, u_half, v_half, rgb_out, width, matrix, full_range); |
There was a problem hiding this comment.
This call is written as a long single line and will be reformatted by cargo fmt (CI runs cargo fmt -- --check). Please run rustfmt or format this call to match the rest of the file’s wrapping style.
| yuv_420p16_to_rgb_or_rgba_row::<false>(y, u_half, v_half, rgb_out, width, matrix, full_range); | |
| yuv_420p16_to_rgb_or_rgba_row::<false>( | |
| y, | |
| u_half, | |
| v_half, | |
| rgb_out, | |
| width, | |
| matrix, | |
| full_range, | |
| ); |
| y, u_half, v_half, rgba_out, width, matrix, full_range, | ||
| ); | ||
| } | ||
|
|
There was a problem hiding this comment.
New RGBA conversion entry points are introduced here, but there are no accompanying scalar tests validating (a) alpha is set to the expected opaque value (0xFF for u8 output / (1<<BITS)-1 for u16 output) and (b) the RGB bytes match the existing *_to_rgb_row outputs for the same inputs. Since src/row/scalar/tests.rs already covers the RGB variants, please add analogous tests for these RGBA wrappers/shared kernels.
| #[cfg(test)] | |
| mod tests { | |
| use super::*; | |
| fn assert_rgba_matches_rgb_and_has_opaque_alpha(rgb: &[u8], rgba: &[u8], width: usize) { | |
| for x in 0..width { | |
| assert_eq!(&rgba[4 * x..4 * x + 3], &rgb[3 * x..3 * x + 3]); | |
| assert_eq!(rgba[4 * x + 3], 0xFF); | |
| } | |
| } | |
| #[test] | |
| fn yuv_420_to_rgba_row_matches_rgb_and_sets_opaque_alpha() { | |
| let width = 4; | |
| let y = [0u8, 64, 128, 255]; | |
| let u_half = [128u8, 128]; | |
| let v_half = [128u8, 128]; | |
| let mut rgb = [0u8; 12]; | |
| let mut rgba = [0u8; 16]; | |
| yuv_420_to_rgb_row(&y, &u_half, &v_half, &mut rgb, width, ColorMatrix::Bt601, true); | |
| yuv_420_to_rgba_row( | |
| &y, | |
| &u_half, | |
| &v_half, | |
| &mut rgba, | |
| width, | |
| ColorMatrix::Bt601, | |
| true, | |
| ); | |
| assert_rgba_matches_rgb_and_has_opaque_alpha(&rgb, &rgba, width); | |
| } | |
| #[test] | |
| fn yuv_420p10_to_rgba_row_matches_rgb_and_sets_opaque_alpha() { | |
| let width = 4; | |
| let y = [0u16, 256, 768, 1023]; | |
| let u_half = [512u16, 512]; | |
| let v_half = [512u16, 512]; | |
| let mut rgb = [0u8; 12]; | |
| let mut rgba = [0u8; 16]; | |
| yuv_420p_n_to_rgb_row::<10>( | |
| &y, | |
| &u_half, | |
| &v_half, | |
| &mut rgb, | |
| width, | |
| ColorMatrix::Bt601, | |
| true, | |
| ); | |
| yuv_420p_n_to_rgba_row::<10>( | |
| &y, | |
| &u_half, | |
| &v_half, | |
| &mut rgba, | |
| width, | |
| ColorMatrix::Bt601, | |
| true, | |
| ); | |
| assert_rgba_matches_rgb_and_has_opaque_alpha(&rgb, &rgba, width); | |
| } | |
| } |
| /// `use_simd = false` forces scalar. SIMD per-arch routes land in the | ||
| /// follow-up Ship 8 Tranche 5a PR — for now this dispatcher always | ||
| /// runs the scalar reference regardless of `use_simd`. |
There was a problem hiding this comment.
The docs currently say "use_simd = false forces scalar" but the implementation ignores use_simd entirely (let _ = use_simd; and always scalar). To avoid misleading callers, please document this explicitly in the rustdoc in the same style used elsewhere (e.g. "use_simd is currently a no-op"), until the SIMD branches land.
| assert!(rgba_out.len() >= rgba_min, "rgba_out row too short"); | ||
|
|
||
| let _ = use_simd; // SIMD per-arch routes land in Ship 8 Tranche 5a. | ||
| scalar::yuv_420p_n_to_rgba_row::<9>(y, u_half, v_half, rgba_out, width, matrix, full_range); |
There was a problem hiding this comment.
This call (and several nearby ones) is long enough that cargo fmt will re-wrap it; CI enforces cargo fmt -- --check. Please run rustfmt before merging so formatting matches the project standard.
| scalar::yuv_420p_n_to_rgba_row::<9>(y, u_half, v_half, rgba_out, width, matrix, full_range); | |
| scalar::yuv_420p_n_to_rgba_row::<9>( | |
| y, | |
| u_half, | |
| v_half, | |
| rgba_out, | |
| width, | |
| matrix, | |
| full_range, | |
| ); |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
## Summary Adds u8 RGBA SIMD across all 5 backends for high-bit 4:2:0 YUV (`yuv420p9/10/12/14/16`, `p010/p012/p016`) and wires them into the 8 high-bit u8 RGBA dispatchers in \`src/row/mod.rs\`. Builds on the scalar prep + dispatcher signatures landed in PR #24. The companion u16 RGBA SIMD work is deferred to Tranche 5b. ## Changes - **5 SIMD backends** — NEON / SSE4.1 / AVX2 / AVX-512 / wasm simd128 — each gain a const-generic \`*_to_rgb_or_rgba_row<BITS, ALPHA>\` template across 4 kernel families: - planar BITS-generic: \`yuv_420p_n_to_rgb_or_rgba_row<BITS={9,10,12,14}, ALPHA>\` - semi-planar BITS-generic: \`p_n_to_rgb_or_rgba_row<BITS={10,12}, ALPHA>\` (P016 has its own family) - 16-bit planar: \`yuv_420p16_to_rgb_or_rgba_row<ALPHA>\` - 16-bit semi-planar: \`p16_to_rgb_or_rgba_row<ALPHA>\` Existing RGB and new RGBA wrappers are thin shims over the shared template. Only the store (\`vst3q_u8\` vs \`vst4q_u8\`, \`write_rgb_*\` vs \`write_rgba_*\`) and the scalar tail dispatch branch on \`ALPHA\`; per-pixel math is unchanged. - **8 high-bit u8 RGBA dispatchers** wired in \`src/row/mod.rs\` (\`yuv420p9/10/12/14/16_to_rgba_row\`, \`p010/p012/p016_to_rgba_row\`) — replace the prior \`let _ = use_simd\` stubs with the standard \`cfg_select!\` per-arch route block, mirroring the existing RGB dispatchers. \`use_simd = false\` still forces scalar. - **Per-backend RGBA equivalence tests** — ~30 new \`#[test]\` functions across the 5 backend test modules. Each new x86 test gates on \`is_x86_feature_detected!\` so the suite stays clean under sanitizer/Miri/non-feature-flagged CI runners. - Compile-time \`const { assert!(BITS == ...) }\` retained on every shared template (was already a Codex-flagged hardening from prior tranches).
Ship 8 Tranche 5 — scalar foundation. Adds RGBA output (both 8-bit and native-depth
u16) for all 8 high-bit-depth 4:2:0 source formats:Yuv420p9 / 10 / 12 / 14 / 16andP010 / P012 / P016. Scalar paths are fully wired and shippable today — the SIMD per-arch routes land in the follow-up Tranche 5a (u8 RGBA) and 5b (u16 RGBA) PRs.This is split out of Tranche 5 (which would have been ~6–8k LOC end-to-end) so the foundational scalar work, the public API surface, and the Strategy A kernel-design pattern can land independently of the per-backend SIMD work.
Scope
Yuv420pNv12,Nv21Yuv422p,Nv16Yuv444pNv24,Nv42+ Strategy A retrofitYuv440pYuv420p9/10/12/14/16+P010/P012/P016(u8 + u16 RGBA)Yuv422p9/10/12/14/16,Yuv440p10/12,P210/P212/P216Yuv444p9/10/12/14/16,P410/P412/P416Usage:
```rust
use colconv::{row, ColorMatrix};
// 10-bit YUV 4:2:0 → 8-bit packed RGBA (Strategy A scalar; SIMD branch
// fills in via Tranche 5a).
row::yuv420p10_to_rgba_row(
y_row, u_half, v_half,
&mut rgba_out,
width,
ColorMatrix::Bt2020Ncl,
/full_range=/ false,
/use_simd=/ true, // accepted today; routed only when 5a lands
);
// P010 (HEVC HDR HW decode) → native-depth u16 packed RGBA (alpha = 0x3FF).
row::p010_to_rgba_u16_row(
y_row, uv_half,
&mut rgba_u16_out,
width,
ColorMatrix::Bt2020Ncl, false, true,
);
```
What's in this PR
Public API — 16 new dispatcher functions in
src/row/mod.rsEach format gets both a u8 RGBA dispatcher and a native-depth u16 RGBA dispatcher, paralleling the existing RGB ones:
yuv420p9_to_rgba_rowyuv420p9_to_rgba_u16_rowyuv420p10_to_rgba_rowyuv420p10_to_rgba_u16_rowp010_to_rgba_rowp010_to_rgba_u16_rowyuv420p12_to_rgba_rowyuv420p12_to_rgba_u16_rowyuv420p14_to_rgba_rowyuv420p14_to_rgba_u16_rowp012_to_rgba_rowp012_to_rgba_u16_rowyuv420p16_to_rgba_rowyuv420p16_to_rgba_u16_rowp016_to_rgba_rowp016_to_rgba_u16_rowThe
use_simd: boolparameter is held on every signature so the follow-up SIMD PRs (5a, 5b) can fill in per-arch branches without breaking callers. Today it's a no-op (let _ = use_simd;with a comment) — every dispatcher always runs the scalar reference. This is functionally correct (scalar matches the eventual SIMD output bit-for-bit) but slower than the eventual SIMD path.Plus
rgba_row_elems(width)helper — parallel to the existingrgba_row_bytes/rgb_row_elems, sizes&mut [u16]RGBA buffers (width × 4u16elements).Kernel work — 8 new const-ALPHA scalar templates + 13 RGBA wrappers + u16 expand helper
Mirrors the
<const ALPHA: bool>template pattern established by PRs #16–22 for the 8-bit RGBA paths. Existing*_to_rgb_*_row<...>functions are now thin::<false>wrappers — zero behavior change.Yuv420pfamily)P010family)yuv_420p_n_to_rgb_or_rgba_row<9, ALPHA>(+ u16 sibling)<10, ALPHA>p_n_to_rgb_or_rgba_row<10, ALPHA>(+ u16)<12, ALPHA><12, ALPHA><14, ALPHA>yuv_420p16_to_rgb_or_rgba_row<ALPHA>(+ u16; non-generic, i64 chroma for u16 path)p16_to_rgb_or_rgba_row<ALPHA>(+ u16, i64 chroma)Compile-time BITS guards: the Pn shared kernels now use
const { assert!(BITS == 10 \|\| BITS == 12) }instead ofdebug_assert!— this is a hard fix for a pre-existing release-only corruption trap that the new RGBA paths would have widened. If a future dispatcher accidentally instantiatedp_n_to_rgb_or_rgba_*_row::<16>it would silently route the i32-chroma path at 16-bit input and overflow before clamp; the compile-time assertion now fails monomorphization for any BITS outside {10, 12}, eliminating that bug class. Routing for P016 stays unambiguous — its dedicatedp16_to_rgb_or_rgba_*_row<ALPHA>kernel uses i64 chroma multiply, and the comments on the Pn wrappers explicitly call out "P016 has its own kernel family — never routed here."`expand_rgb_u16_to_rgba_u16_row` — u16 analogue of the existing
expand_rgb_to_rgba_rowhelper. Strategy A on the u16 path (run RGB once, fan-out to RGBA via memory-bound copy + alpha pad). Alpha element is(1 << BITS) - 1, resolved at compile time per format. Marked#[allow(dead_code)]for now — the consumer (MixedSinkerwith_rgba_u16Strategy A) lands in 5b alongside the rest of the high-bit sinker integration.What's deferred
let _ = use_simd;stubs in the new u8 RGBA dispatchers. Removes theuse_simdno-op annotation. ~3k LOC.write_rgba_u16_*SIMD store helpers (parallel to existingwrite_rgb_u16_*insrc/row/arch/x86_common.rsand per-backend equivalents) + 8MixedSinker<F>::with_rgba_u16blocks across the high-bit sinker module + Strategy A wiring on the u16 path (consumesexpand_rgb_u16_to_rgba_u16_row). ~3–4k LOC.ALPHA = falsehalf of every test already covers the new shared kernel body).Yuv420p10until 5b (which is when MixedSinkerwith_rgba_u16forYuv420p10lands).Resolved Codex review findings during this branch
debug_assert!(BITS == 10 \|\| BITS == 12)on the Pn kernels: a release-only corruption trap if a future dispatcher misroutes P016 through the Pn family. Upgraded toconst { assert!(...) }(compile-time monomorphization failure) on both the u8 and u16 Pn shared kernels. Comments on the Pn wrappers corrected to call out P016's separate kernel family explicitly.PixelSinkcontract #2 — initial scalar-only branch had no public dispatchers, so the new RGBA paths were unreachable. Added all 16 dispatchers in this PR; the new RGBA scalar wrappers now have callers and the#[allow(dead_code)]annotations were removed (kept only onexpand_rgb_u16_to_rgba_u16_rowwhose consumer is in 5b).Verification
cargo test --lib— 479 passed; 0 failed (unchanged — pure foundation; tests land with SIMD in 5a/5b)cargo test --doc— 1 passedRUSTFLAGS=-Dwarnings cargo clippy --lib --tests— clean (matches CI)RUSTFLAGS=-Dwarnings cargo clippy --lib --no-default-features— cleanRUSTFLAGS=-Dwarnings cargo clippy --lib --no-default-features --features alloc— cleancargo check --lib --target wasm32-unknown-unknown— cleancargo check --lib --target x86_64-unknown-linux-gnu— cleanTest plan
test,test-sde-avx512,cross,coverage,clippy,build,miri-*jobs.yuv420p10_to_rgb_rowetc.) still produce identical output — the_or_rgba::<false>refactor is a pure restructuring with no behavior change.🤖 Generated with Claude Code