Skip to content

feat(row): Ship 8 — high-bit 4:2:0 RGBA scalar (SIMD lands in 5a/5b)#24

Merged
uqio merged 5 commits intomainfrom
feat/ship8-rgba-high-bit-420
Apr 26, 2026
Merged

feat(row): Ship 8 — high-bit 4:2:0 RGBA scalar (SIMD lands in 5a/5b)#24
uqio merged 5 commits intomainfrom
feat/ship8-rgba-high-bit-420

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented Apr 26, 2026

Ship 8 Tranche 5 — scalar foundation. Adds RGBA output (both 8-bit and native-depth u16) for all 8 high-bit-depth 4:2:0 source formats: Yuv420p9 / 10 / 12 / 14 / 16 and P010 / P012 / P016. Scalar paths are fully wired and shippable today — the SIMD per-arch routes land in the follow-up Tranche 5a (u8 RGBA) and 5b (u16 RGBA) PRs.

This is split out of Tranche 5 (which would have been ~6–8k LOC end-to-end) so the foundational scalar work, the public API surface, and the Strategy A kernel-design pattern can land independently of the per-backend SIMD work.

Scope

# Tranche Formats Status
1 4:2:0 planar Yuv420p ✅ shipped (PR #16)
2 4:2:0 semi-planar Nv12, Nv21 ✅ shipped (PR #17)
3 4:2:2 planar + semi-planar Yuv422p, Nv16 ✅ shipped (PR #18)
4a 4:4:4 planar Yuv444p ✅ shipped (PR #19)
4b 4:4:4 semi-planar Nv24, Nv42 + Strategy A retrofit ✅ shipped (PR #20)
4c 4:4:0 planar Yuv440p ✅ shipped (PR #22)
5-prep High-bit 4:2:0 RGBA — scalar Yuv420p9/10/12/14/16 + P010/P012/P016 (u8 + u16 RGBA) this PR
5a High-bit 4:2:0 — u8 RGBA SIMD same formats next
5b High-bit 4:2:0 — u16 RGBA SIMD + sinker integration same formats after 5a
6 High-bit-depth 4:2:2 Yuv422p9/10/12/14/16, Yuv440p10/12, P210/P212/P216
7 High-bit-depth 4:4:4 Yuv444p9/10/12/14/16, P410/P412/P416

Usage:

```rust
use colconv::{row, ColorMatrix};

// 10-bit YUV 4:2:0 → 8-bit packed RGBA (Strategy A scalar; SIMD branch
// fills in via Tranche 5a).
row::yuv420p10_to_rgba_row(
y_row, u_half, v_half,
&mut rgba_out,
width,
ColorMatrix::Bt2020Ncl,
/full_range=/ false,
/use_simd=/ true, // accepted today; routed only when 5a lands
);

// P010 (HEVC HDR HW decode) → native-depth u16 packed RGBA (alpha = 0x3FF).
row::p010_to_rgba_u16_row(
y_row, uv_half,
&mut rgba_u16_out,
width,
ColorMatrix::Bt2020Ncl, false, true,
);
```

What's in this PR

Public API — 16 new dispatcher functions in src/row/mod.rs

Each format gets both a u8 RGBA dispatcher and a native-depth u16 RGBA dispatcher, paralleling the existing RGB ones:

Format Bit depth u8 RGBA u16 RGBA
Yuv420p9 9 yuv420p9_to_rgba_row yuv420p9_to_rgba_u16_row
Yuv420p10 10 yuv420p10_to_rgba_row yuv420p10_to_rgba_u16_row
P010 10 p010_to_rgba_row p010_to_rgba_u16_row
Yuv420p12 12 yuv420p12_to_rgba_row yuv420p12_to_rgba_u16_row
Yuv420p14 14 yuv420p14_to_rgba_row yuv420p14_to_rgba_u16_row
P012 12 p012_to_rgba_row p012_to_rgba_u16_row
Yuv420p16 16 yuv420p16_to_rgba_row yuv420p16_to_rgba_u16_row
P016 16 p016_to_rgba_row p016_to_rgba_u16_row

The use_simd: bool parameter is held on every signature so the follow-up SIMD PRs (5a, 5b) can fill in per-arch branches without breaking callers. Today it's a no-op (let _ = use_simd; with a comment) — every dispatcher always runs the scalar reference. This is functionally correct (scalar matches the eventual SIMD output bit-for-bit) but slower than the eventual SIMD path.

Plus rgba_row_elems(width) helper — parallel to the existing rgba_row_bytes / rgb_row_elems, sizes &mut [u16] RGBA buffers (width × 4 u16 elements).

Kernel work — 8 new const-ALPHA scalar templates + 13 RGBA wrappers + u16 expand helper

Mirrors the <const ALPHA: bool> template pattern established by PRs #16–22 for the 8-bit RGBA paths. Existing *_to_rgb_*_row<...> functions are now thin ::<false> wrappers — zero behavior change.

BITS Planar (Yuv420p family) Semi-planar (P010 family)
9 yuv_420p_n_to_rgb_or_rgba_row<9, ALPHA> (+ u16 sibling)
10 <10, ALPHA> p_n_to_rgb_or_rgba_row<10, ALPHA> (+ u16)
12 <12, ALPHA> <12, ALPHA>
14 <14, ALPHA>
16 yuv_420p16_to_rgb_or_rgba_row<ALPHA> (+ u16; non-generic, i64 chroma for u16 path) p16_to_rgb_or_rgba_row<ALPHA> (+ u16, i64 chroma)

Compile-time BITS guards: the Pn shared kernels now use const { assert!(BITS == 10 \|\| BITS == 12) } instead of debug_assert! — this is a hard fix for a pre-existing release-only corruption trap that the new RGBA paths would have widened. If a future dispatcher accidentally instantiated p_n_to_rgb_or_rgba_*_row::<16> it would silently route the i32-chroma path at 16-bit input and overflow before clamp; the compile-time assertion now fails monomorphization for any BITS outside {10, 12}, eliminating that bug class. Routing for P016 stays unambiguous — its dedicated p16_to_rgb_or_rgba_*_row<ALPHA> kernel uses i64 chroma multiply, and the comments on the Pn wrappers explicitly call out "P016 has its own kernel family — never routed here."

`expand_rgb_u16_to_rgba_u16_row` — u16 analogue of the existing expand_rgb_to_rgba_row helper. Strategy A on the u16 path (run RGB once, fan-out to RGBA via memory-bound copy + alpha pad). Alpha element is (1 << BITS) - 1, resolved at compile time per format. Marked #[allow(dead_code)] for now — the consumer (MixedSinker with_rgba_u16 Strategy A) lands in 5b alongside the rest of the high-bit sinker integration.

What's deferred

  • Tranche 5a (next PR): Add u8 RGBA SIMD per-arch routes across all 5 backends (NEON, SSE4.1, AVX2, AVX-512, wasm simd128) for all 8 formats. Fills in the let _ = use_simd; stubs in the new u8 RGBA dispatchers. Removes the use_simd no-op annotation. ~3k LOC.
  • Tranche 5b (after 5a): Add u16 RGBA SIMD per-arch routes + new write_rgba_u16_* SIMD store helpers (parallel to existing write_rgb_u16_* in src/row/arch/x86_common.rs and per-backend equivalents) + 8 MixedSinker<F>::with_rgba_u16 blocks across the high-bit sinker module + Strategy A wiring on the u16 path (consumes expand_rgb_u16_to_rgba_u16_row). ~3–4k LOC.
  • Tests: format-level + per-backend equivalence tests land alongside the SIMD impls in 5a / 5b — the scalar paths added here are exercised by the existing high-bit RGB tests' kernel reuse (they go through the same const-ALPHA template, so the ALPHA = false half of every test already covers the new shared kernel body).
  • Compile_fail doctest advance: stays pointing at Yuv420p10 until 5b (which is when MixedSinker with_rgba_u16 for Yuv420p10 lands).

Resolved Codex review findings during this branch

  • feat(0.1.0): row-primitive kernels with SIMD dispatch + Sink API (yuv420p, rgb→hsv, bgr↔rgb) #1 — pre-existing debug_assert!(BITS == 10 \|\| BITS == 12) on the Pn kernels: a release-only corruption trap if a future dispatcher misroutes P016 through the Pn family. Upgraded to const { assert!(...) } (compile-time monomorphization failure) on both the u8 and u16 Pn shared kernels. Comments on the Pn wrappers corrected to call out P016's separate kernel family explicitly.
  • feat(NV12): NV12(semi-planar 4:2:0) + fallible PixelSink contract #2 — initial scalar-only branch had no public dispatchers, so the new RGBA paths were unreachable. Added all 16 dispatchers in this PR; the new RGBA scalar wrappers now have callers and the #[allow(dead_code)] annotations were removed (kept only on expand_rgb_u16_to_rgba_u16_row whose consumer is in 5b).

Verification

  • cargo test --lib479 passed; 0 failed (unchanged — pure foundation; tests land with SIMD in 5a/5b)
  • cargo test --doc — 1 passed
  • RUSTFLAGS=-Dwarnings cargo clippy --lib --tests — clean (matches CI)
  • RUSTFLAGS=-Dwarnings cargo clippy --lib --no-default-features — clean
  • RUSTFLAGS=-Dwarnings cargo clippy --lib --no-default-features --features alloc — clean
  • cargo check --lib --target wasm32-unknown-unknown — clean
  • cargo check --lib --target x86_64-unknown-linux-gnu — clean

Test plan

  • CI green on test, test-sde-avx512, cross, coverage, clippy, build, miri-* jobs.
  • Spot-check that callers of the existing high-bit RGB dispatchers (yuv420p10_to_rgb_row etc.) still produce identical output — the _or_rgba::<false> refactor is a pure restructuring with no behavior change.
  • Verify the new RGBA dispatchers route through the scalar reference at the expected speed (this PR's intentional regression vs the future SIMD-enabled path; tracked for 5a/5b).

🤖 Generated with Claude Code

@al8n al8n requested a review from Copilot April 26, 2026 10:27
@al8n al8n changed the title feat: ship8 rgba high bit 420 feat(row): Ship 8 — high-bit 4:2:0 RGBA scalar (SIMD lands in 5a/5b) Apr 26, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 26, 2026

Codecov Report

❌ Patch coverage is 34.38596% with 187 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/row/mod.rs 0.00% 142 Missing ⚠️
src/row/scalar.rs 68.53% 45 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds high-bit-depth 4:2:0 RGBA support (prep work for “Ship 8” tranche) by introducing scalar RGBA kernels and wiring new public row dispatchers (currently scalar-only).

Changes:

  • Add scalar RGB→RGBA expansion for u16 rows and new RGBA variants for high-bit-depth planar/semi-planar YUV 4:2:0 conversions (u8 RGBA + native-depth u16 RGBA).
  • Refactor several existing scalar RGB kernels into shared RGB/RGBA kernels via a const ALPHA: bool monomorphization.
  • Add public RGBA row dispatchers for Yuv420p{9,10,12,14,16} and P01{0,2,6} plus a new rgba_row_elems() sizing helper.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
src/row/scalar.rs Introduces/extends scalar kernels to support RGBA outputs (u8 and u16) and adds a u16 RGB→RGBA fan-out helper; refactors into shared RGB/RGBA kernels.
src/row/mod.rs Adds public RGBA dispatchers for high-bit 4:2:0 formats and a u16 RGBA row sizing helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/row/scalar.rs
}
}

// ---- High-bit-depth YUV 4:2:0 → RGB (BITS ∈ {10, 12, 14}) -------------
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The section header comment says BITS ∈ {10, 12, 14}, but this module now supports BITS = 9 as well (via the compile-time guard and new 9-bit RGBA/RGB dispatchers). Please update the header to include 9-bit so documentation matches the actual supported set.

Suggested change
// ---- High-bit-depth YUV 4:2:0 → RGB (BITS ∈ {10, 12, 14}) -------------
// ---- High-bit-depth YUV 4:2:0 → RGB (BITS ∈ {9, 10, 12, 14}) ----------

Copilot uses AI. Check for mistakes.
Comment thread src/row/scalar.rs Outdated
Comment thread src/row/scalar.rs
matrix: ColorMatrix,
full_range: bool,
) {
yuv_420p16_to_rgb_or_rgba_row::<false>(y, u_half, v_half, rgb_out, width, matrix, full_range);
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call is written as a long single line and will be reformatted by cargo fmt (CI runs cargo fmt -- --check). Please run rustfmt or format this call to match the rest of the file’s wrapping style.

Suggested change
yuv_420p16_to_rgb_or_rgba_row::<false>(y, u_half, v_half, rgb_out, width, matrix, full_range);
yuv_420p16_to_rgb_or_rgba_row::<false>(
y,
u_half,
v_half,
rgb_out,
width,
matrix,
full_range,
);

Copilot uses AI. Check for mistakes.
Comment thread src/row/scalar.rs
y, u_half, v_half, rgba_out, width, matrix, full_range,
);
}

Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New RGBA conversion entry points are introduced here, but there are no accompanying scalar tests validating (a) alpha is set to the expected opaque value (0xFF for u8 output / (1<<BITS)-1 for u16 output) and (b) the RGB bytes match the existing *_to_rgb_row outputs for the same inputs. Since src/row/scalar/tests.rs already covers the RGB variants, please add analogous tests for these RGBA wrappers/shared kernels.

Suggested change
#[cfg(test)]
mod tests {
use super::*;
fn assert_rgba_matches_rgb_and_has_opaque_alpha(rgb: &[u8], rgba: &[u8], width: usize) {
for x in 0..width {
assert_eq!(&rgba[4 * x..4 * x + 3], &rgb[3 * x..3 * x + 3]);
assert_eq!(rgba[4 * x + 3], 0xFF);
}
}
#[test]
fn yuv_420_to_rgba_row_matches_rgb_and_sets_opaque_alpha() {
let width = 4;
let y = [0u8, 64, 128, 255];
let u_half = [128u8, 128];
let v_half = [128u8, 128];
let mut rgb = [0u8; 12];
let mut rgba = [0u8; 16];
yuv_420_to_rgb_row(&y, &u_half, &v_half, &mut rgb, width, ColorMatrix::Bt601, true);
yuv_420_to_rgba_row(
&y,
&u_half,
&v_half,
&mut rgba,
width,
ColorMatrix::Bt601,
true,
);
assert_rgba_matches_rgb_and_has_opaque_alpha(&rgb, &rgba, width);
}
#[test]
fn yuv_420p10_to_rgba_row_matches_rgb_and_sets_opaque_alpha() {
let width = 4;
let y = [0u16, 256, 768, 1023];
let u_half = [512u16, 512];
let v_half = [512u16, 512];
let mut rgb = [0u8; 12];
let mut rgba = [0u8; 16];
yuv_420p_n_to_rgb_row::<10>(
&y,
&u_half,
&v_half,
&mut rgb,
width,
ColorMatrix::Bt601,
true,
);
yuv_420p_n_to_rgba_row::<10>(
&y,
&u_half,
&v_half,
&mut rgba,
width,
ColorMatrix::Bt601,
true,
);
assert_rgba_matches_rgb_and_has_opaque_alpha(&rgb, &rgba, width);
}
}

Copilot uses AI. Check for mistakes.
Comment thread src/row/mod.rs
Comment on lines +2677 to +2679
/// `use_simd = false` forces scalar. SIMD per-arch routes land in the
/// follow-up Ship 8 Tranche 5a PR — for now this dispatcher always
/// runs the scalar reference regardless of `use_simd`.
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs currently say "use_simd = false forces scalar" but the implementation ignores use_simd entirely (let _ = use_simd; and always scalar). To avoid misleading callers, please document this explicitly in the rustdoc in the same style used elsewhere (e.g. "use_simd is currently a no-op"), until the SIMD branches land.

Copilot uses AI. Check for mistakes.
Comment thread src/row/mod.rs
assert!(rgba_out.len() >= rgba_min, "rgba_out row too short");

let _ = use_simd; // SIMD per-arch routes land in Ship 8 Tranche 5a.
scalar::yuv_420p_n_to_rgba_row::<9>(y, u_half, v_half, rgba_out, width, matrix, full_range);
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call (and several nearby ones) is long enough that cargo fmt will re-wrap it; CI enforces cargo fmt -- --check. Please run rustfmt before merging so formatting matches the project standard.

Suggested change
scalar::yuv_420p_n_to_rgba_row::<9>(y, u_half, v_half, rgba_out, width, matrix, full_range);
scalar::yuv_420p_n_to_rgba_row::<9>(
y,
u_half,
v_half,
rgba_out,
width,
matrix,
full_range,
);

Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@uqio uqio merged commit 32b03b5 into main Apr 26, 2026
43 checks passed
@uqio uqio deleted the feat/ship8-rgba-high-bit-420 branch April 26, 2026 10:40
uqio added a commit that referenced this pull request Apr 26, 2026
## Summary

Adds u8 RGBA SIMD across all 5 backends for high-bit 4:2:0 YUV (`yuv420p9/10/12/14/16`, `p010/p012/p016`) and wires them into the 8 high-bit u8 RGBA dispatchers in \`src/row/mod.rs\`. Builds on the scalar prep + dispatcher signatures landed in PR #24. The companion u16 RGBA SIMD work is deferred to Tranche 5b.

## Changes

- **5 SIMD backends** — NEON / SSE4.1 / AVX2 / AVX-512 / wasm simd128 — each gain a const-generic \`*_to_rgb_or_rgba_row<BITS, ALPHA>\` template across 4 kernel families:
  - planar BITS-generic: \`yuv_420p_n_to_rgb_or_rgba_row<BITS={9,10,12,14}, ALPHA>\`
  - semi-planar BITS-generic: \`p_n_to_rgb_or_rgba_row<BITS={10,12}, ALPHA>\` (P016 has its own family)
  - 16-bit planar: \`yuv_420p16_to_rgb_or_rgba_row<ALPHA>\`
  - 16-bit semi-planar: \`p16_to_rgb_or_rgba_row<ALPHA>\`
  
  Existing RGB and new RGBA wrappers are thin shims over the shared template. Only the store (\`vst3q_u8\` vs \`vst4q_u8\`, \`write_rgb_*\` vs \`write_rgba_*\`) and the scalar tail dispatch branch on \`ALPHA\`; per-pixel math is unchanged.

- **8 high-bit u8 RGBA dispatchers** wired in \`src/row/mod.rs\` (\`yuv420p9/10/12/14/16_to_rgba_row\`, \`p010/p012/p016_to_rgba_row\`) — replace the prior \`let _ = use_simd\` stubs with the standard \`cfg_select!\` per-arch route block, mirroring the existing RGB dispatchers. \`use_simd = false\` still forces scalar.

- **Per-backend RGBA equivalence tests** — ~30 new \`#[test]\` functions across the 5 backend test modules. Each new x86 test gates on \`is_x86_feature_detected!\` so the suite stays clean under sanitizer/Miri/non-feature-flagged CI runners.

- Compile-time \`const { assert!(BITS == ...) }\` retained on every shared template (was already a Codex-flagged hardening from prior tranches).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants