Skip to content

feat(row): Ship 8 — high-bit 4:4:4 RGBA scalar (SIMD lands in 6b/6c)#29

Merged
al8n merged 1 commit intomainfrom
feat/ship8-rgba-high-bit-444-scalar
Apr 27, 2026
Merged

feat(row): Ship 8 — high-bit 4:4:4 RGBA scalar (SIMD lands in 6b/6c)#29
al8n merged 1 commit intomainfrom
feat/ship8-rgba-high-bit-444-scalar

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented Apr 27, 2026

Summary

First of three sub-PRs that add high-bit-depth 4:4:4 RGBA support — exactly mirroring the 3-PR structure that landed high-bit 4:2:0 RGBA in PRs #24#25#26.

This PR (analogous to #24): scalar reference RGBA kernels + 16 public dispatchers. The dispatchers are SIMD-shaped (accept use_simd: bool) but always route to scalar — the use_simd flag is held in the signature so the follow-up SIMD PRs (6b: u8 SIMD, 6c: u16 SIMD + sinker) wire per-arch routes without breaking call-site signatures.

Why ship the scalar layer first

Three reasons match the 4:2:0 precedent:

  1. Call-site stability: callers can wire `with_rgba` / `with_rgba_u16` once; subsequent SIMD PRs become non-API-breaking changes.
  2. Reviewability: the scalar layer is the numerical reference — easier to validate in isolation than alongside 5 backends × 8 kernels of SIMD math.
  3. Bisect surface: if a future SIMD PR introduces a regression, the scalar reference is a stable byte-identical baseline to compare against.

The dispatchers' `use_simd` parameter is deliberately ignored in this PR — see the "Out of scope" + "Codex review verdict" sections below.

Changes

Scalar (src/row/scalar.rs)

8 new const-ALPHA template wrappers + 8 thin RGB shims over those templates. Mirrors the established 4:2:0 pattern (e.g. yuv_420p_n_to_rgb_or_rgba_row<BITS, ALPHA>):

Family Shared template RGB wrapper RGBA wrapper
Yuv444p_n (BITS ∈ {9,10,12,14}) yuv_444p_n_to_rgb_or_rgba_row<BITS, ALPHA> yuv_444p_n_to_rgb_row<BITS> yuv_444p_n_to_rgba_row<BITS>
Yuv444p_n u16 yuv_444p_n_to_rgb_or_rgba_u16_row<BITS, ALPHA> yuv_444p_n_to_rgb_u16_row<BITS> yuv_444p_n_to_rgba_u16_row<BITS>
Yuv444p16 (16-bit dedicated) yuv_444p16_to_rgb_or_rgba_row<ALPHA> yuv_444p16_to_rgb_row yuv_444p16_to_rgba_row
Yuv444p16 u16 yuv_444p16_to_rgb_or_rgba_u16_row<ALPHA> yuv_444p16_to_rgb_u16_row yuv_444p16_to_rgba_u16_row
P_n_444 (BITS ∈ {10,12}) p_n_444_to_rgb_or_rgba_row<BITS, ALPHA> p_n_444_to_rgb_row<BITS> p_n_444_to_rgba_row<BITS>
P_n_444 u16 p_n_444_to_rgb_or_rgba_u16_row<BITS, ALPHA> p_n_444_to_rgb_u16_row<BITS> p_n_444_to_rgba_u16_row<BITS>
P_n_444_16 (P416) p_n_444_16_to_rgb_or_rgba_row<ALPHA> p_n_444_16_to_rgb_row p_n_444_16_to_rgba_row
P_n_444_16 u16 p_n_444_16_to_rgb_or_rgba_u16_row<ALPHA> p_n_444_16_to_rgb_u16_row p_n_444_16_to_rgba_u16_row

Alpha contracts:

  • u8 RGBA: 0xFF (always opaque, no alpha plane in source)
  • u16 RGBA, BITS-generic: (1 << BITS) - 1
  • u16 RGBA, 16-bit dedicated: 0xFFFF

Each shared template preserves the original RGB function's const { assert!(BITS == ...) } BITS guard.

Dispatchers (src/row/mod.rs)

16 new public dispatchers under a new // ---- High-bit 4:4:4 RGBA dispatchers (Ship 8 Tranche 6 prep) ---- section header:

  • u8 RGBA (8): yuv444p9_to_rgba_row, yuv444p10_to_rgba_row, yuv444p12_to_rgba_row, yuv444p14_to_rgba_row, yuv444p16_to_rgba_row, p410_to_rgba_row, p412_to_rgba_row, p416_to_rgba_row.
  • u16 RGBA (8): same names + _u16_row suffix.

Each dispatcher validates slice lengths (mirroring the existing RGB dispatchers' bounds checks), then drops use_simd and calls the scalar reference:

let _ = use_simd; // SIMD per-arch routes land in Ship 8 Tranche 6b (u8) / 6c (u16) PR.
scalar::<fn>(...);

This is identical to how PR #24 staged the 4:2:0 dispatchers before #25/#26 wired SIMD.

Tests (src/row/scalar/tests.rs)

+6 scalar reference tests, one per kernel family:

  • yuv_444p_n_to_rgba_row::<10> gray-to-gray (validates u8 alpha = 0xFF)
  • yuv_444p_n_to_rgba_u16_row::<10> gray-to-gray (validates alpha = 1023)
  • yuv_444p16_to_rgba_row gray-to-gray (validates alpha = 0xFF)
  • yuv_444p16_to_rgba_u16_row gray-to-gray (validates alpha = 0xFFFF)
  • p_n_444_to_rgba_row::<10> gray-to-gray
  • p_n_444_16_to_rgba_u16_row gray-to-gray (validates alpha = 0xFFFF)

These exercise the scalar code path and pin the alpha contract per family. SIMD equivalence tests (per backend × per kernel × per BITS) land alongside the SIMD wiring in 6b/6c.

Out of scope (deferred to follow-up sub-PRs)

  • Per-arch SIMD kernels (NEON / SSE4.1 / AVX2 / AVX-512 / wasm simd128) — Tranche 6b for u8 RGBA, 6c for u16 RGBA. The use_simd: bool parameter is held in every dispatcher's signature now so 6b/6c can wire cfg_select! blocks without changing the public API.
  • Sinker integration (MixedSinker<Yuv444p9..16>, <P410/P412/P416>, <Yuv440p10/12>) — Tranche 6c, alongside u16 SIMD.
  • Per-format RGBA equivalence tests (5 backends × 6 widths × 6 matrices × 2 ranges) — land with their respective SIMD backends in 6b/6c.

Codex adversarial review verdict

Verdict: needs-attention, with one finding flagging that use_simd is silently ignored in the new dispatchers. This is intentional and matches the established precedent set by PR #24 (titled "SIMD lands in 5a/5b"); see PR #24 description for the same rationale. The flag will become meaningful when 6b (u8 SIMD) and 6c (u16 SIMD) wire per-arch routes — as it did for the 4:2:0 family in PRs #25 and #26.

Codex did not flag any correctness or design issues with the scalar refactor itself.

Test plan

  • cargo test --lib: 513 pass (was 507 before the new scalar tests; +6 in this PR)
  • cargo check --tests --lib clean across host (aarch64-darwin), x86_64-unknown-freebsd, wasm32-unknown-unknown
  • RUSTFLAGS=\"-Dwarnings\" cargo clippy --lib --tests clean (zero dead-code warnings — every new template is consumed by its RGB+RGBA wrapper pair, and every new dispatcher consumes its scalar)
  • cargo doc --lib --no-deps clean (56 doc warnings, all pre-existing baseline)

Follow-ups

  • Tranche 6b: u8 RGBA SIMD across all 5 backends + per-backend equivalence tests.
  • Tranche 6c: u16 RGBA SIMD across all 5 backends + sinker integration (Yuv444p9/10/12/14/16, P410/P412/P416, Yuv440p10/12) + sinker tests.

🤖 Generated with Claude Code

@al8n al8n changed the title update feat(row): Ship 8 — high-bit 4:4:4 RGBA scalar (SIMD lands in 6b/6c) Apr 27, 2026
@al8n al8n requested a review from Copilot April 27, 2026 00:12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the scalar-reference layer and public row-level API surface for high-bit-depth 4:4:4 → RGBA conversions (u8 and native-depth u16), mirroring the earlier high-bit 4:2:0 staged rollout so SIMD wiring can land later without signature churn.

Changes:

  • Add scalar shared-template kernels for high-bit 4:4:4 RGBA across planar (Yuv444p{9,10,12,14,16}) and semi-planar (P410/P412/P416) families, with constant-opaque alpha.
  • Add 16 new public row dispatchers in row:: for u8 RGBA and u16 RGBA outputs (currently scalar-routed; use_simd intentionally ignored for now).
  • Add scalar reference tests that pin the alpha contract for representative kernel families.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/row/scalar.rs Introduces shared *_to_rgb_or_rgba_* scalar templates and RGBA wrappers for high-bit 4:4:4 planar + semi-planar kernels.
src/row/mod.rs Adds 16 new public high-bit 4:4:4 RGBA dispatchers (u8 + u16) that validate slice lengths and route to scalar.
src/row/scalar/tests.rs Adds 6 scalar tests validating gray mapping sanity and the required opaque-alpha values per family.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/row/scalar.rs
Comment on lines 971 to 975
// Compile-time guard — fails monomorphization for any BITS outside
// {10, 12, 14}. The 16-bit path lives in `yuv_444p16_to_rgb_row`
// {9, 10, 12, 14}. The 16-bit path lives in `yuv_444p16_to_rgb_row`
// (i32 u8-output kernel family). Without this guard a caller
// invoking ::<16> would reach the NEON clamp where
// invoking ::<16, _> would reach the NEON clamp where
// `(1 << BITS) - 1 as i16` silently wraps to -1.
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compile-time guard comment here references a “NEON clamp”, but this is the scalar kernel implementation. Consider rewording this comment to describe the generic 16-bit footgun (e.g., i16/i32 clamping/overflow behavior) without mentioning a specific SIMD backend, or move the backend-specific rationale to the SIMD dispatcher/docs.

Copilot uses AI. Check for mistakes.
@al8n al8n merged commit 85277c2 into main Apr 27, 2026
47 checks passed
@al8n al8n deleted the feat/ship8-rgba-high-bit-444-scalar branch April 27, 2026 00:17
al8n pushed a commit that referenced this pull request Apr 27, 2026
The 4:4:4 high-bit YUV planar SIMD docs claimed `BITS ∈ {10, 12, 14}`
across all 5 backends, but the const-assert in every implementation
accepts `BITS == 9 || 10 || 12 || 14` and the `yuv444p9_to_rgba_row`
public dispatcher (added in PR #29) instantiates the kernel with
`<9>`. The doc string was stale from before BITS=9 was added in Ship 6b.

Updates both the const-generic bound (`{10, 12, 14}` → `{9, 10, 12, 14}`)
and the prose bit-list (`10/12/14-bit` → `9/10/12/14-bit`) on every
4:4:4 planar SIMD doc — covers the u8 RGB, u8 RGBA (added in this PR),
and u16 RGB siblings across NEON, SSE4.1, AVX2, AVX-512, and wasm
simd128. 23 lines updated total.

Addresses Copilot review comments on PR #30. Also retroactively fixes
the matching drift on the u16 RGB and pre-existing u8 RGB docs that
Copilot didn't explicitly flag but had identical wording.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants