feat(yuv): add yuv420p12/14 + P012 via const-generic BITS#6
Conversation
Benchmark ResultsBenchmark Results SummaryDate: 2026-04-19 10:35:00 UTC View detailed resultsDetailed Criterion results have been uploaded as artifacts. Download them from the workflow run to view charts and detailed statistics. |
There was a problem hiding this comment.
Pull request overview
This PR generalizes existing 10‑bit YUV420p/P010 row kernels to be const‑generic over bit depth (BITS) and updates dispatch/doc references accordingly, as groundwork for adding additional high bit‑depth YUV420p formats.
Changes:
- Generalize scalar P010 row kernels into
p_n_*const‑generic functions and update scalar tests to call the new entrypoints. - Update row dispatch (
src/row/mod.rs) and multiple SIMD backends to call the renamed/generic kernels for the 10‑bit path. - Update SIMD/scalar equivalence tests and documentation references to the new generic function names (partially).
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| src/row/scalar.rs | Renames P010 scalar kernels to p_n_* and makes them const‑generic over BITS; updates scalar tests to match. |
| src/row/mod.rs | Updates dispatcher calls/docs to use the renamed/generic SIMD entrypoints for 10‑bit. |
| src/row/arch/neon.rs | Generalizes NEON high bit‑depth kernels over BITS (incl. u16 clamp max derived from BITS). |
| src/row/arch/x86_sse41.rs | Renames SIMD entrypoints to const‑generic forms and updates some test scaffolding (but leaves compilation issues). |
| src/row/arch/x86_avx2.rs | Renames SIMD entrypoints to const‑generic forms (but leaves u16 clamp/shift logic and test callsites inconsistent). |
| src/row/arch/x86_avx512.rs | Renames SIMD entrypoints to const‑generic forms (but leaves u16 clamp/shift logic and test callsites inconsistent). |
| src/row/arch/wasm_simd128.rs | Renames SIMD entrypoints to const‑generic forms (test updates appear incomplete in the provided diff). |
Comments suppressed due to low confidence (3)
src/row/arch/x86_avx512.rs:750
p_n_to_rgb_u16_rowis generic overBITS, but it still usesOUT_MAX_10 = 1023and shifts by a fixed 6 bits. This makes non-10-bit instantiations (e.g. P012) incorrect and clamps away valid output. Compute both shift (16 - BITS) and max ((1 << BITS) - 1) fromBITS.
let coeffs = scalar::Coefficients::for_matrix(matrix);
let (y_off, y_scale, c_scale) = scalar::range_params_n::<BITS, BITS>(full_range);
let bias = scalar::chroma_bias::<BITS>();
const RND: i32 = 1 << 14;
const OUT_MAX_10: i16 = 1023;
// SAFETY: AVX‑512BW availability is the caller's obligation.
unsafe {
let rnd_v = _mm512_set1_epi32(RND);
let y_off_v = _mm512_set1_epi16(y_off as i16);
let y_scale_v = _mm512_set1_epi32(y_scale);
let c_scale_v = _mm512_set1_epi32(c_scale);
let bias_v = _mm512_set1_epi16(bias as i16);
let max_v = _mm512_set1_epi16(OUT_MAX_10);
let zero_v = _mm512_set1_epi16(0);
let cru = _mm512_set1_epi32(coeffs.r_u());
src/row/arch/x86_avx2.rs:701
p_n_to_rgb_u16_rowis now const-generic overBITS, but it still clamps toOUT_MAX_10 = 1023and shifts by a fixed 6 bits. That makesBITS=12(P012) output incorrect and clamps away valid values. Derive both the shift (16 - BITS) and max ((1 << BITS) - 1) fromBITS.
let coeffs = scalar::Coefficients::for_matrix(matrix);
let (y_off, y_scale, c_scale) = scalar::range_params_n::<BITS, BITS>(full_range);
let bias = scalar::chroma_bias::<BITS>();
const RND: i32 = 1 << 14;
const OUT_MAX_10: i16 = 1023;
// SAFETY: AVX2 availability is the caller's obligation.
unsafe {
let rnd_v = _mm256_set1_epi32(RND);
let y_off_v = _mm256_set1_epi16(y_off as i16);
let y_scale_v = _mm256_set1_epi32(y_scale);
let c_scale_v = _mm256_set1_epi32(c_scale);
let bias_v = _mm256_set1_epi16(bias as i16);
let max_v = _mm256_set1_epi16(OUT_MAX_10);
let zero_v = _mm256_set1_epi16(0);
src/row/arch/x86_avx512.rs:640
p_n_to_rgb_rowis const-generic overBITS, but it still shifts samples by a fixed 6 bits (_mm512_srli_epi16::<6>). That only matches 10-bit high packing (P010). ForBITS=12(P012) this needs to shift by16 - BITS(4) to extract the active high bits correctly.
pub(crate) unsafe fn p_n_to_rgb_row<const BITS: u32>(
y: &[u16],
uv_half: &[u16],
rgb_out: &mut [u8],
width: usize,
matrix: ColorMatrix,
full_range: bool,
) {
debug_assert_eq!(width & 1, 0);
debug_assert!(y.len() >= width);
debug_assert!(uv_half.len() >= width);
debug_assert!(rgb_out.len() >= width * 3);
let coeffs = scalar::Coefficients::for_matrix(matrix);
let (y_off, y_scale, c_scale) = scalar::range_params_n::<BITS, 8>(full_range);
let bias = scalar::chroma_bias::<BITS>();
const RND: i32 = 1 << 14;
// SAFETY: AVX‑512BW availability is the caller's obligation.
unsafe {
let rnd_v = _mm512_set1_epi32(RND);
let y_off_v = _mm512_set1_epi16(y_off as i16);
let y_scale_v = _mm512_set1_epi32(y_scale);
let c_scale_v = _mm512_set1_epi32(c_scale);
let bias_v = _mm512_set1_epi16(bias as i16);
let cru = _mm512_set1_epi32(coeffs.r_u());
let crv = _mm512_set1_epi32(coeffs.r_v());
let cgu = _mm512_set1_epi32(coeffs.g_u());
let cgv = _mm512_set1_epi32(coeffs.g_v());
let cbu = _mm512_set1_epi32(coeffs.b_u());
let cbv = _mm512_set1_epi32(coeffs.b_v());
let pack_fixup = _mm512_setr_epi64(0, 2, 4, 6, 1, 3, 5, 7);
let dup_lo_idx = _mm512_setr_epi64(0, 1, 8, 9, 2, 3, 10, 11);
let dup_hi_idx = _mm512_setr_epi64(4, 5, 12, 13, 6, 7, 14, 15);
let mut x = 0usize;
while x + 64 <= width {
let y_low_i16 = _mm512_srli_epi16::<6>(_mm512_loadu_si512(y.as_ptr().add(x).cast()));
let y_high_i16 = _mm512_srli_epi16::<6>(_mm512_loadu_si512(y.as_ptr().add(x + 32).cast()));
let (u_vec, v_vec) = deinterleave_uv_u16_avx512(uv_half.as_ptr().add(x));
let u_vec = _mm512_srli_epi16::<6>(u_vec);
let v_vec = _mm512_srli_epi16::<6>(v_vec);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| scalar::yuv_420p_n_to_rgb_row::<10>(&y, &u, &v, &mut rgb_scalar, width, matrix, full_range); | ||
| scalar::yuv_420p_n_to_rgb_row::<BITS>(&y, &u, &v, &mut rgb_scalar, width, matrix, full_range); | ||
| unsafe { | ||
| yuv420p10_to_rgb_row(&y, &u, &v, &mut rgb_simd, width, matrix, full_range); |
| @@ -419,7 +419,7 @@ pub(crate) unsafe fn yuv420p10_to_rgb_u16_row( | |||
| let y_scale_v = _mm512_set1_epi32(y_scale); | |||
| let c_scale_v = _mm512_set1_epi32(c_scale); | |||
| let bias_v = _mm512_set1_epi16(bias as i16); | |||
| let mask_v = _mm512_set1_epi16(scalar::bits_mask::<10>() as i16); | |||
| let mask_v = _mm512_set1_epi16(scalar::bits_mask::<BITS>() as i16); | |||
| let max_v = _mm512_set1_epi16(OUT_MAX_10); | |||
| let zero_v = _mm512_set1_epi16(0); | |||
| full_range, | ||
| ); | ||
| unsafe { | ||
| yuv420p10_to_rgb_u16_row(&y, &u, &v, &mut rgb_simd, width, matrix, full_range); |
| scalar::p010_to_rgb_u16_row(&y, &uv, &mut rgb_scalar, width, matrix, full_range); | ||
| scalar::p_n_to_rgb_u16_row::<BITS>(&y, &uv, &mut rgb_scalar, width, matrix, full_range); | ||
| unsafe { | ||
| p010_to_rgb_u16_row(&y, &uv, &mut rgb_simd, width, matrix, full_range); |
| @@ -562,8 +562,8 @@ pub(crate) unsafe fn p010_to_rgb_row( | |||
| debug_assert!(rgb_out.len() >= width * 3); | |||
|
|
|||
| let coeffs = scalar::Coefficients::for_matrix(matrix); | |||
| let (y_off, y_scale, c_scale) = scalar::range_params_n::<10, 8>(full_range); | |||
| let bias = scalar::chroma_bias::<10>(); | |||
| let (y_off, y_scale, c_scale) = scalar::range_params_n::<BITS, 8>(full_range); | |||
| let bias = scalar::chroma_bias::<BITS>(); | |||
| _mm_setr_epi8, _mm_shuffle_epi8, _mm_srai_epi32, _mm_srl_epi16, _mm_srli_si128, _mm_sub_epi16, | ||
| _mm_unpackhi_epi16, _mm_unpackhi_epi64, _mm_unpacklo_epi16, _mm_unpacklo_epi64, |
| scalar::p_n_to_rgb_row::<BITS>(&y, &uv, &mut rgb_scalar, width, matrix, full_range); | ||
| unsafe { | ||
| p010_to_rgb_row(&y, &uv, &mut rgb_simd, width, matrix, full_range); | ||
| } |
| scalar::p010_to_rgb_u16_row(&y, &uv, &mut rgb_scalar, width, matrix, full_range); | ||
| scalar::p_n_to_rgb_u16_row::<BITS>(&y, &uv, &mut rgb_scalar, width, matrix, full_range); | ||
| unsafe { | ||
| p010_to_rgb_u16_row(&y, &uv, &mut rgb_simd, width, matrix, full_range); |
| @@ -403,7 +403,7 @@ pub(crate) unsafe fn yuv420p10_to_rgb_u16_row( | |||
| let y_scale_v = _mm256_set1_epi32(y_scale); | |||
| let c_scale_v = _mm256_set1_epi32(c_scale); | |||
| let bias_v = _mm256_set1_epi16(bias as i16); | |||
| let mask_v = _mm256_set1_epi16(scalar::bits_mask::<10>() as i16); | |||
| let mask_v = _mm256_set1_epi16(scalar::bits_mask::<BITS>() as i16); | |||
| let max_v = _mm256_set1_epi16(OUT_MAX_10); | |||
| let zero_v = _mm256_set1_epi16(0); | |||
| full_range, | ||
| ); | ||
| unsafe { | ||
| yuv420p10_to_rgb_u16_row(&y, &u, &v, &mut rgb_simd, width, matrix, full_range); |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
This PR generalizes the existing 10-bit YUV420p/P010 row conversion kernels to support additional high-bit-depth variants via a const BITS generic (notably enabling P012-style high-bit-packed semi-planar input), and updates SIMD backends/dispatchers accordingly.
Changes:
- Generalize scalar P010 row kernels into
p_n_to_rgb_*_row<const BITS: u32>usingsample >> (16 - BITS)for high-bit-packed extraction. - Refactor SIMD backends (NEON/SSE4.1/AVX2/AVX-512/wasm simd128) to expose
*_n_*kernels parameterized byBITSand update dispatch call sites. - Update/rename internal calls and tests for the new function names (still primarily exercised at
BITS == 10).
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
src/row/scalar.rs |
Generalizes P010 scalar kernels to p_n_* with BITS-dependent shifts and range params; updates related tests. |
src/row/mod.rs |
Updates SIMD dispatch call sites to the renamed/generic backend entry points. |
src/row/arch/x86_sse41.rs |
Refactors P010 and yuv420p10 SIMD kernels to BITS-generic variants and adjusts tails to call the new scalar generics. |
src/row/arch/x86_avx512.rs |
Same as above for AVX-512: BITS-generic kernels and shift-count handling. |
src/row/arch/x86_avx2.rs |
Same as above for AVX2: BITS-generic kernels and shift-count handling. |
src/row/arch/wasm_simd128.rs |
Same as above for wasm simd128: BITS-generic kernels and shift-count handling. |
src/row/arch/neon.rs |
Extends NEON high-bit-depth kernels to BITS generics and switches semi-planar shifting to a 16 - BITS variable shift. |
Comments suppressed due to low confidence (1)
src/row/arch/x86_sse41.rs:213
- This doc block still describes the fixed 10-bit P010 implementation (e.g. “shifted right by 6 (
_mm_srli_epi16::<6>)” and referencesyuv420p10_to_rgb_row), but the function is now generic overBITSand uses a variable shift (16 - BITS). Update the docs and the numerical-contract link to referencescalar::p_n_to_rgb_row::<BITS>so rustdoc doesn’t mislead or produce stale intra-doc links.
/// Byte‑identical to [`scalar::p_n_to_rgb_row::<10>`].
///
/// # Safety
///
/// 1. **SSE4.1 must be available on the current CPU.**
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// # Numerical contract | ||
| /// | ||
| /// Byte‑identical to [`scalar::p010_to_rgb_u16_row`]. | ||
| /// Byte‑identical to [`scalar::p_n_to_rgb_u16_row::<10>`]. | ||
| /// | ||
| /// # Safety |
| /// # Numerical contract | ||
| /// | ||
| /// Byte‑identical to [`scalar::p010_to_rgb_u16_row`]. | ||
| /// Byte‑identical to [`scalar::p_n_to_rgb_u16_row::<10>`]. | ||
| /// | ||
| /// # Safety |
| let (y_off, y_scale, c_scale) = range_params_n::<BITS, BITS>(full_range); | ||
| let bias = chroma_bias::<BITS>(); | ||
| let out_max: i32 = (1i32 << BITS) - 1; | ||
| let shift = 16 - BITS; | ||
|
|
| // SAFETY: NEON verified on this CPU; bounds / parity are | ||
| // the caller's obligation (asserted above). | ||
| unsafe { | ||
| arch::neon::yuv420p10_to_rgb_row(y, u_half, v_half, rgb_out, width, matrix, full_range); | ||
| arch::neon::yuv_420p_n_to_rgb_row::<10>(y, u_half, v_half, rgb_out, width, matrix, full_range); | ||
| } |
| /// # Numerical contract | ||
| /// | ||
| /// Byte‑identical to [`scalar::p010_to_rgb_u16_row`]. | ||
| /// Byte‑identical to [`scalar::p_n_to_rgb_u16_row::<10>`]. | ||
| /// | ||
| /// # Safety |
| /// shift: each `u16` load is extracted to its `BITS`‑bit value via | ||
| /// `sample >> (16 - BITS)`, then the same Q15 pipeline as | ||
| /// [`yuv_420p_n_to_rgb_row`] runs with the same `BITS`. For `BITS == | ||
| /// 10` this is P010 (`>> 6`); for `BITS == 12` it's P012 (`>> 4`). | ||
| /// Mispacked input — e.g. a low‑bit‑packed buffer handed to this |
| /// # Numerical contract | ||
| /// | ||
| /// Byte‑identical to [`scalar::p010_to_rgb_row`]. | ||
| /// Byte‑identical to [`scalar::p_n_to_rgb_row::<10>`]. | ||
| /// | ||
| /// # Safety |
| /// | ||
| /// # Numerical contract | ||
| /// | ||
| /// Byte‑identical to [`scalar::p010_to_rgb_row`]. | ||
| /// Byte‑identical to [`scalar::p_n_to_rgb_row::<10>`]. | ||
| /// |
| /// # Numerical contract | ||
| /// | ||
| /// Byte‑identical to [`scalar::p010_to_rgb_row`]. | ||
| /// Byte‑identical to [`scalar::p_n_to_rgb_row::<10>`]. | ||
| /// |
| /// # Numerical contract | ||
| /// | ||
| /// Byte‑identical to [`scalar::p010_to_rgb_u16_row`]. | ||
| /// Byte‑identical to [`scalar::p_n_to_rgb_u16_row::<10>`]. | ||
| /// |
Benchmark ResultsBenchmark Results SummaryDate: 2026-04-19 11:01:06 UTC Benchmark Results for macos-aarch64-neonSystem Information
allBenchmark Results for macos-aarch64-scalarSystem Information
allBenchmark Results for ubuntu-x86_64-avx2-maxSystem Information
allBenchmark Results for ubuntu-x86_64-defaultSystem Information
allBenchmark Results for ubuntu-x86_64-nativeSystem Information
allBenchmark Results for ubuntu-x86_64-scalarSystem Information
allBenchmark Results for ubuntu-x86_64-sse41-maxSystem Information
allBenchmark Results for windows-x86_64-defaultSystem Information
allView detailed resultsDetailed Criterion results have been uploaded as artifacts. Download them from the workflow run to view charts and detailed statistics. |
There was a problem hiding this comment.
Pull request overview
This PR expands the crate’s high-bit-depth YUV support by adding 12-bit and 14-bit planar YUV420 (yuv420p12le/yuv420p14le) plus 12-bit semi-planar P012, and by generalizing existing P010/P010 SIMD pathways to const-generic “N-bit” kernels.
Changes:
- Add new source formats + walkers:
Yuv420p12,Yuv420p14, andP012, and wire them intoyuv::mod. - Generalize P010 and YUV420p10 SIMD/scalar kernels into const-generic
*_n_*implementations to support 10/12/14-bit variants. - Extend
MixedSinkerto consume the new formats, and add Criterion benches for the new conversions.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/yuv/yuv420p12.rs | Adds 12-bit planar YUV420 row type + walker (yuv420p12_to). |
| src/yuv/yuv420p14.rs | Adds 14-bit planar YUV420 row type + walker (yuv420p14_to). |
| src/yuv/p012.rs | Adds P012 row type + walker (p012_to). |
| src/yuv/mod.rs | Registers and re-exports the new formats/functions. |
| src/sinker/mixed.rs | Adds MixedSinker support for Yuv420p12/Yuv420p14/P012 and new RowSlice variants. |
| src/row/scalar.rs | Generalizes P010 scalar kernels to const-generic p_n_* and updates tests accordingly. |
| src/row/mod.rs | Dispatch layer adds 12/14-bit planar and P012 row conversion APIs and routes to generalized SIMD fns. |
| src/row/arch/neon.rs | Generalizes NEON implementations to const-generic BITS for planar + Pn paths; adds equivalence tests. |
| src/row/arch/x86_sse41.rs | Generalizes SSE4.1 implementations for planar + Pn paths; adds 12/14-bit equivalence tests. |
| src/row/arch/x86_avx2.rs | Generalizes AVX2 implementations for planar + Pn paths; adds 12/14-bit equivalence tests. |
| src/row/arch/x86_avx512.rs | Generalizes AVX-512 implementations for planar + Pn paths; adds 12/14-bit equivalence tests. |
| src/row/arch/wasm_simd128.rs | Generalizes wasm simd128 implementations for planar + Pn paths; adds 12/14-bit equivalence tests. |
| src/frame.rs | Introduces PnFrame<BITS> (BITS=10/12) + P012Frame alias; adds Yuv420p12/14 frame aliases. |
| benches/yuv_420p12_to_rgb.rs | Adds Criterion bench for yuv420p12 row conversions. |
| benches/yuv_420p14_to_rgb.rs | Adds Criterion bench for yuv420p14 row conversions. |
| benches/p012_to_rgb.rs | Adds Criterion bench for P012 row conversions. |
| Cargo.toml | Registers the new bench targets. |
Comments suppressed due to low confidence (2)
src/frame.rs:536
- The doc comment for
PnFrame::try_newstill refers to constructing aP010Frameand returningP010FrameError, but the function is now generic over the Pn family (P010/P012) and returnsPnFrameError. Updating these docs to referencePnFrame/PnFrameError(and mentioning the P010/P012 type aliases) would prevent confusion for callers reading docs via rustdoc.
/// Constructs a new [`P010Frame`], validating dimensions and plane
/// lengths. Strides are in `u16` **samples**.
///
/// Returns [`P010FrameError`] if any of:
/// - `width` or `height` is zero,
src/row/arch/neon.rs:652
- This section’s doc comment still describes the u16-output semi-planar path in 10-bit-specific terms (e.g.
range_params_n::<10, 10>, clamp to[0, 1023], and references top010_to_rgb_row). Since the implementation is now const-generic overBITS, the docs should be updated to useBITS/(1 << BITS) - 1and to reference the new generic function names so readers don’t apply the wrong constraints to the 12-bit (P012) instantiation.
/// Same structure as [`p010_to_rgb_row`] up to the chroma compute;
/// the only differences are:
/// - `range_params_n::<10, 10>` → larger scales targeting the 10‑bit
/// output range.
/// - Clamp is explicit min/max to `[0, 1023]` via
/// [`clamp_u10`](crate::row::arch::neon::clamp_u10).
/// - Writes use two `vst3q_u16` calls per 16‑pixel block.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Full‑width Y row of a **12‑bit** planar source ([`Yuv420p12`]). | ||
| /// `u16` samples, `width` elements, low‑bit‑packed. |
| // ---- Yuv420p12 impl ---------------------------------------------------- | ||
|
|
||
| impl<'a> MixedSinker<'a, Yuv420p12> { | ||
| /// Attaches a packed **`u16`** RGB output buffer. Mirrors | ||
| /// [`MixedSinker<Yuv420p10>::with_rgb_u16`] but produces 12‑bit | ||
| /// output (values in `[0, 4095]` in the low 12 of each `u16`, upper | ||
| /// 4 zero). Length is measured in `u16` **elements** (`width × | ||
| /// height × 3`). |
| /// # Numerical contract | ||
| /// | ||
| /// Byte‑identical to [`scalar::p010_to_rgb_u16_row`]. | ||
| /// Byte‑identical to [`scalar::p_n_to_rgb_u16_row::<10>`]. |
| fn clamp_u10(v: int16x8_t, zero_v: int16x8_t, max_v: int16x8_t) -> uint16x8_t { | ||
| unsafe { vreinterpretq_u16_s16(vminq_s16(vmaxq_s16(v, zero_v), max_v)) } | ||
| } | ||
|
|
Benchmark ResultsBenchmark Results SummaryDate: 2026-04-19 11:36:01 UTC Benchmark Results for macos-aarch64-neonSystem Information
allBenchmark Results for macos-aarch64-scalarSystem Information
allBenchmark Results for ubuntu-x86_64-avx2-maxSystem Information
allBenchmark Results for ubuntu-x86_64-defaultSystem Information
allBenchmark Results for ubuntu-x86_64-nativeSystem Information
allBenchmark Results for ubuntu-x86_64-scalarSystem Information
allBenchmark Results for ubuntu-x86_64-sse41-maxSystem Information
allBenchmark Results for windows-x86_64-defaultSystem Information
allView detailed resultsDetailed Criterion results have been uploaded as artifacts. Download them from the workflow run to view charts and detailed statistics. |
Benchmark ResultsBenchmark Results SummaryDate: 2026-04-19 12:17:27 UTC Benchmark Results for macos-aarch64-neonSystem Information
allBenchmark Results for macos-aarch64-scalarSystem Information
allBenchmark Results for ubuntu-x86_64-avx2-maxSystem Information
allBenchmark Results for ubuntu-x86_64-defaultSystem Information
allBenchmark Results for ubuntu-x86_64-nativeSystem Information
allBenchmark Results for ubuntu-x86_64-scalarSystem Information
allBenchmark Results for ubuntu-x86_64-sse41-maxSystem Information
allBenchmark Results for windows-x86_64-defaultSystem Information
allView detailed resultsDetailed Criterion results have been uploaded as artifacts. Download them from the workflow run to view charts and detailed statistics. |
Summary
Ships three new high-bit-depth YUV source formats on top of a
const-generic refactor of the existing 10-bit pipeline:
Yuv420p12— planar 4:2:0, 12-bit, low-bit-packed (yuv420p12le).Yuv420p14— planar 4:2:0, 14-bit, low-bit-packed (yuv420p14le).P012— semi-planar 4:2:0, 12-bit, high-bit-packed (p012le).The existing 10-bit scalar + 5 SIMD backends (NEON, SSE4.1, AVX2,
AVX-512, wasm simd128) are refactored to
const BITS: u32— all threenew formats reuse the exact same Q15 kernel machinery, monomorphized
per depth. No new SIMD backend code.
Core changes
Frame layer (
src/frame.rs)PnFrame<'a, const BITS: u32>generalizes the oldP010Frame.Type aliases pin each shipped depth:
P010Frame = PnFrame<'_, 10>,P012Frame = PnFrame<'_, 12>.PnFrameError/PnFramePlaneenums (back-compat aliased toP010FrameError/P010FramePlane). NewUnsupportedBitsvariant;
SampleLowBitsSetnow carrieslow_bits: u32.try_new_checkedlow-bits scan generalizes from hardcoded& 0x3Fto
& ((1 << (16 - BITS)) - 1). Docs updated to flag the P012weakness explicitly — at
BITS == 12the 4-low-bits checkaccepts all multiple-of-16 samples, which includes common
yuv420p12leflat-region values (Y=256/1024,UV=2048).Yuv420p12Frame = Yuv420pFrame16<'_, 12>and
Yuv420p14Frame = Yuv420pFrame16<'_, 14>(the underlyingstruct already accepts
BITS ∈ {10, 12, 14}).Scalar + SIMD kernels
Renamed and made const-generic:
const BITS: u32)yuv420p10_to_rgb_rowyuv_420p_n_to_rgb_row::<BITS>yuv420p10_to_rgb_u16_rowyuv_420p_n_to_rgb_u16_row::<BITS>p010_to_rgb_rowp_n_to_rgb_row::<BITS>p010_to_rgb_u16_rowp_n_to_rgb_u16_row::<BITS>BITSflows through:debug_assert!(BITS ∈ {10, 12, 14}).debug_assert!(BITS ∈ {10, 12}).vshlq_u16(_, vdupq_n_s16(-(16 - BITS))),x86
_mm*_srl_epi16(_, shr_count)withshr_countderived onceper call from
BITS, wasmu16x8_shr(_, (16 - BITS) as u32).out_max = (1 << BITS) - 1at call time(was hardcoded
OUT_MAX_10 = 1023).Public per-format dispatchers stay (
yuv420p10_to_rgb_row,p010_to_rgb_row, etc.) plus six new ones:yuv420p12_to_rgb_row/_u16_row,yuv420p14_to_rgb_row/_u16_row,p012_to_rgb_row/_u16_row. Each monomorphizes itsown
BITS; back-compat preserved.New yuv/ modules
src/yuv/yuv420p12.rs— markerYuv420p12,Yuv420p12Row,Yuv420p12Sink,yuv420p12_towalker.src/yuv/yuv420p14.rs— same shape withBITS == 14.src/yuv/p012.rs— markerP012,P012Row,P012Sink,p012_towalker (viaP012Frame = PnFrame<'_, 12>).MixedSinker (
src/sinker/mixed.rs)Three new
impl PixelSink for MixedSinker<'_, F>blocks, each withtheir own luma downshift, native-depth
rgb_u16packing,scratch-buffer/HSV branching, and row-shape validation:
MixedSinker<Yuv420p12>— luma>> 4, u16 output inyuv420p12lelow-packed convention.MixedSinker<Yuv420p14>— luma>> 6.MixedSinker<P012>— luma>> 8(same accessor as P010 sinceboth put active bits in the high positions of the u16).
Seven new
RowSlicevariants:Y12,UHalf12,VHalf12,UvHalf12,Y14,UHalf14,VHalf14.Tests
SIMD equivalence per backend (5 backends × 2 formats × 2 depths)
Every backend (NEON / SSE4.1 / AVX2 / AVX-512 / wasm simd128) gained
const-generic equivalence helpers:
check_planar_u8_*_equivalence_n::<BITS>check_planar_u16_*_equivalence_n::<BITS>check_pn_u8_*_equivalence_n::<BITS>(semi-planar only,BITS ∈ {10, 12})check_pn_u16_*_equivalence_n::<BITS>Exercised for
BITS == 12andBITS == 14across every matrix(
Bt601,Bt709,Bt2020Ncl,Smpte240m,Fcc,YCgCo) × range× tail-width case (
1920, odd tails, etc.).MixedSinker integration (19 new end-to-end tests)
For
Yuv420p12/Yuv420p14/P012:rgb_u8_only_gray_is_grayrgb_u16_only_native_depth_grayrgb_u8_and_u16_both_populatedluma_downshifts_to_8bithsv_from_gray_is_zero_hue_zero_satrgb_u16_too_short_returns_errwith_simd_false_matches_with_simd_truep012_matches_yuv420p12_mixed_sinker_with_shifted_samples(cross-layout parity)
Frame-level regression tests
p012_try_new_checked_rejects_low_bits_set— positive validation.p012_try_new_checked_accepts_low_packed_flat_content_by_design— pins the known
try_new_checkedlimitation atBITS == 12(accepts
Y=0x0100/UV=0x0800even though that's mispackedyuv420p12le). Documents the weakness in code so future attemptsto strengthen
try_new_checkedhave a concrete test to validateagainst.
Benches
Three new Criterion harnesses mirroring the 10-bit pattern:
yuv_420p12_to_rgb(u8 + u16 output, 720p / 1080p / 4K widths)yuv_420p14_to_rgb(same)p012_to_rgb(same, high-bit-packed sample generator)Documentation updates
src/lib.rs— added "Supported source formats" table +"Not yet shipped" section.
src/yuv/mod.rs— structured by 8-bit / high-bit-depth planar /high-bit-depth semi-planar, with an explicit "Not yet shipped"
block (16-bit family, 4:2:2 / 4:4:4, packed RGB).
shifted right by BITS - 6→shifted right by 16 - BITS,scalar::p_n_to_rgb_*::<10>→::<BITS>, clarifiedclamp_u10helper's max is derived fromBITS.docs/color-conversion-functions.md— Tier 1/2 tables split outper-depth rows; added "Shipped (v0.4a)" section; split original
Ship 4 into Ship 4a (SHIPPED) and Ship 4b (16-bit, not yet —
blocker: Q15 chroma_sum overflows i32 at BITS == 16).
docs/hardware-decode-with-ffmpeg-next.md— added P012 decodeexample, updated "P010 frames have garbage Y values" section to
cover both P010 and P012 with the generalized
>> (16 - BITS)rule, dropped
(planned)markers now that the API ships.Verification
cargo test --lib→ 186 passed (164 → 186with +19
MixedSinker+ +3 frame regression tests).cargo check --lib --tests --target x86_64-unknown-freebsd—clean.
RUSTFLAGS='-C target-feature=+simd128' cargo check --lib --tests --target wasm32-wasip1—clean.
cargo check --benches— clean.cargo doc --lib --no-deps— builds; 21 pre-existing"redundant explicit link target" warnings (not introduced here).
Review history in this PR
Earlier iterations (pre-rebase on the final commit) surfaced issues
that are now fixed:
srli::<6>andOUT_MAX_10 = 1023after theconst BITSrename — every::<6>/
1023now derives fromBITS.scalar::...::<BITS>in test scope replaced with
::<10>, and old SIMD function namesin test helpers renamed.
P012 try_new_checkedflagged as a silent-corruption path forcommon
yuv420p12leflat content — docs now call this outexplicitly and the behavior is pinned by a named regression test
so the type system (choosing
P012FramevsYuv420p12Frameatconstruction based on decoder metadata) is clearly the intended
provenance guarantee.
Follow-ups (not in this PR)
yuv420p16le,p016le). Blockedon the Q15 chroma_sum overflow — needs either i64 intermediates
or a lower-Q coefficient format. Current kernels explicitly
debug_assert!againstBITS == 16so this won't silently enterthe Q15 code path.
P410 / P416) and planar
yuv422p/yuv444pfamilies.Test plan
wasm32 (
wasm32-wasip1with+simd128).cargo benchbaseline on Apple M-series — expected ~5×SIMD speedup on the 12/14-bit u8 paths and ~4× on the u16 paths
(mirrors the v0.2 / v0.3 numbers for 10-bit).
decoder emitting
p012leis available) —MixedSinker<P012>RGB output against a reference
libswscaleconversion.🤖 Generated with Claude Code