Skip to content

Editor playback optimization#1598

Merged
richiemcilroy merged 56 commits intomainfrom
cursor/editor-playback-optimization-5f42
Feb 15, 2026
Merged

Editor playback optimization#1598
richiemcilroy merged 56 commits intomainfrom
cursor/editor-playback-optimization-5f42

Conversation

@richiemcilroy
Copy link
Copy Markdown
Member

@richiemcilroy richiemcilroy commented Feb 14, 2026

Optimise playback performance by implementing a comprehensive benchmarking infrastructure, tuning decoder prefetch, batching GPU command submissions, pipelining GPU readbacks, and improving audio synchronization.

This PR addresses key bottlenecks such as the double GPU round-trip for frames, conservative prefetching, and synchronous GPU readbacks, aiming for smoother 60fps playback, faster scrubbing, and tighter audio-video sync across macOS and Windows.


Open in Cursor Open in Web

Greptile Overview

Greptile Summary

Comprehensive performance optimization focused on eliminating GPU bottlenecks and improving playback smoothness through pipelined operations and smarter prefetching.

Key optimizations:

  • Pipelined GPU readbacks with triple buffering eliminate synchronous waits between frames
  • Batched GPU command submissions reduce overhead by combining YUV conversion and rendering into single command buffers
  • Increased prefetch buffer (60→90 frames) and parallel decode tasks (4→6) with ramped startup (8 tasks initially)
  • LRU frame cache with eviction strategy prevents memory bloat during seeks
  • Reduced decoder timeouts (200ms→100ms) and separate initial seek timeout (10s vs 2s)
  • NV12 output format option bypasses double GPU round-trip for WebSocket streaming
  • Comprehensive benchmarking infrastructure added for measuring pipeline performance

Architecture improvements:

  • Shared YUV converter pipelines with bind group caching reduce GPU object creation overhead
  • Encoder-based layer preparation allows batching GPU work instead of immediate submission
  • prepare_with_encoder variants across display and camera layers enable command buffering
  • Frame cache eviction when seeking prevents unbounded memory growth
  • Statistics logging every 2s tracks cache hits, prefetch hits, and sync decodes

Confidence Score: 4/5

  • Safe to merge with minor observations on timeout tuning and buffer sizing
  • Well-structured performance optimizations with clear separation of concerns. The pipelined GPU operations and batched command submission are sound architectural improvements. Timeout reductions and increased buffer sizes are reasonable tuning choices. One concern is the reduced GPU timeout (30s→10s) may be aggressive for slower systems, but the retry logic provides fallback. The NV12 conversion path is properly validated with dimension checks.
  • Pay attention to crates/editor/src/playback.rs timeout tuning and crates/rendering/src/frame_pipeline.rs GPU buffer timeout reduction

Important Files Changed

Filename Overview
crates/editor/src/playback.rs Optimized playback prefetching with increased buffer sizes (90 frames), parallel decode tasks (6-8), reduced timeouts, and added frame cache eviction strategy
crates/rendering/src/frame_pipeline.rs Added pipelined GPU readbacks with triple buffering, RGBA-to-NV12 converter, reduced GPU timeout from 30s to 10s, and improved async buffer mapping
crates/rendering/src/yuv_converter.rs New YUV converter with shared pipelines, bind group caching, encoder batching support for both NV12 and YUV420P formats
crates/rendering/src/lib.rs Added NV12 rendering path, pipeline flushing for final frames, encoder-based layer preparation to batch GPU commands
crates/rendering/src/layers/display.rs Added prepare_with_encoder variant to batch YUV conversion with other GPU work instead of submitting separately

Flowchart

flowchart TD
    A[Playback Start] --> B[Prefetch Task Spawned]
    A --> C[Main Playback Loop]
    
    B --> D{Frames Decoded < 15?}
    D -->|Yes| E[Use 8 Parallel Tasks]
    D -->|No| F[Use 6 Parallel Tasks]
    
    E --> G[Decode Frames Ahead]
    F --> G
    
    G --> H[Send to Prefetch Buffer]
    H --> I[LRU Frame Cache]
    
    C --> J{Frame in Cache?}
    J -->|Yes| K[Cache Hit - Use Cached]
    J -->|No| L{Frame in Prefetch Buffer?}
    
    L -->|Yes| M[Prefetch Hit - Use Frame]
    L -->|No| N{Frame In-Flight?}
    
    N -->|Yes| O[Wait 100ms for Frame]
    N -->|No| P[Sync Decode with Timeout]
    
    K --> Q[Render Pipeline]
    M --> Q
    O --> Q
    P --> Q
    
    Q --> R[Prepare Layers with Encoder]
    R --> S[YUV to RGBA Conversion]
    S --> T[Batch GPU Commands]
    T --> U{Output Format?}
    
    U -->|RGBA| V[Pipelined RGBA Readback]
    U -->|NV12| W[RGBA to NV12 Conversion]
    
    W --> X[Pipelined NV12 Readback]
    
    V --> Y[Triple Buffer Rotation]
    X --> Y
    
    Y --> Z[Return Previous Frame]
    Z --> AA[WebSocket Send]
    
    AA --> AB{Frames Behind?}
    AB -->|Yes > 6| AC[Skip Frames]
    AB -->|No| C
    AC --> AD[Evict Cache]
    AD --> C
Loading

Last reviewed commit: f60bbbe

cursoragent and others added 18 commits February 14, 2026 18:22
…ance

Creates a comprehensive benchmark (playback-pipeline-benchmark) that measures
each stage of the playback pipeline independently:
- Decode-only performance (frame retrieval latency)
- Full pipeline (decode + GPU render + readback) at multiple resolutions
- Scrubbing performance (random-access frame rendering)

Reports min/avg/p50/p95/p99/max statistics for each stage, effective FPS,
and frame budget utilization analysis.

Usage: cargo run -p cap-editor --example playback-pipeline-benchmark -- \
  --recording-path /path/to/recording [--fps 60] [--frames 300]

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
…ance

Creates a comprehensive benchmark (playback-pipeline-benchmark) that measures
each stage of the playback pipeline independently:
- Decode-only performance (frame retrieval latency)
- Full pipeline (decode + GPU render + readback) at multiple resolutions
- Scrubbing performance (random-access frame rendering)

Reports min/avg/p50/p95/p99/max statistics for each stage, effective FPS,
and frame budget utilization analysis.

Usage: cargo run -p cap-editor --example playback-pipeline-benchmark -- \
  --recording-path /path/to/recording [--fps 60] [--frames 300]
Key changes to playback.rs:
- Increase PARALLEL_DECODE_TASKS from 4 to 6 (sustained)
- Add INITIAL_PARALLEL_DECODE_TASKS of 8 during ramp-up phase
- Increase PREFETCH_BUFFER_SIZE from 60 to 90 frames
- Increase FRAME_CACHE_SIZE from 60 to 90 (matches decoder cache)
- Reduce warmup threshold from 20 to 10 frames (faster first frame)
- Reduce warmup timeout from 1000ms to 500ms
- Reduce frame wait timeouts from 200ms to 100ms for better responsiveness
- Reduce aggressive skip threshold from 10 to 6 frames behind
- Reduce PREFETCH_BEHIND from 15 to 10 (focus more on forward frames)

These changes should improve:
- First-frame latency (faster warmup)
- Sustained playback FPS (more parallel decoding)
- Frame availability (larger cache and prefetch buffer)
- Responsiveness (shorter wait timeouts)

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Key changes to playback.rs:
- Increase PARALLEL_DECODE_TASKS from 4 to 6 (sustained)
- Add INITIAL_PARALLEL_DECODE_TASKS of 8 during ramp-up phase
- Increase PREFETCH_BUFFER_SIZE from 60 to 90 frames
- Increase FRAME_CACHE_SIZE from 60 to 90 (matches decoder cache)
- Reduce warmup threshold from 20 to 10 frames (faster first frame)
- Reduce warmup timeout from 1000ms to 500ms
- Reduce frame wait timeouts from 200ms to 100ms for better responsiveness
- Reduce aggressive skip threshold from 10 to 6 frames behind
- Reduce PREFETCH_BEHIND from 15 to 10 (focus more on forward frames)

These changes should improve:
- First-frame latency (faster warmup)
- Sustained playback FPS (more parallel decoding)
- Frame availability (larger cache and prefetch buffer)
- Responsiveness (shorter wait timeouts)
Previously, YUV→RGBA conversion (compute shader) created its own command
encoder and called queue.submit() independently, then the main render pass
created another encoder and submitted again. This meant 2+ GPU submissions
per frame.

Changes:
- Add convert_nv12_to_encoder() and convert_yuv420p_to_encoder() methods
  to YuvToRgbaConverter that dispatch compute passes on an external encoder
  instead of creating their own
- Add prepare_with_encoder() to DisplayLayer that uses the batched
  conversion methods
- Add prepare_with_encoder() to RendererLayers that passes the shared
  encoder through to display layer
- Modify produce_frame() to create the command encoder first and pass it
  to both prepare and render phases

Result: Single queue.submit() per frame instead of 2+, reducing GPU
overhead and improving frame throughput.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Previously, YUV→RGBA conversion (compute shader) created its own command
encoder and called queue.submit() independently, then the main render pass
created another encoder and submitted again. This meant 2+ GPU submissions
per frame.

Changes:
- Add convert_nv12_to_encoder() and convert_yuv420p_to_encoder() methods
  to YuvToRgbaConverter that dispatch compute passes on an external encoder
  instead of creating their own
- Add prepare_with_encoder() to DisplayLayer that uses the batched
  conversion methods
- Add prepare_with_encoder() to RendererLayers that passes the shared
  encoder through to display layer
- Modify produce_frame() to create the command encoder first and pass it
  to both prepare and render phases

Result: Single queue.submit() per frame instead of 2+, reducing GPU
overhead and improving frame throughput.
Previously finish_encoder() would:
1. Discard any pending readback result
2. Submit current frame readback
3. Wait synchronously for current readback

Now it:
1. Wait for previous frame's readback (should already be done)
2. Submit current frame's readback (starts async)
3. Return the previous frame immediately

This means GPU readback of frame N overlaps with CPU processing
and decode of frame N+1, reducing the synchronous wait time per
frame. The first frame still waits synchronously, but subsequent
frames benefit from the pipelining.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Previously finish_encoder() would:
1. Discard any pending readback result
2. Submit current frame readback
3. Wait synchronously for current readback

Now it:
1. Wait for previous frame's readback (should already be done)
2. Submit current frame's readback (starts async)
3. Return the previous frame immediately

This means GPU readback of frame N overlaps with CPU processing
and decode of frame N+1, reducing the synchronous wait time per
frame. The first frame still waits synchronously, but subsequent
frames benefit from the pipelining.
Audio sync improvements:
- Pre-rendered audio: tighten jump detection threshold from 100ms to 50ms
  (audio now re-syncs when video playhead drifts by more than 50ms)
- macOS (non-prerendered): reduce sync threshold from 120ms to 80ms
- Windows (non-prerendered): reduce sync threshold from 200ms to 100ms
- Windows: reduce hard seek threshold from 500ms to 300ms
- Windows: reduce min sync interval callbacks from 50 to 30 for more
  responsive correction

These changes address user reports of audio being behind video or
taking a couple of seconds to sync up during playback.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Audio sync improvements:
- Pre-rendered audio: tighten jump detection threshold from 100ms to 50ms
  (audio now re-syncs when video playhead drifts by more than 50ms)
- macOS (non-prerendered): reduce sync threshold from 120ms to 80ms
- Windows (non-prerendered): reduce sync threshold from 200ms to 100ms
- Windows: reduce hard seek threshold from 500ms to 300ms
- Windows: reduce min sync interval callbacks from 50 to 30 for more
  responsive correction

These changes address user reports of audio being behind video or
taking a couple of seconds to sync up during playback.
ZoomFocusInterpolator was cloning CursorEvents (Vec of cursor moves and
clicks) on every frame during playback. For recordings with many cursor
events, this adds unnecessary allocation pressure.

Changes:
- Add ZoomFocusInterpolator::new_arc() that accepts Arc<CursorEvents>
- Playback and preview paths now use Arc sharing instead of deep cloning
- Reduces per-frame allocation during playback
- Renderer channel capacity reduced from 8 to 4 to reduce stale frame
  queuing and wasted decode work

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
ZoomFocusInterpolator was cloning CursorEvents (Vec of cursor moves and
clicks) on every frame during playback. For recordings with many cursor
events, this adds unnecessary allocation pressure.

Changes:
- Add ZoomFocusInterpolator::new_arc() that accepts Arc<CursorEvents>
- Playback and preview paths now use Arc sharing instead of deep cloning
- Reduces per-frame allocation during playback
- Renderer channel capacity reduced from 8 to 4 to reduce stale frame
  queuing and wasted decode work
- Reduce decoder frame timeout from 5000ms to 2000ms (normal frames)
- Reduce decoder initial seek timeout from 20000ms to 10000ms
- Reduce GPU buffer wait timeout from 30s to 10s
- Improve GPU readback poll loop: use progressive backoff instead of
  tight busy-wait (yield for first 10 polls, 100us sleep for 10-100
  polls, 1ms sleep after that)
- Reduces CPU usage during GPU readback waiting while maintaining
  low latency for fast readbacks

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
- Reduce decoder frame timeout from 5000ms to 2000ms (normal frames)
- Reduce decoder initial seek timeout from 20000ms to 10000ms
- Reduce GPU buffer wait timeout from 30s to 10s
- Improve GPU readback poll loop: use progressive backoff instead of
  tight busy-wait (yield for first 10 polls, 100us sleep for 10-100
  polls, 1ms sleep after that)
- Reduces CPU usage during GPU readback waiting while maintaining
  low latency for fast readbacks
Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
The pipelined readback optimization returns the previous frame's readback
result while submitting the current frame's readback. This means the very
last frame in a render loop would be lost since no subsequent call would
collect it.

Changes:
- Add flush_pending_readback() function to frame_pipeline
- Add flush_pipeline() method to FrameRenderer
- Call flush_pipeline() at end of render_video_to_channel() to capture
  the last frame during exports
- For playback, losing the last frame is acceptable (playback loop
  manages its own frame counting)

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
The pipelined readback optimization returns the previous frame's readback
result while submitting the current frame's readback. This means the very
last frame in a render loop would be lost since no subsequent call would
collect it.

Changes:
- Add flush_pending_readback() function to frame_pipeline
- Add flush_pipeline() method to FrameRenderer
- Call flush_pipeline() at end of render_video_to_channel() to capture
  the last frame during exports
- For playback, losing the last frame is acceptable (playback loop
  manages its own frame counting)
@cursor
Copy link
Copy Markdown

cursor bot commented Feb 14, 2026

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

cursoragent and others added 11 commits February 14, 2026 18:48
Adds periodic (every 2s) tracing::info! logs during playback that report:
- Effective FPS (frames rendered / elapsed time)
- Total frames rendered and skipped
- Cache hit count (frames served from LRU cache)
- Prefetch hit count (frames from prefetch buffer)
- Sync decode count (frames decoded synchronously on-demand)
- Current prefetch buffer utilization

This data helps diagnose playback performance issues in real-world
usage without requiring the benchmark tool.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Adds periodic (every 2s) tracing::info! logs during playback that report:
- Effective FPS (frames rendered / elapsed time)
- Total frames rendered and skipped
- Cache hit count (frames served from LRU cache)
- Prefetch hit count (frames from prefetch buffer)
- Sync decode count (frames decoded synchronously on-demand)
- Current prefetch buffer utilization

This data helps diagnose playback performance issues in real-world
usage without requiring the benchmark tool.
Convert RGBA frames to NV12 format before sending over WebSocket,
reducing per-frame data from width*height*4 bytes (RGBA) to
width*height*1.5 bytes (NV12) — a 62.5% bandwidth reduction.

At half-res (1248x702) this reduces per-frame size from ~3.5MB to
~1.3MB, dropping bandwidth needs from ~210MB/s to ~78MB/s at 60fps.

Changes:
- Add WSFrameFormat enum (Rgba/Nv12) to WSFrame struct
- Add WSFrame::from_rendered_frame_nv12() constructor that converts
  RGBA data to NV12 using the existing convert_to_nv12() function
- Add pack_ws_frame() helper that selects correct packing format
- Editor window frame callbacks now use NV12 format
- Screenshot editor and camera legacy paths remain RGBA
- WebSocket stats logging now reports actual format (NV12/RGBA)

The frontend already has full NV12 support via WebGPU compute shader
(renderNv12FrameWebGPU) and Canvas2D fallback, so no frontend changes
are needed.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Convert RGBA frames to NV12 format before sending over WebSocket,
reducing per-frame data from width*height*4 bytes (RGBA) to
width*height*1.5 bytes (NV12) — a 62.5% bandwidth reduction.

At half-res (1248x702) this reduces per-frame size from ~3.5MB to
~1.3MB, dropping bandwidth needs from ~210MB/s to ~78MB/s at 60fps.

Changes:
- Add WSFrameFormat enum (Rgba/Nv12) to WSFrame struct
- Add WSFrame::from_rendered_frame_nv12() constructor that converts
  RGBA data to NV12 using the existing convert_to_nv12() function
- Add pack_ws_frame() helper that selects correct packing format
- Editor window frame callbacks now use NV12 format
- Screenshot editor and camera legacy paths remain RGBA
- WebSocket stats logging now reports actual format (NV12/RGBA)

The frontend already has full NV12 support via WebGPU compute shader
(renderNv12FrameWebGPU) and Canvas2D fallback, so no frontend changes
are needed.
Add evict_far_from() method to FrameCache that removes entries far from
the current playhead position. Called when aggressive frame skipping
occurs (playback falls behind) to prevent the cache from holding stale
frames that will never be used again.

This keeps the LRU cache focused on frames near the current playhead,
improving cache hit rates during normal sequential playback.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Add evict_far_from() method to FrameCache that removes entries far from
the current playhead position. Called when aggressive frame skipping
occurs (playback falls behind) to prevent the cache from holding stale
frames that will never be used again.

This keeps the LRU cache focused on frames near the current playhead,
improving cache hit rates during normal sequential playback.
Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Previously the camera layer created its own command encoder and called
queue.submit() for both YUV→RGBA conversion and texture copy — 2 extra
GPU submissions per frame when camera is visible.

Changes:
- Add copy_from_yuv_output_to_encoder() that copies using external encoder
- Add prepare_with_encoder() to CameraLayer that uses batched
  convert_nv12_to_encoder/convert_yuv420p_to_encoder methods
- Update RendererLayers::prepare_with_encoder() to use camera's batched
  path for both camera and camera_only layers

Result: When camera is visible, saves 2-4 queue.submit() calls per frame,
reducing GPU overhead.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Previously the camera layer created its own command encoder and called
queue.submit() for both YUV→RGBA conversion and texture copy — 2 extra
GPU submissions per frame when camera is visible.

Changes:
- Add copy_from_yuv_output_to_encoder() that copies using external encoder
- Add prepare_with_encoder() to CameraLayer that uses batched
  convert_nv12_to_encoder/convert_yuv420p_to_encoder methods
- Update RendererLayers::prepare_with_encoder() to use camera's batched
  path for both camera and camera_only layers

Result: When camera is visible, saves 2-4 queue.submit() calls per frame,
reducing GPU overhead.
Rewrite convert_to_nv12() for better performance:
- Separate Y and UV computation with row-level bounds checking
  (one check per row instead of per-pixel)
- Use row slices for sequential memory access
- Process UV pairs in a tight while loop instead of branching per pixel
- Use bit shift (>>1) instead of division for averaging
- More cache-friendly: process each row's Y values contiguously,
  then UV values for even rows

This conversion runs on every frame in the frame callback, so reducing
its cost directly improves sustained FPS.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
cursoragent and others added 21 commits February 14, 2026 19:13
Previously, the WebSocket handler cloned the entire WSFrame (~1.3MB of
NV12 data) on every frame from the watch channel. Now the watch channel
stores Arc<WSFrame>, so the clone is just an atomic reference count
increment.

Changes:
- Change create_watch_frame_ws to accept watch::Receiver<Option<Arc<WSFrame>>>
- Add pack_ws_frame_ref/pack_nv12_frame_ref/pack_frame_data_ref that
  take &WSFrame references instead of consuming ownership
- Update editor_window.rs to wrap WSFrame in Arc before sending
- Update screenshot_editor.rs to wrap WSFrame in Arc before sending
- Broadcast-based create_frame_ws (camera_legacy) unchanged as
  broadcast handles cloning differently

This eliminates ~1.3MB of allocation per frame in the WS handler,
reducing GC pressure and improving frame delivery throughput.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Previously, the WebSocket handler cloned the entire WSFrame (~1.3MB of
NV12 data) on every frame from the watch channel. Now the watch channel
stores Arc<WSFrame>, so the clone is just an atomic reference count
increment.

Changes:
- Change create_watch_frame_ws to accept watch::Receiver<Option<Arc<WSFrame>>>
- Add pack_ws_frame_ref/pack_nv12_frame_ref/pack_frame_data_ref that
  take &WSFrame references instead of consuming ownership
- Update editor_window.rs to wrap WSFrame in Arc before sending
- Update screenshot_editor.rs to wrap WSFrame in Arc before sending
- Broadcast-based create_frame_ws (camera_legacy) unchanged as
  broadcast handles cloning differently

This eliminates ~1.3MB of allocation per frame in the WS handler,
reducing GC pressure and improving frame delivery throughput.
Add compute shader and Rust pipeline for converting rendered RGBA frames
to NV12 format on the GPU before readback. This will reduce readback
data size by 62% (from width*height*4 to width*height*1.5 bytes).

New components:
- shaders/rgba_to_nv12.wgsl: Compute shader that processes 4x2 pixel
  blocks, writing complete u32 words to avoid data races. Each thread
  produces 4 Y values (2 rows) and 2 UV pairs.
- RgbaToNv12Converter: Creates compute pipeline, manages storage and
  readback buffers, dispatches conversion compute pass
- PendingNv12Readback: Async readback with same progressive backoff
  poll loop as RGBA readback
- Nv12RenderedFrame: Output frame with NV12 data + metadata

Not yet wired into the render pipeline - infrastructure only.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
Add compute shader and Rust pipeline for converting rendered RGBA frames
to NV12 format on the GPU before readback. This will reduce readback
data size by 62% (from width*height*4 to width*height*1.5 bytes).

New components:
- shaders/rgba_to_nv12.wgsl: Compute shader that processes 4x2 pixel
  blocks, writing complete u32 words to avoid data races. Each thread
  produces 4 Y values (2 rows) and 2 UV pairs.
- RgbaToNv12Converter: Creates compute pipeline, manages storage and
  readback buffers, dispatches conversion compute pass
- PendingNv12Readback: Async readback with same progressive backoff
  poll loop as RGBA readback
- Nv12RenderedFrame: Output frame with NV12 data + metadata

Not yet wired into the render pipeline - infrastructure only.
…reduces readback by 62%

Major optimization: instead of reading back full RGBA from GPU (width*height*4)
and converting to NV12 on CPU, the render pipeline now converts RGBA→NV12 on
GPU using a compute shader and reads back only NV12 data (width*height*1.5).

This eliminates two bottlenecks simultaneously:
1. GPU readback size reduced by 62% (e.g., 3.5MB → 1.3MB at half-res)
2. CPU RGBA→NV12 conversion (~1-2ms per frame) eliminated from render thread

Pipeline flow change:
  Before: GPU render RGBA → readback RGBA → CPU convert NV12 → WS send NV12
  After:  GPU render RGBA → GPU convert NV12 → readback NV12 → WS send NV12

Implementation:
- Add rgba_to_nv12.wgsl compute shader (processes 4x2 pixel blocks,
  writes complete u32 words to avoid data races)
- Add RgbaToNv12Converter with compute pipeline, storage/readback buffers
- Add finish_encoder_nv12() for NV12 readback path
- Add FrameRenderer::render_nv12() method
- Add EditorFrameOutput enum (Rgba/Nv12) for frame callback
- Editor renderer now produces NV12 frames directly
- Frame callback receives NV12 data without CPU conversion
- Export path unchanged (still uses RGBA readback)

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
…reduces readback by 62%

Major optimization: instead of reading back full RGBA from GPU (width*height*4)
and converting to NV12 on CPU, the render pipeline now converts RGBA→NV12 on
GPU using a compute shader and reads back only NV12 data (width*height*1.5).

This eliminates two bottlenecks simultaneously:
1. GPU readback size reduced by 62% (e.g., 3.5MB → 1.3MB at half-res)
2. CPU RGBA→NV12 conversion (~1-2ms per frame) eliminated from render thread

Pipeline flow change:
  Before: GPU render RGBA → readback RGBA → CPU convert NV12 → WS send NV12
  After:  GPU render RGBA → GPU convert NV12 → readback NV12 → WS send NV12

Implementation:
- Add rgba_to_nv12.wgsl compute shader (processes 4x2 pixel blocks,
  writes complete u32 words to avoid data races)
- Add RgbaToNv12Converter with compute pipeline, storage/readback buffers
- Add finish_encoder_nv12() for NV12 readback path
- Add FrameRenderer::render_nv12() method
- Add EditorFrameOutput enum (Rgba/Nv12) for frame callback
- Editor renderer now produces NV12 frames directly
- Frame callback receives NV12 data without CPU conversion
- Export path unchanged (still uses RGBA readback)
…_of)

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
The NV12 readback path was submitting and immediately waiting for each
frame, losing the pipelining benefit of the RGBA path. Now it uses the
same pattern: return previous frame's result while submitting current
frame's readback.

Changes:
- Add dual readback buffers to RgbaToNv12Converter (alternating)
- Refactor convert_and_readback into submit_conversion + start_readback
- Store pending readback in converter (take_pending to retrieve)
- finish_encoder_nv12 now returns previous frame while current renders
- First frame still waits synchronously, subsequent frames pipelined

This ensures the NV12 GPU path gets the same pipelining benefit as
the RGBA path — GPU readback of frame N overlaps with rendering of
frame N+1.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
The NV12 readback path was submitting and immediately waiting for each
frame, losing the pipelining benefit of the RGBA path. Now it uses the
same pattern: return previous frame's result while submitting current
frame's readback.

Changes:
- Add dual readback buffers to RgbaToNv12Converter (alternating)
- Refactor convert_and_readback into submit_conversion + start_readback
- Store pending readback in converter (take_pending to retrieve)
- finish_encoder_nv12 now returns previous frame while current renders
- First frame still waits synchronously, subsequent frames pipelined

This ensures the NV12 GPU path gets the same pipelining benefit as
the RGBA path — GPU readback of frame N overlaps with rendering of
frame N+1.
The render_nv12 method was missing the retry-on-GPU-error logic that the
RGBA render method has. If a GPU buffer mapping fails (which can happen
transiently on some hardware), it now retries up to 3 times with
progressive backoff — same as the RGBA path.

On retry, both the RenderSession and the NV12 converter are reset to
clear any stale GPU state.

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
The render_nv12 method was missing the retry-on-GPU-error logic that the
RGBA render method has. If a GPU buffer mapping fails (which can happen
transiently on some hardware), it now retries up to 3 times with
progressive backoff — same as the RGBA path.

On retry, both the RenderSession and the NV12 converter are reset to
clear any stale GPU state.
When the GPU NV12 conversion fails (e.g., dimensions not a multiple of 4),
the fallback path was wrapping RGBA data in Nv12RenderedFrame and sending
it to the frontend tagged as NV12 format. The frontend WebGPU shader would
try to decode NV12, producing incorrect colors.

Fix:
- Add GpuOutputFormat enum (Nv12/Rgba) to Nv12RenderedFrame
- GPU NV12 path sets format=Nv12, fallback sets format=Rgba
- Editor frame callback checks the format field: if Nv12, sends directly;
  if Rgba, does CPU RGBA→NV12 conversion before sending
- Ensures the frontend always receives correctly formatted NV12 data

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
When the GPU NV12 conversion fails (e.g., dimensions not a multiple of 4),
the fallback path was wrapping RGBA data in Nv12RenderedFrame and sending
it to the frontend tagged as NV12 format. The frontend WebGPU shader would
try to decode NV12, producing incorrect colors.

Fix:
- Add GpuOutputFormat enum (Nv12/Rgba) to Nv12RenderedFrame
- GPU NV12 path sets format=Nv12, fallback sets format=Rgba
- Editor frame callback checks the format field: if Nv12, sends directly;
  if Rgba, does CPU RGBA→NV12 conversion before sending
- Ensures the frontend always receives correctly formatted NV12 data
…ression

The GPU RGBA→NV12 compute shader was adding too much GPU-side work at
full resolution (1920x1080), causing playback to drop from ~60fps to
~24-30fps on M4 Max. The GPU compute overhead exceeded the readback
bandwidth savings.

Changes:
- Revert editor renderer from render_nv12 back to render (RGBA output)
- Restore renderer channel capacity from 4 back to 8
- CPU NV12 conversion still happens in the frame callback for WS
  bandwidth savings (~1-2ms, well within frame budget)

The GPU NV12 infrastructure (shader, pipeline, pipelined readback) is
preserved for future use when it can be made resolution-adaptive, but
the editor now uses the standard RGBA render path that works well at
all resolutions.

All other optimizations remain active:
- Batched GPU command submissions (YUV→RGBA + render in single encoder)
- Pipelined RGBA readback (previous frame returned while current renders)
- NV12 over WebSocket (CPU conversion, 62% bandwidth reduction)
- Arc<WSFrame> (no deep clone)
- Prefetch/decode parallelism (6-8 tasks, 90-frame cache)
- Audio sync improvements
- Performance instrumentation

Co-authored-by: Richie McIlroy <richiemcilroy@users.noreply.github.com>
…ression

The GPU RGBA→NV12 compute shader was adding too much GPU-side work at
full resolution (1920x1080), causing playback to drop from ~60fps to
~24-30fps on M4 Max. The GPU compute overhead exceeded the readback
bandwidth savings.

Changes:
- Revert editor renderer from render_nv12 back to render (RGBA output)
- Restore renderer channel capacity from 4 back to 8
- CPU NV12 conversion still happens in the frame callback for WS
  bandwidth savings (~1-2ms, well within frame budget)

The GPU NV12 infrastructure (shader, pipeline, pipelined readback) is
preserved for future use when it can be made resolution-adaptive, but
the editor now uses the standard RGBA render path that works well at
all resolutions.

All other optimizations remain active:
- Batched GPU command submissions (YUV→RGBA + render in single encoder)
- Pipelined RGBA readback (previous frame returned while current renders)
- NV12 over WebSocket (CPU conversion, 62% bandwidth reduction)
- Arc<WSFrame> (no deep clone)
- Prefetch/decode parallelism (6-8 tasks, 90-frame cache)
- Audio sync improvements
- Performance instrumentation
@richiemcilroy richiemcilroy marked this pull request as ready for review February 15, 2026 00:22
@cursor
Copy link
Copy Markdown

cursor bot commented Feb 15, 2026

You have run out of free Bugbot PR reviews for this billing cycle. This will reset on March 8.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@@ -93,32 +36,39 @@ fn pack_nv12_frame(
let y_stride = width;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NV12 metadata currently hardcodes y_stride = width. Since WSFrame carries stride (and callers fill it for NV12), it seems safer to encode that stride here so the receiver can handle padded NV12 rows.

uv_stride: u32,
) -> Result<&wgpu::TextureView, YuvConversionError> {
let (effective_width, effective_height, _downscaled) =
validate_dimensions(width, height, self.gpu_max_texture_size)?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate_dimensions can return effective_width/effective_height that differ from the input. This function still uploads/dispatches using the original width/height, which can exceed the allocated texture size when downscaling kicks in. Consider either early-returning when effective_* != * (since we’re not resampling here) or consistently using effective_* for uploads/dispatch after ensuring the planes match.

uv_stride: u32,
) -> Result<&wgpu::TextureView, YuvConversionError> {
let (effective_width, effective_height, _downscaled) =
validate_dimensions(width, height, self.gpu_max_texture_size)?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing for convert_yuv420p_to_encoder: effective_width/effective_height is computed but the rest of the function uses the original width/height for uploads and dispatch. If downscaling is possible here, it’d be good to make the behavior explicit (reject vs. handle) to avoid mismatched extents.


let pending = nv12_converter
.take_pending()
.expect("just submitted a conversion");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This expect can take the whole render down if something goes sideways; probably better to surface it as a RenderingError like the other failure paths.

Suggested change
.expect("just submitted a conversion");
let pending = nv12_converter
.take_pending()
.ok_or(RenderingError::BufferMapWaitingFailed)?;

Comment on lines +535 to +542
if let Some(Ok(final_frame)) = frame_renderer.flush_pipeline().await
&& final_frame.width > 0
&& final_frame.height > 0
{
sender
.send((final_frame, frame_number.saturating_sub(1)))
.await?;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently drops any flush error silently. Even if you don’t want to fail the whole render on a flush failure, a warn-level log here would make it much easier to spot pipeline issues.

Suggested change
if let Some(Ok(final_frame)) = frame_renderer.flush_pipeline().await
&& final_frame.width > 0
&& final_frame.height > 0
{
sender
.send((final_frame, frame_number.saturating_sub(1)))
.await?;
}
if let Some(result) = frame_renderer.flush_pipeline().await {
match result {
Ok(final_frame) if final_frame.width > 0 && final_frame.height > 0 => {
sender
.send((final_frame, frame_number.saturating_sub(1)))
.await?;
}
Err(e) => {
tracing::warn!(error = %e, "Failed to flush pipelined readback");
}
_ => {}
}
}

Comment on lines 29 to 47
@@ -93,32 +36,39 @@ fn pack_nv12_frame(
let y_stride = width;
let metadata_size = 28;
let mut output = Vec::with_capacity(data.len() + metadata_size);
output.extend_from_slice(&data);
output.extend_from_slice(data);
output.extend_from_slice(&y_stride.to_le_bytes());
output.extend_from_slice(&height.to_le_bytes());
output.extend_from_slice(&width.to_le_bytes());
output.extend_from_slice(&frame_number.to_le_bytes());
output.extend_from_slice(&target_time_ns.to_le_bytes());
output.extend_from_slice(&NV12_FORMAT_MAGIC.to_le_bytes());

output
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NV12 path currently hardcodes y_stride = width and ignores WSFrame.stride. If the NV12 producer ever has padded rows, the receiver will misinterpret the buffer.

Suggested change
fn pack_nv12_frame_ref(
data: &[u8],
y_stride: u32,
width: u32,
height: u32,
frame_number: u32,
target_time_ns: u64,
) -> Vec<u8> {
let metadata_size = 28;
let mut output = Vec::with_capacity(data.len() + metadata_size);
output.extend_from_slice(data);
output.extend_from_slice(&y_stride.to_le_bytes());
output.extend_from_slice(&height.to_le_bytes());
output.extend_from_slice(&width.to_le_bytes());
output.extend_from_slice(&frame_number.to_le_bytes());
output.extend_from_slice(&target_time_ns.to_le_bytes());
output.extend_from_slice(&NV12_FORMAT_MAGIC.to_le_bytes());
output
}

Comment on lines +89 to +95
WSFrameFormat::Nv12 => pack_nv12_frame_ref(
&frame.data,
frame.width,
frame.height,
frame.frame_number,
frame.target_time_ns,
),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you take the y_stride param in pack_nv12_frame_ref, this call site should pass through the stride from the frame.

Suggested change
WSFrameFormat::Nv12 => pack_nv12_frame_ref(
&frame.data,
frame.width,
frame.height,
frame.frame_number,
frame.target_time_ns,
),
WSFrameFormat::Nv12 => pack_nv12_frame_ref(
&frame.data,
frame.stride,
frame.width,
frame.height,
frame.frame_number,
frame.target_time_ns,
),

self.ensure_texture_size(device, effective_width, effective_height);
self.swap_output_buffer();

upload_plane_with_stride(queue, &self.y_texture, y_data, width, height, y_stride, "Y")?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate_dimensions can return effective_width/effective_height smaller than the inputs, but this function still uploads/dispatches using the original width/height. If downscaling kicks in, that looks like it can write past the allocated textures. Probably want to consistently use the effective dimensions for upload extents + dispatch (or explicitly bail when downscaled).

Comment on lines +764 to +768
let previous_frame = if let Some(prev) = session.pipelined_readback.take_pending() {
Some(prev.wait(device).await?)
} else {
None
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small behavior change: previously a failed previous readback was best-effort (error dropped) and rendering continued; now it hard-errors the whole pipeline via ?. If map failures can be transient on some GPUs, keeping this non-fatal might be safer.

Suggested change
let previous_frame = if let Some(prev) = session.pipelined_readback.take_pending() {
Some(prev.wait(device).await?)
} else {
None
};
let previous_frame = if let Some(prev) = session.pipelined_readback.take_pending() {
prev.wait(device).await.ok()
} else {
None
};

@richiemcilroy richiemcilroy merged commit 7b228c4 into main Feb 15, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants