Skip to content

Add FPGA-based test application for realtime predecoder#490

Merged
bmhowe23 merged 5 commits intoNVIDIA:mainfrom
wsttiger:add_realtime_predecoder_fpga2
Apr 8, 2026
Merged

Add FPGA-based test application for realtime predecoder#490
bmhowe23 merged 5 commits intoNVIDIA:mainfrom
wsttiger:add_realtime_predecoder_fpga2

Conversation

@wsttiger
Copy link
Copy Markdown
Collaborator

@wsttiger wsttiger commented Apr 8, 2026

Summary

Add FPGA RDMA transport for the AI predecoder + PyMatching pipeline. Syndrome data arrives from a Hololink FPGA via RoCE v2, passes through TensorRT inference (AI predecoder), then PyMatching MWPM decoding, all orchestrated by realtime_pipeline.

Changes

New: FPGA predecoder bridge (hololink_predecoder_bridge.cpp)

  • Creates a Hololink GpuRoceTransceiver (DOCA GPU-RoCE) and feeds its ring buffer into realtime_pipeline via the external_ringbuffer path
  • Runs 8 TRT predecoder workers + 16 PyMatching decode threads
  • Supports --data-dir for ground-truth correctness verification against observables.bin
  • Logs per-shot diagnostic output confirming RDMA receipt, TRT inference result, and PyMatching decode

New: Orchestration script (hololink_predecoder_test.sh)

  • 2-process FPGA mode: bridge + hololink_fpga_syndrome_playback
  • 3-process emulated mode: emulator + bridge + playback
  • Converts binary detectors.bin to the text format the playback tool expects
  • Config-aware defaults for page size, num shots, and BRAM constraints

Fix: realtime_pipeline.cu external ring buffer consumer

  • The consumer_loop gated slot processing on slot_occupied[], which is only set by the software ring_buffer_injector. With an FPGA-sourced external ring buffer, slots were silently skipped and no requests ever completed. Skip the slot_occupied and drain checks when external_ring_ is true.

Refactor: Consolidate shared predecoder code

  • Extract ~350 lines of duplicated types (PipelineConfig, DecoderContext, PreLaunchCopyCtx, WorkerCtx, PyMatchQueue, TestData, SparseCSR, loaders) into predecoder_pipeline_common.{h,cpp}
  • Move both drivers and the orchestration script into libs/qec/unittests/realtime/
  • test_realtime_predecoder_w_pymatching.cpp is a git mv from libs/qec/lib/realtime/ with the shared code removed

Test results

Software benchmark (d13_r104, 20s, 104 us injection rate):

Metric Value
Submitted / Completed 192,309 / 192,309
Throughput 9,610 req/s
Mean latency 370 us (p50=332, p99=1,204)
PyMatching decode 218 us avg
Syndrome reduction 98.3%
Pipeline LER 0.0020 (384 / 192,309)
Predecoder-only LER 0.3980

FPGA bridge (d13_r104, 1 shot via real Hololink FPGA on GB200):

wsttiger added 4 commits April 6, 2026 23:21
…Hololink

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
…ernal ring buffer

Three main changes:

1. Add hololink_predecoder_bridge: receives syndrome data from the
   Hololink FPGA via RDMA and runs AI predecoder (TRT) + PyMatching
   through realtime_pipeline using the external_ringbuffer path.
   Includes --data-dir for ground-truth correctness verification.

2. Fix consumer_loop for external ring buffers: the consumer gated
   slot processing on slot_occupied[], which is only set by the
   software ring_buffer_injector. With an FPGA-sourced external ring
   buffer, slots were silently skipped. Skip the slot_occupied and
   drain checks when external_ring_ is true.

3. Consolidate shared code: extract ~350 lines of duplicated types
   (PipelineConfig, DecoderContext, PreLaunchCopyCtx, WorkerCtx,
   PyMatchQueue, TestData, SparseCSR, loaders) into
   predecoder_pipeline_common.{h,cpp}. Move both drivers and the
   orchestration script into unittests/realtime/.

Tested: 20s software benchmark reproduces 192,309 requests at 9,610
req/s with LER=0.0020. FPGA bridge completes 1 shot of d13_r104 via
real RDMA from the Hololink FPGA on GB200.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Add a per-shot log line in the bridge CPU stage that confirms each
pipeline step: RDMA receipt (detector count + input nonzero), TRT
inference (logical_pred + residual nonzero count). This makes it
possible to verify the full data path without ground-truth data.

Update hololink_predecoder_test.sh to resolve the bridge binary from
its new location in unittests/realtime/ instead of unittests/utils/.

Signed-off-by: Scott Thornton <wsttiger@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@bmhowe23
Copy link
Copy Markdown
Collaborator

bmhowe23 commented Apr 8, 2026

/ok to test 87a6872

@bmhowe23
Copy link
Copy Markdown
Collaborator

bmhowe23 commented Apr 8, 2026

I submitted the regular per-PR CI plus

Copy link
Copy Markdown
Collaborator

@bmhowe23 bmhowe23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to the core part of the library some pretty minimal and low-risk, so this LGTM.

@bmhowe23 bmhowe23 changed the title Add realtime predecoder fpga2 Add FPGA-based test application for realtime predecoder Apr 8, 2026
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
@bmhowe23 bmhowe23 merged commit 31ae759 into NVIDIA:main Apr 8, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants