Add realtime AI decoder / predecoder infrastructure (GPU + Host) w/ host dispatcher#457
Conversation
…atch Introduce the cudaq-realtime library under realtime/, providing infrastructure for low-latency GPU-accelerated realtime coprocessing between FPGA/CPU and GPU systems in the NVQLink architecture. Key components: - C-compatible host API (cudaq_realtime.h) with dispatch manager/dispatcher lifecycle management (create, configure ring buffers, start/stop) - Persistent GPU dispatch kernel that polls a ring buffer for incoming RPC requests and dispatches to registered handlers via function table lookup using FNV-1a hashed function IDs - Two dispatch modes: DeviceCallMode (direct __device__ function calls) and GraphLaunchMode (device-side cudaGraphLaunch with backpressure and single-launch guards, requires sm_80+) - Two kernel synchronization strategies: RegularKernel (__syncthreads) and CooperativeKernel (grid-wide cooperative_groups sync) - Schema-driven type system for RPC argument/result descriptors - Shared library (libcudaq-realtime.so) for the host API and static library (libcudaq-realtime-dispatch.a) for GPU kernel device code - GTest-based unit tests covering device-call dispatch, host API integration, and device-side graph launch Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Introduce AIDecoderService and AIPreDecoderService in the QEC library, enabling a hybrid realtime pipeline where GPU-side TensorRT inference (predecoding) hands off results to CPU-side classical decoders like PyMatching. Key components: - AIDecoderService: wraps TensorRT inference in a CUDA graph using a gateway kernel pattern (mailbox pointer indirection) to bridge the dispatch kernel's dynamic ring buffer addresses to TRT's fixed I/O buffers. Supports SKIP_TRT env var for testing without TensorRT. - AIPreDecoderService: extends AIDecoderService with an N-deep pinned memory circular queue for GPU-to-CPU handoff, slot claim/release protocol (d_claimed_slot, d_inflight_flag), backpressure signaling via d_ready_flags/d_queue_idx, and poll_next_job/release_job API with proper acquire/release memory ordering - ThreadPool utility with optional Linux CPU core pinning for low-latency PyMatching worker threads - End-to-end integration test demonstrating the full hybrid pipeline: dispatcher -> 4x AIPreDecoderService GPU inference -> polling thread -> 4-worker PyMatching thread pool -> TX flag acknowledgment - CMake integration to find TensorRT and build the test with CUDA separable compilation Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Upgrade the AI predecoder test from a dummy identity TRT engine to a real d=7 r=7 surface code Z-type ONNX model. The service classes now support ONNX→TRT engine compilation, multi-output tensor bindings, and type-agnostic (INT32) I/O. The test fires 8 realistic syndrome payloads through 4 GPU pre-decoders and verifies end-to-end residual detector output handed off to simulated PyMatching workers. Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Replace the simulated PyMatching worker with a real MWPM decoder using the d=7 surface code's static Z parity check matrix via the cudaq-qec decoder plugin system. The 336 residual detectors from the AI predecoder are sliced into 14 spatial rounds of 24 Z-stabilizer syndromes and decoded independently. A mutex protects the decoder for thread safety across the 4-worker thread pool. Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Extract hard-coded d=7 parameters into a PipelineConfig struct with static factory methods for d=7, d=13, d=21, and d=31 surface codes. Runtime config selection via command-line argument (d7|d13|d21|d31) preserves existing d=7 functionality while enabling larger-distance experiments. ONNX_MODEL_PATH replaced with ONNX_MODEL_DIR to support per-config model filenames. Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Introduce a reusable header-only latency and throughput tracker for realtime decoding pipelines. Provides per-request submit/complete timestamping, percentile statistics (p50/p90/p95/p99), and a formatted report including wall time, throughput, and per-request breakdown. Signed-off-by: Scott Thornton <sthornton@nvidia.com>
… requests Enhance PipelineBenchmark to distinguish submitted vs completed requests, report timeouts, and cap per-request output to 50 entries. Integrate it into the predecoder pipeline test with per-request submit/complete markers and spin-wait polling for accurate latency measurement. Increase default total_requests from 20 to 100 across all distance configs. Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Instrument the PyMatching worker with high-resolution timestamps to measure decode time vs worker overhead. Report a breakdown showing PyMatching decode, worker overhead, and GPU+dispatch+poll latency as percentages of the total end-to-end pipeline, plus per-round latency. Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Add engine caching: prefer a pre-built .engine file when available, otherwise build from ONNX and save the engine for subsequent runs. Replace the single mutex-protected PyMatching decoder with a pool of per-worker decoder instances using thread-local index assignment, eliminating lock contention in the decode path. Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Introduce a streaming test mode alongside the existing batch mode, activated via CLI (`stream [rate_us] [duration_s]`). The streaming mode uses dedicated producer/consumer threads to simulate continuous FPGA syndrome arrival with configurable inter-arrival rate, in-flight throttling (capped to num_predecoders), backpressure tracking, and warmup period exclusion from latency stats. Reports steady-state throughput, percentile latencies, and per-round timing breakdown. Signed-off-by: Scott Thornton <sthornton@nvidia.com>
Signed-off-by: Scott Thornton <sthornton@nvidia.com>
…nt kernel The CUDA device runtime has a hardcoded 128 fire-and-forget graph launch slot limit that is never reclaimed while a persistent parent kernel runs, making the device-side dispatcher unsuitable for sustained operation. This adds a host-side CPU dispatcher thread that polls rx_flags and calls cudaGraphLaunch from host code on per-predecoder CUDA streams, bypassing the device runtime limit entirely. Streaming mode uses the host dispatcher; batch mode retains the device-side dispatcher for backward compatibility. Key changes: - New host_dispatcher.h/.cpp with host_dispatcher_loop() - AIPreDecoderService::capture_graph() gains device_launch flag for conditional cudaGraphInstantiateFlagDeviceLaunch vs standard instantiation - d_queue_idx_ changed from cudaMalloc to cudaHostAllocMapped so the host dispatcher can read backpressure state without cudaMemcpy - Mailbox bank changed to mapped pinned memory for zero-copy host writes - Streaming test uses host dispatcher with per-predecoder streams Verified: d7 streaming 16,824 requests (219 us mean, 31 us/round), d13 streaming 6,227 requests (455 us mean, 35 us/round), zero errors. Signed-off-by: Scott Thornton <sthornton@nvidia.com>
…actor - Add host dispatcher with dynamic worker pool (idle_mask, inflight_slot_tags) to avoid head-of-line blocking; use libcu++ system-scope atomics for rx/tx/ready flags and mapped pinned memory. - Extend AIPreDecoderService and PreDecoderJob with origin_slot for out-of-order completion; default queue_depth 1 for host dispatch. - Add design doc (host_side_dispatcher_design_gemini.md) with spin-polling dispatcher and worker pseudocode/constraints. - Refactor test_realtime_predecoder_w_pymatching for dynamic pool and update CMakeLists; adjust nvqlink daemon and dispatch_kernel for host-side dispatch. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
This commit fundamentally redesigns the host-side execution model to achieve
microsecond-level latency, shifting from a general-purpose thread pool to
a strict, pinned, and lock-free architecture.
Key architectural changes in `test_realtime_predecoder_w_pymatching.cpp`:
1. Dedicated Polling Threads (Removed Thread Pool)
- Replaced `cudaq::qec::utils::ThreadPool` and the single `incoming_thread`
with a vector of dedicated `std::thread` worker loops.
- Eliminates queueing latency, mutex locking, and context switching
overhead. Each worker thread now spins continuously checking for its own
GPU completions.
2. Strict CPU Thread Pinning
- Introduced `pin_thread_to_core` and `pin_current_thread_to_core` using
the Linux `pthread_setaffinity_np` API.
- Pinned the Dispatcher (Core 2), Producer (Core 3), Consumer (Core 4),
and all Worker threads (Cores 10+) to ensure they never migrate, keeping
their CPU caches perfectly warm.
3. High-Resolution Sub-Component Timing
- Added tracking arrays (`dispatch_ts`, `poll_ts`, `debug_dispatch_ts_arr`)
piped through `WorkerPoolContext` and `PreDecoderJob`.
- Updated end-of-run reporting to calculate differences between timestamps,
proving that Host Dispatch overhead is negligible (~1-3µs) and the
bottleneck is the GPU inference itself.
4. PyMatching Data Conversion Optimization
- Inside `pymatching_worker_task`, replaced the conversion of `int32_t`
syndrome data into a `std::vector<double>`.
- Now populates a pre-allocated `cudaqx::tensor<uint8_t>` to avoid slow
double-precision conversions inside the latency-critical worker loop.
5. NVTX Profiling Markers
- Included `<nvtx3/nvToolsExt.h>` and wrapped key blocks in
`nvtxRangePushA` and `nvtxRangePop`.
- Enables generation of `nsys` profiles to visually align CPU thread
activity with GPU TensorRT execution.
Other changes:
- Enable TensorRT FP16 builder flag (`kFP16`) in `ai_decoder_service.cu`
for supported platforms to accelerate GPU inference.
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
…e directory. Refactored the pymatching demo code to use the updated functions in realtime. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
…nstrumentation
Remove the duplicate QEC-local host_dispatcher.{h,cpp} in favor of
the canonical realtime library versions, eliminating link ambiguity.
Fix three correctness/performance bugs in the streaming pipeline:
- Consumer was strict in-order, causing 327 µs head-of-line blocking
when parallel workers complete out of order. Changed to scan all
active slots and harvest whichever are ready.
- Dispatcher set tx_flags=READY immediately on graph launch (when
tx_data_host was non-null), causing phantom completions. Set
tx_data_host/dev to nullptr so dispatcher uses IN_FLIGHT sentinel.
- Race between consumer clearing tx_flags and resetting slot_request:
producer could see slot available and write slot_request before the
consumer's slot_request=-1, permanently orphaning the slot. Fixed
by resetting slot_request before clearing tx_flags with a store
fence (__sync_synchronize) for ARM memory ordering.
Replace broken timing breakdown (dispatch_ts was always 0, making the
entire report show "Other/Misc Wait") with a 3-stage per-request
breakdown: [A] submit→worker poll, [B] worker task, [C] consumer
poll lag, with p50/p99 percentiles.
Also: reduce NUM_SLOTS 64→16 to cut queuing delay, remove unused
queue_depth from PipelineConfig, add DISABLE_PYMATCHING conditional
compilation, add stuck-request diagnostics, and remove batch mode /
watchdog / dead code.
Results (d7, 8 workers, open-loop):
62.5K req/s, 230 µs mean latency, 500K/500K completed, 0 drops.
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Increase num_predecoders and num_workers from 8 to 16 across all config presets, and set NUM_SLOTS to 32. With 8 workers the pipeline capacity (~24K req/s) was below the 33K req/s arrival rate at 30 µs spacing, causing unbounded queuing and p99 latency spikes to 4.9 ms. With 16 workers and 32 slots, d13 at 30 µs arrival sustains 25K req/s with 299 µs mean latency (23 µs/round), p99 = 334 µs, and near-zero backpressure (9K stalls vs 38M previously). Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Add optimization profile in build_engine_from_onnx for ONNX models with dynamic dimensions (batch dim = 0). When detected, pin all dynamic dims to 1 via min/opt/max profile so TensorRT can build the engine. Previously these models failed with "Failed to build TRT engine from ONNX". Switch d13 config to predecoder_memory_d13_T13_X.onnx, which takes detectors as input rather than raw measurements. End-to-end latency drops from 299 µs to 226 µs, mainly from PyMatching (69 µs → 12 µs). Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Replace byte-by-byte memory copies with wider load/store operations to reduce memory transactions in the CUDA graph. The input kernel now uses uint32_t (4-byte) copies, the passthrough kernel uses uint4 (16-byte) copies, and the output kernel is replaced entirely with cudaMemcpyAsync (DMA copy engine) followed by a minimal 1-thread signal kernel. Thread counts bumped from 128 to 256. Reduces d13 mean end-to-end latency from 226 µs to 141 µs (~85 µs) and per-round latency from 17.4 µs to 10.8 µs. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Move the input copy from an SM-based kernel inside the CUDA graph to a host-issued cudaMemcpyAsync via a new pre_launch_fn callback on HostDispatchWorker. This frees GPU compute units for TRT inference and reduces Stage A latency by ~19 µs. Add get_trt_input_ptr() and get_host_ring_ptrs() accessors to support the callback wiring. Separate the T104 ONNX model into its own d13_r104 pipeline config (104 rounds, 32K slots) and restore d13_r13 to use the T13 model. Update design document to reflect DMA data movement, pre-launch callback, out-of-order consumer, and ARM memory ordering constraints. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Introduce a RealtimePipeline class (pipeline.h, realtime_pipeline.cu) that encapsulates all ring buffer allocation, atomic synchronization, dispatcher wiring, worker thread management, and consumer slot lifecycle behind a callback-driven API. Application code provides a GPU stage factory, a CPU stage callback, and a completion handler -- zero direct atomic access required. Refactor test_realtime_predecoder_w_pymatching.cpp from 1083 lines to ~470 lines by replacing inline atomics, thread management, and slot tracking with pipeline.submit() / pipeline.stop() / pipeline.stats(). Add d13_r104 config (T=104 model, 131K slot size). Signed-off-by: Scott Thornton <wsttiger@gmail.com>
21 tests covering AIDecoderService, AIPreDecoderService, and the host-side dispatcher. Correctness tests push 5,000 random 6.4 KB payloads through the full CUDA graph pipeline and verify bitwise identity. Integration tests exercise multi-predecoder concurrency and sustained throughput (200 requests, regression for the 128-launch limit fix). SKIP_TRT buffer size increased to 1600 floats to match realistic syndrome payload sizes. Signed-off-by: Scott Thornton <sthornton@nvidia.com>
The legacy predecoder_input_kernel and its cudaq::nvqlink includes are no longer used since input data arrives via the pre_launch DMA callback. Design doc updated to reflect current code: removed kernel deletion, RealtimePipeline scaffolding, test suite, and SKIP_TRT buffer size (1600 floats). Signed-off-by: Scott Thornton <sthornton@nvidia.com>
…provements Add GPU-only pipeline mode that skips CPU worker threads when no cpu_stage callback is registered, using cudaLaunchHostFunc for completion signaling instead. Add post_launch_fn/post_launch_data callback to HostDispatchWorker and GpuWorkerResources, called after successful cudaGraphLaunch. Rename CpuStageContext fields to gpu_output/gpu_output_size and AIPreDecoderService buffers to h_predecoder_outputs_/d_predecoder_outputs_ for clarity. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
…om:wsttiger/cudaqx into add_realtime_ai_predecoder_host_side_gb200
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Address all critical (C1-C4) and major (M1-M12) defects identified during code review: Critical fixes: - C1: Fix race condition in try_submit via compare_exchange_weak - C2: Use uint64_t + separate occupancy flag for slot_request to support full request_id range (was int64_t with -1 sentinel) - C3: Add __syncthreads() before response header write in gateway_output_kernel to prevent partially-written result reads - C4: Always write IN_FLIGHT sentinel to tx_flags after graph launch Major fixes: - M1: Remove cudaSetDeviceFlags from RingBufferManager (caller's duty) - M2: Use std::atomic load with memory_order_acquire for tx/rx flag reads instead of plain volatile (ARM correctness) - M3: Validate num_workers <= 64 (idle_mask capacity) - M4: Validate gpu_factory is set before start() - M5: Check producer_stop in RingBufferInjector::submit to prevent infinite spin after shutdown - M6: Make started flag std::atomic<bool> - M7: Add CUDA error checks in AIDecoderService::capture_graph - M8: Check enqueueV3 return value in both service files - M9: Fix tensor_volume for dynamic-shape dims (was wrapping to SIZE_MAX on dim=-1) - M10: Assert num_workers == num_predecoders in benchmark - M11: Add aarch64 paths to predecoder test's TRT CMake search - M12: Replace vector<bool> with vector<uint8_t> to avoid concurrent write UB Also extracts submit logic into RingBufferInjector class to separate test infrastructure from pipeline core. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Rename all public classes, structs, and type aliases in the realtime QEC headers to snake_case, matching the cudaqx project convention per PR review feedback. Key renames: AIDecoderService → ai_decoder_service, AIPreDecoderService → ai_predecoder_service, PreDecoderJob → pre_decoder_job, RealtimePipeline → realtime_pipeline, RingBufferInjector → ring_buffer_injector, PipelineStageConfig → pipeline_stage_config, GpuWorkerResources → gpu_worker_resources, CpuStageContext → cpu_stage_context, Completion → completion, and all associated callback type aliases. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
The gateway_output_kernel was leaving request_id and ptp_timestamp unset in the RPCResponse. Read both fields from the incoming RPCHeader before overwriting with the response, then echo them into the corresponding RPCResponse fields. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Yes, we use cudaq::qec::decoder for the CPU-side PyMatching stage — the benchmark creates a pool via decoder::get("pymatching", H, params) and each worker thread grabs one to decode residual syndromes. The ai_decoder_service / ai_predecoder_service classes don't subclass decoder though — they're TRT wrappers that manage CUDA graphs, ring buffer I/O, and ready-flag signaling, which is a fundamentally different abstraction than the synchronous decode(syndrome) → result interface. The trt_decoder plugin isn't used here. So it's a two-tier hybrid: GPU tier = ai_predecoder_service (TRT), CPU tier = standard decoder plugin (PyMatching). |
….github.com> I, Ben Howe <141149032+bmhowe23@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8cd20a5 Signed-off-by: Ben Howe <141149032+bmhowe23@users.noreply.github.com>
….github.com> I, Ben Howe <141149032+bmhowe23@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8cd20a5 Signed-off-by: Ben Howe <141149032+bmhowe23@users.noreply.github.com>
I, Ben Howe <bhowe@nvidia.com>, hereby add my Signed-off-by to this commit: 74221d6 Signed-off-by: Ben Howe <bhowe@nvidia.com>
I, Ben Howe <bhowe@nvidia.com>, hereby add my Signed-off-by to this commit: 30107d0 Signed-off-by: Ben Howe <bhowe@nvidia.com>
….github.com> I, Ben Howe <141149032+bmhowe23@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8cd20a5 Signed-off-by: Ben Howe <141149032+bmhowe23@users.noreply.github.com>
I, Ben Howe <bhowe@nvidia.com>, hereby add my Signed-off-by to this commit: 008a734 Signed-off-by: Ben Howe <bhowe@nvidia.com>
….github.com> I, Ben Howe <141149032+bmhowe23@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8cd20a5 Signed-off-by: Ben Howe <141149032+bmhowe23@users.noreply.github.com>
Replace CMAKE_SOURCE_DIR with CMAKE_CURRENT_SOURCE_DIR-relative paths for test source files and ONNX_MODEL_DIR. In standalone QEC builds (CI), CMAKE_SOURCE_DIR is libs/qec rather than the repo root, causing doubled paths like libs/qec/libs/qec/lib/realtime/... that fail to resolve. Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
…om:wsttiger/cudaqx into add_realtime_ai_predecoder_host_side_gb200
…decoder_host_side_gb200
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Signed-off-by: Ben Howe <bhowe@nvidia.com>
The ai_decoder_service constructor checked std::getenv("SKIP_TRT") to
decide between TRT model loading and a passthrough identity kernel.
This was fragile and caused CI GPU test failures on 3 of 4 platforms
where the env var wasn't visible at construction time.
Add create_passthrough() static factories on ai_decoder_service and
ai_predecoder_service that construct test-only instances without
touching TRT. Remove the getenv check from the production constructor
and replace the SKIP_TRT check in capture_graph() with a context_
null check. Update all tests to use the factory instead of setenv.
Signed-off-by: Scott Thornton <wsttiger@gmail.com>
…om:wsttiger/cudaqx into add_realtime_ai_predecoder_host_side_gb200 Signed-off-by: Scott Thornton <wsttiger@gmail.com>
Signed-off-by: Ben Howe <bhowe@nvidia.com>
Signed-off-by: Ben Howe <bhowe@nvidia.com>
Signed-off-by: Ben Howe <bhowe@nvidia.com>
Signed-off-by: Ben Howe <bhowe@nvidia.com>
Signed-off-by: Ben Howe <bhowe@nvidia.com>
Host-side AI predecoder pipeline with RealtimePipeline abstraction
Summary
Adds a complete host-side realtime decoding pipeline that pairs a GPU-based AI predecoder (TensorRT) with CPU-based PyMatching MWPM decoding, orchestrated by a new
RealtimePipelineC++ abstraction that hides all low-level atomics and thread management from application code.AIPreDecoderService/AIDecoderService): TensorRT inference wrapped in CUDA graphs with gateway kernels for ring buffer I/O, supporting ONNX model loading, dynamic batch dims, FP16, and engine cachingGpuStageFactory,CpuStageCallback,CompletionCallback), eliminating direct atomic access from application codecompare_exchange_weakslot claiming and backpressure trackinguint4loads), DMA-based output copy, out-of-order consumer harvesting, ARM memory ordering fixes (std::atomicacquire loads,__sync_synchronizefences)vector<bool>UBKey files
realtime/include/cudaq/realtime/pipeline.h,realtime/lib/pipeline/realtime_pipeline.curealtime/include/.../host_dispatcher.h,realtime/lib/.../host_dispatcher.cu,host_dispatcher_capi.curealtime/include/.../cudaq_realtime.h,realtime/lib/.../cudaq_realtime_api.cpplibs/qec/include/.../ai_decoder_service.h,ai_predecoder_service.h,libs/qec/lib/realtime/ai_decoder_service.cu,ai_predecoder_service.culibs/qec/lib/realtime/test_realtime_predecoder_w_pymatching.cpplibs/qec/unittests/test_realtime_pipeline.cu,realtime/unittests/test_host_dispatcher.cudocs/host_side_dispatcher_design_gemini.md,docs/hybrid_ai_predecoder_pipeline.mdTest plan
test_realtime_pipeline— 21 GTest cases pass (identity passthrough, multi-request correctness, shutdown, slot wraparound, sustained throughput)test_realtime_predecoder_w_pymatching— end-to-end benchmark runs with d7, d13_r13, d13_r104 configs at various injection ratestest_host_dispatcher— host dispatcher unit tests passtest_dispatch_kernel— dispatch kernel unit tests passSKIP_TRT=1passthrough mode works for CI environments without GPU/TRT