Add realtime AI decoder / predecoder infrastructure (GPU + Host) w/ host dispatcher by wsttiger · Pull Request #457 · NVIDIA/cudaqx

wsttiger · 2026-03-04T02:25:44Z

Host-side AI predecoder pipeline with RealtimePipeline abstraction

Summary

Adds a complete host-side realtime decoding pipeline that pairs a GPU-based AI predecoder (TensorRT) with CPU-based PyMatching MWPM decoding, orchestrated by a new RealtimePipeline C++ abstraction that hides all low-level atomics and thread management from application code.

AI predecoder service (AIPreDecoderService / AIDecoderService): TensorRT inference wrapped in CUDA graphs with gateway kernels for ring buffer I/O, supporting ONNX model loading, dynamic batch dims, FP16, and engine caching
Host-side dispatcher: Spin-polling dispatcher that replaces the device-side persistent kernel, with dynamic worker pool, per-worker CUDA graph launch, idle bitmask scheduling, pre/post-launch callbacks (DMA input copy), and RPC-based function dispatch
RealtimePipeline abstraction: Encapsulates dispatcher thread, worker threads, consumer thread, and ring buffer management behind a clean factory/callback API (GpuStageFactory, CpuStageCallback, CompletionCallback), eliminating direct atomic access from application code
RingBufferInjector: Dedicated software injection class for test/replay workloads, with thread-safe compare_exchange_weak slot claiming and backpressure tracking
GPU-only mode: Pipeline can operate without CPU worker threads when no post-processing decoder is needed
Performance optimizations: Vectorized GPU copy kernels (uint4 loads), DMA-based output copy, out-of-order consumer harvesting, ARM memory ordering fixes (std::atomic acquire loads, __sync_synchronize fences)
Comprehensive defect fixes: All critical (C1–C4) and major (M1–M12) defects from code review addressed, including race conditions, premature flag writes, dynamic-shape TRT volume overflow, and vector<bool> UB
Test coverage: GTest suite with 21 tests (identity passthrough, correctness, host dispatcher, sustained throughput) plus full benchmark with configurable distance/rounds/injection rate

Key files

Area	Files
Pipeline abstraction	`realtime/include/cudaq/realtime/pipeline.h`, `realtime/lib/pipeline/realtime_pipeline.cu`
Host dispatcher	`realtime/include/.../host_dispatcher.h`, `realtime/lib/.../host_dispatcher.cu`, `host_dispatcher_capi.cu`
Realtime C API	`realtime/include/.../cudaq_realtime.h`, `realtime/lib/.../cudaq_realtime_api.cpp`
AI decoder services	`libs/qec/include/.../ai_decoder_service.h`, `ai_predecoder_service.h`, `libs/qec/lib/realtime/ai_decoder_service.cu`, `ai_predecoder_service.cu`
Benchmark	`libs/qec/lib/realtime/test_realtime_predecoder_w_pymatching.cpp`
Unit tests	`libs/qec/unittests/test_realtime_pipeline.cu`, `realtime/unittests/test_host_dispatcher.cu`
Design docs	`docs/host_side_dispatcher_design_gemini.md`, `docs/hybrid_ai_predecoder_pipeline.md`

Test plan

test_realtime_pipeline — 21 GTest cases pass (identity passthrough, multi-request correctness, shutdown, slot wraparound, sustained throughput)
test_realtime_predecoder_w_pymatching — end-to-end benchmark runs with d7, d13_r13, d13_r104 configs at various injection rates
test_host_dispatcher — host dispatcher unit tests pass
test_dispatch_kernel — dispatch kernel unit tests pass
Verify ARM (Grace) memory ordering correctness under sustained load (no stuck slots, no phantom completions)
Verify SKIP_TRT=1 passthrough mode works for CI environments without GPU/TRT

…atch Introduce the cudaq-realtime library under realtime/, providing infrastructure for low-latency GPU-accelerated realtime coprocessing between FPGA/CPU and GPU systems in the NVQLink architecture. Key components: - C-compatible host API (cudaq_realtime.h) with dispatch manager/dispatcher lifecycle management (create, configure ring buffers, start/stop) - Persistent GPU dispatch kernel that polls a ring buffer for incoming RPC requests and dispatches to registered handlers via function table lookup using FNV-1a hashed function IDs - Two dispatch modes: DeviceCallMode (direct __device__ function calls) and GraphLaunchMode (device-side cudaGraphLaunch with backpressure and single-launch guards, requires sm_80+) - Two kernel synchronization strategies: RegularKernel (__syncthreads) and CooperativeKernel (grid-wide cooperative_groups sync) - Schema-driven type system for RPC argument/result descriptors - Shared library (libcudaq-realtime.so) for the host API and static library (libcudaq-realtime-dispatch.a) for GPU kernel device code - GTest-based unit tests covering device-call dispatch, host API integration, and device-side graph launch Signed-off-by: Scott Thornton <sthornton@nvidia.com>

Introduce AIDecoderService and AIPreDecoderService in the QEC library, enabling a hybrid realtime pipeline where GPU-side TensorRT inference (predecoding) hands off results to CPU-side classical decoders like PyMatching. Key components: - AIDecoderService: wraps TensorRT inference in a CUDA graph using a gateway kernel pattern (mailbox pointer indirection) to bridge the dispatch kernel's dynamic ring buffer addresses to TRT's fixed I/O buffers. Supports SKIP_TRT env var for testing without TensorRT. - AIPreDecoderService: extends AIDecoderService with an N-deep pinned memory circular queue for GPU-to-CPU handoff, slot claim/release protocol (d_claimed_slot, d_inflight_flag), backpressure signaling via d_ready_flags/d_queue_idx, and poll_next_job/release_job API with proper acquire/release memory ordering - ThreadPool utility with optional Linux CPU core pinning for low-latency PyMatching worker threads - End-to-end integration test demonstrating the full hybrid pipeline: dispatcher -> 4x AIPreDecoderService GPU inference -> polling thread -> 4-worker PyMatching thread pool -> TX flag acknowledgment - CMake integration to find TensorRT and build the test with CUDA separable compilation Signed-off-by: Scott Thornton <sthornton@nvidia.com>

Upgrade the AI predecoder test from a dummy identity TRT engine to a real d=7 r=7 surface code Z-type ONNX model. The service classes now support ONNX→TRT engine compilation, multi-output tensor bindings, and type-agnostic (INT32) I/O. The test fires 8 realistic syndrome payloads through 4 GPU pre-decoders and verifies end-to-end residual detector output handed off to simulated PyMatching workers. Signed-off-by: Scott Thornton <sthornton@nvidia.com>

Replace the simulated PyMatching worker with a real MWPM decoder using the d=7 surface code's static Z parity check matrix via the cudaq-qec decoder plugin system. The 336 residual detectors from the AI predecoder are sliced into 14 spatial rounds of 24 Z-stabilizer syndromes and decoded independently. A mutex protects the decoder for thread safety across the 4-worker thread pool. Signed-off-by: Scott Thornton <sthornton@nvidia.com>

Extract hard-coded d=7 parameters into a PipelineConfig struct with static factory methods for d=7, d=13, d=21, and d=31 surface codes. Runtime config selection via command-line argument (d7|d13|d21|d31) preserves existing d=7 functionality while enabling larger-distance experiments. ONNX_MODEL_PATH replaced with ONNX_MODEL_DIR to support per-config model filenames. Signed-off-by: Scott Thornton <sthornton@nvidia.com>

Introduce a reusable header-only latency and throughput tracker for realtime decoding pipelines. Provides per-request submit/complete timestamping, percentile statistics (p50/p90/p95/p99), and a formatted report including wall time, throughput, and per-request breakdown. Signed-off-by: Scott Thornton <sthornton@nvidia.com>

… requests Enhance PipelineBenchmark to distinguish submitted vs completed requests, report timeouts, and cap per-request output to 50 entries. Integrate it into the predecoder pipeline test with per-request submit/complete markers and spin-wait polling for accurate latency measurement. Increase default total_requests from 20 to 100 across all distance configs. Signed-off-by: Scott Thornton <sthornton@nvidia.com>

Instrument the PyMatching worker with high-resolution timestamps to measure decode time vs worker overhead. Report a breakdown showing PyMatching decode, worker overhead, and GPU+dispatch+poll latency as percentages of the total end-to-end pipeline, plus per-round latency. Signed-off-by: Scott Thornton <sthornton@nvidia.com>

Add engine caching: prefer a pre-built .engine file when available, otherwise build from ONNX and save the engine for subsequent runs. Replace the single mutex-protected PyMatching decoder with a pool of per-worker decoder instances using thread-local index assignment, eliminating lock contention in the decode path. Signed-off-by: Scott Thornton <sthornton@nvidia.com>

Introduce a streaming test mode alongside the existing batch mode, activated via CLI (`stream [rate_us] [duration_s]`). The streaming mode uses dedicated producer/consumer threads to simulate continuous FPGA syndrome arrival with configurable inter-arrival rate, in-flight throttling (capped to num_predecoders), backpressure tracking, and warmup period exclusion from latency stats. Reports steady-state throughput, percentile latencies, and per-round timing breakdown. Signed-off-by: Scott Thornton <sthornton@nvidia.com>

Signed-off-by: Scott Thornton <sthornton@nvidia.com>

…nt kernel The CUDA device runtime has a hardcoded 128 fire-and-forget graph launch slot limit that is never reclaimed while a persistent parent kernel runs, making the device-side dispatcher unsuitable for sustained operation. This adds a host-side CPU dispatcher thread that polls rx_flags and calls cudaGraphLaunch from host code on per-predecoder CUDA streams, bypassing the device runtime limit entirely. Streaming mode uses the host dispatcher; batch mode retains the device-side dispatcher for backward compatibility. Key changes: - New host_dispatcher.h/.cpp with host_dispatcher_loop() - AIPreDecoderService::capture_graph() gains device_launch flag for conditional cudaGraphInstantiateFlagDeviceLaunch vs standard instantiation - d_queue_idx_ changed from cudaMalloc to cudaHostAllocMapped so the host dispatcher can read backpressure state without cudaMemcpy - Mailbox bank changed to mapped pinned memory for zero-copy host writes - Streaming test uses host dispatcher with per-predecoder streams Verified: d7 streaming 16,824 requests (219 us mean, 31 us/round), d13 streaming 6,227 requests (455 us mean, 35 us/round), zero errors. Signed-off-by: Scott Thornton <sthornton@nvidia.com>

…actor - Add host dispatcher with dynamic worker pool (idle_mask, inflight_slot_tags) to avoid head-of-line blocking; use libcu++ system-scope atomics for rx/tx/ready flags and mapped pinned memory. - Extend AIPreDecoderService and PreDecoderJob with origin_slot for out-of-order completion; default queue_depth 1 for host dispatch. - Add design doc (host_side_dispatcher_design_gemini.md) with spin-polling dispatcher and worker pseudocode/constraints. - Refactor test_realtime_predecoder_w_pymatching for dynamic pool and update CMakeLists; adjust nvqlink daemon and dispatch_kernel for host-side dispatch. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Signed-off-by: Scott Thornton <wsttiger@gmail.com>

This commit fundamentally redesigns the host-side execution model to achieve microsecond-level latency, shifting from a general-purpose thread pool to a strict, pinned, and lock-free architecture. Key architectural changes in `test_realtime_predecoder_w_pymatching.cpp`: 1. Dedicated Polling Threads (Removed Thread Pool) - Replaced `cudaq::qec::utils::ThreadPool` and the single `incoming_thread` with a vector of dedicated `std::thread` worker loops. - Eliminates queueing latency, mutex locking, and context switching overhead. Each worker thread now spins continuously checking for its own GPU completions. 2. Strict CPU Thread Pinning - Introduced `pin_thread_to_core` and `pin_current_thread_to_core` using the Linux `pthread_setaffinity_np` API. - Pinned the Dispatcher (Core 2), Producer (Core 3), Consumer (Core 4), and all Worker threads (Cores 10+) to ensure they never migrate, keeping their CPU caches perfectly warm. 3. High-Resolution Sub-Component Timing - Added tracking arrays (`dispatch_ts`, `poll_ts`, `debug_dispatch_ts_arr`) piped through `WorkerPoolContext` and `PreDecoderJob`. - Updated end-of-run reporting to calculate differences between timestamps, proving that Host Dispatch overhead is negligible (~1-3µs) and the bottleneck is the GPU inference itself. 4. PyMatching Data Conversion Optimization - Inside `pymatching_worker_task`, replaced the conversion of `int32_t` syndrome data into a `std::vector<double>`. - Now populates a pre-allocated `cudaqx::tensor<uint8_t>` to avoid slow double-precision conversions inside the latency-critical worker loop. 5. NVTX Profiling Markers - Included `<nvtx3/nvToolsExt.h>` and wrapped key blocks in `nvtxRangePushA` and `nvtxRangePop`. - Enables generation of `nsys` profiles to visually align CPU thread activity with GPU TensorRT execution. Other changes: - Enable TensorRT FP16 builder flag (`kFP16`) in `ai_decoder_service.cu` for supported platforms to accelerate GPU inference. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

…e directory. Refactored the pymatching demo code to use the updated functions in realtime. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

…nstrumentation Remove the duplicate QEC-local host_dispatcher.{h,cpp} in favor of the canonical realtime library versions, eliminating link ambiguity. Fix three correctness/performance bugs in the streaming pipeline: - Consumer was strict in-order, causing 327 µs head-of-line blocking when parallel workers complete out of order. Changed to scan all active slots and harvest whichever are ready. - Dispatcher set tx_flags=READY immediately on graph launch (when tx_data_host was non-null), causing phantom completions. Set tx_data_host/dev to nullptr so dispatcher uses IN_FLIGHT sentinel. - Race between consumer clearing tx_flags and resetting slot_request: producer could see slot available and write slot_request before the consumer's slot_request=-1, permanently orphaning the slot. Fixed by resetting slot_request before clearing tx_flags with a store fence (__sync_synchronize) for ARM memory ordering. Replace broken timing breakdown (dispatch_ts was always 0, making the entire report show "Other/Misc Wait") with a 3-stage per-request breakdown: [A] submit→worker poll, [B] worker task, [C] consumer poll lag, with p50/p99 percentiles. Also: reduce NUM_SLOTS 64→16 to cut queuing delay, remove unused queue_depth from PipelineConfig, add DISABLE_PYMATCHING conditional compilation, add stuck-request diagnostics, and remove batch mode / watchdog / dead code. Results (d7, 8 workers, open-loop): 62.5K req/s, 230 µs mean latency, 500K/500K completed, 0 drops. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Increase num_predecoders and num_workers from 8 to 16 across all config presets, and set NUM_SLOTS to 32. With 8 workers the pipeline capacity (~24K req/s) was below the 33K req/s arrival rate at 30 µs spacing, causing unbounded queuing and p99 latency spikes to 4.9 ms. With 16 workers and 32 slots, d13 at 30 µs arrival sustains 25K req/s with 299 µs mean latency (23 µs/round), p99 = 334 µs, and near-zero backpressure (9K stalls vs 38M previously). Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Add optimization profile in build_engine_from_onnx for ONNX models with dynamic dimensions (batch dim = 0). When detected, pin all dynamic dims to 1 via min/opt/max profile so TensorRT can build the engine. Previously these models failed with "Failed to build TRT engine from ONNX". Switch d13 config to predecoder_memory_d13_T13_X.onnx, which takes detectors as input rather than raw measurements. End-to-end latency drops from 299 µs to 226 µs, mainly from PyMatching (69 µs → 12 µs). Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Replace byte-by-byte memory copies with wider load/store operations to reduce memory transactions in the CUDA graph. The input kernel now uses uint32_t (4-byte) copies, the passthrough kernel uses uint4 (16-byte) copies, and the output kernel is replaced entirely with cudaMemcpyAsync (DMA copy engine) followed by a minimal 1-thread signal kernel. Thread counts bumped from 128 to 256. Reduces d13 mean end-to-end latency from 226 µs to 141 µs (~85 µs) and per-round latency from 17.4 µs to 10.8 µs. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Move the input copy from an SM-based kernel inside the CUDA graph to a host-issued cudaMemcpyAsync via a new pre_launch_fn callback on HostDispatchWorker. This frees GPU compute units for TRT inference and reduces Stage A latency by ~19 µs. Add get_trt_input_ptr() and get_host_ring_ptrs() accessors to support the callback wiring. Separate the T104 ONNX model into its own d13_r104 pipeline config (104 rounds, 32K slots) and restore d13_r13 to use the T13 model. Update design document to reflect DMA data movement, pre-launch callback, out-of-order consumer, and ARM memory ordering constraints. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Introduce a RealtimePipeline class (pipeline.h, realtime_pipeline.cu) that encapsulates all ring buffer allocation, atomic synchronization, dispatcher wiring, worker thread management, and consumer slot lifecycle behind a callback-driven API. Application code provides a GPU stage factory, a CPU stage callback, and a completion handler -- zero direct atomic access required. Refactor test_realtime_predecoder_w_pymatching.cpp from 1083 lines to ~470 lines by replacing inline atomics, thread management, and slot tracking with pipeline.submit() / pipeline.stop() / pipeline.stats(). Add d13_r104 config (T=104 model, 131K slot size). Signed-off-by: Scott Thornton <wsttiger@gmail.com>

21 tests covering AIDecoderService, AIPreDecoderService, and the host-side dispatcher. Correctness tests push 5,000 random 6.4 KB payloads through the full CUDA graph pipeline and verify bitwise identity. Integration tests exercise multi-predecoder concurrency and sustained throughput (200 requests, regression for the 128-launch limit fix). SKIP_TRT buffer size increased to 1600 floats to match realistic syndrome payload sizes. Signed-off-by: Scott Thornton <sthornton@nvidia.com>

The legacy predecoder_input_kernel and its cudaq::nvqlink includes are no longer used since input data arrives via the pre_launch DMA callback. Design doc updated to reflect current code: removed kernel deletion, RealtimePipeline scaffolding, test suite, and SKIP_TRT buffer size (1600 floats). Signed-off-by: Scott Thornton <sthornton@nvidia.com>

…provements Add GPU-only pipeline mode that skips CPU worker threads when no cpu_stage callback is registered, using cudaLaunchHostFunc for completion signaling instead. Add post_launch_fn/post_launch_data callback to HostDispatchWorker and GpuWorkerResources, called after successful cudaGraphLaunch. Rename CpuStageContext fields to gpu_output/gpu_output_size and AIPreDecoderService buffers to h_predecoder_outputs_/d_predecoder_outputs_ for clarity. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

…om:wsttiger/cudaqx into add_realtime_ai_predecoder_host_side_gb200

Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Address all critical (C1-C4) and major (M1-M12) defects identified during code review: Critical fixes: - C1: Fix race condition in try_submit via compare_exchange_weak - C2: Use uint64_t + separate occupancy flag for slot_request to support full request_id range (was int64_t with -1 sentinel) - C3: Add __syncthreads() before response header write in gateway_output_kernel to prevent partially-written result reads - C4: Always write IN_FLIGHT sentinel to tx_flags after graph launch Major fixes: - M1: Remove cudaSetDeviceFlags from RingBufferManager (caller's duty) - M2: Use std::atomic load with memory_order_acquire for tx/rx flag reads instead of plain volatile (ARM correctness) - M3: Validate num_workers <= 64 (idle_mask capacity) - M4: Validate gpu_factory is set before start() - M5: Check producer_stop in RingBufferInjector::submit to prevent infinite spin after shutdown - M6: Make started flag std::atomic<bool> - M7: Add CUDA error checks in AIDecoderService::capture_graph - M8: Check enqueueV3 return value in both service files - M9: Fix tensor_volume for dynamic-shape dims (was wrapping to SIZE_MAX on dim=-1) - M10: Assert num_workers == num_predecoders in benchmark - M11: Add aarch64 paths to predecoder test's TRT CMake search - M12: Replace vector<bool> with vector<uint8_t> to avoid concurrent write UB Also extracts submit logic into RingBufferInjector class to separate test infrastructure from pipeline core. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Rename all public classes, structs, and type aliases in the realtime QEC headers to snake_case, matching the cudaqx project convention per PR review feedback. Key renames: AIDecoderService → ai_decoder_service, AIPreDecoderService → ai_predecoder_service, PreDecoderJob → pre_decoder_job, RealtimePipeline → realtime_pipeline, RingBufferInjector → ring_buffer_injector, PipelineStageConfig → pipeline_stage_config, GpuWorkerResources → gpu_worker_resources, CpuStageContext → cpu_stage_context, Completion → completion, and all associated callback type aliases. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

The gateway_output_kernel was leaving request_id and ptp_timestamp unset in the RPCResponse. Read both fields from the incoming RPCHeader before overwriting with the response, then echo them into the corresponding RPCResponse fields. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Signed-off-by: Scott Thornton <wsttiger@gmail.com>

wsttiger · 2026-04-02T04:06:01Z

Does anything in this PR use our cudaq::qec::decoder interface? When I Ctrl-F for trt_decoder in this PR, I don't see any hits, but it's unclear if that's because we're not actually using the standard decoder interface or if GitHub is just having trouble with the large diffs.

Yes, we use cudaq::qec::decoder for the CPU-side PyMatching stage — the benchmark creates a pool via decoder::get("pymatching", H, params) and each worker thread grabs one to decode residual syndromes.

The ai_decoder_service / ai_predecoder_service classes don't subclass decoder though — they're TRT wrappers that manage CUDA graphs, ring buffer I/O, and ready-flag signaling, which is a fundamentally different abstraction than the synchronous decode(syndrome) → result interface. The trt_decoder plugin isn't used here.

So it's a two-tier hybrid: GPU tier = ai_predecoder_service (TRT), CPU tier = standard decoder plugin (PyMatching).

….github.com> I, Ben Howe <141149032+bmhowe23@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8cd20a5 Signed-off-by: Ben Howe <141149032+bmhowe23@users.noreply.github.com>

I, Ben Howe <bhowe@nvidia.com>, hereby add my Signed-off-by to this commit: 74221d6 Signed-off-by: Ben Howe <bhowe@nvidia.com>

I, Ben Howe <bhowe@nvidia.com>, hereby add my Signed-off-by to this commit: 30107d0 Signed-off-by: Ben Howe <bhowe@nvidia.com>

….github.com> I, Ben Howe <141149032+bmhowe23@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8cd20a5 Signed-off-by: Ben Howe <141149032+bmhowe23@users.noreply.github.com>

I, Ben Howe <bhowe@nvidia.com>, hereby add my Signed-off-by to this commit: 008a734 Signed-off-by: Ben Howe <bhowe@nvidia.com>

….github.com> I, Ben Howe <141149032+bmhowe23@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8cd20a5 Signed-off-by: Ben Howe <141149032+bmhowe23@users.noreply.github.com>

Replace CMAKE_SOURCE_DIR with CMAKE_CURRENT_SOURCE_DIR-relative paths for test source files and ONNX_MODEL_DIR. In standalone QEC builds (CI), CMAKE_SOURCE_DIR is libs/qec rather than the repo root, causing doubled paths like libs/qec/libs/qec/lib/realtime/... that fail to resolve. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Signed-off-by: Scott Thornton <wsttiger@gmail.com>

…om:wsttiger/cudaqx into add_realtime_ai_predecoder_host_side_gb200

…decoder_host_side_gb200

Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Signed-off-by: Ben Howe <bhowe@nvidia.com>

The ai_decoder_service constructor checked std::getenv("SKIP_TRT") to decide between TRT model loading and a passthrough identity kernel. This was fragile and caused CI GPU test failures on 3 of 4 platforms where the env var wasn't visible at construction time. Add create_passthrough() static factories on ai_decoder_service and ai_predecoder_service that construct test-only instances without touching TRT. Remove the getenv check from the production constructor and replace the SKIP_TRT check in capture_graph() with a context_ null check. Update all tests to use the factory instead of setenv. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

…om:wsttiger/cudaqx into add_realtime_ai_predecoder_host_side_gb200 Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Signed-off-by: Ben Howe <bhowe@nvidia.com>

bmhowe23

Thanks, @wsttiger !

wsttiger and others added 29 commits February 18, 2026 20:51

Added design document

a36a2c3

Signed-off-by: Scott Thornton <sthornton@nvidia.com>

Updated the design document to reflect code changes.

10dfcfb

Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Copied the updated realtime code (dispatchers and all) to the realtim…

a04ef38

…e directory. Refactored the pymatching demo code to use the updated functions in realtime. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Merge branch 'add_realtime_ai_predecoder_host_side_gb200' of github.c…

e4df4c4

…om:wsttiger/cudaqx into add_realtime_ai_predecoder_host_side_gb200

Added pipeline library to QEC unittests CMake

84af084

Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Formatting

c5ee3c8

Signed-off-by: Scott Thornton <wsttiger@gmail.com>

wsttiger requested a review from bmhowe23 March 4, 2026 02:26

wsttiger added 3 commits April 2, 2026 03:53

Formatting

51ed56a

Signed-off-by: Scott Thornton <wsttiger@gmail.com>

bmhowe23 and others added 24 commits April 2, 2026 07:26

DCO Remediation Commit for Ben Howe <141149032+bmhowe23@users.noreply…

74221d6

….github.com> I, Ben Howe <141149032+bmhowe23@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8cd20a5 Signed-off-by: Ben Howe <141149032+bmhowe23@users.noreply.github.com>

DCO Remediation Commit for Ben Howe <141149032+bmhowe23@users.noreply…

30107d0

….github.com> I, Ben Howe <141149032+bmhowe23@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8cd20a5 Signed-off-by: Ben Howe <141149032+bmhowe23@users.noreply.github.com>

DCO Remediation Commit for Ben Howe <bhowe@nvidia.com>

d859305

I, Ben Howe <bhowe@nvidia.com>, hereby add my Signed-off-by to this commit: 74221d6 Signed-off-by: Ben Howe <bhowe@nvidia.com>

DCO Remediation Commit for Ben Howe <bhowe@nvidia.com>

eba5bea

I, Ben Howe <bhowe@nvidia.com>, hereby add my Signed-off-by to this commit: 30107d0 Signed-off-by: Ben Howe <bhowe@nvidia.com>

DCO Remediation Commit for Ben Howe <141149032+bmhowe23@users.noreply…

008a734

….github.com> I, Ben Howe <141149032+bmhowe23@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8cd20a5 Signed-off-by: Ben Howe <141149032+bmhowe23@users.noreply.github.com>

DCO Remediation Commit for Ben Howe <bhowe@nvidia.com>

590b0e3

I, Ben Howe <bhowe@nvidia.com>, hereby add my Signed-off-by to this commit: 008a734 Signed-off-by: Ben Howe <bhowe@nvidia.com>

DCO Remediation Commit for Ben Howe <141149032+bmhowe23@users.noreply…

aaaad22

….github.com> I, Ben Howe <141149032+bmhowe23@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 8cd20a5 Signed-off-by: Ben Howe <141149032+bmhowe23@users.noreply.github.com>

Updated docs to match the snake case and namespaces

5631b54

Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Merge branch 'add_realtime_ai_predecoder_host_side_gb200' of github.c…

83c25eb

…om:wsttiger/cudaqx into add_realtime_ai_predecoder_host_side_gb200

Merge remote-tracking branch 'upstream/main' into add_realtime_ai_pre…

0722b97

…decoder_host_side_gb200

Updated CUDA-Q SHA in the CI actions

8fa8a70

Signed-off-by: Scott Thornton <wsttiger@gmail.com>

CI updates

33de35a

Signed-off-by: Ben Howe <bhowe@nvidia.com>

TearDown update for tests that are skipped

69d98ae

Update lib_qec.yaml CI to install TensorRT

3306cd9

Merge branch 'main' into add_realtime_ai_predecoder_host_side_gb200

6d9ffd9

Merge branch 'add_realtime_ai_predecoder_host_side_gb200' of github.c…

9bd9ab2

…om:wsttiger/cudaqx into add_realtime_ai_predecoder_host_side_gb200 Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Merge branch 'main' into add_realtime_ai_predecoder_host_side_gb200

277d6d3

Remove duplication for CUDA-Q realtime SHA

911cd91

Signed-off-by: Ben Howe <bhowe@nvidia.com>

Silence some deprecation warnings that we cannot control

ddc0652

Signed-off-by: Ben Howe <bhowe@nvidia.com>

Move public API to experimental namespace

324dbc8

Signed-off-by: Ben Howe <bhowe@nvidia.com>

Remove dead code

46eb759

Signed-off-by: Ben Howe <bhowe@nvidia.com>

Follow-up to SHA deduplication

2434b2c

Signed-off-by: Ben Howe <bhowe@nvidia.com>

bmhowe23 approved these changes Apr 3, 2026

View reviewed changes

bmhowe23 merged commit 61afa38 into NVIDIA:main Apr 3, 2026
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add realtime AI decoder / predecoder infrastructure (GPU + Host) w/ host dispatcher#457

Add realtime AI decoder / predecoder infrastructure (GPU + Host) w/ host dispatcher#457
bmhowe23 merged 80 commits intoNVIDIA:mainfrom
wsttiger:add_realtime_ai_predecoder_host_side_gb200

wsttiger commented Mar 4, 2026

Uh oh!

wsttiger commented Apr 2, 2026

Uh oh!

bmhowe23 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wsttiger commented Mar 4, 2026

Host-side AI predecoder pipeline with RealtimePipeline abstraction

Summary

Key files

Test plan

Uh oh!

wsttiger commented Apr 2, 2026

Uh oh!

bmhowe23 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants