NVIDIA · RamyaGuru · May 6, 2026 · May 5, 2026 · May 5, 2026 · May 5, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -20,7 +20,7 @@ CMake options (full table in `docs/getting-started.md`):
 - `DAQIRI_BUILD_EXAMPLES` — builds the benchmark executables (default `ON`).
 - `DAQIRI_REORDER_GPU_PROFILE` — enable CUDA event timing in the DPDK reorder kernels (off by default).
 
-CUDA architectures are hardcoded to `80;90` (A100, H100) in `src/CMakeLists.txt:25`. Change this when targeting other GPUs.
+CUDA architectures are hardcoded to `80;90;121` (A100, H100, GB10) in `src/CMakeLists.txt:25`. Change this when targeting other GPUs.
 
 **Socket → RDMA dependency**: the socket backend reuses the RoCE transport from the RDMA implementation, so `src/CMakeLists.txt:144-152` automatically prepends `rdma` to `DAQIRI_MGR_LIST` whenever `socket` is requested. The reverse is not true — listing `rdma` alone does not pull in `socket`.
 
@@ -30,11 +30,11 @@ There is no unit test suite. Verification is done via the benchmark executables
 
 | Executable | Source | Typical config |
 |---|---|---|
-| `daqiri_bench_raw_gpudirect` | `raw_gpudirect_bench.cpp` | `daqiri_bench_raw_tx_rx.yaml`, `daqiri_bench_raw_sw_loopback.yaml`, `daqiri_bench_raw_rx_multi_q.yaml` |
+| `daqiri_bench_raw_gpudirect` | `raw_gpudirect_bench.cpp` | `daqiri_bench_raw_tx_rx.yaml`, `daqiri_bench_raw_tx_rx_spark.yaml`, `daqiri_bench_raw_sw_loopback.yaml`, `daqiri_bench_raw_rx_multi_q.yaml` |
 | `daqiri_bench_raw_hds` | `raw_hds_bench.cpp` | `daqiri_bench_raw_tx_rx_hds.yaml` |
 | `daqiri_bench_raw_reorder_seq` | `raw_reorder_seq_bench.cpp` | `daqiri_bench_raw_tx_rx_reorder_seq_1024*.yaml`, `daqiri_bench_raw_rx_reorder_seq_*.yaml` |
 | `daqiri_bench_raw_reorder_quantize` | `raw_reorder_quantize_bench.cpp` | `daqiri_bench_raw_tx_rx_reorder_quantize_seq_batch.yaml` |
-| `daqiri_bench_rdma` | `rdma_bench.cpp` | `daqiri_bench_rdma_tx_rx.yaml` |
+| `daqiri_bench_rdma` | `rdma_bench.cpp` | `daqiri_bench_rdma_tx_rx.yaml`, `daqiri_bench_rdma_tx_rx_spark.yaml` |
 | `daqiri_bench_socket` | `socket_bench.cpp` | `daqiri_bench_socket_{udp,tcp}_tx_rx.yaml` |
 
 The four `raw_*` benches share `raw_bench_common.{cpp,h}` and accept `--seconds N`. `daqiri_bench_rdma` and `daqiri_bench_socket` also take `--mode {tx,rx,both}`.

diff --git a/docs/configuration.md b/docs/configuration.md
@@ -46,8 +46,11 @@ and their `kind` determines the receive mode (CPU-only, header-data split, or ba
   - type: `string`
   - values:
     - `huge` — CPU hugepages (recommended for CPU buffers)
-    - `device` — GPU VRAM
-    - `host_pinned` — Pinned CPU pages (not recommended for high-throughput RX/TX)
+    - `device` — GPU VRAM (discrete GPUs only; requires GPUDirect via peermem or DMA-BUF)
+    - `host_pinned` — Pinned CPU pages allocated via `cudaHostAlloc`. **Recommended on
+      integrated GPUs (e.g. NVIDIA GB10 / DGX Spark)**, where the NIC cannot peer-DMA
+      into device memory and CUDA reports DMA-BUF unsupported. On discrete-GPU systems,
+      prefer `device` for high-throughput RX/TX paths.
     - `host` — Regular CPU memory (not recommended)
 - **`affinity`**: GPU ID for `device` memory, or NUMA node ID for CPU memory.
   - type: `integer`

diff --git a/docs/getting-started.md b/docs/getting-started.md
@@ -96,7 +96,7 @@ Then build the DAQIRI library:
 | `DAQIRI_BUILD_EXAMPLES` | `ON` | Build benchmark executables. |
 | `BUILD_SHARED_LIBS` | — | Build as shared library. |
 
-CUDA architectures are hardcoded to `80;90` (A100, H100) in `src/CMakeLists.txt`.
+CUDA architectures are hardcoded to `80;90;121` (A100, H100, GB10) in `src/CMakeLists.txt`.
 
 ## Next Steps
 

diff --git a/docs/tutorials/benchmarking_examples.md b/docs/tutorials/benchmarking_examples.md
@@ -35,6 +35,13 @@ docker run --rm -it --privileged \
 
 ## Update the loopback configuration
 
+!!! tip "DGX Spark"
+
+    For systems configured per the [DGX Spark profile](system_configuration.md#dgx-spark-profile), use these configs to skip the PCIe/IP/CPU-core edits below:
+
+    - [`daqiri_bench_raw_tx_rx_spark.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_spark.yaml) for `daqiri_bench_raw_gpudirect` — still set `eth_dst_addr` to the RX MAC: `cat /sys/class/net/enP2p1s0f0np0/address`.
+    - [`daqiri_bench_rdma_tx_rx_spark.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_rdma_tx_rx_spark.yaml) for `daqiri_bench_rdma` — no further edits needed.
+
 The benchmark executables and example YAML configurations are located at:
 
 | | Binaries | YAML configs |

diff --git a/docs/tutorials/configuration-walkthrough.md b/docs/tutorials/configuration-walkthrough.md
@@ -100,7 +100,7 @@ bench_tx: # (30)!
 3. :material-wrench: `master_core` is the ID of the CPU core used for setup. It does not need to be isolated, and is recommended to differ from the `cpu_core` fields below used for polling the NIC.
 4. The `memory_regions` section lists where the NIC will write/read data from/to when bypassing the OS kernel. Tip: when using GPU buffer regions, keeping the sum of their buffer sizes lower than 80% of your BAR1 size is generally a good rule of thumb.
 5. A descriptive name for that memory region to refer to later in the `interfaces` section.
-6. :material-package-variant: The type of memory region. Best options are `device` (GPU) or `huge` (hugepages - CPU). Also supported: `host_pinned` (CPU, pinned) and `host` (CPU, unpinned). Choose based on whether your application processes packets on the GPU or CPU.
+6. :material-package-variant: The type of memory region. On discrete GPUs, the best options are `device` (GPU VRAM, GPUDirect) or `huge` (hugepages, CPU). On integrated GPUs (e.g. NVIDIA GB10 / DGX Spark) where `device` cannot be peer-DMA'd by the NIC, use `host_pinned` instead. Also supported: `host` (CPU, unpinned). See the full [memory regions reference](../configuration.md#memory-regions). Choose based on whether your application processes packets on the GPU or CPU and on the GPU class.
 7. :material-wrench: The GPU ID for `device` memory regions. The NUMA node ID for CPU memory regions.
 8. :material-package-variant: The number of buffers in the memory region. A higher value means more time to process the data, but takes additional space on the GPU BAR1. Too low increases the risk of dropping packets from the NIC having nowhere to write (Rx) or higher latency from buffering (Tx). A good starting point is 5x the `batch_size` below.
 9. :material-package-variant: The size of each buffer in the memory region. These should be equal to your maximum packet size, or less if breaking down packets (e.g. header-data split, see the `rx` queue below).