diff --git a/CLAUDE.md b/CLAUDE.md index 74a0cfa..d3aea65 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -20,7 +20,7 @@ CMake options (full table in `docs/getting-started.md`): - `DAQIRI_BUILD_EXAMPLES` — builds the benchmark executables (default `ON`). - `DAQIRI_REORDER_GPU_PROFILE` — enable CUDA event timing in the DPDK reorder kernels (off by default). -CUDA architectures are hardcoded to `80;90` (A100, H100) in `src/CMakeLists.txt:25`. Change this when targeting other GPUs. +CUDA architectures are hardcoded to `80;90;121` (A100, H100, GB10) in `src/CMakeLists.txt:25`. Change this when targeting other GPUs. **Socket → RDMA dependency**: the socket backend reuses the RoCE transport from the RDMA implementation, so `src/CMakeLists.txt:144-152` automatically prepends `rdma` to `DAQIRI_MGR_LIST` whenever `socket` is requested. The reverse is not true — listing `rdma` alone does not pull in `socket`. @@ -30,11 +30,11 @@ There is no unit test suite. Verification is done via the benchmark executables | Executable | Source | Typical config | |---|---|---| -| `daqiri_bench_raw_gpudirect` | `raw_gpudirect_bench.cpp` | `daqiri_bench_raw_tx_rx.yaml`, `daqiri_bench_raw_sw_loopback.yaml`, `daqiri_bench_raw_rx_multi_q.yaml` | +| `daqiri_bench_raw_gpudirect` | `raw_gpudirect_bench.cpp` | `daqiri_bench_raw_tx_rx.yaml`, `daqiri_bench_raw_tx_rx_spark.yaml`, `daqiri_bench_raw_sw_loopback.yaml`, `daqiri_bench_raw_rx_multi_q.yaml` | | `daqiri_bench_raw_hds` | `raw_hds_bench.cpp` | `daqiri_bench_raw_tx_rx_hds.yaml` | | `daqiri_bench_raw_reorder_seq` | `raw_reorder_seq_bench.cpp` | `daqiri_bench_raw_tx_rx_reorder_seq_1024*.yaml`, `daqiri_bench_raw_rx_reorder_seq_*.yaml` | | `daqiri_bench_raw_reorder_quantize` | `raw_reorder_quantize_bench.cpp` | `daqiri_bench_raw_tx_rx_reorder_quantize_seq_batch.yaml` | -| `daqiri_bench_rdma` | `rdma_bench.cpp` | `daqiri_bench_rdma_tx_rx.yaml` | +| `daqiri_bench_rdma` | `rdma_bench.cpp` | `daqiri_bench_rdma_tx_rx.yaml`, `daqiri_bench_rdma_tx_rx_spark.yaml` | | `daqiri_bench_socket` | `socket_bench.cpp` | `daqiri_bench_socket_{udp,tcp}_tx_rx.yaml` | The four `raw_*` benches share `raw_bench_common.{cpp,h}` and accept `--seconds N`. `daqiri_bench_rdma` and `daqiri_bench_socket` also take `--mode {tx,rx,both}`. diff --git a/docs/configuration.md b/docs/configuration.md index 606357f..34059c8 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -46,8 +46,11 @@ and their `kind` determines the receive mode (CPU-only, header-data split, or ba - type: `string` - values: - `huge` — CPU hugepages (recommended for CPU buffers) - - `device` — GPU VRAM - - `host_pinned` — Pinned CPU pages (not recommended for high-throughput RX/TX) + - `device` — GPU VRAM (discrete GPUs only; requires GPUDirect via peermem or DMA-BUF) + - `host_pinned` — Pinned CPU pages allocated via `cudaHostAlloc`. **Recommended on + integrated GPUs (e.g. NVIDIA GB10 / DGX Spark)**, where the NIC cannot peer-DMA + into device memory and CUDA reports DMA-BUF unsupported. On discrete-GPU systems, + prefer `device` for high-throughput RX/TX paths. - `host` — Regular CPU memory (not recommended) - **`affinity`**: GPU ID for `device` memory, or NUMA node ID for CPU memory. - type: `integer` diff --git a/docs/getting-started.md b/docs/getting-started.md index e8f2143..75c214f 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -96,7 +96,7 @@ Then build the DAQIRI library: | `DAQIRI_BUILD_EXAMPLES` | `ON` | Build benchmark executables. | | `BUILD_SHARED_LIBS` | — | Build as shared library. | -CUDA architectures are hardcoded to `80;90` (A100, H100) in `src/CMakeLists.txt`. +CUDA architectures are hardcoded to `80;90;121` (A100, H100, GB10) in `src/CMakeLists.txt`. ## Next Steps diff --git a/docs/tutorials/benchmarking_examples.md b/docs/tutorials/benchmarking_examples.md index ef3dafa..f265366 100644 --- a/docs/tutorials/benchmarking_examples.md +++ b/docs/tutorials/benchmarking_examples.md @@ -35,6 +35,13 @@ docker run --rm -it --privileged \ ## Update the loopback configuration +!!! tip "DGX Spark" + + For systems configured per the [DGX Spark profile](system_configuration.md#dgx-spark-profile), use these configs to skip the PCIe/IP/CPU-core edits below: + + - [`daqiri_bench_raw_tx_rx_spark.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_spark.yaml) for `daqiri_bench_raw_gpudirect` — still set `eth_dst_addr` to the RX MAC: `cat /sys/class/net/enP2p1s0f0np0/address`. + - [`daqiri_bench_rdma_tx_rx_spark.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_rdma_tx_rx_spark.yaml) for `daqiri_bench_rdma` — no further edits needed. + The benchmark executables and example YAML configurations are located at: | | Binaries | YAML configs | diff --git a/docs/tutorials/configuration-walkthrough.md b/docs/tutorials/configuration-walkthrough.md index d16615e..ba8e63b 100644 --- a/docs/tutorials/configuration-walkthrough.md +++ b/docs/tutorials/configuration-walkthrough.md @@ -100,7 +100,7 @@ bench_tx: # (30)! 3. :material-wrench: `master_core` is the ID of the CPU core used for setup. It does not need to be isolated, and is recommended to differ from the `cpu_core` fields below used for polling the NIC. 4. The `memory_regions` section lists where the NIC will write/read data from/to when bypassing the OS kernel. Tip: when using GPU buffer regions, keeping the sum of their buffer sizes lower than 80% of your BAR1 size is generally a good rule of thumb. 5. A descriptive name for that memory region to refer to later in the `interfaces` section. -6. :material-package-variant: The type of memory region. Best options are `device` (GPU) or `huge` (hugepages - CPU). Also supported: `host_pinned` (CPU, pinned) and `host` (CPU, unpinned). Choose based on whether your application processes packets on the GPU or CPU. +6. :material-package-variant: The type of memory region. On discrete GPUs, the best options are `device` (GPU VRAM, GPUDirect) or `huge` (hugepages, CPU). On integrated GPUs (e.g. NVIDIA GB10 / DGX Spark) where `device` cannot be peer-DMA'd by the NIC, use `host_pinned` instead. Also supported: `host` (CPU, unpinned). See the full [memory regions reference](../configuration.md#memory-regions). Choose based on whether your application processes packets on the GPU or CPU and on the GPU class. 7. :material-wrench: The GPU ID for `device` memory regions. The NUMA node ID for CPU memory regions. 8. :material-package-variant: The number of buffers in the memory region. A higher value means more time to process the data, but takes additional space on the GPU BAR1. Too low increases the risk of dropping packets from the NIC having nowhere to write (Rx) or higher latency from buffering (Tx). A good starting point is 5x the `batch_size` below. 9. :material-package-variant: The size of each buffer in the memory region. These should be equal to your maximum packet size, or less if breaking down packets (e.g. header-data split, see the `rx` queue below). diff --git a/docs/tutorials/system_configuration.md b/docs/tutorials/system_configuration.md index a0579e7..21683ad 100644 --- a/docs/tutorials/system_configuration.md +++ b/docs/tutorials/system_configuration.md @@ -5,563 +5,607 @@ hide: # System Configuration -DAQIRI requires a system with an [**NVIDIA SmartNIC**](https://www.nvidia.com/en-us/networking/ethernet-adapters/) (ConnectX-6 Dx or later) and a [**discrete GPU**](https://www.nvidia.com/en-us/design-visualization/desktop-graphics/) (GPUDirect-capable). Check the [Getting Started](../getting-started.md) page to confirm you are running on a supported platform. +DAQIRI requires an [**NVIDIA SmartNIC**](https://www.nvidia.com/en-us/networking/ethernet-adapters/) (ConnectX-6 Dx or later) and a CUDA-capable GPU. Two reference platforms are documented in this tutorial — pick yours below: -This page covers both the **required system setup** to get DAQIRI running and **optional performance tuning** to maximize throughput and minimize latency. Complete the [System Setup for DAQIRI](#system-setup-for-daqiri) section first, then move on to [System Optimization](#system-optimization) as needed. +- **IGX Orin** with a discrete GPU (e.g. [RTX 6000 Ada](https://www.nvidia.com/en-us/design-visualization/rtx-6000/)): peermem-based GPUDirect, a separate GPU BAR1, and a discrete-PCIe path between GPU and NIC. The originally-supported reference platform. +- **DGX Spark** (Grace Blackwell **GB10** superchip): unified CPU/GPU memory via NVLink-C2C, integrated **ConnectX-7**, no peermem, and GPUDirect via `kind: host_pinned` data buffers (see [PR #41](https://github.com/nvidia/daqiri/pull/41)). -## System Setup for DAQIRI +The two tabs below are linked across the page (mkdocs-material `content.tabs.link`), so other same-named sub-tabs (`tune_system.py` vs `manual`, `One-time` vs `Persistent`, etc.) will switch in lockstep as you toggle the platform. -This section covers the essential system setup steps needed before using DAQIRI. Complete this setup before moving on to [System Optimization](#system-optimization) or [running benchmarks](benchmarking_examples.md). +=== "IGX Orin" -In this tutorial, we will be developing on an **NVIDIA IGX Orin platform** with [IGX SW 1.1](https://docs.nvidia.com/igx-orin/user-guide/latest/base-os.html) and an [NVIDIA RTX 6000 ADA GPU](https://www.nvidia.com/en-us/design-visualization/rtx-6000/), which is the configuration that is currently actively tested. The concepts should be applicable to other systems based on Ubuntu 22.04 as well. It should also work on other Linux distributions with a glibc version of 2.35 or higher by containerizing the dependencies and applications on top of an Ubuntu 22.04 image, but this is not actively tested at this time. + This tab covers both the **required system setup** to get DAQIRI running on IGX Orin and **optional performance tuning** to maximize throughput and minimize latency. Complete the [System Setup for DAQIRI](#system-setup-for-daqiri) section first, then move on to [System Optimization](#system-optimization) as needed. -!!! Warning "Secure boot conflict" + ## System Setup for DAQIRI - If you have secure boot enabled on your system, you might need to disable it as a prerequisite to run some of the configurations below ([switching the NIC link layers to Ethernet](#switch-your-nic-link-layers-to-ethernet), [updating the MRRS of your NIC ports](#step-3-maximize-the-nics-max-read-request-size-mrrs), [updating the BAR1 size of your GPU](#step-8-maximize-gpu-bar1-size)). Secure boot can be re-enabled after the configurations are completed. + This section covers the essential system setup steps needed before using DAQIRI. Complete this setup before moving on to [System Optimization](#system-optimization) or [running benchmarks](benchmarking_examples.md). -### Check your NIC drivers + In this tutorial, we will be developing on an **NVIDIA IGX Orin platform** with [IGX SW 1.1](https://docs.nvidia.com/igx-orin/user-guide/latest/base-os.html) and an [NVIDIA RTX 6000 ADA GPU](https://www.nvidia.com/en-us/design-visualization/rtx-6000/), which is the configuration that is currently actively tested. The concepts should be applicable to other systems based on Ubuntu 22.04 as well. It should also work on other Linux distributions with a glibc version of 2.35 or higher by containerizing the dependencies and applications on top of an Ubuntu 22.04 image, but this is not actively tested at this time. -Ensure your NIC drivers are loaded: + !!! Warning "Secure boot conflict" -```bash -lsmod | grep ib_core -``` + If you have secure boot enabled on your system, you might need to disable it as a prerequisite to run some of the configurations below ([switching the NIC link layers to Ethernet](#switch-your-nic-link-layers-to-ethernet), [updating the MRRS of your NIC ports](#step-3-maximize-the-nics-max-read-request-size-mrrs), [updating the BAR1 size of your GPU](#step-8-maximize-gpu-bar1-size)). Secure boot can be re-enabled after the configurations are completed. -??? abstract "See an example output" + ### Check your NIC drivers - This would be an expected output, where `ib_core` is listed on the left. + Ensure your NIC drivers are loaded: ```bash - ib_core 442368 8 rdma_cm,ib_ipoib,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm - mlx_compat 20480 11 rdma_cm,ib_ipoib,mlxdevm,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core + lsmod | grep ib_core ``` -If this is empty, install the latest OFED drivers from DOCA (the DOCA APT repository should already be configured from the [DAQIRI build setup](../getting-started.md#build-the-daqiri-library)), and reboot your system: + ??? abstract "See an example output" -```bash -sudo apt update -sudo apt install doca-ofed -sudo reboot -``` + This would be an expected output, where `ib_core` is listed on the left. -!!! note + ```bash + ib_core 442368 8 rdma_cm,ib_ipoib,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm + mlx_compat 20480 11 rdma_cm,ib_ipoib,mlxdevm,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core + ``` - If this is not empty, you can still install the newest OFED drivers from `doca-ofed` above. If you choose to keep your current drivers, install the following utilities for convenience later on. They include tools like `ibstat`, `ibv_devinfo`, `ibdev2netdev`, `mlxconfig`: + If this is empty, install the latest OFED drivers from DOCA (the DOCA APT repository should already be configured from the [DAQIRI build setup](../getting-started.md#build-the-daqiri-library)), and reboot your system: ```bash sudo apt update - sudo apt install infiniband-diags ibverbs-utils mlnx-ofed-kernel-utils mft - ``` - - Also upgrade the user space libraries to make sure your tools have all the symbols they need: - - ```bash - sudo apt install libibverbs1 librdmacm1 rdma-core + sudo apt install doca-ofed + sudo reboot ``` -Running `ibstat` or `ibv_devinfo` will confirm your NIC interfaces are recognized by your drivers. + !!! note -### Switch your NIC Link Layers to Ethernet + If this is not empty, you can still install the newest OFED drivers from `doca-ofed` above. If you choose to keep your current drivers, install the following utilities for convenience later on. They include tools like `ibstat`, `ibv_devinfo`, `ibdev2netdev`, `mlxconfig`: -NVIDIA SmartNICs can function in two separate modes (called link layer): + ```bash + sudo apt update + sudo apt install infiniband-diags ibverbs-utils mlnx-ofed-kernel-utils mft + ``` -- Ethernet (ETH) -- Infiniband (IB) + Also upgrade the user space libraries to make sure your tools have all the symbols they need: -To identify the current mode, run `ibstat` or `ibv_devinfo` and look for the `Link Layer` value. + ```bash + sudo apt install libibverbs1 librdmacm1 rdma-core + ``` -```bash -ibv_devinfo -``` + Running `ibstat` or `ibv_devinfo` will confirm your NIC interfaces are recognized by your drivers. -??? note "Warning about `libvmw_pvrdma-rdmav34.so`" + ### Switch your NIC Link Layers to Ethernet - If you see a warning like `couldn't load driver 'libvmw_pvrdma-rdmav34.so'`, this is harmless. It refers to a VMware paravirtual RDMA driver that is not relevant on bare-metal systems and can be safely ignored. + NVIDIA SmartNICs can function in two separate modes (called link layer): -??? failure "Couldn't load driver 'libmlx5-rdmav34.so'" + - Ethernet (ETH) + - Infiniband (IB) - If you see an error like this, you might have different versions for your OFED tools and libraries. Attempt after upgrading your user space libraries to match the version of your OFED tools like so: + To identify the current mode, run `ibstat` or `ibv_devinfo` and look for the `Link Layer` value. ```bash - sudo apt update - sudo apt install libibverbs1 librdmacm1 rdma-core - ``` - -??? abstract "See an example output" - - In the example below, the `mlx5_0` interface is in Ethernet mode, while the `mlx5_1` interface is in Infiniband mode. Do not pay attention to the `transport` value which is always `InfiniBand`. - - ```sh hl_lines="18 37" - hca_id: mlx5_0 - transport: InfiniBand (0) - fw_ver: 28.38.1002 - node_guid: 48b0:2d03:00f4:07fb - sys_image_guid: 48b0:2d03:00f4:07fb - vendor_id: 0x02c9 - vendor_part_id: 4129 - hw_ver: 0x0 - board_id: NVD0000000033 - phys_port_cnt: 1 - port: 1 - state: PORT_ACTIVE (4) - max_mtu: 4096 (5) - active_mtu: 4096 (5) - sm_lid: 0 - port_lid: 0 - port_lmc: 0x00 - link_layer: Ethernet - - hca_id: mlx5_1 - transport: InfiniBand (0) - fw_ver: 28.38.1002 - node_guid: 48b0:2d03:00f4:07fc - sys_image_guid: 48b0:2d03:00f4:07fb - vendor_id: 0x02c9 - vendor_part_id: 4129 - hw_ver: 0x0 - board_id: NVD0000000033 - phys_port_cnt: 1 - port: 1 - state: PORT_ACTIVE (4) - max_mtu: 4096 (5) - active_mtu: 4096 (5) - sm_lid: 0 - port_lid: 0 - port_lmc: 0x00 - link_layer: InfiniBand + ibv_devinfo ``` -**For Holoscan Networking, we want the NIC to use the ETH link layer.** To switch the link layer mode, there are two possible options: + ??? note "Warning about `libvmw_pvrdma-rdmav34.so`" -1. On IGX Orin developer kits, you can switch that setting through the BIOS: [see IGX Orin documentation](https://docs.nvidia.com/igx-orin/user-guide/latest/switch-network-link.html). -2. On any system with a NVIDIA NIC (including the IGX Orin developer kits), you can run the commands below from a terminal: + If you see a warning like `couldn't load driver 'libvmw_pvrdma-rdmav34.so'`, this is harmless. It refers to a VMware paravirtual RDMA driver that is not relevant on bare-metal systems and can be safely ignored. - 1. Identify the PCI address of your NVIDIA NIC + ??? failure "Couldn't load driver 'libmlx5-rdmav34.so'" - === "ibdev2netdev" + If you see an error like this, you might have different versions for your OFED tools and libraries. Attempt after upgrading your user space libraries to match the version of your OFED tools like so: - ```bash - nic_pci=$(sudo ibdev2netdev -v | awk '{print $1}' | head -n1) - ``` + ```bash + sudo apt update + sudo apt install libibverbs1 librdmacm1 rdma-core + ``` - === "lspci" + ??? abstract "See an example output" - ```bash - # `0200` is the PCI-SIG class code for Ethernet controllers - # `0207` is the PCI-SIG class code for Infiniband controllers - # `15b3` is the Vendor ID for Mellanox - nic_pci=$(lspci -n | awk '($2 == "0200:" || $2 == "0207:") && $3 ~ /^15b3:/ {print $1; exit}') - ``` + In the example below, the `mlx5_0` interface is in Ethernet mode, while the `mlx5_1` interface is in Infiniband mode. Do not pay attention to the `transport` value which is always `InfiniBand`. + + ```sh hl_lines="18 37" + hca_id: mlx5_0 + transport: InfiniBand (0) + fw_ver: 28.38.1002 + node_guid: 48b0:2d03:00f4:07fb + sys_image_guid: 48b0:2d03:00f4:07fb + vendor_id: 0x02c9 + vendor_part_id: 4129 + hw_ver: 0x0 + board_id: NVD0000000033 + phys_port_cnt: 1 + port: 1 + state: PORT_ACTIVE (4) + max_mtu: 4096 (5) + active_mtu: 4096 (5) + sm_lid: 0 + port_lid: 0 + port_lmc: 0x00 + link_layer: Ethernet + + hca_id: mlx5_1 + transport: InfiniBand (0) + fw_ver: 28.38.1002 + node_guid: 48b0:2d03:00f4:07fc + sys_image_guid: 48b0:2d03:00f4:07fb + vendor_id: 0x02c9 + vendor_part_id: 4129 + hw_ver: 0x0 + board_id: NVD0000000033 + phys_port_cnt: 1 + port: 1 + state: PORT_ACTIVE (4) + max_mtu: 4096 (5) + active_mtu: 4096 (5) + sm_lid: 0 + port_lid: 0 + port_lmc: 0x00 + link_layer: InfiniBand + ``` - 2. Set both link layers to Ethernet. `LINK_TYPE_P1` and `LINK_TYPE_P2` are for `mlx5_0` and `mlx5_1` respectively. You can choose to only set one of them. `ETH` or `2` is Ethernet mode, and `IB` or `1` is for InfiniBand. + **For Holoscan Networking, we want the NIC to use the ETH link layer.** To switch the link layer mode, there are two possible options: - ```bash - sudo mlxconfig -d $nic_pci set LINK_TYPE_P1=ETH LINK_TYPE_P2=ETH - ``` + 1. On IGX Orin developer kits, you can switch that setting through the BIOS: [see IGX Orin documentation](https://docs.nvidia.com/igx-orin/user-guide/latest/switch-network-link.html). + 2. On any system with a NVIDIA NIC (including the IGX Orin developer kits), you can run the commands below from a terminal: - Apply with `y`. + 1. Identify the PCI address of your NVIDIA NIC - ??? abstract "See an example output" + === "ibdev2netdev" - ```sh - Device #1: - ---------- + ```bash + nic_pci=$(sudo ibdev2netdev -v | awk '{print $1}' | head -n1) + ``` - Device type: ConnectX7 - Name: P3740-B0-QSFP_Ax - Description: NVIDIA Prometheus P3740 ConnectX-7 VPI PCIe Switch Motherboard; 400Gb/s; dual-port QSFP; PCIe switch5.0 X8 SLOT0 ;X16 SLOT2; secure boot; - Device: 0005:03:00.0 + === "lspci" - Configurations: Next Boot New - LINK_TYPE_P1 ETH(2) ETH(2) - LINK_TYPE_P2 IB(1) ETH(2) + ```bash + # `0200` is the PCI-SIG class code for Ethernet controllers + # `0207` is the PCI-SIG class code for Infiniband controllers + # `15b3` is the Vendor ID for Mellanox + nic_pci=$(lspci -n | awk '($2 == "0200:" || $2 == "0207:") && $3 ~ /^15b3:/ {print $1; exit}') + ``` - Apply new Configuration? (y/n) [n] : - y + 2. Set both link layers to Ethernet. `LINK_TYPE_P1` and `LINK_TYPE_P2` are for `mlx5_0` and `mlx5_1` respectively. You can choose to only set one of them. `ETH` or `2` is Ethernet mode, and `IB` or `1` is for InfiniBand. - Applying... Done! - -I- Please reboot machine to load new configurations. + ```bash + sudo mlxconfig -d $nic_pci set LINK_TYPE_P1=ETH LINK_TYPE_P2=ETH ``` - - `Next Boot` is the current value that was expected to be used at the next reboot. - - `New` is the value you're about to set to override `Next Boot`. + Apply with `y`. - ??? failure "ERROR: write counter to semaphore: Operation not permitted" + ??? abstract "See an example output" - Disable secure boot on your system ahead of changing the link type of your NIC ports. It can be re-enabled afterwards. + ```sh + Device #1: + ---------- - 3. Reboot your system. + Device type: ConnectX7 + Name: P3740-B0-QSFP_Ax + Description: NVIDIA Prometheus P3740 ConnectX-7 VPI PCIe Switch Motherboard; 400Gb/s; dual-port QSFP; PCIe switch5.0 X8 SLOT0 ;X16 SLOT2; secure boot; + Device: 0005:03:00.0 - ```bash - sudo reboot - ``` + Configurations: Next Boot New + LINK_TYPE_P1 ETH(2) ETH(2) + LINK_TYPE_P2 IB(1) ETH(2) -### Configure the IP addresses of the NIC ports + Apply new Configuration? (y/n) [n] : + y -First, we want to identify the logical names of your NIC interfaces. Connecting an SFP cable in just one of the ports of the NIC will help you identify which port is which. Run the following command once the cable is in place: + Applying... Done! + -I- Please reboot machine to load new configurations. + ``` -```bash -ibdev2netdev -``` + - `Next Boot` is the current value that was expected to be used at the next reboot. + - `New` is the value you're about to set to override `Next Boot`. -??? abstract "See an example output" + ??? failure "ERROR: write counter to semaphore: Operation not permitted" - In the example below, only `mlx5_1` has a cable connected (`Up`), and its logical ethernet name is `eth1`: + Disable secure boot on your system ahead of changing the link type of your NIC ports. It can be re-enabled afterwards. - ```bash - $ ibdev2netdev - mlx5_0 port 1 ==> eth0 (Down) - mlx5_1 port 1 ==> eth1 (Up) - ``` + 3. Reboot your system. + + ```bash + sudo reboot + ``` -??? failure "ibdev2netdev does not show the NIC" + ### Configure the IP addresses of the NIC ports - If you have a cable connected but it does not show Up/Down in the output of `ibdev2netdev`, you can try to parse the output of `dmesg` instead. The example below shows that `0005:03:00.1` is plugged, and that it is associated with `eth1`: + First, we want to identify the logical names of your NIC interfaces. Connecting an SFP cable in just one of the ports of the NIC will help you identify which port is which. Run the following command once the cable is in place: - ```sh - $ sudo dmesg | grep -w mlx5_core - ... - [ 11.512808] mlx5_core 0005:03:00.0 eth0: Link down - [ 11.640670] mlx5_core 0005:03:00.1 eth1: Link down - ... - [ 3712.267103] mlx5_core 0005:03:00.1: Port module event: module 1, Cable plugged + ```bash + ibdev2netdev ``` -The next step is to set a static IP on the interface you'd like to use so you can refer to it in your Holoscan applications. First, check if you already have any addresses configured using the ethernet interface names identified above (in our case, `eth0` and `eth1`): + ??? abstract "See an example output" -```bash -ip -f inet addr show eth0 -ip -f inet addr show eth1 -``` + In the example below, only `mlx5_1` has a cable connected (`Up`), and its logical ethernet name is `eth1`: -If nothing appears, or you'd like to change the address, you can set an IP address through the Network Manager user interface, CLI (`nmcli`), or other IP configuration tools. In the example below, we configure the `eth0` interface with an address of `1.1.1.1/24`, and the `eth1` interface with an address of `2.2.2.2/24`. + ```bash + $ ibdev2netdev + mlx5_0 port 1 ==> eth0 (Down) + mlx5_1 port 1 ==> eth1 (Up) + ``` -=== "One-time" + ??? failure "ibdev2netdev does not show the NIC" - ```bash - sudo ip addr add 1.1.1.1/24 dev eth0 - sudo ip addr add 2.2.2.2/24 dev eth1 - ``` + If you have a cable connected but it does not show Up/Down in the output of `ibdev2netdev`, you can try to parse the output of `dmesg` instead. The example below shows that `0005:03:00.1` is plugged, and that it is associated with `eth1`: -=== "Persistent" + ```sh + $ sudo dmesg | grep -w mlx5_core + ... + [ 11.512808] mlx5_core 0005:03:00.0 eth0: Link down + [ 11.640670] mlx5_core 0005:03:00.1 eth1: Link down + ... + [ 3712.267103] mlx5_core 0005:03:00.1: Port module event: module 1, Cable plugged + ``` - Set these variables to your desired values: + The next step is to set a static IP on the interface you'd like to use so you can refer to it in your Holoscan applications. First, check if you already have any addresses configured using the ethernet interface names identified above (in our case, `eth0` and `eth1`): ```bash - if_name=eth0 - if_static_ip=1.1.1.1/24 + ip -f inet addr show eth0 + ip -f inet addr show eth1 ``` - === "NetworkManager" + If nothing appears, or you'd like to change the address, you can set an IP address through the Network Manager user interface, CLI (`nmcli`), or other IP configuration tools. In the example below, we configure the `eth0` interface with an address of `1.1.1.1/24`, and the `eth1` interface with an address of `2.2.2.2/24`. - Update the IP with `nmcli`: + === "One-time" ```bash - sudo nmcli connection modify $if_name ipv4.addresses $if_static_ip - sudo nmcli connection up $if_name + sudo ip addr add 1.1.1.1/24 dev eth0 + sudo ip addr add 2.2.2.2/24 dev eth1 ``` - === "systemd-networkd" - - Create a network config file with the static IP: - - ```bash - cat << EOF | sudo tee /etc/systemd/network/20-$if_name.network - [Match] - MACAddress=$(cat /sys/class/net/$if_name/address) - - [Network] - Address=$if_static_ip - EOF - ``` + === "Persistent" - Apply now: + Set these variables to your desired values: ```bash - sudo systemctl restart systemd-networkd + if_name=eth0 + if_static_ip=1.1.1.1/24 ``` -!!! note + === "NetworkManager" - If you are connecting the NIC to another NIC with an [interconnect](https://www.nvidia.com/en-us/networking/interconnect/), do the same on the other system with an IP address on the same network segment. - For example, to communicate with `1.1.1.1/24` above (`/24` -> `255.255.255.0` submask), setup your other system with an IP between `1.1.1.2` and `1.1.1.254`, and the same `/24` submask. + Update the IP with `nmcli`: -### Enable GPUDirect + ```bash + sudo nmcli connection modify $if_name ipv4.addresses $if_static_ip + sudo nmcli connection up $if_name + ``` -Assuming you already have [NVIDIA drivers](https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/index.html#ubuntu-installation-network) installed, check if the `nvidia_peermem` kernel module is loaded: + === "systemd-networkd" -=== "tune_system.py" + Create a network config file with the static IP: - === "Debian installation" + ```bash + cat << EOF | sudo tee /etc/systemd/network/20-$if_name.network + [Match] + MACAddress=$(cat /sys/class/net/$if_name/address) - ```bash - sudo /opt/nvidia/holoscan/bin/tune_system.py --check topo - ``` + [Network] + Address=$if_static_ip + EOF + ``` - === "From source" + Apply now: - ```bash - cd holohub - sudo ./operators/advanced_network/python/tune_system.py --check topo + ```bash + sudo systemctl restart systemd-networkd + ``` - ``` + !!! note - ??? abstract "See an example output" + If you are connecting the NIC to another NIC with an [interconnect](https://www.nvidia.com/en-us/networking/interconnect/), do the same on the other system with an IP address on the same network segment. + For example, to communicate with `1.1.1.1/24` above (`/24` -> `255.255.255.0` submask), setup your other system with an IP between `1.1.1.2` and `1.1.1.254`, and the same `/24` submask. - ```log - 2025-03-12 14:15:07 - INFO - GPU 0: NVIDIA RTX A6000 has GPUDirect support. - 2025-03-12 14:15:27 - INFO - nvidia-peermem module is loaded. - ``` + ### Enable GPUDirect -```bash -lsmod | grep nvidia_peermem -``` + Assuming you already have [NVIDIA drivers](https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/index.html#ubuntu-installation-network) installed, check if the `nvidia_peermem` kernel module is loaded: -If it's not loaded, run the following command, then check again: + === "tune_system.py" -=== "One-time" + === "Debian installation" - ```bash - sudo modprobe nvidia_peermem - ``` + ```bash + sudo /opt/nvidia/holoscan/bin/tune_system.py --check topo + ``` -=== "Persistent" + === "From source" - ```bash - sudo echo "nvidia-peermem" >> /etc/modules - sudo systemctl restart systemd-modules-load.service - ``` + ```bash + cd holohub + sudo ./operators/advanced_network/python/tune_system.py --check topo -??? failure "Error loading the `nvidia-peermem` kernel module" + ``` - If you run into an error loading the `nvidia-peermem` kernel module, follow these steps: + ??? abstract "See an example output" - 1. Install the `doca-ofed` package to get the latest drivers for your NIC as [documented above](#check-your-nic-drivers). - 2. Restart your system. - 3. Rebuild your NVIDIA drivers with DKMS like so: + ```log + 2025-03-12 14:15:07 - INFO - GPU 0: NVIDIA RTX A6000 has GPUDirect support. + 2025-03-12 14:15:27 - INFO - nvidia-peermem module is loaded. + ``` ```bash - peermem_ko=$(find /lib/modules/$(uname -r) -name "*peermem*.ko") - nv_dkms=$(dpkg -S "$peermem_ko" | cut -d: -f1) - sudo dpkg-reconfigure $nv_dkms - sudo modprobe nvidia_peermem + lsmod | grep nvidia_peermem ``` -??? info "Why peermem and not dma buf?" + If it's not loaded, run the following command, then check again: - `peermem` is currently the only GPUDirect interface supported by all our [networking backends](background.md#kernel-bypass). This section will therefore provide instructions for `peermem` and not `dma buf`. - ---- + === "One-time" -## System Optimization + ```bash + sudo modprobe nvidia_peermem + ``` -!!! warning "Advanced" + === "Persistent" - The section below is for advanced users looking to extract more performance out of their system. You can choose to skip this section and return to it later if performance if your application is not satisfactory. + ```bash + sudo echo "nvidia-peermem" >> /etc/modules + sudo systemctl restart systemd-modules-load.service + ``` -While the configurations above are the minimum requirements to get a NIC and a NVIDIA GPU to communicate while bypassing the OS kernel stack, performance can be further improved in most scenarios by tuning the system as described below. + ??? failure "Error loading the `nvidia-peermem` kernel module" -The table below summarizes all optimization steps covered in this section, along with the corresponding `tune_system.py` flags and whether each setting can be made persistent across reboots. Use it as a checklist to track your progress. + If you run into an error loading the `nvidia-peermem` kernel module, follow these steps: -| Step | Description | Tuning Script Flag | Persistent Option Available? | -|------|-------------|--------------------|-------------| -| 1 | [PCIe topology](#step-1-ensure-ideal-pcie-topology) | `--check topo` | N/A (hardware) | -| 2 | [PCIe config (MPS/Speed)](#step-2-check-the-nics-pcie-configuration) | `--check mps` | N/A (hardware) | -| 3 | [NIC MRRS](#step-3-maximize-the-nics-max-read-request-size-mrrs) | `--check mrrs` / `--set mrrs` | No — use a startup script | -| 4 | [Hugepages](#step-4-enable-huge-pages) | `--check hugepages` | Yes — kernel bootline or `/etc/fstab` | -| 5 | [CPU isolation](#step-5-isolate-cpu-cores) | `--check cmdline` | Yes — kernel bootline | -| 6 | [CPU governor](#step-6-prevent-cpu-cores-from-going-idle) | `--check cpu-freq` | Yes — see persistent option in section | -| 7 | [GPU clocks](#step-7-prevent-the-gpu-from-going-idle) | `--check gpu-clock` | Partial — `nvidia-smi -pm 1` persists driver; clock locks need a startup script | -| 8 | [GPU BAR1 size](#step-8-maximize-gpu-bar1-size) | `--check bar1-size` | Yes — firmware flash | -| 9 | [Jumbo frames (MTU)](#step-9-enable-jumbo-frames) | `--check mtu` | Yes — see persistent option in section | + 1. Install the `doca-ofed` package to get the latest drivers for your NIC as [documented above](#check-your-nic-drivers). + 2. Restart your system. + 3. Rebuild your NVIDIA drivers with DKMS like so: -!!! tip "Plan your reboots" + ```bash + peermem_ko=$(find /lib/modules/$(uname -r) -name "*peermem*.ko") + nv_dkms=$(dpkg -S "$peermem_ko" | cut -d: -f1) + sudo dpkg-reconfigure $nv_dkms + sudo modprobe nvidia_peermem + ``` - Several steps below require adding flags to the kernel bootline in `/etc/default/grub` (hugepages in [Enable Huge pages](#step-4-enable-huge-pages), CPU isolation in [Isolate CPU cores](#step-5-isolate-cpu-cores)). We recommend reading through both sections first and adding all the flags at once to avoid multiple reboots. Other items like MRRS, GPU clocks, and MTU can be applied at runtime but reset on reboot — consider scripting them or using a systemd service for persistence. + ??? info "Why peermem and not dma buf?" -Before diving in each of the setups below, we provide a utility script as part of the DAQIRI library which provides an overview of the configurations that potentially need to be tuned on your system. + `peermem` is currently the only GPUDirect interface supported by all our [networking backends](background.md#kernel-bypass). This section will therefore provide instructions for `peermem` and not `dma buf`. -??? example "Work In Progress" + --- - This utility script is under active development and will be updated in future releases with additional checks, more actionable recommendations, and automated tuning. + ## System Optimization -=== "Debian installation" + !!! warning "Advanced" - ```bash - sudo /opt/nvidia/holoscan/bin/tune_system.py --check all - ``` + The section below is for advanced users looking to extract more performance out of their system. You can choose to skip this section and return to it later if performance if your application is not satisfactory. -=== "From source" + While the configurations above are the minimum requirements to get a NIC and a NVIDIA GPU to communicate while bypassing the OS kernel stack, performance can be further improved in most scenarios by tuning the system as described below. - ```bash - cd holohub - sudo ./operators/advanced_network/python/tune_system.py --check all - ``` + The table below summarizes all optimization steps covered in this section, along with the corresponding `tune_system.py` flags and whether each setting can be made persistent across reboots. Use it as a checklist to track your progress. -??? abstract "See an example output" - - Our tuned-up IGX system with A6000 can optimize most settings: - - ```log - 2025-03-12 14:16:06 - INFO - CPU 0: Governor is correctly set to 'performance'. - 2025-03-12 14:16:06 - INFO - CPU 1: Governor is correctly set to 'performance'. - 2025-03-12 14:16:06 - INFO - CPU 2: Governor is correctly set to 'performance'. - 2025-03-12 14:16:06 - INFO - CPU 3: Governor is correctly set to 'performance'. - 2025-03-12 14:16:06 - INFO - CPU 4: Governor is correctly set to 'performance'. - 2025-03-12 14:16:06 - INFO - CPU 5: Governor is correctly set to 'performance'. - 2025-03-12 14:16:06 - INFO - CPU 6: Governor is correctly set to 'performance'. - 2025-03-12 14:16:06 - INFO - CPU 7: Governor is correctly set to 'performance'. - 2025-03-12 14:16:06 - INFO - CPU 8: Governor is correctly set to 'performance'. - 2025-03-12 14:16:06 - INFO - CPU 9: Governor is correctly set to 'performance'. - 2025-03-12 14:16:06 - INFO - CPU 10: Governor is correctly set to 'performance'. - 2025-03-12 14:16:06 - INFO - CPU 11: Governor is correctly set to 'performance'. - 2025-03-12 14:16:06 - INFO - cx7_0/0005:03:00.0: MRRS is correctly set to 4096. - 2025-03-12 14:16:06 - INFO - cx7_1/0005:03:00.1: MRRS is correctly set to 4096. - 2025-03-12 14:16:06 - WARNING - cx7_0/0005:03:00.0: PCIe Max Payload Size is not set to 256 bytes. Found: 128 bytes. - 2025-03-12 14:16:06 - WARNING - cx7_1/0005:03:00.1: PCIe Max Payload Size is not set to 256 bytes. Found: 128 bytes. - 2025-03-12 14:16:06 - INFO - HugePages_Total: 4 - 2025-03-12 14:16:06 - INFO - HugePage Size: 1024.00 MB - 2025-03-12 14:16:06 - INFO - Total Allocated HugePage Memory: 4096.00 MB - 2025-03-12 14:16:06 - INFO - Hugepages are sufficiently allocated with at least 500 MB. - 2025-03-12 14:16:06 - INFO - GPU 0: SM Clock is correctly set to 1920 MHz (within 500 of the 2100 MHz theoretical Max). - 2025-03-12 14:16:06 - INFO - GPU 0: Memory Clock is correctly set to 8000 MHz. - 2025-03-12 14:16:06 - INFO - GPU 00000005:09:00.0: BAR1 size is 8192 MiB. - 2025-03-12 14:16:06 - INFO - GPU GPU0 has at least one PIX/PXB connection to a NIC - 2025-03-12 14:16:06 - INFO - isolcpus found in kernel boot line - 2025-03-12 14:16:06 - INFO - rcu_nocbs found in kernel boot line - 2025-03-12 14:16:06 - INFO - irqaffinity found in kernel boot line - 2025-03-12 14:16:06 - INFO - Interface cx7_0 has an acceptable MTU of 9000 bytes. - 2025-03-12 14:16:06 - INFO - Interface cx7_1 has an acceptable MTU of 9000 bytes. - 2025-03-12 14:16:06 - INFO - GPU 0: NVIDIA RTX A6000 has GPUDirect support. - 2025-03-12 14:16:06 - INFO - nvidia-peermem module is loaded. - ``` + | Step | Description | Tuning Script Flag | Persistent Option Available? | + |------|-------------|--------------------|-------------| + | 1 | [PCIe topology](#step-1-ensure-ideal-pcie-topology) | `--check topo` | N/A (hardware) | + | 2 | [PCIe config (MPS/Speed)](#step-2-check-the-nics-pcie-configuration) | `--check mps` | N/A (hardware) | + | 3 | [NIC MRRS](#step-3-maximize-the-nics-max-read-request-size-mrrs) | `--check mrrs` / `--set mrrs` | No — use a startup script | + | 4 | [Hugepages](#step-4-enable-huge-pages) | `--check hugepages` | Yes — kernel bootline or `/etc/fstab` | + | 5 | [CPU isolation](#step-5-isolate-cpu-cores) | `--check cmdline` | Yes — kernel bootline | + | 6 | [CPU governor](#step-6-prevent-cpu-cores-from-going-idle) | `--check cpu-freq` | Yes — see persistent option in section | + | 7 | [GPU clocks](#step-7-prevent-the-gpu-from-going-idle) | `--check gpu-clock` | Partial — `nvidia-smi -pm 1` persists driver; clock locks need a startup script | + | 8 | [GPU BAR1 size](#step-8-maximize-gpu-bar1-size) | `--check bar1-size` | Yes — firmware flash | + | 9 | [Jumbo frames (MTU)](#step-9-enable-jumbo-frames) | `--check mtu` | Yes — see persistent option in section | -Based on the results, you can figure out which of the sections below are appropriate to update configurations on your system. + !!! tip "Plan your reboots" -### Step 1: Ensure ideal PCIe topology + Several steps below require adding flags to the kernel bootline in `/etc/default/grub` (hugepages in [Enable Huge pages](#step-4-enable-huge-pages), CPU isolation in [Isolate CPU cores](#step-5-isolate-cpu-cores)). We recommend reading through both sections first and adding all the flags at once to avoid multiple reboots. Other items like MRRS, GPU clocks, and MTU can be applied at runtime but reset on reboot — consider scripting them or using a systemd service for persistence. -Kernel bypass and GPUDirect rely on PCIe to communicate between the GPU and the NIC at high speeds. As-such, the topology of the PCIe tree on a system is critical to ensure optimal performance. + Before diving in each of the setups below, we provide a utility script as part of the DAQIRI library which provides an overview of the configurations that potentially need to be tuned on your system. -Run the following command to check the GPUDirect communication matrix. **You are looking for a `PXB` or `PIX` connection between the GPU and the NIC interfaces to get the best performance.** + ??? example "Work In Progress" -=== "tune_system.py" + This utility script is under active development and will be updated in future releases with additional checks, more actionable recommendations, and automated tuning. === "Debian installation" ```bash - sudo /opt/nvidia/holoscan/bin/tune_system.py --check topo + sudo /opt/nvidia/holoscan/bin/tune_system.py --check all ``` === "From source" ```bash cd holohub - sudo ./operators/advanced_network/python/tune_system.py --check topo + sudo ./operators/advanced_network/python/tune_system.py --check all ``` ??? abstract "See an example output" - On IGX developer kits, the board's internal switch is designed to connect the GPU to the NIC interfaces with a `PXB` connection, offering great performance. + Our tuned-up IGX system with A6000 can optimize most settings: ```log - 2025-03-06 12:07:45 - INFO - GPU GPU0 has at least one PIX/PXB connection to a NIC + 2025-03-12 14:16:06 - INFO - CPU 0: Governor is correctly set to 'performance'. + 2025-03-12 14:16:06 - INFO - CPU 1: Governor is correctly set to 'performance'. + 2025-03-12 14:16:06 - INFO - CPU 2: Governor is correctly set to 'performance'. + 2025-03-12 14:16:06 - INFO - CPU 3: Governor is correctly set to 'performance'. + 2025-03-12 14:16:06 - INFO - CPU 4: Governor is correctly set to 'performance'. + 2025-03-12 14:16:06 - INFO - CPU 5: Governor is correctly set to 'performance'. + 2025-03-12 14:16:06 - INFO - CPU 6: Governor is correctly set to 'performance'. + 2025-03-12 14:16:06 - INFO - CPU 7: Governor is correctly set to 'performance'. + 2025-03-12 14:16:06 - INFO - CPU 8: Governor is correctly set to 'performance'. + 2025-03-12 14:16:06 - INFO - CPU 9: Governor is correctly set to 'performance'. + 2025-03-12 14:16:06 - INFO - CPU 10: Governor is correctly set to 'performance'. + 2025-03-12 14:16:06 - INFO - CPU 11: Governor is correctly set to 'performance'. + 2025-03-12 14:16:06 - INFO - cx7_0/0005:03:00.0: MRRS is correctly set to 4096. + 2025-03-12 14:16:06 - INFO - cx7_1/0005:03:00.1: MRRS is correctly set to 4096. + 2025-03-12 14:16:06 - WARNING - cx7_0/0005:03:00.0: PCIe Max Payload Size is not set to 256 bytes. Found: 128 bytes. + 2025-03-12 14:16:06 - WARNING - cx7_1/0005:03:00.1: PCIe Max Payload Size is not set to 256 bytes. Found: 128 bytes. + 2025-03-12 14:16:06 - INFO - HugePages_Total: 4 + 2025-03-12 14:16:06 - INFO - HugePage Size: 1024.00 MB + 2025-03-12 14:16:06 - INFO - Total Allocated HugePage Memory: 4096.00 MB + 2025-03-12 14:16:06 - INFO - Hugepages are sufficiently allocated with at least 500 MB. + 2025-03-12 14:16:06 - INFO - GPU 0: SM Clock is correctly set to 1920 MHz (within 500 of the 2100 MHz theoretical Max). + 2025-03-12 14:16:06 - INFO - GPU 0: Memory Clock is correctly set to 8000 MHz. + 2025-03-12 14:16:06 - INFO - GPU 00000005:09:00.0: BAR1 size is 8192 MiB. + 2025-03-12 14:16:06 - INFO - GPU GPU0 has at least one PIX/PXB connection to a NIC + 2025-03-12 14:16:06 - INFO - isolcpus found in kernel boot line + 2025-03-12 14:16:06 - INFO - rcu_nocbs found in kernel boot line + 2025-03-12 14:16:06 - INFO - irqaffinity found in kernel boot line + 2025-03-12 14:16:06 - INFO - Interface cx7_0 has an acceptable MTU of 9000 bytes. + 2025-03-12 14:16:06 - INFO - Interface cx7_1 has an acceptable MTU of 9000 bytes. + 2025-03-12 14:16:06 - INFO - GPU 0: NVIDIA RTX A6000 has GPUDirect support. + 2025-03-12 14:16:06 - INFO - nvidia-peermem module is loaded. + ``` + + Based on the results, you can figure out which of the sections below are appropriate to update configurations on your system. + + ### Step 1: Ensure ideal PCIe topology + + Kernel bypass and GPUDirect rely on PCIe to communicate between the GPU and the NIC at high speeds. As-such, the topology of the PCIe tree on a system is critical to ensure optimal performance. + + Run the following command to check the GPUDirect communication matrix. **You are looking for a `PXB` or `PIX` connection between the GPU and the NIC interfaces to get the best performance.** + + === "tune_system.py" + + === "Debian installation" + + ```bash + sudo /opt/nvidia/holoscan/bin/tune_system.py --check topo + ``` + + === "From source" + + ```bash + cd holohub + sudo ./operators/advanced_network/python/tune_system.py --check topo + ``` + + ??? abstract "See an example output" + + On IGX developer kits, the board's internal switch is designed to connect the GPU to the NIC interfaces with a `PXB` connection, offering great performance. + + ```log + 2025-03-06 12:07:45 - INFO - GPU GPU0 has at least one PIX/PXB connection to a NIC + ``` + + === "nvidia-smi" + + ```bash + nvidia-smi topo -mp ``` -=== "nvidia-smi" + ??? abstract "See an example output" + + On IGX developer kits, the board's internal switch is designed to connect the GPU to the NIC interfaces with a `PXB` connection, offering great performance. + ``` + GPU0 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID + GPU0 X PXB PXB 0-11 0 N/A + NIC0 PXB X PIX + NIC1 PXB PIX X + + Legend: + + X = Self + SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) + NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node + PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) + PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) + PIX = Connection traversing at most a single PCIe bridge + + NIC Legend: + + NIC0: mlx5_0 + NIC1: mlx5_1 + ``` + + If your connection is not optimal, you might be able to improve it by moving your NIC and/or GPU on a different PCIe port, so that they can share a branch and do not require going back to the Host Bridge (the CPU) to communicate. Refer to your system manufacturer for documentation, or run the following command to inspect the topology of your system: ```bash - nvidia-smi topo -mp + lspci -tv ``` ??? abstract "See an example output" - On IGX developer kits, the board's internal switch is designed to connect the GPU to the NIC interfaces with a `PXB` connection, offering great performance. + Here is the PCIe tree of an IGX system. Note how the ConnectX-7 and RTX A6000 are connected to the same branch. + ``` hl_lines="2 3 5" + -+-[0007:00]---00.0-[01-ff]----00.0 Marvell Technology Group Ltd. 88SE9235 PCIe 2.0 x2 4-port SATA 6 Gb/s Controller + +-[0005:00]---00.0-[01-ff]----00.0-[02-09]--+-00.0-[03]--+-00.0 Mellanox Technologies MT2910 Family [ConnectX-7] + | | \-00.1 Mellanox Technologies MT2910 Family [ConnectX-7] + | +-01.0-[04-06]----00.0-[05-06]----08.0-[06]-- + | \-02.0-[07-09]----00.0-[08-09]----00.0-[09]--+-00.0 NVIDIA Corporation GA102GL [RTX A6000] + | \-00.1 NVIDIA Corporation GA102 High Definition Audio Controller + +-[0004:00]---00.0-[01-ff]----00.0 Sandisk Corp WD PC SN810 / Black SN850 NVMe SSD + +-[0001:00]---00.0-[01-ff]----00.0-[02-fc]--+-01.0-[03-34]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller + | +-02.0-[35-66]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller + | +-03.0-[67-98]----00.0 Device 1c00:3450 + | +-04.0-[99-ca]----00.0-[9a]--+-00.0 ASPEED Technology, Inc. ASPEED Graphics Family + | | \-02.0 ASPEED Technology, Inc. Device 2603 + | \-05.0-[cb-fc]----00.0 Realtek Semiconductor Co., Ltd. RTL8822CE 802.11ac PCIe Wireless Network Adapter + \-[0000:00]- ``` - GPU0 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID - GPU0 X PXB PXB 0-11 0 N/A - NIC0 PXB X PIX - NIC1 PXB PIX X - Legend: + !!! warning "x86_64 compatibility" - X = Self - SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) - NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node - PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) - PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) - PIX = Connection traversing at most a single PCIe bridge + Most x86_64 systems are not designed for this topology as they lack a discrete PCIe switch. In that case, the best connection they can achieve is `NODE`. - NIC Legend: + ### Step 2: Check the NIC's PCIe configuration - NIC0: mlx5_0 - NIC1: mlx5_1 - ``` + !!! quote "[Understanding PCIe Configuration for Maximum Performance - May 27, 2022](https://enterprise-support.nvidia.com/s/article/understanding-pcie-configuration-for-maximum-performance)" -If your connection is not optimal, you might be able to improve it by moving your NIC and/or GPU on a different PCIe port, so that they can share a branch and do not require going back to the Host Bridge (the CPU) to communicate. Refer to your system manufacturer for documentation, or run the following command to inspect the topology of your system: - -```bash -lspci -tv -``` - -??? abstract "See an example output" - - Here is the PCIe tree of an IGX system. Note how the ConnectX-7 and RTX A6000 are connected to the same branch. - ``` hl_lines="2 3 5" - -+-[0007:00]---00.0-[01-ff]----00.0 Marvell Technology Group Ltd. 88SE9235 PCIe 2.0 x2 4-port SATA 6 Gb/s Controller - +-[0005:00]---00.0-[01-ff]----00.0-[02-09]--+-00.0-[03]--+-00.0 Mellanox Technologies MT2910 Family [ConnectX-7] - | | \-00.1 Mellanox Technologies MT2910 Family [ConnectX-7] - | +-01.0-[04-06]----00.0-[05-06]----08.0-[06]-- - | \-02.0-[07-09]----00.0-[08-09]----00.0-[09]--+-00.0 NVIDIA Corporation GA102GL [RTX A6000] - | \-00.1 NVIDIA Corporation GA102 High Definition Audio Controller - +-[0004:00]---00.0-[01-ff]----00.0 Sandisk Corp WD PC SN810 / Black SN850 NVMe SSD - +-[0001:00]---00.0-[01-ff]----00.0-[02-fc]--+-01.0-[03-34]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller - | +-02.0-[35-66]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller - | +-03.0-[67-98]----00.0 Device 1c00:3450 - | +-04.0-[99-ca]----00.0-[9a]--+-00.0 ASPEED Technology, Inc. ASPEED Graphics Family - | | \-02.0 ASPEED Technology, Inc. Device 2603 - | \-05.0-[cb-fc]----00.0 Realtek Semiconductor Co., Ltd. RTL8822CE 802.11ac PCIe Wireless Network Adapter - \-[0000:00]- - ``` + PCIe is used in any system for communication between different modules [including the NIC and the GPU]. This means that in order to process network traffic, the different devices communicating via the PCIe should be well configured. When connecting the network adapter to the PCIe, it auto-negotiates for the maximum capabilities supported between the network adapter and the CPU. -!!! warning "x86_64 compatibility" + The instructions below are meant to understand if your system is able to extract the maximum capabilities of your NIC, but they're not configurable. The two values that we are looking at here are the Max Payload Size (MPS - the maximum size of a PCIe packet) and the Speed (or PCIe generation). - Most x86_64 systems are not designed for this topology as they lack a discrete PCIe switch. In that case, the best connection they can achieve is `NODE`. + #### Max Payload Size (MPS) -### Step 2: Check the NIC's PCIe configuration + === "tune_system.py" + + === "Debian installation" + + ```bash + sudo /opt/nvidia/holoscan/bin/tune_system.py --check mps + ``` -!!! quote "[Understanding PCIe Configuration for Maximum Performance - May 27, 2022](https://enterprise-support.nvidia.com/s/article/understanding-pcie-configuration-for-maximum-performance)" + === "From source" - PCIe is used in any system for communication between different modules [including the NIC and the GPU]. This means that in order to process network traffic, the different devices communicating via the PCIe should be well configured. When connecting the network adapter to the PCIe, it auto-negotiates for the maximum capabilities supported between the network adapter and the CPU. + ```bash + cd holohub + sudo ./operators/advanced_network/python/tune_system.py --check mps + ``` -The instructions below are meant to understand if your system is able to extract the maximum capabilities of your NIC, but they're not configurable. The two values that we are looking at here are the Max Payload Size (MPS - the maximum size of a PCIe packet) and the Speed (or PCIe generation). + ??? abstract "See an example output" -#### Max Payload Size (MPS) + The PCIe configuration on the IGX Orin developer kit is not able to leverage the max payload size of the NIC: -=== "tune_system.py" + ```log + 2025-03-10 16:15:54 - WARNING - cx7_0/0005:03:00.0: PCIe Max Payload Size is not set to 256 bytes. Found: 128 bytes. + 2025-03-10 16:15:54 - WARNING - cx7_1/0005:03:00.1: PCIe Max Payload Size is not set to 256 bytes. Found: 128 bytes. + ``` - === "Debian installation" + === "manual" - ```bash - sudo /opt/nvidia/holoscan/bin/tune_system.py --check mps - ``` + Identify the PCIe address of your NVIDIA NIC: - === "From source" + === "ibdev2netdev" + + ```bash + nic_pci=$(sudo ibdev2netdev -v | awk '{print $1}' | head -n1) + ``` + + === "lspci" + + ```bash + # `0200` is the PCI-SIG class code for NICs + # `15b3` is the Vendor ID for Mellanox + nic_pci=$(lspci -n | awk '$2 == "0200:" && $3 ~ /^15b3:/ {print $1}' | head -n1) + ``` + + Check current and max MPS: ```bash - cd holohub - sudo ./operators/advanced_network/python/tune_system.py --check mps + sudo lspci -vv -s $nic_pci | awk '/DevCap/{s=1} /DevCtl/{s=0} /MaxPayload /{match($0, /MaxPayload [0-9]+/, m); if(s){print "Max " m[0]} else{print "Current " m[0]}}' ``` - ??? abstract "See an example output" + ??? abstract "See an example output" - The PCIe configuration on the IGX Orin developer kit is not able to leverage the max payload size of the NIC: + The PCIe configuration on the IGX Orin developer kit is not able to leverage the max payload size of the NIC: - ```log - 2025-03-10 16:15:54 - WARNING - cx7_0/0005:03:00.0: PCIe Max Payload Size is not set to 256 bytes. Found: 128 bytes. - 2025-03-10 16:15:54 - WARNING - cx7_1/0005:03:00.1: PCIe Max Payload Size is not set to 256 bytes. Found: 128 bytes. - ``` + ```log + Max MaxPayload 512 + Current MaxPayload 128 + ``` + + !!! note + + While your NIC might be capable of more, 256 bytes is generally the largest supported by any switch/CPU at this time. -=== "manual" + ##### PCIe Speed/Generation Identify the PCIe address of your NVIDIA NIC: @@ -579,808 +623,1033 @@ The instructions below are meant to understand if your system is able to extract nic_pci=$(lspci -n | awk '$2 == "0200:" && $3 ~ /^15b3:/ {print $1}' | head -n1) ``` - Check current and max MPS: + Check current and max Speeds: ```bash - sudo lspci -vv -s $nic_pci | awk '/DevCap/{s=1} /DevCtl/{s=0} /MaxPayload /{match($0, /MaxPayload [0-9]+/, m); if(s){print "Max " m[0]} else{print "Current " m[0]}}' + sudo lspci -vv -s $nic_pci | awk '/LnkCap/{s=1} /LnkSta/{s=0} /Speed /{match($0, /Speed [0-9]+GT\/s/, m); if(s){print "Max " m[0]} else{print "Current " m[0]}}' ``` ??? abstract "See an example output" - The PCIe configuration on the IGX Orin developer kit is not able to leverage the max payload size of the NIC: + On IGX, the switch is able to maximize the NIC speed, both being PCIe 5.0: ```log - Max MaxPayload 512 - Current MaxPayload 128 + Max Speed 32GT/s + Current Speed 32GT/s ``` - !!! note + ### Step 3: Maximize the NIC's Max Read Request Size (MRRS) - While your NIC might be capable of more, 256 bytes is generally the largest supported by any switch/CPU at this time. + !!! quote "[Understanding PCIe Configuration for Maximum Performance - May 27, 2022](https://enterprise-support.nvidia.com/s/article/understanding-pcie-configuration-for-maximum-performance)" -##### PCIe Speed/Generation + PCIe Max Read Request determines the maximal PCIe read request allowed. A PCIe device usually keeps track of the number of pending read requests due to having to prepare buffers for an incoming response. The size of the PCIe max read request may affect the number of pending requests (when using data fetch larger than the PCIe MTU). -Identify the PCIe address of your NVIDIA NIC: + Unlike the PCIe properties queried in the previous section, the MRRS is configurable. **We recommend maxing it to 4096 bytes**. Run the following to check your current settings: -=== "ibdev2netdev" + === "tune_system.py" - ```bash - nic_pci=$(sudo ibdev2netdev -v | awk '{print $1}' | head -n1) - ``` + === "Debian installation" -=== "lspci" + ```bash + sudo /opt/nvidia/holoscan/bin/tune_system.py --check mrrs + ``` - ```bash - # `0200` is the PCI-SIG class code for NICs - # `15b3` is the Vendor ID for Mellanox - nic_pci=$(lspci -n | awk '$2 == "0200:" && $3 ~ /^15b3:/ {print $1}' | head -n1) - ``` + === "From source" -Check current and max Speeds: + ```bash + cd holohub + sudo ./operators/advanced_network/python/tune_system.py --check mrrs + ``` -```bash -sudo lspci -vv -s $nic_pci | awk '/LnkCap/{s=1} /LnkSta/{s=0} /Speed /{match($0, /Speed [0-9]+GT\/s/, m); if(s){print "Max " m[0]} else{print "Current " m[0]}}' -``` + === "manual" -??? abstract "See an example output" + Identify the PCIe address of your NVIDIA NIC: - On IGX, the switch is able to maximize the NIC speed, both being PCIe 5.0: + === "ibdev2netdev" - ```log - Max Speed 32GT/s - Current Speed 32GT/s - ``` + ```bash + nic_pci=$(sudo ibdev2netdev -v | awk '{print $1}' | head -n1) + ``` -### Step 3: Maximize the NIC's Max Read Request Size (MRRS) + === "lspci" -!!! quote "[Understanding PCIe Configuration for Maximum Performance - May 27, 2022](https://enterprise-support.nvidia.com/s/article/understanding-pcie-configuration-for-maximum-performance)" + ```bash + # `0200` is the PCI-SIG class code for NICs + # `15b3` is the Vendor ID for Mellanox + nic_pci=$(lspci -n | awk '$2 == "0200:" && $3 ~ /^15b3:/ {print $1}' | head -n1) + ``` - PCIe Max Read Request determines the maximal PCIe read request allowed. A PCIe device usually keeps track of the number of pending read requests due to having to prepare buffers for an incoming response. The size of the PCIe max read request may affect the number of pending requests (when using data fetch larger than the PCIe MTU). + Check current MRRS: -Unlike the PCIe properties queried in the previous section, the MRRS is configurable. **We recommend maxing it to 4096 bytes**. Run the following to check your current settings: + ```bash + sudo lspci -vv -s $nic_pci | grep DevCtl: -A2 | grep -oE "MaxReadReq [0-9]+" + ``` -=== "tune_system.py" + Update MRRS: === "Debian installation" ```bash - sudo /opt/nvidia/holoscan/bin/tune_system.py --check mrrs + sudo /opt/nvidia/holoscan/bin/tune_system.py --set mrrs ``` === "From source" ```bash cd holohub - sudo ./operators/advanced_network/python/tune_system.py --check mrrs + sudo ./operators/advanced_network/python/tune_system.py --set mrrs ``` -=== "manual" + !!! note - Identify the PCIe address of your NVIDIA NIC: + This value is reset on reboot and needs to be set every time the system boots - === "ibdev2netdev" + ??? failure "ERROR: pcilib: sysfs_write: write failed: Operation not permitted" + + Disable secure boot on your system ahead of changing the MRRS of your NIC ports. It can be re-enabled afterwards. + + ### Step 4: Enable Huge pages + + Huge pages are a memory management feature that allows the OS to allocate large blocks of memory (typically 2MB or 1GB) instead of the default 4KB pages. This reduces the number of page table entries and the amount of memory used for translation, improving cache performance and reducing TLB (Translation Lookaside Buffer) misses, which leads to lower latencies. + + While it is naturally beneficial for CPU packets, it is also needed when routing data packets to the GPU in order to handle metadata (mbufs) on the CPU. + + === "hugeadm" + + We recommend installing the `libhugetlbfs-bin` package for the `hugeadm` utility: ```bash - nic_pci=$(sudo ibdev2netdev -v | awk '{print $1}' | head -n1) + sudo apt update + sudo apt install -y libhugetlbfs-bin ``` - === "lspci" + Then, check your huge page pools: ```bash - # `0200` is the PCI-SIG class code for NICs - # `15b3` is the Vendor ID for Mellanox - nic_pci=$(lspci -n | awk '$2 == "0200:" && $3 ~ /^15b3:/ {print $1}' | head -n1) + hugeadm --pool-list ``` - Check current MRRS: + ??? abstract "See an example output" - ```bash - sudo lspci -vv -s $nic_pci | grep DevCtl: -A2 | grep -oE "MaxReadReq [0-9]+" - ``` + The example below shows that this system supports huge pages of 64K, 2M (default), 32M, and 1G, but that none of them are currently allocated. -Update MRRS: + ``` + Size Minimum Current Maximum Default + 65536 0 0 0 + 2097152 0 0 0 * + 33554432 0 0 0 + 1073741824 0 0 0 + ``` -=== "Debian installation" + And your huge page mount points: - ```bash - sudo /opt/nvidia/holoscan/bin/tune_system.py --set mrrs - ``` + ```bash + hugeadm --list-all-mounts + ``` -=== "From source" + ??? abstract "See an example output" - ```bash - cd holohub - sudo ./operators/advanced_network/python/tune_system.py --set mrrs - ``` + The default huge pages are mounted on `/dev/hugepages` with a page size of 2M. -!!! note + ``` + Mount Point Options + /dev/hugepages rw,relatime,pagesize=2M + ``` - This value is reset on reboot and needs to be set every time the system boots + === "vanilla" -??? failure "ERROR: pcilib: sysfs_write: write failed: Operation not permitted" + First, check your huge page pools: - Disable secure boot on your system ahead of changing the MRRS of your NIC ports. It can be re-enabled afterwards. + ```bash + ls -1 /sys/kernel/mm/hugepages/ + grep Huge /proc/meminfo + ``` -### Step 4: Enable Huge pages + ??? abstract "See an example output" -Huge pages are a memory management feature that allows the OS to allocate large blocks of memory (typically 2MB or 1GB) instead of the default 4KB pages. This reduces the number of page table entries and the amount of memory used for translation, improving cache performance and reducing TLB (Translation Lookaside Buffer) misses, which leads to lower latencies. + The example below shows that this system supports huge pages of 64K, 2M (default), 32M, and 1G, but that none of them are currently allocated. -While it is naturally beneficial for CPU packets, it is also needed when routing data packets to the GPU in order to handle metadata (mbufs) on the CPU. + ``` + hugepages-1048576kB + hugepages-2048kB + hugepages-32768kB + hugepages-64kB + ``` -=== "hugeadm" + ``` + HugePages_Total: 0 + HugePages_Free: 0 + HugePages_Rsvd: 0 + HugePages_Surp: 0 + Hugepagesize: 2048 kB + Hugetlb: 0 kB + ``` - We recommend installing the `libhugetlbfs-bin` package for the `hugeadm` utility: + And your huge page mount points: - ```bash - sudo apt update - sudo apt install -y libhugetlbfs-bin - ``` + ```bash + mount | grep huge + ``` - Then, check your huge page pools: + ??? abstract "See an example output" - ```bash - hugeadm --pool-list - ``` + The default huge pages are mounted on `/dev/hugepages` with a page size of 2M. - ??? abstract "See an example output" + ``` + hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M) + ``` - The example below shows that this system supports huge pages of 64K, 2M (default), 32M, and 1G, but that none of them are currently allocated. + **As a rule of thumb, we recommend to start with 3 to 4 GB of total huge pages, with an individual page size of 500 MB to 1 GB** (per system availability). - ``` - Size Minimum Current Maximum Default - 65536 0 0 0 - 2097152 0 0 0 * - 33554432 0 0 0 - 1073741824 0 0 0 - ``` + There are two ways to allocate huge pages: - And your huge page mount points: + - in the kernel bootline (recommended to ensure contiguous memory allocation) or + - dynamically at runtime (risk of fragmentation for large page sizes) - ```bash - hugeadm --list-all-mounts - ``` + The example below allocates 4 huge pages of 1GB each. - ??? abstract "See an example output" + === "Kernel bootline" - The default huge pages are mounted on `/dev/hugepages` with a page size of 2M. + Add the flags below to the `GRUB_CMDLINE_LINUX` variable in `/etc/default/grub`: + ```bash + default_hugepagesz=1G hugepagesz=1G hugepages=4 ``` - Mount Point Options - /dev/hugepages rw,relatime,pagesize=2M - ``` - -=== "vanilla" - First, check your huge page pools: + ??? info "Show explanation" - ```bash - ls -1 /sys/kernel/mm/hugepages/ - grep Huge /proc/meminfo - ``` - - ??? abstract "See an example output" + - `default_hugepagesz`: the default huge page size to use, making them available from the default mount point, `/dev/hugepages`. + - `hugepagesz`: the size of the huge pages to allocate. + - `hugepages`: the number of huge pages to allocate. - The example below shows that this system supports huge pages of 64K, 2M (default), 32M, and 1G, but that none of them are currently allocated. + Then rebuild your GRUB configuration and reboot: - ``` - hugepages-1048576kB - hugepages-2048kB - hugepages-32768kB - hugepages-64kB + ```bash + sudo update-grub + sudo reboot ``` - ``` - HugePages_Total: 0 - HugePages_Free: 0 - HugePages_Rsvd: 0 - HugePages_Surp: 0 - Hugepagesize: 2048 kB - Hugetlb: 0 kB - ``` + === "Runtime" - And your huge page mount points: + Allocate the 4x 1GB huge pages: - ```bash - mount | grep huge - ``` + === "hugeadm" - ??? abstract "See an example output" + ```bash + sudo hugeadm --pool-pages-min 1073741824:4 + ``` - The default huge pages are mounted on `/dev/hugepages` with a page size of 2M. + === "vanilla" - ``` - hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M) - ``` + ```bash + echo 4 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages + ``` + + Create a mount point to access the 1GB huge pages pool since that is not the default size on that system. We will name it `/mnt/huge` here. + + === "One-time" + + ```bash + sudo mkdir -p /mnt/huge + sudo mount -t hugetlbfs -o pagesize=1G none /mnt/huge + ``` + + === "Persistent" + + ```bash + echo "nodev /mnt/huge hugetlbfs pagesize=1G 0 0" | sudo tee -a /etc/fstab + sudo mount /mnt/huge + ``` + + !!! note + + If you work with containers, remember to mount this directory in your container as well with `-v /mnt/huge:/mnt/huge`. + + Rerunning the initial commands should now list 4 hugepages of 1GB each. 1GB will be the default huge page size if updated in the kernel bootline only. + + ### Step 5: Isolate CPU cores + + !!! note + + This optimization is less impactful when using the `gpunetio` backend since the GPU polls the NIC. + + The CPU interacting with the NIC to route packets is sensitive to perturbations, especially with smaller packet/batch sizes requiring more frequent work. Isolating a CPU in Linux prevents unwanted user or kernel threads from running on it, reducing context switching and latency spikes from noisy neighbors. + + We recommend isolating the CPU cores you will select to interact with the NIC (defined in the `daqiri` configuration [described in the configuration reference](configuration-walkthrough.md) in this tutorial). This is done by setting additional flags on the kernel bootline. + + You can first check if any of the recommended flags were already set on the last boot: + + === "tune_system.py" + + === "Debian installation" + + ```bash + sudo /opt/nvidia/holoscan/bin/tune_system.py --check cmdline + ``` -**As a rule of thumb, we recommend to start with 3 to 4 GB of total huge pages, with an individual page size of 500 MB to 1 GB** (per system availability). + === "From source" -There are two ways to allocate huge pages: + ```bash + cd holohub + sudo ./operators/advanced_network/python/tune_system.py --check cmdline + ``` -- in the kernel bootline (recommended to ensure contiguous memory allocation) or -- dynamically at runtime (risk of fragmentation for large page sizes) + === "manual" + + ```bash + cat /proc/cmdline | grep -e isolcpus -e irqaffinity -e nohz_full -e rcu_nocbs -e rcu_nocb_poll + ``` + + Decide which cores to isolate based on your configuration. We recommend one core per queue as a rule of thumb. First, identify your core IDs: + + ```bash + cat /proc/cpuinfo | grep processor + ``` -The example below allocates 4 huge pages of 1GB each. + ??? abstract "See an example output" -=== "Kernel bootline" + This system has 12 cores, numbered 0 to 11: + ```bash + processor # 0 + processor # 1 + processor # 2 + processor # 3 + processor # 4 + processor # 5 + processor # 6 + processor # 7 + processor # 8 + processor # 9 + processor # 10 + processor # 11 + ``` - Add the flags below to the `GRUB_CMDLINE_LINUX` variable in `/etc/default/grub`: + As an example, the line below will isolate cores 9, 10 and 11, leaving cores 0-8 free for other tasks and hardware interrupts: ```bash - default_hugepagesz=1G hugepagesz=1G hugepages=4 + isolcpus=9-11 irqaffinity=0-8 nohz_full=9-11 rcu_nocbs=9-11 rcu_nocb_poll ``` ??? info "Show explanation" - - `default_hugepagesz`: the default huge page size to use, making them available from the default mount point, `/dev/hugepages`. - - `hugepagesz`: the size of the huge pages to allocate. - - `hugepages`: the number of huge pages to allocate. + | Parameter | Description | + | --------- | ----------- | + | `isolcpus` | Isolates specific CPU cores from the Linux scheduler, preventing regular system tasks from running on them. This ensures dedicated cores are available exclusively for your networking tasks, reducing context switches and interruptions that can cause latency spikes. | + | `irqaffinity` | Controls which CPU cores can handle hardware interrupts. By directing network interrupts away from your isolated cores, you prevent networking tasks from being interrupted by hardware events, maintaining consistent processing time. | + | `nohz_full` | Disables regular kernel timer ticks on specified cores when they're running user space applications. This reduces overhead and prevents periodic interruptions, allowing your networking code to run with fewer disturbances. | + | `rcu_nocbs` | Offloads Read-Copy-Update (RCU) callback processing from specified cores. RCU is a synchronization mechanism in the Linux kernel that can cause periodic processing bursts. Moving this work away from your networking cores helps maintain consistent performance. | + | `rcu_nocb_poll` | Works with `rcu_nocbs` to improve how RCU callbacks are processed on non-callback CPUs. This can reduce latency spikes by changing how the kernel polls for RCU work. | + + Together, these parameters create an environment where specific CPU cores can focus exclusively on network packet processing with minimal interference from the operating system, resulting in lower and more consistent latency. - Then rebuild your GRUB configuration and reboot: + Add these flags to the `GRUB_CMDLINE_LINUX` variable in `/etc/default/grub`, then rebuild your GRUB configuration and reboot: ```bash sudo update-grub sudo reboot ``` -=== "Runtime" + Verify that the flags were properly set after boot by rerunning the check commands above. - Allocate the 4x 1GB huge pages: + ### Step 6: Prevent CPU cores from going idle - === "hugeadm" + When a core goes idle/to sleep, coming back online to poll the NIC can cause latency spikes and dropped packets. To prevent this, **we recommend setting the scaling governor to `performance` for these CPU cores**. - ```bash - sudo hugeadm --pool-pages-min 1073741824:4 - ``` + !!! note - === "vanilla" + Cores from a single cluster will always share the same governor. + + !!! bug + + We have witnessed instances where setting the governor to `performance` on only the isolated cores (dedicated to polling the NIC) does not lead to the performance gains expected. As such, we currently recommend setting the governor to `performance` for all cores which has shown to be reliably effective. + + Check the current governor for each of your cores: + + === "tune_system.py" + + === "Debian installation" + + ```bash + sudo /opt/nvidia/holoscan/bin/tune_system.py --check cpu-freq + ``` + + === "From source" + + ```bash + cd holohub + sudo ./operators/advanced_network/python/tune_system.py --check cpu-freq + ``` + + ??? abstract "See an example output" + + ``` + 2025-03-06 12:20:27 - WARNING - CPU 0: Governor is set to 'powersave', not 'performance'. + 2025-03-06 12:20:27 - WARNING - CPU 1: Governor is set to 'powersave', not 'performance'. + 2025-03-06 12:20:27 - WARNING - CPU 2: Governor is set to 'powersave', not 'performance'. + 2025-03-06 12:20:27 - WARNING - CPU 3: Governor is set to 'powersave', not 'performance'. + 2025-03-06 12:20:27 - WARNING - CPU 4: Governor is set to 'powersave', not 'performance'. + 2025-03-06 12:20:27 - WARNING - CPU 5: Governor is set to 'powersave', not 'performance'. + 2025-03-06 12:20:27 - WARNING - CPU 6: Governor is set to 'powersave', not 'performance'. + 2025-03-06 12:20:27 - WARNING - CPU 7: Governor is set to 'powersave', not 'performance'. + 2025-03-06 12:20:27 - WARNING - CPU 8: Governor is set to 'powersave', not 'performance'. + 2025-03-06 12:20:27 - WARNING - CPU 9: Governor is set to 'powersave', not 'performance'. + 2025-03-06 12:20:27 - WARNING - CPU 10: Governor is set to 'powersave', not 'performance'. + 2025-03-06 12:20:27 - WARNING - CPU 11: Governor is set to 'powersave', not 'performance'. + ``` + + === "manual" ```bash - echo 4 | sudo tee /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages + cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ``` - Create a mount point to access the 1GB huge pages pool since that is not the default size on that system. We will name it `/mnt/huge` here. + ??? abstract "See an example output" + + In this example, all cores were defaulted to `powersave` instead of the recommended `performance`. + + ``` + powersave + powersave + powersave + powersave + powersave + powersave + powersave + powersave + powersave + powersave + powersave + powersave + ``` + + Install `cpupower` to more conveniently set the governor: + + ```bash + sudo apt update + sudo apt install -y linux-tools-$(uname -r) + ``` + + Set the governor to `performance` for all cores: === "One-time" ```bash - sudo mkdir -p /mnt/huge - sudo mount -t hugetlbfs -o pagesize=1G none /mnt/huge + sudo cpupower frequency-set -g performance ``` === "Persistent" ```bash - echo "nodev /mnt/huge hugetlbfs pagesize=1G 0 0" | sudo tee -a /etc/fstab - sudo mount /mnt/huge - ``` + cat << EOF | sudo tee /etc/systemd/system/cpu-performance.service + [Unit] + Description=Set CPU governor to performance + After=multi-user.target - !!! note + [Service] + Type=oneshot + ExecStart=/usr/bin/cpupower -c all frequency-set -g performance - If you work with containers, remember to mount this directory in your container as well with `-v /mnt/huge:/mnt/huge`. + [Install] + WantedBy=multi-user.target + EOF + sudo systemctl enable cpu-performance.service + sudo systemctl start cpu-performance.service + ``` -Rerunning the initial commands should now list 4 hugepages of 1GB each. 1GB will be the default huge page size if updated in the kernel bootline only. + Running the checks above should now list `performance` as the governor for all cores. You can also run `sudo cpupower -c all frequency-info` for more details. -### Step 5: Isolate CPU cores + ### Step 7: Prevent the GPU from going idle -!!! note + Similarly to the above, we want to maximize the GPU's clock speed and prevent it from going idle. - This optimization is less impactful when using the `gpunetio` backend since the GPU polls the NIC. + Run the following command to check your current clocks and whether they're locked (persistence mode): -The CPU interacting with the NIC to route packets is sensitive to perturbations, especially with smaller packet/batch sizes requiring more frequent work. Isolating a CPU in Linux prevents unwanted user or kernel threads from running on it, reducing context switching and latency spikes from noisy neighbors. + ```text + nvidia-smi -q | grep -i "Persistence Mode" + nvidia-smi -q -d CLOCK + ``` -We recommend isolating the CPU cores you will select to interact with the NIC (defined in the `daqiri` configuration [described in the configuration reference](configuration-walkthrough.md) in this tutorial). This is done by setting additional flags on the kernel bootline. + ??? abstract "See an example output" -You can first check if any of the recommended flags were already set on the last boot: + ``` hl_lines="1 7 8 20 21" + Persistence Mode: Enabled + ... + Attached GPUs : 1 + GPU 00000005:09:00.0 + Clocks + Graphics : 420 MHz + SM : 420 MHz + Memory : 405 MHz + Video : 1680 MHz + Applications Clocks + Graphics : 1800 MHz + Memory : 8001 MHz + Default Applications Clocks + Graphics : 1800 MHz + Memory : 8001 MHz + Deferred Clocks + Memory : N/A + Max Clocks + Graphics : 2100 MHz + SM : 2100 MHz + Memory : 8001 MHz + Video : 1950 MHz + ... + ``` -=== "tune_system.py" + To lock the GPU's clocks to their max values: - === "Debian installation" + === "One-time" ```bash - sudo /opt/nvidia/holoscan/bin/tune_system.py --check cmdline + sudo nvidia-smi -pm 1 + sudo nvidia-smi -lgc=$(nvidia-smi --query-gpu=clocks.max.sm --format=csv,noheader,nounits) + sudo nvidia-smi -lmc=$(nvidia-smi --query-gpu=clocks.max.mem --format=csv,noheader,nounits) ``` - === "From source" + === "Persistent" ```bash - cd holohub - sudo ./operators/advanced_network/python/tune_system.py --check cmdline + cat << EOF | sudo tee /etc/systemd/system/gpu-max-clocks.service + [Unit] + Description=Max GPU clocks + After=multi-user.target + + [Service] + Type=oneshot + ExecStart=/usr/bin/nvidia-smi -pm 1 + ExecStart=/bin/bash -c '/usr/bin/nvidia-smi --lock-gpu-clocks=$(/usr/bin/nvidia-smi --query-gpu=clocks.max.sm --format=csv,noheader,nounits)' + ExecStart=/bin/bash -c '/usr/bin/nvidia-smi --lock-memory-clocks=$(/usr/bin/nvidia-smi --query-gpu=clocks.max.mem --format=csv,noheader,nounits)' + RemainAfterExit=true + + [Install] + WantedBy=multi-user.target + EOF + + sudo systemctl enable gpu-max-clocks.service + sudo systemctl start gpu-max-clocks.service ``` -=== "manual" + ??? info "Show explanation" - ```bash - cat /proc/cmdline | grep -e isolcpus -e irqaffinity -e nohz_full -e rcu_nocbs -e rcu_nocb_poll - ``` + This queries the max clocks for the GPU SM (`clocks.max.sm`) and memory (`clocks.max.mem`) and sets them to the current clocks (`lock-gpu-clocks` and `lock-memory-clocks` respectively). `-pm 1` (or `--persistence-mode=1`) enables persistence mode to lock these values. -Decide which cores to isolate based on your configuration. We recommend one core per queue as a rule of thumb. First, identify your core IDs: + ??? abstract "See an example output" -```bash -cat /proc/cpuinfo | grep processor -``` + ``` + GPU clocks set to "(gpuClkMin 2100, gpuClkMax 2100)" for GPU 00000005:09:00.0 + All done. + Memory clocks set to "(memClkMin 8001, memClkMax 8001)" for GPU 00000005:09:00.0 + All done. + ``` -??? abstract "See an example output" + You can confirm that the clocks are set to the max values by running `nvidia-smi -q -d CLOCK` again. - This system has 12 cores, numbered 0 to 11: - ```bash - processor # 0 - processor # 1 - processor # 2 - processor # 3 - processor # 4 - processor # 5 - processor # 6 - processor # 7 - processor # 8 - processor # 9 - processor # 10 - processor # 11 - ``` + !!! note -As an example, the line below will isolate cores 9, 10 and 11, leaving cores 0-8 free for other tasks and hardware interrupts: + Some max clocks might not be achievable in certain configurations, or due to boost clocks (SM) or rounding errors (Memory), despite the lock commands indicating it worked. For example - on IGX - the max non-boot SM clock will be 1920 MHz, and the max memory clock will show 8000 MHz, which are satisfying compared to the initial mode. -```bash -isolcpus=9-11 irqaffinity=0-8 nohz_full=9-11 rcu_nocbs=9-11 rcu_nocb_poll -``` + ### Step 8: Maximize GPU BAR1 size -??? info "Show explanation" + The GPU BAR1 memory is the primary resource consumed by `GPUDirect`. It allows other PCIe devices (like the CPU and the NIC) to access the GPU's memory space. The larger the BAR1 size, the more memory the GPU can expose to these devices in a single PCIe transaction, reducing the number of transactions needed and improving performance. - | Parameter | Description | - | --------- | ----------- | - | `isolcpus` | Isolates specific CPU cores from the Linux scheduler, preventing regular system tasks from running on them. This ensures dedicated cores are available exclusively for your networking tasks, reducing context switches and interruptions that can cause latency spikes. | - | `irqaffinity` | Controls which CPU cores can handle hardware interrupts. By directing network interrupts away from your isolated cores, you prevent networking tasks from being interrupted by hardware events, maintaining consistent processing time. | - | `nohz_full` | Disables regular kernel timer ticks on specified cores when they're running user space applications. This reduces overhead and prevents periodic interruptions, allowing your networking code to run with fewer disturbances. | - | `rcu_nocbs` | Offloads Read-Copy-Update (RCU) callback processing from specified cores. RCU is a synchronization mechanism in the Linux kernel that can cause periodic processing bursts. Moving this work away from your networking cores helps maintain consistent performance. | - | `rcu_nocb_poll` | Works with `rcu_nocbs` to improve how RCU callbacks are processed on non-callback CPUs. This can reduce latency spikes by changing how the kernel polls for RCU work. | + **We recommend a BAR1 size of 1GB or above.** Check the current BAR1 size: - Together, these parameters create an environment where specific CPU cores can focus exclusively on network packet processing with minimal interference from the operating system, resulting in lower and more consistent latency. + === "tune_system.py" -Add these flags to the `GRUB_CMDLINE_LINUX` variable in `/etc/default/grub`, then rebuild your GRUB configuration and reboot: + === "Debian installation" -```bash -sudo update-grub -sudo reboot -``` + ```bash + sudo /opt/nvidia/holoscan/bin/tune_system.py --check bar1-size + ``` -Verify that the flags were properly set after boot by rerunning the check commands above. + === "From source" -### Step 6: Prevent CPU cores from going idle + ```bash + cd holohub + sudo ./operators/advanced_network/python/tune_system.py --check bar1-size + ``` -When a core goes idle/to sleep, coming back online to poll the NIC can cause latency spikes and dropped packets. To prevent this, **we recommend setting the scaling governor to `performance` for these CPU cores**. + ??? abstract "See an example output" -!!! note + ``` + 2025-03-06 12:22:53 - INFO - GPU 00000005:09:00.0: BAR1 size is 8192 MiB. + ``` - Cores from a single cluster will always share the same governor. + === "manual" -!!! bug + ```bash + nvidia-smi -q | grep -A 3 BAR1 + ``` - We have witnessed instances where setting the governor to `performance` on only the isolated cores (dedicated to polling the NIC) does not lead to the performance gains expected. As such, we currently recommend setting the governor to `performance` for all cores which has shown to be reliably effective. + ??? abstract "See an example output" -Check the current governor for each of your cores: + For our RTX A6000, this shows a BAR1 size of 256 MiB: -=== "tune_system.py" + ``` + BAR1 Memory Usage + Total : 256 MiB + Used : 13 MiB + Free : 243 MiB + ``` - === "Debian installation" + !!! warning - ```bash - sudo /opt/nvidia/holoscan/bin/tune_system.py --check cpu-freq - ``` + Resizing the BAR1 size requires: - === "From source" + - A BIOS with resizable BAR support + - A GPU with physical resizable BAR - ```bash - cd holohub - sudo ./operators/advanced_network/python/tune_system.py --check cpu-freq - ``` + **If you attempt to go forward with the instructions below without meeting the above requirements, you might render your GPU unusable.** + + #### BIOS Resizable BAR support + + First, check if your system and BIOS support resizable BAR. Refer to your system's manufacturer documentation to access the BIOS. The Resizable BAR option is often categorized under `Advanced > PCIe` settings. Enable this feature if found. + + !!! note + + The IGX Developer kit with IGX OS 1.1+ supports resizable BAR by default. + + #### GPU Resizable BAR support + + Next, you can check if your GPU has physical resizable BAR by running the following command: + + ```bash + sudo lspci -vv -s $(nvidia-smi --query-gpu=pci.bus_id --format=csv,noheader) | grep BAR + ``` ??? abstract "See an example output" + This RTX A6000 has a resizable BAR1, currently set to 256 MiB: + ``` - 2025-03-06 12:20:27 - WARNING - CPU 0: Governor is set to 'powersave', not 'performance'. - 2025-03-06 12:20:27 - WARNING - CPU 1: Governor is set to 'powersave', not 'performance'. - 2025-03-06 12:20:27 - WARNING - CPU 2: Governor is set to 'powersave', not 'performance'. - 2025-03-06 12:20:27 - WARNING - CPU 3: Governor is set to 'powersave', not 'performance'. - 2025-03-06 12:20:27 - WARNING - CPU 4: Governor is set to 'powersave', not 'performance'. - 2025-03-06 12:20:27 - WARNING - CPU 5: Governor is set to 'powersave', not 'performance'. - 2025-03-06 12:20:27 - WARNING - CPU 6: Governor is set to 'powersave', not 'performance'. - 2025-03-06 12:20:27 - WARNING - CPU 7: Governor is set to 'powersave', not 'performance'. - 2025-03-06 12:20:27 - WARNING - CPU 8: Governor is set to 'powersave', not 'performance'. - 2025-03-06 12:20:27 - WARNING - CPU 9: Governor is set to 'powersave', not 'performance'. - 2025-03-06 12:20:27 - WARNING - CPU 10: Governor is set to 'powersave', not 'performance'. - 2025-03-06 12:20:27 - WARNING - CPU 11: Governor is set to 'powersave', not 'performance'. + Capabilities: [bb0 v1] Physical Resizable BAR + BAR 0: current size: 16MB, supported: 16MB + BAR 1: current size: 256MB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB 64GB + BAR 3: current size: 32MB, supported: 32MB ``` -=== "manual" + If your GPU is listed [on this page](https://developer.nvidia.com/displaymodeselector), you can download the `Display Mode Selector` to resize the BAR1 to 8GB. + + 1. Press `Join Now`. + 2. Once approved, download the `Display Mode Selector` archive. + 3. Unzip the archive. + 4. Access your system without an X-server running. SSH into the machine, or switch to a Virtual Console (`Alt+F1`). You do not need to physically disconnect the monitor — the requirement is that no display server (X11/Wayland) is holding a lock on the NVIDIA driver. + 5. Go down the right OS and architecture folder for your system (`linux/aarch64` or `linux/x64`). + 6. Run the `displaymodeselector` command like so: ```bash - cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor + chmod +x displaymodeselector + sudo ./displaymodeselector --gpumode physical_display_enabled_8GB_bar1 ``` - ??? abstract "See an example output" + Press `y` to confirm you'd like to continue, then `y` again to apply to all the eligible adapters. - In this example, all cores were defaulted to `powersave` instead of the recommended `performance`. + ??? abstract "See an example output" ``` - powersave - powersave - powersave - powersave - powersave - powersave - powersave - powersave - powersave - powersave - powersave - powersave - ``` + NVIDIA Display Mode Selector Utility (Version 1.67.0) + Copyright (C) 2015-2021, NVIDIA Corporation. All Rights Reserved. -Install `cpupower` to more conveniently set the governor: + WARNING: This operation updates the firmware on the board and could make + the device unusable if your host system lacks the necessary support. -```bash -sudo apt update -sudo apt install -y linux-tools-$(uname -r) -``` + Are you sure you want to continue? + Press 'y' to confirm (any other key to abort): + y + Specified GPU Mode "physical_display_enabled_8GB_bar1" -Set the governor to `performance` for all cores: -=== "One-time" + Update GPU Mode of all adapters to "physical_display_enabled_8GB_bar1"? + Press 'y' to confirm or 'n' to choose adapters or any other key to abort: + y - ```bash - sudo cpupower frequency-set -g performance - ``` + Updating GPU Mode of all eligible adapters to "physical_display_enabled_8GB_bar1" -=== "Persistent" + Apply GPU Mode <6> corresponds to "physical_display_enabled_8GB_bar1" - ```bash - cat << EOF | sudo tee /etc/systemd/system/cpu-performance.service - [Unit] - Description=Set CPU governor to performance - After=multi-user.target + Reading EEPROM (this operation may take up to 30 seconds) - [Service] - Type=oneshot - ExecStart=/usr/bin/cpupower -c all frequency-set -g performance + [==================================================] 100 % + Reading EEPROM (this operation may take up to 30 seconds) - [Install] - WantedBy=multi-user.target - EOF - sudo systemctl enable cpu-performance.service - sudo systemctl start cpu-performance.service - ``` + Successfully updated GPU mode to "physical_display_enabled_8GB_bar1" ( Mode 6 ). -Running the checks above should now list `performance` as the governor for all cores. You can also run `sudo cpupower -c all frequency-info` for more details. - -### Step 7: Prevent the GPU from going idle - -Similarly to the above, we want to maximize the GPU's clock speed and prevent it from going idle. - -Run the following command to check your current clocks and whether they're locked (persistence mode): - -```text -nvidia-smi -q | grep -i "Persistence Mode" -nvidia-smi -q -d CLOCK -``` - -??? abstract "See an example output" - - ``` hl_lines="1 7 8 20 21" - Persistence Mode: Enabled - ... - Attached GPUs : 1 - GPU 00000005:09:00.0 - Clocks - Graphics : 420 MHz - SM : 420 MHz - Memory : 405 MHz - Video : 1680 MHz - Applications Clocks - Graphics : 1800 MHz - Memory : 8001 MHz - Default Applications Clocks - Graphics : 1800 MHz - Memory : 8001 MHz - Deferred Clocks - Memory : N/A - Max Clocks - Graphics : 2100 MHz - SM : 2100 MHz - Memory : 8001 MHz - Video : 1950 MHz - ... - ``` + A reboot is required for the update to take effect. + ``` -To lock the GPU's clocks to their max values: + ??? failure "Error: unload the NVIDIA kernel driver first" -=== "One-time" + If you see this error: - ```bash - sudo nvidia-smi -pm 1 - sudo nvidia-smi -lgc=$(nvidia-smi --query-gpu=clocks.max.sm --format=csv,noheader,nounits) - sudo nvidia-smi -lmc=$(nvidia-smi --query-gpu=clocks.max.mem --format=csv,noheader,nounits) - ``` + ```bash + ERROR: In order to avoid the irreparable damage to your graphics adapter it is necessary to unload the NVIDIA kernel driver first: -=== "Persistent" + rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia_peermem nvidia + ``` - ```bash - cat << EOF | sudo tee /etc/systemd/system/gpu-max-clocks.service - [Unit] - Description=Max GPU clocks - After=multi-user.target + Stop the display server and then unload the NVIDIA kernel modules listed in the error message (the exact list may vary by system): - [Service] - Type=oneshot - ExecStart=/usr/bin/nvidia-smi -pm 1 - ExecStart=/bin/bash -c '/usr/bin/nvidia-smi --lock-gpu-clocks=$(/usr/bin/nvidia-smi --query-gpu=clocks.max.sm --format=csv,noheader,nounits)' - ExecStart=/bin/bash -c '/usr/bin/nvidia-smi --lock-memory-clocks=$(/usr/bin/nvidia-smi --query-gpu=clocks.max.mem --format=csv,noheader,nounits)' - RemainAfterExit=true + ```bash + sudo systemctl isolate multi-user + sudo rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia_peermem nvidia + ``` - [Install] - WantedBy=multi-user.target - EOF + !!! tip "IGX Systems" - sudo systemctl enable gpu-max-clocks.service - sudo systemctl start gpu-max-clocks.service - ``` + IGX systems may have additional NVIDIA kernel modules (e.g. `nvidia_vrs_pseq`) that must also be unloaded. Check for remaining modules with `lsmod | grep nvidia` and `rmmod` each one before retrying. -??? info "Show explanation" + ??? failure "/dev/mem: Operation not permitted. Access to physical memory denied" - This queries the max clocks for the GPU SM (`clocks.max.sm`) and memory (`clocks.max.mem`) and sets them to the current clocks (`lock-gpu-clocks` and `lock-memory-clocks` respectively). `-pm 1` (or `--persistence-mode=1`) enables persistence mode to lock these values. + Disable secure boot on your system ahead of changing your GPU's BAR1 size. It can be re-enabled afterwards. -??? abstract "See an example output" + Reboot your system, and check the BAR1 size again to confirm the change. - ``` - GPU clocks set to "(gpuClkMin 2100, gpuClkMax 2100)" for GPU 00000005:09:00.0 - All done. - Memory clocks set to "(memClkMin 8001, memClkMax 8001)" for GPU 00000005:09:00.0 - All done. + ```bash + sudo reboot ``` -You can confirm that the clocks are set to the max values by running `nvidia-smi -q -d CLOCK` again. + ### Step 9: Enable Jumbo Frames -!!! note + Jumbo frames are Ethernet frames that carry a payload larger than the standard 1500 bytes MTU (Maximum Transmission Unit). They can significantly improve network performance when transferring large amounts of data by reducing the overhead of packet headers and the number of packets that need to be processed. - Some max clocks might not be achievable in certain configurations, or due to boost clocks (SM) or rounding errors (Memory), despite the lock commands indicating it worked. For example - on IGX - the max non-boot SM clock will be 1920 MHz, and the max memory clock will show 8000 MHz, which are satisfying compared to the initial mode. + **We recommend an MTU of 9000 bytes on all interfaces involved in the data path.** You can check the current MTU of your interfaces: -### Step 8: Maximize GPU BAR1 size + === "tune_system.py" -The GPU BAR1 memory is the primary resource consumed by `GPUDirect`. It allows other PCIe devices (like the CPU and the NIC) to access the GPU's memory space. The larger the BAR1 size, the more memory the GPU can expose to these devices in a single PCIe transaction, reducing the number of transactions needed and improving performance. + === "Debian installation" -**We recommend a BAR1 size of 1GB or above.** Check the current BAR1 size: + ```bash + sudo /opt/nvidia/holoscan/bin/tune_system.py --check mtu + ``` -=== "tune_system.py" + === "From source" - === "Debian installation" + ```bash + cd holohub + sudo ./operators/advanced_network/python/tune_system.py --check mtu + ``` + + ??? abstract "See an example output" + + ``` + 2025-03-06 16:51:19 - INFO - Interface eth0 has an acceptable MTU of 9000 bytes. + 2025-03-06 16:51:19 - INFO - Interface eth1 has an acceptable MTU of 9000 bytes. + ``` + + === "manual" + + For a given `if_name` interface: ```bash - sudo /opt/nvidia/holoscan/bin/tune_system.py --check bar1-size + if_name=eth0 + ip link show dev $if_name | grep -oE "mtu [0-9]+" ``` - === "From source" + ??? abstract "See an example output" + + ``` + mtu 1500 + ``` + + You can set the MTU for each interface like so, for a given `if_name` name identified [above](#configure-the-ip-addresses-of-the-nic-ports): + + === "One-time" ```bash - cd holohub - sudo ./operators/advanced_network/python/tune_system.py --check bar1-size + sudo ip link set dev $if_name mtu 9000 ``` - ??? abstract "See an example output" + === "Persistent" - ``` - 2025-03-06 12:22:53 - INFO - GPU 00000005:09:00.0: BAR1 size is 8192 MiB. - ``` + === "NetworkManager" + + ```bash + sudo nmcli connection modify $if_name ipv4.mtu 9000 + sudo nmcli connection up $if_name + ``` -=== "manual" + === "systemd-networkd" - ```bash - nvidia-smi -q | grep -A 3 BAR1 - ``` + Assuming you've set an IP address for the interface [above](#configure-the-ip-addresses-of-the-nic-ports), you can add the MTU to the interface's network configuration file like so: - ??? abstract "See an example output" + ```bash + sudo sed -i '/\[Network\]/a MTU=9000' /etc/systemd/network/20-$if_name.network + sudo systemctl restart systemd-networkd + ``` - For our RTX A6000, this shows a BAR1 size of 256 MiB: + ??? info "Can I do more than 9000?" - ``` - BAR1 Memory Usage - Total : 256 MiB - Used : 13 MiB - Free : 243 MiB - ``` + While your NIC might have a maximum MTU capability larger than 9000, we typically recommend setting the MTU to 9000 bytes, as that is the standard size for jumbo frames that's widely supported for compatibility with other network equipment. When using jumbo frames, all devices in the communication path must support the same MTU size. If any device in between has a smaller MTU, packets will be fragmented or dropped, potentially degrading performance. -!!! warning + Example with the CX-7 NIC: - Resizing the BAR1 size requires: + ```bash + $ ip -d link show dev $if_name | grep -oE "maxmtu [0-9]+" + maxmtu 9978 + ``` - - A BIOS with resizable BAR support - - A GPU with physical resizable BAR + --- + **Next:** [Benchmarking Examples](benchmarking_examples.md) — run your first DAQIRI benchmark - **If you attempt to go forward with the instructions below without meeting the above requirements, you might render your GPU unusable.** +=== "DGX Spark" -#### BIOS Resizable BAR support + ## DGX Spark profile -First, check if your system and BIOS support resizable BAR. Refer to your system's manufacturer documentation to access the BIOS. The Resizable BAR option is often categorized under `Advanced > PCIe` settings. Enable this feature if found. + This tab covers a **DGX Spark** workstation: Grace Blackwell **GB10** superchip (unified CPU/GPU memory via NVLink-C2C, integrated **ConnectX-7**, ARM64), running Ubuntu 24.04 with a CUDA-13 / driver-580 stack. Several IGX optimization steps are physically inapplicable on Spark (no separate GPU BAR1, no peermem, no discrete-PCIe path between GPU and NIC). Each is called out as **N/A on Spark** in the corresponding step, with the rationale. -!!! note + ### Pre-flight: the CX-7 disappears without a cable - The IGX Developer kit with IGX OS 1.1+ supports resizable BAR by default. + !!! warning "QSFP cable is required" -#### GPU Resizable BAR support + DGX Spark removes the integrated CX-7 PFs from the PCI bus when no QSFP cable is plugged in. After boot, `mlx5_core` probes 4 PFs, each port immediately emits `Cable unplugged`, and the platform service `/opt/nvidia/dgx-spark-mlnx-hotplug` removes all four CX-7 PCIe devices between roughly t=5s and t=20s. Symptoms: `lspci -d 15b3:` is empty, `ibv_devinfo` reports "No IB devices found", and `/sys/class/infiniband/` is empty even though `mlx5_core` and `mlx5_ib` are loaded. -Next, you can check if your GPU has physical resizable BAR by running the following command: + **Plug a cable into the chassis QSFP socket before debugging firmware, drivers, or BIOS.** The hotplug service brings the device back when a cable is detected. -```bash -sudo lspci -vv -s $(nvidia-smi --query-gpu=pci.bus_id --format=csv,noheader) | grep BAR -``` + The hotplug behavior is implemented as a power/thermal management policy and coordinates with `nvidia-spark-mlnx-firmware-manager.service`. If you need the NIC alive without a cable for software-only testing, the override point is the scripts under `/opt/nvidia/dgx-spark-mlnx-hotplug` — read them before disabling. -??? abstract "See an example output" + ### Port topology: 4 PFs, 2 chips, tied chassis sockets - This RTX A6000 has a resizable BAR1, currently set to 256 MiB: + `lspci -d 15b3:` on Spark shows **four** CX-7 PFs across **two** PCIe domains, fronting two chips that share a single internal fabric: + ```text + 0000:01:00.0 → mlx5_0 → enp1s0f0np0 + 0000:01:00.1 → mlx5_1 → enp1s0f1np1 + 0002:01:00.0 → mlx5_2 → enP2p1s0f0np0 + 0002:01:00.1 → mlx5_3 → enP2p1s0f1np1 ``` - Capabilities: [bb0 v1] Physical Resizable BAR - BAR 0: current size: 16MB, supported: 16MB - BAR 1: current size: 256MB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB 64GB - BAR 3: current size: 32MB, supported: 32MB + + The two chassis QSFPs are **tied** through the internal fabric. Pulling **just one end** of a loopback cable drops `carrier` to 0 on **all four** PFs simultaneously, which is the reliable diagnostic for confirming the topology: + + ```bash + for i in /sys/class/net/{enp1s0f0np0,enp1s0f1np1,enP2p1s0f0np0,enP2p1s0f1np1}; do + echo "$i: $(cat $i/carrier)" + done ``` -If your GPU is listed [on this page](https://developer.nvidia.com/displaymodeselector), you can download the `Display Mode Selector` to resize the BAR1 to 8GB. + Practical consequence for daqiri loopback benchmarks: any pair of PFs forms a working loopback. There is no "wrong pair" — pick TX/RX based on what your YAML expects. The Spark example YAMLs use `mlx5_0` (TX) ↔ `mlx5_2` (RX), so traffic crosses the cable. -1. Press `Join Now`. -2. Once approved, download the `Display Mode Selector` archive. -3. Unzip the archive. -4. Access your system without an X-server running. SSH into the machine, or switch to a Virtual Console (`Alt+F1`). You do not need to physically disconnect the monitor — the requirement is that no display server (X11/Wayland) is holding a lock on the NVIDIA driver. -5. Go down the right OS and architecture folder for your system (`linux/aarch64` or `linux/x64`). -6. Run the `displaymodeselector` command like so: + `ethtool -m` reports identical `Connector: 0x23 No separable connector` on all 4 PFs and is **not** useful for distinguishing them; use the cable-yank test. -```bash -chmod +x displaymodeselector -sudo ./displaymodeselector --gpumode physical_display_enabled_8GB_bar1 -``` + ## System Setup for DAQIRI -Press `y` to confirm you'd like to continue, then `y` again to apply to all the eligible adapters. + ### Check your NIC drivers -??? abstract "See an example output" + Same as IGX — verify `ib_core` is loaded: + ```bash + lsmod | grep ib_core ``` - NVIDIA Display Mode Selector Utility (Version 1.67.0) - Copyright (C) 2015-2021, NVIDIA Corporation. All Rights Reserved. - WARNING: This operation updates the firmware on the board and could make - the device unusable if your host system lacks the necessary support. + Spark ships with the Mellanox stack pre-installed. Two Spark-specific packages are involved at boot: `nvidia-spark-mlnx-firmware-manager` (flashes CX-7 firmware when a cable is present) and `nvidia-mlnx-tools`. Don't disable them. - Are you sure you want to continue? - Press 'y' to confirm (any other key to abort): - y - Specified GPU Mode "physical_display_enabled_8GB_bar1" + ### Switch your NIC link layers to Ethernet + Run `mlxconfig` once per PCIe domain (one command per chip). The CX-7 firmware on Spark already ships with both link types set to `ETH`, so this is usually a no-op verification: - Update GPU Mode of all adapters to "physical_display_enabled_8GB_bar1"? - Press 'y' to confirm or 'n' to choose adapters or any other key to abort: - y + ```bash + for d in 0000:01:00.0 0002:01:00.0; do + sudo mlxconfig -d "$d" set LINK_TYPE_P1=ETH LINK_TYPE_P2=ETH + done + ``` - Updating GPU Mode of all eligible adapters to "physical_display_enabled_8GB_bar1" + Reboot only if any flag actually changed. - Apply GPU Mode <6> corresponds to "physical_display_enabled_8GB_bar1" + ### Configure the IP addresses of the NIC ports - Reading EEPROM (this operation may take up to 30 seconds) + Spark uses NetworkManager. Create persistent `daqiri-tx` / `daqiri-rx` profiles that pin both the IP and the MTU (so [Step 9: Jumbo frames](#step-9-enable-jumbo-frames-already-covered) is folded in here): - [==================================================] 100 % - Reading EEPROM (this operation may take up to 30 seconds) + ```bash + sudo nmcli connection add type ethernet ifname enp1s0f0np0 con-name daqiri-tx \ + ipv4.addresses 1.1.1.1/24 ipv4.method manual ipv4.mtu 9000 \ + ipv4.gateway "" ipv6.method ignore + sudo nmcli connection add type ethernet ifname enP2p1s0f0np0 con-name daqiri-rx \ + ipv4.addresses 2.2.2.2/24 ipv4.method manual ipv4.mtu 9000 \ + ipv4.gateway "" ipv6.method ignore + sudo nmcli connection up daqiri-tx + sudo nmcli connection up daqiri-rx + ``` - Successfully updated GPU mode to "physical_display_enabled_8GB_bar1" ( Mode 6 ). + Verify: - A reboot is required for the update to take effect. + ```bash + ip -f inet addr show enp1s0f0np0 + ip -f inet addr show enP2p1s0f0np0 + ip link show enp1s0f0np0 | grep -oE "mtu [0-9]+" + ip link show enP2p1s0f0np0 | grep -oE "mtu [0-9]+" ``` -??? failure "Error: unload the NVIDIA kernel driver first" + ### Enable GPUDirect - If you see this error: + !!! warning "Skip `nvidia_peermem` on GB10" - ```bash - ERROR: In order to avoid the irreparable damage to your graphics adapter it is necessary to unload the NVIDIA kernel driver first: + `sudo modprobe nvidia_peermem` returns `Invalid argument` (EINVAL, exit=1) on GB10. The module file ships in `/lib/modules/$(uname -r)/kernel/nvidia-580-open/nvidia-peermem.ko`, but loading fails by design: peermem maps the NIC into a separate GPU BAR1, and GB10's NVLink-C2C unified memory has no separate BAR1. - rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia_peermem nvidia - ``` + !!! note "DMA-BUF is also unreachable as of CUDA 13.1" + + The Open kernel module on Grace platforms expects the standard Linux **DMA-BUF** path instead of peermem, but as of CUDA 13.1 / driver 580.142 the device-attribute query reports `flag=0`: + + ```text + cuDeviceGetAttribute(CU_DEVICE_ATTRIBUTE_DMA_BUF_SUPPORTED, 0) → SUCCESS, flag=0 + cuDeviceGetAttribute(CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_SUPPORTED, 0) → SUCCESS, flag=0 + cuDeviceGetAttribute(CU_DEVICE_ATTRIBUTE_INTEGRATED, 0) → SUCCESS, flag=1 + ``` + + DAQIRI's CUDA-DMA-BUF code path is therefore unreachable on Spark; `dpdk_patches/dmabuf.patch` still ships and is mandatory for the build, but the daqiri-side dma-buf branch never fires. + + **The right configuration on Spark is `kind: "host_pinned"` in the YAML** — there is no system-side step. Buffers are allocated by daqiri via `cudaHostAlloc` (so they are CUDA-addressable) and registered with DPDK via `rte_extmem_register`. End-to-end TX↔RX over the QSFP loop with `kind: "host_pinned"`, `num_bufs: 51200`, `batch_size: 10240` reaches **~94 Gbps** unicast (verified against `main` 9ebd729, which contains [PR #41](https://github.com/nvidia/daqiri/pull/41)). `kind: "huge"` works as a fallback at the same rate; `kind: "device"` does **not** work and is not expected to on GB10. + + See the ready-to-run [`examples/daqiri_bench_raw_tx_rx_spark.yaml`](https://github.com/nvidia/daqiri/blob/main/examples/daqiri_bench_raw_tx_rx_spark.yaml) for the complete config. - Stop the display server and then unload the NVIDIA kernel modules listed in the error message (the exact list may vary by system): + --- + + ## System Optimization + + The IGX checklist below applies on Spark with a few items skipped or reshaped. Quick map: + + | Step | Spark status | Notes | + |------|--------------|-------| + | 1. PCIe topology | **N/A** | Single-SoC integrated GPU; no separable PCIe path GPU↔NIC | + | 2. PCIe MPS / Speed | unchanged | Same diagnostic commands; PCIe Gen5 native | + | 3. NIC MRRS | reshape | Use systemd unit + `setpci CAP_EXP+8.w` (capability-relative); **disable Secure Boot** | + | 4. Hugepages | reshape | Use a grub **drop-in** under `/etc/default/grub.d/`, not `/etc/default/grub` | + | 5. CPU isolation | reshape | Pin to big cores 16-19 (X925 cluster 1); folds into the same grub drop-in as Step 4 | + | 6. CPU governor | already set | Spark default is `performance` on all 20 cores | + | 7. GPU clocks | unchanged | Same systemd-unit recipe; GB10 max SM clock is 3003 MHz | + | 8. GPU BAR1 size | **N/A** | Unified memory; no resizable BAR1 | + | 9. Jumbo frames | folded into setup | MTU=9000 was set in the `daqiri-tx`/`daqiri-rx` nmcli profiles above | + + `tune_system.py --check all` on Spark suppresses the WARNs that are false positives on integrated GPUs (peermem, gpudirect, topology, BAR1) — see the source comments in [`python/tune_system.py`](https://github.com/nvidia/daqiri/blob/main/python/tune_system.py). + + ### Step 1: PCIe topology — N/A on Spark + + `nvidia-smi topo -m` reports the integrated GPU as `SYS`-connected to the NIC PFs. This is structural, not tunable: there is no separable PCIe path GPU↔NIC on a single-SoC integrated GPU. `tune_system.py --check topo` recognizes integrated GPUs and reports INFO instead of WARNING for this case. + + ### Step 2: Check the NIC's PCIe configuration + + Same diagnostic commands as IGX — for each PF, query `MaxPayload` (DevCap vs DevCtl) and PCIe `Speed` (LnkCap vs LnkSta). Spark's CX-7 PFs negotiate PCIe Gen5 at the chip's native MPS; nothing to set. ```bash - sudo systemctl isolate multi-user - sudo rmmod nvidia_uvm nvidia_drm nvidia_modeset nvidia_peermem nvidia + for d in 0000:01:00.0 0000:01:00.1 0002:01:00.0 0002:01:00.1; do + echo "=== $d ===" + sudo lspci -vv -s "$d" | awk '/DevCap/{s=1} /DevCtl/{s=0} /MaxPayload /{match($0, /MaxPayload [0-9]+/, m); if(s){print "Max " m[0]} else{print "Current " m[0]}}' + sudo lspci -vv -s "$d" | awk '/LnkCap/{s=1} /LnkSta/{s=0} /Speed /{match($0, /Speed [0-9]+GT\/s/, m); if(s){print "Max " m[0]} else{print "Current " m[0]}}' + done ``` - !!! tip "IGX Systems" + ### Step 3: Maximize the NIC's MRRS via a systemd unit - IGX systems may have additional NVIDIA kernel modules (e.g. `nvidia_vrs_pseq`) that must also be unloaded. Check for remaining modules with `lsmod | grep nvidia` and `rmmod` each one before retrying. + `tune_system.py --set mrrs` writes `0x68.w`. On Spark, write to `CAP_EXP+8.w` (capability-relative) so the change is robust to capability-layout differences and easy to do in a unit file. **Secure Boot must be disabled** for `setpci` writes to succeed (otherwise the kernel's lockdown policy returns EPERM). -??? failure "/dev/mem: Operation not permitted. Access to physical memory denied" + ```bash + cat << 'EOF' | sudo tee /etc/systemd/system/nic-mrrs.service + [Unit] + Description=Set CX-7 PFs MRRS to 4096 (capability-relative) + After=multi-user.target - Disable secure boot on your system ahead of changing your GPU's BAR1 size. It can be re-enabled afterwards. + [Service] + Type=oneshot + ExecStart=/usr/bin/bash -c 'for d in 0000:01:00.0 0000:01:00.1 0002:01:00.0 0002:01:00.1; do setpci -s "$d" CAP_EXP+8.w=0x5000:0xf000; done' + RemainAfterExit=true -Reboot your system, and check the BAR1 size again to confirm the change. + [Install] + WantedBy=multi-user.target + EOF + sudo systemctl daemon-reload + sudo systemctl enable --now nic-mrrs.service + ``` -```bash -sudo reboot -``` + Verify: -### Step 9: Enable Jumbo Frames + ```bash + for d in 0000:01:00.0 0000:01:00.1 0002:01:00.0 0002:01:00.1; do + sudo setpci -s "$d" CAP_EXP+8.w + done + # Each line should print 5xxx (high nibble 5 = 4096-byte MRRS). + ``` -Jumbo frames are Ethernet frames that carry a payload larger than the standard 1500 bytes MTU (Maximum Transmission Unit). They can significantly improve network performance when transferring large amounts of data by reducing the overhead of packet headers and the number of packets that need to be processed. + ### Step 4: Enable Huge pages — grub drop-in pattern -**We recommend an MTU of 9000 bytes on all interfaces involved in the data path.** You can check the current MTU of your interfaces: + Spark composes its `GRUB_CMDLINE_LINUX` from drop-ins under `/etc/default/grub.d/`. Edit a new file rather than `/etc/default/grub` directly so Spark platform updates don't fight your changes: -=== "tune_system.py" + ```bash + cat << 'EOF' | sudo tee /etc/default/grub.d/daqiri-tuning.cfg + GRUB_CMDLINE_LINUX="${GRUB_CMDLINE_LINUX} default_hugepagesz=1G hugepagesz=1G hugepages=3 isolcpus=16-19 nohz_full=16-19 rcu_nocbs=16-19 rcu_nocb_poll irqaffinity=0-15" + EOF + sudo update-grub + ``` - === "Debian installation" + Add the 1G hugepage mount and pre-create the directory: - ```bash - sudo /opt/nvidia/holoscan/bin/tune_system.py --check mtu - ``` + ```bash + echo "nodev /mnt/huge hugetlbfs pagesize=1G 0 0" | sudo tee -a /etc/fstab + sudo mkdir -p /mnt/huge + ``` - === "From source" + Reboot once — Steps 4 and 5 land together. - ```bash - cd holohub - sudo ./operators/advanced_network/python/tune_system.py --check mtu - ``` + ```bash + sudo reboot + ``` - ??? abstract "See an example output" + After reboot, verify: - ``` - 2025-03-06 16:51:19 - INFO - Interface eth0 has an acceptable MTU of 9000 bytes. - 2025-03-06 16:51:19 - INFO - Interface eth1 has an acceptable MTU of 9000 bytes. - ``` + ```bash + grep Huge /proc/meminfo + mount | grep huge + ``` -=== "manual" + ### Step 5: Isolate CPU cores - For a given `if_name` interface: + Spark has 20 cores arranged big.LITTLE-style: cluster 0 is 10 Cortex-A725 LITTLE cores (IDs 0-9 + part of 10-15), cluster 1 is the big Cortex-X925 cores (IDs 16-19 are the four highest-frequency big cores). Pin the daqiri TX/RX/processing threads onto **16, 17, 18, 19**; the rest of the system continues working normally on cores 0-15. - ```bash - if_name=eth0 - ip link show dev $if_name | grep -oE "mtu [0-9]+" + The grub drop-in created in [Step 4](#step-4-enable-huge-pages-grub-drop-in-pattern) above already includes: + + ```text + isolcpus=16-19 nohz_full=16-19 rcu_nocbs=16-19 rcu_nocb_poll irqaffinity=0-15 ``` - ??? abstract "See an example output" + Verify after reboot: - ``` - mtu 1500 - ``` + ```bash + cat /proc/cmdline | grep -oE "isolcpus=[^ ]+|nohz_full=[^ ]+|rcu_nocbs=[^ ]+|irqaffinity=[^ ]+" + ``` -You can set the MTU for each interface like so, for a given `if_name` name identified [above](#configure-the-ip-addresses-of-the-nic-ports): + ### Step 6: CPU governor — already `performance` on Spark -=== "One-time" + Spark ships with `performance` on all 20 cores. Verify: ```bash - sudo ip link set dev $if_name mtu 9000 + cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ``` -=== "Persistent" + If you want a belt-and-suspenders unit anyway, the IGX `cpu-performance.service` recipe works unchanged. - === "NetworkManager" + ### Step 7: Prevent the GPU from going idle - ```bash - sudo nmcli connection modify $if_name ipv4.mtu 9000 - sudo nmcli connection up $if_name - ``` + The IGX [`gpu-max-clocks.service`](#step-7-prevent-the-gpu-from-going-idle) systemd-unit recipe works unchanged on Spark; query and lock at runtime: - === "systemd-networkd" + ```bash + sudo nvidia-smi -pm 1 + sudo nvidia-smi -lgc=$(nvidia-smi --query-gpu=clocks.max.sm --format=csv,noheader,nounits) + sudo nvidia-smi -lmc=$(nvidia-smi --query-gpu=clocks.max.mem --format=csv,noheader,nounits) + ``` - Assuming you've set an IP address for the interface [above](#configure-the-ip-addresses-of-the-nic-ports), you can add the MTU to the interface's network configuration file like so: + On a production GB10 the locked SM clock is 3003 MHz. If `nvidia-smi -pm 1` reports persistence mode is unsupported on this platform, the lock-clocks calls still take effect for the current driver session — fold them into a unit and start it after reboot. - ```bash - sudo sed -i '/\[Network\]/a MTU=9000' /etc/systemd/network/20-$if_name.network - sudo systemctl restart systemd-networkd - ``` + ### Step 8: GPU BAR1 size — N/A on Spark -??? info "Can I do more than 9000?" + GB10 has unified CPU/GPU memory (NVLink-C2C coherent) — there is no resizable BAR1 to enlarge, so the entire IGX displaymodeselector / firmware-flash flow does not apply. `nvidia-smi -q | grep -A 3 BAR1` may print numbers but they are not actionable. `tune_system.py --check bar1-size` reports INFO instead of WARNING when an integrated GPU is detected. - While your NIC might have a maximum MTU capability larger than 9000, we typically recommend setting the MTU to 9000 bytes, as that is the standard size for jumbo frames that's widely supported for compatibility with other network equipment. When using jumbo frames, all devices in the communication path must support the same MTU size. If any device in between has a smaller MTU, packets will be fragmented or dropped, potentially degrading performance. + ### Step 9: Enable Jumbo frames — already covered - Example with the CX-7 NIC: + The `daqiri-tx` / `daqiri-rx` nmcli profiles created in [Configure the IP addresses](#configure-the-ip-addresses-of-the-nic-ports_1) already pin `ipv4.mtu 9000`, so this step is a no-op on Spark. Verify: ```bash - $ ip -d link show dev $if_name | grep -oE "maxmtu [0-9]+" - maxmtu 9978 + ip link show enp1s0f0np0 | grep -oE "mtu [0-9]+" + ip link show enP2p1s0f0np0 | grep -oE "mtu [0-9]+" ``` ---- -**Next:** [Benchmarking Examples](benchmarking_examples.md) — run your first DAQIRI benchmark + --- + **Next:** [Benchmarking Examples](benchmarking_examples.md) — run your first DAQIRI benchmark + diff --git a/examples/daqiri_bench_raw_tx_rx_spark.yaml b/examples/daqiri_bench_raw_tx_rx_spark.yaml new file mode 100644 index 0000000..c228850 --- /dev/null +++ b/examples/daqiri_bench_raw_tx_rx_spark.yaml @@ -0,0 +1,89 @@ +# DGX Spark (GB10) ready-to-run config for daqiri_bench_raw_gpudirect. +# Templated version (with placeholders) is in +# daqiri_bench_raw_tx_rx.yaml. +# +# Spark substitutions baked in here: +# - PCIe addresses: 0000:01:00.0 (TX, mlx5_0) / 0002:01:00.0 (RX, mlx5_2), +# traffic crosses the chassis QSFP loop; see docs/tutorials/system_configuration.md +# "DGX Spark" tab for the 4-PFs/2-chips/tied-ports topology. +# - kind: host_pinned, NOT device. GB10 has unified memory; nvidia_peermem +# does not load and CUDA reports DMA_BUF_SUPPORTED=0. PR #41 made +# host_pinned the working GPUDirect path on Spark (~94 Gbps unicast). +# - cpu_core: 17/18 from the four big-cluster X925 cores 16-19 isolated by the +# daqiri-tuning grub drop-in. +# - master_core: 8 (a non-isolated big core; any 0-15 works). +# - bench_tx ip_src/ip_dst: match the daqiri-tx / daqiri-rx nmcli profiles +# (1.1.1.1/24 and 2.2.2.2/24, MTU 9000). +# - eth_dst_addr: per-system, fill from the RX interface MAC: +# cat /sys/class/net/enP2p1s0f0np0/address +# +%YAML 1.2 +--- +daqiri: + cfg: + version: 1 + stream_type: "raw" + master_core: 8 + debug: false + log_level: "info" + loopback: "" + + memory_regions: + - name: "Data_TX_GPU" + kind: "host_pinned" + affinity: 0 + num_bufs: 51200 + buf_size: 8064 + - name: "Data_RX_GPU" + kind: "host_pinned" + affinity: 0 + num_bufs: 51200 + buf_size: 8064 + + interfaces: + - name: "tx_port" + address: 0000:01:00.0 + tx: + queues: + - name: "tx_q_0" + id: 0 + batch_size: 10240 + cpu_core: 17 + memory_regions: + - "Data_TX_GPU" + offloads: + - "tx_eth_src" + - name: "rx_port" + address: 0002:01:00.0 + rx: + flow_isolation: true + queues: + - name: "rq_q_0" + id: 0 + cpu_core: 18 + batch_size: 10240 + memory_regions: + - "Data_RX_GPU" + flows: + - name: "flow_0" + id: 0 + action: + type: queue + id: 0 + match: + udp_src: 4096 + udp_dst: 4096 + +bench_rx: + interface_name: "rx_port" + +bench_tx: + interface_name: "tx_port" + batch_size: 10240 + payload_size: 8000 + header_size: 64 + eth_dst_addr: <00:00:00:00:00:00> # cat /sys/class/net/enP2p1s0f0np0/address + ip_src_addr: 1.1.1.1 + ip_dst_addr: 2.2.2.2 + udp_src_port: 4096 + udp_dst_port: 4096 diff --git a/examples/daqiri_bench_rdma_tx_rx_spark.yaml b/examples/daqiri_bench_rdma_tx_rx_spark.yaml new file mode 100644 index 0000000..5bc12ce --- /dev/null +++ b/examples/daqiri_bench_rdma_tx_rx_spark.yaml @@ -0,0 +1,105 @@ +# DGX Spark (GB10) ready-to-run config for daqiri_bench_rdma. +# Templated version (different IPs, different cores) is in +# daqiri_bench_rdma_tx_rx.yaml. +# +# Spark substitutions: +# - IPs: 1.1.1.1 (client/TX) and 2.2.2.2 (server/RX), matching the +# daqiri-tx / daqiri-rx nmcli profiles documented in the system +# configuration tutorial DGX Spark tab. +# - cpu_core values pulled from the isolated big-cluster X925 cores +# 16-19 (see grub drop-in /etc/default/grub.d/daqiri-tuning.cfg). +# - master_core: 8 (non-isolated big core). +# - kind: host_pinned (already correct upstream and required on GB10 +# where peermem is N/A and DMA-BUF is unreachable). +# +%YAML 1.2 +--- +daqiri: + cfg: + version: 1 + stream_type: "socket" + protocol: "roce" + master_core: 8 + debug: false + log_level: "info" + + memory_regions: + - name: "DATA_RX_GPU_SERVER" + kind: "host_pinned" + affinity: 0 + num_bufs: 20 + buf_size: 9000000 + - name: "DATA_TX_GPU_SERVER" + kind: "host_pinned" + affinity: 0 + num_bufs: 20 + buf_size: 9000000 + - name: "DATA_TX_GPU_CLIENT" + kind: "host_pinned" + affinity: 0 + num_bufs: 20 + buf_size: 90000000 + - name: "DATA_RX_GPU_CLIENT" + kind: "host_pinned" + affinity: 0 + num_bufs: 20 + buf_size: 90000000 + + interfaces: + - name: my_client + address: 1.1.1.1 + socket_config: + mode: client + remote_ip: 2.2.2.2 + remote_port: 4096 + roce_config: + transport_mode: RC + tx: + queues: + - name: "Client_TX_Queue" + id: 0 + batch_size: 1 + cpu_core: 17 + rx: + queues: + - name: "Client_RX_Queue" + id: 0 + cpu_core: 18 + batch_size: 1 + - name: my_server + address: 2.2.2.2 + socket_config: + mode: server + local_ip: 2.2.2.2 + local_port: 4096 + roce_config: + transport_mode: RC + rx: + queues: + - name: "Server_RX_Queue" + id: 0 + cpu_core: 19 + batch_size: 1 + tx: + queues: + - name: "Server_TX_Queue" + id: 0 + cpu_core: 16 + batch_size: 1 + +rdma_bench_server: + server_address: 2.2.2.2 + server_port: 4096 + message_size: 8000000 + send: true + receive: true + server: true + +rdma_bench_client: + message_size: 8000000 + client_address: 1.1.1.1 + server_address: 2.2.2.2 + server_port: 4096 + receive: true + send: true + server: false diff --git a/python/tune_system.py b/python/tune_system.py index 66118d2..e6cb7cb 100755 --- a/python/tune_system.py +++ b/python/tune_system.py @@ -96,6 +96,32 @@ def parse_args(): return parser.parse_args() +def is_any_integrated_gpu(): + """ + Returns True if any visible CUDA device reports CU_DEVICE_ATTRIBUTE_INTEGRATED == 1. + Integrated GPUs (e.g. NVIDIA GB10 / DGX Spark) share memory with the CPU and + several discrete-GPU tuning checks (peermem, GPUDirect-RDMA-supported, + PIX/PXB topology, BAR1 size) do not apply to them. + """ + try: + libcuda = CDLL("libcuda.so") + if libcuda.cuInit(0) != 0: + return False + cuDevAttrIntegrated = 18 # CU_DEVICE_ATTRIBUTE_INTEGRATED + count = c_int() + libcuda.cuDeviceGetCount(byref(count)) + for i in range(count.value): + dev = c_int() + libcuda.cuDeviceGet(byref(dev), i) + flag = c_int() + libcuda.cuDeviceGetAttribute(byref(flag), cuDevAttrIntegrated, dev) + if flag.value == 1: + return True + except Exception: + pass + return False + + def check_peermem_kernel(): """ Check if the nvidia-peermem module for GPUDirect is loaded in the kernel. @@ -111,6 +137,12 @@ def check_peermem_kernel(): if bool(result.stdout.strip()): logging.info("nvidia-peermem module is loaded.") + elif is_any_integrated_gpu(): + logging.info( + "nvidia-peermem module is not loaded, but the platform has an integrated GPU " + "(e.g. GB10 / DGX Spark) where peermem does not apply. Use kind: host_pinned " + "in the daqiri YAML for GPUDirect on this platform." + ) else: logging.warning("nvidia-peermem module is not loaded. GPUDirect may not work.") @@ -127,6 +159,7 @@ def check_gpudirect_support(): libcuda = CDLL("libcuda.so") cudaDevAttrGPUDirectRDMASupported = 116 + cuDevAttrIntegrated = 18 result = libcuda.cuInit(0) if result != 0: @@ -145,8 +178,17 @@ def check_gpudirect_support(): supported = c_int() libcuda.cuDeviceGetAttribute(byref(supported), cudaDevAttrGPUDirectRDMASupported, device) + integrated = c_int() + libcuda.cuDeviceGetAttribute(byref(integrated), cuDevAttrIntegrated, device) + if bool(supported.value): logging.info(f"GPU {i}: {name.value.decode()} has GPUDirect support.") + elif bool(integrated.value): + logging.info( + f"GPU {i}: {name.value.decode()} is integrated (unified memory). " + "GPUDirect-RDMA-supported reported as 0 is expected on this platform; " + "use kind: host_pinned in the daqiri YAML." + ) else: logging.warning(f"GPU {i}: {name.value.decode()} does not have GPUDirect support.") @@ -491,6 +533,12 @@ def check_bar1_size(): Checks the BAR1 size of all NVIDIA GPUs using nvidia-smi. Logs the BAR1 size for each GPU and ensures it is non-zero. """ + if is_any_integrated_gpu(): + logging.info( + "BAR1 size check skipped: integrated GPU detected (unified memory). " + "There is no resizable BAR1 to enlarge on platforms like GB10 / DGX Spark." + ) + return try: # Run nvidia-smi to get BAR1 memory information result = subprocess.run( @@ -544,6 +592,13 @@ def check_topology_connections(): Executes `nvidia-smi topo -m`, parses its output, and ensures that every GPU has at least one PIX or PXB connection to a NIC. If not, logs an error specifying the GPU, NIC, and the actual connection type. """ + if is_any_integrated_gpu(): + logging.info( + "Skipping PIX/PXB topology requirement: integrated GPU detected. " + "On single-SoC unified-memory platforms (e.g. GB10 / DGX Spark) there is " + "no separable PCIe path GPU↔NIC to optimize." + ) + return try: # Run nvidia-smi topo -m to get topology information result = subprocess.run( @@ -682,7 +737,9 @@ def check_mtu_size(): def update_mrrs_for_nvidia_devices(): """ Updates the PCIe Maximum Read Request Size (MRRS) to 4096 for all Mellanox devices, - preserving the lower 12 bits of the current setting. + preserving the lower 12 bits of the current setting. Reads back after the write + so a silently-failing setpci (e.g. Secure Boot lockdown) is reported as an error + rather than misreported as a success. """ try: nic_info = get_nic_info() @@ -706,9 +763,25 @@ def update_mrrs_for_nvidia_devices(): # Write the new MRRS value back subprocess.run(["setpci", "-s", pci_address, f"68.w={new_value:04x}"], check=True) - logging.info( - f"Successfully updated MRRS to 4096 for device at PCIe address {pci_address}={hex(new_value)}." + + # Read back to verify the write actually landed. + verify_result = subprocess.run( + ["setpci", "-s", pci_address, "68.w"], + capture_output=True, + text=True, + check=True, ) + verified_value = int(verify_result.stdout.strip(), 16) + if (verified_value & 0xF000) >> 12 == 5: + logging.info( + f"Successfully updated MRRS to 4096 for device at PCIe address {pci_address}={hex(verified_value)}." + ) + else: + logging.error( + f"MRRS write to {pci_address} did not take effect (read back {hex(verified_value)}). " + "Most common cause: kernel lockdown from Secure Boot blocks setpci writes " + "silently. Disable Secure Boot in firmware and retry." + ) except subprocess.CalledProcessError as e: logging.error( f"Failed to update MRRS for device at PCIe address {pci_address}: {e}" diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt index 265a368..0534ec0 100644 --- a/src/CMakeLists.txt +++ b/src/CMakeLists.txt @@ -22,7 +22,7 @@ find_package(yaml-cpp QUIET) enable_language(CUDA) set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} --diag-suppress 1217") -set(CMAKE_CUDA_ARCHITECTURES "80;90") +set(CMAKE_CUDA_ARCHITECTURES "80;90;121") add_compile_definitions(ALLOW_EXPERIMENTAL_API) add_compile_definitions(DOCA_ALLOW_EXPERIMENTAL_API)