Skip to content

Commit

Permalink
DGPU support
Browse files Browse the repository at this point in the history
Add support for x86 PC systems. This requires the following issues to be
addressed:

- Various differences in the nv-p2p.h API and semantics.
- A different GPU page size.
- P2P APIs are implemented in nvidia.ko external module, rather than the
  kernel itself, so the build system must locate nv-p2p.h and nvidia.ko,
  and extract required symbol version information from nvidia.ko.
- Differences in the allocation routines that CUDA applications must use
  for memory used with GPUDirect RDMA.
  • Loading branch information
nvswarren committed Apr 17, 2019
1 parent 34e7f62 commit 77fdda4
Show file tree
Hide file tree
Showing 9 changed files with 287 additions and 52 deletions.
67 changes: 47 additions & 20 deletions README.md
@@ -1,11 +1,14 @@
# Introduction

This repository provides a minimal hardware-based demonstration of GPUDirect
RDMA on NVIDIA Jetson AGX Xavier (Jetson) running Linux for Tegra (L4T). This
feature allows a PCIe device to directly access CUDA GPU memory, thus allowing
zero-copy sharing of data between CUDA and a PCIe device.
RDMA. This feature allows a PCIe device to directly access CUDA memory, thus
allowing zero-copy sharing of data between CUDA and a PCIe device.

A graphical repreentation of the system configuration created by the software
The code supports both:
* NVIDIA Jetson AGX Xavier (Jetson) running Linux for Tegra (L4T).
* A PC running the NVIDIA CUDA drivers and containing a Quadro or Tesla GPU.

A graphical representation of the system configuration created by the software
in this repository, and the data flow between components, is shown below:

![RDMA Configuration and Data Flow](rdma-flow.svg)
Expand All @@ -26,13 +29,13 @@ USB bus. It is available from:
### PCIe Adapter Board

The PicoEVB board is a double-sided M.2 device. Jetson physically only supports
boards with a full-size PCIe connector, or single-sided M.2 devices. Some form
of adapter is required to connect the two in a mechanically reliable way.
boards with a full-size PCIe connector, or single-sided M.2 devices. PCs
typically only support boards with a full-size PCIe connector. Some form of
adapter is required to connect the two in a mechanically reliable way.

A PCIe x16/x8/x4/x2/x1 to M.2 key E adapter may be used to plug the PicoEVB
board into Jetson's full-size PCIe slot. The same adapter enables the PicoEVB
board to be plugged into a desktop PC for development. One such adapter board
may be available from Amazon as ASIN B013U4401W, product name "Sourcingbay
board into a full-size PCIe slot on Jetson or a PC. One such adapter board may
be available from Amazon as ASIN B013U4401W, product name "Sourcingbay
M.2(NGFF) Wireless Card to PCI-e 1X Adapter".

The following pair of adapters may be used to connect the PicoEVB board to
Expand Down Expand Up @@ -137,7 +140,7 @@ hung, but is actually running.

# Linux Kernel Driver

## Building on Jetson
## Building on Jetson, to Run on Jetson

To build the Linux kernel driver on Jetson, execute:

Expand All @@ -150,7 +153,7 @@ cd /path/to/this/project/kernel-module/

This will generate `picoevb-rdma.ko`.

## Building on an x86 Linux PC
## Building on an x86 Linux PC, to Run on Jetson

The Linux kernel driver may alternatively be built (cross-compiled) on an x86
Linux PC. You will first need to obtain a copy of the "Linux headers" or
Expand All @@ -170,6 +173,17 @@ KDIR=/path/to/linux-headers-4.9.140-tegra-linux_x86_64/kernel-4.9/ ./build-for-j

This will generate `picoevb-rdma.ko`. This file must be copied to Jetson.

## Building on an x86 Linux PC, to Run on That PC

```
sudo apt update
sudo apt install build-essential bc
cd /path/to/this/project/kernel-module/
./build-for-pc-native.sh
```

This will generate `picoevb-rdma.ko`.

## Loading the Module

To load the kernel module, execute:
Expand All @@ -195,7 +209,7 @@ $ lspci -v

# User-space Applications

## Building on Jetson
## Building on Jetson, to Run on Jetson

The client applications are best built on Jetson itself. Make sure you have the
CUDA development tools installed, and execute:
Expand All @@ -207,7 +221,7 @@ cd /path/to/this/project/client-applications/
./build-for-jetson-igpu-native.sh
```

## Building on an x86 Linux PC
## Building on an x86 Linux PC, to Run on Jetson

Building (cross-compiling) the client applications on a x86 Linux PC is only
partially supported; the makefile does not yet support cross-compiling the CUDA
Expand All @@ -225,13 +239,25 @@ You may need to adjust the value of variable `CROSS_COMPILE` in script
`./build-for-jetson-igpu-on-pc.sh` to match the configuration of your x86 Linux
PC.

## Building on an x86 Linux PC, to Run on That PC

Make sure you have the CUDA development tools installed, and execute:

```
sudo apt update
sudo apt install build-essential bc
cd /path/to/this/project/client-applications/
./build-for-pc-native.sh
```

## Running the Tests

### Data Access Tests

Two PCIe data access tests are provided; `rdma-malloc` and `rdma-cuda`. Both
tests are structurally identical, but allocate memory using different APIs; the
former using `malloc()`, and the latter via `cudaHostAlloc()`.
former using `malloc()`, and the latter via `cudaHostAlloc()` (Jetson) or
`cudaMalloc()` (PC).

Both tests proceed as following:

Expand All @@ -254,12 +280,13 @@ You can avoid the need to use `sudo` by applying appropriate permissions to the
kernel driver's device file, `/dev/picoevb`.

Internally to the kernel driver, the copy operation divides the surface into
64K chunks, and for each chunk first copies that chunk's data from the source
surface to the FPGA's internal memory, then copies the data from the FPGA's
internal memory to the destination surface. This demonstrates both PCIe read
and write access to CUDA GPU memory. The requirement to divide the data into
64K chunks is a limitation of the internal memory size of the PicoEVB board's
FPGA, and likely would not apply in a production device.
64KiB chunks (or smaller, depending on memory alignment), and for each chunk
first copies that chunk's data from the source surface to the FPGA's internal
memory, then copies the data from the FPGA's internal memory to the destination
surface. This demonstrates both PCIe read and write access to CUDA GPU memory.
The requirement to divide the data into chunks is a limitation of the internal
memory size of the PicoEVB board's FPGA, and likely would not apply in a
production device.

### set-leds

Expand Down
4 changes: 4 additions & 0 deletions client-applications/Makefile
Expand Up @@ -29,6 +29,10 @@ NVCC ?= $(CUDA_TOOLKIT)/bin/nvcc

CFLAGS := \
-ggdb
ifdef NV_BUILD_DGPU
CFLAGS += \
-DNV_BUILD_DGPU
endif

TARGETS := rdma-cuda rdma-malloc set-leds
default: $(TARGETS)
Expand Down
24 changes: 24 additions & 0 deletions client-applications/build-for-pc-native.sh
@@ -0,0 +1,24 @@
#!/bin/sh

# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.

export NV_BUILD_DGPU=1
exec make
42 changes: 42 additions & 0 deletions client-applications/rdma-cuda.cu
Expand Up @@ -72,8 +72,12 @@ int main(int argc, char **argv)
return 1;
}

#ifdef NV_BUILD_DGPU
ce = cudaMalloc(&src_d, SURFACE_SIZE * sizeof(*src_d));
#else
ce = cudaHostAlloc(&src_d, SURFACE_SIZE * sizeof(*src_d),
cudaHostAllocDefault);
#endif
if (ce != cudaSuccess) {
fprintf(stderr, "Allocation of src_d failed: %d\n", ce);
return 1;
Expand All @@ -94,8 +98,12 @@ int main(int argc, char **argv)
return 1;
}

#ifdef NV_BUILD_DGPU
ce = cudaMalloc(&dst_d, SURFACE_SIZE * sizeof(*dst_d));
#else
ce = cudaHostAlloc(&dst_d, SURFACE_SIZE * sizeof(*dst_d),
cudaHostAllocDefault);
#endif
if (ce != cudaSuccess) {
fprintf(stderr, "Allocation of dst_d failed: %d\n", ce);
return 1;
Expand Down Expand Up @@ -138,7 +146,25 @@ int main(int argc, char **argv)
return 1;
}

/*
* dGPU on x86 does not allow GPUDirect RDMA on host pinned memory
* (cudaMalloc), so we must allocate device memory, and manually copy
* it to the host for validation.
*/
#ifdef NV_BUILD_DGPU
ce = cudaMallocHost(&dst_cpu, SURFACE_SIZE * sizeof(*dst_cpu), 0);
if (ce != cudaSuccess) {
fprintf(stderr, "cudaMallocHost(dst_cpu) failed\n");
return 1;
}
ce = cudaMemcpy(dst_cpu, dst_d, SURFACE_SIZE * sizeof(*dst_cpu), cudaMemcpyDeviceToHost);
if (ce != cudaSuccess) {
fprintf(stderr, "cudaMemcpy() failed: %d\n", ce);
return 1;
}
#else
dst_cpu = dst_d;
#endif

ret = 0;
for (y = 0; y < SURFACE_H; y++) {
Expand All @@ -157,14 +183,26 @@ int main(int argc, char **argv)
if (ret)
return 1;

#ifdef NV_BUILD_DGPU
ce = cudaFreeHost(dst_cpu);
if (ce != cudaSuccess) {
fprintf(stderr, "cudaFreeHost(dst_cpu) failed: %d\n", ce);
return 1;
}
#endif

unpin_params_dst.handle = pin_params_dst.handle;
ret = ioctl(fd, PICOEVB_IOC_UNPIN_CUDA, &unpin_params_dst);
if (ret != 0) {
fprintf(stderr, "ioctl(UNPIN_CUDA dst) failed: %d\n", ret);
return 1;
}

#ifdef NV_BUILD_DGPU
ce = cudaFree(dst_d);
#else
ce = cudaFreeHost(dst_d);
#endif
if (ce != cudaSuccess) {
fprintf(stderr, "Free of dst_d failed: %d\n", ce);
return 1;
Expand All @@ -177,7 +215,11 @@ int main(int argc, char **argv)
return 1;
}

#ifdef NV_BUILD_DGPU
ce = cudaFree(src_d);
#else
ce = cudaFreeHost(src_d);
#endif
if (ce != cudaSuccess) {
fprintf(stderr, "Free of src_d failed: %d\n", ce);
return 1;
Expand Down
6 changes: 6 additions & 0 deletions kernel-module/Kbuild
Expand Up @@ -9,4 +9,10 @@
# FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
# more details.

ifdef NV_BUILD_DGPU
KBUILD_CFLAGS += \
-I$(NVIDIA_SRC_DIR) \
-DNV_BUILD_DGPU
endif

obj-m += picoevb-rdma.o
9 changes: 8 additions & 1 deletion kernel-module/Makefile
Expand Up @@ -9,10 +9,17 @@
# FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
# more details.

KDIR ?= "/lib/modules/$(shell uname -r)/build"
KDIR ?= /lib/modules/$(shell uname -r)/build

.PHONY: default
default: modules

%:
$(MAKE) -C "$(KDIR)" "M=$$PWD" "$@"

ifdef NV_BUILD_DGPU
modules: Module.symvers

Module.symvers: $(NVIDIA_KO) nvidia-ko-to-module-symvers
./nvidia-ko-to-module-symvers "$<" "$@"
endif
36 changes: 36 additions & 0 deletions kernel-module/build-for-pc-native.sh
@@ -0,0 +1,36 @@
#!/bin/sh

# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.

export NV_BUILD_DGPU=1
export NVIDIA_SRC_DIR="$(find /usr/src/nvidia-* -name nv-p2p.h 2>/dev/null|head -1|xargs dirname 2>/dev/null)"
if [ ! -d "${NVIDIA_SRC_DIR}" ]; then
echo "ERROR: Could not find nv-p2p.h"
exit 1
fi

export NVIDIA_KO"=$(find /lib/modules/$(uname -r)/ -name 'nvidia*.ko'|grep -P 'nvidia(_[0-9]+)?.ko'|head -1)"
if [ ! -f "${NVIDIA_KO}" ]; then
echo "ERROR: Could not find nvidia.ko"
exit 1
fi

exec make
24 changes: 24 additions & 0 deletions kernel-module/nvidia-ko-to-module-symvers
@@ -0,0 +1,24 @@
#!/bin/bash

syms=()
syms+=(nvidia_p2p_init_mapping)
syms+=(nvidia_p2p_destroy_mapping)
syms+=(nvidia_p2p_get_pages)
syms+=(nvidia_p2p_put_pages)
syms+=(nvidia_p2p_free_page_table)
syms+=(nvidia_p2p_dma_map_pages)
syms+=(nvidia_p2p_dma_unmap_pages)

nvidia_ko_fn="$1"
symvers_fn="$2"

touch "${symvers_fn}"
for sym in "${syms[@]}"; do
crc="$(objdump -t "${nvidia_ko_fn}" | grep "__crc_${sym}" | awk '{print $1}')"
if [ -z "${crc}" ]; then
echo "Warning: Can't find symbol ${sym} in ${nvidia_ko_fn}; assuming CONFIG_MODVERSIONS=n so setting CRC=0"
crc=00000000
fi
sed -i '/${sym}/d' "${symvers_fn}"
echo "0x${crc} ${sym} ${nvidia_ko_fn} EXPORT_SYMBOL" >> ${symvers_fn}
done

0 comments on commit 77fdda4

Please sign in to comment.