DGPU support

Add support for x86 PC systems. This requires the following issues to be addressed: - Various differences in the nv-p2p.h API and semantics. - A different GPU page size. - P2P APIs are implemented in nvidia.ko external module, rather than the kernel itself, so the build system must locate nv-p2p.h and nvidia.ko, and extract required symbol version information from nvidia.ko. - Differences in the allocation routines that CUDA applications must use for memory used with GPUDirect RDMA.
NVIDIA · Apr 17, 2019 · 77fdda4 · 77fdda4
1 parent 34e7f62
commit 77fdda4
Show file tree

Hide file tree

Showing 9 changed files with 287 additions and 52 deletions.
diff --git a/README.md b/README.md
@@ -1,11 +1,14 @@
 # Introduction
 
 This repository provides a minimal hardware-based demonstration of GPUDirect
-RDMA on NVIDIA Jetson AGX Xavier (Jetson) running Linux for Tegra (L4T). This
-feature allows a PCIe device to directly access CUDA GPU memory, thus allowing
-zero-copy sharing of data between CUDA and a PCIe device.
+RDMA. This feature allows a PCIe device to directly access CUDA memory, thus
+allowing zero-copy sharing of data between CUDA and a PCIe device.
 
-A graphical repreentation of the system configuration created by the software
+The code supports both:
+* NVIDIA Jetson AGX Xavier (Jetson) running Linux for Tegra (L4T).
+* A PC running the NVIDIA CUDA drivers and containing a Quadro or Tesla GPU.
+
+A graphical representation of the system configuration created by the software
 in this repository, and the data flow between components, is shown below:
 
 ![RDMA Configuration and Data Flow](rdma-flow.svg)
@@ -26,13 +29,13 @@ USB bus. It is available from:
 ### PCIe Adapter Board
 
 The PicoEVB board is a double-sided M.2 device. Jetson physically only supports
-boards with a full-size PCIe connector, or single-sided M.2 devices. Some form
-of adapter is required to connect the two in a mechanically reliable way.
+boards with a full-size PCIe connector, or single-sided M.2 devices. PCs
+typically only support boards with a full-size PCIe connector. Some form of
+adapter is required to connect the two in a mechanically reliable way.
 
 A PCIe x16/x8/x4/x2/x1 to M.2 key E adapter may be used to plug the PicoEVB
-board into Jetson's full-size PCIe slot. The same adapter enables the PicoEVB
-board to be plugged into a desktop PC for development. One such adapter board
-may be available from Amazon as ASIN B013U4401W, product name "Sourcingbay
+board into a full-size PCIe slot on Jetson or a PC. One such adapter board may
+be available from Amazon as ASIN B013U4401W, product name "Sourcingbay
 M.2(NGFF) Wireless Card to PCI-e 1X Adapter".
 
 The following pair of adapters may be used to connect the PicoEVB board to
@@ -137,7 +140,7 @@ hung, but is actually running.
 
 # Linux Kernel Driver
 
-## Building on Jetson
+## Building on Jetson, to Run on Jetson
 
 To build the Linux kernel driver on Jetson, execute:
 
@@ -150,7 +153,7 @@ cd /path/to/this/project/kernel-module/
 
 This will generate `picoevb-rdma.ko`.
 
-## Building on an x86 Linux PC
+## Building on an x86 Linux PC, to Run on Jetson
 
 The Linux kernel driver may alternatively be built (cross-compiled) on an x86
 Linux PC. You will first need to obtain a copy of the "Linux headers" or
@@ -170,6 +173,17 @@ KDIR=/path/to/linux-headers-4.9.140-tegra-linux_x86_64/kernel-4.9/ ./build-for-j
 
 This will generate `picoevb-rdma.ko`. This file must be copied to Jetson.
 
+## Building on an x86 Linux PC, to Run on That PC
+
+```
+sudo apt update
+sudo apt install build-essential bc
+cd /path/to/this/project/kernel-module/
+./build-for-pc-native.sh
+```
+
+This will generate `picoevb-rdma.ko`.
+
 ## Loading the Module
 
 To load the kernel module, execute:
@@ -195,7 +209,7 @@ $ lspci -v
 
 # User-space Applications
 
-## Building on Jetson
+## Building on Jetson, to Run on Jetson
 
 The client applications are best built on Jetson itself. Make sure you have the
 CUDA development tools installed, and execute:
@@ -207,7 +221,7 @@ cd /path/to/this/project/client-applications/
 ./build-for-jetson-igpu-native.sh
 ```
 
-## Building on an x86 Linux PC
+## Building on an x86 Linux PC, to Run on Jetson
 
 Building (cross-compiling) the client applications on a x86 Linux PC is only
 partially supported; the makefile does not yet support cross-compiling the CUDA
@@ -225,13 +239,25 @@ You may need to adjust the value of variable `CROSS_COMPILE` in script
 `./build-for-jetson-igpu-on-pc.sh` to match the configuration of your x86 Linux
 PC.
 
+## Building on an x86 Linux PC, to Run on That PC
+
+Make sure you have the CUDA development tools installed, and execute:
+
+```
+sudo apt update
+sudo apt install build-essential bc
+cd /path/to/this/project/client-applications/
+./build-for-pc-native.sh
+```
+
 ## Running the Tests
 
 ### Data Access Tests
 
 Two PCIe data access tests are provided; `rdma-malloc` and `rdma-cuda`. Both
 tests are structurally identical, but allocate memory using different APIs; the
-former using `malloc()`, and the latter via `cudaHostAlloc()`.
+former using `malloc()`, and the latter via `cudaHostAlloc()` (Jetson) or
+`cudaMalloc()` (PC).
 
 Both tests proceed as following:
 
@@ -254,12 +280,13 @@ You can avoid the need to use `sudo` by applying appropriate permissions to the
 kernel driver's device file, `/dev/picoevb`.
 
 Internally to the kernel driver, the copy operation divides the surface into
-64K chunks, and for each chunk first copies that chunk's data from the source
-surface to the FPGA's internal memory, then copies the data from the FPGA's
-internal memory to the destination surface. This demonstrates both PCIe read
-and write access to CUDA GPU memory. The requirement to divide the data into
-64K chunks is a limitation of the internal memory size of the PicoEVB board's
-FPGA, and likely would not apply in a production device.
+64KiB chunks (or smaller, depending on memory alignment), and for each chunk
+first copies that chunk's data from the source surface to the FPGA's internal
+memory, then copies the data from the FPGA's internal memory to the destination
+surface. This demonstrates both PCIe read and write access to CUDA GPU memory.
+The requirement to divide the data into chunks is a limitation of the internal
+memory size of the PicoEVB board's FPGA, and likely would not apply in a
+production device.
 
 ### set-leds
 

diff --git a/client-applications/Makefile b/client-applications/Makefile
@@ -29,6 +29,10 @@ NVCC ?= $(CUDA_TOOLKIT)/bin/nvcc
 
 CFLAGS := \
 	-ggdb
+ifdef NV_BUILD_DGPU
+	CFLAGS += \
+		-DNV_BUILD_DGPU
+endif
 
 TARGETS := rdma-cuda rdma-malloc set-leds
 default: $(TARGETS)

diff --git a/client-applications/build-for-pc-native.sh b/client-applications/build-for-pc-native.sh
@@ -0,0 +1,24 @@
+#!/bin/sh
+
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in
+# all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+# DEALINGS IN THE SOFTWARE.
+
+export NV_BUILD_DGPU=1
+exec make
diff --git a/client-applications/rdma-cuda.cu b/client-applications/rdma-cuda.cu
@@ -72,8 +72,12 @@ int main(int argc, char **argv)
 		return 1;
 	}
 
+#ifdef NV_BUILD_DGPU
+	ce = cudaMalloc(&src_d, SURFACE_SIZE * sizeof(*src_d));
+#else
 	ce = cudaHostAlloc(&src_d, SURFACE_SIZE * sizeof(*src_d),
 		cudaHostAllocDefault);
+#endif
 	if (ce != cudaSuccess) {
 		fprintf(stderr, "Allocation of src_d failed: %d\n", ce);
 		return 1;
@@ -94,8 +98,12 @@ int main(int argc, char **argv)
 		return 1;
 	}
 
+#ifdef NV_BUILD_DGPU
+	ce = cudaMalloc(&dst_d, SURFACE_SIZE * sizeof(*dst_d));
+#else
 	ce = cudaHostAlloc(&dst_d, SURFACE_SIZE * sizeof(*dst_d),
 		cudaHostAllocDefault);
+#endif
 	if (ce != cudaSuccess) {
 		fprintf(stderr, "Allocation of dst_d failed: %d\n", ce);
 		return 1;
@@ -138,7 +146,25 @@ int main(int argc, char **argv)
 		return 1;
 	}
 
+	/*
+	 * dGPU on x86 does not allow GPUDirect RDMA on host pinned memory
+	 * (cudaMalloc), so we must allocate device memory, and manually copy
+	 * it to the host for validation.
+	 */
+#ifdef NV_BUILD_DGPU
+	ce = cudaMallocHost(&dst_cpu, SURFACE_SIZE * sizeof(*dst_cpu), 0);
+	if (ce != cudaSuccess) {
+		fprintf(stderr, "cudaMallocHost(dst_cpu) failed\n");
+		return 1;
+	}
+	ce = cudaMemcpy(dst_cpu, dst_d, SURFACE_SIZE * sizeof(*dst_cpu), cudaMemcpyDeviceToHost);
+	if (ce != cudaSuccess) {
+		fprintf(stderr, "cudaMemcpy() failed: %d\n", ce);
+		return 1;
+	}
+#else
 	dst_cpu = dst_d;
+#endif
 
 	ret = 0;
 	for (y = 0; y < SURFACE_H; y++) {
@@ -157,14 +183,26 @@ int main(int argc, char **argv)
 	if (ret)
 		return 1;
 
+#ifdef NV_BUILD_DGPU
+	ce = cudaFreeHost(dst_cpu);
+	if (ce != cudaSuccess) {
+		fprintf(stderr, "cudaFreeHost(dst_cpu) failed: %d\n", ce);
+		return 1;
+	}
+#endif
+
 	unpin_params_dst.handle = pin_params_dst.handle;
 	ret = ioctl(fd, PICOEVB_IOC_UNPIN_CUDA, &unpin_params_dst);
 	if (ret != 0) {
 		fprintf(stderr, "ioctl(UNPIN_CUDA dst) failed: %d\n", ret);
 		return 1;
 	}
 
+#ifdef NV_BUILD_DGPU
+	ce = cudaFree(dst_d);
+#else
 	ce = cudaFreeHost(dst_d);
+#endif
 	if (ce != cudaSuccess) {
 		fprintf(stderr, "Free of dst_d failed: %d\n", ce);
 		return 1;
@@ -177,7 +215,11 @@ int main(int argc, char **argv)
 		return 1;
 	}
 
+#ifdef NV_BUILD_DGPU
+	ce = cudaFree(src_d);
+#else
 	ce = cudaFreeHost(src_d);
+#endif
 	if (ce != cudaSuccess) {
 		fprintf(stderr, "Free of src_d failed: %d\n", ce);
 		return 1;

diff --git a/kernel-module/Kbuild b/kernel-module/Kbuild
@@ -9,4 +9,10 @@
 # FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
 # more details.
 
+ifdef NV_BUILD_DGPU
+	KBUILD_CFLAGS += \
+		-I$(NVIDIA_SRC_DIR) \
+		-DNV_BUILD_DGPU
+endif
+
 obj-m += picoevb-rdma.o
diff --git a/kernel-module/Makefile b/kernel-module/Makefile
@@ -9,10 +9,17 @@
 # FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
 # more details.
 
-KDIR ?= "/lib/modules/$(shell uname -r)/build"
+KDIR ?= /lib/modules/$(shell uname -r)/build
 
 .PHONY: default
 default: modules
 
 %:
 	$(MAKE) -C "$(KDIR)" "M=$$PWD" "$@"
+
+ifdef NV_BUILD_DGPU
+modules: Module.symvers
+
+Module.symvers: $(NVIDIA_KO) nvidia-ko-to-module-symvers
+	./nvidia-ko-to-module-symvers "$<" "$@"
+endif
diff --git a/kernel-module/build-for-pc-native.sh b/kernel-module/build-for-pc-native.sh
@@ -0,0 +1,36 @@
+#!/bin/sh
+
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in
+# all copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+# DEALINGS IN THE SOFTWARE.
+
+export NV_BUILD_DGPU=1
+export NVIDIA_SRC_DIR="$(find /usr/src/nvidia-* -name nv-p2p.h 2>/dev/null|head -1|xargs dirname 2>/dev/null)"
+if [ ! -d "${NVIDIA_SRC_DIR}" ]; then
+	echo "ERROR: Could not find nv-p2p.h"
+	exit 1
+fi
+
+export NVIDIA_KO"=$(find /lib/modules/$(uname -r)/ -name 'nvidia*.ko'|grep -P 'nvidia(_[0-9]+)?.ko'|head -1)"
+if [ ! -f "${NVIDIA_KO}" ]; then
+	echo "ERROR: Could not find nvidia.ko"
+	exit 1
+fi
+
+exec make
diff --git a/kernel-module/nvidia-ko-to-module-symvers b/kernel-module/nvidia-ko-to-module-symvers
@@ -0,0 +1,24 @@
+#!/bin/bash
+
+syms=()
+syms+=(nvidia_p2p_init_mapping)
+syms+=(nvidia_p2p_destroy_mapping)
+syms+=(nvidia_p2p_get_pages)
+syms+=(nvidia_p2p_put_pages)
+syms+=(nvidia_p2p_free_page_table)
+syms+=(nvidia_p2p_dma_map_pages)
+syms+=(nvidia_p2p_dma_unmap_pages)
+
+nvidia_ko_fn="$1"
+symvers_fn="$2"
+
+touch "${symvers_fn}"
+for sym in "${syms[@]}"; do
+	crc="$(objdump -t "${nvidia_ko_fn}" | grep "__crc_${sym}" | awk '{print $1}')"
+	if [ -z "${crc}" ]; then
+		echo "Warning: Can't find symbol ${sym} in ${nvidia_ko_fn}; assuming CONFIG_MODVERSIONS=n so setting CRC=0"
+		crc=00000000
+	fi
+	sed -i '/${sym}/d' "${symvers_fn}"
+	echo "0x${crc}	${sym}	${nvidia_ko_fn}	EXPORT_SYMBOL" >> ${symvers_fn}
+done