This repository contains the sourse code from the book CUDA Fortran for Scientists and Engineers, Best Practices for Efficient CUDA Fortran Programming, Second Edition, arranged by chapter.
SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use the files in this directory except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-
Section 1.3.1:
increment.f90andincrement.cufdemonstrate differences between Fortran and CUDA Fortran versions of a simple code -
Section 1.3.2:
multiblock.cufdemonstrates using multiple thread blocks -
Section 1.3.3:
multidim.cufdemonstrates how mutiple dimensions are accommodated in CUDA Fortran kernels -
Section 1.3.4:
explicitInterface.cufdemonstrates how explicit interfaces are used when device code is defined outside ause-d module -
Section 1.3.5:
managed.cufandmanagedImplicit.cufdemonstrate use of managed memory -
Section 1.3.6:
multidimCUF.cuf,managedCUF.cuf, andmanagedCUF2.f90demonstrate use of CUF kernels -
Section 1.4.1:
deviceQuery.cufdemonstrates how to determine device properties at runtime, andpciBusID.cufdemonstrates how to determine the PCI bus for a specified device -
Section 1.5:
errorHandling.cuf,syncError.cuf, andasynError.cufdemonstrate different aspects of error handling of device code -
Section 1.7:
version.cufdemonstrates how to determine the CUDA driver and CUDA Toolkit versions at runtime.
-
Section 2.1.1:
accuracy.cufdemonstrates some accuracy issues with summations using a single accumulator -
Section 2.1.2:
fma.cufdemonstrates how to verify if a fused multiply-add (FMA) is used -
Section 2.2.1:
print.cufshows how to print from device code -
Section 2.2.2:
debug.cufis used for debugging withcuda-gdb -
Section 2.2.3:
memcheck.cufandinitcheck.cufdemonstrate how thecompute-sanitizercan be used to check for out-of-bounds and initialization errors
-
Section 3.1.2:
events.cufdemonstrates how to use CUDA events to time kernel execution -
Section 3.1.3:
multidim.cufis used to demonstrate profiling by the Nsight Systems command-line interfacensys -
Section 3.1.4.1:
nvtxBasic.cufdemonstrates use of the basic NVTX tooling interfaces -
Section 3.1.4.2:
nvtxAdv.cufandnvtxAdv2.cufdemonstrate use of the advanced NVTX tooling interfaces -
Section 3.1.4.3:
nvtxAuto.cufis used to show how NVTX ranges can be automatically generated without modification of source code (see Makefile) -
Section 3.2:
limitingFactor.cufis used to show how kernels can be modified to determine performance limiting factors (instruction vs. memory) -
Section 3.3.1:
peakBandwidth.cufuses device management API routines to determine the theoretical peak bandwidth -
Section 3.3.2:
effectiveBandwidth.cufuses a simply copy kernel to calculate a representative achievable bandwidth
-
Section 4.1.2.1:
twoKernels.cufdemonstrates synchronization characteristics of kernels run in different streams -
Section 4.1.3:
pipeline.cufdemonstrates overlapping data transfers and kernel execution -
Section 4.1.4.2:
streamSync.cufdemonstrates use ofcudaStreamSycnhronize() -
Section 4.1.4.3:
eventSync.cufdemonstrates use ofcudaEventSycnhronize() -
Section 4.1.5.1:
defaultStream.cuf,defaultStreamVar.cuf, anddefaultStreamVarExplicit.cufshow how to set the default stream used for kernel launches, data transfers, and reduction operations -
Section 4.1.5.2:
differentStreamTypes.cufandconcurrentKernels.cufdemonstrate characteristics of non-blocking streams -
Section 4.2.1:
sharedExample.cufdemonstrates use of static and dynamic shared memory, ``sharedMultiple.cuf` shows how offsets are used when multiple dynamic shared memory arrays are declared as assumed size arrays -
Section 4.2.2:
syncthreads.cufdemonstrates use ofsyncthreads_*()variants -
Section 4.2.3:
ballot.cufdemonstrates use of the warp ballot functions -
Section 4.2.3.1:
shfl.cufdemonstrates use of the warp shuffle function__shfl_xor() -
Section 4.2.4:
raceAndAtomic.cufandraceAndAtomicShared.cufdemonstrate how atomic operations can be used to avoid race conditions when modifying global and shared memory -
Section 4.2.5:
threadfence.cufdemonstrates howthreadfence()is used to order memory accesses -
Section 4.2.6:
cgReverse.cufis an cooperative group version of thesharedExample.cufcode from section 4.2.1 -
Section 4.2.6.1:
smooth.cufdemonstrate use of grid synchronization via cooperative groups -
Section 4.2.6.2:
swap.cufdemonstrates how to use distributred shared memory via thread block clusters
-
Section 5.1.1:
HDtransfer.cufshows performance of data transfers between host and device using pageable and pinned host memory,sliceTransfer.cufshows (when profiled withnsys) that multiple transfers of array slices can be mapped to a singlecudaMemcpy2D()call, andasync.cufdemonstrates piplining of data transfers and kernel execution in different streams to achieve overlap -
Section 5.2.2.1:
assumedShapeSize.cufshows (when compiled with-gpu=ptxinfo) how assumed-shape array declaration of kernel arguments results in large register useage relative to assume-size declarations -
Section 5.2.2.2:
stide.cufandoffset.cufare used to determine the effective bandwidth of accessing global data with various strides and offsets -
Section 5.2.3:
local.cufshows how to check for local memory usage -
Section 5.2.4:
constant.cufandconstantAttribute.cufdemonstrate use and verification of user-allocated constant memory -
Section 5.2.5:
loads.cufdemonstrates caching behavior of loads from global memory -
Section 5.2.6.1:
maxSharedMemory.cufshows how to reserved the maximum amount of shared memory allowable -
Section 5.2.6.2:
transpose.cufuses a progressive sequence of kernels to show the benefits of various shared-memory optimization strategies when performing a matrix transpose -
Section 5.2.7:
spill.cufdemonstrates the use of thelaunch_bounds()attribute -
Section 5.3.1:
parallelism.cufdemonstrates how the execution configuration and occupancy affect performance -
Section 5.3.2.1:
parallelismPipeline.cufdemonstrates asynchronous transfers between global and shared memory using the pipeline primitives interface -
Section 5.3.2.2:
cufILP.cufdemonstrates how to achieve instruction-level parallelism in CUF kernels -
Section 5.4.1.4:
fma.cufis used to demonstrate how-gpu=[no]fmais used to contol use of fused multiply-add instructions
-
Section 6.1:
portingBase.f90is a host code ported to CUDA using managed memory (portingManaged.cuf) and global memory (portindDevice.cuf) -
Section 6.2: Condition inclusion of code using the predefined symbol
_CUDA(portingManaged_CUDA.F90,portingDevice_CUDA.F90) and the!@cufsentinel (portingManagedSent.F90,portingDeviceSent.F90) -
Section 6.3.1-2: Porting of
laplace2D.f90code via variable ranaming viausestatements (laplace2DUse.F90) and viaassociateblocks (portingAssociate.f90,laplace2DAssoc.f90) -
Section 6.4: The module
union_m.cufcontains a C-like union for reduction of global memory footprint of work arrays -
Section 6.5: The modules
compact_m.cufand the optimizedcompactOpt_m.cufcontain routines for array compaction
-
Section 7.1:
callingC.cufshows how to interface CUDA Fortran with CUDA C routines inc.cu -
Sections 7.2.1-2:
sgemmLegacy.cufandsgemmNew.cufdemonstrate how to interface with cuBLAS library using the legacy and new cuBLAS APIs -
Section 7.2.3:
getrfBatched.cufshows how to interface with batched cuBLAS routines -
Section 7.2.4:
gemmPerf.cufshows how to opt in to using the TF32 format and tensor cores for matrix mutiplication -
Section 7.3:
cusparseMV.cufandcusparseMV_ex.cufdemonstrate use of the cuSPARSE library -
Section 7.4: ``potr.cuf` demonstrates use of the cuSOLVER library
-
Section 7.5:
matmulTC.cufandmatmulTranspose.cufdemonstrate use of the tensor core library through and overloadedmatmul()routine as well as through the cuBLAS interfaces through the use of thecutensorExmodule -
Section 7.5.1:
cutensorContraction.cufillustrates use of the low-level cuTENSOR interfaces -
Section 7.6:
testSort.cufuse interfaces to the Thrust C++ template library to sort an array
-
Section 8.1:
minimal.cufshos how to select, and allocate global memory on, different devices at runtime -
Section 8.1.1.1:
p2pAccess.cufshows how to check for peer-to-peer access between devices -
Section 8.1.2:
directTransfer.cufshow how to transfer data between global memory on different devices without staging through the host memory,p2pBandwidth.cufmeasures the bandwidth of transfers between GPUs -
Section 8.1.3:
transposeP2P.cufperforms a distributed transpose using P2P transfers -
Section 8.2.1:
mpiDevices.cufshows how MPI ranks are mapped to devices based on the compute mode, andassignDevice.cufshows how to ensure each MPI rank maps to a different device regardless of the compute mode setting, through a routine in thempiDeviceUtil.cufmodule -
Section 8.2.2:
transposeMPI.cufandtransposeCAMPI.cufare MPI and CUDA-aware MPI versions of the distributed transpose (similar to the P2P transpose performed in Section 8.1.3)
-
Section 9.1:
generate_randomnumbers.cufdemonstrates use of the CURAND library to generate random numbers -
Section 9.2:
compute_pi.cufcomputes pi using the Monte Carlo technique -
Section 9.2.1:
ieee_accuracy.f90is used to illustrate accuracy issues related to FMA -
Section 9.3:
pi_performance.CUFmeasures performance of the pi calculation using shared memory, shuffle, atomic locks, and cooperative group kernels -
Section 9.3.1:
shflExample.cufdemonstrates use of the warp shuffle instructions -
Section 9.3.3:
testPiGridGroup.cufshows how to use the grid_group cooperative group to perform reductions -
Section 9.4:
accuracy_sum.cufdemonstrate issues encountered with accuracy of summations -
Section 9.5:
montecarlo_european_option.cufuses Monte Carlo methods to price European options
-
Section 10.1:
finiteDifference.cufcalculates a numerical derivatives using a nine-point stencil -
Section 10.1.2:
limitingFactor.cufuses modified derivative kernels to isolate the limiting factor -
Section 10.1.4:
finiteDifferenceStr.cufcalculated derivatives on non-uniform grid -
Section 10.2:
laplace2D.cufis a finite difference solution ot the 2D Laplace equation
-
Section 11.1:
fft_test_c2c.cufandfft_test_r2c.cufdemonstrate use of the CUFFT library -
Section 11.2:
fft_derivative.cufdemonstrates use of the CUFFT routines to calculate derivatives -
Section 11.3:
exampleOverlapFFT.cufperforms a convolution via FFTs -
Section 11.4:
ns2d.cufis a vortex simulation using FFTs
-
Section 12.1:
ppmExample.f90generates a simple PPM file, the format used for images in this chapter -
Section 12.2:
rgb_m.F90contains the RGB derived type and overloaded operations -
Section 12.3:
ray.F90uses the ray derived type in the first ray tracing code -
Section 12.4:
sphere.F90shows how intersections of rays with a sphere are calculated -
Section 12.5:
normal.F90calculates surface normals, andtwoSpheres.F90accommodates mutiple objects -
Section 12.6:
antialias.F90shows how multiple rays per pixel are used in antialiasing -
Section 12.7.1:
diffuse.F90generates an image of a sphere with a Lambertian or diffuse surface -
Section 12.7.2:
metal.F90generates an image of a metalic and diffuse spheres -
Section 12.7.3:
dielectric.F90generates an image with glass, metal, and diffuse spheres -
Section 12.8:
camera.F90implements a positionable camera -
Section 12.9:
defocusBlur.F90implements focal length effects -
Section 12.10:
cover.F90generates a scene with many spheres -
Section 12.11:
triangle.F90implements triangular objects -
Section 12.12:
lights.f90implements lighted objects -
Section 12.13:
texture.F90implements a textured surface