# GPU Computing

<!-- <img src="./imgs/Julia-code-cpu-gpu.png" width=768>

* **host**: CPU + system memory (host memory)
* **device**: GPU with its memory (device memory) -->

### NVIDIA A100 GPU

<img src="./imgs/a100_front.png" width=512px>

**Source:** [NVIDIA whitepaper](https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf)

**Streaming Multiprocessor in NVIDIA A100**

<img src="./imgs/a100_SM.png" width=512px>

**Source:** [NVIDIA whitepaper](https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf)

| Compute units              | Count            |
|----------------------------|------------------|
| **SMs**                    | 108              |
| **CUDA cores** / FP32 ALUs | 6912 (64 per SM) |
| **Tensor cores**           | 432 (4 per SM)   |

### Topology

<img src="./imgs/gpu_topology.svg" width=500px>

* **host**: CPU + system memory (host memory)
* **device**: GPU with its memory (device memory)
* **SM**: Streaming Multiprocessor

Communication:
* Host-device bandwidth: **31.5 GB/s**
* GPU global memory bandwidth: **1555 GB/s**

## Quick comparison: CPU vs GPU

### AMD EPYC 7763 vs NVIDIA A100

|               | compute units   | maximum clock frequency [GHz] | FP32 peak performance [TFLOPS] |
|:-------------:|:---------------:|:-----------------------------:|:------------------------------:|
| AMD EPYC 7763 |  64 x86 cores   |  3.50                         |  5.0                           |
| NVIDIA A100   | 6912 CUDA cores |  1.41                         | 19.5 (155.9 for Tensor cores)                     |

Most of the [top500](https://top500.org/lists/top500/2023/11/) systems have GPUs as accelerators and they are dominating!

<img src="./imgs/cpu_gpu_fraction.svg" width=768>

**Source:** [J. Apostolakis et al., *Detector simulation challenges for future accelerator experiments.*, Frontiers in Physics 10 (2022)](https://doi.org/10.3389/fphy.2022.913510)

### Differences between CPU and GPU

|                   | CPU                               | GPU                                 |
|:-----------------:|:---------------------------------:|:-----------------------------------:|
| optimized for     | latency and per-core performance  | computational throughput            |
| cores             | complex                           | rather simple                       |
| number of threads | O(100)                            | O(1_000_000)                        |
| thread pinning    | a must for good performance       | not needed at all                   |

### Memory-bound scientific computing

The performance of most scientific codes is **memory-bound** (memory access speed) rather than compute-bound (how fast computations can be done). In a certain time interval, GPUs (and CPUs) can perform more computations than read numbers from memory.

**Peak performance over peak memory bandwidth** (for NVIDIA A100 40GB SXM)

$$
\dfrac{19.5 \ [\textrm{TFLOPS}]}{1.56 \ [\textrm{TB/s}]} \cdot 4 \ \textrm{B} = 50\ \textrm{FLOPs}
$$

To achieve the peak FP32 performance one needs 50 FLOPs per FP32 number (i.e. `Float32`) from memory.

Crucially, the peak memory bandwidth of GPUs is much higher than for CPUs: **~1.56 TB/s** (A100 40GB SXM) vs **~400 GB/s** (2x AMD EPYC 7763).

(→ exercise **saxpy_gpu** and **daxpy_cpu**)

## Julia + GPU (NVIDIA)

Website: https://juliagpu.org/

We'll focus on **NVIDIA GPUs** but there is [support for other GPUs](https://juliagpu.org/) (AMD, Intel, etc.) as well.

The interface to NVIDIA GPU computing is the [CUDA language extension](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html). In Julia there is [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl).

It leverages LLVM, specifically parts of the Julia compiler as well as [GPUCompiler.jl](https://github.com/JuliaGPU/GPUCompiler.jl), to compile **native GPU code**. (compare to `nvcc`)

It provides:

* **High-level abstraction `CuArray`**
* **Tools for writing CUDA kernels**
* **Wrappers to proprietary NVIDIA libraries (e.g. cuBLAS, cuFFT, cuSOLVER, cuSPARSE)**

### Getting CUDA

By default, **it's automatic**. The CUDA toolkit is installed automatically when **using** CUDA.jl for the first time. The only requirement is a working NVIDIA driver.

**Note:** You can readily add CUDA.jl to a Julia environment on a machine without GPUs, say, a login node. See [Precompiling CUDA.jl without CUDA](https://cuda.juliagpu.org/stable/installation/overview/#Precompiling-CUDA.jl-without-CUDA) for more information.

#### System CUDA

You can opt-out of the automatic system by setting a Julia preference, e.g.

```julia
CUDA.set_runtime_version!(v"12.2"; local_toolkit=true)
```

In [1]:
using CUDA

In [2]:
CUDA.versioninfo()

CUDA runtime 12.2, artifact installation
CUDA driver 12.2
NVIDIA driver 535.154.5

CUDA libraries: 
- CUBLAS: 12.2.5
- CURAND: 10.3.3
- CUFFT: 11.0.8
- CUSOLVER: 11.5.2


- CUSPARSE: 12.1.2
- CUPTI: 20.0.0
- NVML: 12.0.0+535.154.5

Julia packages: 
- CUDA: 5.1.2
- CUDA_Driver_jll: 0.7.0+1
- CUDA_Runtime_jll: 0.10.1+0

Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7

Preferences:


- CUDA_Runtime_jll.version: 12.2

1 device:
  0: NVIDIA GeForce RTX 3070 Ti Laptop GPU (sm_86, 7.372 GiB / 8.000 GiB available)


In [3]:
CUDA.functional() # if this works, you're good to go 👍

true

In [4]:
device() # the currently selected GPU

CuDevice(0): NVIDIA GeForce RTX 3070 Ti Laptop GPU

## Array programming: `CuArray`

The simplest way to use a GPU is via **vectorized array operations** (e.g. broadcasting). Each of these operations will be backed by one or more GPU kernels, either natively written in Julia or from some application library. As long as your data is large enough you should be able to get nice speedups in many cases.

You use the `CuArray` type, which serves a dual purpose:

* a managed container for GPU memory
* a way to dispatch to operations that execute on the GPU

A `CuArray` is a **CPU object representing GPU memory**.

In [5]:
x_gpu = CuArray{Float32}(undef, 4)

4-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.0
 0.0
 0.0
 0.0

In [6]:
CUDA.rand(4) # Note: defaults to Float32

4-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.7391175
 0.84908134
 0.046983227
 0.64861226

In [7]:
CUDA.zeros(4)

4-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.0
 0.0
 0.0
 0.0

 We can readily move data to the GPU by converting to `CuArray`.

 <img src="./imgs/cpu_gpu_transfer.svg" width=180px>

In [8]:
x_cpu = [1,2,3,4]
x_gpu = CuArray(x_cpu) 

4-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
 1
 2
 3
 4

(or by using `copyto!` or `copy!` to move it into already allocated memory)

For better performance the data movement between CPU and GPU should be minimized as much as possible.

### Array computations on GPU

In [9]:
CuArray <: AbstractArray

true

Therefore, we should be able to do all kind of operations with it, that we'd also do with regular `Array`s. (**duck typing**)

#### Example: Matrix multiplication

In [10]:
N = 2048
A_gpu = CUDA.rand(N,N);
B_gpu = CUDA.rand(N,N);

In [11]:
CUDA.@sync A_gpu * B_gpu # we need CUDA.@sync because GPU operations are typically asynchronous

2048×2048 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 501.325  520.195  512.339  508.041  …  506.885  515.095  506.574  511.452
 490.967  495.588  507.527  491.685     497.625  506.091  488.61   502.771
 502.253  510.948  513.907  501.452     505.972  513.187  503.681  513.226
 496.215  510.388  509.005  506.202     507.746  509.221  497.762  507.773
 499.031  506.784  514.251  507.418     511.988  511.682  493.963  507.552
 502.38   514.254  529.072  511.22   …  517.469  522.785  510.055  521.209
 493.301  506.185  512.416  503.791     502.829  505.601  491.318  505.199
 505.15   517.611  509.744  517.33      516.554  524.496  509.856  516.612
 504.27   519.443  523.288  508.838     513.612  517.515  511.861  513.502
 502.099  502.83   511.259  499.184     506.761  511.678  500.742  509.539
   ⋮                                 ⋱             ⋮               
 522.559  518.192  530.008  512.315     515.162  522.757  505.014  513.793
 508.399  510.44   519.091  512.039  …  511.234  518.

In [12]:
using BenchmarkTools

@btime CUDA.@sync(A_cpu * B_cpu) setup=(A_cpu = rand(Float32, N,N); B_cpu = rand(Float32, N,N););
@btime CUDA.@sync(A_gpu * B_gpu) setup=(A_gpu = CUDA.rand(N,N); B_gpu = CUDA.rand(N,N););

  42.645 ms (2 allocations: 16.00 MiB)


  1.774 ms (46 allocations: 1.12 KiB)


(Note: `*` for `CuArray`s uses a cuBLAS kernel under the hood)

#### More examples: Broadcasting, `map`, `reduce`, etc.

In [13]:
CUDA.@sync A_gpu .+ B_gpu; # runs on the GPU!

In [14]:
CUDA.@sync sqrt.(A_gpu.^2 + B_gpu.^2); # runs on the GPU!

In [15]:
CUDA.@sync mapreduce(sin, +, A_gpu); # runs on the GPU!

**The power of simple GPU array programming can not be underestimated!** Entire codes (like deep learning frameworks etc.) could be ported to GPU without ever writing a single CUDA kernel manually.

Of course, it isn't always as easy or performance can be improved by writing custom kernels. (→ exercise **heat_diffusion**)

#### "Counter-example:" Scalar indexing

In [16]:
A_gpu[1]

ErrorException: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.

In [17]:
CUDA.allowscalar(false)

In [18]:
A_gpu[1]

ErrorException: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.

**note**:

* The access to an individual element in an array, e.g. scalar indexing, must not be used in array programming.
* You must express arithmetic operations in terms of arrays and treat the `CuArray` array as a whole entity, e.g.

```julia
function gpu_broadcasting!(C, A, B)
    CUDA.@sync C .= A .* B
end
```

#### A few words on memory management

`CuArray`s are managed by Julia's **garbage collector**. If they are unreachable, they will get cleaned up automatically during a GC run. However, keep in mind that the (CPU-focused) GC isn't good at sensing GPU memory pressure.

In [19]:
CUDA.memory_status()

Effective GPU memory usage: 99.13% (7.719 GiB/7.787 GiB)
Memory pool usage: 64.000 MiB (7.188 GiB reserved)


In [20]:
x_gpu = CUDA.rand(10_000_000);

In [21]:
sizeof(x_gpu) |> Base.format_bytes

"38.147 MiB"

In [22]:
CUDA.memory_status()

Effective GPU memory usage: 99.22% (7.726 GiB/7.787 GiB)
Memory pool usage: 102.147 MiB (7.188 GiB reserved)


In [23]:
x_gpu = nothing; GC.gc(true)

In [24]:
CUDA.memory_status()

Effective GPU memory usage: 99.48% (7.747 GiB/7.787 GiB)
Memory pool usage: 70.147 MiB (7.188 GiB reserved)


By default CUDA.jl uses a **memory pool** to speed up future allocations. So it might appear as if the objects have not been freed. (On Noctua 2 we have disabled it. You can disable the memory pool with `JULIA_CUDA_MEMORY_POOL=none`.)

We can use `CUDA.unsafe_free!(x_gpu)` and `CUDA.reclaim()` to more aggressively suggest to release the memory.

In [25]:
x_gpu = CUDA.rand(10_000_000);

In [26]:
CUDA.memory_status()

Effective GPU memory usage: 99.37% (7.738 GiB/7.787 GiB)
Memory pool usage: 108.294 MiB (7.188 GiB reserved)


In [27]:
CUDA.unsafe_free!(x_gpu)

In [28]:
CUDA.memory_status()

Effective GPU memory usage: 99.63% (7.758 GiB/7.787 GiB)
Memory pool usage: 70.147 MiB (7.188 GiB reserved)


Of course, one must be careful with `CUDA.unsafe_free!` because one still has the handle `x_gpu` that now points to free'd memory. But it is fine and very useful in a pattern like this:

```julia
function myfunction(x::CuArray)
    tmp_memory = similar(x)
    expensive_operation!(x, tmp_memory)
    CUDA.unsafe_free!(tmp_memory)
    return x
end
```

In [29]:
x_gpu = nothing # to be safe :)

## Kernel programming: Writing CUDA kernels

The high-level array programming is most suitable for some types of computations. It may be less efficient, or even not possible for general applications.

**CUDA kernel**: a function that will be executed by all *GPU threads* in parallel.

Based on the index of a thread we can make them operate on different pieces of given data.

(It might be helpful to think of the CUDA kernel as being the body of a loop.)

In [30]:
function cuda_kernel!(x)
    i = threadIdx().x # the thread index ("loop index")
    x[i] += 1
    return nothing # CUDA kernels should never return anything
end

cuda_kernel! (generic function with 1 method)

One can launch the kernel on the GPU with the `@cuda` macro (non-blocking, asynchronous):

In [31]:
x = CUDA.zeros(1024)

CUDA.@sync @cuda threads=length(x) cuda_kernel!(x)

CUDA.HostKernel for cuda_kernel!(CuDeviceVector{Float32, 1})

In [32]:
x

1024-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 ⋮
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0

As you can imaging, kernel programming can become more difficult, especially if you care about performance. A few reasons:

* you need to respect **hardware limitations** of the GPU, e.g. maximum number of GPU threads per block
* **not all operations can readily be expressed as scalar kernels**, e.g. reduction
* kernels execute on the GPU where the **Julia runtime isn't available**

In particular due to the last point, kernel code has limitations
  * no GC
  * no `print` etc. (→ `@cuprint`)
  * code must be fully type inferred (no dynamic dispatch allowed)
  * no `try ... catch ... end`
  * ...

**You can't just write arbitrary Julia code in kernels.** Fortunately though, many things just work and can get you far (see e.g. exercises).

#### Example: Hardware limitation

In [33]:
x = CUDA.zeros(1025) # one more element than before

CUDA.@sync @cuda threads=length(x) cuda_kernel!(x)

CuError: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)

In [34]:
CUDA.attribute(device(), CUDA.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK)

1024

### CUDA programming model

<!-- <img src="./imgs/CUDA_programming_model.png" width=1024> -->
<img src="imgs/cuda_prog_model.svg" width=1024>

Conceptual mapping:

* **Grid** of blocks → entire GPU
* **Blocks** of threads → SMs
* **Threads** → CUDA cores

**Note**: up to three dimensions, $(x, y, z)$, can be used to organize the thread blocks and threads in each block.

In [35]:
function cuda_kernel_blocks!(x)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x # global thread index
    if i <= length(x) # make sure that we're inbounds (c.f. "loop" iteration range)
        @inbounds x[i] += 1
    end
    return nothing
end

cuda_kernel_blocks! (generic function with 1 method)

In [36]:
x = CUDA.zeros(1025);

**execution configuration** for a CUDA kernel:
* `threads`: number of threads in each block
* `blocks`: number of blocks in the grid

In [37]:
CUDA.@sync @cuda threads=1024 blocks=2 cuda_kernel_blocks!(x)

CUDA.HostKernel for cuda_kernel_blocks!(CuDeviceVector{Float32, 1})

In [38]:
x

1025-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 ⋮
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0

#### How does our CUDA kernel compare to broadcasting?

In [39]:
x = CUDA.rand(1024*1024);

function add_one_broadcasting(x)
    CUDA.@sync x .+ 1
end

# same CUDA kernel, but with different execution configurations
function add_one_kernel_1024_1k(x)
    CUDA.@sync @cuda threads=1024 blocks=1024 cuda_kernel_blocks!(x)
end

function add_one_kernel_256_4k(x)
    CUDA.@sync @cuda threads=256 blocks=4*1024 cuda_kernel_blocks!(x)
end

function add_one_kernel_64_16k(x)
    CUDA.@sync @cuda threads=64 blocks=16*1024 cuda_kernel_blocks!(x)
end

@btime add_one_broadcasting(x) setup=(x = CUDA.zeros(1024););
@btime add_one_kernel_1024_1k(x) setup=(x = CUDA.zeros(1024););
@btime add_one_kernel_256_4k(x) setup=(x = CUDA.zeros(1024););
@btime add_one_kernel_64_16k(x) setup=(x = CUDA.zeros(1024););

  7.351 μs (28 allocations: 1.39 KiB)


  9.450 μs (2 allocations: 112 bytes)


  8.040 μs (2 allocations: 112 bytes)


  16.594 μs (2 allocations: 112 bytes)


The performance of a CUDA kernel is affected by the execution configuration (because of the runtime scheduling of thread blocks and threads).

How to obtain a good execution configuration for a CUDA kernel?

### Simplifying kernel launches: [Occupancy API](https://developer.nvidia.com/blog/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/)

Hardcoded execution configuration is not a good idea. A few reasons:

* In reality, the actual maximal number of threads can depend on kernel details, like how many resources the kernel is using.
* You might want to support different GPUs with different hardware limitations.

**Occupancy** measures the ratio of the number of active *warps* per SM to the maximum number of possible warps per SM.

*warp*: a group of 32 parallel threads executing the same instructions.

* Low occupancy usually implies low performance (because of underutilized hardware resources).
* High occupancy, however, may not guarantee the best performance.

The occupancy API is an automatic tool that can be used to obtain *reasonably good* execution configurations.

In [40]:
kernel = @cuda launch=false cuda_kernel_blocks!(x) # don't launch the kernel

CUDA.HostKernel for cuda_kernel_blocks!(CuDeviceVector{Float32, 1})

In [41]:
config = CUDA.launch_configuration(kernel.fun)

(blocks = 92, threads = 768)

Here, the number `blocks` indicates how many blocks we would need to fully occupy the GPU. For a given input `x`, we might need fewer or more blocks.

In [42]:
threads = min(length(x), config.threads)

768

In [43]:
blocks = cld(length(x), threads)

1366

Launching the kernel with the dynamic launch parameters:

In [44]:
kernel(x; threads, blocks) # calling `kernel` like a regular function

In [45]:
@btime kernel(x; threads=$threads, blocks=$blocks) setup=(x = CUDA.zeros(1024););

  862.500 ns (2 allocations: 112 bytes)


### Introspection

Similar to `@code_*` for CPU there are `@device_code_*` macros for GPU. However, the GPU pendant for `@code_native` is `@device_code_ptx`.

**PTX**: a low-level **p**arallel **t**hread e**x**ecution virtual machine and instruction set architecture used in NVIDIA CUDA programming.

In [46]:
@device_code_warntype @cuda threads=1024 blocks=1024 cuda_kernel_blocks!(x)

PTX CompilerJob of MethodInstance for cuda_kernel_blocks!(::CuDeviceVector{Float32, 1}) for sm_86

MethodInstance for 

cuda_kernel_blocks!(::CuDeviceVector{Float32, 1})
  from cuda_kernel_blocks!([90mx[39m)[90m @[39m [90mMain[39m [90m~/JuliaUCL24/notebooks/Day3/[39m[90m[4m3_gpu_computing.ipynb:1[24m[39m
Arguments
  #self#[36m::Core.Const(cuda_kernel_blocks!)[39m
  x[36m::CuDeviceVector{Float32, 1}[39m
Locals
  val[36m::Float32[39m
  i[36m::Int64[39m
Body[36m::Nothing[39m
[90m1 ─[39m

       

Core.NewvarNode(

:(val))
[90m│  [39m %2  = Main.blockIdx()[36m::@NamedTuple{x::Int32, y::Int32, z::Int32}[39m
[90m│  [39m %3  = Base.getproperty(%

2, :x)[36m::Int32[39m
[90m│  [39m %4  = (%3 - 1)[36m::Int64[39m
[90m│  [39m %5  = Main.blockDim()[36m::@NamedTuple{x::Int32, y::Int32, z::Int32}[39m
[90m│  [39m %6  = Base.getproperty(%5, :x)[36m::Int32[39m
[90m│  [39m %7  = (%4 * %6)[36m::Int64[39m
[90m│  [39m %8  = Main.threadIdx()[36m::@NamedTuple{x::Int32, y::Int32, z::Int32}[39m
[90m│  [39m %9  = Base.getproperty(%8, :x)[36m::Int32[39m
[90m│  [39m       (i = %7 + %9)
[90m│  [39m %11 = i[36m::Int64[39m
[90m│  [39m %12 = Main.length(x)[36m::Int64[39m
[90m│  [39m %13 = (%11 <= %12)[36m::Bool[39m
[90m└──[39m       goto #3 if not %13
[90m2 ─[39m       nothing
[90m│  [39m %16 = Base.getindex(x, i)[36m::Float32[39m
[90m│  [39m %17 = (%16 + 1)[36m::Float32[39m
[90m│  [39m       Base.setindex!(x, %17, i)
[90m│  [39m       (val = %17)
[90m│  [39m       nothing
[90m└──[39m       val
[90m3 ┄[39m       return Main.nothing



In [47]:
@device_code_llvm debuginfo=:none @cuda threads=1024 blocks=1024 cuda_kernel_blocks!(x)

; PTX CompilerJob of MethodInstance for cuda_kernel_blocks!(::CuDeviceVector{Float32, 1}) for sm_86
[95mdefine[39m [95mptx_kernel[39m [36mvoid[39m [93m@_Z19cuda_kernel_blocks_13CuDeviceArrayI7Float32Li1ELi1EE[39m[33m([39m[33m{[39m [36mi64[39m

[0m, [36mi32[39m [33m}[39m [0m%state[0m, [33m{[39m [36mi8[39m [95maddrspace[39m[33m([39m[33m1[39m[33m)[39m[0m*[0m, [36mi64[39m[0m, [33m[[39m[33m1[39m [0mx [36mi64[39m[33m][39m[0m, [36mi64[39m [33m}[39m [0m%0[33m)[39m [95mlocal_unnamed_addr[39m [33m{[39m
[91mconversion:[39m
  [0m%.fca.3.extract [0m= [96m[1mextractvalue[22m[39m [33m{[39m [36mi8[39m [95maddrspace[39m[33m([39m[33m1[39m[33m)[39m[0m*[0m, [36mi64[39m[0m, [33m[[39m[33m1[39m [0mx [36mi64[39m[33m][39m[0m, [36mi64[39m [33m}[39m [0m%0[0m, [33m3[39m
  [0m%1 [0m= [96m[1mcall[22m[39m [36mi32[39m [93m@llvm.nvvm.read.ptx.sreg.ctaid.x[39m[33m([39m[33m)[39m
  [0m%2 [0m= [96m[1mzext[22m[39m [36mi32[39m [0m%1 [95mto[39m [36mi64[39m
  [0m%3 [0m= [96m[1mcall[22m[39m [36mi32[39m [93m@llvm.nvvm.read.ptx.sreg.ntid.x[39m[33m([39m[33m)[39m
  [0m%4 [0m= [96m[1mzext[22m[39m [36mi32[39m [0m%3 [95mto[39m [36mi64

In [48]:
@device_code_ptx @cuda threads=1024 blocks=1024 cuda_kernel_blocks!(x)

// PTX CompilerJob of MethodInstance for cuda_kernel_blocks!(::CuDeviceVector{Float32, 1}) for sm_86



[90m//[39;49;00m
[90m// Generated by LLVM NVPTX Back-End[39;49;00m
[90m//[39;49;00m

[94m.version[39;49;00m [94m8.2[39;49;00m
[94m.target[39;49;00m [91msm_86[39;49;00m
[94m.address_size[39;49;00m [94m64[39;49;00m

	[90m// .globl	_Z19cuda_kernel_blocks_13CuDeviceArrayI7Float32Li1ELi1EE // -- Begin function _Z19cuda_kernel_blocks_13CuDeviceArrayI7Float32Li1ELi1EE[39;49;00m
                                        [90m// @_Z19cuda_kernel_blocks_13CuDeviceArrayI7Float32Li1ELi1EE[39;49;00m
[94m.visible[39;49;00m [94m.entry[39;49;00m [04m[91m_[39;49;00m[91mZ19cuda_kernel_blocks_13CuDeviceArrayI7Float32Li1ELi1EE[39;49;00m(
	[94m.param[39;49;00m [94m.align[39;49;00m [94m8[39;49;00m [96m.b8[39;49;00m [04m[91m_[39;49;00m[91mZ19cuda_kernel_blocks_13CuDeviceArrayI7Float32Li1ELi1EE_param_0[39;49;00m[[94m16[39;49;00m],
	[94m.param[39;49;00m [94m.align[39;49;00m [94m8[39;49;00m [96m.b8[39;49;00m [04m[91m_[39;49;00m[91mZ19cuda_kernel_blocks_13Cu

## Multi-GPUs in one compute node

**(Note: You can't run this part because you only have a single device in this session.)**

There are **4x NVIDIA A100 GPUs** in a regular Noctua 2 GPU node:

<!-- <img src="./imgs/Noctua2_GPU_node.png" width=320px> -->
<img src="imgs/Noctua2_GPU_node.svg" width=320px>

Since each Julia task gets its own local CUDA execution environment, it is easy to utilize multi-GPUs in one compute node to perform computations in parallel by launching multiple Julia tasks.

In [49]:
using Base.Threads

In the following we will use `Threads.@spawn` (since multi-threading support is a rather recent addition to CUDA.jl).

In [50]:
function gpu_computation(A, B, C)
    for i in 1:512
        C = A * B
        A = B * C
        B = C * A
    end
    sin.(B)
    return B
end

function multi_gpu()
    n = 4096
    @sync begin
        # Julia task for the 1st GPU
        @spawn begin
            device!(0) # first GPU
            A = CUDA.rand(n, n)
            B = CUDA.rand(n, n)
            C = CUDA.zeros(n, n)
            println("GPU 1: running gpu_computation")
            CUDA.@sync gpu_computation(A,B,C)
            println("GPU 1: done")
        end
        # Julia task for the 2nd GPU
        @spawn begin
            device!(1) # second GPU
            A = CUDA.rand(n, n)
            B = CUDA.rand(n, n)
            C = CUDA.zeros(n, n)
            println("GPU 2: running gpu_computation")
            CUDA.@sync gpu_computation(A,B,C)
            println("GPU 2: done")
        end
    end
    return nothing
end

multi_gpu (generic function with 1 method)

In [51]:
multi_gpu()

GPU 1: running gpu_computation


GPU 1: done


CompositeException: TaskFailedException

    nested task error: CUDA error: invalid device ordinal (code 101, ERROR_INVALID_DEVICE)
    Stacktrace:
     [1] throw_api_error(res::CUDA.cudaError_enum)
       @ CUDA ~/.julia/packages/CUDA/rXson/lib/cudadrv/libcuda.jl:27
     [2] check
       @ ~/.julia/packages/CUDA/rXson/lib/cudadrv/libcuda.jl:34 [inlined]
     [3] cuDeviceGet
       @ ~/.julia/packages/CUDA/rXson/lib/utils/call.jl:26 [inlined]
     [4] CuDevice
       @ ~/.julia/packages/CUDA/rXson/lib/cudadrv/devices.jl:17 [inlined]
     [5] device!(dev::CuDevice, flags::Nothing) (repeats 2 times)
       @ CUDA ~/.julia/packages/CUDA/rXson/lib/cudadrv/state.jl:336 [inlined]
     [6] (::var"#12#14"{Int64})()
       @ Main ~/JuliaUCL24/notebooks/Day3/3_gpu_computing.ipynb:26

In [52]:
# call CUDA.memory_status() for all GPUs
for dev in devices()
    device!(dev)
    println()
    CUDA.memory_status()
end
device!(0);


Effective GPU memory usage: 99.07% (7.715 GiB/7.787 GiB)
Memory pool usage: 1.848 GiB (7.188 GiB reserved)


## Benchmarking and Profiling

In [53]:
device!(0)

CuDevice(0): NVIDIA GeForce RTX 3070 Ti Laptop GPU

In [54]:
A = CUDA.rand(1024, 1024)
B = CUDA.rand(1024, 1024)

@btime CUDA.@sync A .* B;

  43.354 μs (29 allocations: 1.86 KiB)


Note that "allocations" here means CPU allocations. For GPU allocations you can e.g. use `CUDA.@time`.

In [55]:
CUDA.@time A .* B;

  0.000298 seconds (32 CPU allocations: 

1.906 KiB)

 (1 GPU allocation: 4.000 MiB, 4.73% memmgmt time)


### Integrated profiler: `CUDA.@profile`

In [56]:
CUDA.@profile A .* B

Profiler ran for 218.87 µs, capturing 11 events.

Host-side activity: calling CUDA APIs took 47.68 µs (21.79% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────┐
│[1m Time (%) [0m│[1m Total time [0m│[1m Calls [0m│[1m Name                    [0m│
├──────────┼────────────┼───────┼─────────────────────────┤
│   11.22% │[31m   24.56 µs [0m│     1 │[1m cuLaunchKernel          [0m│
│    5.45% │   11.92 µs │     1 │ cuMemAllocFromPoolAsync │
└──────────┴────────────┴───────┴─────────────────────────┘

Device-side activity: GPU was busy for 39.1 µs (17.86% of the trace)
┌──────────┬────────────┬───────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│[1m Time (%) [0m│[1m Total time [0m│[1m Calls [0m│[1m Name                                                       

In [57]:
CUDA.@profile trace=true A .* B

Profiler ran for 287.77 µs, capturing 11 events.

Host-side activity: calling CUDA APIs took 65.57 µs (22.78% of the trace)
┌────┬──────────┬──────────┬─────────────────────────┬──────────────────────────┐
│[1m ID [0m│[1m    Start [0m│[1m     Time [0m│[1m Name                    [0m│[1m                  Details [0m│
├────┼──────────┼──────────┼─────────────────────────┼──────────────────────────┤
│  5 │ 25.03 µs │[31m 37.91 µs [0m│[1m cuMemAllocFromPoolAsync [0m│ 4.000 MiB, device memory │
│  9 │ 95.13 µs │  24.8 µs │ cuLaunchKernel          │                        - │
└────┴──────────┴──────────┴─────────────────────────┴──────────────────────────┘

Device-side activity: GPU was busy for 40.05 µs (13.92% of the trace)
┌────┬───────────┬──────────┬─────────┬────────┬──────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

### External profiler: [NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems) (optional)

See https://cuda.juliagpu.org/stable/development/profiling/#External-profilers

**Command**: `CUDA.@profile external=true`

Use [**NVTX.jl**](https://github.com/JuliaGPU/NVTX.jl) to annotate (i.e. label and colorize) code blocks.

<img src="./imgs/nsight_systems.png" width=800px>

## Case study: Three ways to SAXPY on the GPU

**SAXPY** = **S**ingle precision **A** times **X** **P**lus **Y**

→ exercise **saxpy_gpu**

<img src="./imgs/a100_saxpy_results.png" width=1024px>

## Core messages of this notebook

* GPU architecture is optimal for **high throughput computations using millions of threads**.
* For good performance the communication between host and GPU device(s) must be minimized.
* CUDA.jl enables to program NVIDIA GPUs in Julia
    - high-level array abstraction with `CuArray`
        * convenient for some types of computation
        * reasonably good performance is achievable
    - writing custom CUDA kernels
        * flexible for general computations
        * **CUDA programming model**
            - **Grid of blocks** → entire GPU
            - **Blocks of threads** → SMs
            - **Threads** → CUDA cores
        * the performance is influenced by **the execution configuration**, e.g. number of thread blocks and number of threads per block
        * use Occupancy API to obtain an optimal execution configuration for a CUDA kernel
    - using the vendor GPU libraries, e.g. cuBLAS, if possible
* Multi-GPUs in one compute node can be utilized in parallel with multiple Julia tasks.