![img](https://github.com/JuliaLang/julia/raw/master/doc/src/assets/logo.svg)![img](https://avatars.githubusercontent.com/u/7346142?s=200&v=4)

# Installation and Setup

JuliaGPU packages are easy to install: Just do `Pkg.add("CUDA")` to install the CUDA.jl package, which provides bindings to NVIDIA's CUDA. CUDA.jl provides all of the compiler and runtime logic needed to program NVIDIA GPUs; the only thing you need to provide is a functional NVIDIA driver (which most HPC systems already have installed and configured), but you don't need to install the CUDA toolkit! CUDA.jl downloads one if it's not already available on your system (again, HPC systems usually have this provided for you):

In [1]:
# This can take a little while to download and compile, so just be patient
import Pkg
Pkg.add("CUDA")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `/global/u2/t/train921/julia-hpc-tutorial-sc24-main/Project.toml`
[32m[1m  No Changes[22m[39m to `/global/u2/t/train921/julia-hpc-tutorial-sc24-main/Manifest.toml`


We're going to be running some small benchmarks in this notebook, so let's also grab Julia's BenchmarkTools.jl while we're at it:

In [2]:
import Pkg
Pkg.add("BenchmarkTools")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `/global/u2/t/train921/julia-hpc-tutorial-sc24-main/Project.toml`
[32m[1m  No Changes[22m[39m to `/global/u2/t/train921/julia-hpc-tutorial-sc24-main/Manifest.toml`


And now we import these packages, along with the built-in LinearAlgebra standard library:

In [3]:
using CUDA
using BenchmarkTools

using LinearAlgebra

Now, GPU vendor libraries can be difficult, so CUDA.jl provides a convenient way to check if everything is setup correctly, the `CUDA.versioninfo()` function. Like Julia's `Base.versioninfo()`, this will print some information on the available hardware and loaded libraries:

In [4]:
CUDA.versioninfo()

CUDA runtime 12.2, local installation
CUDA driver 12.6
NVIDIA driver 535.183.6

CUDA libraries: 
- CUBLAS: 12.2.1
- CURAND: 10.3.3
- CUFFT: 11.0.8
- CUSOLVER: 11.5.0
- CUSPARSE: 12.1.1
- CUPTI: 2023.2.0 (API 20.0.0)
- NVML: 12.0.0+535.183.6

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.3+0
- CUDA_Runtime_jll: 0.15.3+0
- CUDA_Runtime_Discovery: 0.3.5

Toolchain:
- Julia: 1.10.4
- LLVM: 15.0.7

Preferences:
- CUDA_Runtime_jll.version: 12.2
- CUDA_Runtime_jll.local: true

1 device:
  0: NVIDIA A100-SXM4-40GB (sm_80, 39.390 GiB / 40.000 GiB available)


It's always good practice to check this at least once on a new system or when you mess with `module`s loaded, just to ensure that everything is connected and happy!

# Example: Vector Addition

As a simple example, let's take a look at vector addition. Let's assume you have two vectors $\vec{a}$ and $\vec{b}$ and you want to add them elementwise. You can do this in many ways in Julia: 
1. simple for loop on a CPU
2. julia array add (+) on a CPU or GPU
3. GPU kernel programming in CUDA (or KernelAbstractions using CUDA as backend - we'll see this soon!)

In [5]:
# define our input a, b vectors, and output c vector in CPU RAM
vector_size = 1024
a = rand(1:4, vector_size)
b = rand(1:4, vector_size)
c = zeros(Int, vector_size)

# what's in a?
a

1024-element Vector{Int64}:
 3
 1
 3
 1
 3
 3
 1
 1
 2
 3
 1
 3
 4
 ⋮
 4
 3
 1
 2
 3
 4
 4
 1
 3
 2
 2
 4

Let's write a simple CPU loop to add two vectors in serial:

In [6]:
# Note: the exclamation mark (!) doesn't do anything special
# It's just used to indicate that a function mutates its arguments
function vadd!(a, b, c)
    for i in 1:length(c)
        c[i] = a[i] + b[i]
    end
end
vadd!(a, b, c)
c

1024-element Vector{Int64}:
 4
 4
 5
 3
 5
 5
 4
 5
 6
 5
 4
 7
 5
 ⋮
 5
 4
 4
 5
 7
 8
 7
 5
 4
 3
 6
 6

Thankfully, Julia has a ton of built-in array operations, so we don't actually need to implement this ourselves. Julia's vector add (+) operation works exactly as you'd expect:

In [7]:
c = a + b

# Note that, unlike `vadd!`, the above operation allocates a new `c` as the output vector - this is important to remember.

1024-element Vector{Int64}:
 4
 4
 5
 3
 5
 5
 4
 5
 6
 5
 4
 7
 5
 ⋮
 5
 4
 4
 5
 7
 8
 7
 5
 4
 3
 6
 6

Great! But isn't this a GPU tutorial? Let's get to it!

In [8]:
# We need first to make copies of the a and b vectors on the GPU, and define a new dc empty GPU vector

# The CuArray() function automatically allocates a new GPU array of the same size and shape as the input,
# and copies from the input CPU array to the newly-allocated GPU array
da = CuArray(a)
db = CuArray(b)

# CUDA.zeros takes the desired element type and array size, and automatically allocates and initializes
# a new GPU array with int64 zeros
dc = CUDA.zeros(Int, size(a))

# We can also safely take a look at what's in da, even though it's on the GPU!
# It's a different array type, and that fact is made clear to us:
da

1024-element CuArray{Int64, 1, CUDA.DeviceMemory}:
 3
 1
 3
 1
 3
 3
 1
 1
 2
 3
 1
 3
 4
 ⋮
 4
 3
 1
 2
 3
 4
 4
 1
 3
 2
 2
 4

Now that we know how to allocate on the GPU, let's see how to use this same add (+) operation on the GPU to add two of those vectors using CUDA:

In [9]:
dc = da + db

# We can add GPU vectors using the same `+` operator, thanks to Julia's multiple dispatch!
# Also, like for the CPU add operation, this one also allocates a new GPU array for output `dc`

1024-element CuArray{Int64, 1, CUDA.DeviceMemory}:
 4
 4
 5
 3
 5
 5
 4
 5
 6
 5
 4
 7
 5
 ⋮
 5
 4
 4
 5
 7
 8
 7
 5
 4
 3
 6
 6

Cool, but vector addition is pretty... simple? Let us learn how to write our own GPU kernels with CUDA.jl in pure Julia.

In array operations, CUDA.jl can leverage implicit parallelism (expressed over the array's elements) to automatically execute these operations in parallel on a GPU. When using hand-rolled kernels, it is instead the programmer's responsibility to decide how to effectively assign the available parallel execution resources for the specific operation. Let's see how this is done for vector addition, before moving on to more interesting examples:

In [10]:
function vadd_kernel!(c, a, b)
    # Obtain GPU thread index, which should be mapped to the valid indices of a and b
    i = threadIdx().x
    # Each thread will add its own element to c
    c[i] = a[i] + b[i]
    
    # GPU kernels don't return anything
    return
end

vadd_kernel! (generic function with 1 method)

At a high level, that's pretty easy, you just need to write a scalar function, just like you'd do if you were writing CUDA C++. Now we just need to launch that function in parallel using the `@cuda` macro, and specify the number of GPU threads with the `threads` keyword argument:

In [11]:
# Launch our `vadd_kernel!` GPU kernel, with our GPU arrays as inputs
@cuda threads=length(da) vadd_kernel!(dc, da, db)

dc

1024-element CuArray{Int64, 1, CUDA.DeviceMemory}:
 4
 4
 5
 3
 5
 5
 4
 5
 6
 5
 4
 7
 5
 ⋮
 5
 4
 4
 5
 7
 8
 7
 5
 4
 3
 6
 6

OK, this is great, and was not a lot of work for us! But to see a downside of this simple approach, let's try to work with bigger vectors, by setting `vector_size` to 10240:

In [12]:
vector_size = 10240
da = CuArray(rand(1:4, vector_size))
db = CuArray(rand(1:4, vector_size))
dc = CUDA.zeros(Int, vector_size)

@cuda threads=length(da) vadd_kernel!(dc, da, db)

LoadError: Number of threads in x-dimension exceeds device limit (10240 > 1024).

Oh no! What is going on here?

GPUs have a limited number of threads they can run on a single streaming multiprocessor (SM) at once, and we just tried to assign too many threads to one SM, which isn't possible:

In [13]:
# To query the number of threads per block, we can inspect CUDA attributes:
CUDA.attribute(device(), CUDA.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK)

1024

Since 10240 > 1024, the SM wouldn't have had enough resources to satisfy our request, at least not with the default of a single block per kernel.

Thankfully, GPUs also have multiple SMs, so in theory this should be solvable. To take advantage of more than one SM, we need to run a kernel with multiple blocks, as a single block can only execute on one SM (which has limited resources available, as we saw in the query above). In order to exploit multiple blocks, though, we need to understand how to index into our arrays when our index depends not just on our thread index, but also on our block index and block sizes.

In CUDA.jl, the expression `i = threadIdx().x + (blockIdx().x - 1) * blockDim().x` calculates a unique index for each thread across multiple blocks in a CUDA kernel execution. Here's a breakdown of each component and how they contribute to computing this index:

- `threadIdx().x`: This returns the x-coordinate of the thread within its block. It's the thread's index within the block, starting from 1 (unlike C/C++ CUDA where it starts from 0).

- `blockIdx().x`: This gives the x-coordinate of the block within the grid. It represents the block's index in the grid, also starting from 1.

- `blockDim().x`: This represents the number of threads per block along the x-axis.

In essence, the formula `i = threadIdx().x + (blockIdx().x - 1) * blockDim().x` is used to compute a global index for each thread, regardless of how blocks are sized. It positions the threads linearly across all blocks. Here's what each part does:

- `(blockIdx().x - 1) * blockDim().x`: This part calculates the offset to the start of the current block. Subtracting 1 from `blockIdx().x` makes it zero-based, and then it is multiplied by the number of threads in each block `(blockDim().x)`. This gives the index of the first thread in the current block relative to the entire grid.

- `threadIdx().x`: Adding this to the block offset gives the specific thread's index within the whole grid.

Knowing this, let's now rewrite our kernel to properly handle multiple blocks, using the above global indexing formula:

In [14]:
function vadd_cuda!(c, a, b)
    # Calculate a unique index for each thread across multiple blocks
    i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
    
    # Ensure that we skip invalid indices, if we over-allocated a few threads
    if i <= length(a)
        c[i] = a[i] + b[i]
    end

    return
end

vadd_kernel! (generic function with 1 method)

Now we can launch our kernel with the maximum number of threads per block (1024), and then divide up our computation across multiple 1024-wide blocks:

In [15]:
@cuda threads=1024 blocks=cld(length(da),1024) vadd_cuda!(dc, da, db) # cld(x, y) is (x / y) with round-up behavior
dc

10240-element CuArray{Int64, 1, CUDA.DeviceMemory}:
 5
 7
 5
 4
 6
 5
 4
 4
 6
 5
 7
 8
 8
 ⋮
 7
 2
 6
 3
 4
 5
 3
 7
 5
 7
 8
 4

Yay! Now that we're thoroughly done with vector addition, let's move on to something a bit heavier:

## Example: Matrix-Matrix Multiplication

Matrix multiplication is a mainstay of all kinds of applications, so we should be able to implement this in Julia with ease. Let's first take a look at doing this on the CPU:

In [24]:
# Allocate our random matrix inputs and zero'd output
matrix_size = 1024
A = rand(matrix_size, matrix_size)
B = rand(matrix_size, matrix_size)
C = zeros(matrix_size, matrix_size)

A

1024×1024 Matrix{Float64}:
 0.599324   0.136234   0.210912    …  0.963063    0.0903736   0.46398
 0.636538   0.864631   0.00851327     0.232077    0.257339    0.183778
 0.591745   0.197747   0.344443       0.196597    0.908298    0.53485
 0.921838   0.105807   0.803605       0.459633    0.509019    0.967507
 0.342914   0.849087   0.725722       0.884828    0.699581    0.218655
 0.100464   0.557632   0.890053    …  0.0495604   0.885375    0.641849
 0.304138   0.459314   0.24382        0.326963    0.134166    0.648276
 0.33685    0.107411   0.521201       0.291425    0.919088    0.230217
 0.453407   0.472649   0.462337       0.0498801   0.873891    0.910634
 0.700216   0.0216538  0.279405       0.687379    0.0839788   0.687677
 0.717115   0.928108   0.669639    …  0.999369    0.569745    0.241245
 0.0106636  0.415395   0.14221        0.153659    0.00537939  0.0914411
 0.373983   0.909296   0.174363       0.7676      0.148141    0.0880714
 ⋮                                 ⋱              

The three nested loops implementation of matrix multiplication is easy to express on the CPU:

In [18]:
function MatrixMultiplication!(C, A, B)
    for i in 1:size(C, 1)
        for j in 1:size(C, 2)
            C[i, j] = 0
            for k in 1:size(A, 2)
                C[i, j] += A[i, k] * B[k, j]
            end
        end
    end
end


MatrixMultiplication! (generic function with 1 method)

In [19]:
MatrixMultiplication!(C, A, B)
C

2048×2048 Matrix{Float64}:
 522.008  514.447  518.353  508.014  …  503.772  518.87   516.198  510.185
 513.823  501.285  509.5    505.66      495.307  511.577  502.189  501.256
 537.5    518.566  517.116  523.175     513.203  522.36   520.318  514.112
 525.524  510.32   507.626  518.828     505.366  515.603  512.018  511.279
 521.791  511.283  505.692  513.941     500.857  511.991  510.209  512.336
 509.001  495.895  503.534  504.875  …  498.997  512.214  499.784  502.016
 516.151  510.726  503.878  518.626     504.707  515.551  511.767  500.716
 521.474  510.29   521.157  521.408     512.018  524.482  513.75   512.201
 525.957  512.427  515.678  527.4       507.682  525.113  512.892  508.593
 513.135  494.763  502.173  503.78      498.956  503.753  501.352  506.024
 526.618  512.589  505.468  517.26   …  496.107  522.325  513.331  501.411
 521.852  510.839  498.612  515.724     501.931  519.637  514.263  503.614
 516.697  505.221  503.88   508.651     499.331  515.61   514.946  500.09

Of course, Julia has this one built-in already (it calls BLAS):

In [20]:
C = A * B

2048×2048 Matrix{Float64}:
 522.008  514.447  518.353  508.014  …  503.772  518.87   516.198  510.185
 513.823  501.285  509.5    505.66      495.307  511.577  502.189  501.256
 537.5    518.566  517.116  523.175     513.203  522.36   520.318  514.112
 525.524  510.32   507.626  518.828     505.366  515.603  512.018  511.279
 521.791  511.283  505.692  513.941     500.857  511.991  510.209  512.336
 509.001  495.895  503.534  504.875  …  498.997  512.214  499.784  502.016
 516.151  510.726  503.878  518.626     504.707  515.551  511.767  500.716
 521.474  510.29   521.157  521.408     512.018  524.482  513.75   512.201
 525.957  512.427  515.678  527.4       507.682  525.113  512.892  508.593
 513.135  494.763  502.173  503.78      498.956  503.753  501.352  506.024
 526.618  512.589  505.468  517.26   …  496.107  522.325  513.331  501.411
 521.852  510.839  498.612  515.724     501.931  519.637  514.263  503.614
 516.697  505.221  503.88   508.651     499.331  515.61   514.946  500.09

Our implementation performs quite a bit worse than Julia's (OpenBLAS), but that's OK - we haven't really optimized it at all, since this is just a tutorial:

In [27]:
@benchmark MatrixMultiplication!(C, A, B)

BenchmarkTools.Trial: 1 sample with 1 evaluation.
 Single result which took [34m6.027 s[39m (0.00% GC) to evaluate,
 with a memory estimate of [33m0 bytes[39m, over [33m0[39m allocations.

In [26]:
@benchmark A * B

BenchmarkTools.Trial: 197 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m 7.025 ms[22m[39m … [35m228.372 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 84.16%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m17.285 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m25.397 ms[22m[39m ± [32m 22.751 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m4.42% ±  7.89%

  [39m█[39m▃[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [34m [39m[39m [39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m [39m▁[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m█[39m█[39m█[39m▁[39m▁

Ok, let's now implement matrix multiplication on the GPU:

In [None]:
# We need first to move A and B matrices to the GPU and define a new DC zero'd matrix on the GPU
DA = CuArray(A)
DB = CuArray(B)
DC = CUDA.zeros(size(A))

DA

In the same way, here we can multiply the `DA` matrix by `DB` matrix using the `*` operator (which forwards the call to CUBLAS), thanks again to Julia's multiple dispatch:

In [None]:
DC = DA * DB

In [None]:
function MatrixMultiplication_cuda!(C, A, B)
    # Calculate the global row and column indices
    row = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    col = (blockIdx().y - 1) * blockDim().y + threadIdx().y

    # Create a 0 of the same type as C's element type (for type stability)
    sum = zero(eltype(C))

    if row <= size(A, 1) && col < size(B, 2)
        for i in 1:size(A, 2)
            # @inbounds disables bounds checking for array accesses, to improve performance
            # Note that incorrect usage can result in segfaults/memory faults/wrong results
            @inbounds sum += A[row, i] * B[i, col]
        end
        C[row, col] = sum
    end

    return
end

In [24]:
# Split blocks up into 32x32 tiles
@cuda threads=(32, 32) blocks=(matrix_size ÷ 32, matrix_size ÷ 32) MatrixMultiplication_cuda!(DC, DA, DB)

DC

2048×2048 CuArray{Float64, 2, CUDA.DeviceMemory}:
 512.654  491.693  497.835  491.77   …  498.166  497.263  508.779  501.667
 506.188  496.73   505.53   499.666     502.857  492.82   505.705  498.718
 510.076  494.765  504.817  498.36      509.65   498.694  507.214  508.859
 508.611  498.84   506.446  498.362     512.131  503.416  511.78   502.783
 516.901  503.984  513.068  506.68      509.245  516.674  517.319  505.767
 515.004  500.863  505.083  504.142  …  515.888  503.512  520.505  516.691
 516.63   501.365  509.056  507.993     518.387  510.541  517.607  508.491
 523.411  508.684  512.396  511.45      518.802  510.688  518.202  512.118
 502.255  490.325  496.933  509.615     504.088  494.503  504.638  501.606
 527.083  514.173  518.128  511.688     518.804  519.843  521.059  517.776
 512.59   496.871  509.555  503.781  …  510.958  502.675  512.275  506.564
 522.832  503.34   520.555  513.428     518.154  516.196  519.143  515.926
 507.4    493.934  513.169  503.977     506.983  4

And as we'd expect, the optimized CUBLAS implementation is far faster than ours:

In [22]:
@benchmark DA * DB

BenchmarkTools.Trial: 5594 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m 20.429 μs[22m[39m … [35m132.732 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 98.82%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m972.307 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m891.426 μs[22m[39m ± [32m  1.789 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m2.67% ±  1.57%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [34m█[39m[39m [39m 
  [39m▄[39m▁[39m▁[

In [25]:
@benchmark CUDA.@sync @cuda threads=(32, 32) blocks=(matrix_size ÷ 32, matrix_size ÷ 32) MatrixMultiplication_cuda!(DC, DA, DB)

BenchmarkTools.Trial: 545 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m9.129 ms[22m[39m … [35m 9.940 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m9.178 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m9.178 ms[22m[39m ± [32m53.594 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m [39m▁[39m [39m▃[39m▁[39m▃[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [32m▃[39m[34m▄[39m[39m█[39m▂[39m▂[39m▄[39m▃[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▂[39m▂[39m▂[39m▄[39m▆[39m█[39m▇[39m█[

Ouch! Why exactly is our kernel implementation slower?

The answer is that this is only the naive implementation of matrix multiplication - if you did the same in CUDA C++, you'd get similarly bad performance. The performant implementation relies on tiling, where the matrix is divided into smaller submatrices (tiles) that fit more effectively within the GPU’s memory hierarchy, including shared memory and cache. Additionally, optimized kernels would use shared memory and WMMA instructions to greatly improve data locality and reduce bandwidth needs (both of which can be easily used in Julia).

In an optimized implementation, each thread block on the GPU handles a specific tile of the output matrix, loading portions of the input tiles into shared memory to reduce the repeated global memory access. This approach enables a higher level of parallelism by allowing multiple tiles to be processed concurrently across the GPU cores, without bottlenecking on memory transfers.

For the purpose of this tutorial, we will not be implementing these optimizations, but do know that they are as easy (or easier) to use in Julia compared to CUDA C++.

# KernelAbstractions

Now that we know how to write vendor-specific kernels, let's explore the naive matrix multiplication example using KernelAbstractions.jl:

In [30]:
import Pkg
Pkg.add("KernelAbstractions")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `/global/u2/t/train921/julia-hpc-tutorial-sc24-main/Project.toml`
[32m[1m  No Changes[22m[39m to `/global/u2/t/train921/julia-hpc-tutorial-sc24-main/Manifest.toml`


We'll also load the Random standard library to assist with certain random array initializations:

In [32]:
using KernelAbstractions
using Random

Implementing a kernel with KernelAbstractions is very similar to implementing a kernel with CUDA.jl. The primary differences include annotating a kernel function with `@kernel`, and doing thread indexing using `@index` (which efficiently abstracts away the index calculations we were previously doing). Otherwise, most things are the same:

In [28]:
@kernel function MatrixMultiplication_kernel!(C, A, B)
    # Global index of each thread across multiple blocks in both x and y dimensions of the grid
    row, col = @index(Global, NTuple)

    # Everything else is the same!
    sum = zero(eltype(C))

    if row <= size(A, 1) && col <= size(B, 2)
        for i = 1:size(A, 2)
             @inbounds sum += A[row, i] * B[i, col]
        end
        @inbounds C[row, col] = sum
     end
end

MatrixMultiplication_kernel! (generic function with 4 methods)

One key difference between KernelAbstractions and CUDA is that, because KernelAbstractions is portable, we need to select the CUDA "backend" when we compile our kernel (AMDGPU, Apple, and Intel are also supported). Most operations take the backend as the first argument, to allow Julia's multiple dispatch to redirect calls to the correct implementation. Additionally, KernelAbstractions separate the compilation and kernel launch stages, and provides configurations for each step to optimize further.

In [29]:
# Select the CUDA backend
Backend = CUDA.CUDABackend()

# Use KernelAbstractions's APIs to allocate GPU matrices DA, DB, and DC
matrix_size = 2048
T = Float64
DA = rand!(allocate(Backend, T, matrix_size, matrix_size))
DB = rand!(allocate(Backend, T, matrix_size, matrix_size))
DC = KernelAbstractions.zeros(Backend, T, matrix_size, matrix_size)

# Compile the kernel
# We'll statically assign the workgroup (AKA block) size to allow for additional compile-time optimizations
workgroupsize = (32, 32)
kernel! = MatrixMultiplication_kernel!(Backend, workgroupsize)

# Launch the kernel with our GPU matrices as inputs
kernel!(DC, DA, DB, ndrange=(size(DC)))

# Explicitly wait for the kernel to complete
KernelAbstractions.synchronize(Backend)

# Are our results what we'd expect to see (compared to CUBLAS)?
isapprox(DC, DA * DB)

true

In [30]:
@benchmark begin
    kernel!(DC, DA, DB, ndrange=size(DC))
    KernelAbstractions.synchronize(Backend)
end

BenchmarkTools.Trial: 348 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m13.950 ms[22m[39m … [35m14.059 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m14.013 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m14.012 ms[22m[39m ± [32m20.283 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁[39m [39m▃[39m▃[39m▂[39m▅[39m [39m [39m▄[39m▁[39m▂[39m▆[39m▅[39m▃[34m▄[39m[39m▂[39m▄[39m▄[39m▁[39m█[39m▄[39m▂[39m█[39m▄[39m▂[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▄[39m▁[39m▁[39m▃[39m▁[39m▃[39m▃[39

# Example: Memory copy with KernelAbstractions

Let's now see a different kind of example, to show how to use shared memory in KernelAbstractions. This kernel performs a matrix copy using local memory (also known as shared memory in CUDA), which can significantly speed up the memory access times by reducing global memory bandwidth usage:

In [31]:
@kernel function lmem_copy_kernel!(output, @Const(input))
    # Gets the global index of the thread in a multidimensional grid, which is used to index into the global input and output arrays.
    I, J = @index(Global, NTuple) 
    # Gets the local index within a thread block or workgroup, useful for indexing into locally shared memory.
    i, j = @index(Local, NTuple) # Local index of thread

    # @groupsize() retrieves the dimensions of the thread block or workgroup.
    # The @uniform ensures that these values are treated as constants that are the same for all threads.
    N = @uniform @groupsize()[1] # same as blockDim().x 
    M = @uniform @groupsize()[2] # same as blockDim().y

    # Allocate local (shared) memory
    tile = @localmem eltype(output) (N, M)

    # First, data from the global input array is loaded into the shared tile array using local indices.
    @inbounds tile[i, j] = input[I, J]

    # @synchronize ensures that all threads in the workgroup have completed their memory writes to the shared memory before proceeding. 
    # This is crucial to prevent race conditions.
    @synchronize

    # Finally, the data is written back from the shared tile array to the global output array.
    @inbounds output[I, J] = tile[i, j]
end

lmem_copy_kernel! (generic function with 4 methods)

This kernel is a little bit more "advanced" than prior kernels, but still is quite readable once you know what each of the macros do. We can quickly test that it works correctly:

In [32]:
# Allocate inputs and outputs
input = rand!(allocate(Backend, T, matrix_size, matrix_size))
output = KernelAbstractions.zeros(Backend, T, matrix_size, matrix_size)

# Compile and launch the kernel, and wait for it to complete
lmem_copy! = lmem_copy_kernel!(Backend, workgroupsize)
lmem_copy!(output, input, ndrange=size(input))
KernelAbstractions.synchronize(Backend)

# Confirm that the output matrix now matches the input matrix
all(input == output)

true