# Installation

In Julia GPU packages are easy to install: Just do a `Pkg.add("CUDA")`. The only thing you need, is a functional NVIDIA driver, but you don't need to install the CUDA toolkit

In [1]:
using Pkg
Pkg.add("CUDA")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `/global/u2/t/train920/.julia/environments/v1.9/Project.toml`
[32m[1m  No Changes[22m[39m to `/global/u2/t/train920/.julia/environments/v1.9/Manifest.toml`


In [2]:
import Pkg; 
Pkg.add("BenchmarkTools")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `/global/u2/t/train920/.julia/environments/v1.9/Project.toml`
[32m[1m  No Changes[22m[39m to `/global/u2/t/train920/.julia/environments/v1.9/Manifest.toml`


In [3]:
using CUDA
using LinearAlgebra

In [4]:
using BenchmarkTools

In CUDA.jl check if the package is functional, you can call the `versioninfo()` function. Like `Base.versioninfo()`, this will print some information on the available hardware and loaded libraries:

In [5]:
CUDA.versioninfo()

CUDA runtime 12.2, local installation
CUDA driver 12.6
NVIDIA driver 535.183.6, originally for CUDA 12.2

CUDA libraries: 
- CUBLAS: 12.2.1
- CURAND: 10.3.3
- CUFFT: 11.0.8
- CUSOLVER: 11.5.0
- CUSPARSE: 12.1.1
- CUPTI: 2023.2.0 (API 20.0.0)
- NVML: 12.0.0+535.183.6

Julia packages: 
- CUDA: 5.4.3
- CUDA_Driver_jll: 0.9.2+0
- CUDA_Runtime_jll: 0.14.1+0
- CUDA_Runtime_Discovery: 0.3.5

Toolchain:
- Julia: 1.9.4
- LLVM: 14.0.6

Preferences:
- CUDA_Runtime_jll.version: 12.2
- CUDA_Runtime_jll.local: true

1 device:
  0: NVIDIA A100-SXM4-40GB (sm_80, 39.390 GiB / 40.000 GiB available)


# Add Two Vectors Example

Let have as an example vectors add. Let assume you have two vectors $\vec{a}$ and $\vec{b}$ and you want to add them. You can do it in in many ways in Julia: 
1. simple for loop in CPU
2. julia add (+) in CPU or in GPU
3. GPU kernel programming in CUDA or kernel abstracton using CUDA as backend

In [6]:
# let define our input a, b vectors, and output c vector in CPU
vector_size = 1024
a = rand(1:4, vector_size)
b = rand(1:4, vector_size)
c = zeros(Int, vector_size)

1024-element Vector{Int64}:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

The simple loop CPU loop to add two vectors

In [7]:
function vadd(a, b, c)
    for i in 1:vector_size
        c[i] = a[i] + b[i]
    end
    return
end
vadd(a, b, c)
c

1024-element Vector{Int64}:
 3
 8
 2
 4
 2
 4
 3
 3
 6
 6
 2
 6
 4
 ⋮
 3
 5
 5
 2
 5
 7
 3
 4
 4
 2
 7
 3

Julia add (+) operation in CPU

In [8]:
c = a + b

1024-element Vector{Int64}:
 3
 8
 2
 4
 2
 4
 3
 3
 6
 6
 2
 6
 4
 ⋮
 3
 5
 5
 2
 5
 7
 3
 4
 4
 2
 7
 3

Great!! 
Let see how to use add (+) in GPU to add two vectors using GPU resources.

In [9]:
# We need first to move a and b vectors to GPU and define new dc empty vector in GPU
da = CuArray(a)
db = CuArray(b)
dc = CUDA.zeros(Int, size(a))

1024-element CuArray{Int64, 1, CUDA.DeviceMemory}:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

We can add `da` vector to `db` vector using `+` operator. Thanks to Julio multiple dispatch feature!!!

In [10]:
dc = da + db

1024-element CuArray{Int64, 1, CUDA.DeviceMemory}:
 3
 8
 2
 4
 2
 4
 3
 3
 6
 6
 2
 6
 4
 ⋮
 3
 5
 5
 2
 5
 7
 3
 4
 4
 2
 7
 3

Let us now learn how to write gpu kernel with `CUDA.jl` in Julia.

In array operations, `CUDA.jl`` can leverage implicit parallelism to automatically execute these operations in parallel on a GPU. However, when using kernels, it is the programmer's responsibility to effectively utilize the available parallel execution resources for the specific operation.

In [11]:
function vadd(c, a, b)
    # obtain thread index which should be map the index of a and b
    i = threadIdx().x
    # Each thread will add its own element to c
    c[i] = a[i] + b[i]
    return
end

vadd (generic function with 1 method)

At a high level, that's pretty easy, you just need to write a scalar function and launch that function in parallel using the `@cuda` macro and its `threads` keyword argument

In [12]:
@cuda threads=length(a) vadd(dc, da, db)
dc

1024-element CuArray{Int64, 1, CUDA.DeviceMemory}:
 3
 8
 2
 4
 2
 4
 3
 3
 6
 6
 2
 6
 4
 ⋮
 3
 5
 5
 2
 5
 7
 3
 4
 4
 2
 7
 3

ok this is great but try to set `vector_size` to 10240. You will notice that CPU simple loop and add (+) operator in the CPU and GPU are working, but your hand written GPU code is not working.

Ouch what is going on here?

GPUs have a limited number of threads they can run on a single streaming multiprocessor (SM), but they also have multiple SMs.

To take advantage of them all, we need to run a kernel with multiple blocks.  

In CUDA.jl, the expression `i = threadIdx().x + (blockIdx().x - 1) * blockDim().x` calculates a unique index for each thread across multiple blocks in a CUDA kernel execution. Here's a breakdown of each component and how they contribute to computing this index:

- `threadIdx().x`: This returns the x-coordinate of the thread within its block. It's the thread's index within the block, starting from 1 (unlike C/C++ CUDA where it starts from 0).

- `blockIdx().x`: This gives the x-coordinate of the block within the grid. It represents the block's index in the grid, also starting from 1.

- `blockDim().x`: This represents the number of threads per block along the x-axis.

The formula `i = threadIdx().x + (blockIdx().x - 1) * blockDim().x` is used to compute a global index for each thread. It positions the threads linearly across all blocks. Here's what each part does:

- `(blockIdx().x - 1) * blockDim().x`: This part calculates the offset to the start of the current block. Subtracting 1 from `blockIdx().x` makes it zero-based, and then it is multiplied by the number of threads in each block `(blockDim().x)`. This gives the index of the first thread in the current block relative to the entire grid.

- `threadIdx().x`: Adding this to the block offset gives the specific thread's index within the whole grid.

It similer if you are working in 2D grids. The formula for 2D grids is `i = threadIdx().y * blockDim().y + threadIdx().y`. Here's what each part does:


In [13]:
#To know number of Threads per block
CUDA.attribute(device(), CUDA.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK)

1024

In [14]:
function vadd(c, a, b)
    # calculates a unique index for each thread across multiple blocks
    i = threadIdx().x + (blockIdx().x - 1) * blockDim().x
    if i <= length(a)
        c[i] = a[i] + b[i]
    end
    return
end

vadd (generic function with 1 method)

In [15]:
@cuda threads=1024 blocks=cld(length(da),1024) vadd(dc, da, db)
dc

1024-element CuArray{Int64, 1, CUDA.DeviceMemory}:
 3
 8
 2
 4
 2
 4
 3
 3
 6
 6
 2
 6
 4
 ⋮
 3
 5
 5
 2
 5
 7
 3
 4
 4
 2
 7
 3

## Add Matrix Multiplication Example

In [16]:
matrix_size = 2048
A = rand(matrix_size, matrix_size)
B = rand(matrix_size, matrix_size)
C = zeros(matrix_size, matrix_size)

2048×2048 Matrix{Float64}:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0

The three nested loops implmentation of matrix multiplication in CPU

In [17]:
function MatrixMultiplication!(A, B, C)
    for i in 1:matrix_size
        Threads.@threads for j in 1:matrix_size
            C[i, j] = 0
            for k in 1:matrix_size
                C[i, j] += A[i, k] * B[k, j]
            end
        end
    end
end


MatrixMultiplication! (generic function with 1 method)

Julia mutiplication (*) operation in CPU

In [18]:
C = A * B

2048×2048 Matrix{Float64}:
 497.727  498.172  505.842  496.282  …  518.785  515.704  511.647  527.195
 512.367  512.068  517.075  508.459     528.052  527.279  525.13   529.496
 508.418  513.269  519.497  504.399     529.693  520.459  516.247  525.33
 507.398  524.236  517.82   512.141     524.056  533.715  523.636  527.291
 518.574  525.062  519.645  523.654     541.132  533.416  529.764  534.653
 502.198  519.423  513.635  504.205  …  521.792  529.731  523.782  529.043
 507.997  516.699  509.864  512.743     538.144  528.51   524.629  521.334
 519.55   519.476  522.8    508.789     534.098  536.234  529.813  529.768
 488.082  501.328  503.469  493.955     511.72   514.566  506.7    512.4
 501.65   512.958  513.608  501.925     520.92   531.522  518.559  514.894
 500.202  513.806  508.432  504.929  …  518.806  519.545  516.367  513.643
 514.988  517.088  520.564  518.217     535.746  531.564  525.136  530.639
 499.068  513.717  514.405  510.414     531.018  529.449  520.731  523.751
 

In [19]:
@benchmark  A * B

BenchmarkTools.Trial: 45 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m 78.016 ms[22m[39m … [35m217.613 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m104.030 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m107.660 ms[22m[39m ± [32m 24.252 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m1.24% ± 2.68%

  [39m [39m [39m [39m▃[39m [39m▆[39m [39m [39m▃[39m [39m▁[34m█[39m[39m▁[32m [39m[39m▆[39m [39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▄[39m▁[39m▄[39m█

Now Let see how to use add (*) in GPU to mutiply two matrices using GPU resources.

In [20]:
# We need first to move A and B matrces to GPU and define new DC empty matrix in GPU
DA = CuArray(A)
DB = CuArray(B)
DC = CUDA.zeros(size(A))

2048×2048 CuArray{Float32, 2, CUDA.DeviceMemory}:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0

The same way here we can multiply `DA` matrix by `DB` matrix using `*` operator. Thanks again to Julio multiple dispatch feature!!!

In [21]:
DC = DA * DB

2048×2048 CuArray{Float64, 2, CUDA.DeviceMemory}:
 497.727  498.172  505.842  496.282  …  518.785  515.704  511.647  527.195
 512.367  512.068  517.075  508.459     528.052  527.279  525.13   529.496
 508.418  513.269  519.497  504.399     529.693  520.459  516.247  525.33
 507.398  524.236  517.82   512.141     524.056  533.715  523.636  527.291
 518.574  525.062  519.645  523.654     541.132  533.416  529.764  534.653
 502.198  519.423  513.635  504.205  …  521.792  529.731  523.782  529.043
 507.997  516.699  509.864  512.743     538.144  528.51   524.629  521.334
 519.55   519.476  522.8    508.789     534.098  536.234  529.813  529.768
 488.082  501.328  503.469  493.955     511.72   514.566  506.7    512.4
 501.65   512.958  513.608  501.925     520.92   531.522  518.559  514.894
 500.202  513.806  508.432  504.929  …  518.806  519.545  516.367  513.643
 514.988  517.088  520.564  518.217     535.746  531.564  525.136  530.639
 499.068  513.717  514.405  510.414     531.018  529.

In [22]:
@benchmark  DA * DB

BenchmarkTools.Trial: 5883 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m 17.414 μs[22m[39m … [35m  6.583 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 80.61%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m972.299 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m834.409 μs[22m[39m ± [32m357.439 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.36% ±  2.17%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [32m [39m[39m [39m [39m [39m [39m [39m [39m [39m█[34m [39m[39m 
  [39m▄[39m▁[39m▁[

In [23]:
function MatrixMultiplication!(A,B,C)

    row = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    col = (blockIdx().y - 1) * blockDim().y + threadIdx().y

    sum = zero(eltype(C))

    if row <= size(A, 1) && col < size(B, 2)
        for i = 1:size(A, 2)

            #@inbounds disables bounds checking for array accesses for performance optimization.
            @inbounds sum += A[row, i] * B[i, col]
        end
        C[row, col] = sum
    end

    return
end

MatrixMultiplication! (generic function with 1 method)

In [24]:
@cuda threads=(32, 32) blocks=(matrix_size ÷ 32, matrix_size ÷ 32) MatrixMultiplication!(DA, DB, DC)
DC

2048×2048 CuArray{Float64, 2, CUDA.DeviceMemory}:
 497.727  498.172  505.842  496.282  …  518.785  515.704  511.647  527.195
 512.367  512.068  517.075  508.459     528.052  527.279  525.13   529.496
 508.418  513.269  519.497  504.399     529.693  520.459  516.247  525.33
 507.398  524.236  517.82   512.141     524.056  533.715  523.636  527.291
 518.574  525.062  519.645  523.654     541.132  533.416  529.764  534.653
 502.198  519.423  513.635  504.205  …  521.792  529.731  523.782  529.043
 507.997  516.699  509.864  512.743     538.144  528.51   524.629  521.334
 519.55   519.476  522.8    508.789     534.098  536.234  529.813  529.768
 488.082  501.328  503.469  493.955     511.72   514.566  506.7    512.4
 501.65   512.958  513.608  501.925     520.92   531.522  518.559  514.894
 500.202  513.806  508.432  504.929  …  518.806  519.545  516.367  513.643
 514.988  517.088  520.564  518.217     535.746  531.564  525.136  530.639
 499.068  513.717  514.405  510.414     531.018  529.

In [25]:
@benchmark CUDA.@sync @cuda threads=(32, 32) blocks=(matrix_size ÷ 32, matrix_size ÷ 32) MatrixMultiplication!(DA, DB, DC)

BenchmarkTools.Trial: 304 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m16.102 ms[22m[39m … [35m16.930 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m16.198 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m16.201 ms[22m[39m ± [32m63.703 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▄[39m▁[39m▃[39m▆[39m [34m▆[39m[32m [39m[39m▂[39m▆[39m█[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▃[39m▁[39m▁[39m▃[39m▁[39m▁[39m

Ouch!! why the kernel implmentation is slower than Julia multiplication operator?

The answer is that this is only the naive implementation of matrix multiplication in Julia. The performant implementation relies on tiling, where the matrix is divided into smaller submatrices (tiles) that fit more effectively within the GPU’s memory hierarchy, including shared memory and cache. By processing these tiles independently, the GPU can optimize memory access patterns and minimize data transfer overhead.

In a tiled implementation, each thread block on the GPU handles a specific tile of the output matrix, loading portions of the input tiles into shared memory to reduce the repeated global memory access. This approach enables a higher level of parallelism by allowing multiple tiles to be processed concurrently across the GPU cores.

# Kernel Abstraction

Let us explor the naive matrix multiplication example using Kernel abstraction

In [26]:
import Pkg; 
Pkg.add("KernelAbstractions")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `/global/u2/t/train920/.julia/environments/v1.9/Project.toml`
[32m[1m  No Changes[22m[39m to `/global/u2/t/train920/.julia/environments/v1.9/Manifest.toml`


In [27]:
using KernelAbstractions
using Random

Please note how to write a kernel in KernelAbstractions.jl. There are minimal changes compared to the vendor package. It efficiently abstracts away the index calculations

In [28]:
@kernel function MatrixMultiplication_kernel!(A, B, C)
    #Global index of  each thread across multiple blocks in both x and y dimension of the grid
    row, col = @index(Global, NTuple)

    sum = zero(eltype(C))

    if row <= size(A, 1) && col <= size(B, 2)
        for i = 1:size(A, 2)
             @inbounds sum += A[row, i] * B[i, col]
        end
        @inbounds C[row, col] = sum
     end
end


MatrixMultiplication_kernel! (generic function with 4 methods)

In [29]:
Backend =  CUDA.CUDABackend()
matrix_size = 2048
T = Float64
DA = rand!(allocate(Backend, T, matrix_size, matrix_size))
DB = rand!(allocate(Backend, T, matrix_size, matrix_size))
DC = KernelAbstractions.zeros(Backend, T, matrix_size, matrix_size)

workgroupsize = (32, 32)

kernel! = MatrixMultiplication_kernel!(Backend, workgroupsize)
kernel!(DA, DB, DC, ndrange=(size(DC)))
KernelAbstractions.synchronize(Backend)

isapprox(DC, DA * DB)

true

In [30]:
@benchmark begin
    kernel!(DA, DB, DC, ndrange=(size(DC)))
    KernelAbstractions.synchronize(Backend)
end


BenchmarkTools.Trial: 348 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m13.950 ms[22m[39m … [35m14.059 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m14.013 ms              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m14.012 ms[22m[39m ± [32m20.283 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁[39m [39m▃[39m▃[39m▂[39m▅[39m [39m [39m▄[39m▁[39m▂[39m▆[39m▅[39m▃[34m▄[39m[39m▂[39m▄[39m▄[39m▁[39m█[39m▄[39m▂[39m█[39m▄[39m▂[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▄[39m▁[39m▁[39m▃[39m▁[39m▃[39m▃[39

# Memory copy with KernelAbstractions

Let have another example to show how to use shared memory in KernelAbstractions. This kernel performs a matrix copy using local memory (also known as shared memory in CUDA), which can significantly speed up the memory access times by reducing global memory bandwidth usage. 

In [31]:
@kernel function lmem_copy_kernel!(output, @Const(input))

    # Gets the global index of the thread in a multidimensional grid, which is used to index into the global input and output arrays.
    I, J= @index(Global, NTuple) 
    # Gets the local index within a thread block or workgroup, useful for indexing into locally shared memory.
    i, j = @index(Local, NTuple) # Local index of thread

    #@groupsize() retrieves the dimensions of the thread block or workgroup. 
    #The @uniform ensures that these values are treated as constants that are the same for all threads.
    N = @uniform @groupsize()[1] # blockDim.x 
    M = @uniform @groupsize()[2] # blockDim.y
    
    tile = @localmem eltype(output) (N, M) # Allocate local (shared) memory

    #First, data from the global input array is loaded into the shared tile array using local indices.
    @inbounds tile[i, j] = input[I, J]

    #@synchronize ensures that all threads in the workgroup have completed their memory writes to the shared memory before proceeding. 
    #This is crucial to prevent race conditions.
    @synchronize

    #Finally, the data is written back from the shared tile array to the global output array.
    @inbounds output[I, J] = tile[i, j]

end


lmem_copy_kernel! (generic function with 4 methods)

In [32]:
input = rand!(allocate(Backend, T, matrix_size, matrix_size))
output = KernelAbstractions.zeros(Backend, T, matrix_size, matrix_size)

const lmem_copy! = lmem_copy_kernel!(Backend, workgroupsize)
lmem_copy!(output, input, ndrange=size(input))
KernelAbstractions.synchronize(Backend)

all(input == output)


true