# Grid of resistors (GPU version)
## Stencil computation with CUDAnative

In this notebook will you will learn how to run a few simple GPU kernels in Julia by using the `CUDAnative` package. The exercise will introduce a GPU translation of the "Grid of resistors" code as an example of how to write GPU kernels in Julia. The purpose is to write the simples or fastest GPU stencil code but to introduce GPU kernel computations in Julia.

First load some packages that will be used as part of the exercise

In [10]:
using CUDAnative # Compile Julia programs to GPUs
using CuArrays   # Arrays on GPU
using BenchmarkTools

As explained in the "π in many ways" notebook, the command `nvidia-smi` can be used to listing available NVidia CPUs

In [11]:
;nvidia-smi

Wed Jul  4 04:10:42 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.59                 Driver Version: 390.59                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 23%   24C    P8    15W / 250W |     10MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:0E:00.0 Off |                  N/A |
| 23%   26C    P8     8W / 250W |     10MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  Off  | 00000000:0F:00.0 Off |                    0 |
| N/A   

Now we will allocate two GPU arrays where `A` will contain random values except at the boundary which will be zero a both ends.

In [13]:
A = CuArray(zeros(Float32, 16))
B = CuArray(zeros(Float32, 16))
rand!(A[2:15])
@show A
@show B;

A = CuArray(Float32[0.0, 0.431209, 0.820723, 0.234987, 0.00420462, 0.461162, 0.346231, 0.777714, 0.508115, 0.864385, 0.579559, 0.124908, 0.347255, 0.0388117, 0.874848, 0.0])
B = CuArray(Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])


Below is a first example of a stencil kernel.

In [14]:
function stencil_kernel!(B, A)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    if i > 1 && i < 16
        B[i] = A[i-1] - 2*A[i] + A[i+1]    
    end
    
    return nothing
end

stencil_kernel! (generic function with 1 method)

**Exercise** If you are not familiar with CUDA programming then try to explain the code. What is the value of `i`?

We will now launch the kernel on the GPU by using the `@cuda` macro. The first argument to the macro is a tuple of the grid dimension and the block dimension. Each are allow to be either a scalar or a length two or three tuple but we wil only be using scalars in this notebook.

**Exercise** Try running `@cuda (1, 16) stencil_kernel!(B, A)` and print `B`. Exaplain what has happened.

In [18]:
@cuda (1, 16) stencil_kernel!(B, A)
B

16-element CuArray{Float32,1}:
   0.0     
   4.04747 
  -6.41678 
   2.66617 
   4.56534 
  -8.71407 
   9.68398 
 -10.1086  
   9.50052 
  -5.49083 
   0.704027
   0.676997
  -0.530789
   1.14448 
  -1.71088 
   0.0     

**Exercise** Add these two lines to the kernel
```julia
@cuprintf("blockIdx().x = %d, blockDim().x = %d, threadIdx().x = %d, i = %d\n",
    Int32(blockIdx().x), Int32(blockDim().x), Int32(threadIdx().x), Int32(i))
```
Now launch the kernel with combinations of grid and block sizes for which the product is 16. Explain.

In [22]:
function stencil_kernel!(B, A)
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    @cuprintf("blockIdx().x = %d, blockDim().x = %d, threadIdx().x = %d, i = %d\n",
        Int32(blockIdx().x), Int32(blockDim().x), Int32(threadIdx().x), Int32(i))
    if i > 1 && i < 16
        B[i] = A[i-1] - 2*A[i] + A[i+1]    
    end
    
    return nothing
end

stencil_kernel! (generic function with 1 method)

In [26]:
@cuda (1, 16) stencil_kernel!(A, B)
@cuda (2, 8) stencil_kernel!(A, B)
@cuda (4, 4) stencil_kernel!(A, B)

blockIdx().x = 1, blockDim().x = 16, threadIdx().x = 1, i = 1
blockIdx().x = 1, blockDim().x = 16, threadIdx().x = 2, i = 2
blockIdx().x = 1, blockDim().x = 16, threadIdx().x = 3, i = 3
blockIdx().x = 1, blockDim().x = 16, threadIdx().x = 4, i = 4
blockIdx().x = 1, blockDim().x = 16, threadIdx().x = 5, i = 5
blockIdx().x = 1, blockDim().x = 16, threadIdx().x = 6, i = 6
blockIdx().x = 1, blockDim().x = 16, threadIdx().x = 7, i = 7
blockIdx().x = 1, blockDim().x = 16, threadIdx().x = 8, i = 8
blockIdx().x = 1, blockDim().x = 16, threadIdx().x = 9, i = 9
blockIdx().x = 1, blockDim().x = 16, threadIdx().x = 10, i = 10
blockIdx().x = 1, blockDim().x = 16, threadIdx().x = 11, i = 11
blockIdx().x = 1, blockDim().x = 16, threadIdx().x = 12, i = 12
blockIdx().x = 1, blockDim().x = 16, threadIdx().x = 13, i = 13
blockIdx().x = 1, blockDim().x = 16, threadIdx().x = 14, i = 14
blockIdx().x = 1, blockDim().x = 16, threadIdx().x = 15, i = 15
blockIdx().x = 1, blockDim().x = 16, threadIdx().x = 16, i

We will now turn to the grid of resistors computation. First we repeat the original version for easy reference and comparison of results.

In [53]:
function compute_resistance(n, nreps = 100)
    # Original MATLAB version, Alan Edelman, January 1994
    # Julia translations, Andreas Noack, June 2018

    # assume n and omega already defined or take
    # the following values for the optimal omega
    μ = (cos(π/(2*n)) + cos(π/(2*n + 1)))/2
    ω = 2*(1 - sqrt(1 - μ^2))/μ^2
    # (See page 409 of Strang Intro to Applied Math , this is equation 16)

    # Initialize voltages
    v = zeros(Float32, 2*n + 1, 2*n + 2)

    # Define Input Currents
    b = copy(v)
    b[n + 1, (n + 1):(n + 2)]  = [1 -1]

    # Makes indices easy to read
    ie = 2:2:(2*n)      # even i's
    io = 3:2:(2*n - 1)  # odd i's
    je = 2:2:(2*n)      # even j's
    jo = 3:2:(2*n + 1)  # odd j's

    # Jacobi Steps
    for k in 1:nreps
        v[ie, je] = (1 - ω) * v[ie,je] +
                      ω*(v[ie + 1, je] + v[ie - 1, je] + v[ie, je + 1] + v[ie, je - 1] + b[ie, je])/4
        v[io, jo] = (1 - ω) * v[io, jo] +
                      ω*(v[io + 1, jo] + v[io - 1, jo] + v[io, jo + 1] + v[io, jo - 1] + b[io, jo])/4
        v[ie, jo] = (1 - ω) * v[ie, jo] +
                      ω*(v[ie + 1, jo] + v[ie - 1, jo] + v[ie, jo + 1] + v[ie, jo - 1] + b[ie, jo])/4
        v[io, je] = (1 - ω) * v[io, je] +
                      ω*(v[io + 1, je] + v[io - 1, je] + v[io, je + 1] + v[io, je - 1] + b[io, je])/4
    end
    # Compute resistance = v_A - v_b = 2 v_A
    r = 2*v[n + 1, n + 1]
    return v, r
end

compute_resistance (generic function with 2 methods)

Below is a first GPU version.

In [55]:
function stencil(v, b, ω, i, j)
    vij = (1 - ω)*v[i, j] +
        ω*(v[i + 1, j] + v[i - 1, j] + v[i, j + 1] + v[i, j - 1] + 
            b[i, j])/4
    return vij
end

function stencil_inner_kernel!(v, b, ω, n)
    li = (blockIdx().x-1) * blockDim().x + threadIdx().x
    mm, nn = size(v)
    i, j = ind2sub(v, li)
    
    # even-even
    if iseven(i) && i <= 2n &&
       iseven(j) && j <= 2n
       
        vij = stencil(v, b, ω, i, j)
        v[i, j] = vij
    end
    sync_threads()
    
    # odd-odd
    if isodd(i) && 1 < i <= (2n - 1) &&
       isodd(j) && 1 < j <= (2n + 1)
        
        vij = stencil(v, b, ω, i, j)
        v[i, j] = vij
    end
    sync_threads()
    
    # even-odd
    if iseven(i) && i <= 2n &&
       isodd(j)  && 1 < j <= (2n + 1)
        
        vij = stencil(v, b, ω, i, j)
        v[i, j] = vij
    end
    sync_threads()
    
    # odd-even
    if isodd(i)  && 1 < i <= (2n - 1) &&
       iseven(j) && j <= 2n
        
        vij = stencil(v, b, ω, i, j)
        v[i, j] = vij
    end    
    sync_threads()

    return nothing
end

function stencil_outer_kernel!(v, b, ω, n, iter)
    for i in 1:iter
        stencil_inner_kernel!(v, b, ω, n)
    end
    
    return nothing
end

function compute_resistance_gpu_bad(n, nreps = 100, blocksize = 64)
    # assume n and omega already defined or take
    # the following values for the optimal omega
    μ = (cos(π/(2*n)) + cos(π/(2*n + 1)))/2
    ω = 2*(1 - sqrt(1 - μ^2))/μ^2
    # (See page 409 of Strang Intro to Applied Math , this is equation 16)

    # Initialize voltages
    v = fill!(CuArray{Float32}(2*n + 1, 2*n + 2), 0)

    # Define Input Currents
    b = copy(v)
    b[n + 1, (n + 1):(n + 2)]  = [1 -1]
    
    # Jacobi Steps
    @show mn = length(v)
    @show gridsize = div(mn, blocksize) + 1
    @show blocksize
    @cuda (gridsize, blocksize) stencil_outer_kernel!(v, b, ω, n, nreps)

    # Compute resistance = v_A - v_b = 2 v_A
    r = 2*v[n + 1, n + 1]
    return v, r
end

compute_resistance_gpu_bad (generic function with 3 methods)

**Exercise** Read the code an try to understand what is going on
**Exercise** Launch the *bad* kernel with `n=3` and compare to the original version. Next try with `n=4`. Can you explain what happens?

In [66]:
compute_resistance_gpu_bad(3)[2]

mn = length(v) = 56
gridsize = div(mn, blocksize) + 1 = 1
blocksize = 64


0.4876364f0

In [65]:
compute_resistance_gpu_bad(4)[2]

mn = length(v) = 90
gridsize = div(mn, blocksize) + 1 = 2
blocksize = 64


0.516162f0

To force synchronization across grid blocks, we'd need to exit the CUDA kernel.

**Exercise** Discuss how the previous program could be modified to force synchonization

Launching CUDA kernel is associated with some costs so we'd like to minimize the number of launches. 

**Exercise** Define the the two kernel functions `stencil_kernel_eeoo!(v, b, ω, n)` and `stencil_kernel_eooe!(v, b, ω, n)` from the source of `stencil_inner_kernel!` where `eeoo` means the "even-even and odd-odd" loop indices and vice versa.

In [49]:
function stencil_kernel_eeoo!(v, b, ω, n)
    li = (blockIdx().x-1) * blockDim().x + threadIdx().x
    mm, nn = size(v)
    i, j = ind2sub(v, li)
    
    # even-even
    if iseven(i) && i <= 2n &&
       iseven(j) && j <= 2n
       
        vij = stencil(v, b, ω, i, j)
        v[i, j] = vij
    end
    
    # odd-odd
    if isodd(i) && 1 < i <= (2n - 1) &&
       isodd(j) && 1 < j <= (2n + 1)
        
        vij = stencil(v, b, ω, i, j)
        v[i, j] = vij
    end

    return nothing
end

function stencil_kernel_eooe!(v, b, ω, n)
    li = (blockIdx().x-1) * blockDim().x + threadIdx().x
    mm, nn = size(v)
    i, j = ind2sub(v, li)
    
    # even-odd
    if iseven(i) && i <= 2n &&
       isodd(j)  && 1 < j <= (2n + 1)
        
        vij = stencil(v, b, ω, i, j)
        v[i, j] = vij
    end
    
    # odd-even
    if isodd(i)  && 1 < i <= (2n - 1) &&
       iseven(j) && j <= 2n
        
        vij = stencil(v, b, ω, i, j)
        v[i, j] = vij
    end    

    return nothing
end

compute_resistance_gpu_good (generic function with 2 methods)

**Exercise** Add the two kernel functions to the loop with the right block and grid sizes. Verify that the result is correct.

In [None]:
function compute_resistance_gpu_good(n, nreps = 100, blocksize = 32)
    # assume n and omega already defined or take
    # the following values for the optimal omega
    μ = (cos(π/(2*n)) + cos(π/(2*n + 1)))/2
    ω = 2*(1 - sqrt(1 - μ^2))/μ^2
    # (See page 409 of Strang Intro to Applied Math , this is equation 16)

    # Initialize voltages
    v = fill!(CuArray{Float32}(2*n + 1, 2*n + 2), 0)

    # Define Input Currents
    b = copy(v)
    b[n + 1, (n + 1):(n + 2)]  = [1 -1]
    
    # Jacobi Steps
    mn = length(v)
    gridsize = div(mn, blocksize) + 1
    for i in 1:nreps
        @cuda (gridsize, blocksize) stencil_kernel_eeoo!(v, b, ω, n)
        @cuda (gridsize, blocksize) stencil_kernel_eooe!(v, b, ω, n)
    end

    # Compute resistance = v_A - v_b = 2 v_A
    r = 2*v[n + 1, n + 1]
    return v, r
end

**Exercise** Compare timings to the original version