# July 31st Individual Meeting: Rework on D2y_GPU() function

## What slows down D2y_GPU()?

```
function D2y_GPU(d_u, d_y, Nx, Ny, h, ::Val{TILE_DIM}) where {TILE_DIM}
	# tidx = ((blockIdx().x - 1) * TILE_DIM + threadIdx().x - 1)*Ny + 1
	tidx = (blockIdx().x - 1) * TILE_DIM + threadIdx().x
	N = Nx*Ny
	# d_y = zeros(N)
	t1 = (tidx - 1)*Ny + 1
	if 1 <= t1 <= N - Ny + 1
		d_y[t1] = (d_u[t1] - 2 * d_u[t1 + 1] + d_u[t1 + 2]) / h^2
	end
	sync_threads()

	t2 = tidx * Ny
	if Ny <= t2 <= N
		d_y[t2] = (d_u[t2 - 2] - 2 * d_u[t2 - 1] + d_u[t2]) / h^2
	end

	sync_threads()

	for j = 1:Nx
		if 2 + (j-1)*Ny <= tidx <= j*Ny-1
			d_y[tidx] = (d_u[tidx - 1] - 2 * d_u[tidx] + d_u[tidx + 1]) / h^2
		end
	end
	sync_threads()

	nothing
end
```
This works, but extremely slow. Removing for loop, memory through-put for this code > 100 GB/s. We should rewrite for loop for better performance on GPU.

## First rework

```
function D2y_GPU_v3(d_u, d_y, Nx, Ny, h, ::Val{TILE_DIM}) where {TILE_DIM}
	tidx = (blockIdx().x - 1) * TILE_DIM + threadIdx().x
	N = Nx*Ny
	if 1 <= tidx <= N && 1 <= tidx <= N
		d_y[tidx] = (d_u[tidx] - 2d_u[tidx+1] + d_u[tidx + 2]) / h^2
	elseif mod(tidx,Ny) == 0
		d_y[tidx] = (d_u[tidx] - 2d_u[tidx-1] + d_u[tidx - 2]) / h^2
	else
		d_y[tidx] = (d_u[tidx-1] - 2d_u[tidx] + d_u[tidx + 1]) / h^2
	end
	nothing

end
```

This can not be executed. Error Info
```
julia> tester_D2y(20)
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
ERROR: KernelException: exception thrown during kernel execution on device GeForce GTX 1660 Ti
Stacktrace:
 [1] check_exceptions() at C:\Users\cheny\.juliapro\JuliaPro_v1.4.2-1\packages\CUDA\5t6R9\src\compiler\exceptions.jl:84
 [2] prepare_cuda_call() at C:\Users\cheny\.juliapro\JuliaPro_v1.4.2-1\packages\CUDA\5t6R9\src\state.jl:37
 [3] initialize_api() at C:\Users\cheny\.juliapro\JuliaPro_v1.4.2-1\packages\CUDA\5t6R9\lib\cuda\error.jl:98
 [4] macro expansion at C:\Users\cheny\.juliapro\JuliaPro_v1.4.2-1\packages\CUDA\5t6R9\lib\cuda\libcuda.jl:502 [inlined]
 [5] macro expansion at C:\Users\cheny\.juliapro\JuliaPro_v1.4.2-1\packages\CUDA\5t6R9\lib\cuda\error.jl:108 [inlined]
 [6] cuMemcpyDtoH_v2(::Ptr{Float64}, ::CuPtr{Float64}, ::Int64) at C:\Users\cheny\.juliapro\JuliaPro_v1.4.2-1\packages\CUDA\5t6R9\lib\utils\call.jl:93
 [7] #unsafe_copyto!#6 at C:\Users\cheny\.juliapro\JuliaPro_v1.4.2-1\packages\CUDA\5t6R9\lib\cuda\memory.jl:324 [inlined]
 [8] unsafe_copyto! at C:\Users\cheny\.juliapro\JuliaPro_v1.4.2-1\packages\CUDA\5t6R9\lib\cuda\memory.jl:317 [inlined]
 [9] unsafe_copyto! at C:\Users\cheny\.juliapro\JuliaPro_v1.4.2-1\packages\CUDA\5t6R9\src\array.jl:309 [inlined]
 [10] copyto! at C:\Users\cheny\.juliapro\JuliaPro_v1.4.2-1\packages\CUDA\5t6R9\src\array.jl:284 [inlined]
 [11] copyto! at C:\Users\cheny\.juliapro\JuliaPro_v1.4.2-1\packages\GPUArrays\JqOUg\src\host\abstractarray.jl:102 [inlined]
 [12] collect(::CuArray{Float64,1,Nothing}) at C:\Users\cheny\.juliapro\JuliaPro_v1.4.2-1\packages\CUDA\5t6R9\src\array.jl:264
 [13] tester_D2y(::Int64) at C:\Users\cheny\OneDrive\Documents\version-control\Poisson_Julia\original_src\deriv_ops_GPU.jl:378
 [14] top-level scope at none:0
 ```
 
 It seems that if else if is not convenient on GPU. I should try something else. </br>
 
 It's easier to do duplicate calculation than conditional statements on GPU.
 
 
 
 ## Second Rework
 
 ```
 function D2y_GPU_v4(d_u, d_y, Nx, Ny, h, ::Val{TILE_DIM}) where {TILE_DIM}
	tidx = (blockIdx().x - 1) * TILE_DIM + threadIdx().x
	N = Nx*Ny
	if 2 <= tidx <= N-1
		d_y[tidx] = (d_u[tidx-1] - 2d_u[tidx] + d_u[tidx + 1]) / h^2
	end
	sync_threads()

	tb = tidx * Ny
	if 1 <= tb <= N
		d_y[tb] = (d_u[tb] - 2d_u[tb - 1] + d_u[tb - 2]) / h^2
	end
	sync_threads()

	te = (tidx-1) * Ny + 1
	if 1 <= te <= N
		d_y[te] = (d_u[te] - 2d_u[te + 1] + d_u[te + 2]) / h^2
	end
	sync_threads()
	nothing
end
```

This works when Ny is small (<= 111), and there are some weird bugs when Ny is larger (>= 112). Looks like some of the operations related with tb and te are not executed. I guess it should be something to do with the missing threads due to the operation tb = tidx * Ny. So I tried this and it worked without bugs:

```
function D2y_GPU_v4(d_u, d_y, Nx, Ny, h, ::Val{TILE_DIM}) where {TILE_DIM}
	tidx = (blockIdx().x - 1) * TILE_DIM + threadIdx().x
	N = Nx*Ny
	if 2 <= tidx <= N-1
		d_y[tidx] = (d_u[tidx-1] - 2d_u[tidx] + d_u[tidx + 1]) / h^2
	end


	if 1 <= tidx <= N && mod(tidx,Ny) == 0
		d_y[tidx] = (d_u[tidx] - 2d_u[tidx - 1] + d_u[tidx - 2]) / h^2
	end

	if 1 <= tidx <= N && mod(tidx,Ny) == 1
		d_y[tidx] = (d_u[tidx] - 2d_u[tidx + 1] + d_u[tidx + 2]) / h^2
	end
	sync_threads()
	nothing
end
```

## Further optimization with loop fusion.

```
function D2y_GPU_v5(d_u, d_y, Nx, Ny, h, ::Val{TILE_DIM}) where {TILE_DIM}
	tidx = (blockIdx().x - 1) * TILE_DIM + threadIdx().x
	N = Nx*Ny
	if 2 <= tidx <= N-1
		d_y[tidx] = (d_u[tidx-1] - 2d_u[tidx] + d_u[tidx + 1]) / h^2
	end


	if 1 <= tidx <= N && mod(tidx,Ny) == 0
		d_y[tidx] = (d_u[tidx] - 2d_u[tidx - 1] + d_u[tidx - 2]) / h^2
		d_y[tidx-Ny+1] = (d_u[tidx-Ny+1] - 2d_u[tidx - Ny + 2] + d_u[tidx - Ny + 3]) / h^2
	end

	# if 1 <= tidx <= N && mod(tidx,Ny) == 1
	# 	d_y[tidx] = (d_u[tidx] - 2d_u[tidx + 1] + d_u[tidx + 2]) / h^2
	# end
	sync_threads()
	nothing
end
```


## Performance of D2y_GPU_v5() vs D2x_GPU()

```
julia> tester_D2y(2000)
y ≈ y_gpu = true
y ≈ y_gpu_2 = true
Float64(t1) = 7.446997e8
Float64(t2) = 5.322033e8
Float64(t3) = 6.2246e6
t1 / t2 = 1.3992767425530808
t1 / t3 = 119.6381614882884
CPU Through-put                 0.86
GPU Through-put                 1.20
GPU (v2) Through-put               102.82
(7.446997e8, 5.322033e8, 6.2246e6)

julia> tester_D2x(2000)
y ≈ y_gpu = true
y ≈ y_gpu_2 = true
Float64(t1) = 1.0318478e9
Float64(t2) = 7.868501e6
Float64(t3) = 7.118e6
t1 / t2 = 131.13651507447224
t1 / t3 = 144.963163810059
CPU Through-put                 0.62
GPU Through-put                81.34
GPU (v2) Through-put                89.91
(1.0318478e9, 7.868501e6, 7.118e6)
```

## To Do: Shared memory and tiling ...

## Other Interesting topics from Julia Con 2020

Exascale computational group at Argonne [Exanauts](https://exanauts.github.io/)
They did something in large-scale with Julia and GPU. [ExaPf.jl](https://github.com/exanauts/ExaPF.jl) (on Optimization Solver)

Implicit Global Grid [ImplicitGlobalGrid.jl](https://github.com/eth-cscs/ImplicitGlobalGrid.jl)

They did multiple GPU Julia code with MPI. I am particularly interested in ImplicitGlobalGrid.jl. There should be something that I learn about from their code if I want to do multiple GPU implementation later.