# August 14th Meeting


In [1]:
## Finite Difference Operator with Shared Memory

using CUDA

function D2x_GPU_v5(d_u, d_y, Nx, Ny, h, ::Val{TILE_DIM1}, ::Val{TILE_DIM2}) where {TILE_DIM1, TILE_DIM2}
	tidx = threadIdx().x
	tidy = threadIdx().y

	i = (blockIdx().x - 1) * TILE_DIM1 + tidx
	j = (blockIdx().y - 1) * TILE_DIM2 + tidy

	global_index = i + (j-1)*Ny

	# i = (blockIdx().x - 1) * TILE_DIM + threadIdx().x
	tile = @cuStaticSharedMem(eltype(d_u),(TILE_DIM1,TILE_DIM2+4))

	k = tidx
	l = tidy

	# Writing pencil-shaped shared memory

	# for tile itself
	if k <= TILE_DIM1 && l <= TILE_DIM2 && global_index <= Nx*Ny
		tile[k,l+2] = d_u[global_index]
	end

	sync_threads()

	# for left halo
	if k <= TILE_DIM1 && l <= 2 && 2*Ny+1 <= global_index <= (Nx+2)*Ny
		tile[k,l] = d_u[global_index - 2*Ny]
	end

	sync_threads()


	# for right halo
	if k <= TILE_DIM1 && l >= TILE_DIM2 - 2 && 2*Ny+1 <= global_index <= (Nx-2)*Ny
		tile[k,l+4] = d_u[global_index + 2*Ny]
	end

	sync_threads()

	# Finite difference operation starts here

	if k <= TILE_DIM1 && l + 2 <= TILE_DIM2 + 4 && global_index <= Ny
		d_y[global_index] = (tile[k,l + 2] - 2*tile[k,l+3] + tile[k,l+4]) / h^2
	end

	if k <= TILE_DIM1 &&  l + 2 <= TILE_DIM2 + 4 && Ny+1 <= global_index <= (Nx-1)*Ny
		d_y[global_index] = (tile[k,l + 1] - 2*tile[k, l + 2] + tile[k,l+3]) / h^2
	end

	if k <= TILE_DIM1 && l + 2 <= TILE_DIM2 + 4 && (Nx-1)*Ny + 1 <= global_index <= Nx*Ny
		d_y[global_index] = (tile[k,l] - 2*tile[k,l + 1] + tile[k,l+2]) / h^2
	end

	sync_threads()

	nothing
end

D2x_GPU_v5 (generic function with 1 method)

This works, but the performance is not ideal. I believe it has something to do with how I write data into shared memory.

This is how it's done in C++ code. I tried something similar, but there were bugs. So I go with my current implementation that doesn't do copy past within shared memory for halo.


```
// fill in periodic images in shared memory array 
  if (i < 4) {
    s_f[sj][si-4]  = s_f[sj][si+mx-5];
    s_f[sj][si+mx] = s_f[sj][si+1];   
  }
```

## What is slowing down?

Set Nx = 3000
TILE_DIM_1 = 1
TILE_DIM_2 = 1024

### The simpliest version: only reading data to shared memory
I comment out all sections except this one

```
	# for tile itself
	if k <= TILE_DIM1 && l <= TILE_DIM2 && global_index <= Nx*Ny
		tile[k,l+2] = d_u[global_index]
	end

	sync_threads()
```

The memory through-put: 130 - 140 GB/s (Hardware maximum capacity: 188, 75%)

### Add left halo

I uncomment out this section

```
	# for left halo
	if k <= TILE_DIM1 && l <= 2 && 2*Ny+1 <= global_index <= (Nx+2)*Ny
		tile[k,l] = d_u[global_index - 2*Ny]
	end

	sync_threads()
```

The memory through-put 112 - 116 GB/s. Not too far from above

### Add right halo

```
	# for right halo
	if k <= TILE_DIM1 && l >= TILE_DIM2 - 2 && 2*Ny+1 <= global_index <= (Nx-2)*Ny
		tile[k,l+4] = d_u[global_index + 2*Ny]
	end

	sync_threads()
```


The memory though-put 107 - 108 GB/s

So even our implementation of halo is not ideal here, reading data to memory isn't really something that limits the memory through-put.





### Now add finite difference: Only central

I un-comment this section

```
	if k <= TILE_DIM1 &&  l + 2 <= TILE_DIM2 + 4 && Ny+1 <= global_index <= (Nx-1)*Ny
		d_y[global_index] = (tile[k,l + 1] - 2*tile[k, l + 2] + tile[k,l+3]) / h^2
	end
```

Memory through-put dropped to 40 GB/s. By comparison, implementation without shared memory is 95 GB/s


### Now add finite difference for left boundary data

I un-comment out this section

```
	if k <= TILE_DIM1 && l + 2 <= TILE_DIM2 + 4 && global_index <= Ny
		d_y[global_index] = (tile[k,l + 2] - 2*tile[k,l+3] + tile[k,l+4]) / h^2
	end
```

Memory through-put is still 40 GB/s. 


### Now add finite difference for right boundary data

I un-comment out this section

```
	if k <= TILE_DIM1 && l + 2 <= TILE_DIM2 + 4 && (Nx-1)*Ny + 1 <= global_index <= Nx*Ny
		d_y[global_index] = (tile[k,l] - 2*tile[k,l + 1] + tile[k,l+2]) / h^2
	end
```

Memory through-put drops to 30 GB/s. I don't understand why this is different from left boundary data

This is the final version of D2x using shared memory. It works, but it's slower than the version without shared memory.

## Why the shared memory version is slow?

### Not re-using data from shared memory a lot

We are only using second-order finite difference operator. It means each entry from shared memory is only used at most 3 times (?). For halo region, it's only used once. But the addtional cost including reading data from global memory to shared memory, reading data from shared memory for operating and write the result back to global memory. This might not be ideal. But for higher order finite difference operator, or operators that involve more entries in shared memory, (6th order , DxDy), we will see better efficiency of using shared memory


### Data in shared memory is only used for one operator

We are only using shared memory for D2x here. But the same data can be used for Dx, Or any finite difference operators in x direction. We should be able to write a kernel DX(du, dx, d2x, dxx, ...) tht read data from du once into shared memory, and output dx for Dx operator, d2x for D2x operator, ... This will also increase the utilization of shared memory.


### Restriction from SBP operator
We have different operators for boundary data compared to interior data in SBP operators. It might be something not very different from other schemes, but it adds complexity.

For example, if finite difference operator is <br>
(-2 1 0 0 ... , <br>
1 -2 1 0 ... , <br>
0 1 -2 1 0 ... , <br>
... ) <br>

We can use a unified `d_y[global_index] = (tile[k,l + 1] - 2*tile[k, l + 2] + tile[k,l+3]) / h^2` to do finite difference from shared memory. We only need to set the value of left halo from the left-most tile to be zero, and right halo from right-most tile to be zero. For tiles in the interior, we only need to do periodical memory copy to write value into halos


But if finite difference operator is  <br>
(1 -2 1 0 0 ... , <br>
1 -2 1 0 0 ... , <br>
0 1 -2 1 0 0 ... , <br>
... ) <br>

It's hard to just by alternating values in halo region to use a unified `d_y[global_index] = (tile[k,l + 1] - 2*tile[k, l + 2] + tile[k,l+3]) / h^2` to do finite difference. That's something that puzzled me for a while. In my implementation, I used three if-ends loop to write data into shared memory, and three if-end loops to read data from shared memory for finite difference. 4/6 if-end loops only involve very limited data for a very large domain and very large tile, but it looks like the cost is not proportional to the amount of data.