# Ch. 5 Local Thread Interaction: Stencils and Shared Memory

Up to this point, all of the examples have fit the
map pattern. The threads in the computational grid could proceed independently, and no thread interactions took place in which multiple threads tried to read from or write to the same array location. 

The map pattern’s complete thread independence is ideal for parallelization, but not all computations are so parallel-friendly. There are other common computing patterns that involve thread interactions, and we need to
be prepared to handle those interactions with reasonable efficiency.

In this notebook we deal with the case of __local thread interactions__ which involve computations requiring information from neighboring threads in the computational
grid. (
    
> The case of __global thread interactions__, where every thread in the grid can contribute to a result, is discussed in Chapter 7 "Global Thread Interactions and
Reductions".

Local interactions frequently arise in association with 
another common computing pattern known as a __stencil computation__.

> ___Stencils___ are also referred to as ___masks___.

# 5.1 Finite Difference Stencils

Again, we focus on a specific application to keep the presentation focused and concrete. Here the application involves the finite difference method, the numeric estimation of a function’s derivative. The definition
of a derivative is:

$$ \frac{f(x)}{dx} = \lim_{h\to 0} \frac{f (x + h) - f (x)}{h}$$

The finite difference approximation avoids taking the limit of $h$ as it goes to zero (since floating point arithmetic has no concept of infinitesimals)
and instead treats $h$ as the finite distance between points where the function is sampled. Because the spacing between values is small but not infinitesimal, we arrive at an approximation of the derivative. For this example we stick to the simple case of an array of points $x_i$ with uniform spacing $h$. We apply this idea in the context of function values computed using the _map_ app. The _map_ app used `numpy.linspace()` to create
an array of N sample points equally spaced over the interval $[x_0,x_{N-1}] = [0, 1]$ such that $N-1$ subintervals each of length $\frac{1}{N-1}$ are created. In finite difference language:

- $h = \frac{1}{N-1}$

- $x_0 + i*h = x_i \to$ `x[i]`

- $f(x_i) = f_i \to$ `f[i]`

The forward difference estimate of the derivative is given by:

$$\nabla f_i = \frac{f_{i+1}-f_i}{h}$$

and produces an array `forwardD` of forward derivative estimates with entries:

```
forwardD[i] = 1/h*(f[i+1] - f[i])
```

There is also a backward difference that would produce an array of derivative estimates with entries:

```
backwardD[i] = 1/h*(f[i] - f[i-1])
```

and a central difference estimate that averages the forward and backward estimates, and produces an array with entries:

```
centralD[i] = 1/(2*h)*(f[i+1] - f[i-1])
```

Applying the forward difference formula to the backward difference array produces a central difference estimate of the second derivative

$$\nabla^2f_i = \frac{f_{i+1}-2 f_i + f_{i+1}}{h^2}$$

which produces an array of second difference estimates with the following
entries:

```
centralD2[i] = (1/h**2)*(f[i-1] - 2*f[i] + f[i+1])
```

Whichever derivative estimate you choose, the derivative at index `i` is computed as some linear combination of the entries with index values `i-1` , `i` , and `i+1` from the initial array; that is, the value
at the current position (identified by index `i`) and the neighboring entries in the array (indexed by `i-1` and `i+1`).

To construct the derivative estimate array we first build a __small array of fixed coefficients__, or ___stencil___, where each array entry corresponds to a coefficient of a term in the difference scheme being used. In the case of the second-order central difference estimate of the second derivative, the coefficients of
the stencil would be `1/(h**2)*[1,-2,1]`. Computing the finite difference derivative estimate for index `i` can then be thought of as laying this short array over the larger array (with the center of the stencil aligned with `f[i]`) and computing the dot product of the overlapping entries (by multipying corresponding entries and summing). These inner products of the stencil coefficients with neighborhoods of the array of function values covered by the stencil form the entries in the array of derivative estimates.

> To keep things simple for the moment, we focus on
generic entries, not on endpoints. When we present the implementation, we explain how to handle entries near the
start and end of the array where the stencil extends past the bounds of the array.

Here are other stencils for the derivative estimates mentioned above:

- $1^{st}$-order forward difference: 
$$c * [0,-1,1] \text{, where } c = \frac{1}{h}$$

- $1^{st}$-order backward difference: 
$$c * [-1,1,0] \text{, where } c = \frac{1}{h}$$

- $1^{st}$-order central difference: 
  $$c * [-1,0,1] \text{, where } c = \frac{1}{2h}$$

In each of these cases, the stencil corresponds to an array of 3 constant coefficients that cover the $i^{th}$ entry and its neighbors 1 position
to either side. The size of the interacting coefficient neighborhood is referred to as the __stencil radius__,
so  therse are all examples of stencils with radius 1. It
is not uncommon to use stencils with larger radius values to achieve higher orders of accuracy. For example, here are a couple examples of commonly used finite difference stencils of radius 2: 

• $1^{st}$-derivative central difference with radius 2: 
$$c * [-1,8,0,-8,1] \text{, where } c = \frac{1}{12h}$$

• $2^{nd}$-derivative central difference with radius 2: 
$$c * [-1,16,-30,16,-1] \text{, where } c = \frac{1}{12h^2}$$

We start by discussing the simple case of radius 1, but we provide an implementation that handles stencils of arbitrary radius (within some practical limits).

## 5.3 Serial Implementation of a 1D Finite Difference: The _stencil_ App

In this section, we present the serial version of the _stencil_ app that computes finite difference derivative approximations. The code for the app includes 2 files: _apps/stencil/main.py_ to control execution and
_apps/stencil/serial.py_ with the function definitions for the serial implementation.

```
File: main.py
01: import numpy as np
02: import matplotlib.pyplot as plt
03: from serial import pArray, nth_deriv
04: 
05: N = 128
06: 
07: def main():
08:     x = np.linspace(0, 1, N, endpoint=True, dtype=np.float32)
09:     
10:     f = pArray(x)
11:     print(f)
12:     plt.plot(x, f, 'o', label='f(x) = x**2')
13: 
14:     dfdx = nth_deriv(f, 1)
15:     print(dfdx)
16:     plt.plot(x, dfdx, 'o', label='First derivative')
17:     
18:     d2fdx2 = nth_deriv(f, 2)
19:     print(d2fdx2)
20:     plt.plot(x, d2fdx2, 'o', label='Second derivative')
21:     
22:     plt.legend()
23:     plt.show()
24: 
25: if __name__ == '__main__':
26:     main()
```

$$\text{Listing 5.1 - } apps\_stencil\_main.py$$

Listing 5.1, _apps/stencil/main.py_, is derived from _apps/map/main.py_ with a few modifications:

- The imported functions include `pArray()` (which replaces `sArray()` and returns samples corresponding to the parabola $y = p(x) = x^2$) and `nth_deriv(f,n)` which computes an array of finite difference estimates of the $n^{th}$ derivative as detailed in the discussion of _apps/stencil/serial.py_ below.

- Following the call to evaluate, print, and plot the function values on Lines 10-12, there are calls to evaluate/print/plot the first and second derivative estimates on lines 14-16 and 18-20 respectively.

Listing 5.2 shows code for an implementation of _apps/stencil/serial.py_. The first part of the code closely resembles _apps/map/serial.py_ with `s()` replaced with a new `p()` function and `sArray()` renamed to
`pArray()`. Note that $p(x)=x^2$ is chosen so that its derivatives are simple and recognizable:

$$
\frac{df}{dx} = 2x, \;
\frac{d^2f}{dx^2} = 2
$$

```
File: serial.py
01: import numpy as np
02: 
03: def p(x0):
04:     return x0**2
05: 
06: def pArray(x):
07:     n = x.size
08:     f = np.empty(n)
09:     for i in range(n):
10:         f[i] = p(x[i])
11:     return f
12: 
13: def nth_deriv(f, order):
14:     n = f.shape[0]
15:     if order == 1:
16:         stencil = np.array([-1., 0., 1.])
17:         c = (n-1) / 2.
18:     elif order == 2:
19:         stencil = np.array([1., -2., 1.])
20:         c = (n-1)*(n-1)
21:     deriv = np.zeros(n)
22: 
23:     for i in range(1,n-1):
24:         deriv[i] = c * (
25:             f[i-1]*stencil[0]+
26:             f[i] * stencil[1]+
27:             f[i+1]*stencil[2])
28: 
29:     # Alternative version of stencil by loop sum
30:     #RAD = 1
31:     #for i in range(1,n-1):
32:     #    for di in range(-RAD, RAD+1):
33:     #        deriv[i] += f[i+di]*stencil[di+RAD]
34: 
35:     return deriv
```

$$\text{Listing 5.2 - } apps\_stencil\_serial.py$$

The important new code is within the `nth_deriv()` function
on Lines 13-37. The function takes 2 arguments: `f` is the array
of sample values and `order` is the order of the derivative (which is implemented for `order`$\in \{1,2\}$).  Line 14 establishes
the dimension of the array (which also determines the spacing between the evaluation points; i.e. $n-1 = 1/h$). Lines 15-20 specify the stencil coefficients, and line 21 generates an array of 0’s to contain the array of derivative values. The actual stencil computation occurs on lines 23-27, and an equivalent alternative version that uses a loop to sum over a stencil of given radius appears in the comments on lines 30-33.

When applying the stencil to the array of input values, `f`, there
are cases when the stencil cannot be applied properly because the stencil would extend beyond the bounds of the array. Near the ends of the array (for a radius 1 stencil this means the first and last entry in the array),
something else needs to be done. One approach involves extending the array to provide entries to cover the full stencil. Typical choices for this involve treating the array as periodic or padding with specified values. Another approach is to use an alternative differencing scheme for which the necessary function values lie within the bounds of the array. In particular, a forward difference stencil can be applied at the left end (`i=0`) and a backward difference stencil can be applied at the right end (`i=n-1`). Listing 5.3 provides code that applies the latter scheme applying forward and backward differencing as appropriate at the ends of the array. 

## 5.3 Parallel Implementation of the _stencil_ App

In this section we move on to a parallel implementation of the _stencil_
app. We build on the structure of the _map_ app and create a modified
version of _apps/map/parallel.py_ to provide the parallel function
definitions. Listing 5.3 shows the code for _apps/stencil/parallel.py__.

```
File: parallel.py
01: import numpy as np
02: from numba import jit, cuda, float32
03: TPB = 32
04: 
05: @cuda.jit(device = True)
06: def p(x0):
07:     return x0**2
08: 
09: @cuda.jit #Lazy compilation
10: #@cuda.jit('void(float32[:], float32[:])') #Eager compilation
11: def pKernel(d_f, d_x):
12:     i = cuda.grid(1)
13:     n = d_x.shape[0]    
14:     if i < n:
15:         d_f[i] = p(d_x[i])
16: 
17: def pArray(x):
18:     n = x.shape[0]
19:     d_x = cuda.to_device(x)
20:     d_f = cuda.device_array(n, dtype = np.float32) #need dtype for eager compile
21:     pKernel[(n+TPB-1)//TPB, TPB](d_f, d_x)
22:     return d_f.copy_to_host()
23: 
24: @cuda.jit
25: def df_kernel(d_deriv, d_f, stencil):
26:     i = cuda.grid(1)
27:     n = d_f.shape[0]    
28:     if i > 1 and i < n-1:
29:         d_deriv[i] = (
30:             d_f[i-1]*stencil[0]+
31:             d_f[i] * stencil[1]+
32:             d_f[i+1]*stencil[2])
33: 
34: def nth_deriv(f, order):
35:     n = f.shape[0]
36:     if order == 1:
37:         stencil =(n-1)/2. * np.array([-1., 0., 1.])
38:     elif order == 2:
39:         stencil = (n-1)*(n-1)* np.array([1., -2., 1.])
40:     d_f = cuda.to_device(f)
41:     d_deriv = cuda.device_array(n, dtype = np.float32)
42:     df_kernel[(n+TPB-1)//TPB, TPB](d_deriv, d_f, stencil)
43: 
44:     return d_deriv.copy_to_host()
```
$$\text{Listing 5.3 - } apps/stencil/parallel.py$$

This version of _apps/stencil/parallel.py_ includes parallel implementations of both `pArray()` and `nth_deriv()`. The first portion of the code, through line 22, is directly analogous to _apps/map/parallel.py_.

The more novel portion of the code is the parallelized version of `nth_deriv()` on lines 34-44 which serves as the wrapper for `df_kernel()` on lines 24-32. Hopefully, this code now looks pretty readable. The one significant change is that the bounds test is `if i > 1 and i < n-1` (or more generally `if i > RAD and i < n-RAD` for a stencil of radius `RAD`) to ensure that the stencil remains within the array bounds. 

After changing `serial` to `parallel` in the import statement of _apps/stencil/main.py_, you can execute the parallel implementation
and verify that the results agree with those from the serial implementation.

# 5.4 Performance Considerations

Hopefully you are now comfortable enough accessing the basic features
of Python, numpy, numba, and CUDA so that converting serial
code to a basic parallel equivalent seems straightforward, because we have
reached the point where it is worth discussing some not-so-basic features
that can have a significant impact on the code performance. 

So far, our attention has been entirely focused on creating code that can
run in parallel on the numerous compute cores on the GPU. We have
noted that the cores on the GPU do not in general have direct access to
host memory, so the GPU computations need to read input data and
write output data to the device memory that resides on the GPU. In a
massively parallel setting with hundreds or thousands of cores at our
disposal, an app’s performance may not be limited by computing capacity
(e.g. the maximum number of GFlops or TFlops the system can
achieve). It is at least equally likely that performance will be limited by data access and transfer rates. Having all the compute cores in the world will do us no good if the cores cannot efficiently access the input data they need or write the output data that has been computed.

The codes developed so far have dealt primarily with global device memory. However, there are several other types of memory on the device, each with different properties in terms of quantity, access rate, and scope of thread access. There is an essential rule of thumb that you need to know about memory management in CUDA:

> Memory that is closer to compute cores supports faster data transfer
but is less abundant and accessible to fewer threads. 

Here are is a list of memory types we have used or will use in the future:

- ___Host memory___ memory resides on the CPU side of the system. This
is typically your most abundant memory but, for GPU cores to access
this memory, the data has to be transferred across the PCIe
bus to the GPU. Transferring data across the bus is relatively quite slow so we try to minimize transfers to/from host memory or organize the data in such a way as to hide the time that is required to transfer the data. (Only transfering data to and from disk is slower than accessing host memory.)

- ___Global memory___ is the most abundant type of device memory. It is
accessible by all threads in the computation grid, but it is farthest
from the compute cores and provides the slowest data transfers on
the device. All the device arrays that we created with `cuda.device_array()` and `cuda.to_device()` reside in global memory.

- ___Shared memory___ is included in each SM and is allocated
per thread block; all threads in a block have access to the
same shared memory. Shared arrays are created on the device side
using `cuda.shared.array()`. Shared arrays are advantageous in calculations that access the same location in memory multiple times.
Instead of repeatedly accessing the same location in global memory, that location can be accessed once and the data can be stored in shared memory from which future accesses can occur much more quickly. While modern GPUs include additional cashes that attempt to mitigate the difference, data access from global memory can be $\sim 100 \times$ slower than from shared memory. The speed advantage makes shared memory attractive, but there are practical limitations as each SM has a very limited amount of shared memory. (Higher end cards like the GTX 1080Ti
contain only 49KB of shared memory per block.)

- ___Register memory___ resides immediately adjacent to each compute core. 
It provides the fastest transfers, but is not something we can control directly through the application programmer's interface(API). A register is typically used to store scalar arguments and values of thread-specific variables created during kernel execution. Each thread has a __very__ limited number of available registers; current hardware typically provides a maximum of 255 32-bit registers per thread. (If this limit is exceeded the memory "spills" over into local memory. When profiling a kernel, trying to optimize the register memory usage can lead to improved kernel execution speed.

- __Local memory__ is memory local to a specific thread (i.e. it cannot
be accessed by any other thread) but, unlike the previous types of
memory, it is not a physical location on the GPU. Local memory is
privately assigned global memory, used primarily when the register
memory is at capacity or when assigned using `cuda.local.array()`.

CUDA also provides specialized ___texture memory___, ___surface memory___,
and ___constant memory___ but these are not currently implemented
in Numba. If you are interested in learning about them, refer to the
CUDA documentation.

With an understanding of the types of memory at our
disposal, we can move on to a discussion of how specific types of memory can be used to achieve performance enhancements which can be grouped into 3 general categories:

- __Basic performance enhancements__ occur with the initial shift to CUDA,
which allows the numerous compute cores on your GPU to perform
computations in parallel. Since the SIMT model of parallelism
supports scaling with the number of cores, basic acceleration
can also be achieved by simply obtaining an enhanced GPU.

- __Intermediate performance enhancements__ usually involve "Jedi memory tricks" that are the main topic of this section. Utilization
of appropriate memory, especially shared memory, can make significant progress in reducing GPU latency and optimizing runtimes.

- __Advanced performance enhancements__ involve fine-tuning
your app using advanced programming techniques as motivated
by detailed profiling information. 

We have already seen how to translate serial code into parallel code to
achieve basic performance enhancements. The coding effort involved
is very reasonable, and the acceleration achieved by the basic transition
can be significant. 

The next step is to discuss some of the techniques
used to achieve intermediate performance enhancements. The following
section, "Acclerating Stencils With Shared Memory", will serve as our initial foray into these intermediate performance enhancements.
The implementation of shared memory requires some additional code
development effort, but not an unreasonable amount, and the associated
acceleration can be significant.

While we have discussed basic performance enhancement and will
discuss intermediate performance enhancements, advanced performance
enhancement techniques involve significant use of profiling tools and intimate knowledge of what goes on within the GPU when a kernel is
run; and these factors are often code and hardware specific. Understanding how registers are being used during kernel execution, utilizing multiple data streams, accessing multiple GPUs, etc. are just a few of the more advanced techniques used to wring out that last bit of performance from a system. Each level of optimization comes with increased effort and diminishing returns, so the reward to effort ratio can become small. While you should not be deterred from gaining the expertise needed to achieve advanced performance enhancements, you should also be aware that an effort/reward consideration is appropriate and, on that basis, our focus will remain primarly on basic acceleration and
intermediate acceleration.

> For those who are interested in advanced accleration techniques, a number of references and resources are available. For specifics, see the notebook on "Further Steps".

## 5.5 Accelerating Stencils with Shared Memory

Lets begin by taking a close look at the kernel function, `deriv_kernel()` , from  _apps/stencil/parallel.py_ with the intent of identifying inefficiencies or redundancies that could provide opportunities for performance improvement. 

```
@cuda.jit
def deriv_kernel(d_deriv, d_f, d_stencil, rad):
    i = cuda.grid(1)
    n = d_f.shape[0]
    d_deriv[i] = 0    
    if i > rad and i < n-rad:
        d_deriv[i] = 0
        for j in range(2*rad + 1):
            d_deriv [i] += d_f[i + j - rad] * d_stencil [j]
```

$$\text{Code snippet for df\_kernel from } apps/stencil/parallel.py$$

The first 3 lines set the index for the thread and size of the array and then initializes `d_derv[i]` to 0. The `if` statement performs the usual test to prevent out-of-bounds array access. The essential computation is performed in final 2 lines that compute the dot product of the stencil with the corresponding array entries.

Lets think specifically about what happens when this statement is executed
by each thread. In particular, how many times does each element of `d_f`  get read from global memory? Each thread computes the inner product of the stencil with the segment of `d_f` that is covered by the stencil centered on `d_f[i]`. Since the length of the stencil is `2*RAD+1`, a typical element of `d_f` lies under the stencil in `2*RAD+1` threads, the thread where the stencil is centered on `d_f[i]` and on the `RAD` entries on either side.

As mentioned previously, the GPU provides ready access to hundreds or thousands of processors that are capable of teraflop-scale computing, operation counts may not be the performance-limiting factor. In many cases the more critical issue involves data access, and we have now identified a key data access redundancy. Global access has the largest latency (slowest access) of any memory on the GPU, and the initial stencil implementation accesses each array element repeatedly (`2*RAD+1` times) from that slowest portion of GPU memory. This redundancy becomes even worse when we move to higher-dimensional problems. On a D-dimensional grid, the redundancy is `(2*RAD+1)**D`, so a very reasonable stencil of radius 2 on a grid with $D=3$ has a data access redundancy factor of $5^3 = 125$.

Avoiding such redundancies provides the opportunity for a significant acceleration that involves what is perhaps the most important intermediate performance enhancement: the use of shared memory. Recall that shared memory offers a limited amount (typically 48 KB) of storage that is available to all threads in a block and offers significantly faster access (compared to global memory).

Here is the typical plan for achieving intermediate level acceleration using shared memory to deal with local thread interaction:

- Create a shared memory array to hold the data needed by an entire thread block. The entries to be read from global memory include the elements whose indices coincide with index values of threads in the block and also `RAD` neighboring ___halo values___ (also called ___ghost values___) that are needed to evaluate the stencil for threads at the edge of the block. The size of the shared array is thus set to be `NSHARED = TPB + 2*RAD`.


- Read the entries from the input global memory array __JUST ONCE__ and store them in the block’s shared memory array.

- Have each thread access the data it needs from the shared memory array.

- Finally, when all threads in the block have completed execution, write their output to the appropriate positions in a global output array. (It is also possible to have threads store results in a shared output array and, after all threads have completed, transfer the data from the shared output array to a global output array.)

> __Feasible shared array size:__ Given the 48KB capacity, $6,000$ double-precision values can be stored in a block's shared memory. Given that a block is limited to $1024$ threads, shared memory provides room to store several sub-arrays whose length is near the number of threads in the block.

With the steps laid out, we create an updated implementation of `nth_deriv()` and an updated `deriv_kernel()` which now involve the use of shared memory. The updated function definitions are shown Listing 5.4 of _apps/stencil/shared.py_ below. 

```
File: shared.py
01: import numpy as np
02: from numba import jit, cuda, float32, int32
03: 
04: TPB = 128
05: #define length of the shared memory array
06: NSHARED = 130 #value must agree with TPB + 2*RAD
07: 
08: @cuda.jit(device = True)
09: def p(x0):
10:     return x0**2
11: 
12: @cuda.jit #Lazy compilation
13: #@cuda.jit('void(float32[:], float32[:])') #Eager compilation
14: def pKernel(d_f, d_x):
15:     i = cuda.grid(1)
16:     n = d_x.shape[0]    
17:     if i < n:
18:         d_f[i] = p(d_x[i])
19: 
20: def pArray(x):
21:     n = x.shape[0]
22:     d_x = cuda.to_device(x)
23:     d_f = cuda.device_array(n, dtype = np.float32) #need dtype spec for eager compilation
24:     pKernel[(n+TPB-1)//TPB, TPB](d_f, d_x)
25:     return d_f.copy_to_host()
26: 
27: @cuda.jit
28: #@cuda.jit("void(float32[:],float32[:],float32[:])")
29: def deriv_kernel(d_deriv, d_f, d_stencil, rad):
30:     n = d_f.shape[0]
31:     i = cuda.grid(1)
32:     sh_f = cuda.shared.array(NSHARED, dtype = float32)
33:     #thread index (and index for optional shared output array)
34:     tIdx = cuda.threadIdx.x
35:     #index for shared input array
36:     shIdx = tIdx + rad
37: 
38:     if i>=n:
39:         return
40: 
41:     #Load regular cells
42:     sh_f[shIdx] = d_f[i]
43: 
44:     #Halo cells- Check that the entries to be loaded are within array bounds
45:     if tIdx < rad:
46:         if i >= rad:
47:             sh_f[shIdx - rad] = d_f[i-rad]
48:         if i + cuda.blockDim.x < n:
49:             sh_f[shIdx + cuda.blockDim.x] = d_f[i + cuda.blockDim.x]
50: 
51:     #make sure that shared array is fully loaded before any thread reads from it
52:     cuda.syncthreads()
53: 
54:     #write values only where the full stencil is "in bounds"
55:     if i >= rad and i < n-rad:
56:         stencil_dot =  sh_f[shIdx] * d_stencil[rad]
57:         for d in range(1,rad+1):
58:             stencil_dot += sh_f[shIdx-d]*d_stencil[rad-d] + sh_f[shIdx+d]*d_stencil[rad+d]
59:         d_deriv[i] = stencil_dot
60: 
61: def nth_deriv(f, order, rad):
62:     n = f.shape[0]
63:     if rad == 1:
64:         if order == 1:
65:             stencil =(n-1)/2. * np.array([-1., 0., 1.])
66:         elif order == 2:
67:             stencil = (n-1)*(n-1)* np.array([1., -2., 1.])
68:     elif rad == 2:
69:         if order == 1:
70:             stencil =(n-1)/12. * np.array([1., -8., 0., 8., -1.])
71:         elif order == 2:
72:             stencil = (n-1)*(n-1)* np.array([-1., 16., -30., 16., -1.])/12.
73:     print(order, stencil)
74:     d_f = cuda.to_device(f)
75:     d_stencil = cuda.to_device(stencil)
76:     d_deriv = cuda.device_array(n, dtype = np.float32)
77:     deriv_kernel[(n+TPB-1)//TPB, TPB](d_deriv, d_f, d_stencil, rad)
78: 
79:     return d_deriv.copy_to_host()
```

$$\text{Listing 5.4 - } apps/stencil/shared.py$$

Much like the transition from serial to parallel, the shift from parallel
to shared maintains much of the same code. Outside the kernel
function, the only change made to _apps/stencil/parallel.py_ is the addition of `NSHARED = 130` , on Line 6, which dictates the size of the shared memory array created in `deriv_kernel()`. 

> The kernel implementation above uses ___static shared memory___ (as opposed to ___dynamic shared memory___ that will be discussed later) which requires that the size of the shared array must be specified directly as a numerical value known at compile time. A variable cannot be passed into the kernel and used to determine the size of the shared array, so line 6 needs to be of the form  `NSHARED = 130`, ___NOT___ `NSHARED = TPB+2*RAD`

The rest of the alterations occur within the kernel and involve preparing the shared memory array. The code for shared memory version of `deriv_kernel` is isolated in Listing 5.5 below: 

```
27: @cuda.jit
28: #@cuda.jit("void(float32[:],float32[:],float32[:])")
29: def deriv_kernel(d_deriv, d_f, d_stencil, rad):
30:     n = d_f.shape[0]
31:     i = cuda.grid(1)
32:     sh_f = cuda.shared.array(NSHARED, dtype = float32)
33:     #thread index (and index for optional shared output array)
34:     tIdx = cuda.threadIdx.x
35:     #index for shared input array
36:     shIdx = tIdx + rad
37: 
38:     if i>=n:
39:         return
40: 
41:     #Load regular cells
42:     sh_f[shIdx] = d_f[i]
43: 
44:     #Halo cells- Check that the entries to be loaded are within array bounds
45:     if tIdx < rad:
46:         if i >= rad:
47:             sh_f[shIdx - rad] = d_f[i-rad]
48:         if i + cuda.blockDim.x < n:
49:             sh_f[shIdx + cuda.blockDim.x] = d_f[i + cuda.blockDim.x]
50: 
51:     #make sure that shared array is fully loaded before any thread reads from it
52:     cuda.syncthreads()
53: 
54:     #write values only where the full stencil is "in bounds"
55:     if i >= rad and i < n-rad:
56:         stencil_dot =  sh_f[shIdx] * d_stencil[rad]
57:         for d in range(1,rad+1):
58:             stencil_dot += sh_f[shIdx-d]*d_stencil[rad-d] + sh_f[shIdx+d]*d_stencil[rad+d]
59:         d_deriv[i] = stencil_dot

```
$$\text{Listing 5.5 - Shared memory version of deriv\_kernel}$$

The plan within the kernel is to load the shared array, `sh_f`, with the necessary elements from the global array, `d_f`. Once the data has been stored within `sh_f` the code can proceed as it did in the parallel case, but accessing data from the shared array instead of the array in global memory. In order to do this some setup is required so, after the usual assignment of the array length `n` and the thread's index in the computational grid using the usual numba shorthand `i=cuda.grid(1)`, the shared array is create on line 32. While `i` provides the natural way to index the global array `d_f`, it is helpful to introduce additional indices to help make sure we get the "bookkeeping" right. In particular, on lines 34 and 36, we introduce `tIdx` as shorthand for the index of the thread within the block and `shIdx` as the index into the shared array `sh_f`. To compute the stencil for the thread with `thIdx=0` based on date from the shared array, `sh_f` needs to include `rad` left-side halo values. Thus the value of `d_f[i]` associated with thread 0 must be stored as the shared array with index `rad`. In general, data associated with a given value of `tIdx` should be stored at position `tIdx+rad`, and we ensure this correspondence by introducing `shIdx=thIdx+rad` and consistently use `shIdx` to locate shared array values.

> For consistent bookkeeping, just use the index variable appropriate for each entity: use `i` with `d_f`, `tIdx` for threads, and `shIdx` with `sh_f`.

With the bookkeeping foundation in place, the next step is to load all of the data needed for the theads in the block to perform their stencil computations. This process starts on line 42 where `sh_f[shIdx] = d_f[i]` tells each thread to load a __regular value__; i.e. an value from `d_f` whose array index matches the grid index `i` of a thread in the block. (The preceding bounds check on line 38 prevents attempts to load entries from beyond the bounds of `d_f`.)

The next step is to load the halo values from the `rad` entries on either side of where the regular values reside in `d_f`. On line 45, `if tIdx < rad:` starts a block of code instructing the first `rad` threads in the block to load the halo values. The typical organization is to have each of those threads load one halo value on each side: the left-most thread (with `tIdx=0`) loads the left-most left halo value and the left-most right halo value. With this approach, each of the threads loading halo values determines the relevant right halo index by adding the blocksize plus the stencil radius to the left halo index. Each ensuing thread loads left and right halo values each one index to the right until the thread with `tIdx=rad` loads the right-most left halo value and the right-most right halo value. The code on lines 46-47 for loading the left halo values includes a left bounds test:
```
if i >= rad:
    sh_f[shIdx - rad] = d_f[i-rad]
```
and the code on lines 48-49 for loading the right halo values includes a right bounds test:
```
if i + cuda.blockDim.x < n:
    sh_f[shIdx + cuda.blockDim.x] = d_f[i + cuda.blockDim.x]
```
Note that the index for reading and storing the right and left halo values differ by `cuda.blockDim.x+rad`.

That completes the code for loading the data into the shared array, but that does not necessarily mean that when a thread gets to this line that the shared array is fully populated. Loading data into the array requires action by each thread in the block since each includes a call to load a value from global memory into the shared array. Remembering that the block actually gets broken up into warps of 32 threads, while those threads do execute instructions in lock-step, they are not executing synchronously with the threads from other warps in the block. As a result, it is necessary to pause and make sure that all the threads in the block have executed to this point and all of the loads into the shared array have completed before any thread attempts to proceed with the actual stencil computation (which requires reading data from the shared array). 

The pause in execution described above is exactly the purpose of new command, `cuda.syncthreads()`, introduced on line 52. This is our first encounter with issues involving synchronization and
warps, and it is important to note that synchronization prevents cores from proceeding with execution and makes execution take longer. While this is antithetical to the goal of accelerating codes via parallelism, it should be done when necessary (and only when necessary) to obtain correct results. 

> Reminder: Getting incorrect results faster does not count as acceleration!

> __Data type specifications in kernels:__ The data type of a shared array is set within in a kernel and must be created using the types provided by numba (not numpy). A valid specification looks like `dtype=cuda.float32` (or `dtype=float32` which is equivalent after `from cuda import float32`) A specification like `dtype=np.float32` is __NOT__ valid. If you get the follwing error messages, check that arrays created in the kernel use cuda types:
```
ValueError: Specified type '<class 'numpy.float32'>' (<class 'type'>) is not a Numba type

This error is usually caused by passing an argument of a type that is unsupported by the named function.
```

Now that the description of the shared memory implementation is complete, you should edit _apps/stencil/main.py_ to import from _shared_
and execute the app to verify that the shared memory implementation produces correct results.

## 5.6 Warps, Scheduling, and Race Conditions

We glossed over warps in our earlier discussion, but we have reached a point were it is appropriate to mention how they affect the execution of code. 

When a kernel is launched, the system scheduler takes each of the blocks defined during the kernel launch and assigns them to an available SM. The SM further divides the block into warps of 32 threads and, following the SIMT model, each thread within a warp executes the same instructions in lock-step.  We do not have direct control over the order in which warps; what we do know is that multiple warps from a block do not execute together so it pays to be very careful __not__ to assume synchronization of all threads in a block.

If we made that invalid assumption and left the `cuda.synchronize()` out of `deriv_kernel()`, a warp would start executing, partially populate the shared array, and then start reading values from `sh_f` that are needed for the stencil computation. Some of those values will be entries in `sh_f` that are loaded by threads from another warp in the block. If that other warp has not yet executed, instead of reading data transferred from `d_f` to `sh_f` threads in the current warp will actually read whatever random bits happened to be in the memory allocated for `sh_f`.

This is the first case we have encountered where the result of a computation depends in a significant, unpredictable, and irreproducible way on the order of execution. Such behavior is called __undefined__ and the situatin leading to undefined behavior is called a ___race condition___ (because the results are determined by the unpredictable order in which the warps finish in the race to execute).

> (A note on small blocks: Any block with 32 or fewer threads has 1 warp and takes up 32 threads, so specifying blocks with fewer than 32 threads leads to wasted resources.)

In this situation, dealing with multiple threads loading values into a shared memory array, `cuda.syncthreads()` is the right tool for ensuring the necessary synchronization across the block. The other common cuda synchronization tools include `synchronize()` (for synchronizing a computational stream) and atomic operations which we will encounter when we consider the reduction pattern.

## 5.6 Suggested Projects

1. Write a version of _apps/stencil/main.py_ that executes the different implementations of the _stencil_ app, verifies that the results are the same, and compares execution times for different array sizes.

2. Revise the _stencil_ app to use difference stencils with larger radii. Compare and contrast the results.

3. Explore how execution time depends on blocksize and stencil radius (Remember to adjust NSHARED to be consistent with revised values of TPB and RAD .) What is the largest blocksize you can execute? What error is produced when you go beyond that limit?

4. Profile the different versions of the _stencil_ app.
