# Iterative Solvers for Elliptic PDEs

In "Intro to Finite Difference Methods for PDEs" we classified the $2^{nd}$ order quasilinear PDEs as elliptic, parabolic, or hyperbolic, and we started looking as solvers for elliptic equations.

In particular, we looked at the 2D Laplace equation:
- 2D Cartesian (rectangular, coord-aligned) domain (coords: $\{x,y\}$)
- Dirichlet BCs (values specified on boundary)

$$ \nabla^2 u = \frac{\partial^2u}{\partial x^2} + \frac{\partial^2u}{\partial y^2} = 0 \text{ on } x \in [0,L_x]; \; y \in [0,L_y]$$

BCs: $u(0,y)=u(L_x,y)=u(0,x)=0, \; u(L_y,x) = U$

After discretizing the spatial domain to a regular 2D grid, we discretized the PDE by substituting with central difference approximations for the derivatives:

$$
\begin{aligned}
\frac{\partial^2u}{\partial x^2} & \rightarrow \frac{1}{h^2}(u_{i-1,j}-2u_{i,j}+u_{i+1,j}) + O(h^2) \\
\frac{\partial^2u}{\partial y^2} & \rightarrow \frac{1}{h^2} (u_{i,j-1} -2 u_{i,j} + u_{i,j+1}) + O(h^2)
\end{aligned}
$$
- For each internal grid point get equation of the form:

$$ \nabla^2 u = \frac{\partial ^2 u}{\partial x^2} + \frac{\partial^2u}{\partial y^2} \rightarrow u_{i-1,j}+u_{i+1,j}+u_{i,j-1} + u_{i,j+1} -4u_{i,j} = 0$$

or in code format:

`u[i-1,j] + u[i+1,j] + u[i,j-1] + u[i,j+1] - 4*u[i,j] == 0`

$\implies$ __5-point stencil computation__: 

![stencil_5point](stencil_5point.png)

With no source terms (so right-hand side is zero), we can solve for $u_{i,j}$ in terms of values at neighboring gridpoints. At equilibrium (for the actual solution for the steady-state temperature distribution), the following relation should satisfied (in the limit as the grid is refined):

$$u_{i,j} = \frac{1}{4} (u_{i-1,j} + u_{i+1,j} +u_{i,j-1} +u_{i,j+1} )$$

We turn this into an iterative scheme by adding an iteration count superscript, so that at each iteration the value at each grid point is replaced by the stencil based on the neighbors (in this case, by the average of teh neighboring values).

$$u^{k+1}_{i,j} = \frac{1}{4} (u^k_{i-1,j} + u^k_{i+1,j} +u^k_{i,j-1} +u^k_{i,j+1} )$$

Game plan for computing solution:
  - Repeatedly update each grid point. 
  - Converge to numerical solution!?
  - This approach is known as Jacobi iteration.

Let's take a look at implementation and convergence issues:

While truncation error is $O(h^2)$, refining grid may not reduce error.

Also need to worry about rate of convergence:
- Jacobi spectral radius (magnitude of largest eigenvalue of the BIG matrix): 
$$\rho_{jacobi} \approx 1-\frac{1}{2} (\pi^2 h^2)$$
- Good news: $\rho_{jacobi} < 1 \implies$ convergence
- Bad news: grid refinement $\implies h \to 0, \rho_{jacobi} \to 1$
- Attempts to reduce truncation error by refinement cause SLOW convergence
- Number of iterations for given accuracy $\sim O(h^2)$
- Total operation count $\sim O(N^4) \implies$ Serial = Painful
- Sub-optimal memory usage: <br>new values computed from old values $\iff$ need a copy of the array  

Alternatives:
- Gauss-Seidel: systematically use updated (improved) values
- Can operate "in place" with single version of array
- Spectral radius: 
$$\rho_{GS} \approx 1-(\pi^2 h^2)$$ 
    - Improved
    - But only by factor of 2
    - CUDA parallel version will be non-deterministic
- Red-Black version (inspired by checkerboard)
  - Grid points of one color updated bsed only on other color
  - Run kernel twice:
    - update red points
    - update black points
  - Deterministic and has improved Gauss-Seidel convergence rate
- Successive over-relaxation
  - Think of the change in values of $u$ as a direction to move to reduce the residual error

$$\Delta u_{i,j} = u^{k+1}_{i,j}-u^k_{i,j} = \frac{1}{4} (u^k_{i-1,j} + u^k_{i+1,j} +u^k_{i,j-1} +u^k_{i,j+1} -4u^k_{i,j})$$

```
du = (u[i-1,j]+u[i+1,j]+u[i,j-1]+u[i,j+1]-4.*u[i,j])/4.
```

  - If it is good to move in that direction, wouldn't it be better to mover farther in that direction?

$$u^{k+1}_{i,j} = u^k_{i,j} + w * \Delta u_{i,j}$$
```
u[i,j] = u[i,j] + w * du
```
  - `w` is the over-relaxation factor

  - Don't get carried away: $w>2$ $\implies$ divergence
  - Optimal value: $w_{opt} = 2-2 \pi h$
  - Optimal convergence: 
  $$\rho_{opt} = 1 - 2 \pi h $$
  - As grid is refined, convergence multiplier approaches 1 more slowly; iterations required for given accuracy closer to $O(N)$

Our interest lies in parallelizing these methods and characterizing their performance

- Compared to CPU world:
  - less concerned about operation counts
  - more concerned about memory and iterations (when forced to serialize)
  - 2D tiled approach using shared memory with halo values

Let's look at sample serial and parallel codes...