# Introduction to MPI - Day 1

## 1. Code Description

### 1.1 The model problem
The code is a numerical solver for 2D Poisson equation with Dirichlet boundary condition discretised by 2nd order central finite difference operator on a unit square domain. 

Consider 

\begin{align*}
-\Delta u &= f \; \text{in} \; \Omega \\
 u &= g \; \text{on} \; \partial \Omega
\end{align*}

for $\Omega = [0,1] \times [0,1]$.

Define a uniform partition of the domain $\Omega$ with nodal points at which the solution of the Poisson equation is sampled. Let $h$ be the uniform distance between two nodal points then the nodal points that lie on the mesh are defined by

\begin{align*}
x_i = i h, \; y_j = j h\qquad i,j = 0,\cdots, N
\end{align*}
where $N$ is a given mesh size and $i, j$ are integers along $x, y$-axis telling the location of each nodal point.  

### 1.2 Discretisation
We use the second-order central finite difference method to discretise the Laplace operator
 \begin{align*}
     \big(\Delta u)\big)_{i,j} & = \big(D_{xx}^2u\big)_{i,j}+ \big(D_{yy}^2 u\big)_{i,j}\\[2ex]
     & \approx \frac{u_{i+1,j}-2u_{i,j}+u_{i-1,j}}{h^2} + \frac{u_{i,j+1}-2u_{i,j}+u_{i,j-1}}{h^2} 
 \end{align*}

leading to
\begin{align}
    -\big(\Delta u)\big)_{i,j} = \frac{4u_{i,j}-u_{i+1,j}-u_{i-1,j}-u_{i,j+1}-u_{i,j-1}}{h^2}=f(u_{i,j}).
\end{align}

The above finite-difference formula can further be represented by a five-point stencil matrix built in the mesh
\begin{align*}S = 
    \begin{pmatrix}
    & -1 & \\
    -1 & 4 &-1\\
    & -1 &
    \end{pmatrix}.
\end{align*}

Impose the Dirichlet boundary condition, on the interior nodal points the discretisation can be written as a linear equation

\begin{align*}
Au = f, \qquad A=\frac{1}{h^2}
    \begin{pmatrix}
S & I \\
I & S & I \\
& I & \ddots & \ddots \\
& & \ddots & \ddots & I \\
& & & I & S
\end{pmatrix}
\end{align*}
where $A \in \mathbb{R}^{(N-2)^2 \times (N-2)^2}$, $\bm{u} \in \mathbb{R}^{(N-2)^2}$ and $\bm{f} \in \mathbb{R}^{(N-2)^2}$.

Note that with the five-point stencil, the matrix $A$ was never assembled and is nowhere in sight! 

### 1.3 Numerical Solvers

To solve the linear system, the Jacobi iterative method is used and compared. 

$\textbf{Jacobi method}$

\begin{align*}
u^{(k+1)} = D^{-1}( f - Lu_{k} -Uu_k),
\end{align*}
where $D, L, U$ are the diagonal matrix, lower triangular matrix and upper triangular matrix of $A$, respectively.
Write into stecil,
\begin{align*}
u^{(k+1)}_{ij} = (h^2 f_{ij} + u^{(k)}_{i-1,j} +u^{(k)}_{i+1,j} + u^{(k)}_{i, j-1}+u^{(k)}_{i,j+1})/4 \qquad i,j = 1,\cdots, N-1 
\end{align*} 

The method is implemented in file [solver.c](./solver.c) if you want a peak but remember no refactoring is needed for routines in this file. 

### 1.4 Convergence measurement 
Usually, it is impossible to measure the error of the numerical solver as it requires to know the true solution $e^{(k)}= u^{(k)} -u^\ast$ where $u^\ast$ is unkown. 

Often the measurement of the differene between appproximation is seen, i.e. $\lim_{k\to \infty} d^{(k)} = u^{(k)} - u^{(k-1)} = 0$. However this cauchy sequency type of quantity has no implication of the convergence to the matrix problem. Therefore this practice should be avoided.

A standard measurement is by the residual of the problem,
$$r^{(k)} = f - Au^{(k)}$$
as implemented in [solver.c](./solver.c), owing to the following relation
$$Ae^{(k)} = r^{(k)} \implies  \|r\|\le \|A\| \|e\|.$$
Implies $$\|e^{(k)}\| \le \|A^{-1}\| \|r^{(k)}\|,$$
that is $$\lim_{k\to \infty} \|r^{(k)}\| = 0 \implies \lim_{k\to \infty}\|e^{(k)}\| = 0.$$



### 1.5 Parallelisation
As disscused in the slides, the unit square domain is decomposed into to subdomains for parallelisation. Each MPI process hosts a subdomain. 
```cpp
double (*submesh)[mesh_size] = malloc(sizeof *submesh * *ptr_rows);
```
where mesh_size is the number of grid points per dimension. 


The decomposition is performed vertically, resulting two rows of ghost nodes on the top and the bottom for each subdomain.
```cpp
/* top row of ghost nodes */
submesh[*ptr_rows -1]

/*bottom row of ghost nodes */
submesh[0]
```
The mesh initialisation is written in [mesh.c](./mesh.c). Please do not edit.

Note that the value that the pointer ptr_rows pointing to is different for the top slab subdomain as it may host extra number of rows. The following example demonstrates the decomposition for a domain of 27x27 distributed to 4 MPI processes. 

## 2. Blocking Communication

Each subdomain needs to update its ghost rows from neighbouring processes and send the top and bottom rows of full nodes to the neighbouring processes. 

For a MPI process $i$ to recv the update on the top row of ghost nodes with a standard blocking communication, we use
```cpp 
/* on MPI process i */
 MPI_Recv(submesh[*ptr_rows -1], mesh_size, MPI_DOUBLE, upper, highertag, MPI_COMM_WORLD, &status);
```

This requires the upper rank $i+1$ to coordinate with a send call
```cpp
/* on MPI process i+1 */
MPI_Send(submesh[1], mesh_size, MPI_DOUBLE, lower, highertag, MPI_COMM_WORLD);
```

Because every processes will execute those two calls, they can be put together without specifying the rank. See [laplace_mpi_blocking.c](./laplace_mpi_blocking.c).

**`TODO`**: in [laplace_mpi_blocking.c](./laplace_mpi_blocking.c), only top ghost row is updated, complete the same communication for the bottom ghost row. Once you are finished run the next cell. If you are stuck, peek solution at [soln_laplace_mpi_blocking.c](./solutions/laplace_mpi_blocking.c)

In [5]:
!make clean && make blocking && echo "Compilation Successful!" && mpiexec -np 8 ./laplace_mpi_blocking 100 500 Jacobi > convergence_blocking.txt
!tail -20 convergence_blocking.txt 

rm -f *.o laplace_mpi_blocking laplace_mpi_nonblocking laplace_mpi_persistent
mpicc -g -Wall -O3 -lm -o laplace_mpi_blocking laplace_mpi_blocking.c mesh.c solver.c
[01m[Klaplace_mpi_blocking.c:[m[K In function ‘[01m[Kmain[m[K’:
  226 | int count, [01;35m[Kw_count[m[K ;
      |            [01;35m[K^~~~~~~[m[K
Compilation Successful!
Residual  0.075310 after iter 482 
Residual  0.075272 after iter 483 
Residual  0.075234 after iter 484 
Residual  0.075196 after iter 485 
Residual  0.075158 after iter 486 
Residual  0.075120 after iter 487 
Residual  0.075082 after iter 488 
Residual  0.075045 after iter 489 
Residual  0.075007 after iter 490 
Residual  0.074969 after iter 491 
Residual  0.074931 after iter 492 
Residual  0.074894 after iter 493 
Residual  0.074856 after iter 494 
Residual  0.074818 after iter 495 
Residual  0.074780 after iter 496 
Residual  0.074743 after iter 497 
Residual  0.074705 after iter 498 
Residual  0.074668 after iter 499 
Residual  0.074630 a

## Nonblocking Communication
During the communication between MPI processes, the iterative method can perform on interior grid nodes (except top and bottom full nodes) concurrently. To this end, a `Jacobi_int` routine is used and separated from `Jacobi_top` and `Jacobi_bottom` (see [solver.c](./solver.c)).

Hence, the communication starts from sending and receiving data for the top and bottom rows, and is followed by the `Jacobi_int` routine. Before applying `Jacobi_top` and `Jacobi_bottom` respectively, one needs to ensure the corresponding communication call is completed. The communications use `MPI_Request`
```cpp
 MPI_Request top_bnd_requests[2],  bottom_bnd_requests[2];
```
to identify the communications. 

In the code [lapalce_mpi_nonblocking.c](./laplace_mpi_nonblocking.c), it is written based on the "first in first served" concept - whichever request is completed, it gets to proceed with the Jacobi method. 
```cpp
    /* Test on either the top or bottom layer */
    if ( (MPI_Testall(2, top_bnd_requests, &top_flag, top_bnd_status) > 0) || (MPI_Testall(2, bottom_bnd_requests, &bottom_flag, bottom_bnd_status) > 0))
    {
       MPI_Abort(MPI_COMM_WORLD, 1);
    }

    /* if the top layer is ready */
    if (top_flag){
        /* perform jacobi on the top bnd */
        Jacobi_top(ptr_rows, mesh_size, &submesh[0][0], &submesh_new[0][0], &subrhs[0][0], space);

    /* if the the bottom layer is ready */
    if (bottom_flag){
    /* perform jacobi on the bottom bnd */
    Jacobi_bottom(ptr_rows, mesh_size, &submesh[0][0], &submesh_new[0][0], &subrhs[0][0], space);
    }
    
    /* if the bottom layer is yet ready */
    else{
        /* wait on the bottom layer */
        MPI_Waitall(2, bottom_bnd_requests, bottom_bnd_status);
        Jacobi_bottom(ptr_rows, mesh_size, &submesh[0][0], &submesh_new[0][0], &subrhs[0][0], space);        
        }
    }
    /* if the top layer is yet ready but the buttom is */
    else if (bottom_flag){
        /* perform jacobi bottom ready */
        Jacobi_bottom(ptr_rows, mesh_size, &submesh[0][0], &submesh_new[0][0], &subrhs[0][0], space);
        /* wait on the top layer */
        MPI_Waitall(2, top_bnd_requests, top_bnd_status);
        Jacobi_top(ptr_rows, mesh_size, &submesh[0][0], &submesh_new[0][0], &subrhs[0][0], space);
        }
    /* if neither of the top and bottom is ready, then wait on both */
    else {
        MPI_Waitall(2, bottom_bnd_requests, bottom_bnd_status);
        Jacobi_bottom(ptr_rows, mesh_size, &submesh[0][0], &submesh_new[0][0], &subrhs[0][0], space);
        MPI_Waitall(2, top_bnd_requests, top_bnd_status);
        Jacobi_top(ptr_rows, mesh_size, &submesh[0][0], &submesh_new[0][0], &subrhs[0][0], space);
    }
```
However, this may overcomplicated the program. 

**`TODO`**  Change the code in [laplace_mpi_nonblocking.c](./laplace_mpi_nonblocking.c) to a simplifed version that binds both top and bottom communications into one batch of requests. Once you are finished run the next cell. If you are stuck, peek solution at [soln_laplace_mpi_nonblocking.c](./solutions/laplace_mpi_nonblocking.c)



In [6]:
!make clean && make nonblocking && echo "Compilation Successful!" && mpiexec -np 8 ./laplace_mpi_nonblocking 100 500 Jacobi > convergence_nonblocking.txt
!tail -20 convergence_nonblocking.txt 

rm -f *.o laplace_mpi_blocking laplace_mpi_nonblocking laplace_mpi_persistent
mpicc -g -Wall -O3 -lm -o laplace_mpi_nonblocking laplace_mpi_nonblocking.c mesh.c solver.c
[01m[Klaplace_mpi_nonblocking.c:[m[K In function ‘[01m[Kmain[m[K’:
  144 | int [01;35m[Kindex[m[K, top_flag, bottom_flag;
      |     [01;35m[K^~~~~[m[K
Compilation Successful!
Residual  0.075310 after iter 482 
Residual  0.075272 after iter 483 
Residual  0.075234 after iter 484 
Residual  0.075196 after iter 485 
Residual  0.075158 after iter 486 
Residual  0.075120 after iter 487 
Residual  0.075082 after iter 488 
Residual  0.075045 after iter 489 
Residual  0.075007 after iter 490 
Residual  0.074969 after iter 491 
Residual  0.074931 after iter 492 
Residual  0.074894 after iter 493 
Residual  0.074856 after iter 494 
Residual  0.074818 after iter 495 
Residual  0.074780 after iter 496 
Residual  0.074743 after iter 497 
Residual  0.074705 after iter 498 
Residual  0.074668 after iter 499 
Residua

## 3. Persistent Communication

You should now be quite familiar with our code, and may notice that the same communication is executed over and over again only with different data buffer within the iteration loops. Recall the four stages of `MPI Operation` mentioned in the slides. The `initialisation stage` hands over the argument list, which stays unchanged in our program. The persistent communication binds the argument list to the communication request once and repeatedly uses it in the subsequent communication calls.  

**`TODO`** Refactor the code in [laplace_mpi_persistent.c](./laplace_mpi_persistent.c), which is using previous nonblocking routines at the current status, to a persistent communication. Once you are finished run the next cell. If you are stuck, peek the solution at [soln_laplace_persistent.c](./solutions/laplace_mpi_persistent.c).

Essential MPI functions needed: `MPI_Recv_init`, `MPI_Send_int`, `MPI_Startall`, `MPI_Waitall`, `MPI_Request_free`

In [11]:
!cd ./solutions && make clean && make persistent && echo "Compilation Successful!" && mpiexec -np 8 ./laplace_mpi_persistent 100 500 Jacobi > convergence_persistent.txt
!cd ./solutions && tail -20 convergence_persistent.txt

rm -f *.o laplace_mpi_blocking laplace_mpi_nonblocking laplace_mpi_persistent
mpicc -g -Wall -O3 -lm -o laplace_mpi_persistent laplace_mpi_persistent.c mesh.c solver.c
[01m[Klaplace_mpi_persistent.c:[m[K In function ‘[01m[Kmain[m[K’:
  159 | int [01;35m[Kindex[m[K, top_flag, bottom_flag;
      |     [01;35m[K^~~~~[m[K
Compilation Successful!
Residual  0.075272 after iter 483 
Residual  0.075234 after iter 484 
Residual  0.075196 after iter 485 
Residual  0.075158 after iter 486 
Residual  0.075120 after iter 487 
Residual  0.075082 after iter 488 
Residual  0.075045 after iter 489 
Residual  0.075007 after iter 490 
Residual  0.074969 after iter 491 
Residual  0.074931 after iter 492 
Residual  0.074894 after iter 493 
Residual  0.074856 after iter 494 
Residual  0.074818 after iter 495 
Residual  0.074780 after iter 496 
Residual  0.074743 after iter 497 
Residual  0.074705 after iter 498 
Residual  0.074668 after iter 499 
Residual  0.074630 after iter 500 
Final resi