<center><img src="../img/ICHEC_Logo.png" alt="Drawing" style="width: 500px;"/></center>

<center><img src="../img/MPI_Logo.png" alt="Drawing" style="width: 250px;"/>

# <center>MPI (Message Passing Interface)</center>

***
## <center><b>Non-blocking Communication</b></center>

Let's recall our deadlock situation, which can happen for a cyclic structure.

<center><img src="../img/deadlock.png" alt="Drawing" style="width: 350px;"/> </center>

We also have a serialisation case, for non cyclic workflows

<center><img src="../img/serialMPI.png" alt="Drawing" style="width: 350px;"/> </center>

<details>
    <summary markdown="span">Which situation is worse, deadlock or serialisation?</summary>
<br>
   Serialisation! 
   
   At least with a deadlock its easier to identify the problem
<br>
</details>


* Non-blocking sends and receives.
    - **`isend`** & **`irecv`**
    - Returns immediately and sends/receives in background.
    - Return value is a Request object.
* Enables some computing concurrently with communication.
* Avoids many common deadlock situations.



```python
rank = comm.Get_rank()
size = comm.Get_size()

if rank == 0:
    data = np.arange(size, dtype=float) * (rank + 1)
    # start a send
    req = comm.Isend(data, dest=1)
    # ... do something else ...
    calculate_something(rank)
    # wait for the send to finish
    req.wait()
    # now safe to read/write data again

elif rank == 1:
    data = np.empty(size, dtype=float)
    # post a recieve
    req = comm.Irecv(data, source=0)
    # ... do something else ...
    calculate_something(rank)
    # wait for the receive to finish
    req.wait()
    # data is not ready to use
```

`isend` provides the message that is then communicated and received during `recv`. 

<center><img src="../img/isend.png" alt="Drawing" style="width: 350px;"/> </center>

`irecv` sets up the receive buffer, then `send` sends the message, but may have already received the message from another process. Or it can be received latest in the `wait`.


<center><img src="../img/irecv.png" alt="Drawing" style="width: 350px;"/> </center>


<center><img src="../img/4.1.1.png" alt="Drawing" style="width: 350px;"/> </center>

* In above example, where data colour coded as;
    - **<span style="color:green">ghost_data</span>**
    - **<span style="color:purple">border_data</span>**
    - **<span style="color:blue">local_data</span>**

```python
request = comm.Irecv(ghost_data)
request2 = comm.Isend(border_data)
compute(ghost_independent_data)
request.wait()
compute(border_data)
```

* Interleaving communication and computation

* Methods **`waitall()`** and **`waitany()`** come in handy when dealing with multiple non-blocking operations (available in the **`MPI.Request`** class).
* **`Request.waitall(requests)`**
    - Wait for all initiated requests to complete.
* **`Request.waitany(requests)`**
    - Wait for any initiated request to complete.

In [None]:
%%writefile Isend_Irecv.py 

from mpi4py import MPI
from mpi4py.MPI import Request
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

# data = send buffer
data = np.arange(10, dtype=float) * (rank + 1)
# buffer = receive buffer
buffer = np.zeros(10, dtype=float)

print('rank', rank, 'sending:', data)

if rank == 0:
    req = [comm.Isend(data, dest=1)]
    req.append(comm.Irecv(buffer, source=1))
if rank == 1:
    req = [comm.Isend(data, dest=0)]
    req.append(comm.Irecv(buffer, source=0))
    
print('rank', rank, 'receive buffer before wait:', buffer)
    
Request.waitall(req)

print('rank', rank, 'receive buffer after wait:', buffer)

In [None]:
!mpirun -np 2 python3 Isend_Irecv.py

* Non-blocking communication is usually the smart way to do point-to-point communication in MPI.
* Non-blocking communication realisation.
    - **`isend`** / **`Isend`**
    - **`irecv`** / **`Irecv`**
    - **`request.wait()`**

## <center> [Exercise 5](./05-Exercises-MPI.ipynb)

***
## <center><b>Collective Communication</b></center>

* Collective communication transmits data among all processes in a process group (communicator).
    - These routines must be called by all the processes in the group.
    - Amount of sent and received data must match.
* Collective communication includes
    - Data movement
    - Collective computation
    - Synchronization
* Example
    - **`comm.barrier()`** makes every task hold until all tasks in the communicator comm have called it.
* Collective communication typically outperforms point-to-point communication.

Original code can reduce from;

```python
if rank == 0:
    for i in range(1, size):
        comm.Send(data, i)
else:
    comm.Recv(data, 0)
```
to only one line...

We will cover 4 types of collective communication

### Broadcasting

* Code becomes more compact (and efficient!) and easier to maintain:
    - For example, communicating a NumPy array from task 0 to all other tasks:
* Send the same data from one process to all the others

```python
comm.Bcast(data, 0)
```
<center><img src="../img/4.1.3.png" alt="Drawing" style="width: 450px;"/> 

In [None]:
%%writefile broadcast.py

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    # Python object
    py_data = {'key1' : 0.0, 'key2' : 11}
    # NumPy array
    data = np.arange(8)/10
else:
    py_data = None
    data = np.zeros(8)

# Broadcasting the python object
new_data = comm.bcast(py_data, root=0)
print('rank', rank, 'received python object:', new_data)
# Broadcasting the NumPy array
comm.Bcast(data, root=0)
print('rank', rank, 'received NumPy array:', data)

In [None]:
!mpirun -np 4 python3 broadcast.py

### Scattering

* Sends/distributes equal amounts of data from one process to others.
* Segments $A$, $B$, $...$ may contain multiple elements

<center><img src="../img/4.1.4.png" alt="Drawing" style="width: 450px;"/> 

In [None]:
%%writefile scatter.py

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

if rank == 0:
    # Python object
    py_data = range(size)
    data = np.arange(size**2, dtype=float)
else:
    py_data = None
    data = None

# Scatter the python object
new_data = comm.scatter(py_data, root=0)
print('rank', rank, 'received python object:', new_data)

# Scatter the NumPy array
# A receive buffer is needed here!
buffer = np.empty(size, dtype=float)
comm.Scatter(data, buffer, root=0)
print('rank', rank, 'received NumPy array:', data)

In [None]:
!mpirun -np 4 python3 broadcast.py

### Gathering

* Collect/pulls data from all the processes into one process
* Segments $A$,$B$, $...$ may contain multiple elements

<center><img src="../img/4.1.5.png" alt="Drawing" style="width: 450px;"/> 

In [None]:
%%writefile gather.py

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

data = np.arange(10, dtype=float) * (rank + 1)

# Gather the value of rank from each rank, then send to rank 0
n = comm.gather(rank, root=0)

# Gather the NumPy array from each rank, then send to rank 0
buffer = np.zeros(size * 10, dtype=float)
comm.Gather(data, buffer, root=0)

if rank == 0:
    print('gathered ranks:', n)
    print('gathered NumPy arrays:', buffer)

In [None]:
!mpirun -np 4 python3 gather.py

## <center>[Exercise 6](./05-Exercises-MPI.ipynb)</center>

***
### <center><b>Reduction Operation</b></center>

* Applies an operation over a set of processes and places the result in a single process

<center><img src="../img/4.1.6.png" alt="Drawing" style="width: 450px;"/> 

Below we will look at reduce routines for a single integer (rank) and a numpy array, which on each process is [0,1,2,3] * (rank + 1)

What would be the expected result for running 3 processes for

<details>
    <summary markdown="span">Rank? </summary>
   3
<br>
</details>
<br>
<details>
    <summary markdown="span">And the numpy array? </summary>
   [0,6,12,18]
<br>
</details>

In [None]:
%%writefile reduce.py

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

data = np.arange(10, dtype=float) * (rank + 1)
print(data)
# Gather the value of rank from each rank, then send to rank 0
n = comm.reduce(rank, root=0)

# Gather ther NumPy array from each rank, then send to rank 0
buffer = np.zeros(10, dtype=float)
comm.Reduce(data, buffer, root=0)

if rank == 0:
    print('rank reduction:', n)
    print('NumPy reduction:', buffer)

In [None]:
!mpirun -np 4 python3 reduce.py

### Other Common Collective Operations

* **`Scatterv`**: Each process receives different amount of data.
* **`Gatherv`**: Each process sends different amount of data.
* **`Allreduce`**: All processes receive the results of reduction.
* **`Alltoall`**: Each process sends and receives to/from each other
* **`Alltoallv`**: Each process sends and receives different amount of data to/from each other

### Non-blocking Collectives

* A feature in MPI 3: <span style="color:red">But no support in `mpi4py`</span>
* Non-blocking collectives enable the overlapping of communication and computation together with the benefits of collective communication.
* Restrictions:
    - Have to be called in same order by all ranks in a communicator.
    - Mixing of blocking and non-blocking collectives is not allowed.

### Common Mistakes with Collectives

* Using a collective operation within one branch of an if-test of the rank.
    - **`if rank == 0: comm.bcast(...)`**
    - All processes in a communicator must call a collective routine!
* Assuming that all processes making a collective call would complete at the same time.
* Using the input buffer as the output buffer:
    - **`comm.Scatter(a, a, MPI.SUM)`**

## <center>[Exercise 7](./05-Exercises-MPI.ipynb)</center>

***
## <center><b>Communicators</b></center>

<center><img src="../img/4.1.2.png" alt="Drawing" style="width: 350px;"/> </center>

* By default a single, universal communicator exists to which all processes belong (**`MPI.COMM_WORLD`**).
* One can create new communicators, e.g. by splitting this into sub-groups.

In [None]:
%%writefile comm_ranks.py

from mpi4py import MPI
import numpy as np

from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()

color = rank % 4

local_comm = comm.Split(color)
local_rank = local_comm.Get_rank()

print("Global rank: %d Local rank: %d" % (rank, local_rank))

In [None]:
!mpirun -np 8 python3 comm_ranks.py

## <center>[Exercise 8](./05-Exercises-MPI.ipynb)</center>

***

## <center><b>Other MPI Routines & Methods</b></center>

## Non-blocking Collectives

* A feature in MPI 3: <span style="color:red">But no support in `mpi4py`</span>
* Non-blocking collectives enable the overlapping of communication and computation together with the benefits of collective communication.
* Restrictions:
    - Have to be called in same order by all ranks in a communicator.
    - Mixing of blocking and non-blocking collectives is not allowed.

## Common Mistakes with Collectives

* Using a collective operation within one branch of an if-test of the rank.
    - **`if rank == 0: comm.bcast(...)`**
    - All processes in a communicator must call a collective routine!
* Assuming that all processes making a collective call would complete at the same time.
* Using the input buffer as the output buffer:
    - **`comm.Scatter(a, a, MPI.SUM)`**

## Summary

* Collective communications involve all the processes within a communicator.
    - All processes must call them.
* Collective operations make code more transparent and compact.
* Collective routines allow optimizations by MPI library.
* MPI-3 also contains non-blocking collectives, but these are currently not supported by MPI for Python.
* Documentation for `mpi4py` is quite limited
    - MPI used in C, C++, Fortran, and ideally not suited for python as a whole
    - If you are serious about MPI, we suggest utilising a different language of your choice
        - Leads to better performance as a result
        
<center><img src="../img/4.1.7.png" alt="Drawing" style="width: 450px;"/> </center>

Performance of `mpi4py` using for a ping-pong test

* It is possible but not recommended to communicate arbitrary Python objects
* NumPy arrays can be communicated with nearly the same speed as in C/Fortran

## Useful Links

* [Read the docs](https://mpi4py.readthedocs.io/en/stable/index.html)
* ["A Python Introduction to Parallel Programming with MPI"](https://materials.jeremybejarano.com/MPIwithPython/): _Jeremy Bejarano_
* [mpi4py examples](https://github.com/jbornschein/mpi4py-examples): _Jörg Bornschein_
