<center><img src="../img/ICHEC_Logo.png" alt="Drawing" style="width: 500px;"/>

<center><img src="../img/MPI_Logo.png" alt="Drawing" style="width: 250px;"/>

# <center>MPI (Message Passing Interface)</center>

***
## <center> <b>Processes vs. Threads</b>

<details>
    <summary markdown="span"><b>What is a thread?</b></summary>
<br>
   - A dispatchable unit of work <b>within</b> a process<br>
   - Lightweight operation that use the memory of the process they belong to<br>
   - Threads share the same memory with other threads of the same process<br>
<br>
</details>
    
<center><img src="../img/Threads.png" alt="Drawing" style="width: 400px;"/> <center/>


<details>
    <summary markdown="span"><b>What is a process?</b></summary>
<br>
   - An instance of a program running on a computer<br>
   - Heavyweight operation as every process has its own memory space<br>
   - Processes don't share the memory with other processes<br>
<br>
</details>


<center><img src="../img/Process.png" alt="Drawing" style="width: 250px;"/> 

***
## <center> <b>Introduction to MPI</b>

Using MPI is the true way to achieve parallelism.

<details>
    <summary markdown="span"><b>What is MPI?</b></summary>
<br>
   - An application programming interface (API) for communication between separate processes<br>
   - MPI standards defines C, C++, Fortran interfaces
   - Imported in Python using the <b>unofficial</b> <code>mpi4py</code> module<br>
   - Programs with MPI are portable and scalable to run on tens to tens of thousands of cores<br>
   - Over 300 procedures are possible, but only ~10 are really needed<br>
   - The program is launched as separate processes <b>tasks</b> each with their own address space<br>
   - Created in 1980s-1990s, when scientific problems were demanding more memory<br>
   - Idea was made to consider the memory of several interconnected compute nodes as one, known now as <b>distributed memory</b><br>
<br>
</details>

<center><img src="../img/DistributedMemory.png" alt="Drawing" style="width: 450px;"/>



<details>
    <summary markdown="span"><b>What is Distributed Memory?</b></summary>
<br>
   - A set of processors that use their own local memory during computation<br>
   - These each exchange data through communications by sending and receiving messages<br>
   - Cooperative operations are needed by each process (if one sends, another receives)<br>
   - The program is launched as separate processes <b>tasks</b> each with their own address space<br>
   - Created in 1980s-1990s, when scientific problems were demanding more memory<br>
   - Idea was made to consider the memory of several interconnected compute nodes as one, known now as <b>distributed memory</b><br>
<br>
</details>

<br>

As architecture trends changed, shared memory systems were combined over networks creating hybrid distributed memory / shared memory systems. MPI implementors adapted their libraries to handle both types of underlying memory architectures seamlessly. They also adapted/developed ways of handling different interconnects and protocols.

Today, MPI runs on virtually any hardware platform:
* Distributed Memory
* Shared Memory
* Hybrid

### Execution Model

<details>
    <summary markdown="span"><b></b></summary>
<br>
   - In Python, the process model is simpler than with C/Fortran<br>
   - Usually a call is needed to initialise MPI using <code>MPI.Init()</code>, but this is done automatically by; <code>from mpi4py import MPI</code>
   - The MPI program is launched as a set of independent, identical processes
   - These execute the same program code and instructions, and can reside on different nodes/computers
    - Ways to launch MPI programs differs depending on the system <code>mpirun</code>, <code>mpiexec</code>, <code>srun</code>...
<br>
</details>
<br>
<div class="alert alert-block alert-info">
<b>These notebooks are poor for running MPI for anything other than demonstrations. Use provided job scripts to run MPI codes on the compute nodes instead of the login nodes.<b/>
</div>

### Getting Started

To get started with an MPI program you need a **communicator**

<details>
    <summary markdown="span"><b>What is a communicator?</b></summary>
<br>
   - A group containing all the processes that will participate in communication<br>
   - In <code>mpi4py</code> most MPI calls are implemented as methods of a communicator object<br>
   - This can be called using <code>MPI.COMM_WORLD</code><br>
   - The user can define custom communicators, which will be covered tomorrow<br>
</details>
<br>

Once you have a communicator, you need a way of identifying all the MPI processes. These are known as **ranks**.


<details>
    <summary markdown="span"><b>What is a rank?</b></summary>
<br>
   - A logical ID number given to a process<br>
   - A way to query the rank<br>
   - Processes can perform different tasks based on their rank.
<br>
</details>

```python
if (rank == 0):
    # do something
elif (rank == 1):
    # do something else
else:
    # all other processes do something else
```

Aside from the rank, the number of processes also needs to be known. This is called the **size**, and is specified at runtime.

After importing MPI, the beginning of any MPI program will have the following three instructions.

```python
comm = MPI.COMM_WORLD 
size = comm.Get_size()
rank = comm.Get_rank()
```

<details>
    <summary markdown="span"><b>Common Routines in MPI for Python</b></summary>
<br>
   - Communication between processes, sending and receiving messages between 2 or more processes<br>
   - Synchronization between processes<br>
   - Communicator creation and manipulation<br>
   - Advanced features (e.g. user defined datatypes, one-sided communication and parallel I/O)<br>
<br>
</details>

***
## <center><b>Running a Simple MPI Program

In [None]:
%%writefile process_greeting.py

from mpi4py import MPI

# communicator containing all processes
comm = MPI.COMM_WORLD 

size = comm.Get_size()
rank = comm.Get_rank()

print("I am rank %d in group of %d processes" % (rank, size))


### Running MPI

On the login node;
```bash
$ mpirun -np 4 python3 hello.py

```

On the compute node;
- Put the above line in a bash file
- Submit the bash file to the compute node

```bash
$ sbatch my_mpi_job.sh

```

In [None]:
!mpirun -np 4 python3 process_greeting.py

## <center> [Exercise 1](./05-Exercises-MPI.ipynb)

***
## <center><b>Point to Point Communication</b></center>

<details>
    <summary markdown="span"><b>How does Point to Point communication work?</b></summary>
<br>
    - MPI communicates through <b>messages</b>, which are a number of elements of a particular datatype. These can be basic or special MPI derived datatypes<br>
   - Point to Point communication is the communication between two processes, where a source (<code>src</code>) <span style="color:red"><b>sends</b></span> a message to a destination (<code>dest</code>) process which has to <span style="color:blue"><b>receive</b></span> it<br>
   - This communication takes place within a communicator, e.g. <code>MPI.COMM_WORLD</code><br>
   - Each process in a communicator is identified by their ranks in the communicator<br>
   - Sends and receives in a program should match, one <span style="color:blue"><b>receive</b></span> per <span style="color:red"><b>send</b></span>
</details>
<br>

<center><img src="../img/MPI_SendRecv.png" alt="Drawing" style="width: 600px;"/>





* Two main types:
    * synchronous send - sender gets info that message is received
    * buffered/asynchronous send - sender knows that the message has left


### Sending and Receiving Data

Here we will be sending and receiving a dictionary

In [None]:
%%writefile send_receive.py

from mpi4py import MPI

# communicator containing all processes
comm = MPI.COMM_WORLD 
rank = comm.Get_rank()

if rank == 0:
    data = {'a': 7, 'b': 3.14}
    comm.send(data, dest=1)
    print('rank ', rank, ' sent: ', data)
elif rank == 1:
    data = comm.recv(source=0)
    print('rank ', rank, ' received: ', data)
else:
    print('rank ', rank, ' did not receive data')

In [None]:
!mpirun -np 4 python3 send_receive.py

* Arbitrary Python objects can be communicated with the <span style="color:red">**send**</span> and <span style="color:blue">**receive**</span> methods of a communicator.
* `send(data, dest)`
    - `data` - Python object to send
    - `dest`  - Destination rank
* `recv(source)`
    - `source` - source rank
    - Data is provided as return value
* Destination and source ranks have to match!


## Blocking Routines & Deadlocks

* `send()` and `recv()` are <span style="color:red">**blocking**</span> routines
  - The functions exit only once it is safe to use the data (memory) involved in the communication.
* Completion depends on other processes => risk for **_deadlocks_**.
  - For example, if all processes call `send()` and there is no-one left to call a corresponding `recv()` then the program is **stuck forever**.
  - This will be discussed more in non-blocking communication

<center><img src="../img/deadlock.png" alt="Drawing" style="width: 300px;"/></center>


Typical Point-to-Point communication patterns include pariwise exchanges, where processes talk to their neighbours. The incorrect ordering of sends and receives can result in a deadlock.

<center><img src="../img/3.2.3.png" alt="Drawing" style="width: 450px;"/></center>

## <center> [Exercise 2](./05-Exercises-MPI.ipynb)
***

## <center><b>Case study: Parallel Sum</b></center>

* Array originally on process #0 ($P_{0}$).
* Parallel algorithm
* **Scatter**
  - Half of the array is sent to process 1.
* **Compute**
  - $P_{0}$ & $P_{1}$ sum independently their segments.
* Reduction
  - Partial sum on $P_{1}$ sent to $P_{0}$
  - $P_{0}$ sums the partial sums.

<center><img src="../img/3.2.4.png" alt="Drawing" style="width: 150px;"/>

**Step 1.1**: Receive operation in scatter

<center><img src="../img/3.2.4.2.png" alt="Drawing" style="width: 400px;"/></center>

$P_{1}$ posts a <span style="color:blue">**receive**</span> to <span style="color:blue">**receive**</span> *half* of the array **FROM** $P_{0}$

**Step 1.2**: Send operation in scatter

<center><img src="../img/3.2.4.3.png" alt="Drawing" style="width: 400px;"/></center>

$P_{0}$ posts a <span style="color:red">**send**</span> to <span style="color:red">**send**</span> the lower part of the array **TO** $P_{1}$

**Step 2**: Compute the sum in parallel

<center><img src="../img/3.2.4.4.png" alt="Drawing" style="width: 400px;"/></center>

$P_{0}$ & $P_{1}$ computes their parallel sums and stores them locally

**Step 3.1**: <span style="color:blue">**Receive**</span> operation in reduction

<center><img src="../img/3.2.4.5.png" alt="Drawing" style="width: 400px;"/></center>

$P_{0}$ posts a <span style="color:blue">**receive**</span> to <span style="color:blue">**receive**</span> partial sum

**Step 3.2**: <span style="color:red">**Send**</span> operation in reduction

<center><img src="../img/3.2.4.6.png" alt="Drawing" style="width: 400px;"/></center>

$P_{1}$ posts a <span style="color:red">**send**</span> to <span style="color:red">**send**</span> partial sum

**Step 4**: Compute the final answer.

<center><img src="../img/3.2.4.7.png" alt="Drawing" style="width: 400px;"/></center>

$P_{0}$ sums the partial sums

***
## <center><b>Communicating NumPy Arrays</b></center>

* Arbitrary Python objects are converted to byte streams (pickled) when sending and back to Python objects (unpickled) when receiving.
    - These conversions may be a serious overhead to communication.
* Contiguous memory buffers (such as NumPy arrays) can be communicated with very little overhead using upper case methods:
    - **`Send(data, dest)`**
    - **`Recv(data, source)`**
    - Note the difference in receiving: the data array has to exist at the time of call.

### Send/Receive a NumPy Array

* Note the difference between upper/lower case!
* `send`/`recv`: general Python objects --> slow
* `Send`/`Recv`: continuous arrays --> fast


In [None]:
%%writefile Send_Recv.py

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank==0:
    data = np.arange(100, dtype=float)
    comm.Send(data, dest=1)
    print('rank ', rank, ' sent: ', data)
elif rank==1:
    data = np.empty(100, dtype=float)
    comm.Recv(data, source=0)
    print('rank ', rank, ' received: ', data)

In [None]:
!mpirun -np 2 python3 Send_Recv.py

## <center> [Exercise 3](./05-Exercises-MPI.ipynb)
***

## <center><b>Combined Send and Receive</b></center>

* <span style="color:red">Send</span> one message and <span style="color:blue">receive</span> another with a single command
    - reduces the risk of deadlocks
* Destination and source ranks can be the same or different
    - **`MPI.PROC_NULL`** can be used for *no destination/source*

In [None]:
%%writefile SendRecv.py

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

# Send buffer
data = np.arange(10, dtype=float) * (rank + 1)

# Receive buffer
buffer = np.empty(10, float)

if rank==0:
    dest, source = 1, 1
elif rank==1:
    dest, source = 0, 0
    
print('rank ', rank, ' send buffer: ', data)
    
comm.Sendrecv(data, dest=dest, recvbuf=buffer, source=source)

print('rank ', rank, ' receive buffer: ', buffer)

In [None]:
!mpirun -np 2 python3 SendRecv.py

## <center> [Exercise 4](./05-Exercises-MPI.ipynb)
***