# (truly) Parallel Python
Unfortunately, not all problems are of the "embarrassingly-parallel" type. For example, simulations of dynamical systems (nervous/weather/quantum systems) can easily become too computationally heavy for single machines, both in terms of memory and compute. One solution is to *distribute* such simulations across multiple machines. In particular this implies that we are using multiple processes which all work on a part of the simulation and hence need to communicate. The de-facto standard protocol for inter-process communication (in academia) is the Message Passing Interface (MPI). This protocol defines a standard way of processes to send data to/receive data from each other.

Compared to the application we've considered so far, using MPI *effectively* requires significant cognitive and development overhead, so you should very carefully evaluate whether you need to get your hands this dirty before reimplementing your simulation (or do it for fun as a challenge while your supervisor is on holidays ;) ). In the following we will focus on the `mpi4py` package.

## Starting multiple processes: who am I?
Using MPI requires you to change the way you start your Python program. First, we can not (easily) run it from a jupyter notebook. Second, instead of calling it like `python <script>` from the commandline, we need to run it via the `mpirun` executable. At this point, we also specify how many processes we'd like to start via `-np <number of processes>`. How many processes we use should reflect both the available hardware as well as our choices of how to distribute work.

In MPI each process is assigned a rank. This helps you to organize work ("rank X does Y") and communication ("rank X sends Z to rank Y"). We first implement (one of) the simplest possible MPI program(s): report your rank and exit.

WARNING: Here we use MPI for a single machine (which is not the optimal use case; typically you want to use it for distributing work across *many* machines which are connected via network)

In [None]:
# %load mpi_hello_world.py
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

print(f"Hello world from rank {rank}")


## Performing different work on different processes
As mentioned, using the rank, we can let different processes do different work, for example, generating random numbers in different ranges:

In [None]:
# %load mpi_random_numbers.py
from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

x = np.random.normal(loc=rank, size=10_000)

print(f"Hello world from rank {rank}. My mean is {np.mean(x)}.")


So far, the ranks are acting independently. The distinguishing feature of MPI is that is allows communication between processes. In the following we consider *one particular view* of parallelization across processes.

## Distributed simulation (don't do it like this, though)
Now, let's consider a (super simplified) dynamical systems simulation. We have particles moving in a one-dimensional "box" between 0 and 1. Assuming lots and lots of particles, we may want to split the work of propagating the particles, i.e., computing their new position, across different processes. Here we decide that each process should propagate the particles within a certain "volume" of the box. For example, using two processes, one of them propagates all particles between 0 and 0.5, the other all particles between 0.5 and 1. Of course particles can cross the boundary from below 0.5 to above 0.5 and we hence need to communicate positions between the processes. Here, we go for a simple implementation: after each propagation step, information about the new positions is shared across all ranks. Each rank afterwards determines which particles it should propagate in the next step.

In [None]:
# %load mpi_particles.py
from mpi4py import MPI
import numpy as np
import time

from lib_particles import Particle, propagate_particle


comm = MPI.COMM_WORLD
rank = comm.Get_rank()
n_ranks = comm.Get_size()

rng = np.random.RandomState(1235 + rank)

n_particles = 12
n_steps = 5

# determine work distribution between ranks
x_min = rank / n_ranks
x_max = x_min + 1 / n_ranks

# initialize local particles assuming homogeneity
n_particles_per_process = int(np.round(n_particles / n_ranks))
local_particles = [
    Particle(n_particles_per_process * rank + idx, rng.uniform(x_min, x_max))
    for idx in range(n_particles_per_process)
]
print(f"initial configuration ({rank=}):",
      list(sorted(local_particles, key=lambda p: p.idx)))

# simulate for a couple of time steps
for step in range(n_steps):

    # propagate particle positions
    for particle in local_particles:
        propagate_particle(particle)  # TODO check implemention propagate_particle -> "update_particle_position"
        # or particle.x = get_new_particle_position(particle.x)

    # communicate particle positions
    local_particles_new = []
    for root in range(n_ranks):
        recv_buffer = comm.bcast(local_particles, root=root)  # TODO mention synchronization/blocking
        # WARNING: would not work! local_particles = [particle for particle in recv_buffer if ...]
        for particle_new in recv_buffer:
            if x_min <= particle_new.x < x_max:
                local_particles_new.append(particle_new)
    local_particles = local_particles_new

    print()
    print(f"{step=} ({rank=}):",
          list(sorted(local_particles, key=lambda p: p.idx)))


Outlook
- optimizations/different communication patterns: e.g., send data only to the processes that need it, interleaving communication and computation
- (most) common issue: deadlocks, i.e., some processes are waiting for messages that they never receive :O
- as soon as you start using bigger systems, you will have to go through a scheduler to run your scripts, e.g., SLURM, LoadLeveler, etc.; this means writing a (bash) script which tells the scheduler what kind of resources you need and how to run your scripts