# Library functions

The library functions are *exactly* the same as for the serial numpy version. If you write MPI code carefully, the parallelisation need not affect all of the code base. As here, you are often able to use the same routines to do the serial work (i.e. where there is no communications). The speedup comes because each process is operating on a smaller length of road, i.e. the value of `n` will be smaller here.

`gettime` and `updatebcs` are no longer required as MPI has its own timer and the boundary values are now exchanged between processes using send and receive operations, but they are left it in as we want to use the same file as for the serial code.

In [None]:
# %load trafficlib.py
import sys
import numpy as np

def initroad(road, density, seedval):

    # Here we expect a road without halos

    n = len(road)

    np.random.seed(seedval)

    rng = np.random.random(n)

    road[0:n] = np.where(rng[:] < density, 1, 0)

    ncar = np.sum(road)
    
    return ncar


def updateroad(newroad, oldroad):

    n = len(oldroad)-2

    newroad[1:n+1] = np.where(oldroad[1:n+1]==0, oldroad[0:n], oldroad[2:n+2])

    nmove = (newroad[1:n+1] != oldroad[1:n+1]).sum(dtype=int)
    nmove = nmove / 2

    return nmove


def updatebcs(road):

    n = len(road)-2

    road[0]   = road[n]
    road[n+1] = road[1]


import time

def gettime():

    return time.time()


# Main program

There are are a number of changes required to the main program.

* We compute `size` and our `rank` - as is conventional, we nominate rank 0 as the controller (e.g. the process that prints any output)
* The local number of cells `nlocal` is smaller than `ncell` by a factor of `size`. We still need an array `bigroad` to store the whole road as we do the initialisation in serial to ensure it is done identically to the serial numpy code. We also check for consistency - it would be straightforward to allow an arbitary length of road but this would make the code a little more complicated.
* `bigroad` is scattered to all the processes. Note that *all* processes call `comm.Scatter` although they have different roles as the input data only exists on rank 0.
* We compute our neighbours up and down, taking into account the periodic boundary conditions (i.e. the cars are on a roundabout).
* The barrier calls are solely there to make sure that the processes start and finish the main calculation at the same time, which gives us more reliable runtimes. Removing all the barriers will have *no effect* on the correctness of the program.
* Halos are exchanged up and down using a combined send and receive call `Sendrecv`.
* Each process can compute how many cars moved on its section of road but to get the total number we need to sum up across all processes using `Allreduce`. To use the fast numpy MPI functions I use a numpy array of size 1 rather than a scalar.

If you execute the program in the notebook it will run using a single process. As the time comparable to what you saw with the serial numpy version? Is the result the same, i.e. is the final velocity identical to before?

The main parallel exercises are in the cell below the program.

In [None]:
# %load traffic.py
#!/usr/bin/env python

import sys
import time
import numpy as np

from mpi4py import MPI

from trafficlib import initroad, updatebcs, updateroad, gettime

def main(argv):

    comm = MPI.COMM_WORLD

    size = comm.Get_size()
    rank = comm.Get_rank()

    # Simulation parameters
    seedval = 5743
    ncell = 10240000
    maxiter = 1024000000//ncell
    printfreq = maxiter//10

    nlocal = ncell//size

    # Check consistency

    if (nlocal*size != ncell):
        if (rank == 0):
            print(f"ERROR: ncell = {ncell} not a multiple of size = {size}")
        exit()

    bigroad  = np.zeros(ncell,dtype=np.int32)
    newroad  = np.zeros(nlocal+2,dtype=np.int32)
    oldroad  = np.zeros(nlocal+2,dtype=np.int32)

    sbuf = np.zeros(1)
    rbuf = np.zeros(1)

    density = 0.52

    if (rank == 0):

        print(f"Length of road is {ncell}")
        print(f"Number of iterations is {maxiter}")
        print(f"Target density of cars is {density}")
        print(f"Running on {size} process(es)")

        # Initialise road accordingly using random number generator
        print(f"Initialising ...")

        ncars = initroad(bigroad, density, seedval)

        print(f"Actual Density of cars is {format(float(ncars)/float(ncell))}\n")
        print(f"Scattering data ...")

    comm.Scatter(bigroad, oldroad[1:nlocal+1], root=0)

    if (rank == 0):
        print(f"... done\n")

    # Compute neighbours

    rankup   = (rank + 1)
    rankdown = (rank - 1)

    # Wrap-around for cyclic boundary conditions, i.e. a roundabout

    if (rankup == size):
        rankup = 0

    if (rankdown == -1):
        rankdown = size-1

    nmove = 0
    nmovelocal = 0

    comm.barrier()
        
    tstart = MPI.Wtime()

    for iter in range(1, maxiter+1):

        comm.Sendrecv(oldroad[nlocal:nlocal+1], dest=rankup,
                      recvbuf=oldroad[0:1], source=rankdown)

        comm.Sendrecv(oldroad[1:2], dest=rankdown,
                      recvbuf=oldroad[nlocal+1:nlocal+2], source=rankup)

        nmovelocal = updateroad(newroad, oldroad)

        sbuf[0] = nmovelocal
        comm.Allreduce(sbuf, rbuf)
        nmove = rbuf[0]

        # Copy new to old array
        oldroad[1:nlocal+1] = newroad[1:nlocal+1]

        if iter % printfreq == 0:

            if (rank == 0):

                print(f"At iteration {iter} average velocity is {float(nmove)/float(ncars):.6f}")

    comm.barrier()

    tstop = MPI.Wtime()

    if (rank == 0):

        print(f"\nFinished\n")
        print(f"Time taken was {tstop-tstart:.2f} seconds")
        print(f"Update rate was {1.0e-6*ncell*maxiter/(tstop-tstart):.2f} MCOPs")

if __name__ == "__main__":
    main(sys.argv[1:])


## Parallel exercises


Mac and Linux users should use the first method of running. Windows users should use the second method as, without a timeout, Windows MPI does not seem to exit gracefully if the program crashes. Both use shell escape to execute a local command.

To get the best performance and reliable runtimes you should make sure you are running as few other programs as possible on your laptop, e.g. quit as many browser tabs as possible, shut down Spotify and Twitter, ...

* Execute the program on a single process and check if the performance is the same as running in the notebook. You will not see continuous output - it will all appear at the end so there will be some delay.

* Now run on two processes using `-n 2`. Is the program any faster? Is the answer the same as before?
* Run on 4 and 8 processes - does the code get any faster? It can be interesting to run a performance monitor while doing this as you should see multiple Python programs running at the same time (which may put a heavy load on your machine!).

If you want to change the program (e.g. alter the length of the road) you will have to edit the copy stored locally on your machine - changing the program above in the notebook will have no effect. To do this use "File -> Open" from the jupyter menu and remember to save your changes before re-running the program.

**Mac and Linux users run this command**:

In [None]:
!mpiexec -n 1 python traffic.py

**Windows users run this command**:

In [None]:
!mpiexec /timeout 20 -n 1 python traffic.py