# Parallel Python

In this section we briefly introduce two approaches for parallel computing in Python: `ipyparallel` and `mpi4py`.

## Contents

- [Why parallelize Python code?](#Why-would-you-want-to-do-parallel-programming-in-Python?)
    - [Before parallelizing Python code](#Before-spending-a-lot-of-time-parallelizing-your-Python-code)
- [Global interpreter lock](#The-global-interpreter-lock-(GIL))
- [IPython for parallel computing](#IPython-for-parallel-computing)
    - [Configuration](#Configuration)
    - [Using `ipyparallel`](#using_ipyparallel)
    - [Parallel magic commands](#Parallel-magic-commands)
    - [Example problem](#Example-problem:-Computing-the-Mandelbrot-set)
- [MPI4Py](#MPI4Py)
    - [MPI4Py basics](#MPI4Py-basics)
    - [Parallelization via MPI4Py](#Parallelization-via-MPI4Py)

> Many resources on parallel Python exist on the web, and some have been used as inspiration for the material presented here. In particular, the following resources are recommended:
> - https://github.com/dvalters/RSE18-Python-Parallel-workshop
> - https://nbviewer.jupyter.org/github/CQuIC/summer17-computing-workshop/blob/master/Parallelization/mpi4py/Introduction_to_MPI4py.ipynb

## Why would you want to do parallel programming in Python? 

Traditionally, Python is considered to not support parallel programming very well, and "proper" parallel programming should be left to "heavy-duty" languages such as Fortran or C/C++ where libraries or standards such as OpenMP and MPI can be utilised. 

For large scale, massively-parallel applications, this is probably still the case, but a rich variety of libraries and packages have been developed outside the core Python language, so parallel programming is now much better supported.

### Before spending a lot of time parallelizing your Python code
- If your Python code is running too slow, there are many ways to speed it up:
    - Begin by identifying the performance bottlenecks in the code - **profile before optimizing!**
    - Use fast numerical packages like [Numpy](http://www.numpy.org/).
    - Use a just-in-time (JIT) compiler like [Numba](https://numba.pydata.org/).
    - Use C-extensions from [Cython](http://cython.org/).
    - Rewrite the performance-critical functions in C/C++/Fortran, and import them into Python.
    - Any of these methods could speed up Python code by orders of magnitude!
- So why bother with parallelizing Python?
    - Perhaps you're already using Numpy, Numba, Cython and/or C/C++/Fortran for the most compute-intensive parts of your code.
    - Perhaps you have a problem that is particularly suitable for parallelization, e.g. a large dataset that can be processed independently in chunks.

## The global interpreter lock (GIL)

- The most common implementation of Python (interpreter/executable that runs your Python code) is called CPython.
- CPython doesn't support using threads well, because it's been written to assume that individual Python programs are serial.
- CPython implements something called the Global Interpreter Lock (GIL) that protects access to Python objects, preventing multiple threads executing Python bytecode through the Python interpreter at once.
- Subsequent developments in Python have come to rely on the GIL being present, so removing it in future versions of Python is unlikely.
- Parallel approaches to Python are normally based around running multiple instances of the Python interpreter, each with its own copy of the the code being run and each with its own separate GIL.

## IPython for parallel computing

IPython abstracts out parallelism in a general way, supporting many different styles of parallelism: Single program multiple data (SPMD) and multiple program multiple data (MPMD) parallelism, message passing using MPI, task farming, data parallel, a combinations of approaches or custom user-defined approaches.

Most importantly, IPython and the `ipyparallel` package enables all types of parallel applications to be developed, executed, debugged, and monitored *interactively*.

The following are some example use cases:

- Quickly parallelize algorithms that are embarrassingly parallel using a number of simple approaches. Many simple things can be parallelized interactively in one or two lines of code.
- Steer traditional MPI applications on a supercomputer from an IPython session on your laptop.
- Analyze and visualize large datasets (that could be remote and/or distributed) interactively using IPython and tools like matplotlib.
- Develop, test and debug new parallel algorithms (that may use MPI) interactively.
- Tie together multiple MPI jobs running on different systems into one giant distributed and parallel system.
- Run a set of tasks on a set of CPUs using dynamic load balancing.

> Adapted from the [official documentation](https://ipyparallel.readthedocs.io/en/)

### Configuration

Already done with jupyter notebook setup, via command `ipython profile create`

<a id='using_ipyparallel'></a>

### Using `ipyparallel`

Starting an IPython cluster for parallel computing can be done from inside Jupyter: 
- Go to the Jupyter dashboard 
- Click the "IPython clusters" tab 
- Choose number of engines (e.g. 4), and click "Start". 

Interacting with the engines is done via the `ipyparallel.Client()` method:  

In [None]:
import ipyparallel as ipp
client = ipp.Client()
print("Number of ipyparallel engines:", len(client.ids))

#### Direct view

The direct view represents one way of working with IPython engines where the capabilities of each engine are directly and explicitly exposed to the user.

We construct a `DirectView` object via list-access to the client:

In [None]:
dview = client[:]
print(dview)

Suppose we want to calculate the square of 10 integers. We can first define a function and then calculate the squares serially

In [None]:
def square(x):
    return x*x

In [None]:
output = [square(x) for x in range(1,11)]
print(output)

With `ipyparallel` it is handy to do this via `map_sync`

In [None]:
output = dview.map_sync(square, range(1,11))
print(output)

The syntax for `map_sync` is straightforward - it accepts the function and a list of input arguments.

#### Load balanced view

The load balanced view is appropriate when you have many jobs that take differnet amounts of time to complete.

In [None]:
lview = client.load_balanced_view()
print(lview)

In [None]:
import numpy as np
lview.map_sync(lambda x: sum(x), np.random.random((10, 100000)))

### Parallel magic commands

The simplest way to use `ipyparallel` is via the `%px` magic command, which executes code in parallel.

In [None]:
%px import os
%px a = os.getpid()
%px print(a)

Let's use the %px magic together with `scatter` and `gather` functions

In [None]:
dview.scatter('xs', range(24))
%px y = [x**2 for x in xs]
%px print(y)

In [None]:
np.array(dview.gather('y').get())

### Example problem: Computing the Mandelbrot set

We will now look at a more compute intensive problem and measure the speedup from using the `@parallel` decorator.

In [None]:
# Serial version
def mandel1(x, y, max_iters=80):
    c = complex(x, y)
    z = 0.0j
    for i in range(max_iters):
        z = z*z + c
        if z.real*z.real + z.imag*z.imag >= 4:
            return i
    return max_iters

In [None]:
# Parallel version using the @parallel decorator
@dview.parallel(block = True)
def mandel2(x, y, max_iters=80):
    c = complex(x, y)
    z = 0.0j
    for i in range(max_iters):
        z = z*z + c
        if z.real*z.real + z.imag*z.imag >= 4:
            return i
    return max_iters

Initialize arrays and meshgrid

In [None]:
x = np.arange(-2, 1, 0.01)
y = np.arange(-1, 1, 0.01)
X, Y = np.meshgrid(x, y)

In [None]:
#%%timeit
im1 = np.reshape(list(map(mandel1, X.ravel(), Y.ravel())), 
                 (len(y), len(x)))

In [None]:
#%%timeit
im2 = np.reshape(mandel2.map(X.ravel(), Y.ravel()),  (len(y), len(x)))

Finally plot the result

In [None]:
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].grid(False)
axes[0].imshow(im1, cmap='jet')
axes[1].grid(False)
axes[1].imshow(im2, cmap='jet')
pass

## MPI4Py

MPI, the Message Passing Interface, is a standard for parallel programming involving communication between separate parallel processes each with their own separate memory allocation. MPI processes have to pass messages between themselves to invoke code execution and share data between with each other.

### MPI4Py basics

With MPI4Py, it is convenient to obtain the basic MPI settings including the communicator, the rank of the process, and the number of processes.

In [None]:
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

MPI4Py also provides convenient communicating functions like `send`, `recv`, `scatter`, `gather`, etc.

There's no automatic mapping but you may find `scatter` and `gather` very useful in practice.

<a id='exercise36'></a>

### Parallelization via MPI4Py

Below is an example code of calculating the distances via `mpi4py`.

In [None]:
%cd cities

In [None]:
import time
from mpi4py import MPI

import dist_cities as dc

# MPI settings

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
nprocs = comm.Get_size()

# prepare data and determine workloads

if rank == 0:
    cities = dc.read_cities()
    coord_pairs = dc.create_coord_pairs(cities)
    npairs = len(coord_pairs)

    dn = npairs // nprocs
    if npairs % nprocs != 0:
        dn += 1

# compute via MPI
# 1. Slice coord_pairs for processes
# 2. Scatter the sliced pieces
# 3. Do computation on each process
# 4. Gather results to master process
# 5. Collect the results into one list

t0 = time.time()

if rank == 0:
    data = [coord_pairs[int(x*dn):int((x+1)*dn)] for x in range(nprocs)]
else:
    data = None

data = comm.scatter(data, root=0)

result = [dc.calc_dist(p) for p in data]

result = comm.gather(result, root=0)

if rank == 0:
    output = []
    for a in result:
        output += a

t1 = time.time()

if rank == 0:
    print("Maximum distance: %.0f km" % max(output))
    print("Computing time: %.3f sec" % (t1-t0))