# Embarassingly-parallel Python
In the following we focus on "embarrassingly parallel" problems, i.e., problems which can be easily parallelized across threads or processes as they do not require interaction while running. This may include reading a collection of large files, querying databases, or multiple trials of neural networks, molecular dynamics simulations or sampling methods.

# Threading in Python
We first work with the `ThreadPool` available from the multiprocessing module. We assume CPython in which the GIL prevent several threads from executing in parallel. However, for some use case, in particular those which are **I/O bound**, threading can be very useful. Consider for example obtaining data from some database: you would like to query a couple of measurements, and completing each of these queries may take some processing time on the server. Here we mimick this server-side processing time by merely sleeping.

In [1]:
from multiprocessing.pool import ThreadPool
import time

In [2]:
def query_database(x):
    """Query your database to retrieve awesome measurements."""
    time.sleep(x)  # mimicks (input-dependent) server-side processing
    y = x ** 2
    return y

In [3]:
l = [1, 8, 1.5, 2]  # some dummy queries

First, we use the builtin `map` function to perform the database query for each item in l:

In [4]:
%%time
result = list(map(query_database, l))
print(result)

[1, 64, 2.25, 4]
CPU times: user 2.01 ms, sys: 487 µs, total: 2.5 ms
Wall time: 12.5 s


Observations?
- total duration is the sum of the duration of each query -> queries processed in serial, one after the other

Now, we use use `ThreadPool` do perform these queries using two threads (here the `processes` argument actually refers to the number of threads):

In [5]:
%%time
with ThreadPool(processes=2) as pool:  # context manager providing a `ThreadPool` instance
    result = pool.map(query_database, l)
print(result)

[1, 64, 2.25, 4]
CPU times: user 8.32 ms, sys: 832 µs, total: 9.15 ms
Wall time: 8.03 s


Observations?
- results are identical to serial processing of queries; good!
- total duration is reduced: work (here: waiting for results) is distributed across threads
- allocation: queries are performed in order; thread 0 works on query 0, thread 1 on query 1, thread 0 on the rest while thread 1 is busy with query 1
- careful: load is not automatically balanced (`ThreadPool` can not know how long each query takes); in our example if long query is the last, total duration increases 

In [6]:
l = [1, 1.5, 2, 8]  # some dummy queries

In [7]:
%%time
with ThreadPool(processes=2) as pool:
    result = pool.map(query_database, l)
print(result)

[1, 2.25, 4, 64]
CPU times: user 5.92 ms, sys: 1.41 ms, total: 7.33 ms
Wall time: 9.52 s


Now let's consider a compute-intense number-crunching task, for example simulating multiple trials of our fancy neural network model. Here we mimick the simulation by merely counting down from a large number.

In [8]:
def crunch_numbers(x):
    """Run a simulation of your favourite neural network model."""
    n = x * 1e7  # mimick compute-intense simulation
    while n > 0:
        n -= 1
    y = x ** 2
    return y

In [9]:
l = [1, 8, 1.5, 2]  # some dummy simulations

Again, first, we use the builtin `map` function to perform the number crunching for each item in l:

In [10]:
%%time
result = list(map(crunch_numbers, l))
print(result)

[1, 64, 2.25, 4]
CPU times: user 9.9 s, sys: 2.16 ms, total: 9.9 s
Wall time: 9.9 s


Now, we use again use `ThreadPool` do perform these simulations in parallel using two threads:

In [11]:
%%time
with ThreadPool(processes=2) as pool:
    result = pool.map(crunch_numbers, l)
print(result)

[1, 64, 2.25, 4]
CPU times: user 10.2 s, sys: 24.2 ms, total: 10.3 s
Wall time: 10.3 s


Observations?
- runtime (almost) identical to serial execution: GIL prevents simultaneous number crunching. :'(

# Multiprocessing in Python
We now introduce the `ProcessPool`. In contrast to the `ThreadPool` this context manager distributes work across multiple processes running separate instances of the Python interpreter. This allows you to circumvent the limitations of the GIL and achieve truly parallel code execution. Unfortunately, it introduces some downsides, such as some overhead for launching processes and increased memory consumption (depending on implementation and use case). However, for use cases which are **compute bound**, it is an excellent, simple-to-use option. As already introduced above, these use cases may include numerical simulations, sampling methods etc.

In [12]:
from multiprocessing.pool import Pool as ProcessPool

In [13]:
%%time
with ProcessPool(processes=2) as pool:  # context manager providing a `Pool` instance
    result = pool.map(crunch_numbers, l)
print(result)

[1, 64, 2.25, 4]
CPU times: user 10.3 ms, sys: 8.06 ms, total: 18.4 ms
Wall time: 6.74 s


Observations?
- result are identical to serial and threaded execution; good!
- runtime is reduced compare to both serial and threaded execution
- as before, work is distributed across processes one by one, no automatic load balancing

In [14]:
l = [1, 1.5, 2, 8]  # some dummy simulations

In [15]:
%%time
with ProcessPool(processes=2) as pool:
    result = pool.map(crunch_numbers, l)
print(result)

[1, 2.25, 4, 64]
CPU times: user 12.9 ms, sys: 3.98 ms, total: 16.8 ms
Wall time: 7.82 s


# TBD
- race conditions for threading
- memory sharing for processes