# Background - What is Parallel Python




Parallel computing is when many different tasks are carried out simultaneously. It does this by creating independent processes ship data, program files and libraries to an isloated ecosystem where computation is performed. There are three main models:

- **Embarrassingly parallel:** the code does not need to synchronize/communicate with other instances, and you can run multiple instances of the code separately, and combine the results later. If you can do this, great! (array jobs, task queues)

- **Shared memory parallelism:** Parallel threads need to communicate and do so via the same memory (variables, state, etc). (OpenMP)

- **Message passing:** Different processes manage their own memory segments. They share data by communicating (passing messages) as needed. (Message Passing Interface (MPI)).


### Modern Approach To Speed up code

Traditional implemententations of making code parallel are done on a low level. 

However, open source software has ***evolved*** dramatically over the last few years allowing more ***high level implementations and concise 'pythonic' syntax*** that wraps around low level tools. These modern tools also address the nature of why code takes long to run in the Big Data / Data Science world we live in. That is common reasons why code is slow have changed from function-bound bottlenecks to data-bound bottlenecks. 

***The focus on this course is to use these modern high level implementations to address both Data and Function Bound bottlecknecks.***

## Some terminology - Processes, threads and shared memory
A ***process*** is a collection of resources including program files and memory, that operates as an independent entity. Since each process has a seperate memory space, it can operate independently from other processes. It cannot easily access shared data in other processes.

A ***thread*** is the unit of execution within a process. A process can have anywhere from just one thread to many threads. Threads are considered lightweight because they use far less resources than processes. Threads also share the same memory space so are not independent.

<figure>
  <img src="../fig/process_v_thread.png" style="margin:6px;width:300px"/>
</figure><br>

<figure>
  <img src="../fig/process_threads_comparison.png" style="margin:6px;width:400px"/>
</figure><br>

The designers of the Python language made the choice that only one thread in a process can run actual Python code by using the so-called global interpreter lock (GIL).

External libraries (NumPy, SciPy, Pandas, etc), written in C or other languages, can release the lock and run multi-threaded. Code writen in native python has the GIL limitation.

The ***multiprocessing library*** can be used to release the GIL on python native code.


In [1]:
import dask
import dask.distributed as dd
import dask.array as da
import dask.dataframe as dd
import pandas as pd
import random
from multiprocessing import Pool ### The default pool makes one process per CPU


### Speed up a function that could take a while to run

In [17]:

#reference:
#https://aaltoscicomp.github.io/python-for-scicomp/parallel/

def sample(n):
    n_inside_circle = 0
    for i in range(n):
        x = random.random()
        y = random.random()
        if x**2 + y**2 < 1.0:
            n_inside_circle += 1
    return n_inside_circle / n * 4


In [3]:
# Using apply from pandas
ps = pd.Series([10**5,20**5])
ps.apply(sample)

0    3.143000
1    3.141715
dtype: float64

In [10]:
# Create a pool object from with a with statement 
with Pool() as p:
    result = p.map(sample,ps)
    # will engage p.close() automatically

Multiprocessing introduces an initial fixed cost in time (creating Pool objects). Knowing what hardware you are working on is needed to tailor the number of processes created with what is available. There is a risk of creating too many processes which make the initial fixed cost excessively large.

### We will come back to the below alternative (Dask) Afterwards

In [11]:
# Create the dask equivalent input
ds = dd.from_pandas(ps,npartitions = 2)

In [13]:
%%timeit #605 ms ± 10.9 ms per loop
result = ds.apply(sample,meta=('x', 'float64')).mean().compute()

605 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%timeit # 1.08 s ± 47.8 ms per loop
p = Pool()
result = p.map(sample,ps)
p.close()


### Compare Multiprocessing to Dask


Dask uses multiprocessing by default to overcome the GIL. Hence comparing the run time of the multiprocessing library to Dask with a function-bound problem will yield similar results.


Yet dask offers an ecosystem of resource management (Scheduler, diagnostics, data partitions and Task Graphs) that make it a more attractive way to achieve the same thing in most cases. Resource management is handles automatically by the sceduler.




In [15]:
# for reference, delaying the same function.
@dask.delayed
def dd_sample(n):
    n_inside_circle = 0
    for i in range(n):
        x = random.random()
        y = random.random()
        if x**2 + y**2 < 1.0:
            n_inside_circle += 1
    return n_inside_circle / n * 4


result = ds.apply(dd_sample,meta=('x', 'float64')).mean().compute()
result

3.141130625

In [16]:
%%timeit #595 ms ± 1.54 ms per loop
result = ds.apply(dd_sample,meta=('x', 'float64')).mean().compute()

595 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


<div class="keypoints">

### Key points

- Using Multiprocessing (or mpi4py - not covered here) are the traditional ways to make functions run in parallel in python
- Using Dask and its ecosystem is the modern approach


</div>
