# Multiprocessing

## Table of Contents

 1. [Introduction](#Introduction)
     1. [Multithreading vs Multiprocessing](#Multithreading-vs-Multiprocessing)
 2. [Multiprocessing vs Multithreading Source](#Multiprocessing-vs-Multithreading-Source)
 3. [Python Concurrent Futures](#Python-Concurrent-Futures)
    1. [Executor Class](#Executor-Class)
    2. [Thread Pool Executor](#Thread-Pool-Executor)
    3. [Process Pool Operator](#Process-Pool-Operator)
    4. [Example of using concurrent futures](#Example-of-using-concurrent-futures)

## Introduction

Multiprocessing is one of the methods that enables us to speed up processing time by alieviating the amount of processing that
is performed on a single thread or processor

this can be done using various different methods and we will explore these initially

### Multithreading vs Multiprocessing

This section will cover a few misconceptions around the two topics with the processing that is performed in Python.

The first thing to understand is the two modules that are used in Python: **Threading** and **Multiprocessing**, the big difference
between these two is the threading module uses threads which work in the same memory space while the multiprocessing module uses processes
which use difference memory spaces.

There are more differences between multithreading and multiprocessing which we need to understand to decide on the best deployment.

Using an experiement with the following code:

```
def cpu_heavy(x):
    print('I am', x)
    count = 0
    for i in range(10**8):
        count += i

from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
def multithreading(func, args, workers):
    with ThreadPoolExecutor(workers) as ex:
        res = ex.map(func, args)
    return list(res)
def multiprocessing(func, args, workers):
    with ProcessPoolExecutor(workers) as ex:
        res = ex.map(func, args)
    return list(res)

visualize_runtimes(multithreading(cpu_heavy, range(4), 4))
visualize_runtimes(multiprocessing(cpu_heavy, range(4), 4))
```
A graphical representation is created showing that the Multithreading took 20 seconds while Multiprocessing only took 5 seconds,
But this seems strange as a **common misconception** is that these perform in the same way.

(Source: https://github.com/baatout/multithreading-vs-multiprocessing/blob/master/live_comparison.py)

The biggest difference between the two of these processes is that Threadpool does not run in parellel, this can be seen by
measuring the start time for each successive iteration. With the Threadpool you will see that although all of the threads are run they
are run at different moments concurrently.

Another big misconception is the use of multithreading to make code faster, this is not the case as in python it could possibly make
the code slower. This is because the processing is still done in a serial manner but also needs time to switch between the threads.

Even though this is not sounding good for multithreading at the moment, this discussion has been very skewed towards processing
and not all the other use cases. Multithreading is actually very useful for setting up IO processes to speed up the process.

### Multiprocessing vs Multithreading Source

https://medium.com/contentsquare-engineering-blog/multithreading-vs-multiprocessing-in-python-ece023ad55a

## Python Concurrent Futures

The Concurrent futures module provides high-level interface for asynchronous executing callables. The asynchronous
execution can be done either through threads using **ThreadPoolExecutor** or processes using **ProcessPoolExecutor**.
Both of these executions implement the **Executor** class.

### Executor Class

This is a class that provides methods that can be used to execute calls asynchronously, it should not be used directly
but rather through its subclasses.

### Thread Pool Executor

This class is an executor subclass that uses a pool of threads to execute calls asynchronously. One of the areas to watch out
for is possible deadlocks while a future waits for the results of another future. For Example:

In [5]:
import time
from concurrent.futures import ThreadPoolExecutor

def wait_on_b():
    time.sleep(5)
    print(b.result)
    return 5
def wait_on_a():
    time.sleep(5)
    print(a.result())
    return 6

executor = ThreadPoolExecutor(max_workers=2)
a = executor.submit(wait_on_b)
b = executor.submit(wait_on_a)

This will never finish as `a is waiting on b` and `b is waiting on a`.

Another way that a Deadlock can occur is shown below:

In [6]:
from concurrent.futures import ThreadPoolExecutor

def wait_on_future():
    f = executor.submit(pow,5,2)
    print(f.result)

executor = ThreadPoolExecutor(max_workers=1)
executor.submit(wait_on_future)

<Future at 0x1220c81c0 state=running>

<bound method Future.result of <Future at 0x1224f8820 state=pending>>


This will never complete as the executor only has one thread but it is trying to do multiple calls at the same time.

Now that we have seen causes of deadlocks lets start to look at how we can use the ThreadPoolExecutor to do a proper run:

In [7]:
import concurrent.futures
import urllib.request

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

def load_url(url, timeout):
    with urllib.request.urlopen(url, timeout=timeout) as conn:
        return conn.read()

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r page is %d bytes' % (url, len(data)))

'http://europe.wsj.com/' generated an exception: HTTP Error 403: Forbidden
'http://www.foxnews.com/' page is 288814 bytes
'http://some-made-up-domain.com/' page is 64668 bytes
'http://www.cnn.com/' page is 1138977 bytes
'http://www.bbc.co.uk/' page is 311362 bytes


This syntax submits a URL at a time to a set of futures and then outputs the results as soon as the futures
has completed

### Process Pool Operator

This is a subclass of executor that uses a pool of processes to execute calls asynchronously. This module
uses the multiprocessing module.

### Example of using concurrent futures

In [1]:
import concurrent.futures
import time


def power_calc(x):
    time.sleep(x/2)
    return pow(x,2)

values = [1,4,3,5,6,3,7,5,8,10]
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    futures = [executor.submit(power_calc,value) for value in values]
    results = [result.result() for result in futures]

print(results)

1
16
9
25
36
9
49
25
64
100
1
16
9
25
36
9
49
25
64
100
1
16
9
25
36
9
49
25
64
100
1
16
9
25
36
9
49
25
64
100
1
16
9
25
36
9
49
25
64
100
1
16
9
25
36
9
49
25
64
100
1
16
9
25
36
9
49
25
64
100
1
16
9
25
36
9
49
25
64
100
26 s ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Using the syntax above we can summarise the following timings that it takes to complete with different number of Threads:

| Threads | Time |
| --- | --- |
| No Threads| 26 Seconds |
| 2 | 13.5 seconds |
| 5 | 7.51 seconds |
| 7 | 6.41 seconds |
| 10 | 5.01 seconds |


## DASK

Dask allows for multi core execution on larger-than memory datasets. We can think of Dask as a Higher level
and lower level:

 - **High Level**: Dask Provides High-level Array and DataFrame collections that mimic Numpy and Pandas but can
 operate in parrallel in data that does not fit into memory. These are alternatives to the Numpy and Pandas Datasets.
 - **Low Level**: The low level includes schedulers. This provides us with dynamic task schedulers. This is the Dask equivalent
 to threading and multiprocessing.


