# **Multiprocessing with Python**

Python is inherently not designed for multicore works. It has something called GIL (global interpreter lock) which causes all the threads to act in serial manner when interacting with python interpreter. But this does not mean that there are no ways to make multicore programs in python.

But there are few things to note before diving in to multiprocessing.

1. Speedup is not linear with cores/threads used. There are overheads atatched to multiprocessing. (as any other language of course)
2. Shared  states between threads means more annoyance to handle, which means high overhead/ development effort.
3. Program will get affected by Amdahl's law. if dont remember read!
4. Python threads are not like other languages (Java, C++). They are literal threads as in OS native threads, but they all will work in single process which means one python interpreter. So will get bottlenecked by the GIL. Therefore we use processes with independent interpreters for each.
5. Should reduce message/state passing between processes as much as possible.

## Python Multiprocessing overview

* Process - A forked copy of current process with new identifier. We can provide a target method to run in it.
* Pool - Wrapper for process or threading. 
* Queue - A FIFO queue for multiple producer/consumer pattern
* Pipe - A communication channel between 2 processes.
* Manager - A high level managed interface to share objects between processes.


For the example usage of multiprocessing, we will use Monte Carlo simulation of Pi calculation (learned in stats). Below is the base implementation of experiment unit (random x, y values in unit area)

In [4]:
%%writefile Multiprocessing/pi_estimate.py

import os
import random

def calc_point_inside_circle(num_of_estimates):
    
    print(f"Executing calc_point_inside_circle with {num_of_estimates:,} on pid {os.getpid()}")

    trials_inside_circle = 0

    for step in range(int(num_of_estimates)):
        x = random.uniform(0,1)
        y = random.uniform(0,1)

        is_inside_circle = 1 if (x**2 + y**2) <= 1 else 0
        trials_inside_circle += is_inside_circle

    return trials_inside_circle


from multiprocessing import Pool # This is process based
# from multiprocessing import Pool # This is thread based
import time

if __name__ == "__main__":
    total_trials = 1e8
    num_workers = 4

    pool = Pool(processes=num_workers)
    trials_per_worker = total_trials/num_workers
    trials_per_processes = [trials_per_worker]*num_workers

    start_time = time.time()
    trials_inside_circle = pool.map(calc_point_inside_circle, trials_per_processes)
    pi_estimate = (sum(trials_inside_circle)*4)/float(total_trials)
    print(pi_estimate)
    print(f"Time consumed: {time.time()-start_time}")


Overwriting Multiprocessing/pi_estimate.py



If we run the above segment in 4 processes output is as follows.

<center><image src="./img/15.jpg" width="700"/></center>

If we only used 1 process output would be like below.

<center><image src="./img/16.jpg" width="600"/></center>

We can see the clear performance improvement by using multiple processes for this operation. But instead if we used threads we cant expect the same amount of performance gain due to GIL.

## Python Joblib module

Joblib is an improvement on the multioprocessing module, with lightweight pipelining. It can easily be used in pure python/numpy processes with embarasingly parrallel preperty. Also this can be used in calling expensive functions where outputs can be cached to disk between sessions.
Install joblib package using below.

<center>pip install joblib</center>

In [13]:
%%writefile Multiprocessing/pi_estimate_joblib.py

import os
import random
import time

def calc_point_inside_circle(num_of_estimates):
    
    print(f"Executing calc_point_inside_circle with {num_of_estimates:,} on pid {os.getpid()}")

    trials_inside_circle = 0

    for step in range(int(num_of_estimates)):
        x = random.uniform(0,1)
        y = random.uniform(0,1)

        is_inside_circle = 1 if (x**2 + y**2) <= 1 else 0
        trials_inside_circle += is_inside_circle

    return trials_inside_circle

from joblib import Parallel, delayed
if __name__ == "__main__":

    total_trials = 1e8
    num_workers = 4

    trials_per_worker = total_trials/num_workers
    trials_per_processes = [trials_per_worker]*num_workers


    parrallel_obj = Parallel(n_jobs=num_workers, verbose=1)
    async_function = delayed(calc_point_inside_circle)

    start_time = time.time()
    trials_inside_circle =  parrallel_obj(async_function(trials_per_worker) for _ in range(num_workers))
    pi_estimate = (sum(trials_inside_circle)*4)/float(total_trials)

    print(pi_estimate)
    print(f"Time consumed: {time.time()-start_time}")


Overwriting Multiprocessing/pi_estimate_joblib.py


Joblib function syntax is bit confusing as it behave like a chain. Anyhow the output is as follows.

<center><image src="./img/17.jpg" width="600"/></center>

Total consumed time is less than just using a one process.

Parrallel class have so many parameter which we can play with including debug info, timeouts, change usage to threads instead of processes and change the backend etc. Can mess around with those based on the requirement you have.

Another useful feature in joblib is it's `MemoryCache`. This decorator function saves the results to disk cache based on the input arguments. But this require the async_function to have unique arguments. Otherwise cache store wont be able to uniquely identify the results from each process. To do that we can add an index as additional parameter. Check the code below.


In [15]:
%%writefile Multiprocessing/pi_estimate_joblib_cached.py

import os
import random
import time
from joblib import Parallel, delayed
from joblib import Memory
memory = Memory("./Multiprocessing/joblib_cache", verbose=0)

@memory.cache
def calc_point_inside_circle_with_idx(num_of_estimates, idx):
    
    print(f"Executing calc_point_inside_circle with {num_of_estimates:,} and index {idx} on pid {os.getpid()}")

    trials_inside_circle = 0

    for step in range(int(num_of_estimates)):
        x = random.uniform(0,1)
        y = random.uniform(0,1)

        is_inside_circle = 1 if (x**2 + y**2) <= 1 else 0
        trials_inside_circle += is_inside_circle

    return trials_inside_circle

if __name__ == "__main__":

    total_trials = 1e8
    num_workers = 4

    trials_per_worker = total_trials/num_workers
    trials_per_processes = [trials_per_worker]*num_workers


    parrallel_obj = Parallel(n_jobs=num_workers, verbose=1)
    async_function = delayed(calc_point_inside_circle_with_idx)

    start_time = time.time()
    trials_inside_circle =  parrallel_obj(async_function(trials_per_worker, i) for i in range(num_workers))
    pi_estimate = (sum(trials_inside_circle)*4)/float(total_trials)

    print(pi_estimate)
    print(f"Time consumed: {time.time()-start_time}")


Overwriting Multiprocessing/pi_estimate_joblib_cached.py


If we run the above code for the first time output is as follows.

<center><image src="./img/18.jpg" width="600"/></center>

But if we did not change the arguments, then results will come from the cache improving the performance significantly.

<center><image src="./img/19.jpg" width="600"/></center>