# Parallization

**Table of contents**<a id='toc0_'></a>    
- 1. [Serial problem](#toc1_)    
- 2. [Parallization with joblib](#toc2_)    
- 3. [Parallization with Numba](#toc3_)    
- 4. [Limitations](#toc4_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=false
	minLevel=2
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

You will be introduced to how to use the **parallization**.

In [1]:
import time
import joblib

import numpy as np
import numba as nb

from scipy import optimize

import matplotlib.pyplot as plt
plt.rcParams.update({"axes.grid":True,"grid.color":"black","grid.alpha":"0.25","grid.linestyle":"--"})
plt.rcParams.update({'font.size': 14})

In [2]:
import psutil
CPUs = psutil.cpu_count()
CPUs_list = set(np.sort([1,2,4,*np.arange(8,CPUs+1,4)])) 
print(f'this computer has {CPUs} CPUs')

this computer has 8 CPUs


## 1. <a id='toc1_'></a>[Serial problem](#toc0_)

Assume we need to **solve the following optimization problem**

In [3]:
def solver(alpha,beta,gamma):
    return optimize.minimize(lambda x: (x[0]-alpha)**2 + 
                                       (x[1]-beta)**2 + 
                                       (x[2]-gamma)**2,[0,0,0],method='nelder-mead')

$n$ times:

In [8]:
n = 500*CPUs
alphas = np.random.uniform(size=n)
betas = np.random.uniform(size=n)
gammas = np.random.uniform(size=n)

def serial_solver(alphas,betas,gammas):
    results = [solver(alpha,beta,gamma) for (alpha,beta,gamma) in zip(alphas,betas,gammas)]
    return [result.x for result in results]

%time xopts = serial_solver(alphas,betas,gammas)

Wall time: 23 s


## 2. <a id='toc2_'></a>[Parallization with joblib](#toc0_)

**Joblib** can be used to run python code in **parallel**.

1. ``joblib.delayed(FUNC)(ARGS)`` create a task to call  ``FUNC`` with ``ARGS``.
2. ``joblib.Parallel(n_jobs=K)(TASKS)`` execute the tasks in ``TASKS`` in ``K`` parallel processes.


In [9]:
def parallel_solver_joblib(alphas,betas,gammas,n_jobs=1):

    tasks = (joblib.delayed(solver)(alpha,beta,gamma) for (alpha,beta,gamma) in zip(alphas,betas,gammas))
    results = joblib.Parallel(n_jobs=n_jobs)(tasks)
    
    return [result.x for result in results]
    
for n_jobs in CPUs_list:
    if n_jobs > 36: break
    print(f'n_jobs = {n_jobs}')
    %time xopts = parallel_solver_joblib(alphas,betas,gammas,n_jobs=n_jobs)
    print(f'')

n_jobs = 8
Wall time: 10.7 s

n_jobs = 1
Wall time: 21.1 s

n_jobs = 2
Wall time: 13.6 s

n_jobs = 4
Wall time: 9.66 s



**Drawback:** The inputs to the functions are serialized and copied to each parallel process.

[More on Joblib](https://joblib.readthedocs.io/en/latest/index.html) ([examples](https://joblib.readthedocs.io/en/latest/parallel.html))

**Question:** What happens if you remove the ``method=nelder-mead`` in the ``solver()`` function? Why?

## 3. <a id='toc3_'></a>[Parallization with Numba](#toc0_)

use QuantEcon

## 4. <a id='toc4_'></a>[Limitations](#toc0_)

**Parallization** can not always be used. Some problems are inherently sequential. If the result from a previous iteration of the loop is required in a later iteration, the cannot be executed seperately in parallel (except in some special cases such as summing). The larger the proportion of the code, which can be run in parallel is, the larger the potential speed-up is. This is called **Amdahl's Law**.

<img src="https://github.com/NumEconCopenhagen/lectures-2019/raw/master/11/amdahls_law.png" alt="amdahls_law" width=40% />