# Embarassingly parallelize drivers

Often enough a *driver* or *diagnostic* use case can be considered as **embarassingly parallel** for example the code loops over several *models* or *variables* or *scenarios*, etc...

In some case it makes sense to include the loop as part of the driver in order to optimize data re-use (e.g all model will be compared to the same set of observation and these require heavy pre-processing.

But sometimes this is not necessary and all the added loops can make the code cumbersome and hard to read.

The pcmdi_metrics package offers a simple solution for emabrassingly parallel code provided they are based on the [Community Diagnosis Package](https://cdp.readthedocs.io/en/latest/)'s arguments parser.

The idea is simple, identify parameters over which your code is embarassingly parallel. Make sure these parameters are declared as argument inputs to your driver, define the ***granularize** option to your input parsameter file and use **parallelize_driver.py** to run your driver in an embarassingly parallel way.

# Example

## A simple example

### Driver

The following is a mock driver

This driver is accessible [here](mock_driver.py)

```python
#!/usr/bin/env python
from __future__ import print_function

# Prepare the parser
from pcmdi_metrics.driver.pmp_parser import PMPParser
parser = PMPParser(description='A mock driver')

# Some parameters that could be embarassingly parallelized
parser.add_argument("--model", help="model to run over")
parser.add_argument("--variable", help="variable to process")
parser.add_argument("--analysis", help="analysis to run")

p = parser.get_parameter()

print("We are running analysis {} on model {}, using variable {}".format(p.analysis, p.model, p.variable))
```

### Parameter file

You can access the sample parameter file [here](sample_parameter_file.py)

```python
model = "model_a"
variable = "variable_1"
analysis = "analysis 1"
```

### Running

```
python mock_driver.py -p sample_parameter_file.py
```

prints

```
We are running analysis analysis 1 on model model_a, using variable variable_1
```

## Parallelization

We need to edit our parameter file as follow
  * Add the *granularize* keyword to indicate which arguments we are parallelizing over
  * For each of the above parameter, convert them to a list of the values we need (it could be a list of list or anything)
  
The updated parameter file can be obtained [here](sample_parameter_file_updated.py)

```python
granularize = ["model", "analysis"]
model = ["model_a", "second model", 3]  # Mixing type is possible
variable = "variable_1"
analysis = ["analysis {}".format(i) for i in range(2)]
```

And use the `parallelize_driver.py` script:

```
parallelize_driver.py --driver  mock_driver.py -p sample_parameter_file_updated.py
```

```
Executing: /Users/doutriaux1/anaconda2/envs/nightly/bin/python mock_driver.py -p /var/folders/nv/3xl0t1xx4yxb6tyd0yqdm238001cpd/T/tmp7Pc8tg.py
True
Executing: /Users/doutriaux1/anaconda2/envs/nightly/bin/python mock_driver.py -p /var/folders/nv/3xl0t1xx4yxb6tyd0yqdm238001cpd/T/tmpENSvHq.py
True
Executing: /Users/doutriaux1/anaconda2/envs/nightly/bin/python mock_driver.py -p /var/folders/nv/3xl0t1xx4yxb6tyd0yqdm238001cpd/T/tmpvqeJjP.py
True
Executing: /Users/doutriaux1/anaconda2/envs/nightly/bin/python mock_driver.py -p /var/folders/nv/3xl0t1xx4yxb6tyd0yqdm238001cpd/T/tmpIGakna.py
True
Executing: /Users/doutriaux1/anaconda2/envs/nightly/bin/python mock_driver.py -p /var/folders/nv/3xl0t1xx4yxb6tyd0yqdm238001cpd/T/tmpBsuuEs.py
True
Executing: /Users/doutriaux1/anaconda2/envs/nightly/bin/python mock_driver.py -p /var/folders/nv/3xl0t1xx4yxb6tyd0yqdm238001cpd/T/tmpCOOKbc.py
True
We are running analysis analysis 0 on model model_a, using variable variable_1
We are running analysis analysis 0 on model second model, using variable variable_1
We are running analysis analysis 1 on model model_a, using variable variable_1
We are running analysis analysis 1 on model 3, using variable variable_1
We are running analysis analysis 1 on model second model, using variable variable_1
We are running analysis analysis 0 on model 3, using variable variable_1
```

By default it launches as many process as your system's number of processors, but you can control this via the `--num_workers` argument (command line or in parameter file)


```
parallelize_driver.py --driver  mock_driver.py -p sample_parameter_file_updated.py --num_workers=2
```

```
Executing: /Users/doutriaux1/anaconda2/envs/nightly/bin/python mock_driver.py -p /var/folders/nv/3xl0t1xx4yxb6tyd0yqdm238001cpd/T/tmpETNvYa.py
True
Executing: /Users/doutriaux1/anaconda2/envs/nightly/bin/python mock_driver.py -p /var/folders/nv/3xl0t1xx4yxb6tyd0yqdm238001cpd/T/tmp8DQiEr.py
True
We are running analysis analysis 0 on model second model, using variable variable_1
We are running analysis analysis 0 on model model_a, using variable variable_1
Executing: /Users/doutriaux1/anaconda2/envs/nightly/bin/python mock_driver.py -p /var/folders/nv/3xl0t1xx4yxb6tyd0yqdm238001cpd/T/tmpTeSVsm.py
Executing: True/Users/doutriaux1/anaconda2/envs/nightly/bin/python mock_driver.py -p /var/folders/nv/3xl0t1xx4yxb6tyd0yqdm238001cpd/T/tmpcwKzCp.py

True
We are running analysis analysis 1 on model model_a, using variable variable_1
We are running analysis analysis 0 on model 3, using variable variable_1
Executing: /Users/doutriaux1/anaconda2/envs/nightly/bin/python mock_driver.py -p /var/folders/nv/3xl0t1xx4yxb6tyd0yqdm238001cpd/T/tmpnEXyBa.py
TrueExecuting:
 /Users/doutriaux1/anaconda2/envs/nightly/bin/python mock_driver.py -p /var/folders/nv/3xl0t1xx4yxb6tyd0yqdm238001cpd/T/tmpeynGmt.py
True
We are running analysis analysis 1 on model 3, using variable variable_1
We are running analysis analysis 1 on model second model, using variable variable_1
```



