In [None]:
# default_exp workflow
%load_ext lab_black
# nb_black if running in jupyter
%load_ext autoreload
# automatically reload python modules if there are changes in the
%autoreload 2

In [None]:
# hide
from nbdev.showdoc import *

# Workflow

> Define static or dynamic workflow for automatically updating, training and deploying your ML model!



***input:*** Workflow definition parameters

***output:*** python or snakemake script for running the workflow

***description:***

While you are developing your ML application, you might prefer running the notebooks manually again and again.
However, once you have deployed your model into production it becomes unpractical and compromizes scalability, modularity and the principle of ease of reproducibility.
This happens regardless of what 'production' means to you - it might well be that you are just running the notebooks and viewing the results directly from them.
Whatever you are doing, having a single command to run the whole workflow makes things so much easier.

Workflow automation is also the part of the work where you'll probably notice a lot of bugs and nonrobustness in your notebooks.
Probably a lot more than you anticipated, but try not to get frustrated! Debugging is big and important part of the work.

In this notebook we explain alternatives for automating workflows, either as a static or dynamic. 
By following these examples (and further documentation on [papermill](https://papermill.readthedocs.io/) and [Snakemake](https://snakemake.readthedocs.io/)) you can parameterize your notebooks,
run them automatically in a workflow, and even parameterize and automate the workflow definition.
With this template, you can easily define very complex and versatile workflows, that are well documented in a notebook. 

We selected these tools for the template because they have stable community support, they are relatively easy to use and they fit our needs.
There are also other tools for workflow management and orchestration that may better suite your needs. Feel free to use them.
For more information, see for example this 
[comparison of workflow tools for Python](https://medium.com/@Minyus86/comparison-of-pipeline-workflow-packages-airflow-luigi-gokart-metaflow-kedro-pipelinex-5daf57c17e7).

## Import relevant modules

In [None]:
import numpy as np
import pandas as pd
import papermill as pm

## Define notebook parameters

make direct derivations from the paramerters:

## How to run parameterized notebooks with papermill

Papermill allows parameterizing and running notebooks from Python runtime with `papermill.execute_notebook(input, output, parameters)`.
The `input` parameter is the notebook to be run. The `output` parameter is the filepath where copy of the executed notebook is saved with the results.
This can be the same as the `input`, but you probably want to keep it separate - otherwise your version control may get messy. 
In this example executed notebooks are saved under `results/notebooks`. The `parameters` cell allows you to change settings of the notebooks.

You may have noticed, that in the beginning of each notebook there is a cell with a comment `# This cell is tagged parameters`.
The cell has been added `Parameters` tag. The template notebooks already contain the tag, 
but you can check [papermill documentation](https://papermill.readthedocs.io/en/latest/usage-parameterize.html) on how to do it on different notebook editors.
In this cell, variables are assigned. What papermill does is, that any parameters given to the `execute_notebook` function are listed 
in a new cell right below the one tagged with parameters. The listed parameters will rewrite the default assignments.
This is why you should not do anything else but simple assignments in the parameters cell.

Let's show an example. Let's run the notebook 'model' with and without changing the 'seed'-parameter.
Copies of the notebooks executed with different settings are stored in `results/notebooks`.
The copies are saved with underscore prefix `_notebook.ipynb` so that they are ignored by nbdev.

In [None]:
# slow
# run model notebook with default parameters
_ = pm.execute_notebook(
    "02_loss.ipynb",
    "results/notebooks/_02_loss_default_params.ipynb",
)

Executing:   0%|          | 0/43 [00:00<?, ?cell/s]

In [None]:
# slow
# run model notebook with 'seed' -parameter changed from 0 to 1
_ = pm.execute_notebook(
    "02_loss.ipynb", "results/notebooks/_02_loss_seed_1.ipynb", parameters={"seed": 1}
)

Executing:   0%|          | 0/44 [00:00<?, ?cell/s]

You can now open the notebooks and compare the results. 

Now we could just define and run the complete workflow from this notebook: just define which notebooks to run, in which order and with which parameters.
Then just run this notebook and the workflow is executed.
We could even go further and parameterize this notebook to get parameterizable workflow execution.
Then, we could use papermill in another application to run this notebook to execute the rest of the workflow.
However, this approach has two main restrictions. The workflow execution script would not be included in this documentation, and it does not allow dynamic workflows because launching Snakemake from inside a Python runtime will cause all sorts of problems.



## Static executable workflow with papermill

If we want to automatically run the workflow, we need to create and executable script to run it.
We can define it in this notebook to include it in our documentation:


In [None]:
%%writefile static_workflow.py
# execute workflow of the example notebooks
# to run the script, call python static_workflow.py workflow_setup.yaml
# this file has been added to .gitignore
# NOTE: use curly brackets only to format in global variables!
# hint: you can include additional parameters with sys.argv

# import relevant libraries
import papermill as pm
import os
import sys
import yaml

# update modules before running just to be sure
os.system('nbdev_build_lib')

# run data notebook
_ = pm.execute_notebook('00_data.ipynb', # input
                        'results/notebooks/_00_data.ipynb', # output
                       parameters = dict(seed = 0) # params (optional)
                       )
# run model notebook
_ = pm.execute_notebook('01_model.ipynb',
                        'results/notebooks/_01_model.ipynb')
# run loss notebook
_ = pm.execute_notebook('02_loss.ipynb',
                       'results/_02_loss.ipynb')

# optional (uncomment): make backup of the index and workflow notebooks:
# os.system('cp {workflow} {save_notebooks_to}{workflow}')

## Parameterized static executable workflow with papermill

What if some of your input files change, or you would like to run your workflow with a slighly different setup?
Just as we parameterized the components of the workflow, we might want to parameterize the workflow definition.
You can either read parameters directly form sys.argv, python argparser or, like in this example, from a configuration file.

Let's begin by defining the configuration file. We use the [yaml](https://yaml.org/) format, because it is easy to write and read by both humans and machines.
The file is defined in this notebook and written directly into the `workflow_setup.yml` file. The config file is added to `.gitignore` - it is already defined in this notebook, we do not need double versioning. See [here](https://stackoverflow.com/questions/1773805/how-can-i-parse-a-yaml-file-in-python/1774043#1774043) how to use yaml with Python.

In [None]:
%%writefile workflow_setup.yml
---
notebooks: # workflow notebook setup
    index: # name of notebook
        notebook: index.ipynb # notebook file

    data:
        notebook: 00_data.ipynb
        params: # notebook parameters
            seed: 0
    model:
        notebook: 01_model.ipynb
        params:
            seed: 0
    loss:
        notebook: 02_loss.ipynb
        params:
            seed: 0
utils: # general workflow settings
    save_notebooks_to: results/notebooks/
    notebook_save_prefix: _

Overwriting workflow_setup.yml


Let's take a look how the setup looks loaded as a python dictionary:

In [None]:
import yaml

with open("workflow_setup.yml", "r") as f:
    setup_dict = yaml.load(f, Loader=yaml.Loader)
setup_dict

{'notebooks': {'index': {'notebook': 'index.ipynb'},
  'data': {'notebook': '00_data.ipynb', 'params': {'seed': 0}},
  'model': {'notebook': '01_model.ipynb', 'params': {'seed': 0}},
  'loss': {'notebook': '02_loss.ipynb', 'params': {'seed': 0}}},
 'utils': {'save_notebooks_to': 'results/notebooks/',
  'notebook_save_prefix': '_'}}

Now run the cell below to create the execution script. The code is not run in this notebook, but written in the file `static_workflow.py`:

In [None]:
%%writefile static_workflow.py
# execute workflow of the example notebooks
# to run the script, call python static_workflow.py workflow_setup.yaml
# this file has been added to .gitignore
# NOTE: use curly brackets only to format in global variables!
# hint: you can include additional parameters with sys.argv

# import relevant libraries
import papermill as pm
import os
import sys
import yaml

## parse arguments from workflow_setup.yaml
configfilename = sys.argv[1]
with open(configfilename, 'r') as f:
    config = yaml.load(f, Loader = yaml.Loader)

# variables
notebooks = config['notebooks']
data = notebooks['data']
model = notebooks['model']
loss = notebooks['loss']

utils = config['utils']

# update modules before running just to be sure
os.system('nbdev_build_lib')

# run data notebook
_ = pm.execute_notebook(data['notebook'], # input
                        utils['save_notebooks_to'] \
                        + utils['notebook_save_prefix'] \
                        + data['notebook'], # output
                       parameters = data['params'] # params
                       )
# run model notebook
_ = pm.execute_notebook(model['notebook'],
                        utils['save_notebooks_to'] \
                        + utils['notebook_save_prefix'] \
                        + model['notebook'],
                       parameters = model['params'])
# run loss notebook
_ = pm.execute_notebook(loss['notebook'],
                        utils['save_notebooks_to'] \
                        + utils['notebook_save_prefix'] \
                        + loss['notebook'],
                       parameters = loss['params'])

# optional (uncomment): make backup of the index and workflow notebooks:
# os.system('cp {workflow} {save_notebooks_to}{workflow}')

Overwriting static_workflow.py


If you open the file `static_workflow.py`, you notice that the contents of curly brackets were replaced with the parameters of this notebook.

Now, you can run the workflow:

In [None]:
# slow
# run this in your terminal to also run the nbdev_build_lib
!python static_workflow.py workflow_setup.yml

sh: 1: nbdev_build_lib: not found
Executing: 100%|██████████████████████████████| 58/58 [00:19<00:00,  3.77cell/s]
Executing: 100%|██████████████████████████████| 45/45 [00:18<00:00,  1.91cell/s]
Executing: 100%|██████████████████████████████| 44/44 [00:13<00:00,  4.33cell/s]


You can again make a visible copy of the hidden notebooks folder just like above (remember to delete it afterwards) and view the notebooks.
You can change some of the notebook parameters and rerun the workflow to see how it effects the results.

You see that static workflow definition is quite simple. In the script above, 
we did not define any inputs, outputs or the relation of the different steps.
It's good to keep things that way, unless there is a reason not to.
It might be that we have multiple, changing data sources, complex workflow structure,
need for parallelization or other issues making it either difficult to hard-code
the steps required in your workflow. Then, you might need a dynamic workflow.

## Dynamic executable workflow with Snakemake

Snakemake is a tool that will automatically determine which steps to run based on inputs and outputs.
It's like gnu make, but for Python: easy to read and write, but powerful.
Unfortunaltely it is impossible to cover all the properties of the tool but see their documentation and internet discussion for ideas.
Here we only cover a tiny portion of the possibilities of snakemake, but it can do very complex things.

Snakemake executes the workflow as a rule based directed acyclic graph (DAG).
Each workflow step is determined by a rule, which consists of inputs, outputs and execution of code.
The inputs and outputs are files (data, config, source code, notebooks, images, tables etc.).
The code executed can either be shell commands or Python, written directly into the Snakefile.

Let's consider a workflow where we have two parallel rules 1a and 1b, and one consequtive rule 2.
In addition we have rule all, that determines the whole workflow.
The rules 1a and 1b only depend on their input. The rule 2 depends on outputs of both rules 1a and 1b.
The rule 2 also has an additional input independent of other rules.
We can visualize the workflow as follows:

![Workflow visualization example](./visuals/snakemake_illustration_simple_workflow.png)

Now, if we turned it into a Snakefile script, it would look something like this:

    rule all: # used to determine the whole workflow
        input:
            output_2
            
    rule 1a: # parallel to rule 1b
        input:
            input_1a
        output:
            output_1a
        run: # run python commands
            # Python script to run rule 1a
            
     rule 1b: # parallel to rule 1a
        input:
            input_1b
        output:
            output_1b
        shell: # you also run shell commands
            # shell script to run rule 1b
     
     rule 2: # consequent to steps 1a and 1b
        input:
            output_1a, # depends on rule 1a
            output_1b, # depends on rule 1b
            input_2 # independent input
        output:
            output_2
        run:
            # Python script to run rule2

Snakemake can be either used to run a single rule, or the complete workflow based on rule all.
Based on the inputs and outputs, snakemake will determine which other rules will then need to be executed.

In our example, the notebooks consist the nodes of the DAG graph. Based on the changes in input and output files since the last execution, 
Snakemake determines which steps need to be run. If no changes are observed, snakemake does not do anything. 

For example if we change the input of rule 1b, rules 1b and 2 are executed. If we make a change to the input of rule 2, only the rule 2 will be executed.
If we want to rerun all steps without changin anything, you can just touch the inputs of the independent rules `touch input_1a input_1b`.

The following script will be written into a Snakefile, that you can run to execute the workflow.



In [None]:
%%writefile Snakefile
# import relevant libraries
import papermill as pm
import os

# determine global variables, wildcards etc.

# all: final output of the workflow
rule all:
    input:
        'results/LogisticRegressionClassifier.pkl' # trained model
        
# data
rule data:
    input:
        '00_data.ipynb' # every notebook is it's own input
        # if we had input files they should be listed here, separated with comma ,
    output: # Snakemake checks that after running these files are created / updated
        'data/preprocessed_data/dataset_clean_switzerland_cleveland.csv',  # clean dataset
        'data/preprocessed_data/dataset_toy_switzerland_cleveland.csv', # toy dataset
        'ml_project_template/data.py', # plot functions
        'results/notebooks/_00_data.ipynb' # copy of the executed notebook
    run:
        # run notebook with papermill
        _ = pm.execute_notebook('00_data.ipynb', # input
                                'results/notebooks/_00_data.ipynb', # output
                               parameters = {'seed':0}) # params (optional)
        os.system('nbdev_build_lib --fname 00_data.ipynb') # build data.py

rule model:
    input:
        '01_model.ipynb',
        'data/preprocessed_data/dataset_toy_switzerland_cleveland.csv',
    output:
        'ml_project_template/model.py', # model class
        'results/notebooks/_01_model.ipynb'
    run:
        _ = pm.execute_notebook('01_model.ipynb', # we could also use '{input[0]}'
                                'results/notebooks/_01_model.ipynb') # we could also use '{output[1]}'
        os.system('nbdev_build_lib --fname 01_model.ipynb')
        
rule loss:
    input:
        '02_loss.ipynb',
        'data/preprocessed_data/dataset_clean_switzerland_cleveland.csv',
        'ml_project_template/data.py',
        'ml_project_template/model.py'
    output:
        'results/LogisticRegressionClassifier.pkl', # trained model
        'results/notebooks/_02_loss.ipynb'
    run:
        _ = pm.execute_notebook('02_loss.ipynb',
                               'results/notebooks/_02_loss.ipynb')
        os.system('nbdev_build_lib --fname 02_loss.ipynb')


Overwriting Snakefile


One thing to notice is that Snakemake should be used from terminal directly, not from a Python script or notebook.

To run Snakemake, call:
    
    snakemake -n # dry-run snakemake to check what would be done
    snakemake --jobs 1 # run workflow

The --jobs parameter (you can also use -j) is the maximum number of CPU cores to use with the pipeline (local/cluster/cloud cores).
Usually 1 is enough with simple pipelines, since our primary workflow step is a notebook, and most Python operations are difficult to parallelize.

## Parameterized dynamic executable workflow with Snakemake

Again, we might want to parameterize the notebook for scalability.

For simple parameterization, use rule parameters:
    
    rule model:
        input:
        output:
        parameters: seed = 0
        run:
            _ = pm.execute_notebook(...,
                                    ...,
                               parameters = {seed: parameters.seed})

For complex parameterization, including the parameterization of inputs and outputs,
consider using a config file. Snakemake can automatically load  json or yaml file to python dictionary:

    ## in config.yml:
    rules:
        data:
            ...
        model:
            input:
                - 01_model.ipynb
                - data/preprocessed_data/dataset_toy_switzerland_cleveland.csv
            parameters:
                seed: 0
            ...
        ...
    
    
    ## in Snakefile:
    
    # automatically load json or yaml file to python dictionary 'config'
    configfile: config.yml
    
    rule model:
        input:
            config['rules']['model']['input']
        output:
            ...
        run:
            run:
            _ = pm.execute_notebook(...,
                                    ...,
                               parameters = {seed: config['rules']['model']['parameters']['seed']})


For more information, see [Snakemake documentation](https://snakemake.readthedocs.io/).

## Alternative: Use this notebook to create an API

If you just want to create conventional Python application, you can use this notebook to create main.py so that you can just run your module as python application.
Then, you should change the name of this notebook and the default_exp location to main. This requires that you define and export all functions and classes to modules so that they can be used elsewhere. This is a bit messy, and deminishes many of the benefits of notebook development because instead of just running the notebooks you already created, you have to redefine all the steps required all over again. However, sometimes this might be what you want to do, so we wanted to mention it here.

Pseudo example of how to define main.py in a notebook:

In [None]:
%%script False
## remove this line and the line above if you want to run and export the code (note that it needs editing to work)
### export main
# remove the two extra # in the line above

import numpy as np
import sys
import data, model, loss

def get_input():
    """
    get input from user
    """
    pass

def main(filename):
    """
    run main loop
    """
    # load data
    X, y = data.load(filename)
    # initialize model, fit, optimize
    m = model.LogisticRegressionClassifier(X, y).fit().optimize()
    # main loop
    input_data = get_input()
    while(input_data):
        if input_data[0] == 'predict':
            # predict on input
            print(m.predict(input_data[1]))
        else:
            # do other things
            pass
        
if __name__ == "__main__":
    """
    run with: python ml_project_template filename
    """
    np.random.seed(0)
    filename = sys.argv[1]
    main(filename)

Couldn't find program: 'False'


In [None]:
%%script False
## remove this line and the line above if you want to run the code (note that it needs editing to work)
## test the main loop in this notebook and interact with the application
main()

Couldn't find program: 'False'
