In [None]:
# default_exp workflow
%load_ext lab_black
# nb_black if running in jupyter
%load_ext autoreload
# automatically reload python modules if there are changes in the
%autoreload 2

In [None]:
# hide
from nbdev.showdoc import *

# Workflow

> Define static or dynamic workflow for automatically updating, training and deploying your ML model!



***input:*** Workflow definition parameters

***output:*** python or snakemake script for running the workflow

***description:***

While you are developing your ML application, you might prefer running the notebooks manually again and again.
However, once you have deployed your model into production it becomes unpractical and compromizes scalability, modularity and the principle of ease of reproducibility.
This happens regardless of what 'production' means to you - it might well be that you are just running the notebooks and viewing the results directly from them.
Whatever you are doing, having a single command to run the whole workflow makes things so much easier.

Workflow automation is also the part of the work where you'll probably notice a lot of bugs and nonrobustness in your notebooks.
Probably a lot more than you anticipated, but try not to get frustrated! Debugging is big and important part of the work.

In this notebook we explain alternatives for automating workflows, either as a static or dynamic. 
By following these examples (and further documentation on [papermill](https://papermill.readthedocs.io/) and [Snakemake](https://snakemake.readthedocs.io/)) you can parameterize your notebooks,
run them automatically in a workflow, and even parameterize and automate the workflow definition.
With this template, you can easily define very complex and versatile workflows, that are well documented in a notebook. 

We selected these tools for the template because they have stable community support, they are relatively easy to use and they fit our needs.
There are also other tools for workflow management and orchestration that may better suite your needs. Feel free to use them.
For more information, see for example this 
[comparison of workflow tools for Python](https://medium.com/@Minyus86/comparison-of-pipeline-workflow-packages-airflow-luigi-gokart-metaflow-kedro-pipelinex-5daf57c17e7).

## Import relevant modules

In [None]:
import numpy as np
import pandas as pd
import papermill as pm

## Define notebook parameters

In [None]:
# This cell is tagged with 'parameters'

save_notebooks_to = (
    "results/.notebooks/"  # TODO: make example of timestamping notebook runs
)

## NOTE: copies of executed notebooks are saved to hidden folder, because currently nbdev searches
## all subfolders when building docs, and the notebook copies may cause confusing results.
## hidden folders should be ignored.
## See more: https://github.com/fastai/nbdev/issues/357
## TODO: follow updates on nbdev, see if there is a cleaner fix to this!

# notebook names
index = "index.ipynb"
index_params = {}

data = "00_data.ipynb"
data_params = {}

model = "01_model.ipynb"
model_params = {}

loss = "02_loss.ipynb"
loss_params = {}

workflow = "03_workflow.ipynb"

make direct derivations from the paramerters:

## Custom magic function to write code cell contents to a file with global variables formatted:

This is needed so that we can create parameterizable non-python scripts from inside a notebook.

In [None]:
from IPython.core.magic import register_line_cell_magic


@register_line_cell_magic
def writefile_format_globals(filename, cell):
    """
    This is a function to write contents of a notebook cell to a file.
    To use it, call '%%writefile_format_globals filename' in the first row of
    the cell the contents of which you want to write into a file.
    The code written in this cell is not run in the notebook.
    This means that you can also write and define non-python scripts and
    execute them from a notebook.

    You can format in global variables by placing them inside curly
    brackets '{variable name}'.

    Note, that your code should not include curly brackets.
    If curly brackets need to be written in the file, you can include
    them through the variable insertion:

    ## CELL 1:
    bracket_open_string = '{'
    bracket_close_string = '}'

    ## CELL 2:
    %write_format_globals myscriptname
    {bracket_open_string}
        # I want this part inside curly brackets!
    {bracket_close_string}

    ## file 'myscriptname' after running the cells 1 and 2:
    {
        # I want this part inside curly brackets!
    }
    """
    with open(filename, "w") as f:
        f.write(cell.format(**globals()))

## How to run parameterized notebooks with papermill

Papermill allows parameterizing and running notebooks from Python runtime with `papermill.execute_notebook(input, output, parameters)`.
The `input` parameter is the notebook to be run. The `output` parameter is the filepath where copy of the executed notebook is saved with the results.
This can be the same as the `input`, but you probably want to keep it separate - otherwise your version control may get messy. 
In this example executed notebooks are saved under `results/notebooks`. The `parameters` cell allows you to change settings of the notebooks.

You may have noticed, that in the beginning of each notebook there is a cell with a comment `# This cell is tagged parameters`.
The cell has been added 'Parameters` tag. The template notebooks already contain the tag, 
but you can see [here](https://papermill.readthedocs.io/en/latest/usage-parameterize.html) how to do it on different notebook editors.
In this cell, variables are assigned. What papermill does is, that any parameters given to the `execute_notebook` function are listed 
in a new cell right below the one tagged with parameters. The listed parameters will rewrite the default assignments.
This is why you should not do anything else but simple assignments in the parameters cell.

Let's show an example. Let's run the notebook 'model' with and without changing the 'seed'-parameter.
You can then compare the resulting notebooks in `results/notebooks`.

In [None]:
# slow
# run model notebook with default parameters
_ = pm.execute_notebook(
    "02_loss.ipynb",
    "results/.notebooks/02_loss_default_params.ipynb",
)

Executing:   0%|          | 0/43 [00:00<?, ?cell/s]

In [None]:
# slow
# run model notebook with 'seed' -parameter changed from 0 to 1
_ = pm.execute_notebook(
    "02_loss.ipynb", "results/.notebooks/02_loss_seed_1.ipynb", parameters={"seed": 1}
)

Executing:   0%|          | 0/44 [00:00<?, ?cell/s]

You can now compare the results. 

Copies of the executed notebooks are saved in a hidden folder, so that they don't confuse nbdev.
Notebooks can not be viewed from hidden folders, so you have to make a visible copy of the folder.
This may appear a bit unpractical, and we hope to find a more convenient solution in the future (without making custom edits to nbdev).
However, in practice this rarely matters since the location where you save the notebooks can be outside the project folder, and it only serves as a backup
so you probably don't need to view the automatically executed notebooks that often.

How to move the notebooks from one folder to another in shell:

    cp -r results/.notebooks/ results/notebooks

You can then open the notebooks as usual in your editor. See how changing the seed changes the results?
Now, before you build the modules and docs with `nbdev_build_lib && nbdev_build_docs`, remember to delete the notebook files from under results folder:

    rm -r results/notebooks

## Static executable workflow with papermill

Now we can define a simple static parameterizable workflow. The workflow is a Python script, completely defined in this notebook for consistent of documentation (everything is defined in notebooks). We then export the script into a file, so that it can be executed outside this notebook. We added the script file to gitignore, because it is defined in this notebook and will be recreated every time this notebook is run. We define the script parameterization so, that the parameters are hard coded into the script file when it is generated, based on this notebook. So, to change the setup, you should run this notebook with different parameters. You may make different choises, but consider ease of reproducibility and be sure to document your work well.

Now run the cell below to create the execution script (the code is not run in this notebook):

In [None]:
%%writefile_format_globals static_workflow.py
# execute workflow of the example notebooks
# to run the script, call python static_workflow.py
# this file has been added to .gitignore
# NOTE: use curly brackets only to format in global variables!
# hint: you can include additional parameters with sys.argv

# import relevant libraries
import papermill as pm

# update modules

# optional (uncomment): make backup of index:
# _ = pm.execute_notebook("{index}", "{save_notebooks_to}{index}")

# run data notebook
_ = pm.execute_notebook("{data}", "{save_notebooks_to}{data}")
# remember that you can change the notebook parameters by giving 
# the execute_notebook function parameter dict:
# parameters = dict(parameter_name_string:parameter_value, ...)

# run model notebook
_ = pm.execute_notebook("{model}", "{save_notebooks_to}{model}")
# run loss notebook
_ = pm.execute_notebook("{loss}", "{save_notebooks_to}{loss}")

# optional (uncomment): make backup of the workflow notebook:
# import os
# os.system('cp {workflow} {save_notebooks_to}{workflow}')
# (a recursive call with pm could also work,
# but only do it if you known what you are doing)

If you open the file `static_workflow.py`, you notice that the contents of curly brackets were replaced with the parameters of this notebook.

Now, you can run the workflow:

In [None]:
# slow
!python static_workflow.py

Executing: 100%|██████████████████████████████| 57/57 [00:21<00:00,  2.86cell/s]
Executing: 100%|██████████████████████████████| 44/44 [00:22<00:00,  2.02cell/s]
Executing: 100%|██████████████████████████████| 43/43 [00:17<00:00,  2.98cell/s]


You can again make a visible copy of the hidden notebooks folder just like above (remember to delete it afterwards) and view the notebooks.
You can change some of the notebook parameters and rerun the workflow to see how it effects the results.

You see that static workflow definition is quite simple. In the script above, 
we did not define any inputs, outputs or the relation of the different steps.
It's good to keep things that way, unless there is a reason not to.
It might be that we have multiple, changing data sources, complex workflow structure,
need for parallelization or other issues making it either difficult to hard-code
the steps required in your workflow. Then, you might need a dynamic workflow.

## Dynamic executable workflow with Snakemake

Snakemake is a tool that will automatically determine which steps to run based on inputs and outputs.
It's like gnu make, but for Python: easy to read and write, but powerful.
Here we only cover a tiny portion of the possibilities of snakemake, but it can do very complex things.
One thing to notice is that Snakemake must be the root runtime of Python - you can not lauch it from inside a Python script (at least with the current version).

The following script will be written into a Snakefile, that you can run to execute the workflow.