In [None]:
from IPython.display import SVG

# Building a pipeline

## Outline

We are going to create a pipeline with two steps:

<img alt="Pipeline" src="tutorial-images/pipeline.svg" />

For each step, the process of creating it will be the following:

1. Create notebook
2. Commit changes to git
3. Run notebook reproducibly

By chaining the outputs of one notebook into the inputs of the next, we will create a pipeline.

In [None]:
%cd renku-tutorial-flights

## Create a notebooks directory

We want to follow a standard project organization structure. We will put our notebooks in the folder notebooks.

In [None]:
%mkdir notebooks

## Filter Flights

### Create a notebook

There are notebooks prepared in the templates folder. We are going to use a these to build our pipeline.

In [None]:
%cp ../templates/00-FilterFlights.ipynb ./notebooks/

Step through the [FilterFlights notebook](./renku-tutorial-flights/notebooks/00-FilterFlights.ipynb).

### Commit changes to git

We've done something new that we need to track in version control.

In [None]:
!git status

In [None]:
!git add notebooks/00-FilterFlights.ipynb
!git commit -m"Created notebook to filter flights to AUS, TX."

In [None]:
!git status

**Tip**: `git status` should now report that the working tree is clean; if it's not, you can run `!git clean -f` to make it clean.

### Run reproducibly

Stepping through the notebook as we did is not guaranteed to be reproducible. We will use papermill to run the notebook.

Papermill passes parameters provided with `-p` on to the notebook. And Papermill takes as an argument the notebook to run (`notebooks/00-FilterFlights.ipynb` in this case) and a filename for the resulting run notebook (`notebooks/00-FilterFlights.ran.ipynb`, following the standard convention of [Notebook Name].ran.ipynb)

In [None]:
%%bash
mkdir -p data/output
renku run papermill \
  -p input_path data/flights/2019-01-flights.csv.zip \
  -p output_path data/output/2019-01-flights-filtered.csv \
  notebooks/00-FilterFlights.ipynb \
  notebooks/00-FilterFlights.ran.ipynb

Before we continue, let us take a look at the [FilterFlights.ran notebook](./renku-tutorial-flights/notebooks/00-FilterFlights.ran.ipynb).

## Count Flights

### Create a notebook

Again, you can use the notebook prepared in the templates folder.

In [None]:
%cp ../templates/01-CountFlights.ipynb ./notebooks/

Step through the [CountFlights notebook](./renku-tutorial-flights/notebooks/01-CountFlights.ipynb).

### Commit changes to git

We've done something new that we need to track in version control.

In [None]:
!git status

In [None]:
!git add notebooks/01-CountFlights.ipynb
!git commit -m"Created notebook to count flights to AUS, TX."

In [None]:
!git status

Again, `git status` should report a clean working tree. Remove any lingering files if this is not the case.

### Run reproducibly

Let's run the CountFlights notebook reproducibly as well.

In [None]:
%%bash
mkdir -p data/output
renku run papermill \
  -p input_path data/output/2019-01-flights-filtered.csv  \
  -p output_path data/output/2019-01-flights-count.txt \
  notebooks/01-CountFlights.ipynb \
  notebooks/01-CountFlights.ran.ipynb

<div style="color: #004085; background-color: #cce5ff; border-color: #b8daff; padding: 1.5rem 1.25rem; margin-bottom: 1rem; border: 1px solid transparent; border-radius: .25rem; font-size: xx-large;">
Now we have a pipeline in Renku!
</div>



# Viewing the pipeline

If you have dot available, you can visualize the pipeline.

In [None]:
graph = !renku log --format dot | dot -Tsvg
SVG("\n".join(graph))

# Viewing the result

We can also look at the file that was the final [2019-01-flights-count.txt](renku-tutorial-flights/data/output/2019-01-flights-count.txt).

# Exercises

Solutions are provided in the commented `%load` statements.

Enter the renku-tutorial-diamonds repo to do the exercises.

In [None]:
%cd ../renku-tutorial-diamonds

# Ex 2.0

Create a pipeline for the `renku-tutorial-diamonds` project using the `00-FilterDiamonds.ipynb` and `01-CountDiamonds.ipynb` notebooks from the `templates` directory.

To get started, enter the renku-tutorial-diamonds and make folders for the notebooks and data outputs.

If something fails, restore the working directory to a clean state (e.g., using `git clean -f`), fix the problem and try again.

In [None]:
!mkdir -p notebooks
!mkdir -p data/output

In [None]:
# %load ../solutions/ex2-0.fragment

As earlier, if you have dot available, you can view the graph visually

In [None]:
graph = !renku log --format dot | dot -Tsvg -Gsize=12,10
SVG("\n".join(graph))

# A note on papermill

The notebooks provided in this tutorial have already been set up to be papermill compatible. To make a notebook parameterizible by papermill, see the [papermill documentation](https://papermill.readthedocs.io/en/latest/usage-parameterize.html).

# [On to part 2](02-2-UpdatePipeline.ipynb)

