In [None]:
from IPython.display import SVG

# Starting a Renku project

### Pre-Setup

Make sure your git environment is correctly configured with a username and email.

In [None]:
!git config --global user.name
!git config --global user.email

In [None]:
# If the above does not print anything, set a username and email:
#!git config --global --add user.name "John Doe"
#!git config --global --add user.email "john.doe@example.com"

## Outline

<table class="table table-striped" style="font-size: 18px; margin: 10px;">
    <thead>
        <tr>
            <td>Renku</td>
            <td>Pandas</td>
        </tr>
    <thead>
    <tbody>
        <tr>
            <td style="font-weight: bold">Create repository</td>
            <td></td>
        </tr>
        <tr>
            <td>Declare environment</td>
            <td></td>
        </tr>
        <tr>
            <td>Import data</td>
            <td></td>
        </tr>
        <tr>
            <td></td>
            <td>Inspect and preprocess data</td>
        </tr>
        <tr>
            <td></td>
            <td>Verify preprocessing</td>
        </tr>        
     </tbody>
</table>

# Creating a repository

The first thing to do when starting a project is to initialize a repository. A Renku repository is just a git repository with a little bit of extra structure. 

In [None]:
!renku init renku-tutorial-flights

Let is take a look at what's inside the renku repository

In [None]:
%ls -l renku-tutorial-flights

Renku creates a git repository and generates two files: 
- `Dockerfile` 
- `requirements.txt`

We will take a look at the Dockerfile when we share our project. The `requirements.txt` file is empty and there as a shell. In a minute, we will create fill it out.

## Housekeeping

For the rest of the tutorial, we will work in the renku repository.

In [None]:
%cd renku-tutorial-flights

We also set up some git configuration for this tutorial.

In [None]:
# Workaround for https://github.com/SwissDataScienceCenter/renku-python/issues/579 to diff text in lfs
!git config diff.lfs.textconv cat

## Declaring the environment

<table class="table table-striped" style="font-size: 18px; margin: 10px;">
    <thead>
        <tr>
            <td>Renku</td>
            <td>Pandas</td>
        </tr>
    <thead>
    <tbody>
        <tr>
            <td>Create repository &#10004;</td>
            <td></td>
        </tr>
        <tr>
            <td style="font-weight: bold">Declare environment</td>
            <td></td>
        </tr>
        <tr>
            <td>Import data</td>
            <td></td>
        </tr>
        <tr>
            <td></td>
            <td>Inspect and preprocess data</td>
        </tr>
        <tr>
            <td></td>
            <td>Verify preprocessing</td>
        </tr>        
     </tbody>
</table>

To make the project reproducible, we need to declare the environment it runs in. We will be working with pandas, numpy, scipy, matplotlib, and seaborn, so let us create a requirements.txt file that makes this explicit.

In [None]:
# Normally, you would write a requirements file, but we have one ready here
%cp ../templates/requirements.txt ./

As we work, we to track the process. Some of this information is kept in git. For example, the fact that we have filled out the requirements.txt file.

In [None]:
!git status

Let us tell git about the requirements.txt file.

In [None]:
!git add requirements.txt
!git commit -m"Declare the python environment for the project."

## Importing data


<table class="table table-striped" style="font-size: 18px; margin: 10px;">
    <thead>
        <tr>
            <td>Renku</td>
            <td>Pandas</td>
        </tr>
    <thead>
    <tbody>
        <tr>
            <td>Create repository &#10004;</td>
            <td></td>
        </tr>
        <tr>
            <td>Declare environment &#10004;</td>
            <td></td>
        </tr>
        <tr>
            <td style="font-weight: bold">Import data</td>
            <td></td>
        </tr>
        <tr>
            <td></td>
            <td>Inspect and preprocess data</td>
        </tr>
        <tr>
            <td></td>
            <td>Verify preprocessing</td>
        </tr>        
     </tbody>
</table>

There are several ways renku can import data. Renku can import data from a data repository such as [Zenodo](https://zenodo.org), another renku or git repository, from a file on the file system, or from a URL. We will use  the last of these options.

### Create a dataset

First, we will create a dataset which will group together the files we want to work with.

In [None]:
!renku dataset create flights

### Add data

And we will get some data and add it to the dataset.

In [None]:
!renku dataset add flights !renku dataset add flights https://renkulab.io/gitlab/cramakri/renku-tutorial-flights-data-raw/raw/master/data/v1/2019-01-flights.csv.ziphttps://renkulab.io/gitlab/cramakri/renku-tutorial-flights-data-raw/raw/master/data/v1/2019-01-flights.csv.zip

This copies the data into the folder for the flights dataset.

### Inspecting a dataset

Let's take a look at the dataset.

In [None]:
!renku dataset ls-files flights

# Data science: Inspect and preprocess data

<table class="table table-striped" style="font-size: 18px; margin: 10px;">
    <thead>
        <tr>
            <td>Renku</td>
            <td>Pandas</td>
        </tr>
    <thead>
    <tbody>
        <tr>
            <td>Create repository &#10004;</td>
            <td></td>
        </tr>
        <tr>
            <td>Declare environment &#10004;</td>
            <td></td>
        </tr>
        <tr>
            <td>Import data &#10004;</td>
            <td></td>
        </tr>
        <tr>
            <td></td>
            <td style="font-weight: bold">Inspect and preprocess data</td>
        </tr>
        <tr>
            <td></td>
            <td>Verify preprocessing</td>
        </tr>        
     </tbody>
</table>

There are notebooks prepared in the templates folder. One of these notebooks will get us started with reading in and preprocessing the data.

In [None]:
%mkdir notebooks
%cp ../templates/01-Preprocess-00.ipynb ./notebooks/01-Preprocess.ipynb

## Inspect data

In the data we have **flight date**, **scheduled arrival time** and **actual arrival time**. To compute delay, we need to take the difference between actual arrival time and scheduled arrival time. To complicate matters, the arrival times are not stored as time, but as integer in the format 'HHMM'.

### Run through the notebook
Open [01-Preprocess.ipynb](renku-tutorial-flights/notebooks/01-Preprocess.ipynb) and run through the '01-Preprocess.ipynb' notebook. When done, return to this notebook to continue.

## Preprocess data

The data looks good. We want to save the output as a file. We could just save the file in the notebook, but then we have not recorded what input and processing were used to produce an output.

Instead, we can use the tool **papermill** which runs Jupyter notebooks in a reproducible way. To do this, we need to convert the notebook into one that is papermill compatible and then run it with papermill.

### Modify the notebook to use papermill

Converting the notebook to be papermill compatible is done by creating a cell that initializes all the values we want to use as parameters and converting the code to use these parameters. The cell needs to be tagged as a parameters cell. This is done by editing the cell metadata and adding the following:
```
{
    "tags": [
        "parameters"
    ]
}
```

In [None]:
# Update the notebook to a version with parameters
%cp ../templates/01-Preprocess-01-papermill.ipynb ./notebooks/01-Preprocess.ipynb 

### Resolve the dirty repository state

Renku uses information from git to determine the output of a program. For this to work, the working directory needs to be clean (without modifications).

In [None]:
!git status

We added a notebook. Let us put it into git and make a commit. *Remember to make a useful commit message!*

In [None]:
!git add notebooks
!git commit -m"Initial data inspection and processing"

In [None]:
!git status

Now we can run the notebook with papermill

## Reproducible notebook run

In [None]:
%%bash
mkdir -p data/output
renku run papermill \
  -p input_path data/flights/2019-01-flights.csv.zip \
  -p output_path data/output/2019-01-flights-preprocessed.csv \
  notebooks/01-Preprocess.ipynb \
  notebooks/01-Preprocess.ran.ipynb

Let's take a look at how things look from the renku perspective.

In [None]:
!git status

In [None]:
!renku status

In [None]:
graph = !renku log --format dot | dot -Tsvg
SVG("\n".join(graph))

# Verify preprocessing

Let us examine the preprocessed data to make sure we interpreted the original data correctly.

To do this, we use another notebook. This notebook takes the preprocessed data and visualizes it so we can see if it conforms to our expectations regarding how it should look.

In [None]:
%cp ../templates/02-Inspection-00.ipynb ./notebooks/02-Inspection.ipynb 

## Inspect preprocessed data

Run through the [02-Inspection.ipynb](./renku-tutorial-flights/notebooks/02-Inspection.ipynb) notebook. and then come back here.

### Modify the notebook to use papermill (again)

The 02-Inspection.ipynb notebook was already written with papermill in mind. The one parameter is already declared in its own cell. To make the notebook  papermill compatible all that needs to be done is to tag the parameters cell. This is done by editing the cell metadata and adding the following:

## **Exercise 1**

Make the 02-Inspection.ipynb work with papermill.

In [None]:
# Ex. 1 Solution
# Update the notebook to a version with parameters
# %cp ../templates/02-Inspection-01-papermill.ipynb ./notebooks/02-Inspection.ipynb 

### Resolve the dirty repository state

Again, we need to ensure the working directory is clean.

In [None]:
!git status

We added a notebook. Let us put it into git and make a commit.

In [None]:
!git add notebooks/02-Inspection.ipynb
!git commit -m"Inspecting the results of preprocessing."

Now we can run the notebook with papermill

In [None]:
%%bash
renku run papermill \
  -p input_path data/output/2019-01-flights-preprocessed.csv \
  -p output_path data/output/2019-01-flights-delay-fivenums.csv \
  notebooks/02-Inspection.ipynb \
  notebooks/02-Inspection.ran.ipynb

<div style="color: #004085; background-color: #cce5ff; border-color: #b8daff; padding: .75rem 1.25rem; margin-bottom: 1rem; border: 1px solid transparent; border-radius: .25rem; font-size: larger;">
Now we have a workflow in Renku!
</div>



If you have dot available, you can view the graph visually

In [None]:
# graph = !renku log --format dot | dot -Tsvg -Gsize=12,10
# SVG("\n".join(graph))

If we want to see what we have done so far, we can just look at the git log.

In [None]:
!git log --graph --oneline