# Reproducible Data Science with Renku &mdash; Index

## notebooks/03-renku/index.ipynb

# Renku

Renku is software for doing reproducible, collaborative data science. It has two parts:

* renku CLI (think `git`)
* Renkulab server (think `GitHub` or `GitLab`)

Renku is made up of two parts: the renku command-line interface, and the Renku server. The CLI records commands you execute to build workflows and capture provenance; the server provides a location for sharing and collaboration. Just as with GitHub and git, projects can be started on the server (e.g., [renkulab.io](https://renkulab.io)), or locally, on your laptop or desktop computer. And it is easy to transition a project from one location to the other.

In this tutorial, we will start are project on our laptops, and, in the end, move them to Renkulab where we can share and collaborate with others.

# Renku's building blocks

<table class="table table-condensed" style="font-size: 16px; margin: 10px;">
    <thead>
        <tr>
            <th>Tool</th>
            <th>Environment</th>
            <th>Code</th>
            <th>Data</th>
            <th>Workflow</th>
            <th>Provenance</th>
        </tr>
    </thead>
    <tbody>
        <tr style="font-size:24px;">
            <th><a href="https://renkulab.io">Renku</a></th>
            <td style="text-align: center">Docker</td>
            <td style="text-align: center">git</td>
            <td style="text-align: center">git-lfs</td>
            <td style="text-align: center">CWL</td>
            <td style="text-align: center">PROV-O/RDF</td>
        </tr>
     </tbody>
</table>



Renku combines many tools that you may be familiar with and packages them in a unified way. Renku is a sort of "syntatic sugar" for the building blocks: users are allowed to peek under the covers and work directly with git, e.g., if that is convenient.

# Hands-on with Renku (1h 30m)

<table class="table table-striped" style="font-size: 18px; margin: 10px;">
    <tbody>
        <tr>
            <th>30 min</th>
            <td><a href="01-Starting.ipynb">Starting</a></td>
            <td style="text-align: left">Starting a project, importing data, building a workflow</td>
        </tr>
        <tr>
            <th>30 min</th>
            <td><a href="02-Iterating.ipynb">Iterating</a></td>
            <td style="text-align: left">Updating code and data to do the analysis.</td>
        </tr>
        <tr>
            <th>30 min</th>
            <td><a href="03-Sharing.ipynb">Sharing</td>
            <td style="text-align: left">Sharing results and collaborating using <a href="https://renkulab.io">renkulab.io</a>.</td>
        </tr>
     </tbody>
</table>

# Task

Working with data from [US Dept. of Transportation, Bureau of Transportation Statistics](https://www.transtats.bts.gov), we will answer the following questions:

- Do flights to Austin, TX generally arrive on time?
- Are flights to Austin from certain locations more often delayed than others?

From the perspective of statistics, the tasks are very simple: just compute some means. But the difficulty lies in deriving the values to average. To do this, we will need to build a pipeline consisting of data preprocesing, analysis and visualization of the output of preprocessing, and finally computing the result.

**Spoiler!** The exercise will simulate a real-world situation. Certain steps will not work correctly, but we will only realize it later on, meaning that we will need to backtrack and fix errors upstream.

# Approach

In the hands-on, we will be doing our data science using Jupyter Notebooks. Notebooks have their [detractors](https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/edit#slide=id.g362da58057_0_1), and they make good points, but their popularity is also undeniable.

## Ten Simple Rules

[Ten Simple Rules for Reproducible Research in Jupyter Notebooks](https://arxiv.org/abs/1810.08055)
* Tell a story and make it understandable
* Document environment and process
* Structure work in a pipeline 
* Share code and data

[Ten Simple Rules for Reproducible Research in Jupyter Notebooks](https://arxiv.org/abs/1810.08055) provides a good set of best practices for working with notebooks. We adapt their suggestions to leverage the extra support provided by Renku.

Their advice is essentially the same as what we have been discussing, but they provide some tips for handling problems specific to notebooks. 

Two of these problems are: 1. cells can be executed in any order; 2. it is difficult to provide parameters to notebooks. 1. complicates reproducibility, 2. makes reuse hard.

The authors suggest using [Papermill](https://papermill.readthedocs.io/en/latest/), which solves both of these problems. Using papermill, it is possible to parameterize notebooks, and it is possible to execute them in a reproducible way.

## Hats

As we work through the tutorial, we will be alternating between two different hats: our "pandas" data-science hat and our "renku" data-science hat. When we have our pandas hat on, we will be working within the widely-known pandas eco-system. In terms of data science, the real work happens here. But, we are not going to dedicate much of our attention to this part, and it is possible to work through the tutorial with little to no pandas knowledge.

# Cast of Characters

## General

<table class="table table-striped" style="font-size: 18px; margin: 10px;">
    <tbody>
        <tr>
            <th><code>!</code></th>
            <td>IPython syntax for executing a shell command</td>
        </tr>
        <tr>
            <th><a href="https://papermill.readthedocs.io/en/latest/">papermill</a></th>
            <td>Tool for parameterizing and running notebooks in a reproducible way</td>
        </tr>
     </tbody>
</table>

## Renku

<table class="table table-striped" style="font-size: 18px; margin: 10px;">
    <tbody>
        <tr>
            <th><code>renku dataset</code></th>
            <td>Renku commands for working with datasets</td>
        </tr>        
        <tr>
            <th><code>renku run</code></th>
            <td>Renku wrapper for running commands</td>
        </tr>
     </tbody>
</table>
