# Reproducible Data Science with Renku

## notebooks/03-renku/index.ipynb

You should be at this location in the repository.

# Renku

Renku is software for doing data science that is directly and conceptually reproducible.

It has two parts:

* renku CLI (think `git`)
* Renkulab server (think `GitLab` or `GitHub`)

Renku is a tool for reproducible data science that we are developing at the [Swiss Data Science Center](http://datascience.ch/). It's quite new (only about 1.5 years old as of Jul 2019) and very actively being developed, with many new features underway.

Renku is made up of two parts: the renku command-line interface, and the Renkulab server. The distinction is similar to git vs. GitLab. `git` is a set of command-line tools for using version control on a project. GitLab is a server application for managing multiple projects and giving others access to them.

Similarly, `renku` is a set of command-line tools for working reproducibly; Renkulab is a server for sharing and collaborating on projects, which includes a zero-install environment for running code, including, but not limited to notebooks.

Just as with GitHub and git, projects can be started on the server (e.g., [renkulab.io](https://renkulab.io)), or locally, on your laptop or desktop computer. And it is easy to transition a project from one location to the other.

In this tutorial, we will start are project on our laptops, and, in the end, move them to Renkulab where we can share and collaborate with others.

# Renku's building blocks

<table class="table table-condensed" style="font-size: 16px; margin: 10px;">
    <thead>
        <tr>
            <th>Tool</th>
            <th>Environment</th>
            <th>Code</th>
            <th>Data</th>
            <th>Workflow</th>
            <th>Provenance</th>
        </tr>
    </thead>
    <tbody>
        <tr style="font-size:24px;">
            <th><a href="https://renkulab.io">Renku</a></th>
            <td>Docker</td>
            <td>git</td>
            <td>git-lfs</td>
            <td>CWL</td>
            <td>PROV-O/RDF</td>
        </tr>
     </tbody>
</table>



Renku combines many tools that you may be familiar with and packages them in a unified way. Renku is a sort of "syntatic sugar" for the building blocks: users are allowed to peek under the covers and work directly with git, e.g., if that is convenient.

# Task

Working with data from [US Dept. of Transportation, Bureau of Transportation Statistics](https://www.transtats.bts.gov), we will answer the following question:

- How many flights were there to Austin, TX in Jan 2019

The tools we will used for the task are a bit oversized for such a simple question. But it will give us an opportunity to look at reproducibility in an understandable and managable context.

# Approach

In the hands-on, we will be doing our data science using Jupyter Notebooks. Notebooks have their [detractors](https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/edit#slide=id.g362da58057_0_1), and they make good points, but their popularity is also undeniable.

Renku does not specifically target notebooks &mdash; it can work with any kind of program &mdash; but it is possible to use renku in combination with notebooks.

## Ten Simple Rules

[Ten Simple Rules for Reproducible Research in Jupyter Notebooks](https://arxiv.org/abs/1810.08055)

<img src="./tutorial-images/ten-simple-rules-fig-1.png" alt="Ten Simple Rules Fig. 1" width="900px" />

[Ten Simple Rules for Reproducible Research in Jupyter Notebooks](https://arxiv.org/abs/1810.08055) provides a good set of best practices for working with notebooks. We adapt their suggestions to leverage the extra support provided by Renku.

Their advice is essentially the same as what we have been discussing, but they provide some tips for handling problems specific to notebooks. 

Two of these problems are: 1. cells can be executed in any order; 2. it is difficult to provide parameters to notebooks. 1. complicates reproducibility, 2. makes reuse hard.

The authors suggest using [Papermill](https://papermill.readthedocs.io/en/latest/), which solves both of these problems. Using papermill, it is possible to parameterize notebooks, and it is possible to execute them in a reproducible way.

## Hats

* "Renku" Hat
* "Pandas" Hat

As we work through the tutorial, we will be alternating between two different hats: our "pandas" hat and our "renku" hat. When we have our pandas hat on, we will be working within the widely-known pandas eco-system. In terms of data science, the real work happens here. But, we are not going to dedicate much of our attention to this part, and it is possible to work through the tutorial with little to no pandas knowledge.

# Cast of Characters

<table class="table table-striped" style="font-size: 18px; margin: 10px;">
    <tbody>
        <tr>
            <th width="20%"><code>!</code></th>
            <td>IPython syntax for executing a shell command</td>
        </tr>
        <tr>
            <th width="20%"><code>cp</code></th>
            <td>In practice, we would be writing the code, notebooks, and other files we work with. But, in this tutorial, we are going to write them by copying a pre-written version.</td>
        </tr>
        <tr>
            <th width="20%"><code>git status;</code><br>
                <code>git add;</code><br>
                <code>git commit</code>
            </th>
            <td>As we work, we will be committing to git to keep track of changes we make and the reasons for making them.</td>
        </tr>
        <tr>
            <th width="20%"><a href="https://papermill.readthedocs.io/en/latest/">papermill</a></th>
            <td>Tool for parameterizing and running notebooks in a reproducible way. It takes a notebook and its parameters as input, and produces a new notebook as output. We will use it together with <code>renku run</code></td>
        </tr>
        <tr>
            <th width="10%"><code>renku</code></th>
            <td>Tools for reproducible data science.</td>
        </tr>      
     </tbody>
</table>


# Hands-on with Renku (1h 30m)

<table class="table table-striped" style="font-size: 18px; margin: 10px;">
    <tbody>
        <tr>
            <th width="10%">30 min</th>
            <td width="10%"><a href="01-GettingStarted.ipynb">Starting</a></td>
            <td style="text-align: left">Starting a project, importing data</td>
        </tr>
        <tr>
            <th width="10%">30 min</th>
            <td width="10%"><a href="02-1-BuildPipeline.ipynb">Pipeline</a></td>
            <td style="text-align: left">Build a pipeline that performs an analysis</td>
        </tr>
        <tr>
            <th width="10%">30 min</th>
            <td width="10%"><a href="03-Sharing.ipynb">Sharing</td>
            <td style="text-align: left">Sharing results and collaborating using <a href="https://renkulab.io">renkulab.io</a>.</td>
        </tr>
     </tbody>
</table>