# Task

Working with data from [US Dept. of Transportation, Bureau of Transportation Statistics](https://www.transtats.bts.gov), we will answer the following question:

- How many flights were there to Austin, TX in Jan 2019

The tools we will used for the task are a bit oversized for such a simple question. But it will give us an opportunity to look at reproducibility in an understandable and managable context.

# Approach

In the hands-on, we will be doing our data science using Jupyter Notebooks. Notebooks have their [detractors](https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/edit#slide=id.g362da58057_0_1), and they make good points, but their popularity is also undeniable.

Renku does not specifically target notebooks &mdash; it can work with any kind of program &mdash; but it is possible to use renku in combination with notebooks.

## Ten Simple Rules

[Ten Simple Rules for Reproducible Research in Jupyter Notebooks](https://arxiv.org/abs/1810.08055)

<img src="./tutorial-images/ten-simple-rules-fig-1.png" alt="Ten Simple Rules Fig. 1" width="900px" />

[Ten Simple Rules for Reproducible Research in Jupyter Notebooks](https://arxiv.org/abs/1810.08055) provides a good set of best practices for working with notebooks. We adapt their suggestions to leverage the extra support provided by Renku.

Their advice is essentially the same as what we have been discussing, but they provide some tips for handling problems specific to notebooks. 

Two of these problems are: 1. cells can be executed in any order; 2. it is difficult to provide parameters to notebooks. 1. complicates reproducibility, 2. makes reuse hard.

The authors suggest using [Papermill](https://papermill.readthedocs.io/en/latest/), which solves both of these problems. Using papermill, it is possible to parameterize notebooks, and it is possible to execute them in a reproducible way.

## Hats

* "Renku" Hat
* "Pandas" Hat

As we work through the tutorial, we will be alternating between two different hats: our "pandas" hat and our "renku" hat. When we have our pandas hat on, we will be working within the widely-known pandas eco-system. In terms of data science, the real work happens here. But, we are not going to dedicate much of our attention to this part, and it is possible to work through the tutorial with little to no pandas knowledge.

# Cast of Characters

<table class="table table-striped" style="font-size: 18px; margin: 10px;">
    <tbody>
        <tr>
            <th width="20%"><code>!</code></th>
            <td>IPython syntax for executing a shell command</td>
        </tr>
        <tr>
            <th width="20%"><code>cp</code></th>
            <td>In practice, we would be writing the code, notebooks, and other files we work with. But, in this tutorial, we are going to write them by copying a pre-written version.</td>
        </tr>
        <tr>
            <th width="20%"><code>git status;</code><br>
                <code>git add;</code><br>
                <code>git commit</code>
            </th>
            <td>As we work, we will be committing to git to keep track of changes we make and the reasons for making them.</td>
        </tr>
        <tr>
            <th width="20%"><a href="https://papermill.readthedocs.io/en/latest/">papermill</a></th>
            <td>Tool for parameterizing and running notebooks in a reproducible way. It takes a notebook and its parameters as input, and produces a new notebook as output. We will use it together with <code>renku run</code></td>
        </tr>
        <tr>
            <th width="10%"><code>renku</code></th>
            <td>Tools for reproducible data science.</td>
        </tr>      
     </tbody>
</table>


# Hands-on with Renku (1h 30m)

<table class="table table-striped" style="font-size: 18px; margin: 10px;">
    <tbody>
        <tr>
            <th width="10%">30 min</th>
            <td style="text-align: left">Initialize a project and import data</td>
        </tr>
        <tr>
            <th width="10%"></th>
            <td style="text-align: left"><a href="01-1-CreateProject.ipynb">Create a project</a></td>
        </tr>        
        <tr>
            <th width="10%"></th>
            <td style="text-align: left"><a href="01-2-DeclareEnv.ipynb">Declare the project environment</a></td>
        </tr>
        <tr>
            <th width="10%"></th>
            <td style="text-align: left"><a href="01-3-CreateDataset.ipynb">Create a dataset</a></td>
        </tr>        
        <tr>
            <th width="10%">30 min</th>
            <td style="text-align: left">Analyze data</td>
        </tr>            
            <th width="10%"></th>
            <td style="text-align: left"><a href="02-1-BuildPipeline.ipynb">Build an initial pipeline</a></td>
        <tr>        
            <th width="10%"></th>
            <td style="text-align: left"><a href="02-2-UpdatePipeline.ipynb">Improve the pipeline</a></td>            
        </tr>
        <tr>
            <th width="10%">30 min</th>
            <td style="text-align: left">Share results and collaborate using <a href="https://renkulab.io">renkulab.io</a>.</td>
        </tr>        
        <tr>
            <th width="10%"></th>
            <td style="text-align: left"><a href="03-Sharing.ipynb">Share results</a></td>
        </tr>
     </tbody>
</table>