## Who are we?

### Swiss Data Science Center

"A joint initiative between EPFL & ETH to foster the adoption of data science in academia and industry"

Developing tools for collaborative, reproducible data science is one of our major activities.


* **Rok Roškar**: background in computational astrophysics, HPC, developing Renku

* **Andreas Bleuler**: background in computational astrophyics, HPC, developing Renku and working on industrial data science collaborations

* Chandrasekhar Ramakrishnan: Renku UI lead, main author of the tutorial material

# Reproducibility in (Data) Science

An email I recently received:

![repro email](images/repro-email.png)

## Why do we care about reproducibility?

## Example: Computational Physics

2018 Victoria Stodden, M. Krafczyk, A. Bhaskar [Enabling the Verification of Computational Results](https://web.stanford.edu/~vcs/papers/P-RECS-2018-SKB-DOI.pdf)


The authors looked at 306 articles published in Journal of Computational Physics (which asks for research data to be made available) in 2016-2017.

* 6 of 306 articles included information on how to get the data
* The remaining 298 authors were contacted and they could get data for 49 further articles

In the end, they found the following:

* 0 of 306 results could be straightforwardly reproduced
* 5 results were fairly easy to reproduce
* 22 results could be reproduced with some skill and effort

**Over 90% of articles were not reproducible at all!**

# Why is so much research not reproducible?

<div style="align: center; width: 100%">
    <img alt="Excuses" src="./images/excuses.svg">
</div>

## Reproducibility:

* Ensures transparency
* Controlling for errors and artifacts
* Enabling verification and generalization of hypothesis

(see [Stanford Encyclopedia of Philosophy](https://plato.stanford.edu/entries/scientific-reproducibility/))

### Has fringe benefits

* Enabling reuse of data
* Enabling reuse of code
* Enabling sharing and extension of results

but really...

# Because you might be asked to redo an analysis you did 10 years ago!

## Easier said than done...

In [None]:
# https://bitbucket.org/rokroskar/homegrown/src/master/spiral_structure.py

"""
A set of routines for studying the properties of spiral structure


Rok Roskar 12/2010
Institute for Theoretical Physics, University of Zurich

"""

import pynbody
import pynbody.analysis.profile as profile
import numpy as np
import glob
import scipy as sp
import matplotlib.pylab as plt
import warnings
from scipy.stats import gaussian_kde as kde


In [None]:
def get_fft(fourier_data,t1,t2,r,m=2, window=False) : 
    
    data = np.load(fourier_data)

    rbin = np.digitize([r],data['r'])

    ind = np.where((data['t'] >= t1) & (data['t'] <= t2))[0]

    # the optimal N for the FFT
    nfft = 2**np.ceil(np.log(len(ind))/np.log(2))

    # this is the sample we are interested in
    sample = data['c'][ind,m,rbin]

    # window the sample

    if window: 
        # Hanning
        x = np.arange(0,len(ind))
        win = 0.5*(1-np.cos(2*np.pi*x/(len(ind)-1)))
        sample *= win


## Three (of four) key ingredients missing for a good answer

1. What does the code do? Versioned, well-structured code 😃
2. What kind of data is needed? Preserved and available data 😥
3. How does the code run? Runtime environment 😫
4. How are code and data combined to produce results? Workflow description 😭

## Tools can help - and there are many!

* version control: git, GitHub, GitLab
* data repositories: Zenodo, FigShare, Dataverse
* packaging and environments: pip, conda, containerization (docker, singularity)
* workflow/pipeline description: a myriad of workflow managers and languages

#### Very difficult for any single person to do this + science!

## Summary Matrix of Tools


<table class="table table-condensed" style="font-size: 16px; margin: 10px;">
    <thead>
        <tr>
            <th>Tool</th>
            <th>SaaS</th>
            <th>Local</th>
            <th>Versions code</th>
            <th>Versions data</th>
            <th>Provenance</th>
            <th>Open source</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th><a href="https://github.com/allenai/beaker">Beaker</a></th>
            <td style="text-align: center"></td>
            <td style="text-align: center">X</td>
            <td style="text-align: center">X</td>
            <td style="text-align: center">X</td>
            <td style="text-align: center">X</td>
            <td style="text-align: center">X</td>
        </tr>
        <tr>
            <th><a href="http://mybinder.org">Binder</a></th>
            <td style="text-align: center">X</td>
            <td style="text-align: center"></td>
            <td style="text-align: center"></td>
            <td style="text-align: center"></td>
            <td style="text-align: center"></td>
            <td style="text-align: center">X</td>
        </tr>
        <tr>
            <th><a href="https://codeocean.com">Code Ocean</a></th>
            <td style="text-align: center">X</td>
            <td style="text-align: center"></td>
            <td style="text-align: center">X</td>
            <td style="text-align: center"></td>
            <td style="text-align: center"></td>
            <td style="text-align: center"></td>
        </tr>         
        <tr>
            <th><a href="https://www.dominodatalab.com/product/">Domino</a></th>
            <td style="text-align: center">X</td>
            <td style="text-align: center"></td>
            <td style="text-align: center">?</td>
            <td style="text-align: center">X</td>
            <td style="text-align: center">X</td>
            <td style="text-align: center"></td>
        </tr>
        <tr>
            <th><a href="https://dvc.org">DVC</a></th>
            <td style="text-align: center">X</td>
            <td style="text-align: center">X</td>
            <td style="text-align: center">X</td>
            <td style="text-align: center">X</td>
            <td style="text-align: center">X</td>
            <td style="text-align: center">X</td>
        </tr>
        <tr>
            <th><a href="https://gigantum.com">Gigantum</a></th>
            <td style="text-align: center">X</td>
            <td style="text-align: center">X</td>
            <td style="text-align: center">X</td>
            <td style="text-align: center">X</td>
            <td style="text-align: center">?</td>
            <td style="text-align: center">X</td>
        </tr>
        <tr>
            <th><a href="https://cos.io/our-products/osf/">OSF</a></th>
            <td style="text-align: center">X</td>
            <td style="text-align: center"></td>
            <td style="text-align: center">X</td>
            <td style="text-align: center">X</td>
            <td style="text-align: center">?</td>
            <td style="text-align: center">X</td>
        </tr>
        <tr>
            <th><a href="https://www.pachyderm.io">Pachyderm</a></th>
            <td style="text-align: center">X</td>
            <td style="text-align: center"></td>
            <td style="text-align: center">X</td>
            <td style="text-align: center">X</td>
            <td style="text-align: center">X</td>
            <td style="text-align: center">X</td>
        </tr>
        <tr style="font-size:24px;">
            <th><a href="https://renkulab.io">Renku</a></th>
            <td style="text-align: center">X</td>
            <td style="text-align: center">X</td>
            <td style="text-align: center">X</td>
            <td style="text-align: center">X</td>
            <td style="text-align: center">X</td>
            <td style="text-align: center">X</td>
        </tr>
        <tr>
            <th><a href="https://stenci.la">Stencila</a></th>
            <td style="text-align: center"></td>
            <td style="text-align: center">X</td>
            <td style="text-align: center">?</td>
            <td style="text-align: center">?</td>
            <td style="text-align: center">?</td>
            <td style="text-align: center">X</td>
        </tr>
        <tr>
            <th><a href="https://wholetale.org">WholeTale</a></th>
            <td style="text-align: center">X</td>
            <td style="text-align: center"></td>
            <td style="text-align: center">?</td>
            <td style="text-align: center">?</td>
            <td style="text-align: center">?</td>
            <td style="text-align: center">X</td>
        </tr>
     </tbody>
</table>


## Renku: a platform for reproducible data science

We are building Renku to glue together the pieces needed to work reproducibly. 

With Renku you can:

* create a versioned, collaborative project
* launch interactive environments for data exploration
* record provenance of results for reuse
* explore the global knowledge graph to find data and algorithms

Renku is built on open standards with interoperability in mind.

## Renku: a platform for reproducible data science

Current version started in January 2018

Open source, Apache 2 license: https://github.com/SwissDataScienceCenter/renku

Public beta deployment running at https://renkulab.io


## Renku: roadmap

* support for Dataset viewing and searching
* improving provenance graph recording
* streamlining the CLI experience
* providing support for on-line workflow (re)execution
* deeper integration of the Knowledge Graph
