# Reproducibility

Adapted from Sandve, G. K., Nekrutenko, A., Taylor, J., & Hovig, E. (2013). Ten simple rules for reproducible computational research. PLoS Computational Biology, 9(10), e1003285.



## Computational reproducibility

e.g., as defined by from https://reproducible-builds.org/:

> Reproducible builds are a set of software development practices that create a verifiable path from human readable source code to the binary code used by computers.

In the context of data analysis, this means having a verifiable path from the raw data to the results reported in a paper.

## Rule 1: For Every Result, Keep Track of How It Was Produced

> By ... specifying the full
analysis workflow in a form that allows for
direct execution, one can ensure that the
specification matches the analysis that was
(subsequently) performed, and that the
analysis can be reproduced by yourself or
others in an automated way.

Recording your analyses in a Jupyter notebook, for instance, is an excellent way to follow this rule.

## Rule 2: Avoid Manual Data Manipulation Steps

This one is simple. If you did it by hand and weren't recording everything you did, how will anyone else reproduce it. And for digital data, if you were recording everything you did, why not record it in a way that allows you to automate the analysis?

To follow this rule, you could record *all* your analysis steps in a Jupyter notebook, calling out to external scripts when necessary.

## Rule 3: Archive the Exact Versions of All External Programs Used

Recording the version of external programs is important because different versions of software can behave differently. In Python, a key tool is the requirements file, often named `requirements.txt`.

**Exercise**: Create a `requirements.txt` file that specifies pandas version 0.20.2 and requests version 2.18.1.

Some requirements aren't Python modules and can't be specified in a requirements.txt file. For those, you might want to check out [Docker](https://www.docker.com/).

## Rule 4: Version Control All Custom Scripts

Any code you write should be versioned so that you can keep track of the exact code that was run to produce experimental results and analyses. A common technique is to copy and paste the entire analysis pipeline into a new directory, appending a unique identifier such as `analysis_NEW` for version 2, `analysis_NEWNEW` for version 3, etc. Though it's certainly better than nothing, this tends to produce a mess of easily confusable directories (see 9 below), bloated directories, and inconsistencies when bugs are fixed in one version but not the others.

Instead, use a proper version control system. As of 2017, it's worth starting with [Git](https://git-scm.com/), a standard. The learning curve is particularly shallow, so you may consider starting with an interactive training tool like [`githug`](https://github.com/Gazler/githug) that will run you through the basics quickly. A good first goal would be to learn the vocabulary: repository, clone, stage, commit, push, pull, origin, master, remote, local, merge, and rebase.

## Rule 5: Record All Intermediate Results, When Possible in Standardized Formats

> In principle, as long as the full process used to produce a given result is tracked, all intermediate data can also be regenerated. In practice, having easily accessible intermediate results may be of great value.

**Exercise**: Think of an example of an intermediate result, either in your own work or from one of this week's tutorials, that could profitably be recorded rather than recomputed anew each time. Which storage formats might be appropriate?

## Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds

Many experiments in the behavioral and social sciences make use of randomness: e.g., to assign participants to conditions. Most modern programming languages have good support for random number generation, care must be taken to ensure reproducibility. Two things can go wrong â€” first, you can fail to record the seed, losing information needed to reproduce the experiment or analysis exactly; second, you can fail to change the seed when you do want a fresh batch of random numbers. 

Setting the seed in Python is simple:

In [8]:
import random

random.seed(19145822646)

Now, let's generate some random numbers:

In [9]:
r = [random.random() for _ in range(10)]
r

[0.7764387061390589,
 0.642291262684286,
 0.391831358886754,
 0.7916397779561769,
 0.13015380897795403,
 0.9518298935760082,
 0.326529667693343,
 0.015399332301525126,
 0.7547546949053473,
 0.9290506858367827]

Next, we'll set the seed to the same value to confirm we can recreate the same random numbers:

In [10]:
random.seed(19145822646)
r2 = [random.random() for _ in range(10)]
r2

[0.7764387061390589,
 0.642291262684286,
 0.391831358886754,
 0.7916397779561769,
 0.13015380897795403,
 0.9518298935760082,
 0.326529667693343,
 0.015399332301525126,
 0.7547546949053473,
 0.9290506858367827]

Are they the same?

In [11]:
r == r2

True

Yes. And if you compare with a friend, you'll see that they generated the same numbers, too.

## Rule 7: Always Store Raw Data behind Plots

Even better, always store the raw data. And you can version your data, too, perhaps by commiting it to a version-control system like Git or the [Open Science Framework](https://osf.io/).

## Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected

> When the storage context allows, it is
better to simply incorporate permanent
output of all underlying data when a
main result is generated, using a systematic
naming convention to allow the full
data underlying a given summarized
value to be easily found. 

## Rule 9: Connect Textual Statements to Underlying Results

> Throughout a typical research project, a
range of different analyses are tried and
interpretation of the results made... If you want to reevaluate your previous
interpretations, or allow peers to
make their own assessment of claims you
make in a scientific paper, you will have
to connect a given textual statement
(interpretation, claim, conclusion) to the
precise results underlying the statement.

A Jupyter notebook enables you to put explanations and code side by side, following the recommendations above.

## Rule 10: Provide Public Access to Scripts, Runs, and Results

> Last, but not least, all input data,
scripts, versions, parameters, and intermediate
results should be made publicly
and easily accessible.

Check out the [Open Science Framework](https://osf.io/).

**DISCUSSION** What are some of the limitations to sharing behavioral and social science data? Has this affected your work?