# Technologies for Implementing Reproducibility

To implement a solution for reproducible science, it is necessary to make certain information explicit.

1. What does the code do?
2. What environment does the code run in?
3. How did the code and data evolve?
4. How are programs and data plugged together to arrive at the final result?

Humans need to cooperate and provide the information, but tools can help in some areas. Machine-readable formats for this information make it possible to automate tasks.

# What does the code do?

To follow, understand, and verify what code is doing, it needs to be written in a way that facilitates this. Practices such as **literate programming** and **clean code** provide guidlines for structuring code for this purpose.

Tools like **linters** and **formatters** can also help to keep code readable by enforcing conventions. **Unit tests** help explain the intention behind code, and make you more efficient when developing.

A **README.md** can help explain the motivation for your project, describe the data sources and methods used, explain how to install and use the code, etc. It should be kept up-to-date.

A standardized project structure can also help others understand your code quickly.

<table style="font-size: 14px; margin: 10px; text-align: left">
    <thead>
        <tr>
            <th>Folder/File</th>
            <th>Purpose</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th>README.md</th>
            <td>Describe project</td>
        </tr>        
        <tr>
            <th>data</th>
            <td>Root for data</td>
        </tr>
        <tr> 
            <th>docs</th>
            <td>Root for documentation</td>
        </tr>
        <tr>
            <th>notebooks</th>
            <td>Root for notebooks</td>
        </tr>
        <tr>
            <th>src</th>
            <td>Root for source code</td>
        </tr>
        <tr>
            <th>src/python</th>
            <td>Python code goes here</td>
        </tr>
        <tr>
            <th>src/r</th>
            <td>R code goes here</td>
        </tr>
     </tbody>
</table>

### How reproducible is this code?

You can execute the following code and get the expected answer. In that sense it is *reproducible*, but is that enough? 

In [None]:
%%bash
# From https://benkurtovic.com
echo "
(lambda _, __, ___, ____, _____, ______, _______, ________:
    getattr(
        __import__(True.__class__.__name__[_] + [].__class__.__name__[__]),
        ().__class__.__eq__.__class__.__name__[:__] +
        ().__iter__().__class__.__name__[_:][_____:________]
    )(
        _, (lambda _, __, ___: _(_, __, ___))(
            lambda _, __, ___:
                bytes([___ % __]) + _(_, __, ___ // __) if ___ else
                (lambda: _).__code__.co_lnotab,
            _ << ________,
            (((_____ << ____) + _) << ((___ << _____) - ___)) + (((((___ << __)
            - _) << ___) + _) << ((_____ << ____) + (_ << _))) + (((_______ <<
            __) - _) << (((((_ << ___) + _)) << ___) + (_ << _))) + (((_______
            << ___) + _) << ((_ << ______) + _)) + (((_______ << ____) - _) <<
            ((_______ << ___))) + (((_ << ____) - _) << ((((___ << __) + _) <<
            __) - _)) - (_______ << ((((___ << __) - _) << __) + _)) + (_______
            << (((((_ << ___) + _)) << __))) - ((((((_ << ___) + _)) << __) +
            _) << ((((___ << __) + _) << _))) + (((_______ << __) - _) <<
            (((((_ << ___) + _)) << _))) + (((___ << ___) + _) << ((_____ <<
            _))) + (_____ << ______) + (_ << ___)
        )
    )
)(
    *(lambda _, __, ___: _(_, __, ___))(
        (lambda _, __, ___:
            [__(___[(lambda: _).__code__.co_nlocals])] +
            _(_, __, ___[(lambda _: _).__code__.co_nlocals:]) if ___ else []
        ),
        lambda _: _.__code__.co_argcount,
        (
            lambda _: _,
            lambda _, __: _,
            lambda _, __, ___: _,
            lambda _, __, ___, ____: _,
            lambda _, __, ___, ____, _____: _,
            lambda _, __, ___, ____, _____, ______: _,
            lambda _, __, ___, ____, _____, ______, _______: _,
            lambda _, __, ___, ____, _____, ______, _______, ________: _
        )
    )
)" | python

# What environment does the code run in?

To reproduce a result, it is necessary to isolate the environment in which the computation runs. Of course, it would be possible to describe the environment as prose, but there are tools that can be used to automatically instantiate an environment when the environment is described in an executable format.

For your own python code, you should at least write a *requirements.txt* file describing the requirements for your code, or you might want to think about packaging your code as a *python package*. Tools like [cookiecutter](https://cookiecutter.readthedocs.io/en/latest/) make this easy to do. There was a tutorial on this topic last year, [The Sheer Joy of Packaging](https://www.youtube.com/watch?v=xiI1i525ljE).

Your software environment may be pure python, in which case a *requirements.txt* file is sufficient for describing it. But it have other dependencies as well. This is a list of tools that can be used to create an isolated environment 


<table style="font-size: 14px; margin: 10px; text-align: left">
    <thead>
        <tr>
            <th>tool</th>
            <th>format</th>
            <th>features</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th>pipenv</th>
            <td>requirements.txt</td>
            <td>makes it possible to define an environment for executing python code</td>
        </tr>        
        <tr>
            <th>conda</th>
            <td>requirements.txt, environment.yml</td>
            <td>can be used to specify python environments, but is more general and can incorporate executables as well</td>
        </tr>
        <tr> 
            <th>containers</th>
            <td>e.g., Dockerfile, .def</td>
            <td>(e.g., docker, singularity) can isolate environments that can execute on a shared OS kernel</td>
        </tr>
        <tr>
            <th>virtual machines</th>
            <td>e.g., vbox, vagrant</td>
            <td>isolate the OS kernel as well</td>
        </tr>
     </tbody>
</table>

These tools can be combined (e.g., running conda inside a container)


# How did the code and data evolve?

To understand how decisions that lead to the final result were made, it is helpful to trace how code and data evolved within a project.

Version control software like **git** and **mercurial** (hg) provide a way to do that. Standard version-control software works well with text (e.g., code and certain data formats like csv or json), but have limitations for dealing with binary data. Tools like **git-lfs** and **git-annex** extend version control to the realm of data.

Regardless of what tool you use, make sure you **write useful commit messages.** Good commit messages are concise and explain the intention behind the change that was made. The tools make it possible to view the exact modification in great detail, but it is not always possible to intuit the intention or reason for the change.

### Advanced

When collaborating, agree on a workflow. [OneFlow](https://www.endoflineblog.com/oneflow-a-git-branching-model-and-workflow) is likely a good fit for many data-science projects.

Think about making the commit history readable. `git rebase` is a helpful for this, but be careful in using it.

You may want to consider adopting a consistent structure to your commit messages like [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0-beta.4/).

# How are programs and data plugged together to arrive at the final result?

In data science, results are typically the end product of a series of steps. It is necessary to make those steps explicit so that others can work retrace them.

## Documenting the workflow

* **Scripts** are an imperative description of steps. 
* **Workflow tools** describe the steps in a declarative fashion and provide greater flexibility to introspect and execute steps in varying environments

## Documenting provenance

Workflows describe intermediary and final results with a focus on execution. Formats like **PROV-O** describe results with a focus on tracking the metadata around the results. The programs that were executed are one form of metadata, but so are information like who executed which steps, what is the source of the data, etc. These kinds of information are also important for reproducibility and allow you to answer questions like, "Who should I talk to if I do not understand something?"
