# Reproducible Research

> An article about computational science in a scientific publication is **not** the scholarship itself, it is merely **advertising** of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.
> 
> -- Buckheit and Donoho (1995)

## Non-reproducible research

1. Duke Potti Scandal

  <img src="./screenshot_dukepottiscandal.png" width="500" align="center"/>

  * Potti et al (2006) Genomic signatures to guide the use of chemotherapeutics, [Nature Medicine](http://www.nature.com/nm/journal/v12/n11/full/nm1491.html), 12(11):1294--1300.  

  * Baggerly and Coombes (2009) Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology, [Ann. Appl. Stat.](https://projecteuclid.org/euclid.aoas/1267453942), 3(4):1309--1334.  

  * More information:
    * [Wiki page](http://en.wikipedia.org/wiki/Anil_Potti)
    * [Simply Statistics Blog: The Duke Saga Starter Set](http://simplystatistics.org/2012/02/27/the-duke-saga-starter-set/)

2. Nature Genetics (2015 Impact Factor: 31.616). 20 articles about microarray profiling published in _Nature Genetics_ between Jan 2005 and Dec 2006.

  <img src="./screenshot_naturegeneticsrepeatability.png" width="350" align="center"/>
  <img src="./screenshot_naturegeneticsrepeatabilityfig.png" width="350" align="center"/>

3. Bible code.

  <img src="./biblecode_statsci.png" width="200" align="center"/> 
  <img src="./biblecode.jpg" width="200" align="center"/>
  <img src="http://eliotelwarapologetics.com/wp-content/uploads/2016/11/TRUMP-BIBLE-CODES_5.jpg" width="500" align="center"/>

  * Witztum, Rips, and Rosenberg (1994) Equidistant letter sequences in the book of genesis. [Statist. Sci.](http://projecteuclid.org/euclid.ss/1177010393), 9(3):429-438. 

  * McKay, Bar-Natan, Bar-Hillel, and Kalai (1999) Solving the Bible code puzzle, [Statist. Sci.](https://www.math.washington.edu/~greenber/BibleCode.html), 14(2):150-173.

## Why reproducible research

0. Replicability has been a foundation of science. It helps accumulate scientific knowledge.

0. Greater research impact.

0. Better work habit boosts quality of research.

0. Better teamwork. For **you** (graduate students), it means better communication with your advisor.  
```julia
while true  
       Stud: "that idea you told me to try - it doesn't work!"  
       Prof: "ok. how about trying this instead."
end
```
Unless you reproduce the computing environment (algorithms, dataset, tuning parameters), there's no way professor can help you.

## How to be reproducible in statistics?

> When we publish articles containing figures which were generated by computer, we also publish the complete software environment which generates the figures.
> 
> -- Buckheit and Donoho (1995)

A good example: [http://stanford.edu/~boyd/papers/admm_distr_stats.html](http://stanford.edu/~boyd/papers/admm_distr_stats.html)

## Tools for reproducible research

* Version control: Git.  
* Distributing research, e.g., Julia or R packages: github, bitbucket. 
* Dynamic document: IJulia for Julia or RMarkdown for R.  
* Docker container for reproducing a computing environment.  
* Cloud computing tools.

We are going to practice reproducible research **now**. That is to make your homework reproducible using Git, github.com, and IJulia.

## Version control using Git

> If it's not in source control, it doesn't exist.

<img src="./screenshot_versioncontrolcartoon.png" width="700" align="center"/>

### Collaborative research. 

Statisticians, as opposed to _closet mathematicians_, rarely do things in vacuum.  
* We talk to scientists/clients about their data and questions.  
* We write code (a lot!) together with team members or coauthors.  
* We run code/program on different platforms.  
* We write manuscripts/reports with co-authors.  
* We distribute software so potential users have access to your methods.  

### Why version control?

* A centralized repository helps coordinate multi-person projects.  
* Time machine. Keep track of all the changes and revert back easily (reproducible).  
* Storage efficiency.  
* Synchronize files across multiple computers and platforms.  
* [github.com](https://github.com) is becoming a _de facto_ central repository for open source development.  
E.g., all packages in Julia are distributed through github.com.  
* Advertise yourself thru github.com.


### Available version control tools

* Open source: **Git**, subversion (aka svn), cvs, mercurial, ...
* Proprietary: Visual SourceSafe (VSS), ...
* Dropbox? Mostly for file backup and sharing, limited version control (1 month?), ...

We use Git in this course.

### Why Git?


* As of 2016, Git is the most popular version control system.  
[https://rhodecode.com/insights/version-control-systems-2016](https://rhodecode.com/insights/version-control-systems-2016)

<script type="text/javascript" src="https://ssl.gstatic.com/trends_nrtr/981_RC01/embed_loader.js"></script> <script type="text/javascript"> trends.embed.renderExploreWidget("TIMESERIES", {"comparisonItem":[{"keyword":"/m/05vqwg","geo":"","time":"all"},{"keyword":"/m/012ct9","geo":"","time":"all"},{"keyword":"/m/08441_","geo":"","time":"all"},{"keyword":"/m/08w6d6","geo":"","time":"all"},{"keyword":"/m/09d6g","geo":"","time":"all"}],"category":0,"property":""}, {"exploreQuery":"date=all&q=%2Fm%2F05vqwg,%2Fm%2F012ct9,%2Fm%2F08441_,%2Fm%2F08w6d6,%2Fm%2F09d6g&hl=en-US&tz=&tz=","guestPath":"https://trends.google.com:443/trends/embed/"}); </script> 

* History: Initially designed and developed by [Linus Torvalds](http://en.wikipedia.org/wiki/Linus_Torvalds#The_Linus.2FLinux_connection) in 2005 for Linux kernel development.  
_git_ is the British English slang for _unpleasant person_. 

> I'm an egotistical bastard, and I name all my projects after myself. First 'Linux', now 'git'.
>
> -- <cite>Linus Torvalds</cite>

* svn: **centralized** version control system.  
<img src="./centralized_vcs.png" width="300" align="center"/>
Git: **distributed** version control system.
<img src="./distributed_vcs.png" width="300" align="center"/>

### What do I need to use Git?

* A **Git server** enabling multi-person collaboration through a centralized repository.
    - [github.com](github.com): unlimited public repositories, private repositories costs $, academic user can get 5 private repositories for free (or unlimited from the [Student Developer Pack](https://education.github.com/pack)?)  
    - [bitbucket.org](bitbucket.org): unlimited public repositories, unlimited private repositories for academic account (register for free using your UCLA email)   
    - We use [github.com](github.com) in this course for developing and submitting homework  

* **Git client** on your own machine.
    - Linux: shipped with many Linux distributions, e.g., Ubuntu. If not, install using a package manager, e.g., `yum install git` on CentOS  
    - Mac: install by `port install git` or other package managers  
    - Windows: GitHub for Windows (GUI), TortoiseGIT (is this good?)  
    
Don't totally rely on GUI. Learn to use Git on command line, which is needed for cluster and cloud computing.

### Basic workflow of Git

<img src="./git_workflow.png" width="500" align="center"/>

* Synchronize local Git directory with remote repository (`git pull` = `git fetch` + `git merge`).  
* Modify files in local working directory.  
* Add snapshots of them to staging area (`git add`).  
* Commit: store snapshots permanently to (local) Git repository (`git commit`).  
* Push commits to remote repository (`git push`).  

### Basic Git usage

0. Register for an account on a Git server, e.g., [github.com](github.com). Fill out your profile, upload your public key to the server, ...  

0. Identify yourself at local machine, e.g.,   
```
git config --global user.name "Hua Zhou"
git config --global user.email "huazhou@ucla.edu"
```
Name and email appear in each commit you make.

0. Initialize a project: 
  - Create a repository, e.g., `biostat-m280-HuaZhou` on the server.  
  **This step is done for you by TA**.  
  - Clone the repository to your local machine
  ```bash
  git clone git@github.com:UCLA-BIOSTAT-M280-2017-Spring/biostat-m280-2017-HuaZhou.git
  ```
  Now you have a local copy of the project.
  
0. Working with your local copy.
  - `git pull`: update local Git repository with remote repository (fetch + merge)  
  - `git log filename`: display the current status of working directory  
  - `git diff`: show differences (by default difference from the most recent commit)  
  - `git add file1 file2 ...`: add file(s) to the staging area  
  - `git commit`: commit changes in staging area to Git directory  
  - `git push`: publish commits in local Git repository to remote repository  
  - `git reset --soft HEAD~1`: undo the last commit  
  - `git checkout filename`: go back to the last commit, **discarding** all changes made  
  - `git rm`: remove files from git control  

## Branching in Git

* Branching in Git.  
<img src="./git_branching.png" width="350" align="center"/>

* For this course, you need to have two branches: 
    - `develop` for your own development
    - `master` for releases (homework submission). Note `master` is the default branch when you initialize the project; create and switch to `develop` branch immediately after project initialization.
<img src="./git_branching_simplified.png" width="250" align="center"/>

* Commonly used commands:  
    - `git branch branchname`: create a branch  
    - `git branch`: show all project branches  
    - `git checkout branchname`: switch to a branch  
    - `git tag`: show tags (major landmarks)
    - `git tag tagname`: create a tag

### Sample sessions

* Clone the project, create a `develop` branch, where your write solution for HW1.  
```bash
# clone the project
git clone git@github.com:UCLA-BIOSTAT-M280-2017-Spring/biostat-m280-2017-HuaZhou.git
# enter project folder
cd biostat-m280-2017-HuaZhou
# what branches are there?
git branch
# create develop branch
git branch develop
# switch to the develop branch
git checkout develop
# create folder for HW1
mkdir hw1
cd hw1
# let's write some code
echo "x = 1" > code.jl
echo "some bug" >> code.jl
# commit the code
git add code.jl
git commit -m "famous x = 1 function"
# push to remote repo
git push
```

* Submit and tag HW1 solution to `master` branch.  
```bash
# which branch are we in
git branch
# change to the master branch
git checkout master
# merge develop branch to master branch
git pull origin develop 
# push to the remote master branch
git push
# tag version hw1
git tag hw1
git push --tags
```

### Etiquettes of using Git and version control systems in general

* Be judicious what to put in repository.  
     - Not too less: Make sure collaborators or yourself can reproduce everything on other machines  
     - Not too much: No need to put all intermediate files in repository. Make good use of the `.gitignore` file
     
* Strictly version control system is for source files only. E.g. only `xxx.tex`, `xxx.bib`, and figure files are necessary to produce a pdf file. Pdf file doesn't need to be version controlled or, if version controlled, doesn't need to be frequently committed.

* 
> Commit early, commit often and don't spare the horses.

* Adding an informative message when you commit is **not** optional. Spending one minute on commit message saves hours later for your collaborators and yourself. Read the following sentence to yourself 3 times:
> Write every commit message like the next person who reads it is an axe-wielding maniac who knows where you live.

## Dynamic document using IJulia notebook

* IPython notebook is a powerful tool for authoring dynamic document, which combines code, formatted text, math, and multimedia in a single document.  

* [Jupyter](http://jupyter.org) is the current development that emcompasses multiple languages including **Ju**lia, **Pyt**hon, and **R**. 

* Julia uses Jupyter notebook through the [IJulia.jl](https://github.com/JuliaLang/IJulia.jl) package.

* In this course, you are required to write your homework reports using IJulia.

* For each homework, you need to submit your IJulia notebook (.e.g, `hw1.ipynb`), html (e.g., `hw1.html`), along with all code and data that are necessary to reproduce the results.

* You can start with the Jupyter notebook for the lectures.  