Scientific Collaboration with GitHub

Will Beasley, Dept of Pediatrics,

Biomedical and Behavioral Methodology Core (BBMC)

August 28, 2015

Overview of Git

Git is the underlying version control system. It's similar to 'Track Changes' in MS Word, with three huge differences:

Collaborators can make changes simultaneously. Track Changes frequently involves a painful cognitive load to reconcile different versions.
The entire history is accessible -not just the most recent version. At anytime, you can turn back the clock to any committed change (example).
Coordinates an entire repository of files, not just isolated documents.

Overview of GitHub

GitHub is an online service that leverages Git, and adds some sauce for scientists

Hosts the repository online.
- Code.
- Data.
- Reports & output.
Adds options for user permissions, such as read-only
(unlike Dropbox).
Tools for visualizing code differences and developer activity.
Project Management Tracking, "Issues", & notifications.

Outline

Benefits & Complete examples.
Creation and organization.
Communicate with statisticians and non-statisticians.
Precautions with health care and PHI data.

Benefits

Reproducibility for internal team.
Reproducibility for outsiders.
Hosting reports.

Complete examples

Public and applied

github.com/OuhscBbmc/OsctrAstonWeber
github.com/LiveOak/LylesCarbonSteelCorrosion
github.com/LiveOak/UcaBullying

Public and methodological

github.com/OuhscBbmc/Wats
github.com/OuhscBbmc/REDCapR
github.com/OuhscBbmc/DeSheaToothakerIntroStats

Communication

github.com/bwawrik/MBIO5810
github.com/OuhscBbmc/RedcapExample
github.com/OuhscBbmc/StatisticalComputing

Private

github.com/OuhscBbmc/Tfcbt

Reproducibility for Internal Team

Easier to be disciplined about:
- maintaining a current & coherent code base.
- programmatic data manipulation (instead of manual).
- encapsulating analyses in different files.
Team members can more easily review and synchronize changes.
Easier to jump between computers.
github.com/OuhscBbmc/DeSheaToothakerIntroStats quickly becomes a small website.

Reproducibility for Outsiders

The inputs (ie, data and code) can be inspected & downloaded immediately.
- Details too trivial for your article are available too.
The outputs (ie, stats, graphs, and reports) can be compared to their results.
Ideally the exactly software versions are easily determined.

Benefits of Hosting Reports

Single URL to send to anyone interested
(not just those with access to the OU file server).
Single report to send anyone
(not a bunch of loose graphic files).

Four Life Cycle Templates

Before you start, decide if the repo will be:

Public from the start.
Private forever.
Private during development, then public.
Dual: Maintain both a public and a private.

Mechanism

The typical sequence of operations is

Log in to your computer and Sync the repository to make sure it's up-to-date.
Modify/create/delete a file (as normal).
Locally save the changes to your computer's hard drive (as normal).
Commit your saved changes to your local repository.
Sync your local repository with the central repository again. This "pulls" any changes from the server, attempts to merge the changes (which is usually successful), and finally "pushes" your recent changes to the server.

The GitHub Desktop Client (for Windows & Mac) hides a lot of the complexity.

Demo 1: Create and Assign a Teammate

Create a new repository in wibeasley:
Mbio5810Demo-2015-08.
Assign privileges to a existing user: bwawrik.
Clone on my local machine.
Copy RAnalysisSkeleton.
Push changes.
Boris makes changes in the browser version & commits.
I sync/pull his changes.
Create an issue w/ links.

Demo 2a: More Descriptions

Sree navigates to Mbio5810Demo-2015-08 webpage.
"Fork" the repo (but still on central server).
"Clone" the repo to his local machine.
Modify some file.
Save the file to disk.
"Commit" the change to local repository.
"Sync" the change(s) to central server.

Demo 2b: Pull Request from Outsider

Sree: Navigate to repo website.
Sree: Submit a "pull request" (ie, a "PR").
Will: Receives email notification.
Will: Accepts pull request.
Boris: Syncs his local machine.

Large Data Files

Git is not intended to work with big data files (say, over 1MB).
- Especially ones that change frequently.
- The Git history gets bloated and syncing is sluggish.
- Excluding data from repo hurts reproducibility.

Containing/Referencing Data

The BBMC employs a variety of strategies. As we descend, security increases while reproducibility decreases.

Public data is contained directly as CSVs.
Assume users can download the same public database
(eg, reference genome, census file).
Unshared data that Git "ignores" and doesn't push to the server.
Pulling from the OUHSC file server.
Pulling from a database.

GitHub repositories should never contain patient data (or anything legally protected) --not even "private" repos.

RAnalysisSkeleton Repository

This is minimal example that contains elements of most of my moderately sized projects (say, takes a few weeks start to finish).

https://github.com/wibeasley/RAnalysisSkeleton.

We'll return to this after we finish the slides.

Project Management and Communication

Communicate with internal and external collaborators.
Three forms of communication have their place.
1. Long-term documentation stored in the repository. It should outlive GitHub.
2. Email has private/internal thoughts & criticisms.
3. GitHub issues host publically acceptable thoughts & criticisms. Treat as public. Don't assume GitHub will be in business in three years; forntunately the code & reports aren't tied to GitHub. Worst case, you can serve them as a zip file on your personal page.
Example issues: REDCapR and MBIO5810.

Distributing/Hosting Static Reports

The markdown report is a quick way, but has narrow margins.

For public repositories, routing the html report through http://rawgit.com is typically better.

For private reports, knitr produces a self-contained html report. The graphics, text, and numeric output is in a single file you can email. Anyone with a modern browser can open the file.

Inspecting the diffs is a great way to see if the results changed over time.

My "utility" Directory

Contains files that aren't absolutely necessary for the analysis, but makes reproduction much easier.

Examples: RAnalysisSkeleton

My "reproduce" File

Ideally expose a single file that can calls your other files in the correct order.

It's almost as easy creating a documentation file that offers clear directions to a human.

Plus, you can assert that the intermediate & final files have been produced roughly correctly.

Examples: SteelCorrosion ans Wats

Publicity & Search Engine Optimizations

In the repository's README.md file, provide any relevant information for humans and search engines.
It's obvious how it reduces barriers for human readers.
SEO is also important. Not only will it help improve the repository's SEO, it also improves the performance of your articles. Examples:

Branches, Forks, & Pull Requests

While a beginner, I recommended you
- Give everyone permissions within reason
  (it's much simpler and hassle-free).
- If a PR is necessary, try w/ the browser
  (not your local macine).
As you gain experience
- Experiment w/ branches/forks locally.
- Make small contributions to MBIO5810 scripts & documentation.
- Carefully study the rebase.

Cautions & Limitations -Part 1

Sync early & often
When working in a team, avoid modifying the same file simultaneously. Reconciliation costs you time (but is still easier than without GitHub).
Works easiest with plain text (eg, shell, SAS, R, csv, map), rather than binary/proprietary formats (eg, docx & sas7bdat). The storage mechanism doesn't care much, but the "diff" views won't be available, and reconciling differences can't be done automatically.

Cautions & Limitations -Part 2

Reconciliation strategies range in sophistication, including
- Command line functions for experts.
- Formal branching & merging using GitHub's visual tools.
- Our "Hard Reset" (ie, the only possibility without version control).
Git branching & forking is an important in software development, but I discourage it for repositories focused on analytics.

Cautions & Limitations -Part 3

Securing private information (data & comments).

Layered defense.
Good protocols & practices for data.
Use eager .gitignore exclusions.

Resources

Git and GitHub Mechanics

http://git-scm.com/
https://help.github.com/
Version Control with Git, Loeliger & McCullough (2012)
? Introducing GitHub: a Non-Technical Guide, Bell & Beer (2014)

Implementing Reproducible Research

Victoria Stodden, Friedrich Leisch, Roger D. Peng (editors; 2014)

Reproducible Research with R and RStudio

Christopher Gandrud (2013, 2015)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

beasley-github-2015-08.md

beasley-github-2015-08.md

Scientific Collaboration with GitHub

Overview of Git

Overview of GitHub

Outline

Benefits

Complete examples

Public and applied

Public and methodological

Communication

Private

Reproducibility for Internal Team

Reproducibility for Outsiders

Benefits of Hosting Reports

Four Life Cycle Templates

Mechanism

Demo 1: Create and Assign a Teammate

Demo 2a: More Descriptions

Demo 2b: Pull Request from Outsider

Large Data Files

Containing/Referencing Data

RAnalysisSkeleton Repository

Project Management and Communication

Distributing/Hosting Static Reports

My "utility" Directory

My "reproduce" File

Publicity & Search Engine Optimizations

Branches, Forks, & Pull Requests

Cautions & Limitations -Part 1

Cautions & Limitations -Part 2

Cautions & Limitations -Part 3

Resources

Git and GitHub Mechanics

Implementing Reproducible Research

Reproducible Research with R and RStudio

Files

beasley-github-2015-08.md

Latest commit

History

beasley-github-2015-08.md

File metadata and controls

Scientific Collaboration with GitHub

Overview of Git

Overview of GitHub

Outline

Benefits

Complete examples

Public and applied

Public and methodological

Communication

Private

Reproducibility for Internal Team

Reproducibility for Outsiders

Benefits of Hosting Reports

Four Life Cycle Templates

Mechanism

Demo 1: Create and Assign a Teammate

Demo 2a: More Descriptions

Demo 2b: Pull Request from Outsider

Large Data Files

Containing/Referencing Data

RAnalysisSkeleton Repository

Project Management and Communication

Distributing/Hosting Static Reports

My "utility" Directory

My "reproduce" File

Publicity & Search Engine Optimizations

Branches, Forks, & Pull Requests

Cautions & Limitations -Part 1

Cautions & Limitations -Part 2

Cautions & Limitations -Part 3

Resources

Git and GitHub Mechanics

Implementing Reproducible Research

Reproducible Research with R and RStudio