OU Bioinformatics Breakfast Club
Will Beasley, Dept of Pediatrics,
Biomedical and Behavioral Methodology Core (BBMC)
Git is the underlying version control system. It's similar to 'Track Changes' in MS Word, with three huge differences:
- Collaborators can make changes simultaneously. Track Changes frequently involves a painful cognitive load to reconcile different versions.
- The entire history is accessible -not just the most recent version. At anytime, you can turn back the clock to any committed change (example).
- Coordinates an entire repository of files, not just isolated documents.
GitHub is an online service that leverages Git, and adds some sauce for scientists
- Hosts the repository online.
- Code.
- Data.
- Reports & output.
- Adds options for user permissions, such as read-only
(unlike Dropbox). - Tools for visualizing code differences and developer activity.
- Project Management Tracking, "Issues", & notifications.
- Benefits & Complete examples.
- Creation and organization.
- Communicate with statisticians and non-statisticians.
- Precautions with health care and PHI data.
- Reproducibility for internal team.
- Reproducibility for outsiders.
- Hosting reports.
- github.com/OuhscBbmc/OsctrAstonWeber
- github.com/LiveOak/LylesCarbonSteelCorrosion
- github.com/LiveOak/UcaBullying
- github.com/OuhscBbmc/Wats
- github.com/OuhscBbmc/REDCapR
- github.com/OuhscBbmc/DeSheaToothakerIntroStats
- github.com/bwawrik/MBIO5810
- github.com/OuhscBbmc/RedcapExample
- github.com/OuhscBbmc/StatisticalComputing
- Easier to be disciplined about:
- maintaining a current & coherent code base.
- programmatic data manipulation (instead of manual).
- encapsulating analyses in different files.
- Team members can more easily review and synchronize changes.
- Easier to jump between computers.
- github.com/OuhscBbmc/DeSheaToothakerIntroStats quickly becomes a small website.
- The inputs (ie, data and code) can be inspected & downloaded immediately.
- Details too trivial for your article are available too.
- The outputs (ie, stats, graphs, and reports) can be compared to their results.
- Ideally the exactly software versions are easily determined.
- Single URL to send to anyone interested
(not just those with access to the OU file server). - Single report to send anyone
(not a bunch of loose graphic files).
Before you start, decide if the repo will be:
- Public from the start.
- Private forever.
- Private during development, then public.
- Dual: Maintain both a public and a private.
The typical sequence of operations is
- Log in to your computer and Sync the repository to make sure it's up-to-date.
- Modify/create/delete a file (as normal).
- Locally save the changes to your computer's hard drive (as normal).
- Commit your saved changes to your local repository.
- Sync your local repository with the central repository again. This "pulls" any changes from the server, attempts to merge the changes (which is usually successful), and finally "pushes" your recent changes to the server.
The GitHub Desktop Client (for Windows & Mac) hides a lot of the complexity.
- Create a new repository in wibeasley:
Mbio5810Demo-2015-08
. - Assign privileges to a existing user:
bwawrik
. - Clone on my local machine.
- Copy
RAnalysisSkeleton
. - Push changes.
- Boris makes changes in the browser version & commits.
- I sync/pull his changes.
- Create an issue w/ links.
- Sree navigates to
Mbio5810Demo-2015-08
webpage. - "Fork" the repo (but still on central server).
- "Clone" the repo to his local machine.
- Modify some file.
- Save the file to disk.
- "Commit" the change to local repository.
- "Sync" the change(s) to central server.
- Sree: Navigate to repo website.
- Sree: Submit a "pull request" (ie, a "PR").
- Will: Receives email notification.
- Will: Accepts pull request.
- Boris: Syncs his local machine.
- Git is not intended to work with big data files (say, over 1MB).
- Especially ones that change frequently.
- The Git history gets bloated and syncing is sluggish.
- Excluding data from repo hurts reproducibility.
The BBMC employs a variety of strategies. As we descend, security increases while reproducibility decreases.
- Public data is contained directly as CSVs.
- Assume users can download the same public database
(eg, reference genome, census file). - Unshared data that Git "ignores" and doesn't push to the server.
- Pulling from the OUHSC file server.
- Pulling from a database.
GitHub repositories should never contain patient data (or anything legally protected) --not even "private" repos.
This is minimal example that contains elements of most of my moderately sized projects (say, takes a few weeks start to finish).
https://github.com/wibeasley/RAnalysisSkeleton.
We'll return to this after we finish the slides.
- Communicate with internal and external collaborators.
- Three forms of communication have their place.
- Long-term documentation stored in the repository. It should outlive GitHub.
- Email has private/internal thoughts & criticisms.
- GitHub issues host publically acceptable thoughts & criticisms. Treat as public. Don't assume GitHub will be in business in three years; forntunately the code & reports aren't tied to GitHub. Worst case, you can serve them as a zip file on your personal page.
- Example issues: REDCapR and MBIO5810.
The markdown report is a quick way, but has narrow margins.
For public repositories, routing the html report through http://rawgit.com
is typically better.
For private reports, knitr
produces a self-contained html report. The graphics, text, and numeric output is in a single file you can email. Anyone with a modern browser can open the file.
Inspecting the diffs is a great way to see if the results changed over time.
Contains files that aren't absolutely necessary for the analysis, but makes reproduction much easier.
Examples: RAnalysisSkeleton
Ideally expose a single file that can calls your other files in the correct order.
It's almost as easy creating a documentation file that offers clear directions to a human.
Plus, you can assert that the intermediate & final files have been produced roughly correctly.
Examples: SteelCorrosion ans Wats
- In the repository's README.md file, provide any relevant information for humans and search engines.
- It's obvious how it reduces barriers for human readers.
- SEO is also important. Not only will it help improve the repository's SEO, it also improves the performance of your articles. Examples:
- While a beginner, I recommended you
- Give everyone permissions within reason
(it's much simpler and hassle-free). - If a PR is necessary, try w/ the browser
(not your local macine).
- Give everyone permissions within reason
- As you gain experience
- Sync early & often
- When working in a team, avoid modifying the same file simultaneously. Reconciliation costs you time (but is still easier than without GitHub).
- Works easiest with plain text (eg, shell, SAS, R, csv, map), rather than binary/proprietary formats (eg, docx & sas7bdat). The storage mechanism doesn't care much, but the "diff" views won't be available, and reconciling differences can't be done automatically.
- Reconciliation strategies range in sophistication, including
- Command line functions for experts.
- Formal branching & merging using GitHub's visual tools.
- Our "Hard Reset" (ie, the only possibility without version control).
- Git branching & forking is an important in software development, but I discourage it for repositories focused on analytics.
Securing private information (data & comments).
- Layered defense.
- Good protocols & practices for data.
- Use eager
.gitignore
exclusions.
- http://git-scm.com/
- https://help.github.com/
- Version Control with Git, Loeliger & McCullough (2012)
- ? Introducing GitHub: a Non-Technical Guide, Bell & Beer (2014)
- Victoria Stodden, Friedrich Leisch, Roger D. Peng (editors; 2014)
- Christopher Gandrud (2013, 2015)