<center>
<h1>Using Bioconda to streamline software installation for bioinformatics</h1>
<br/>
<img src="bioconda.png">
<p>
<em>2018-01-30</em>
<br/>
</p>
<p>
<em>Andrew Perry, <br/>Monash Bioinformatics Platform</em>
</p>
</center>

Follow along: https://github.com/MonashBioinformaticsPlatform/bioconda-tutorial/blob/master/Bioconda_Installation.ipynb

Disclaimer: while I'm talking about it, I don't pretend to be a conda / Bioconda expert.

This my whirlwind reinterpretation based on a workshop delivered by Simon Gladman, Saskia Hiltemann and Eric Rasche ["Packaging your bioinformatics tool with Bioconda and Galaxy"](https://www.melbournebioinformatics.org.au/projects-blog/bioconda/).

# Why conda ?

* Manages self contained _environments_, including dependencies. No `sudo` required.

* Large ecosystem of precompiled packages, organized as 'channels' (eg conda-forge, bioconda)

* Language agnostic (not only Python !)

* Creating new packages is straightforward compared with many systems (a recipe composed of a short [build.sh](https://github.com/pansapiens/bioconda-recipes/blob/master/recipes/seqtk/build.sh) and [meta.yml](https://github.com/pansapiens/bioconda-recipes/blob/master/recipes/seqtk/meta.yaml)).

* Interoperates gracefully with Python `virtualenv`s and `pip`.

* Open Source tool (BSD), backed by a company (Continuum Analytics). Some compiled packages in the default 'channel' aren't Open Source.

# Woah .. waddayamean 'environment'

Conda sets some UNIX shell environment variables, primarily `PATH`, that determine which copy of a binary is run when you type it's name.

Type: 

```bash
env
```

and you will see your environment variables.

<img src="captainplanet.jpg">

# Why use Bioconda ?

* Large ecosystem of pre-packaged software already available, ~250 contributors.
  - !!!--> http://bioconda.github.io/recipes.html <---!!!

* Conda environments aid reproducibilty - an `environment.yml` file can record tool versions (+build number) that can be reliably reinstalled elsewhere.

* Is the supported tool installation method for _Galaxy_, so existing packages will be maintained and grow.

* Quality control - well documented guidelines (http://bioconda.github.io/guidelines.html), automatic testing  and package builds (TravisCI / CircleCI), Pull Requests reviewed by core team.
  - Fun fact: Bioconda recipes don't blindly trust version numbers on tools, they use an md5 or sha256 checksum of the source tarball.

* Docker and Singularity containers are automatically generated too (Biocontainers project)

More sales pitches on [this blog](http://blogs.nature.com/naturejobs/2017/11/03/techblog-bioconda-promises-to-ease-bioinformatics-software-installation-woes/) and [bioArxiv](https://www.biorxiv.org/content/early/2017/10/27/207092).

# Let's install Miniconda and some tools from Bioconda 

Miniconda installs the `conda` package manager in your home directory, in it's own conda 'environment' (Miniconda is a trimmed down version of 'Anaconda').

Miniconda can be found at:
https://conda.io/miniconda.html

In [1]:
cd ~
# macOS
curl -O https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh

# or Linux
# curl -O https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 34.3M  100 34.3M    0     0  15.1M      0  0:00:02  0:00:02 --:--:-- 15.1M


In [2]:
sh ./Miniconda3-latest-MacOSX-x86_64.sh --help


usage: ./Miniconda3-latest-MacOSX-x86_64.sh [options]

Installs Miniconda3 4.3.31

-b           run install in batch mode (without manual intervention),
             it is expected the license terms are agreed upon
-f           no error if install prefix already exists
-h           print this help message and exit
-p PREFIX    install prefix, defaults to /Users/perry/miniconda3, must not contain spaces.
-s           skip running pre/post-link/install scripts
-u           update an existing installation
-t           run package tests after installation (may install conda-build)



: 2

In [3]:
sh ./Miniconda3-latest-MacOSX-x86_64.sh
# sh ./Miniconda3-latest-MacOSX-x86_64.sh -b


Welcome to Miniconda3 4.3.31

In order to continue the installation process, please review the license
agreement.
Please, press ENTER to continue
>>> 


Read license (press ENTER many times) then type 'yes'.

The installer adds line to `~/.bash_profile` ( `-s` option prevents this ).

In [7]:
export PATH="$HOME/miniconda3/bin:$PATH"

In [8]:
conda --help

usage: conda [-h] [-V] command ...

conda is a tool for managing and deploying applications, environments and packages.

Options:

positional arguments:
  command
    info         Display information about current conda install.
    help         Displays a list of available conda commands and their help
                 strings.
    list         List linked packages in a conda environment.
    search       Search for packages and display their information. The input
                 is a Python regular expression. To perform a search with a
                 search string that starts with a -, separate the search from
                 the options with --, like 'conda search -- -h'. A * in the
                 results means that package is installed in the current
                 environment. A . means that package is not installed but is
                 cached in the pkgs directory.
    create       Create a new conda environment from a list of specified
                 packages.
   

In [9]:
# Enable the bioconda 'channel' and some others
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda



In [11]:
# This is the config file that was modified by the conda config command. 
# Channels at the bottom take precedence
cat ~/.condarc

channels:
  - bioconda
  - conda-forge
  - defaults


In [14]:
# Install some conda packages
conda install -y seqtk

Fetching package metadata ...............
Solving package specifications: .

Package plan for installation in environment /Users/perry/miniconda3:

The following NEW packages will be INSTALLED:

    seqtk:     1.2-0            bioconda

The following packages will be UPDATED:

    conda:     4.3.31-py36_0             --> 4.3.33-py36_0 conda-forge

The following packages will be SUPERSEDED by a higher-priority channel:

    conda-env: 2.6.0-h36134e3_0          --> 2.6.0-0       conda-forge

conda-env-2.6. 100% |################################| Time: 0:00:00 395.33 kB/s
seqtk-1.2-0.ta 100% |################################| Time: 0:00:00  63.60 kB/s
conda-4.3.33-p 100% |################################| Time: 0:00:01 279.73 kB/s


In [16]:
# samtools version 0.1.15, build number 0
conda install -y samtools=0.1.15=0

# conda install kallisto
# conda install salmon=0.9.1 sailfish

Fetching package metadata ...............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /Users/perry/miniconda3:
#
samtools                  0.1.15                        0    bioconda


In [17]:
# Take a look at where those packages were installed (default environment 'root')
ls ~/miniconda3/bin

total 5792
drwxr-xr-x  75 perry  staff   2.3K 31 Jan 12:49 [34m.[39;49m[0m
drwxr-xr-x  13 perry  staff   416B 31 Jan 12:48 [34m..[39;49m[0m
-rwxr-xr-x   2 perry  staff   216B 27 Oct 02:51 [31m.python.app-post-link.sh[39;49m[0m
-rwxr-xr-x   2 perry  staff    39B 27 Oct 02:51 [31m.python.app-pre-unlink.sh[39;49m[0m
lrwxr-xr-x   1 perry  staff     8B 31 Jan 12:49 [35m2to3[39;49m[0m -> 2to3-3.6
-rwxrwxr-x   1 perry  staff   111B 31 Jan 12:49 [31m2to3-3.6[39;49m[0m
-rwxrwxr-x   2 perry  staff   3.7K 25 Jan 01:56 [31mactivate[39;49m[0m
-rwxr-xr-x   1 perry  staff   5.0K 31 Jan 12:09 [31mc_rehash[39;49m[0m
lrwxr-xr-x   1 perry  staff     3B 31 Jan 12:48 [35mcaptoinfo[39;49m[0m -> tic
-rwxr-xr-x   1 perry  staff   243B 31 Jan 12:10 [31mchardetect[39;49m[0m
-rwxrwxr-x   2 perry  staff    13K  3 Dec  2016 [31mclear[39;49m[0m
-rwxrwxr-x   1 perry  staff   132B 31 Jan 12:48 [31mconda[39;49m[0m
-rwxrwxr-x   1 perry  staff   150B 31 Jan 12:48 [31mconda-env[39;49

## Using environments

Let's create and switch to a new environment, install some packages, record what we installed, deactivate it.

https://conda.io/docs/user-guide/tasks/manage-environments.html

In [19]:
# Create a new conda environment
conda create -y --name new_env

# Activate it explicitly (actually not required after `conda create`)
source activate new_env

Fetching package metadata ...............
Solving package specifications: 
Package plan for installation in environment /Users/perry/miniconda3/envs/new_env:

#
# To activate this environment, use:
# > source activate new_env
#
# To deactivate an active environment, use:
# > source deactivate
#

(new_env) 

: 1

In [20]:
# Install some packages
conda install -y samtools=0.1.14

(new_env) Fetching package metadata ...............
Solving package specifications: .

Package plan for installation in environment /Users/perry/miniconda3/envs/new_env:

The following NEW packages will be INSTALLED:

    ncurses:  5.9-10   conda-forge
    samtools: 0.1.14-0 bioconda   
    zlib:     1.2.8-3  conda-forge

samtools-0.1.1 100% |################################| Time: 0:00:02 189.56 kB/s
(new_env) 

: 1

In [22]:
# By default, environments live under ~/miniconda3/envs
# Take a look
ls ~/miniconda3/envs/new_env/bin
which samtools

(new_env) (new_env) total 1336
drwxr-xr-x  18 perry  staff   576B 31 Jan 13:18 [34m.[39;49m[0m
drwxr-xr-x   7 perry  staff   224B 31 Jan 13:18 [34m..[39;49m[0m
lrwxr-xr-x   1 perry  staff    36B 31 Jan 13:18 [35mactivate[39;49m[0m -> /Users/perry/miniconda3/bin/activate
lrwxr-xr-x   1 perry  staff     3B 31 Jan 13:18 [35mcaptoinfo[39;49m[0m -> tic
-rwxrwxr-x   3 perry  staff    13K  3 Dec  2016 [31mclear[39;49m[0m
lrwxr-xr-x   1 perry  staff    33B 31 Jan 13:18 [35mconda[39;49m[0m -> /Users/perry/miniconda3/bin/conda
lrwxr-xr-x   1 perry  staff    38B 31 Jan 13:18 [35mdeactivate[39;49m[0m -> /Users/perry/miniconda3/bin/deactivate
-rwxrwxr-x   3 perry  staff    71K  3 Dec  2016 [31minfocmp[39;49m[0m
lrwxr-xr-x   1 perry  staff     3B 31 Jan 13:18 [35minfotocap[39;49m[0m -> tic
-rwxrwxr-x   1 perry  staff   5.2K 31 Jan 13:18 [31mncurses5-config[39;49m[0m
-rwxrwxr-x   1 perry  staff   5.2K 31 Jan 13:18 [31mncursesw5-config[39;49m[0m
lrwxr-xr-x   1 perry  s

: 1

In [23]:
# You can export the current environment to an environment.yml
# (For Pythonistas, this functionally similar to a requirements.txt file)
# Keep the environment.yml with your analysis project
conda env export >environment.yml

cat environment.yml

(new_env) (new_env) (new_env) (new_env) (new_env) name: new_env
channels:
- bioconda
- conda-forge
- defaults
dependencies:
- samtools=0.1.14=0
- ncurses=5.9=10
- zlib=1.2.8=3
prefix: /Users/perry/miniconda3/envs/new_env

(new_env) 

: 1

In [24]:
# You can recreate the environment somewhere else with
# conda env create -f environment.yml
# One step toward reproducability !

# Leave the environment
source deactivate

(new_env) (new_env) (new_env) (new_env) (new_env) 

# What about brew / linuxbrew ?

* macOS `brew` and `linuxbrew` are essentially seperate projects with little cooperation.

* Anecdotally, brew/linuxbrew bioinformatics installations were often low quality, frequently broken.

## _homebrew-science is now deprecated_

https://github.com/Homebrew/homebrew-science/issues/6365

# What about apt, yum, etc

* Targeted at system-wide packages - typically don't handle non-privileged installations or concurrent versions gracefully, if at all.

* Official distributions (Debian, Ubuntu, Centos) software repositories rarely keep up to date with recent versions. Unofficial repositories sometimes fill this void (eg Debian Med).

* Creating packages seems needlessly complex for mere mortals.


# What about _modules_, LMOD etc

* Well suited to using the shell environment (ie `PATH`) to manage concurrent versions.

* LMOD environments can be 'additive' (many 'modules' can be loaded simultaneously, unlike conda environments).

* LMOD isn't a package manager (doesn't handle download/compilation/installation)

# What about _pip_ and _virtualenv_ ?

* Both come with Python, great for Python packages, large ecosystem of existing packages.
* Not well suited to non-Python packages (binaries and dependencies).

# What about bio-ansible ?

* [bio-ansible](https://github.com/MonashBioinformaticsPlatform/bio-ansible) is awesome because we wrote it.
* Not awesome because we have to write every task to install new tools.
  - But! bio-ansible also supports installing `conda` packages as LMOD modules (we can add anything from Bioconda easily)
* Allows unprivileged installation of tools in your home directory, and system-wide packages & config.
* Allows concurrent versions and 'stacking' environments via LMOD modules.
* Higher barrier to entry for casual users, per-project installs (IMO).

# Things I've noticed conda doesn't do well

* Doesn't *officially* support 'stacking' environments, *ala* LMOD.
  - Do we care ? Software is small, disk space is cheap, just make environments with the combinations of tool versions you need.
  - Unofficially, adding `max_shlvl: 16` to `~/.condarc` allows it (YMMV) (https://github.com/conda/conda/issues/3580).
* Recipes for building packages (coming up) specify dependency names but not the 'channel' they come from.
* No streamlined method to install from source (eg `conda install https://github.com/me/my_recipe`), pre-compiled binary centric ?
  - Feature appears to have been considered and rejected: https://github.com/conda/conda/issues/306

# Creating (Bio)conda packages

## Lets look at the Bioconda recipe for _seqtk_

https://github.com/bioconda/bioconda-recipes/tree/master/recipes/seqtk

### build.sh
```bash
#!/bin/bash

export C_INCLUDE_PATH=${PREFIX}/include
export LIBRARY_PATH=${PREFIX}/lib

make all
mkdir -p $PREFIX/bin
cp -f seqtk $PREFIX/bin/
```

`PREFIX` is the installation path in the conda build environment (in a configure script you'd do `configure --prefix=$PREFIX`)

## meta.yaml
```yaml
package:
  name: seqtk
  version: 1.2

source:
  fn: v1.2.tar.gz
  url: https://github.com/lh3/seqtk/archive/v1.2.tar.gz
  md5: 255ffe05bf2f073dc57abcff97f11a37

build:
  number: 0

requirements:
  build:
    - gcc   # [not osx]
    - llvm  # [osx]
    - zlib
  run:
    - zlib


about:
  home: https://github.com/lh3/seqtk
  license: MIT License
  summary: Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format

test:
  commands:
    - seqtk seq
```

## Building a recipe (vanilla conda way)

In [25]:
cd ~
mkdir -p recipes/seqtk

# Create your build.sh and meta.yml
# See also: `conda skeleton`, which can help generate templates for specific package types

# vi recipes/seqtk/build.sh
# vi recipes/seqtk/meta.yaml

In [28]:
# We need the conda-build package to ... build packages
conda install -y conda-build

Fetching package metadata ...............
Solving package specifications: .

Package plan for installation in environment /Users/perry/miniconda3:

The following NEW packages will be INSTALLED:

    beautifulsoup4: 4.6.0-py36_0  conda-forge
    conda-build:    2.1.18-py36_0 conda-forge
    conda-verify:   2.0.0-py36_0  conda-forge
    filelock:       2.0.6-py36_0  conda-forge
    jinja2:         2.10-py36_0   conda-forge
    markupsafe:     1.0-py36_0    conda-forge
    pkginfo:        1.4.1-py36_0  conda-forge
    pycrypto:       2.6.1-py36_1  conda-forge
    pyyaml:         3.12-py36_1   conda-forge

beautifulsoup4 100% |################################| Time: 0:00:02  59.83 kB/s
filelock-2.0.6 100% |################################| Time: 0:00:00   3.52 MB/s
markupsafe-1.0 100% |################################| Time: 0:00:00  29.10 kB/s
pkginfo-1.4.1- 100% |################################| Time: 0:00:00  83.65 kB/s
pycrypto-2.6.1 100% |################################| Time: 0:00:

In [31]:
# Once a new recipe is created, we can build it
conda build recipes/seqtk/

BUILD START: seqtk-1.2-0
updating index in: /Users/perry/miniconda3/conda-bld/osx-64
updating index in: /Users/perry/miniconda3/conda-bld/noarch

The following NEW packages will be INSTALLED:

    cctools:       895-h7512d6f_0              
    ld64:          274.2-h7c2db76_0            
    libcxx:        4.0.1-h579ed51_0            
    libcxxabi:     4.0.1-hebd6815_0            
    llvm:          4.0.1-hc748206_0            
    llvm-lto-tapi: 4.0.1-h6701bc3_0            
    zlib:          1.2.11-0         conda-forge


latest version is 3.3.0. Run

conda update -n root conda-build

to get the latest version.

Source cache directory is: /Users/perry/miniconda3/conda-bld/src_cache
Downloading source to cache: v1.2.tar.gz
Downloading https://github.com/lh3/seqtk/archive/v1.2.tar.gz
Success
Extracting download
Package: seqtk-1.2-0
source tree in: /Users/perry/miniconda3/conda-bld/seqtk_1517365865334/work/seqtk-1.2
+ source /Users/perry/miniconda3/bin/activate /Users/perry/miniconda3/

In [32]:
# Where is the package archive we just built ?
conda build recipes/seqtk/ --output

/Users/perry/miniconda3/conda-bld/osx-64/seqtk-1.2-0.tar.bz2


In [34]:
# Install the locally built package from the Conda build cache (the 'local' channel)
conda install -y --use-local seqtk

# or just using the package tar.bz2 directly, or an http:// link where it is hosted
# conda install /Users/perry/miniconda3/conda-bld/osx-64/seqtk-1.2-0.tar.bz2
# conda install http://bioinformatics.erc.monash.edu/home/andrewperry/conda/seqtk-1.2-0.tar.bz2

Fetching package metadata .................
Solving package specifications: .

Package plan for installation in environment /Users/perry/miniconda3:

The following NEW packages will be INSTALLED:

    seqtk: 1.2-0 local



### Distributing your build on Anaconda.org

(on a personal `conda` channel, independently of the Bioconda project / channel)

Step 1:
Create an account at https://anaconda.org/

In [None]:
# Step 2: Install the `anaconda-client` package
conda install anaconda-client

# Step 3: Login and upload
anaconda login
# use the path to the package reported by `conda build recipes/seqtk/ --output`
anaconda upload <path_to_package_bz2>

# Step 4: Test that it worked - install from the channel you uploaded to
conda uninstall bigdatascript
conda install -c https://conda.anaconda.org/pansapiens bigdatascript

# The Bioconda way

http://bioconda.github.io/contributing.html

Adding a recipe and testing locally before making a Pull Request.

First step - go to https://github.com/bioconda/bioconda-recipes and fork the repo.

In [None]:
# Then clone your fork
export MY_GIT_USERNAME=pansapiens
git clone https://github.com/${MY_GIT_USERNAME}/bioconda-recipes.git
cd bioconda-recipes

In [None]:
# Create a new branch for your recipe
git branch mynewtool
git checkout mynewtool

# Existing recipes in Bioconda usually make good starting points
vi recipes/mynewtool/meta.yaml
vi recipes/mynewtool/build.sh

git commit -a -m "Added new recipe for mynewtool."

In [None]:
# This Bioconda script creates a fresh Miniconda install, similar to the Bioconda Travis CI
./simulate-travis.py --bootstrap /tmp/anaconda --overwrite

# Build your recipe and run tests
./simulate-travis.py --disable-docker --packages bigdatascript --force

# If tests passed, you can now squash your commits, push the branch
git push origin

# Go to https://github.com/bioconda/bioconda-recipes and make a Pull Request for your branch

# TravisCI will automatically build your branch.
# If TravisCI tests pass, the Bioconda team will review your PR and if it's okay, merge it

# Recipes in the Bioconda master branch are automatically built on Linux and macOS
# (via TravisCI) and uploaded to the Anaconda `bioconda` channel.

# Thanks !