# How to build and deploy reproducible environments?

This notebook has been produced for a mini workshop about Conda and Docker. It was presented as a seminar for the evo-adapt scientific animation.

January 31th 2025

## Installation

[Install conda](https://docs.conda.io/projects/conda/en/stable/user-guide/install/index.html)

[Install Docker](https://docs.docker.com/engine/install/)

[Install Singularity](https://anaconda.org/conda-forge/singularity)

## Install with `conda`

`conda` is a package manager, mostly known in the `python` programming community (historically it is a package manager for Python packages), but it is now widely used in bio-informatics and can be used to install a large number of softwares with C, C++, Python, R dependencies.

Conda installs virtual environments. They are installed in a `~/.conda` directory per default, though it can be configured to be your working directory, or any other directory that makes sense to you.

In [8]:
!ls -lha ~/.conda

# Virtual envs are in separate directories in `~/.conda/envs/`
!ls -lha ~/.conda/envs/ | head -n 8

# Package binaries are downloaded and stored in separate directories in `~/.conda/pkgs/`
!ls -lha ~/.conda/pkgs/ | head -n 8

total 124K
drwxr-xr-x.   4 tbrazier UR1 4,0K 10 oct.  17:22 .
drwx------.  45 tbrazier UR1 4,0K 27 janv. 17:43 ..
-rw-r--r--.   1 tbrazier UR1  806 27 janv. 18:12 environments.txt
drwxrwsr-x.  14 tbrazier UR1 4,0K 27 janv. 15:24 envs
drwxrwsr-x. 718 tbrazier UR1 104K 27 janv. 18:12 pkgs
total 56K
drwxrwsr-x. 14 tbrazier UR1 4,0K 27 janv. 15:24 .
drwxr-xr-x.  4 tbrazier UR1 4,0K 10 oct.  17:22 ..
drwxr-sr-x. 15 tbrazier UR1 4,0K 21 janv. 11:39 bcftools
-rw-r--r--.  1 tbrazier UR1    0 10 oct.  17:22 .conda_envs_dir_test
drwxr-sr-x.  8 tbrazier UR1 4,0K 17 déc.  14:44 goat
drwxr-sr-x. 17 tbrazier UR1 4,0K  4 janv. 13:17 herho
drwxr-sr-x. 17 tbrazier UR1 4,0K 16 janv. 14:13 jasminesv
total 80M
drwxrwsr-x. 718 tbrazier UR1 104K 27 janv. 18:12 .
drwxr-xr-x.   4 tbrazier UR1 4,0K 10 oct.  17:22 ..
drwxr-sr-x.   7 tbrazier UR1 4,0K 10 oct.  17:30 alsa-lib-1.2.12-h4ab18f5_0
drwxr-sr-x.   7 tbrazier UR1 4,0K  4 janv. 11:10 alsa-lib-1.2.13-hb9d3cd8_0
drwxr-sr-x.   4 tbrazier UR1 4,0K 20 janv. 14

You don't have to bother about these directories. You manage all your packages and envs through conda commands. Each `conda <command>` has a specific purpose and set of options. The most used commands are `conda create`, `conda install`, `conda remove` and `conda clean`.

In [9]:
# conda commands
!conda --help

usage: conda [-h] [-v] [--no-plugins] [-V] COMMAND ...

conda is a tool for managing and deploying applications, environments and packages.

options:
  -h, --help            Show this help message and exit.
  -v, --verbose         Can be used multiple times. Once for detailed output,
                        twice for INFO logging, thrice for DEBUG logging, four
                        times for TRACE logging.
  --no-plugins          Disable all plugins that are not built into conda.
  -V, --version         Show the conda version number and exit.

commands:
  The following built-in and plugins subcommands are available.

  COMMAND
    activate            Activate a conda environment.
    build               Build conda packages from a conda recipe.
    clean               Remove unused packages and caches.
    commands            List all available conda subcommands (including those
                        from plugins). Generally only used by tab-completion.
    compare             Com

In [None]:
# Create a new virtual env
!conda create --yes --quiet -c conda-forge --name workshop python=3.11 # Can be long, conda is not fast
!conda init
!source activate workshop

# Install a package
!conda install -y -q numpy

# Remove a package
!conda remove -y numpy



  conda config --add channels defaults

For more information see https://docs.conda.io/projects/conda/en/stable/user-guide/configuration/use-condarc.html

  deprecated.topic(
Channels:
 - conda-forge
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): ...working... 

You can also manage your conda envs with `conda env <command>`. It is much easier than using directly `conda` commands and allows you to use `.yaml` file to define your future virtual env. This `yaml` file is crucial to keep a trace of what has been installed in your env and to replicate it automatically in another machine/user.

A `yaml` is a basic human-readable markup language. It has a simple structure used to define/setup the virtual env.

In [4]:
%%script false --no-raise-error

# Example of a basic yaml file for conda env
name: workshop-test-1 # the name of the virtual env
channels: # The conda channels where to look for packages
    - conda-forge
dependencies: # Packages to install
    - python=3.7 # Specify the required version
    - numpy

In [None]:
!conda env --help

usage: conda env [-h] command ...

positional arguments:
  command
    config    Configure a conda environment.
    create    Create an environment based on an environment definition file.
    export    Export a given environment
    list      An alias for `conda info --envs`. Lists all conda environments.
    remove    Remove an environment.
    update    Update the current environment based on environment file.

options:
  -h, --help  Show this help message and exit.


In [6]:
%%script false --no-raise-error
# Create a conda virtual env from a file
!conda env create -q -f workshop-test-1.yaml

# Update with a new package
!conda env update -q -f workshop-test-2.yaml

!conda env remove -n workshop-test-1

After you have installed your `conda` packages, it is good practice to do `conda clean` to remove all cached tarballs and unused packages (remember, they are in `~/.conda/pkgs/). These packages are decompressed in a lot of small files and you can have quota issues on a cluster due to all these files.

Note that having all your envs defined in `yaml` files is really a time-saver. You can safely remove envs on your cluster to save space and recompile them easily when needed.

In [None]:
!conda clean -y --all

Will remove 324 (726.7 MB) tarball(s).
Will remove 2 index cache(s).
Will remove 210 (1.43 GB) package(s).
^C

CondaError: KeyboardInterrupt



### Where to find packages. Anaconda

The [Anaconda](https://anaconda.org/conda-forge) website is a catalogue of all `conda` packages available. Note that `Anaconda` is not `conda`. `Anaconda` is a package provider with a free license only for non commercial use (if you use the `anaconda` channel). However, we usually stay with `conda-forge` or `bioconda` channels, which are totally free. `bioconda` contains a lot of additional bio-informatics packages.

![An overview of the Anaconda website](https://europe1.discourse-cdn.com/anaconda/original/2X/b/beba40e0306afb252446c553c9c6670a106ab3ab.png)

### Advanced. Complex conda env

With `conda env` and `yaml` files you can configure complex environments. You can include `pip` packages (another Python package manager), `R` packages, `C` libraries, Unix command line softwares.

In [6]:
!cat workshop-test-3.yaml

# Example of a complex yaml file for conda env
name: workshop-test-complex # the name of the virtual env
channels:
  - mamba
  - conda-forge
  - bioconda
dependencies:
  - git # Unix software
  - r-base # R
  - jq
  - bcftools # bioinformatic software
  - vcftools # bioinformatic software
  - samtools # bioinformatic software
  - htslib # bioinformatic library
  - blas
  - cyvcf2
  - gsl # GNU scientific library in C
  - openssl>1.0 # Unix software
  - pip # pip package manager
  - python=3.8
  - pip: # install with pip - not in conda or dependencies for github installs
    - cython
    - msprime
    - numba
    - numpy
    - scikit-learn
    - pandas
    - "--editable=git+https://github.com/popgenmethods/ldpop.git#egg=ldpop" # Install directly from github
    - tskit
    - "--editable=git+https://github.com/popgenmethods/pyrho.git#egg=pyrho" # Install directly from github


### Advanced. Create your own conda recipe

You may have a new package you want to publish, or the package you need is not in a `conda` forge, cannot be installed or require some bugfixes or new features you want to implement.

There is a solution to stay with `conda` (and you want to stay as much as possible with the same package manager). You can code your own conda recipe. `conda` forges are freely accessible, and you can push your conda recipe to `https://anaconda.org`. Your recipe can remain private or public. For example, I needed the `rCNV` R package, which has no `conda install`. I created a `conda` recipe on my own and pushed it to `https://anaconda.org` so it can be installed simply with `conda install tbrazier::r-rcnv`. In fact, it is relatively easy to put R packages that are already in the CRAN in a `conda` recipe. Look below how I created the `conda` recipe for this R package.

In [8]:
%%script false --no-raise-error

# Get the CRAN repository and build a skeleton
!conda skeleton cran rCNV

# Build the package and store the output path
!conda build --build-only -c conda-forge -c bioconda r-rcnv/. --output > conda_path

# Push to anaconda as tbrazier::r-rcnv
# To upload to anaconda, you need to be logged in. See https://docs.anaconda.com/anacondaorg/
# Look the package https://anaconda.org/tbrazier/r-rcnv
!anaconda upload $(cat conda_path)

# Cleaning
!conda build purge

If the package is not a R package in the CRAN, you have to manually set up the `meta.yaml`. See below for the `ngsParalog` software.

In [25]:
!cat ngsParalog/meta.yaml

package:
  name: ngsparalog
  version: 1.3.3

build:
  script:
    - git clone --branch v1.3.3 https://github.com/tplinderoth/ngsParalog
    - cd ngsParalog
    - make
    - make clean

requirements:
  build:
    - make
    - git
    - htslib
    - samtools
    - numpy=1.23
  run:
    - python
    - htslib
    - samtools
    - r-truncnorm
    - r-docopt


about:
  home: https://github.com/tplinderoth/ngsParalog
  license: GPL-3.0
  summary: 'Copy number variation detection using NGS data'
  description: |
    ngsParalog is a program for detecting genomic regions that are problematic for short reading mapping using population-level, next generation sequencing (NGS) data.
  dev_url: https://github.com/tplinderoth/ngsParalog
  doc_url: https://github.com/tplinderoth/ngsParalog
  doc_source_url: https://github.com/tplinderoth/ngsParalog/blob/master/README.md

In [None]:
# Build the package and store the output path
!conda build --build-only -c conda-forge -c bioconda ngsParalog/. --output > conda_path

# Push to anaconda as tbrazier::r-rcnv
# To upload to anaconda, you need to be logged in. See https://docs.anaconda.com/anacondaorg/
# Look the package https://anaconda.org/tbrazier/r-rcnv
!anaconda upload $(cat conda_path)

# Cleaning
!conda build purge

## Containers

### Build a container (with `Docker`)

In order to build a docker, you must describe the architecture and content of the docker in a definition file. It is a markup format, similar to `yaml`. Usually the docker definition file is simply named `Dockerfile`. For Singularity, definition files have the suffix `.def`. After the build, Docker/Singularity saves your application as an image stored by your container manager (you will not see it in your working directory). I prefer to code and build containers with Docker, because I find it easier and there are more resources to get help. Utlimately it is not important if you use either Docker of Singularity, as both produce images with Singularity at hte next step.

In [9]:
%%script false --no-raise-error

# Build locally (the Dockerfile is in the directory)
# Add the tag <user>/<dockername>:<version>
!sudo docker build -t tombrazier/ldhat:latest .

It is as simple as that to build! Now look at the definition file... It is very close to what you would do to install it on a unix system to install, except that each command is put in a layer, such as `RUN`, `USER`, `ENV` or `WORKDIR`.

We check that the docker image is well saved on the system.

In [10]:
%%script false --no-raise-error

# See all images, even non running
!sudo docker ps -a

# Run an interactive shell within the container
!sudo docker container run -it tombrazier/ldhat bash

Now that your image is built in your system, you can push it to an application catalogue, where it will be available for other users and yourself. The best choice is probably Dockerhub (https://hub.docker.com/), which is similar to the anaconda repository, but for containers.

In [11]:
%%script false --no-raise-error

!sudo docker image tag tombrazier/ldhat:latest tombrazier/ldhat:v2
!sudo docker image push tombrazier/ldhat:v2

### Pull a container (with `Singularity`)

Most of the time, you will find an existing docker for your application, and you will just have to pull the image. Otherwise, you have seen how to build an image and push it to a Docker catalogue (Dockerhub).

We will play with the `minimap2` docker from the `Biocontainers` project (see https://biocontainers.pro/tools/minimap2).

In [12]:
%%script false --no-raise-error

# Get Minimap2 from Biocontainers registry
!singularity pull minimap2.sif docker://quay.io/biocontainers/minimap2:2.28--h577a1d6_4

The containerized appplication is now saved as a `.sif` image.

### Run the container (with `Singularity`)

In singularity, `exec` is used to run a command. The container stops once the command is finished.

`shell` is used to launch an interactive shell within the container. It is particularly useful to navigate within container directories and debug containers.

In [None]:
%%script false --no-raise-error

# by default, singularity shell bind the current directory
!singularity exec minimap2.sif pwd
!singularity exec minimap2.sif echo $USER
!singularity exec minimap2.sif minimap2

# Run an interactive shell to play with the docker
!singularity shell minimap2.sif

### Advanced. Binding directories

Singularity documentation. "When Singularity ‘swaps’ the host operating system for the one inside your container, the host file systems becomes inaccessible. But you may want to read and write files on the host system from within the container."

See https://docs.sylabs.io/guides/3.0/user-guide/bind_paths_and_mounts.html

In [16]:
!echo 'What is in the local directory'
!ls -lh

!echo 'The local directory mounted in the /mnt container directory'
!singularity exec --bind $PWD:/mnt minimap2.sif ls -lh /mnt

What is in the local directory
total 715M
-rw-r--r--. 1 tbrazier UR1  26K 20 janv. 16:55 build-deploy-softwares.ipynb
-rw-r--r--. 1 tbrazier UR1    0 20 janv. 14:43 conda_path
-rwxr-xr-x. 1 tbrazier UR1 1,4K 20 janv. 15:14 Dockerfile
-rwxr-xr-x. 1 tbrazier UR1 263M 20 janv. 15:38 ldhat.sif
-rwxr-xr-x. 1 tbrazier UR1  58M 20 janv. 16:30 minimap2.sif
-rwxr-xr-x. 1 tbrazier UR1 394M 20 janv. 15:57 pyrho.sif
-rw-r--r--. 1 tbrazier UR1   91 20 janv. 11:47 python-notebook.yaml
drwxr-xr-x. 2 tbrazier UR1 4,0K 20 janv. 14:20 r-rcnv
-rw-r--r--. 1 tbrazier UR1  237 20 janv. 12:44 workshop-test-1.yaml
-rw-r--r--. 1 tbrazier UR1  271 20 janv. 12:45 workshop-test-2.yaml
-rw-r--r--. 1 tbrazier UR1  886 20 janv. 12:58 workshop-test-3.yaml
The local directory mounted in the /mnt container directory
total 715M   
-rwxr-xr-x    1 tbrazier UR1         1.4K Jan 20 15:14 [1;32mDockerfile[m
-rw-r--r--    1 tbrazier UR1        25.7K Jan 20 16:55 [0;0mbuild-deploy-softwares.ipynb[m
-rw-r--r--    1 tbrazie

It is really important to bind directories when you want to modify files in your local directories and that files produced during `singularity exec` are not lost when the container stop.

Compare this two commands that look similar.

In [21]:
!echo 'The local directory mounted in the /mnt container directory'
!singularity exec minimap2.sif touch /mnt/test-nobind
!singularity exec minimap2.sif ls -l /mnt # Containers are really immutable. The file test-nobind is not saved, container is not modified
!ls -l /mnt

!echo 'The local directory mounted in the /mnt container directory'
!singularity exec --bind $PWD:/mnt minimap2.sif touch /mnt/test-bind

!ls -l
# test-bind does exist, but not test-nobind

The local directory mounted in the /mnt container directory
touch: /mnt/test-nobind: Read-only file system
total 0
total 0
The local directory mounted in the /mnt container directory
total 731772
-rw-r--r--. 1 tbrazier UR1     30722 20 janv. 17:03 build-deploy-softwares.ipynb
-rw-r--r--. 1 tbrazier UR1         0 20 janv. 14:43 conda_path
-rwxr-xr-x. 1 tbrazier UR1      1390 20 janv. 15:14 Dockerfile
-rwxr-xr-x. 1 tbrazier UR1 275746816 20 janv. 15:38 ldhat.sif
-rwxr-xr-x. 1 tbrazier UR1  60669952 20 janv. 16:30 minimap2.sif
-rwxr-xr-x. 1 tbrazier UR1 412848128 20 janv. 15:57 pyrho.sif
-rw-r--r--. 1 tbrazier UR1        91 20 janv. 11:47 python-notebook.yaml
drwxr-xr-x. 2 tbrazier UR1      4096 20 janv. 14:20 r-rcnv
-rw-r--r--. 1 tbrazier UR1         0 20 janv. 17:03 test-bind
-rw-r--r--. 1 tbrazier UR1       237 20 janv. 12:44 workshop-test-1.yaml
-rw-r--r--. 1 tbrazier UR1       271 20 janv. 12:45 workshop-test-2.yaml
-rw-r--r--. 1 tbrazier UR1       886 20 janv. 12:58 workshop-test-3.