# Dependency Management & Containerization

**Contact:** Alex Kanitz (alexander.kanitz@unibas.ch)

# Table of Contents

* [Introduction](#Introduction)
  * [Deep down in dependency hell](#Deep-down-in-dependency-hell)
  * [What happens when we install software?](#What-happens-when-we-install-software?)
  * [What does it mean to _build_ software?](#What-does-it-mean-to-build-software?)
  * [What does it mean to _compile_ software?](#Excursion:-What-does-it-mean-to-compile-software?)
* [Package managers](#Package-managers)
  * [Pip - A Python package manager](#Pip---A-Python-package-manager)
    * [Storing your dependencies with your code](#Storing-your-dependencies-with-your-code)
    * [Pinning versions](#Pinning-versions)
    * [Virtual environments](#Virtual-environments)
  * [Conda - A more universal package manager](#Conda---A-more-universal-package-manager)
    * [Installing Miniconda](#Installing-Miniconda)
    * [Conda basics](#Conda-basics)
    * [Channels & Bioconda](#Channels-&-Bioconda)
    * [Conda environments](#Conda-environments)
    * [Using Conda with Pip](#Using-Conda-with-Pip)
    * [Use Mamba!](#Use-Mamba!)
* [Containers](#Containers)
  * [A (very) short history of containers](#A-(very)-short-history-of-containers)
  * [What is Docker?](#What-is-Docker?)
  * [Installing Docker](#Installing-Docker)
  * [Running our first container](#Running-our-first-container)
  * [Containers in bioinformatics](#Containers-in-bioinformatics)
  * [Composing Dockerfiles](#Composing-Dockerfiles)
  * [Building Docker images](#Building-Docker-images)
  * [Publishing Docker images](#Publishing-Docker-images)
  * [Cleaning up after yourself](#Cleaning-up-after-yourself)
  * [Further reading](#Further-reading)
* [Publishing on Bioconda & BioContainers](#Publishing-on-Bioconda-&-BioContainers)
* [Homework](#Homework)

This lesson will give you a bit of background on what it means to install a piece of software, how you can bundle dependencies with your software, how you can set up encapsulated development environments to avoid dependency hell. The second part of the lesson deals with the related concept of containerization, where you will learn how to build container images and run them. Both is important both for developing and consuming software.

# Introduction

## Deep down in dependency hell

Surely you have run into software that you haven't been able to install. Or at least software that you had a hard time installing. Perhaps it listed a number of dependencies as prerequisites for installation, and so you needed to find and install these one by one. Or perhaps there wasn't even a list of dependencies and you only found out that some library or other is missing while you were going through the installation process. Okay, backtrack, install the dependency, try again. Only to run into the next missing dependency. Or you can't install the dependency because... the dependeny's dependency (or _dependencies_) are missing... A recursive problem.

Finally, you may find out that you are unable to install a library cannot be installed at all, because no version of that library exists that fulfills the requirements of all the software that requires that library. For example, software A requires version >1.0.0 and <2.0.0 of library X, but software B that you want to install requires version >2.0.0 of that same library X. However, you have version 1.2.3 of the library and you can only install _one_ version of each library. So what to do? You can keep the version of the library, but that means you will not be able to use the software B that you wanted to install. Or you upgrade library X to version 2.3.1, which will allow you to use software B - but it will break your software A that you are using every day.

Welcome to _dependency hell_!

So, with installing software, you never know what you will get. Everything may just work and you are done in a few moments. Or you may end up spending the better part of an afternoon to try to install that software, only to give up at some point because you just don't see how to get there!

_So why do these problems occur?_

And perhaps more interestingly: **Are these issues avoidable? And are the developers to blame?**

_Let's see!_

## What happens when we install software?

The short answer is: _it depends_!

We will not give you the long answer here (feel free to google to find out more if you are interested), but instead will make do with a summary.

Below are some of the steps that may occur during the installation process:

- **Build dependencies are installed**
- **The software is built**
- **Any other dependencies are installed**
- Executables are placed in locations where the user (or system) can find them
- Environment variables and/or aliases are set
- Configuration files are created
- The software is initialized
- Icons and other shortcuts are generated
- ...

What exactly is done, how and when depends on the programming language or languages used to implement the software, the operating system the software is installed on, the software's dependencies, the entry points that the software provides (a graphical user interface? a command-line interface? a web server? all of the above?) and many other factors...

Here, we will only cover two aspects highlighted in the list above, starting with...

## What does it mean to _build_ software?

Let's first see what [Wikipedia says](https://en.wikipedia.org/wiki/Software_build):

> In software development, a build is the process of converting source code files into standalone
> software artifact(s) that can be run on a computer, or the result of doing so.

Okay, but surely Python files (_modules_) can just be shared and executed anywhere?!

Not always. Even though Python is indeed an _interpreted_ language, meaning that it interprets source code during runtime, when installing a package (and also when running it), Python _may_ do some magic to improve or optimize the code for efficiency. Besides, remember when we created the command line executables? Those executables are not part of the code base in version control (try to find them in your repository!), but instead they are _built_ during the ... build process when you installed the package with `python setup.py` or (better) `pip install .`. You may also remember the files and directories that were generated when you created your package with `python -m build` or the `.pyc` files that were created when you first ran your executable after installation. These are all _build artifacts_.

We will not go into detail here, but remember that as soon as you go beyond dealing with a single Python script containing all the required code for its execution, you will likely end up with a software build - even in interpreted languages like Python!

However, there is one common aspect of building software that we will briefly look at here, as it is important for every programmer to know about (after all, you may end up using a different programming language): The concept of _compiling_ software.

## What does it mean to _compile_ software?

Let's start again with with [Wikipedia's definition of what a compiler is (in programming)](https://en.wikipedia.org/wiki/Compiler):

> In computing, a compiler is a computer program that translates computer code written in one
> programming language (the source language) into another language (the target language). The
> name "compiler" is **primarily used for programs that translate source code from a high-level
> programming language to a lower level language (e.g. assembly language, object code, or 
> machine code) to create an executable program**.

For many purposes, compiling software means taking the source code with human readable
instructions and translating it into a binary (i.e., _not_ human readable) _executable_ form.
Did you ever try to open an `.exe` file in a text editor?

Let's look at the first two lines of `samtools`:

```bash
which samtools
# /usr/bin/samtools
head -n2 samtools
```

I get the following output:

```console
Q�tdR�td���������%��	�%	P:P:/lib64/ld-linux-x86-64.so.2GNU�GNUd�?�$Wc􅽞��O�'iGNU7()�C+�>��L@ 78:;=?ACDGHo;b�2���+����j����K��
��}F��8���n��i]�������^|��e�mgUa�}#Q b�)�*y
```

Certainly not "human readable". We don't even get _two lines_!

What we are seeing here is optimized machine code that was generated when building samtools during installation, because for performance reasons, Samtools is (mostly) written in C, which - unlike Python - is not an interpreted, but a _compiled_ language.

Compiled languages are generally faster than interpreted languages, because they analyze the code and translate it into a byte/machine code representation that is optimized for the specific machine architecture. This _compilation_ happens during build time, before you first execute the code. This is made possible, among other things, because C - unlike Python - is a statically typed language, i.e., it is known beforehand what type a given variable has, thus allowing for better code analytics. It's hard to programmatically "understand" what code may be doing if you are free to throw anything at it. In statically typed languages like C, what can happen at any given point is much more constrained by the preceding statements, and compilers make use of these constraints for code optimization.

So why do we mention this here, given that all we do is write Python code?

Apart from the fact that it is good to know this (you really don't want to implement a competitive short read mapper in Python, for example!), there are two main reasons:

1. Sometimes, you may want to run software inside your Python software that was implemented in another language, and that may well be a compiled language
2. Python offers "bindings" that allow Python programs to contain code written in other languages, for example C; in those cases, the foreign language code may need to be compiled, too (this is quite frequently the case, because typically, people make use of this feature to speed up parts of their software while still wanting to retain the simplicity of Python)

While both of these points are beyond the scope of this course, we will further down discuss solutions that can help you manage dependencies in case you are using software written in different programming languages.

In addition to the listed reasons, it is very important to remember that the **artifacts resulting from compiling code on one machine may be different than those generated on another machine** (which is why they should not be under version control, so remember to populate your `.gitignore` file accordingly).

However, let's first circle back to the second part of the build process we wanted to address: installing dependencies!

# Package managers

Coming back to the dependency problem described above: our software needs specific versions of packages A and B, which in turn depend on some specific versions of packages C, D and E. And so forth. Surely, you don't want to _resolve_ the resulting dependency tree manually. Package managers to the rescue!

Package managers are such crucial pieces of software that they exist for pretty much every programming language. And generally they do not only (try to) resolve software dependencies (i.e., they try to fulfill all the version requirements of all dependencies recursively, if possible), but also allow you to easily install, remove, up- and downgrade packages. Package managers may also be tightly associated with _package repositories_, content management servers that host software packages provided by the community. In Python, the primary registry is called the Python Package Index, or [PyPI](https://pypi.org), for short.

Here, we are looking at two Package managers that you will most likely already have heard about:

1. `pip`: A package manager for Python that you can use to install packages available on [PyPI](https://pypi.org/), as well as Python packages on Git servers and local packages
2. `conda`: A Python-based package manager that can be used to install tools implemented in different programming languages, including Python

## Pip - A Python package manager

There are a number of package managers in Python. The most commonly used one is is `pip`. It's easy enough to use:

```bash
# install a package
pip install PACKAGE_NAME

# upgrade a package
pip install --upgrade PACKAGE_NAME

# remove a package
pip uninstall PACKAGE_NAME

# show all installed packages
pip freeze
```

Executing this last command, you might be surprised to learn how many packages are actually installed on your system. Very likely the majority of them already came with your Python installation. Others might have been dependencies of the packages you installed.

> **Careful:** Do not go ahead and just randonmly install packages. Having too many packages in your _global
> environment_ is generally not a good idea, because different packages can get in the way of one another,
> especially if you need different versions. We will learn below how to avoid that!

Let's see this last bit in practice, because resolving dependencies is one of the most important functionalities of package managers. Say we want to install the web framework [Flask](https://flask.palletsprojects.com/en/2.2.x/). The package name is `flask`, and so we do:

```bash
pip install -y flask  # the -y flag is short for --yes and stops pip from
                      # asking for confirmation before it continues to 
                      # uninstall
```

Now, on my reasonably pristine environment, this gave me the following output:

```console
Collecting flask
  Downloading Flask-2.2.2-py3-none-any.whl (101 kB)
     |████████████████████████████████| 101 kB 3.9 MB/s 
Collecting Werkzeug>=2.2.2
  Downloading Werkzeug-2.2.2-py3-none-any.whl (232 kB)
     |████████████████████████████████| 232 kB 17.5 MB/s 
Collecting itsdangerous>=2.0
  Using cached itsdangerous-2.1.2-py3-none-any.whl (15 kB)
Collecting click>=8.0
  Using cached click-8.1.3-py3-none-any.whl (96 kB)
Collecting Jinja2>=3.0
  Using cached Jinja2-3.1.2-py3-none-any.whl (133 kB)
Collecting MarkupSafe>=2.0
  Using cached MarkupSafe-2.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB)
Installing collected packages: MarkupSafe, Werkzeug, Jinja2, itsdangerous, click, flask
Successfully installed Jinja2-3.1.2 MarkupSafe-2.1.1 Werkzeug-2.2.2 click-8.1.3 flask-2.2.2 itsdangerous-2.1.2
```

Importantly, even though I only asked for a single package, a total of six packages were installed. The five packages that I did not specifically ask for are dependencies of Flask, and, therefore, in order to provide a functional Flask installation, these had to be installed as well - and `pip` did that for us. Note that some of the packages I did not actually need to download, they were already _cached_. We will find out later why that is the case.

It's really convenient that Python automatically installs a package's dependencies, as well as _their_ dependencies, if necessary (e.g.,` Werkzeug` is a direct dependency of `flask`, but `MarkupSafe` is not; it is a direct dependency of _both_ `Werkzeug` and `itsdangerous`).

Importantly, the reverse operation `uninstall` does _not_ automatically remove all the dependencies of the package you uninstall`:

```bash
pip uninstall -y flask
```

We get something like this:

```console
Found existing installation: Flask 2.2.2
Uninstalling Flask-2.2.2:
  Would remove:
    /path/to/bin/flask
    /path/to/lib/python3.10/site-packages/Flask-2.2.2.dist-info/*
    /path/to/lib/python3.10/site-packages/flask/*
Proceed (Y/n)? Y
  Successfully uninstalled Flask-2.2.2
```

Only the Flask package was removed!

There isn't all that much more to say about `pip` at this level, but here are some more things
you can do with it:

```bash
# don't actually install, just say what would happen
# only available from pip v22.2 onwards
pip install --dry-run PACKAGE_NAME

# install multiple packages
pip install PACKAGE_1 PACKAGE_2

# use pip to upgrade pip itself
pip install --upgrade pip

# use pip to install the current local package
# requires a setup.py or pyproject.toml file in the directory
pip install .

# same as before, but install in "editable" mode
# use this to install a package that you are currently developing
pip install -e .

# install a Python package from GitHub
# requires Git to be available on the system
# the same general syntax works with GitLab, BitBucket etc., too
pip install git+https://github.com/<owner_name>/<repo_name>.git

# same as previous, but using SSH instead of HTTPS to clone the repository
pip install git+ssh://github.com/<owner_name>/<repo_name>.git
```

### Storing your dependencies with your code

Now, there is one more very useful feature of `pip` that we want to focus on: Instead of passing it one or multiple package names, we can also use the `-r` option to share a text file containing package names:

```bash
pip install -r requirements.txt
```

Here, each line of the file `requirements.txt` is interpreted by `pip` as the name of a package. There's nothing special about the file, it's just a plain text file. There is also nothing special about the filename, you could name it anything you like. The name `requirements.txt` is really just a convention.

This feature is very useful, because it allows us to commit a file to version control that keeps the dependencies of our software project for us, all in one place. So now when another person tries to install our package, rather than scanning through the `README.md` file to search for hints on what dependencies the software may have, all that people need to do to install an _environment_ that supports the execution of our tool is run the above line.

You may remember that the `setup()` function from the `setuptools` package that is typically used in the `setup.py` file describing a Python package has a similar parameter called `install_requires`, which is used to keep track of the dependencies of a package and which takes a list of package names. You may also require the DRY (Don't Repeat Yourself) principle. It makes sense not to define your dependencies in two different places, because what if those two different places end up saying different things (and _they will_, eventually)?

So if we wanted to have just a single source of truth, which one should we keep? The answer is, generally, the `requirements.txt` file. Why? Because it's pure in the sense that it does not contain anything else other than package names, one per line. This property makes it easy to parse that file. So instead of trying to parsing `setup.py` to extract the dependencies and then somehow passing them to `pip` to create our environment for us, we do the opposite, i.e., we parse the `requirements.txt` file in `setup.py` and create the list of package names that we then pass to `install_requies`. This typically looks similar to this (comments were added for clarity):

```py
from pathlib import Path

# get directory where setup.py is located
project_root_dir = Path(__file__).parent.resolve()

# parse package names in requirements.txt and store them in variable
with open(project_root_dir / "requirements.txt", "r", encoding="utf-8") as _file:
    INSTALL_REQUIRES = _file.read().splitlines()

# pass the variable containing the list of package names
# to the install_requires parameter
setup(
    ...
    install_requires=INSTALL_REQUIRES,
    ...
)
```

Note that it is good practice to only put into the `requirements.txt` file only those packages that your tool really needs to execute. So while it's useful to have linters and other tools installed for development and testing purposes, do not place them in `requirements.txt`. Rather, put them in a separate file `requirements_dev.txt` instead.

End users of your software can then just execute `pip install -r requirements.txt` (provided you don't publish a package and describe your dependencies in `install_requires` as described above, in which case they only need to do `pip install PACKAGE_NAME`), while developers/contributors would install the development/testing requirements as well, with:

```bash
pip install -r requirements.txt -r requirements_dev.txt
```

### Specifying versions

Like most package managers, `pip` allows you to specify not only the name of a package, but also the desired version when installing a package. If you want to know what versions are available for a given package, you can check on PyPI (e.g., see [Flask's version history](https://pypi.org/project/Flask/#history)).

By default, if you only specify the package name, the latest version of that package will be installed. For Flask, at the time of writing that version is `2.2`. Let's say we have a reason to use the latest Flask version with the major version `1`: We can look up the available versions on PyPI (or run `pip index versions PACKAGENAME` if your Pip version is at least 21.2) and we will realize that that release has version `1.1.4`. Let's install that with `pip`:

```bash
pip install flask==1.1.4
```

If you don't want to check on PyPI, you can also specify the requirement in your call like this:

```bash
pip install "flask<2"
```

Or more precisely like this:

```bash
pip install "flask>=1,<2"
```

Available operators for specifying versions are listed in the table below (from the [documentation](https://pip.pypa.io/en/stable/topics/dependency-resolution/)):


| Operator | Description | Example |
| --- | --- | --- |
| `==` | Exactly the specified version. | `==3.1`: only `3.1`. |
| `>` | Any version greater than the specified version. | `>3.1`: any version greater than `3.1`. |
| `<` | Any version lesser than the specified version. | `<3.1`: any version lesser than `3.1`. |
| `>=` | Any version greater than or equal to the specified version. | `>=3.1`: version `3.1` and greater. |
| `<=` | Any version lesser than or equal to the specified version. | `<=3.1`: version `3.1` or lesser. |
| `!=` | Any version not equal to the specified version. | `!=3.1`: any version other than `3.1`. |
| `*` | Can be used at the end of a version number to represent all versions of that particular version number. | `==3.1.*`: any version that starts with `3.1`. |
| `~=` | Any "compatible" version. Compatible versions are higher versions that only differ in the final segment. | `~=3.1.2` is equivalent to `>=3.1.2, ==3.1.*`. `~=3.1` is equivalent to `>=3.1, ==3.*`. |

As we have already seen above, **multiple conditions can be chained!** If multiple versions fulfill all conditions, `pip` will install the latest one that fulfills the requirements of all other dependencies.

#### Advice on specifying versions

So now that we can "pin" versions of a dependency, **how should we specify versions or version ranges in our requirements files?**

Are there reasons why we wouldn't always want to use the latest version of every dependency? Absolutely!

- If we write our software against a specific behavior of a given dependency, our code may break if the API of the dependency breaks. Our function calls, e.g., just won't work anymore, maybe because we are missing a required parameter all of a sudden, or we are using a parameter that was dropped or renamed.
- Not locking versions may limit reproducibility, because different versions of different tools may potentially yield different results!

> Note that these reasons underline the importance of having both unit and functional/integration tests available
> for your code. Without them, you can never safely update version numbers.

So are there reasons why we wouldn't always want to just lock the precise versions of each dependency Unfortunately, the answer here is also "yes"!

- If everyone used strict _version locking_ for their packages, we would very quickly end up in situations where different packages cannot be used together anymore. Imagine a situation where your code depends on packages A and B, but package A _also_ requires package B. Now, if your code and package A can each only play with one specific version of package B and that version requirment differs, you are out of luck!
- Another problem is security: It is not unusual that new versions of a software package are released to patch security vulnerabilities that have been uncovered. If you lock your dependency versions strictly, this means that to guard your software against security vulnerabilities, you will need to update your dependencies (and maybe create a new release) each time that a vulnerability was identified in any of your packages, which can quickly become very tedious.

So what to do?

Quoting Thomas Sowell:

> _"There are no solutions. There are only trade-offs."_ (Thomas Sowell)

Indeed, **both _version locking_ and _version promiscuity_ can lead you to dependency hell**, and therefore you need to find a reasonable balance.

A **good middleground** is often to make use of the `~=` operator to denote that any _compatible_ (higher) version may be used. For example, we start developing against version `1.1` of package `dependonme`. To be on the safe side, we don't want to guarantee that older versions will work just as well. However, given that we trust that `dependonme`'s developers are such nice chaps and strictly adhere to [Semantic Versioning](https://semver.org/), we also believe that _future_ minor versions likely won't give us many headaches. On the other hand, we would expect breaking changes from a new major version release (`2.0` in this case), and these may very well break our code - or not, it's hard to say (though sometimes, package authors will be nice enough to issue _deprecation warnings_ that forecat breaking changes for the next major release).

So in that case we would specify `dependonme~=1.1` in our requirements file. Now, `1.1`, `1.1.1`, `1.2`, `1.3` and `1.23.19` etc. are all good to use, but for example `1.0.9` and `2.0.0` are not.

However, this "fuzzy version matcher" is just a good starting ground. There are other considerations that may influence your decision about how to best specify dependency versions for your code, e.g.:

- Your software is a _library_, i.e., you intend it to be used by developers for many different use cases; in that case you want to prioritize compatibility with many different environments.
- Your software is a _web service_, in which case it may be particularly vulnerable to security issues.
- Your software critically depends on an abandoned package that is not maintained anymore, thus adding constraints to the versions you support (especially if that package depends on one or more other packages that you also rely on).

Please also note that there is absolutely no reason why you should use the same strategy for all packages. You can mix and match strategies as you like! For example, consider that your software uses a package for short read mapping, as well as a bunch of linters. Now, for reproducibility reasons, you may be wary of being too promiscuous with regard to the short read mapper version, because upgrades in that dependency may very well lead to different outcomes for your software. On the other hand, staying up to date with the linters will never have an impact on the reproducibility of your software, and therefore you can be a lot more lenient with your version requirements for these.

In the end, **specifying versions is a trade-off across maintainability, usability, security and reproducibility. There is no one best way and you will need to figure it out for yourself given your specific requirements and preferences.**

> Lastly, please also be aware that **Python versions** themselves have a [supported
> lifespan](https://devguide.python.org/versions/), currently of around 5 years. If you develop and
> maintain a tool for a long time, you may occasionally want to update the supported Python version (or
> versions). When you do so, make sure to check out the features of the latest Python versions and consider
> updating your codebase to make use of them.

### Virtual environments

Let's turn back to the problem of _dependency hell_ and our advice/warning not to just randomly install packages into your _global environment_. We have just seen that carefully setting version ranges for your dependencies can help avoid this problem - but only to an extent. The general problem still exists that we cannot install multiple versions of the same package. _Or can we?_

If we look at our _global_ environment, i.e., all the packages that our Python system installation knows about, then the answer is: no, we can't! However, we are, in fact, able to use _namespacing_ as a way to keep packages encapsulated from one another. In this way, we can effectively create what is known as _virtual environments_.

There are multiple tools to create and manage virtual environments in Python. One of them is called `venv` and comes preinstalled with Python.

Let's have a look how it works:

```bash
# create virtual environment
# this creates a directory 'name_of_environment' in the current directory
python -m venv name_of_environment

# activate the virtual environment
source name_of_environment/bin/activate
```

We are now using a _virtual environment_. Try out `pip freeze` and you will see - there are no packages installed here. All is clean, very nice!

Let's go ahead now and install some older version of Flask in our virtual environment:

```bash
pip install "flask <1"
```

Executing `pip freeze` again now gives us:

```console
click==8.1.3
Flask==0.12.5
itsdangerous==2.1.2
Jinja2==3.1.2
MarkupSafe==2.1.1
Werkzeug==0.16.1
```

Cool!

Now let's deactivate our virtual environment with:

```bash
deactivate
```

To prove a point here, let's create a _different_ virtual environment and install a different version of Flask in it:

```bash
python -m venv newer_flask
source newer_flask/bin/activate
pip install flask==2.0
```

Let's see what `pip freeze` gives us:

```console
click==8.1.3
Flask==2.0.0
itsdangerous==2.1.2
Jinja2==3.1.2
MarkupSafe==2.1.1
Werkzeug==2.2.2
```

So now we have _different_ environments, each with a different version of Flask! Note also that the Werkzeug version also drastically changed, because the old Flask version needs an older version of Werkzeug to work. Likewise, the newer Flask version relies on a newer version of Werkzeug.

If we work on different projects, say two Flask-based web apps, one being a bit older, the other newer, we now don't have a problem anymore: we simply use virtual environments to switch between these apps and install in each the corresponding dependencies!

**_Now that you now about virtual environments, don't install any packages in your global environment anymore!_** Keeping your global environment clean keeps your system healthy and stable and allows you to avoid at least one of the hells of dependency management.

> You may have wondered what you can do if you somehow work on a **single project that requires two or more
> versions of the same software**. The short answer would be that you should try your best to avoid this scenario
> at all cost. If you absolutely _can't_ avoid it, we know of no easy way to handle that (please let us know if
> you find one). The hard way would require you to manually install the different package versions under
> different names, then patch your codebases (and/or your dependencies' codebases) to make use of those different
> versions, for example by adding import aliases (see [here](https://stackoverflow.com/a/6572017/20831802)
> for more details and a concrete example).

## Conda - A more universal package manager

If all you are doing falls neatly within the Python ecosystem, using `pip`, `venv` and `requirements.txt` etc. will be just fine and there isn't really much need to read on, at least not from a dependency management point of view. 

However, if you answer one of the following questions with "no", then we recommend you to learn about [Conda](https://conda.io/projects/conda/en/latest/index.html), a Python-based package manager that includes packages that are not implemented in Python:

- Is all code you write written in Python (or has Python bindings)?
- Does your code _only_ depend on Python packages?
- _**Are you only using software written in Python for your work?**_

Still here? Got you with the last question? :)

Indeed, Python may be the only programming lanugage you use to develop, and as a general purpose programming language with a huge ecosystem, you may not need to step out of the Python world for all your programming needs. But surely you use a variety of software in your work, and surely not all of it is written in Python (thank God!).

Conda has answers for dealing with non-Python dependencies where `pip` & co fall short. But what _is_ Conda?

Conda is an open-source package management system and environment management system that runs on Windows, macOS, and Linux. Conda installs and updates packages and their dependencies, not unlike `pip`. Conda can also be conveniently used to create and manage virtual environments on your machine. While it was originally created for Python, _Conda can be used to package and distribute software for any language!_

The `conda` package and environment manager is included in all versions of the Anaconda and Miniconda. Whereas `conda` automates the process of managing packages and environments, Anaconda and Miniconda are software _distributions_, each of which contain various collections of pre-built and pre-configured packages that can be used on the system.

To use `conda`, you need to first install either of these distributions. What's the difference?

- [Anaconda](https://www.anaconda.com/products/distribution) comes with `conda` as well as a very wide range of commonly used data science libraries to get you started, such as [NumPy](https://numpy.org/) (N-dimensional array for numerical computation), [SciPy](https://www.scipy.org/) (Scientific computing library for Python), [Matplotlib](https://matplotlib.org/) (2D Plotting library for Python), [Pandas](https://pandas.pydata.org/) (Powerful Python data structures and data analysis toolkit), [Seaborn](https://seaborn.pydata.org/) (Statistical graphics library for Python), [Bokeh](https://bokeh.org/) (Interactive web visualization library), [Scikit-Learn](https://scikit-learn.org/) Python modules for machine learning and data mining, [NLTK](https://www.nltk.org/) (Natural language toolkit) and [Jupyter Notebook](https://jupyter.org/) (Web app that allows you to create and share documents that contain live code, equations, visualizations and explanatory text). But it doesn't stop with Python packages, as it will also add, for example, [R essentials](https://docs.anaconda.com/anaconda/packages/r-language-pkg-docs/), a collection of 80+ of the most used R packages for data science. All together, Anaconda comes with more than 1'500 packages!
- [Miniconda](https://docs.conda.io/en/latest/miniconda.html) is a "free minimal installer for `conda`. It is a small, bootstrap version of Anaconda that includes only `conda`, Python, the packages they depend on, and a small number of other useful packages, including `pip`, `zlib`."

_So which one to use?_

For beginngers, Anaconda can be very convenient. However, it is use, and very likely, you are not going to need the vast majority of packages it comes with (at least not anytime soon). So for the level of this course and beyond, **we strongly recommend that you install Miniconda** to keep your system crisp and clean.

###  Installing Miniconda

To install Miniconda, please follow the instructions in the [official user guide](https://docs.conda.io/projects/conda/en/latest/user-guide/install/). Of course make sure to use the instructions for your operating system.

For example, for Linux (at least for _most_ Linux machines, depending on your CPU architecture), you would first download the following Miniconda3 installer:

```bash
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
```

This will download a file called `Miniconda3-latest-Linux-x86_64.sh`. You can make sure that the file that was downloaded is complete and valid by checking its hash sum and compare it against the expected one published on the Miniconda website. The SHA256 hash sum for the latest "regular" Miniconda installer (`Miniconda3 Linux 64-bit`) at the time of writing is:

```console
8d936ba600300e08eca3d874dee88c61c6f39303597b2b66baee54af4f7b4122
```

To check the SHA256 hash sum of the downloaded file, you can do:

```bash
sha256sum Miniconda3-latest-Linux-x86_64.sh
```

This gave me the following output:

```console
8d936ba600300e08eca3d874dee88c61c6f39303597b2b66baee54af4f7b4122  Miniconda3-latest-Linux-x86_64.sh
```

All good!

Next, let's make use of the downloaded script file to install Miniconda:

```bash
bash Miniconda3-latest-Linux-x86_64.sh
```

This will produce quite a big log message. Among other things, it will tell you which versions of which packages were installed. Have a look if you are interested.

The last part of the log is actually important:

```console
==> For changes to take effect, close and re-open your current shell. <==

If you'd prefer that conda's base environment not be activated on startup,
   set the auto_activate_base parameter to false:

conda config --set auto_activate_base false

Thank you for installing Miniconda3!
```

First of all, close and re-open your shell before you use Conda!

Also, now that we have Miniconda installed, whenever we open a shell, a Conda environment is always activated for us! While this goes a great way to ensure that you are not installing software in your global environment (which really you should only do for software that you need on a daily basis, like your browser or code editor), it may not necessarily be what you want. Use the command indicated in the log to disable that feature.

To make sure everything worked out fine with the installation, after starting a new shell, type:

```bash
conda info
```

We can see, among other things,

- the location where `conda` is installed
- whether the base image is active or not
- the installed `conda` version
- the default Python version used
- the path where the virtual environments are stored

###  Conda basics

Basic usage of `conda` is quite similar to that of `pip`, as you can install, update or remove packages:

```bash
conda install PACKAGE_NAME
conda update PACKAGE_NAME
conda remove PACKAGE_NAME
```

Note that unlike `pip`, `conda remove` will remove a package's dependencies as well, provided they are not being used by any other package in the environment.

Similar to `pip`, you can use `conda` to update `conda`:

```bash
conda update conda
```

Also similar to `pip`, you can specify versions when installing or updating packages:

```bash
conda install PACKAGE_NAME=1.2
```

A single `=` corresponds to the `pip`'s `~=` operator, so the above command would install `PACKAGE_NAME` of version `>=1.2, ==1.*`.

`conda` also defines the `|` operator, which allows us to specify a set of versions that are not part of a range. For example, the following command would install _either_ Flask version `1.1.4` _or_ `2.2`:

```bash
conda install "flask=1.1.4|2.2"
```

Otherwise, version match specifications are similar to those used in `pip`, with some minor differences. See [here](https://conda.io/projects/conda-build/en/latest/resources/package-spec.html#package-match-specifications) for full authorative information.

In many cases, installation via Pip doesn't actually build software, but rather it fetches prebuilt packages that were created with `python -m build` before uploading a release. However, this doesn't always work out that way and so, more generally, `pip` may still be required to build. This is not the case for `Conda` - it _only_ stores prebuilt packages! It does so based on the assumption that while building software on different machines leads to different executables (often referred to as _binaries_), there aren't actually all that many different architectures out there - at least if we are fine with targeting the vast majority of users. So Conda is basically a registry of binaries, prebuilt for the most commonly used systems. While this may lead to cases where a binary for your particular system is not available (rare, unless you have a really unusual machine), it has the added benefit that installing (but not resolving, see below!) packages is relatively fast.

### Channels & Bioconda

One nice feature of Conda is the concept of (and support for) channels. You can imagine Conda channels as different PyPI deployments, like the regular and the testing one. Packages available on the one are not necessarily available on the other, and when you don't tell the package manager in which "channel" to look, it may not find the package you are looking for. In short, channels represent different collections of packages/tools.

The [Anaconda website](https://anaconda.org) allows you to search across public Conda channels. Sometimes, packages are available on multiple channels, although you may find the different channels providing different versions of a package. Conda's default channel is `conda` - it represents a rather small collection of trusted and commonly used packages. The biggest channel is the `conda-forge` community channel. Here, you can find all kinds of packages that users (not necessarily maintainers) contributed, because they really wanted a certain tool to be available via Conda. The `conda-forge` channel is great to find software that isn't so commonly used, but given that a lot of the packages on there are not officially maintained, you may not find all versions of a given tool.

Apart from `conda` and `conda-forge`, surely the most important channel for us as bioinformaticians is the [`bioconda`](https://bioconda.github.io/) channel - a collection of bioinformatics tools.

When installing a package, you can specify the channel with the `-c` parameter:

```bash
conda install -c bioconda kallisto
```

### Conda environments

As mentioned, Conda is also an environment manager, so no need for a separate package like `venv`. In fact, Conda is always used in the context of an environment, i.e., you cannot install a Conda package _outside_ of an environment.

A Conda environment is a folder/directory that contains a specific collection of Conda packages and their dependencies, so they can be maintained and run separately without interference from other environments. For example, you may use a Conda environment only for a particular project, with a particular Python version and dependencies, and another Conda environment with another Python version and dependencies.

> If you don't create and activate a project-specific environment, any packages you add via `conda` will be
> installed in the `base` environment. However, we strongly recommend you treat Conda's `base` environment
> like the global environment - keep it as clean as possible and prefer to install your dependencies in
> project-specific environments.

Here are the most common commands for managing environments:

```bash
# list the available environments
conda env list

# create an environment using Python 3.11
conda create --name my-env python=3.11

# activate an environment
conda activate my-env

# list the packages installed in the current environment
conda list

# deactivate the current environment
conda deactivate

# delete an environment
conda env remove --name my-env
```

There is also a correspondence to Pip requirement files. However, unlike Pip, these files use the YAML format and are stored, by convention, in a file named `environment.yml` (or `environment_dev.yml`), although other names are possible.

A Conda environment file may look like this:

```yaml
name: my_env
channels:
  - bioconda
  - conda-forge
dependencies:
  - biopython>=1.78
  - kallisto>=0.46.1
  - pandas>=1.0.5
  - pip>=20.2.3
  - pyahocorasick>=1.4.0
  - pydantic>=1.8.1
  - pysam>=0.16.0
  - python>=3.6, <=3.10
  - star>=2.7.6
```

> Note: The order in which channels are listed indicates the order in which those channels are searched through to
> find the individual packages. So if a package with a matching version is available on multiple channels, it is
> installed from the channel furthest up on the list of channels.

You can install an environment using an environment file like this:

```bash
conda env create -f environment.yml
```

### Using Conda with Pip

Conveniently, Conda environments give you Python environments for free. So if you are used to working with `pip` and you can't be bothered to check if Python package `xyz` is available on Conda (and in the version you need) or if it is indeed not available, you can simply use `pip` for installation in your Conda environment. So to make this clear: Whenever you have a Conda environment activated, installing a package with `pip` will install that package in the active Conda environment, _not_ in the global environment. Yay!

You can even instruct Conda to use `pip` for installing certain packages by including something like the following in your `environment.yml` file:

```console
name: my_environment
channels:
  ...
dependencies:
  ...
  - pip:
    - flask~=2.1  # this will install a Flask version >=2.1 and <3
    - .  # this will install the current package
```

> Note, however, that `pip` support in Conda is experimental and comes with some caveats (see the
> [Conda documentation](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-pkgs.html#installing-packages-from-anaconda-org)
> for details. Therefore, if your Python dependencies are available on a Conda channel (which they not always
> are!), it may be better to use the native Conda packages instead.

# Containers

Now, it's nice that we are able to share recipes to create environments for using or contributing to our tools by way of `requirements.txt` or `environment.yml` files to be used with Pip or Conda, respectively. But wouldn't it be nice if instead of sharing these recipes, we could share ready-made environments themselves?

Using Linux container technology, this is indeed possible!

As such, **containers** are a really important concept for reproducible data science. A container, just as its name suggests, is an encapsulated collection of software that can be run on any Linux hardware, in a manner that almost entirely isolates it from the host system, but while still allowing several host services and features to be used (thus avoiding the need to include an _entire_ operating system inside of a container such as is done in _virtual machines_). Containers may have code, environment variables, systems libraries, data, etc., in principle everything that is necessary to perform the functions that the container was built to support. Such as the execution of a particular bioinformatics tool.

So, importantly, container _images_ can be created for any kind of tool!

## A (very) short history of containers

Although containers are a [fairly new](https://blog.aquasec.com/a-brief-history-of-containers-from-1970s-chroot-to-docker-2016) concept (the underlying technology has been introduced in Linux in 2008), the impact the isolation of environments had especially on systems administration was enormous, so that by now the landscape of tooling around containers is very complex.

So let's focus on what's important for our use cases, at the risk of possibly oversimplifying a few things.

Container technology achieves a sort of separation of environments that sits somewhere between the sort of virtual environments we get with `venv` and Conda on the one hand, and virtual machines that completely separate entire systems via a software called a hypervisor on the other hand. Effectively, containers do make use of host resources, alleviating the need to package _everything_ inside a container, such that they can remain reasonably small and can be fired up quickly. Yet containers are isolated to such an extent from one another, that leakages of any sort from one container to another are minimized.

## What is Docker?

The first product that began to fully exploit the potential of container technology commercially is Docker Inc. When Docker first emerged in 2013, containers rapidly began gaining traction. But what _is_ Docker?

From the Docker website:
https://docs.docker.com/get-docker/

> Docker is an open platform for developing, shipping, and running applications. Docker enables you to separate your applications from your infrastructure so you can deliver software quickly. With Docker, you can manage your infrastructure in the same ways you manage your applications. By taking advantage of Docker’s methodologies for shipping, testing, and deploying code quickly, you can significantly reduce the delay between writing code and running it in production.

Nowadays, there are different container types, from [Docker](https://www.docker.com/) to [Singularity](https://sylabs.io/singularity/) to [Podman](https://podman.io/), different container _runtimes_, and different tools for orchestrating the execution of containers (the most important one being [Kubernetes](https://kubernetes.io/)). The ecosystem around container technology quickly became so rich and the impact of containers so substantial that it created entirely new highly sought-after job, the DevOp - a sort of mixture between a software developer and a systems administrator.

But we are digressing. Let's focus on how we can _use_ containers. For this (and the following parts, minus one small digression), we will focus on Docker only. And that means **Docker**, the tool, and Docker, the container type/format, which are still among the most widely used container technologies to this day.

## Installing Docker

To get our feet wet with containers, let's install the Docker Engine. Follow the [instructions from the website](https://docs.docker.com/engine/install/) for your particular operating system.

You can install it by following the [official instructions](https://docs.docker.com/get-docker/). It is important to realize that once installed, Docker runs as a "daemon" on your system, i.e., it is always on, managing your containers.

> **Important:** Note that while Docker is available on Mac and Windows, neither of these are native
> implementations. Container technology simply does not exist on these operating systems, at least not fully.
> Therefore, as of 2022, the Docker clients for Mac and Windows are still effectively virtual machines that run an
> embedded version of Linux. This is important to remember, because there are still limitations on what you can
> do with Docker when using it on these operating systems.

If everything went well, you should be able to run the following command:

```bash
docker version
```

In my case this gave me the following output:

```console
Client: Docker Engine - Community
 Version:           24.0.7
 API version:       1.43
 Go version:        go1.20.10
 Git commit:        afdd53b
 Built:             Thu Oct 26 09:08:01 2023
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          24.0.7
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.10
  Git commit:       311b9ff
  Built:            Thu Oct 26 09:08:01 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.24
  GitCommit:        61f9fd88f79f081d64d6fa3bb1a0dc71ec870523
 runc:
  Version:          1.1.9
  GitCommit:        v1.1.9-0-gccaecfc
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
```

## Running our first container

Let's say we wanted to run [BEDtools](https://hub.docker.com/r/biocontainers/bedtools/), but given that we only need it this once, we don't want to bother with installing it on our system.

With the Docker client available on our system now, we can do a quick Google search to identify a container image that already has BEDtools installed. The [image we find](https://hub.docker.com/r/biocontainers/bedtools/) is `biocontainers/bedtools` and it lives on the Docker image registry [Docker Hub](https://hub.docker.com/).

> There are probably plenty of other Docker images with BEDtools on it, but container images from BioContainers
> are always a good source, as [BioContainers](https://biocontainers.pro/) has long-term plans (and funding!) for
> preserving container images for bioinformatics, so short of an official BEDtools image provided by the tools 
> maintainers, this is the best source we can get.

So let's pull the Docker image to our local Docker client. But first we need to decide exactly which image version to pull (if we don't do that, we will always pull the "latest" image, which may or may not be latest version - it's usually better if we are explicit about versions when running software!).

You can find the available versions on the ["Tags"](https://hub.docker.com/r/biocontainers/bedtools/tags) page of the Docker Hub repository for the image. For our purpose, one is as good as another, so let's use the image with tag `2.25.0`. To retrieve this image, we run:

```bash
docker pull biocontainers/bedtools:2.25.0
```

This will take a while, and you can use the time to look at the screen output. What you will see is that the Docker client first downloads and then extracts different _layers_ of the image. This is crucial, because Docker's _layered_ design comes with advantages and disadvantages, some of which we will learn about.

Let's start with one of the disadvantages: Where did the Docker client pull the image? It's hard to say! A Docker image isn't a single file. The Docker client manages the different layers and knows how they fit together to form an image, but there isn't just a single place where that image lives, and therefore you can't just put it on a thumb drive and give it to someone else. Sharing Docker images invariably requires the use of dedicated Docker registries, such as the Docker Hub.

Once the image is pulled, we can verify that our Docker client knows about by running the following command:

```bash
docker images
```

We should have an entry for the BEDtools container, something like this:

```bash
biocontainers/bedtools    2.25.0    01e329eb92a8    4 years ago    1.19GB
```

Apart from the image name and tag, we get a hash sum of the container. We can use this to verify that we got _exactly_ what we wanted to fetch by comparing the hash sum to the one shown on Docker Hub for that image version. We can also see that the image was built years ago and is actually quite big - 1.19GB just for one tool. This is partly because environments can indeed be big (there are quite a number of libraries that a tool like BEDtools needs). But it's mostly because BioContainers are built automatically via Conda. So rather than just including only what's necessary in the image, it contains Conda and many other unnecessary things. In comparison, the [BEDtools 2.27 image that we built ourselves](https://hub.docker.com/layers/zavolab/bedtools/2.27.0/images/sha256-7e3999f28d6960dfff02fda7ac259defcab41bc8dc74ce095910b536333ae70d?context=explore) in the Zavolan lab is only about 265MB in size. This is good to know - but still it's useful to use BioContainers, because unlike the Zavolan lab image, the BioContainers image is likely to still be around in 5, 10 or perhaps even 20 years!

Okay, now that we pulled the image, we next want to run a container from it (this is the terminology here: we have a container _image_, but we still need to create a container from that image). The basic command to do that is `docker run`, and we use it like this for our example:

```bash
docker run biocontainers/bedtools:2.25.0
```

Hmmm, we seem to be getting the BEDtools help screen when executing that:

```console
bedtools: flexible tools for genome arithmetic and DNA sequence analysis.
usage:    bedtools <subcommand> [options]

The bedtools sub-commands include:

[ Genome arithmetic ]
    intersect     Find overlapping intervals in various ways.
    window        Find overlapping intervals within a window around an interval.
    closest       Find the closest, potentially non-overlapping interval.
    coverage      Compute the coverage over defined intervals.
<OUTPUT TRUNCATED>
```

Interesting.

So what just happened? The image that we pulled is actually "primed" in a way to run BEDtools. That is what it was build for. So when running the image, we are actually executing `bedtools` inside it. As we didn't pass any arguments, that's it - `bedtools` is started, and finishes, and the container with it.

We can actually see that no container is running by executing the following command:

```bash
docker ps
```

Unless you have been using Docker before, you should not get any output. Try the following command instead:

```bash
docker ps --all
```

Now we should see output similar to this:

```console
7a52d2148aa0    biocontainers/bedtools:2.25.0    "bedtools"    About a minute ago    Exited (0) 3 seconds ago
```

This confirms that we actually did have a `biocontainers/bedtools:2.25.0` container running a little while ago.

We could now go on and tell you how you can actually run BEDtools with some inputs, but it would actually require us to go quite a bit deeper into the workings of containers and Docker, only for you to find out that it is actually a bit too tedious to do on an everyday basis whenever you want to run some tool. You absolutely _could_ do that, with some utitlity, if, for example, you don't want to install a certain tool on your global environment just to run it once or twice, and would rather prefer to run it isolated as a container. However, nowadays Linux distributions like Ubuntu already use a way of installing apps that uses some aspects of container technology to achieve isolation and prevent cluttering of "global environment", so installations are better isolated from one another compared to how things were done in the past, and installing and uninstalling apps on such systems basically leaves no traces. So just installing BEDtools the conventional way these days already comes with most of the same benefits, and pulling a ready-made app including its environment from the registry is not much faster or easier than installing that same app and create your own environment for it.

So why are we introducing container technologies then?

## Containers in bioinformatics

The main reason why containers became so popular in bioinformatics is the **improved reproducibility** that they provide. Unlike Conda, e.g., which will fetch the appropriate binary for your system, leading to slightly different environments on different machines, container images very nearly ship entire operating systems (this is a simplification!): Apart from kernel code and few other host system dependencies, the software environment is _exactly_ the same for everyone using the container. And as you can imagine, this greatly reduces the risk of receiving different outputs when different users run the same software in the same way on different machines.

> To be clear: There are still cases where even containers produce different outcomes, either due to the remaining
> software differences across different machines, or because of different hardware (e.g., chip architectures).
> However, using containers is vastly superior in terms of reproducibility when compared to
> shipping`requirements.txt` or `environment.yml` files for Pip and Conda, respectively, let alone not providing
> any concrete information on dependencies and their versions.

As we have already pointed out, running individual tools inside containers manually is tedious and so far never took off. While this may change as more and more compute workloads in the bioinformatics and life sciences are being moved to the cloud, it is still the reality now. However, as we rarely run just one single tool, let alone one single command, but rather run chains of tools one after another to make sense of our data, what we really want to concern ourselves with is how to best encode these complex analysis **_workflows_** and make _them_ reprodubible. And in this scenario, container technologies have had a big impact on bioinformatics. In particular:

1. Computational analysis workflows have had colorful history in bioinformatics and came a long way. Literally hundreds of _workflow engines_, often with their own _workflow languages_ have been developed to prevent people from just stringing together raw commands in a shell (or Python or PUT_LANGUAGE_OF_CHOICE_HERE) script and sharing these (good luck trying to get these running!). Workflow engines allow users to manage the execution of workflows in ways that ensure that workflow steps are run at the right time, with the right amount of resources and within the right environments. Fortunately, we are now at a time where only a few of these engines have endured and they are used, increasingly, by researchers in academia and industry on a daily basis to achieve robust and reproducible results. Importantly, modern **workflow engines allow workflows developers to specify container images to set up the proper environments for each step** (Conda packages are also generally supported, but they do suffer from decreased reproducibility). The engines will then do everything that is necessary to ensure that the right commands are executed to create and run the containers (so that we _don't_ need to do this manually).
2. Conversely, bioinformatics **tool developers can easily provide container images** which ship their tools together with a precisely defined environment, and these images can then be easily consumed by workflow engines.

We will look at workflow languages and workflow engines later on, so in this session we will just have a brief look at how you can create a Docker image for your tool.

> Apart from workflows, another major use case for containers are web services. Whether you provide a database resource or just a user-friendly web interface for your fancy new tool, "containerizing" your web app and using container orchestration tools like Kubernetes (by far the most popular!), Docker Swarm or even (if just for a small website) Docker Compose is the state of the art of systems administration! However, we will not cover that in this course.

## Composing `Dockerfile`s

To create a container image, a "recipe" is required that tells the container engine how to build the image. This recipe will tell the engine what software should be installed in the container image, how the software should be configured, what data should it include, what environment variables should be defined, and so on. In Docker, such a recipe is called `Dockerfile` and is typically contained in a text file of the same name (although the name can be different). 

As an example, here is a minimal `Dockerfile` (with comments) that should be good enough to ship a simple Python package without dependencies, together with a precisely defined environment.

```
FROM python:3.11.6-slim-bullseye

MAINTAINER zavolab-biozentrum@unibas.ch

# Set environment variables for the user name, group name and user home
# directory to be used further down
ARG USER="bioc"
ARG GROUP="bioc"
ARG WORKDIR="/home/${USER}"

# Create user home, user and group & make user and group the owner of the user
# home
RUN mkdir -p $WORKDIR \
  && groupadd -r $GROUP \
  && useradd --no-log-init -r -g $GROUP $USER \
  && chown -R ${USER}:${GROUP} $WORKDIR \
  && chmod 700 $WORKDIR

# Set the container user to the new user
USER $USER

# Make sure the location where Pip installs console scripts/executables is
# available in the $PATH variable so that the container's operating sytem is
# able to locate them
ENV PATH="${WORKDIR}/.local/bin:${PATH}"

# Set the working directory to the user's home
WORKDIR $WORKDIR

# Copy the entire content of the current directory to the working directory and
# make sure the copied files are owned by the container user and corresponding
# group
COPY --chown=${USER}:${GROUP} . $WORKDIR

# install app
RUN pip install -e .

# Set default entry point for containers created from the image
ENTRYPOINT ["NAME_OF_YOUR_TOOL_EXECUTABLE"]

# Set default command-line arguments
CMD ["--help"]
```

That's it!

Let's have a look at the different `Dockerfile`-specific instructions we used (see the official [reference](https://docs.docker.com/engine/reference/builder/) for details):

- `FROM`: Allows us to specify a base image, an existing Docker image that we want to start from. In this case, our base image is a slim version of the Debian v11 Linux distribution nicknamed "Bullseye", with Python (v3.11.6) already installed. This means that our tool will run in Bullseye and on Python 3.11.6, regardless of the host's operating system and Python version. Cool!
- `MAINTAINER`: Allows us to specify, well, the maintainer of the `Dockerfile` recipe. This is just metadata that allows people to reach out with any issues and questions that people may have regarding the Docker image.
- `ARG`: Allows us to specify a variable that can be used during the build process only. Here, we have supplied default values that can be overridden by the person building the image. If no defaults are set, builders **have to** provide values for these.
- `RUN`: Allows us to specify (more or less) arbitrary shell commands, e.g., to install software into the Docker image. NOte that we have chained multiple commands with the `&&` operator under a single `RUN` directive rather than using multiple `RUN` directives. The reason for this is that Docker creates an additional layer for each directive it encounters, which can lead to an excessive amount of layers, which can bloat the image. On the other hand, layers are great, because the layering system allows inheriting from existing base images, which then do not need to be build again. Furthermore, Docker is smart enough that upon successive builds, only those parts will be rebuilt that have changed since the last build. This can be useful when compiling tricky `Dockerfile`s, because you can separate parts that already worked fine to a different layer, reducing the build time.
- `USER`: Allows us to explicitly set the container user. To include this in your `Dockerfile`s is good practice, because without it, the default user is the root user, which has excessive privileges and might allow container users to access data on the host system with those same privileges under certain circumstances.
- `ENV`: Allows us to specify environment variables that propagate into containers created from the built image.
- `WORKDIR`: Allows us to set a working directory inside the container and have the Docker build engine move inside it. If the directory does not exist, it is created.
- `COPY`: Allows us to copy files and directories from the host system into the Docker image. Differs in some important ways from the `cp` shell command, so you may want to refer to the reference documentation for further details. Importantly, the `--chown` flag will _not_ work on Windows containers!
- `ENTRYPOINT`: Allows us to optionally set what is executed automatically when a container is created from the image with `docker run`.
- `CMD`: Allows us to specify default arguments passed to the executable specified in `ENTRYPOINT`. Can easily be overridden by supplying positional arguments to `docker run`.

Now let's create and build the corresponding Docker image.

## Building Docker images

Let's follow these simple steps to have Docker build our image:

1. Create a file `Dockerfile` with the contents above in your repository root directory.
   > **ATTENTION:** Make sure to replace the placeholder in the `ENTRYPOINT` instruction with the actual name of
   > your console script/executable!
2. Optionally create a `.dockerignore` file (similar to `.gitignore`) where you list the files and directories that you do **not** want to include in the Docker image. Make sure **not** to include any files and directories that the packaged tool requires to be run and be installed.
   > **WARNING:** You MUST create and populate a `.dockerignore` file if your current working directory contains
   > any secrets!
3. Build the image with the following command: 
   ```
   docker build -t my-image .
   ```

This will take a little while, but if everything is okay, your image should have been created and is ready for use... locally. Let's see how we can make it available to others!

## Publishing Docker images

If you are happy with your tool and nicely wrapped it in a Docker image, the next step is then to publish that Docker image, e.g., to Docker Hub. This requires the following steps:

1. Create an account on [Docker Hub](https://hub.docker.com)
2. Use the credentials from Docker hub to login at the shell with [`docker login`](https://docs.docker.com/engine/reference/commandline/login/) 
3. Tag the image with Docker user name `docker tag my-image user-name/my-image:some-optional-version-tag` 
4. Push the image to Docker Hub `docker push user-name/my-image:some-optional-version-tag`

This will make the image accessible to anyone, via `docker run --rm user-name/my-image:some-optional-version-tag COMMAND_LINE_OPTIONS` (and making some local directory or directories available inside the container for reading and writing, e.g., via the `-v` option; see `docker run --help` and [its documentation](https://docs.docker.com/engine/reference/run/) for details).

## Cleaning up after yourself

Note that after installing Docker, the Docker daemon is constantly running in the background, and it doesn't forget - unless you tell it to! So over time, more and more images, volumes etc. may accumulate, including artifacts from failed build attempts etc. Depending on how much you use Docker, you can easily end up with tens of gigabytes of disk space being reserved for Docker stuff.

So let's have a look how you can clean after yourself. This will also come with a little crash course on the most important Docker commands.

For starters, let's see how we can find out more about the state of Docker:

```sh
docker info
```

This shows all kinds of interesting information (output truncated):

```console
Client: Docker Engine - Community
 Version:    24.0.7
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.11.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.21.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 21
  Running: 10
  Paused: 0
  Stopped: 11
 Images: 611
 Server Version: 24.0.7
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 [...]
```

For example, I now know that I have more than 600 images on my machine!

Let's see how much space they are using:

```sh
docker system df
```

```console
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          210       12        23.82GB   23.05GB (96%)
Containers      21        10        11.38MB   8.932MB (78%)
Local Volumes   193       6         4.042MB   3.921MB (97%)
Build Cache     164       0         102.3MB   102.3MB
```

Almost 24 Gb for me - time to clean up!

While I also see how much space the other resources that the Docker daemon manages are consuming, but I see that images are using most of the space (not unexpectedly), so let's see how we can list the images that the Docker daemon knows about:

```sh
docker images
docker images -a  # shows all images, including intermediate images
```

You can use the `IMAGE ID` (or the container repository name together with the tag) to remove an individual image like so:

```sh
docker rmi IMAGE_ID
docker rmi REPOSITORY:TAG
```

However, this only works if there is no container running based on the image you would like to remove. So how can we find out which containers we have?

```sh
docker ps
docker ps -a  # shows all containers, including stopped ones 
```

Now, given the `CONTAINER ID` (or any of the associated `NAMES`), we can stop a running container:

```sh
docker stop CONTAINED_ID
docker stop NAME
```

Once a container is stopped, you can also remove it (but note that this will only remove the particular container instance, not the _image_):

```sh
docker rm CONTAINER_ID
docker rm NAME
```

Alternatively, you could also restart a stopped container with... you guessed it:

```sh
docker start CONTAINER_ID
docker start NAME
```

So now we know how we can list images and containers, stop (and restart) containers and remove individual (stopped) containers and images that are not in use by any container. Nice!

However, after using Docker for a while there may easily be dozens or even hundreds of images accumulating. It'd be a bit tedious to have to remove them all one by one. Instead, Docker provides a set of `docker ... prune` commands, which are very convenient to quickly clear out unused Docker images.

To clear out images only, use:

```sh
docker image prune  # removes "dangling" images only, i.e., images that are
                    # not tagged and are not referenced by a container
docker image prune -a  # removes all images that are not referenced by any
                       # running container
```

Something similar exists for containers as well:

```sh
docker container prune  # removes all stopped containers
```

The Docker daemon also manages volumes (used to deal with data) and networks (used to deal with incoming/outgoing network traffic), which we haven't looked at in this short primer. Individual `docker ... prune` commands exist for these as well. In case you ever need them, look them up in the official documentation.

Finally, to clean _everything_ that is not running/being used all at once, you can use this gem:

```sh
docker system prune  # for images, only removes dangling ones
docker system prune -a  # include all unused images
```

After running `docker system prune -a`, `docker system df` now shows the following for me:

```console
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          5         5         1.925GB   0B (0%)
Containers      10        10        2.448MB   0B (0%)
Local Volumes   193       6         4.042MB   3.921MB (97%)
Build Cache     0         0         0B        0B
```

Quite a clean up! The remaining images, containers and volumes are required by containers that are still running and that I did not want to stop/remove.

## Further reading

As we have already hinted at, there is a lot more to Docker and containers, especially when it comes to web servers.

If you are interested, do check out the [Docker](https://docs.docker.com/) and [Kubernetes](https://kubernetes.io/docs/home/) documentations to start your journey to become a DevOp :)

# Publishing on Bioconda & BioContainers

One final point we would like to draw your attention to: It is in your best interest as a developer to make your tool as usable as possible to others. And this involves making it available at different outlets. So far we have looked at PyPI, for pure Python packages. In this session we have also learned about [Bioconda](https://bioconda.github.io/) and briefly touched upon [BioContainers](https://biocontainers.pro/), a community effort to create, streamline the building of and facilitate the hosting of container images for bioinformatics tools.

In fact, Bioconda and BioContainers are related (they are maintained by a largely overlapping group of people). Better yet, **a (Conda-based) container image is automatically created for every tool on Bioconda! Therefore, we strongly encourage you to publish all your mature bioinformatics tools on Bioconda.** You get the container image for free and don't need to take care of maintaining the `Dockerfile` and building and publishing the image. And you also don't need to worry about hosting the image (and possible limitations that Docker Hub or other image registries may impose on you).

Explaining how you can publish your tools on Bioconda is beyond the scope of this course. However, [it is not very difficult](https://bioconda.github.io/contributor/index.html), [especially if you already have a package on PyPI](https://bioconda.github.io/contributor/guidelines.html#python).

# Conclusion

So going back to the original questions we asked: Are installation problems for your users avoidable? Are the developers of the more naughty tools culpable?

With this lecture, we hope that we have convinced you that they very likely did not do very much to make it easier for the user. Perhaps they had a good reason to implement their code in Java. And perhaps their software really needs some system libraries that themselves are not available in Java, so that they could be installed with a Java package manager along with the app code. But did they try to make their tool available on Conda - which might just be able to resolve all dependencies, including the non-Java ones? And did they publish a Docker image that has the software and all its dependencies, installed?

The moral of the story is: Don't be that lazy developer! Make it easy for your users (and potential contributors!) to install and use your software. It will not only increase your reputation as a developer, it will also lead to more adoption and citations!

> Of course it can be tedious to manage releases manually, as doing it well requires a lot of steps:
> - Run all checks and tests to make sure that your tool runs as expected with the new code
> - Maintain a strict versioning policy and upgrade your version accordingly
> - Create a change log and release notes describing all changes
> - Create a Git tag and associated release on GitLab or GitHub
> - Build your tool and upload to PyPI
> - Publish the update on Bioconda  
>  
> However, note that if you followed the recommendations in this course (particularly on writing Conventional Commit messages, implementing extensive tests and setting up Continuous Integration), **all of these steps can be fully automated** according to rules you set up. For example, you could choose to create a release for every single commit that you merge into your default branch. Or you could extend your Git workflow to include the addition of release branches that you merge your feature branches into. Then, once you feel that you would like to have a new release, simply merge the release branch into the default branch to trigger the automatic release pipeline set up in your CI!

# Homework

1. If you haven't done so already, please add all of your app depedencies to a file `requirements.txt` in the repository root directory. Put all of your _development_ dependencies in a separate file `requirements_dev.txt` in the same directory. Make sure to only include _direct_ dependencies and specify a reasonable version range for each depedency, using the appropriate operator (or add them if you already had the "bare" dependency in those files). Choose the most appropriate strategy for pinning versions. As usual, use the GitHub flow to merge these files into the code base.
2. Create a `Dockerfile` for your code. Try to build a Docker image from the `Dockerfile` and make sure you are able to run your tool from within the container. Once you are happy, you can let us know that your `Dockerfile` is ready and we will build it and push the corresponding image to our Docker Hub repository so that we can make use of the Docker image in the Nextflow workflow.
3. If you haven't done so yet, 

## Altnerative: Use Bioconda instead!

As an alternative to the instructions (1) and (2) above, you can also create Conda environment files (`environment.yml` and `environment_dev.yml`) instead of the `requirements.txt` and `requirements_dev.txt` files and publish your tool on Bioconda by following the [instructions](https://bioconda.github.io/contributor/index.html). As a Docker image will be automatically created and published on [BioContainers](https://biocontainers.pro/) for every tool in Bioconda, there will be no need to prepare and maintain a `Dockerfile` and publish the corresponding image. In fact, publishing your tool on Bioconda is preferred, as this ensures long-term availability of your Docker image (and it also provides another source for your users to obtain your tool from).