Skip to content

Latest commit

 

History

History
153 lines (112 loc) · 10.3 KB

06.2-Howto-Singularity-run.md

File metadata and controls

153 lines (112 loc) · 10.3 KB

How to run Singularity containers

Virtualization vs. Containerization

In our group we make extensive use of virtual machines (VMs). We utilize VMs that run on our dedicated hardware on UITS managed servers, and we utilize VMs that run on NSF funded Jetstream servers. Furthermore, we always have the option to buy compute time on VMs hosted on cloud services such as AWS or Google Cloud. Our use of bgRAMOSE as described in the Howto-Jetstream handbook entry provides a comprehensive review of why and how we utilize VMs.

As wonderful as these VMs are, they also incur significant overhead. Each VM involves a computer simulating another computer, i.e an entire operating system running within another. As efficiently as this may be implemented, there is just no way around the overhead. Furthermore, while it is generally straightforward to share/spin up/spin down these machines, it is not necessarily trivial. Sharing a VM image requires several gigabytes of bandwidth and storage, and once the VM is spun up, the resources that are allocated to it cannot be shared by other processes.

Containerization provides a lightweight alternative to virtualization with many of the benefits and negligible overhead.
Containerized applications run in isolated environments within the host operating system, sharing resources with other processes running on the same machine. Because they do not run an entire operating system, containers are trivial to spin up; and as they are much smaller in size, they are much easier to distribute. Finally, it is noteworthy that containers can also run on VMs.

In summary, containers provide portability and reproducibility without significant overhead while VMs can provide compute infrastructure, but with significant overhead.
For those reasons, when we want to distribute our workflows to predictably run on a large number of distinct compute platforms, containers are more suitable than VMs.

Singularity

While the industry standard containerization solution is Docker, it is not suitable for scientific applications due to various technical reasons. We therefore use Singularity as our containerization solution. Singularity is compatible with Docker images and provides a more suitable feature set for scientific applications. It is already installed in many academic clusters, and it is straightforward to install on your own or almost any other platform; the Singularity website provides concise instructions.

Unlike Docker, Singularity does not require any special privileges (sudo) and does not depend on a system-wide daemon running on the background. In short, if you found that Singularity is installed on your system or you were able to install it locally, you are good to run Singularity containers.

Recipes, Images, and Instances

Recipes are plaintext files that specify a compute environment; this includes the base operating system, commands to install all the necessary programs along with their dependencies, and environment variables. When we develope workflows with Singlularity containers, we meticulously prepare recipes (such as this one) that describe how to build the environment our workflow expects to run in.

Images are binary files built from recipes. Given a recipe, any version of Singularity on any machine should build an identical image. An image can be read-only or writable depending on the use case. Furthermore, it is possible to modify an image to make another (possibly read-only) image, although this is considered bad form because this modified image cannot be reproduced from a recipe. One caveat to building images from recipes is that you need sudo privileges to properly build an image from a recipe.

Instances are images in action. Every time we run a program within a container, we run another instance of that image. Unlike VMs, container instances can be spawned and killed and share resources just like any other process. When we execute a program within a container, we are simply executing that program with a sandbox around it that provides a familiar environment; each such sandbox is an instance of the same image. It is important to note that these instances do not communicate by default; e.g., a file written by an instance will not be accessible by another unless they are writing to a shared filesystem.

Because building images from recipes takes a bit of time and requires sudo privileges, we prefer to share pre-built images along with our recipes. This way, those who would like to use our containers can simply download the turn-key solution.

Singularity Hub

Singularity Hub is where we keep our publicly available images for all to access. The BrendelGroup projects that are hosted on GitHub and contain Singularity recipes are connected to Singularity Hub (if they are properly registered). Singularity images for these projects are automatically built by the Singularity Hub infrastructure every time we push into the GitHub repository. Singularity offers a convenient way to download these images. For example :

singularity pull --name bwasp.simg shub://brendelgroup/BWASP

would download the latest BWASP container image and name it bwasp.simg. This would be all you need to do to run the BWASP workflow, which in fact depends on a very larger number of other programs (all coveniently bundled into the bwasp.simg container!).

Running Singularity containers

Once you have a working Singularity installation and downloaded a container image (from Singularity Hub or otherwise), you are ready to run. Singularity comes with a number of subcommands such as build, pull, run, shell, but the one you will be utilizing to get work done is mostly exec. The command :

singularity exec container.simg command

will execute command from within the specified container image. Let us work through come of the caveats before we move into the additional parameters that we will set.

Caveat regarding executable paths

You may or may not have the same command executable in your path in the host environment. For example, let us assume you have genometools version 1.5.7 on your machine (host), whereas the BWASP container image comes with version 1.5.9. When you are running genometools on the host directly, you'll be running version 1.5.7; within the container, you will be running the 1.5.9 version. This can be easily confirmed by executing:

gt --version
singularity exec bwasp.simg gt --version

and comparing the outputs. The first line will report the version that is available on the host (if any), and the second line will report the version available in the container. The host-installed version of a program may change depending on the particular host, but the container-bound version will always be the same.

The above exampple nicely showcases how convenient (and also disorienting) containers can be. When we run, for example, make within the BWASP container, it uses all the programs as they exist within the container, which may or may not be resemble the environment we have on the host machine. In fact, we often test our containers on clean installs of various Linux distributions where none of the relevant tools other than Singularity is installed. Had we run the same makefile using the make on the host, we would likely get an error due to a missing dependency or different results due to varying versions. When we use the container we get exactly the same results every time, regardless of the host system, and this is what we mean by portable and reproducible.

Caveat regarding the data path

Our programs and workflows operate on data, and we always need to specify where this data is. Let us say we want to confert a bam file to a sam file using the tools available in the BWASP container. We can do :

singularity exec bwasp.simg samtools view -h file.bam -o file.sam

When we provide such relative paths, we are implicitly referring to the working directory. But where is this working directory in the host, and does it even exist in the container? This can get tricky.

In practice, this would actually work automagically as long as your working directory is within your home directory because singularity binds your home directory to the same path in the container by default. However, we often work outside our home directories, in scratch spaces or network mounts where more space is available. In that case, we simply need to specify the folder where the data is, so that the container instance has access to it. If file.bam was in a scratch space that is outside our home (/scratch), the command would look like this :

singularity exec --bind /scratch bwasp.simg samtools view -h /scratch/file.bam  -o /scratch/file.sam

All we needed to add was --bind /scratch. We also used full paths to indicate files, but this may or may not be necessary depending on the exact set up. When in doubt, use full paths.

Recommended parameters

While working with Singularity containers, we found that avoiding some of the automagical features and explicitly specifying the following three parameters should help avoid surprises:

  1. --cleanenv or -e : do not inherit environment variables from the host filesystem
  2. --contain or -c : do not autobind home directory
  3. --bind path or -b path : bind this specific path

By using these extra parameters, we make sure that:

  1. Custom Python/R/etc. environments that might exist in the host system do not interfere with our container.
  2. The container does not have access to any directories that we did not explicitly specify.
  3. The container does have access to the path where we have the relevant data and where we wish the relevant output to be written.

A typical BrendelGroup Singularity command will thus look the following:

cd /scratch/BWASP/data
singularity exec -e -c -b /scratch/BWASP/data /DATA/GROUP/prj/SINGULARITY/bwasp.simg make -n

which would run make from within the container found in /DATA/GROUP/prj/SINGULARITY/bwasp.simg to process the data that is found under /scratch/BWASP/data. In this way, all executables available in the container can be used (here by make), and all the data I/O is restricted to /scratch/BWASP/data (and subdirectories).