# A short introduction to containerized software

After spending using nf-core pipelines to answer bioinformatic questions, we will focus on the processes that lie behind these pipelines now.

Today, we will focus on containerization, namely via Docker. 



1. Check if Docker is installed.

In [1]:
!docker info

Client:
 Version:    27.2.0
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.16.2-desktop.1
    Path:     /usr/local/lib/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.29.2-desktop.2
    Path:     /usr/local/lib/docker/cli-plugins/docker-compose
  debug: Get a shell into any image or container (Docker Inc.)
    Version:  0.0.34
    Path:     /usr/local/lib/docker/cli-plugins/docker-debug
  desktop: Docker Desktop commands (Alpha) (Docker Inc.)
    Version:  v0.0.15
    Path:     /usr/local/lib/docker/cli-plugins/docker-desktop
  dev: Docker Dev Environments (Docker Inc.)
    Version:  v0.1.2
    Path:     /usr/local/lib/docker/cli-plugins/docker-dev
  extension: Manages Docker extensions (Docker Inc.)
    Version:  v0.2.25
    Path:     /usr/local/lib/docker/cli-plugins/docker-extension
  feedback: Provide feedback, right in your terminal! (Docker Inc.)
    Version:  v1.0.5
    Path:     

For the following tasks I used the documentation of docker (https://docs.docker.com/)

### What is a container?

- A container is an independent and isolated process that bundles code along with all its dependencies, allowing an application to run consistently across different computing environments. Containers are easily packaged and lightweight. 
- For example, if you want to create an app, all app's components are isolated from each other.

### Why do we use containers?

- We use them because they are portable, meaning they run consistently across different environments. Also isolation of applications from their surroundings prevents conflicts between package versions for example. They are efficient and scalable and ensure a consistent development environment.
- So, if we use multiple tools in bioinformatics, it is nice to use containers as we don't run into dependency issues. It would be a lot harder to create a single environment for all tools, as some tools may have conflicting dependencies.

### What is a docker image?

- A docker image includes all of the files, binaries, libraries, and configurations to run a container.
- Images are immutable, meaning that once they are created, they cannot be changed. To make modifications, you must either create a new image or add changes as new layers.
- Container images consist of multiple layers, where each layer represents changes made to the file system, such as adding, removing, or modifying files.

### Let's run our first docker image:

### Login to docker

In [None]:
# This you need to do on the command line directly

# I pasted "docker login" into the terminal and then logged in

### Run your first docker container

In [9]:
!docker run hello-world


Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/



### Find the container ID

In [10]:
! docker ps -a
# ps is used to list running containers including the container ID, image, command and when it was created
# -a is used to show all containers


CONTAINER ID   IMAGE         COMMAND    CREATED          STATUS                      PORTS     NAMES
84ee87cefad8   hello-world   "/hello"   12 seconds ago   Exited (0) 11 seconds ago             awesome_galois


### Delete the container again, give prove its deleted

In [11]:
# Remove container using rm
! docker rm 84ee87cefad8


84ee87cefad8


In [12]:
# List containers again to verify the deletion -> The container is not listed anymore.
! docker ps -a

CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES


### FASTQC is a very useful tool as you've learned last week. Let's try and run it from command line

Link to the software: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

#### Please describe the steps you took to download and run the software for the example fastq file from last week below:

1. I first did "conda list" to see all installed packages. Because it wasn't installed in the environment, I googled "conda fastqc", then clicked on "https://anaconda.org/bioconda/fastqc" and used the conda install command: "conda install bioconda::fastqc" on the website. Afterwards, I verified the installation using "conda list" again. 

2. Before we run fastqc, we can look for the arguments to use with "fastqc --help". The we can run fastqc with the desired arguments, i.e.: 
fastqc -o [OUTPUT_DIR] *.fq, when we want to use multiple file names as input. If we want to run 3 jobs at once, we could use the argument -t 3.


### Very well, now let's try to make use of its docker container

1. create a container holding fastqc using seqera containers (https://seqera.io/containers/)
2. use the container to generate a fastqc html of the example fastq file

In [38]:
# Click on add after having found the fastqc package (I used the one from bioconda), then click on "get container"
# pull the container
! docker pull community.wave.seqera.io/library/fastqc:0.12.1--5cfd0f3cb6760c42

0.12.1--5cfd0f3cb6760c42: Pulling from library/fastqc
Digest: sha256:0c524d3abe2642c09c5852299bd79bf78ba0ee2ef040473324caab0826f64d44
Status: Image is up to date for community.wave.seqera.io/library/fastqc:0.12.1--5cfd0f3cb6760c42
community.wave.seqera.io/library/fastqc:0.12.1--5cfd0f3cb6760c42


In [39]:
# Check the image
! docker images


REPOSITORY                                                 TAG                        IMAGE ID       CREATED         SIZE
community.wave.seqera.io/library/fastqc                    0.12.1--5cfd0f3cb6760c42   1df9a8700d59   4 months ago    908MB
quay.io/biocontainers/r-shinyngs                           1.8.8--r43hdfd78af_0       3ae022b36dce   5 months ago    1.34GB
quay.io/biocontainers/samtools                             1.20--h50ea8bc_0           4ac62e588716   5 months ago    69.6MB
quay.io/biocontainers/atlas-gene-annotation-manipulation   1.1.1--hdfd78af_0          db9ec43ce403   5 months ago    1.25GB
quay.io/biocontainers/r-shinyngs                           1.8.4--r43hdfd78af_0       33b8b24630c4   11 months ago   1.32GB
hello-world                                                latest                     d2c94e258dcb   17 months ago   13.3kB
quay.io/nf-core/ubuntu                                     20.04                      88bd68917189   17 months ago   72.8MB
quay.io/bio

In [None]:
! mkdir $(pwd)/fastqc_results

In [42]:
# run the container and save the results to a new "fastqc_results" directory
! docker run -v "/home/tabea/ComputationalWorkflows/Tag2/SRA_files/fastq":/data -v "/home/tabea/ComputationalWorkflows/Tag4/fastqc_results":/output community.wave.seqera.io/library/fastqc:0.12.1--5cfd0f3cb6760c42 fastqc -o /output /data/SRX19144486_SRR23195516_1.fastq.gz

# The -v option is used to mount a directory from the host system into the container. 
# In this case, we mount the directory "/home/tabea/ComputationalWorkflows/Tag2/SRA_files/fastq" as /data inside of the container.
# In this case, we mount the directory "/home/tabea/ComputationalWorkflows/Tag4/fastqc_results" as /output inside of the container.
# -o specifies the output directory where the results are saved, I defined it as /output. This is mapped to the local directory "fastqc_results".
# fastqc: This is the command being run inside the container
# /data/SRX19144486_SRR23195516_1.fastq.gz is the input file being analyzed by fastqc.
# "community.wave.seqera.io/library/fastqc:0.12.1--5cfd0f3cb6760c42" is the Docker image being used to run the FastQC tool

application/gzip
Started analysis of SRX19144486_SRR23195516_1.fastq.gz
Approx 5% complete for SRX19144486_SRR23195516_1.fastq.gz
Approx 10% complete for SRX19144486_SRR23195516_1.fastq.gz
Approx 15% complete for SRX19144486_SRR23195516_1.fastq.gz
Approx 20% complete for SRX19144486_SRR23195516_1.fastq.gz
Approx 25% complete for SRX19144486_SRR23195516_1.fastq.gz
Approx 30% complete for SRX19144486_SRR23195516_1.fastq.gz
Approx 35% complete for SRX19144486_SRR23195516_1.fastq.gz
Approx 40% complete for SRX19144486_SRR23195516_1.fastq.gz
Approx 45% complete for SRX19144486_SRR23195516_1.fastq.gz
Approx 50% complete for SRX19144486_SRR23195516_1.fastq.gz
Approx 55% complete for SRX19144486_SRR23195516_1.fastq.gz
Approx 60% complete for SRX19144486_SRR23195516_1.fastq.gz
Approx 65% complete for SRX19144486_SRR23195516_1.fastq.gz
Approx 70% complete for SRX19144486_SRR23195516_1.fastq.gz
Approx 75% complete for SRX19144486_SRR23195516_1.fastq.gz
Approx 80% complete for SRX19144486_SRR23195

### Now that you know how to use a docker container, which approach between running everything manually and using docker was easier and which approach will be easier in the future?

- In this case, running it manually was easier, because I didn't have to create the container before running fastqc and I did not have any other tools in the environment, which is why I did not have any dependency issues.
- However, generally, once the container is created, it is easier and more stable to use the docker container, as Docker eliminates the need to manually install software dependencies, manage versions, or resolve conflicts between libraries. Also, Docker containers ensure a consistent environment every time they are run. And docker containers run on any system that supports Docker.

### What would you say, which approach is more reproducible?

- The approach using docker containers is more reproducible, because once a docker image is created, we have an isolated environment and we can easily share it with others and they can run it without considering their local configurations.

### Compare the file to last weeks fastqc results, are they identical?

- They are identical.

### Is the fastqc version identical?

- nf-core pipeline: I found the version on the github page of nf-core/rnaseq ("https://github.com/nf-core/rnaseq/blob/3.15.1/modules/nf-core/fastqc/environment.yml"). They used bioconda::fastqc=0.12.1 (version 0.12.1)

- docker: I used bioconda::fastqc=0.12.1.

- Both versions are identical

## Dockerfiles

We now used Docker containers and images directly to boost our research. 

Let's create our own toy Dockerfile including the "cowsay" tool (https://en.wikipedia.org/wiki/Cowsay)

Hints:
1. Docker is Linux, so you need to know the apt-get command to install "cowsay"

In [43]:
! mkdir cowsay-docker
! cd cowsay-docker

### Explain the RUN and ENV lines you added to the file

RUN apt-get update && apt-get install -y curl cowsay && apt-get clean

- apt-get update: This command updates the package index of the Debian operating system.
- apt-get install -y curl cowsay: This command installs the packages curl and cowsay
- The -y flag automatically answers "yes" to prompts during installation
- apt-get clean: This command removes any cached files from the package installation process.


ENV PATH="/usr/games:${PATH}"

- The ENV command is used to set environment variables within the Docker container. In this case, it sets the PATH variable.
- /usr/games is added to the PATH. By including it in the PATH, the container will be able to find the cowsay executable when it is called.

In [51]:
# make sure that the image has been built
! docker build -f my_dockerfile -t cowsay-image .

# -f my_dockerfile: This flag specifies the name of my Dockerfile.
# -t cowsay-image: This tags the image with the name "cowsay-image".
# . This specifies the build context, which is the current directory.

[1A[1B[0G[?25l[+] Building 0.0s (0/1)                                          docker:default
[?25h[1A[0G[?25l[+] Building 0.2s (1/2)                                          docker:default
[34m => [internal] load build definition from my_dockerfile                    0.0s
[0m[34m => => transferring dockerfile: 824B                                       0.0s
[0m => [internal] load metadata for docker.io/library/debian:bullseye-slim    0.2s
[?25h[1A[1A[1A[1A[0G[?25l[+] Building 0.3s (1/2)                                          docker:default
[34m => [internal] load build definition from my_dockerfile                    0.0s
[0m[34m => => transferring dockerfile: 824B                                       0.0s
[0m => [internal] load metadata for docker.io/library/debian:bullseye-slim    0.3s
[?25h[1A[1A[1A[1A[0G[?25l[+] Building 0.5s (1/2)                                          docker:default
[34m => [internal] load build definition from my_dockerfile  

In [53]:
# run the docker file 
! docker run cowsay-image cowsay "Hello from Cowsay!"


 ____________________
< Hello from Cowsay! >
 --------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||


## Let's do some bioinformatics with the docker file and create a new docker file that holds the salmon tool used in rnaseq

To do so, use "curl" in your new dockerfile to get salmon from https://github.com/COMBINE-lab/salmon/releases/download/v1.5.2/salmon-1.5.2_linux_x86_64.tar.gz

In [64]:
# use the file "salmon_docker" in this directory to build a new docker image
# build the image
! docker build -f salmon_docker -t salmon-image .

[1A[1B[0G[?25l[+] Building 0.0s (0/0)  docker:default
[?25h[1A[0G[?25l[+] Building 0.0s (0/1)                                          docker:default
[?25h[1A[0G[?25l[+] Building 0.2s (2/3)                                          docker:default
[34m => [internal] load build definition from salmon_docker                    0.0s
[0m[34m => => transferring dockerfile: 577B                                       0.0s
[0m => [internal] load metadata for docker.io/library/debian:bullseye-slim    0.2s
[34m => [auth] library/debian:pull token for registry-1.docker.io              0.0s
[0m[?25h[1A[1A[1A[1A[1A[0G[?25l[+] Building 0.4s (2/3)                                          docker:default
[34m => [internal] load build definition from salmon_docker                    0.0s
[0m[34m => => transferring dockerfile: 577B                                       0.0s
[0m => [internal] load metadata for docker.io/library/debian:bullseye-slim    0.4s
[34m => [auth] libra

In [65]:
! docker run salmon-image ls

bin
boot
dev
etc
home
lib
lib64
media
mnt
opt
proc
root
run
sbin
srv
sys
tmp
usr
var


In [67]:
# run the docker image to give out the version of salmon
# salmon version v1.5.2
! docker run salmon-image usr/bin/salmon-1.5.2_linux_x86_64/bin/salmon --version

salmon 1.5.2


#### Do you think bioinformaticians have to create a docker image every time they want to run a tool?

- No, they don't. There are already existing, public, and well-maintained images for multiple tools (i.e. in the field of bioinformatics) available. 

### Find the salmon docker image online and run it on your computer.

In [68]:
! docker pull combinelab/salmon:latest

latest: Pulling from combinelab/salmon

[1B7f213c76: Pulling fs layer 
[1B1ed9ab84: Pulling fs layer 
[1B0bdd40c3: Pulling fs layer 
[1B893c1bc1: Pulling fs layer 
[1BDigest: sha256:cefd8bb0b2ed9b07f22b5f0fc317ddda540e5b0dc00810d1ff0d92fee5d80370[2A[2K[3A[2K[5A[2K[3A[2K[5A[2K[3A[2K[5A[2K[3A[2K[5A[2K[5A[2K[3A[2K[5A[2K[3A[2K[5A[2K[3A[2K[5A[2K[5A[2K[3A[2K[5A[2K[3A[2K[5A[2K[5A[2K[5A[2K[5A[2K[3A[2K[5A[2K[3A[2K[5A[2K[5A[2K[3A[2K[5A[2K[3A[2K[5A[2K[3A[2K[5A[2K[3A[2K[5A[2K[3A[2K[3A[2K[5A[2K[5A[2K[5A[2K[5A[2K[4A[2K[3A[2K[1A[2K
Status: Downloaded newer image for combinelab/salmon:latest
docker.io/combinelab/salmon:latest


In [70]:
! docker run combinelab/salmon salmon --version

# salmon version 1.10.3

salmon 1.10.3


### What is https://biocontainers.pro/ ?

BioContainers is a community-driven initiative that facilitates the creation, management, and distribution of bioinformatics packages (like Conda) and containers (such as Docker and Singularity). It uses popular frameworks like Conda, Docker, and Singularity.

##### Goals of BioContainers
The goals are to establish a foundation for creating, building, and deploying bioinformatics software, including source code and examples. Furthermore, BioContainer offers pre-made containers for the bioinformatics community. It also provides guidelines for creating reproducible pipelines and workflows using bioinformatics containers.
Best Practices Coordination: Bring together developers and bioinformaticians to promote best practices in documentation and software development.
##### Key Components of BioContainers
Docker Containers: A collection of Dockerfile recipes for automatically building containers.
Conda-based Containers: Recipes for creating Conda packages and subsequently Docker containers.
BioContainers Registry: A hosted registry for all available BioContainers images.
Specifications: Guidelines and rules for contributing to BioContainers.