# Docker hands-on

In this exercise we will build together the Docker image to be used in the next exercises.
This will provide a nice excuse to go through some basic concepts of Docker and to practice with the syntax of the command-line tool and of the `Dockerfile`s.

## Jupyter magics
Jupyter notebooks are collections of `cells` that can contain either Code or annotations, usually in Markdown.
The code cells are executed by a *Kernel*. You can check the name of the kernel (and its status) in the top-right corner of the notebook window.
In this case, the kernel is `Python 3`.

In addition to annotations and *Code* cells, it is possible to define special cells that are interpreted as code, but with a different *kernel* from the one used for the normal *Code* cells.

For example, you may want to define a cell executed in `bash`. Then, you just need to use `%%bash` as the first line of a *Code* cell.




In [1]:
%%bash 
echo "Ehy! This is bash running in $PWD!"

Ehy! This is bash!


These cell annotations are named **cell magic** and you can discover the most common through [the documentation](`https://ipython.readthedocs.io/en/stable/interactive/magics.html#cell-magics).

In this notebook, we will use another **cell magic** to dump in a text file the content of the cell. 
For example, let's create a file containing a quote of Giosué Carducci on Perugia 

In [2]:
%%writefile OnPerugia.txt

Cosí fece in Perugia. Ove l’altera
Mole ingombrava di vasta ombra il suol
Or ride amore e ride primavera,
Ciancian le donne ed i fanciulli al sol.

Overwriting OnPerugia.txt


When the cell is executed, it automatically creates (or overwrite) a file named `OnPerugia.txt` and it dumps in the file the raw content of the cell.
You can then read the content of the file in Python, or in any other language.

In [3]:
print (open("OnPerugia.txt").read())


Cosí fece in Perugia. Ove l’altera
Mole ingombrava di vasta ombra il suol
Or ride amore e ride primavera,
Ciancian le donne ed i fanciulli al sol.



## Setup docker
Docker is based on a client-server communication. 
In our case, the docker server runs in the same container as the Jupyter notebook, but in general it can be in another container in the same machine or even in a different machine.

To make sure the server (or `daemon`) part of docker is running in the container, you can use the command `ps | grep dockerd`.

If the output is not empty, there is a good chance the `docker daemon` is running 😏

In [4]:
%%bash
ps | grep dockerd || echo Error ignored.

     26 ?        00:00:27 dockerd


Make sure you run the following cell to remove the variable `DOCKER_HOST`, if set by mistake.

> **What's happening here?** The installation of the Docker in a docker container is rather cumbersome and not recommended, but it is good enough for practicing and for discussing things in a notebook, close to the commands that you can immediately run.
After polishing and refining the setup, an environmental variable, `DOCKER_HOST`, may remain incorrectly set. 

In [5]:
import os 
if "DOCKER_HOST" in os.environ: del os.environ["DOCKER_HOST"]

## Images and containers
Images are the abstract definition of the computing environment of a docker container. 
Multiple containers can be *spawned* from the same image, and *all* containers are spawned by an image.

However, once the container is created and running it may evolve to a state which is different from the one defined in the image.
For example, you can install an additional library in the container, whereas it is not available in the original image.

However, every new container spawned from the original image will not contain the library you have added.

In addition, in the cloud infrastructure, we tend to restart and possibly recreate the containers quite often. This destroy the internal state of the container and restores the original filesystem as defined in the image.

In this sense, the filesystem and the data in a container are said to be **ephemeral**. If some data has to be made persistent to the lifecycle of a container, that must be stored **outside** of the container, as we will discuss later.

Docker images can be downloaded from remote registries, such as [Dockerhub](https://hub.docker.com), [INFN harbor](https://harbor.cloud.infn.it) or even [GitLab](https://baltig.infn.it), if you are brave enough. Downloading an image is called to "pull" an image.


### List the local images and pull a remote one
To practice a bit, let's list the images available locally.

In [6]:
%%bash
docker image list

REPOSITORY   TAG       IMAGE ID       CREATED          SIZE
<none>       <none>    16e5e107c6ff   14 seconds ago   1.01GB
custom_np    test      1c254538077c   37 minutes ago   1.12GB
numpy        test      680b4389bfdb   37 minutes ago   1.12GB
python       3.11      22c957c35e37   2 weeks ago      1.01GB
python       latest    22c957c35e37   2 weeks ago      1.01GB


Then we pull (or download) the a public image from Dockerhub, for example `python:latest`.

In [7]:
%%bash
docker pull python:latest

latest: Pulling from library/python
Digest: sha256:cc7372fe4746ca323f18c6bd0d45dadf22d192756abc5f73e39f9c7f10cba5aa
Status: Image is up to date for python:latest
docker.io/library/python:latest


And finally we list again the contents of the local cache

In [8]:
%%bash
docker image list

REPOSITORY   TAG       IMAGE ID       CREATED          SIZE
cygno        test      6f56c877838e   2 seconds ago    1.13GB
custom_np    test      1c254538077c   37 minutes ago   1.12GB
numpy        test      680b4389bfdb   37 minutes ago   1.12GB
python       3.11      22c957c35e37   2 weeks ago      1.01GB
python       latest    22c957c35e37   2 weeks ago      1.01GB


You should have seen `python:latest` appearing in the list.

Note that:
 * `python` (before the colon) is named the **Repository**
 * `latest` (after the colon) is named the **tag** and usually indicates a specific version of the software made available in the repository. Python in this case.

If you want to know more on what you have downloaded you can visit the dedicated documentation in Dockerhub.

For example, for python:latest see https://hub.docker.com/_/python.

## Executing code in a container

Ok, now we have a local copy of the latest version of the Python container. 

Let's try to execute some code in the container. For example, let's print the python version (in a fancy way)

In [9]:
%%bash 
docker run python:latest python3 -c "import sys; print (sys.version)"

3.11.5 (main, Sep  7 2023, 12:36:05) [GCC 12.2.0]


The line of code above has:
 * checked if the image `python:latest` is available locally (it would have pulled it otherwise)
 * spawned a container based on the image `python:latest`
 * executed the command `python3` with arguments `-c "import sys; print (sys.version)"` inside the spawned container
 * retrieved the standard output (and the standard error, empty here) stream
 * destroyed the container 

### Detached containers
If the execution of the task in the container is long, it may be worth to start the container in *detached mode*. 
In this way, it will not stop the jupyter notebook (or the shell) it is launched from, but will run in background.

To execute a command in a new container, in detached mode, you can use the `-d` or `--detach` flags.

In [10]:
%%bash 
docker run --detach python:latest python3 -c "import time; time.sleep(30)"

b9a093e458d4060d8b8d26dadee350fc7ffe4580a05f573fb31fe07c50b32cc6


> **Important notice**. The docker command line inteface is rather rigid on the order of the arguments:
>  * first you have `docker` (of course)
>  * then you have the docker command (in this case `run`)
>  * then you have the options to the docker command (in this case `--detach`)
>  * then the image
>  * and finally what has to happen inside the image (in the case of the `run` command, the instructions to be executed)
> Mixing the arguments, for example placing the `--detach` after the image name is a common mistake that results in an error

## Listing the active containers

You can list the active containers with `docker ps`.
If you run the following cell immediately after the previous one, you should see a container with `IMAGE` `python:latest` running.
If you wait more than 30 seconds, the container will finish its job and will be deallocated: you will see nothing running.

In [11]:
%%bash
docker ps

CONTAINER ID   IMAGE           COMMAND                  CREATED         STATUS                  PORTS     NAMES
b9a093e458d4   python:latest   "python3 -c 'import …"   7 seconds ago   Up Less than a second             modest_mclean


### Customizing the image

Let's try to run a command involving numpy. Importing it is enough to the purpose.

In [12]:
%%bash
docker run python:latest python3 -c "import numpy as np" || echo "An error was ignored."

Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'numpy'


An error was ignored.


You get an error because numpy is not preinstalled in the `python:latest` image.

We need to create our custom image, inheriting from `python:latest`, but this time with numpy.

The first thing we need to do is to create a file, usually known as `Dockerfile` that describes this workflow. 

In [13]:
%%writefile Dockerfile

## FROM is a Dockerfile keyword that defines the image we are inheriting from.
## If the image is not available locally, it is automatically pulled from Dockerhub.
FROM python:latest

## RUN is a Dockerfile keyword that defines a task to be performed on top of the 
## inherited image. These instructions are run only at "build-time", they are not 
## running when you spawn the container, but once for all, to edit the image.
RUN pip install numpy

Overwriting Dockerfile


Then, we can run the `docker build` command to digest the Dockerfile description and build a new image.

In [14]:
%%bash
docker build . -t numpy:test 

#0 building with "default" instance using docker driver

#1 [internal] load .dockerignore
#1 transferring context: 2B done
#1 ...

#2 [internal] load build definition from Dockerfile
#2 transferring dockerfile: 489B done
#2 DONE 0.2s

#1 [internal] load .dockerignore
#1 DONE 0.6s

#3 [internal] load metadata for docker.io/library/python:latest
#3 DONE 0.0s

#4 [1/2] FROM docker.io/library/python:latest
#4 DONE 0.0s

#5 [2/2] RUN pip install numpy
#5 CACHED

#6 exporting to image
#6 exporting layers done
#6 writing image sha256:680b4389bfdbc66ca303cdb01d933ec3a697fc395422744e1c24151bbd96bddb 0.0s done
#6 naming to docker.io/library/numpy:test done
#6 DONE 0.0s


See how it appears in `docker image list`:

In [15]:
%%bash
docker image list

REPOSITORY   TAG       IMAGE ID       CREATED          SIZE
cygno        test      6f56c877838e   26 seconds ago   1.13GB
custom_np    test      1c254538077c   37 minutes ago   1.12GB
numpy        test      680b4389bfdb   37 minutes ago   1.12GB
python       3.11      22c957c35e37   2 weeks ago      1.01GB
python       latest    22c957c35e37   2 weeks ago      1.01GB


And see how it fixes the error we got before.

In [16]:
%%bash
docker run numpy:test python3 -c "import numpy as np"

## Installing custom software in the image
So far we have seen how to install software from online repositories inside your Docker image, but what if the software we want to install is local?
Then you can use the `COPY` command.

Consider the following "extremely complicated and unique custo software module" entierly stored in the file `my_custom_module.py`.


In [17]:
%%writefile my_custom_module.py

import numpy as np

def greetings():
    """Print hello numpy and a random number in the interval 0-1"""
    print (f"Hello numpy, {np.random.uniform(0, 1):.3f}")


Overwriting my_custom_module.py


Then we edit the Dockerfile to add the `COPY` instruction and we build the image.

In [18]:
%%writefile Dockerfile

## FROM is a Dockerfile keyword that defines the image we are inheriting from.
## If the image is not available locally, it is automatically pulled from Dockerhub.
FROM python:latest

## RUN is a Dockerfile keyword that defines a task to be performed on top of the 
## inherited image. These instructions are run only at "build-time", they are not 
## running when you spawn the container, but once for all, to edit the image.
RUN pip install numpy

## We write in the image the custom software module
COPY my_custom_module.py ./my_custom_module.py

Overwriting Dockerfile


Finally, we build the image (naming it `custom_np`) and we run some code that requires numpy to be executed.

In [19]:
%%bash
## Build the custom_np image 
docker build -t custom_np:test .
## Run code accessing the custom module
docker run custom_np:test python3 -c "from my_custom_module import greetings; greetings()"

#0 building with "default" instance using docker driver

#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 589B done
#1 DONE 1.1s

#2 [internal] load .dockerignore
#2 transferring context: 2B done
#2 DONE 1.1s

#3 [internal] load metadata for docker.io/library/python:latest
#3 DONE 0.0s

#4 [1/3] FROM docker.io/library/python:latest
#4 DONE 0.0s

#5 [internal] load build context
#5 transferring context: 212B done
#5 DONE 0.3s

#6 [2/3] RUN pip install numpy
#6 CACHED

#7 [3/3] COPY my_custom_module.py ./my_custom_module.py
#7 CACHED

#8 exporting to image
#8 exporting layers done
#8 writing image sha256:1c254538077c428e792714f699e9fad72e13124ff1b9d29917021ac0a6aafb6b 0.1s done
#8 naming to docker.io/library/custom_np:test
#8 naming to docker.io/library/custom_np:test 0.1s done
#8 DONE 0.2s


Hello numpy, 0.220


## Sharing data (volumes) with the container

Storing custom code and data in the Docker image is great for software distribution, but it is most often the wrong solution to provide input data to the code running in the docker.
As an alternative, one can share a volume of the host (the filesystem where you are running this notebook, in this case) to the container.
This can be achieved with the `--volume` or `-v` flags of the `docker run` command (see [docs](https://docs.docker.com/engine/reference/commandline/run/#volume)).

The syntax is as follows:
```bash
docker run 
 -v <path_in_host_filesystem>:<path_in_the_container>
 image_name:tag
 [command]
```

For example, let's create an input data file containing an array of random numbers normally distributed. We will store it in `data/file.npz`.

In [20]:
## First we check if the data folder exists. If it does not, we create it.
import os 
if not os.path.exists("data"): 
    os.mkdir("data")

## Then we import numpy and we use it to create and store a random dataset in the data folder.
import numpy as np
np.savez("data/file.npz", dataset=np.random.normal(0, 1, 1000))

Then, we write a simple line of code to read the data file and compute the mean.

> Note that the paths, both for the host and the container, must be absolute paths.
> We use the $PWD environment variable to make this step installation independent.
>
Note that, inside the docker, in our snippet to read the array and compute its mean, we must look for the file in the location where it has been mounted (`/my_mounted_volume`), rather than in the original `data/` folder. 

In [21]:
%%bash
docker run \
 -v $PWD/data:/my_mounted_volume \
 numpy:test \
 python3 -c "import numpy as np; print (np.load('/my_mounted_volume/file.npz')['dataset'].mean());"

0.03574120443484179


Of course, nothing prevents you to intend *software* as a kind of *input data*. 

For example for defining a configuration, or to ease the debugging during the development phase, you can simply share with the container snippets of code.

In the cell below, we rewrite the one-line snippet in a more readable and maintainable format.
Then, we store it in `data/script.py` to make it "visible" from the docker.


In [22]:
%%writefile data/script.py

import numpy as np 
print ("Hello! You are running a script defined in data/script.py")
dataset = np.load('/my_mounted_volume/file.npz')['dataset']
print (dataset.mean())

Overwriting data/script.py


Then, you can simply run the script from the mounted volume, and retrieve your output!

In [23]:
%%bash
docker run \
  -v $PWD/data:/my_mounted_volume \
  numpy:test \
  python3 /my_mounted_volume/script.py

Hello! You are running a script defined in data/script.py
0.03574120443484179


## Python bindings

We close this introduction by introducing the Python bindings to the Docker engine, which provide an interesting alternative to the Command-Line Interface (CLI) when more complex setups must be described. 

Starting from an environment where the command line works properly, one can inherit the working configuration generating an instance for the docker client from the environment.

In [24]:
import docker

docker_client = docker.from_env()

Once the client is set up, it can be used to retrieve all the information on the images and the containers discussed above, as python structures.

In [25]:
docker_client.images.list()

[<Image: 'cygno:test'>,
 <Image: 'custom_np:test'>,
 <Image: 'numpy:test'>,
 <Image: 'python:3.11', 'python:latest'>]

In [26]:
docker_client.containers.list()

[]

Containers can be spawned and volumes can be attached, exactly in the same way as we did from the command line, but with Python syntax for easier integration in Python applications.

In [27]:
docker_client.containers.run("numpy:test", """python3 -c "print('Hello world')" """)

b'Hello world\n'

In [28]:
docker_client.containers.run("numpy:test", "python my_mounted_volume/script.py", volumes={os.path.join(os.getcwd(), "data"): dict(bind="/my_mounted_volume")})

b'Hello! You are running a script defined in data/script.py\n0.03574120443484179\n'