# Docker For the Modern Data Scientists: 6 Concepts You Can't Ignore in 2023
## A guide rich with visuals
![](attachment:1c6d4eda-d86d-41a2-9dbc-ed62f91bdb81.jpg)

### The story

This is by far one of the funniest memes I've ever seen:

https://twitter.com/gerardsans/status/1413936148846727179?s=20

It touches on one of the most painful problems not just in data science and ML but in all of programming - sharing applications/scripts and making the darn things work in others' machines as well.

While Microsoft, Apple and Linus Torvalds meant well when they released different operating systems, they led to the never-ending struggle for software compatibility. Linux, Windows, macOS - each has its own quirks and idiosyncrasies. And let's not forget the variations in Python versions, library versions and the unpredictable landscapes of GPU drivers.

While containers existed for a while to solve this problem, it was with the release of Docker that they boomed in popularity in 2013. Since then, Docker and its containers became the go-to tools to share anything that runs with code.

https://www.reddit.com/r/ProgrammerHumor/comments/cw58z7/it_works_on_my_machine/?utm_source=share&utm_medium=web2x&context=3

So, this tutorial will highlight the six most important concepts to help you navigate your way in the complex world of Docker as a data scientist or an ML engineer.

-------------

### A little note

Like many other great software, interacting with Docker is very intuitive and easy. You just have to read the docs once to know the commands required to make the most out of the tool.

That's why we are concerned more with the theory behind each command - those are much harder to understand and almost always, the documentation does a poor job of explaining them.

So, throughout the tutorial, I will be focusing more on the concepts rather than code but will sprinkle a few relevant pages whenever needed to learn more about a certain item.

Let's get started!

-------------

### 0. Why not ZIP files?

![image.png](attachment:7b5bf40c-f077-4f18-ac8e-e23b1fb0b0f8.png)

Why learn a totally new tool when you can put all the code and datasets for your model into a zip file and share that? Well, that would be the same as sending the legos to build a car via mail instead of just driving the ready car to your friend's house. 

There are many excellent reasons to consider Docker above zip files or other methods:

0. **Dependency and compatibility chaos**: zip files don't care about the host system. They are like globetrotting tourists who expect all machines to speak their language. But different operating systems have different architectures and they will be a massive issue for different libraries and dependencies and their versions.
1. **Reproducibility woes**: Imagine things break when you share your zip file. Is it because of a bug or an environment issue? This will just lead to hours of debugging that will make even the most patient person scream-swear. 
2. **Isolation illusion**: You don't really know the contents of a zipfile beforehand and unpacking it is like releasing a bunch of mischievous mice into your operating system. You don't know where they will run and make a mess. People with malicious intent can take advantage of such chaos in the form of security attacks.
3. **Deployment dilemmas**: deploying models from zip files requires tedious manual configuration, environment setup and managing dependencies. It is like building a house from scratch every time you move to a new city. 

In short, while zip files may look like the easiest way to share applications, they can't match the power of Docker containers.

But what *is* a container, you ask? Let's answer that next.

-------------

### 1. Container

Containers are like mini-operating systems on your machine, isolated from other processes and applications such as Spotify, Chrome, Photoshop, games, you name it. They have access to your machine's resources, including RAM, CPU, Disk, and sometimes even GPUs, allowing them to run any software with custom configurations. 

![image.png](attachment:8a5f74fe-8f6b-431d-a6b6-f98293b30025.png)

These lightweight and portable computing environments are tailored to provide everything a machine learning model needs to run in isolation without disrupting the processes on the host machine. They use limited resources to run themselves, leaving the rest of your machine unaffected.

![image.png](attachment:947e0826-b7c0-45ba-96dd-aabb8237e146.png)

Another key advantage is that containers ensure identical results over time. Whether it's a day, month, or year, the outputs will remain consistent for the same inputs. And it's not just the consistency over time; containers also offer consistency anywhere. They run identically on your laptop, your neighbor's rusty Windows machine, or even in the clouds (AWS, Azure, GCP). 

![image.png](attachment:3343bbd4-62d5-4aac-9bfa-b802895fbc92.png)

Another benefit is that containers provide a high level of security and isolation. Even if you make a mess inside a container, rest assured that the mess won't leak out to the rest of your machine or impact other containers. Everything is nicely _contained_ within the container.

Also, containers are lightweight and requires minimal resources compared to alternatives like virtual machines. This makes them highly efficient, allowing you to run instances of entire operating systems such as Ubuntu, Debian, and CentOS Linux processes on top of your operating system.

There are many tools that work with containers but of course, the best is Docker. It is an open source project with millions of users and a go-to tool to create, manage and run any application as a container.

-------------

### 2. Virtualization

So, how can containers have so many excellent features without overwhelming their host?

The answer lies in virtualization technology. Virtualization creates isolated environments within the host operating system, allowing multiple containers to run independently and efficiently.

![image.png](attachment:adaed244-bd60-4191-9c00-6c6dabeb51d8.png)

Virtualization splits up the host resources (CPU, RAM, Disk) and make each piece look like a separate resource to the software running them. For example, a 64GB RAM can be virtualized to look like four individual 16GB RAMs.

Unlike virtual machines (VMs) that achieve many of the same goals as containers and perform virtualization down to the hardware level, containers only virtualize at the software level. They leverage the host operating system's kernel and share the underlying OS resources.

This approach provides lightweight and efficient virtualization, allowing multiple containers to run side by side on a single host. There won't be much overhead in starting and stopping containers, which makes it much faster to update and distribute them.

-------------

### 3. Image

While working with Docker, you will often come across the terms _image_ and _container_ being used seemingly interchangeably. But there are certain differences.

A Docker image is like a food recipe containing meticulous instructions and steps to run an application. A Docker container, on the other hand, is a recipe come to life - a fully prepared dish you can consume. 

A single image can have multiple running instances as containers. But as a rule, even if you have multiple running containers from the same image, they won't interact with each other or even know their existence.

For your own projects, you will usually build the images by yourself. But for many tasks, there are already many pre-built images from the community. 

For example, Docker Hub is the biggest registry, containing more than a million images, all a couple of terminal commands away to use on your machine (after you have Docker installed of course).

The registry contains official images for operating systems (Ubuntu, CentOS, Debian), stacks and languages (Node.js, Python, MySQL, Nginx), databases, pre-packaged and installed ML frameworks (TensorFlow, PyTorch with GPU access, Sklearn) and so on.

For example, to download the official release candidate for Python 3.12 and start using it on your machine, you just have to run the following two commands:

```python
$ docker pull python:3.12-rc-bullseye
$ docker run -it python:3.12-rc-bullseye
```

The last command will start an interactive terminal as soon as a container starts from the `python:3.12-rc-bullseye` image. This running container instance will be like a mini-operating system that only has Python 3.12 installed, nothing else. 

But treating the container like any other Ubuntu distribution, you can install additional tools like Git, Conda and pretty much do (almost) anything you can do in Ubuntu but without a graphical user interface (GUI).

-------------

### 4. Dockerfile

When we call [`docker pull`](https://docs.docker.com/engine/reference/commandline/pull/) and [`docker run python`](https://docs.docker.com/engine/reference/commandline/run/), how does the container know where to get the binaries for Python 3.12, all its dependencies and install them?

The answer lies in Dockerfiles. They are text files serving as blueprints or recipes for creating custom images that encapsulate our Python scripts or machine learning models along with their dependencies and configuration.

You will use Dockerfiles extensively when creating your images (one Dockerfile for one directory/project). While they can get pretty large for complex projects, they typically have the following commands for Python projects:

```Docker
# Use an official Python runtime as the base image
FROM python:3.9-slim

# Set the working directory inside the container
WORKDIR /app

# Copy the requirements file to the container
COPY requirements.txt .

# Install the required Python packages
RUN pip install --no-cache-dir -r requirements.txt

# Copy the rest of the application code to the container
COPY . .

# Define the command to run when the container starts
CMD ["python", "train.py"]
```

Above is a sample Dockerfile to containerize a `train.py` Python script located in our current working directory. Here is an overview of the commands:

1. `FROM` - a keyword to specify a base image. Base images are pre-built images on Docker Hub you can use in your custom images without having to reinvent the wheel. Above, we are using Python 3.9 base image so that we don't have to install Python manually with `apt-get`.
2. `WORKDIR` - sets the working directory inside the container to `/app`, where the application files, `train.py`, `requirements.txt` will be copied.
3. `RUN` - after this keyword, you can write any valid terminal command like `pip install` or run bash scripts to perform certain tasks when building containers.
4. `CMD` - specifies the command to run when the container stars with `docker start`. In this case, it trains a new model with `python train.py`.

To build a new image with this Dockerfile, you call `docker build -t my_image .`. As simple as that.

As you've observed, Dockerfile syntax is not entirely alien to someone who has been using YAML files or the terminal.

Check out [this page](https://docs.docker.com/language/python/) of the Docker documentation to learn more about building images and writing Dockerfiles for Python applications.

-------------

### 5. Image layers

A layer is a bit of a voodoo concept of Docker images. Each instruction/command in a Dockerfile contributes to creating a new, read-only, immutable layer in the resulting image. Layers are stacked on top of each other, forming a layered file system that represents the final image.

![f0T8T5v7BAAAAAAAAAAAAABgZ57MXQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA.png](attachment:b1be936d-9322-46f3-9be9-6e47f89d91ca.png)

There are many benefits to using a layered structure, such as caching. Since building images is an incremental process with many updates to the contents within, caching makes repeated calls of `docker build` much faster. 

Heavy commands such as `FROM` or `RUN` will take only a fraction of a second if Docker detects that these layers weren't changed in the current build. 

![548qqH4YCAABwbcJWAADgV6JWAAAAWIsXiwEAAFjV26s3AAAAAAAAAAAAAABjCFsBAAAAAAAAAAAAiBC2AgAAAAAAAAAAAJAgbAUAAAAAAAAAAAAgQdgKAAAAAAAAAAAAQIKwFQAAAAAAAAAAAIAEYSsAAAAAAAAAAAAACcJWAAAAAAAAAAAAABKErQAAAAAAAAAAAAAkCFsBAAAAAAAAAAA.png](attachment:160111c8-fda1-4c79-99dd-3c2f92e9c6d6.png)

Apart from caching, layers allow efficient storage utilization, version control (image history, easy rollbacks) and lightweight distribution.

Learn more about layers, multi-stage builds and cache from [this page](https://docs.docker.com/build/guide/layers/).

-------------

### 6. Docker engine

A single host can have dozens of built images and running containers. How does the host machine distribute resources across all of them without going up in smoke? Enter the Docker Engine.

![image.png](attachment:a6760313-7352-46df-8696-f897260cc127.png)

Docker Engine is responsible for all the magical Docker jiu-jitsu that takes care of creating, running and maintaining images and containers. It has many components, but here are the _two_ most important ones:

0. Docker Daemon or `dockerd` - a background process on the host machine that manages the lifecycle of containers. It is responsible for virtualization and allocation of resources.
1. Docker Client - a software that allows users to interact with Docker Engine. Primarily, it is the Docker command-line interface (`docker` CLI) but there is also platform-agnostic Docker Desktop for people who prefer a graphical user interface (GUI).
2. Docker API - a set of interfaces and protocols that allows Docker clients or other external tools to interact with Docker Daemon. An internal language for Docker, if you will.

99% of your time will be spent working through the Docker client but it is important to understand other components as they play such a crucial role in how containers operate.

-------------

### Conclusion

Because of all the benefits I mentioned (and didn't mention) here, Docker is extremely popular in the community. As such, many awesome projects have been built upon to extend the default functionality.

For example, Kubernetes, often abbreviated as K8s, is a powerful container orchestration platform that automates the deployment, scaling, and management of containerized applications. It can manage and schedule Docker containers across a cluster of nodes, providing features like automatic scaling, load balancing, and self-healing capabilities.

There is also Docker Compose, which allows you to spin up multiple containers, define their relationships, and manage their configurations as a single application stack.

And specific to us, Kubeflow is an open-source platform designed to simplify the deployment, management, and scaling of machine learning (ML) workloads on Kubernetes. It aims to provide a seamless and integrated experience for running ML workflows, making it easier for data scientists and engineers to build, train, and deploy machine learning models at scale.

Each of these technologies are worth spending your time on as they will greatly enhance the quality of your life when doing MLOps.

Thank you for reading!

-------------

Loved this article and, let's face it, its bizarre writing style? Imagine having access to dozens more just like it, all written by a brilliant, charming, witty author (that's me, by the way :).

For only 4.99$ membership, you will get access to not just my stories, but a treasure trove of knowledge from the best and brightest minds on Medium. And if you use [my referral link](https://ibexorigin.medium.com/membership), you will earn my supernova of gratitude and a virtual high-five for supporting my work.

https://ibexorigin.medium.com/membership