# Creating a Dockerfile

## Dockerfile Basics

*Dockerfiles* are text-based configuration files used to specify how a Docker image should be built. They play a crucial role in creating consistent and reproducible container environment. A Dockerfile contains a series of instructions that define the image's base, environment setup, and application code and dependencies.

### Dockerfile Structure

A typical Dockerfile follows a structured format:

- *Base Image*: This is the starting point for your Docker image, often based on an existing image from a registry like Docker Hub

- *Instructions*: Dockerfiles consist of a series of instructions that specify how the image should be configured and what should be included in it. These instructions include actions like installing software, copying files, and configuring environment variables.

- *Commands*: Shell commands are used to execute actions during the image build process. These commands can be used for tasks like installing packages, setting up configurations, or running scripts.

Here's a basic Dockerfile structure:

``` docker
# This is a comment
# Use a base image
FROM base_image:tag

# Set environment variables
ENV key=value

# Run commands to install dependencies or configure the image
RUN command1 && command2

# Copy files from the host to the image
COPY source destination

# Specify the default command when a container starts
CMD ["executable", "param1", "param2"]
```

Let's break down each component:

- **Comments**: Lines starting with # are comments and are ignored during image build

- `FROM`: Specifies the base image to use as a starting point. It defines the foundation of your image, often based on official images from Docker Hub. For example, `FROM ubuntu:20.04` sets the base image as Ubuntu 2024.

- `ENV`: Sets environment variables within the image, allowing you to configure the runtime environment for your application

- `RUN`: Executes commands in the image during the build process. You can use this instruction to install software, update packages, and perform any necessary configuration. For instance, you can use `RUN apt-get update && apt-get install -y package-name` to install packages.

- `COPY`: Copies files or directories from the host machine into the image. This is useful for adding application code, configuration files, or assets. For example, `COPY app.py /app/` copies the `app.py` file from your local machine into the `/app` directory in the Docker image.

- `CMD`: Specifies the default command that will run when a container is started from the image. It can be overridden when starting a container. For example, `CMD ["python", "app.py"]` starts the `app.py` script when the container starts.

> Dockerfiles do not contain any extension. The name of the file is literally `Dockerfile`. But an extension might be used, for example, if the Dockerfile specifies the steps for creating an image for an API image, it can be called `api.Dockerfile`.

When a Dockerfile is created in VSCode, it will automatically be recognised as a Dockerfile, as indicated by the characteristic whale icon.

<p align=center> <img src=images/Docker_icon.png width=200> </p>

### Dockerfile Commands

Each Dockerfile command introduces a new layer in the image creation process, contributing to the overall structure and functionality of the resulting image:

- Images can be constructed by building on top of existing layers
- Layers are cached and can be reused in consecutive builds, improving build efficiency
- Layers can also be shared among different images, enhancing resource utilization

#### [FROM](https://docs.docker.com/engine/reference/builder/#from)

The `FROM` instruction is the starting point for defining the image-building process. It has the following syntax:

> `FROM [--platform=<platform>] <image>[:<tag>] [AS <name>]`

Key points to note about the `FROM` instruction:

- It initiates the build stage for creating an image
- Specifies the base image (e.g., Ubuntu, node, conda) that determines the environment and capabilities of the image
- Optionally, you can use `AS` to assign a name to the image, a feature we'll explore in the next lesson when we delve into multi-stage builds

The `FROM` instruction can also be combined with `ARG`, allowing you to pass values from the command line during the build process, providing flexibility in image customization.

In [None]:
# Version is out of build stage
ARG VERSION=latest
# Here build stage starts
FROM busybox:$VERSION

# Gets version into build stage
ARG VERSION
RUN echo $VERSION > image_version

#### [RUN](https://docs.docker.com/engine/reference/builder/#run)

The `RUN` command executes a specified command during the build stage, which is commonly used for tasks such as installing packages. The `RUN` instruction offers two forms:

1. `RUN <command>` (executed via `shell`): This form is employed when you intend to execute a command as if it were run within a shell environment, typically `/bin/sh` or `/bin/bash`

2. `RUN ["executable", "param1", "param2"]` (exec form): The exec form is utilized when either the base image lacks a shell or you wish to avoid any form of unintended interpretation of the command string

To determine which form to use:
- Opt for the shell form when you need to execute a command that typically runs within a shell, such as `apt-get install`

- Choose the exec form when working with base images that lack a shell or when you desire precise control over the command execution without any string manipulation

#### [ENTRYPOINT](https://docs.docker.com/engine/reference/builder/#entrypoint)

`ENTRYPOINT` defines the entry point, which is the command executed when a container is created from an image. The `ENTRYPOINT` instruction can be expressed in two forms:

1. `ENTRYPOINT ["executable", "param1", "param2"]` (preferred exec form): This form specifies the executable and its parameters to be run when the container starts. This form does not invoke a shell, making it independent of the shell environment. It also permits the use of an optional `CMD` (specified after the command) to provide default arguments or parameters for the entry point command.

2. `ENTRYPOINT command param1 param2` (shell form): In the shell form, the entry point command is written as if it were in a shell environment

Noteworthy aspects of `ENTRYPOINT`:
- The container runs as an executable, a practice that is generally recommended for robust containerization

- It is advisable to specify an `ENTRYPOINT` (unless you intend to use shell)

- Either `ENTRYPOINT` or `CMD` must be defined for container execution

In [None]:
FROM ubuntu
# When we run a container from the image, top -b will be run
ENTRYPOINT ["top", "-b"]

#### [CMD](https://docs.docker.com/engine/reference/builder/#cmd)

`CMD` defines default arguments for the entry point, which users can potentially override when using docker run. The `CMD` instruction provides several forms:

1. `CMD ["executable","param1","param2"]`: In this form, you specify the executable as the entry point and provide default parameters. Users have the flexibility to override the entire command if needed.

2. `CMD ["param1","param2"]`: Here, the parameters `param1` and `param2` are set as default arguments to the previously defined `ENTRYPOINT`. Users can override these parameters during `docker run`.

3. `CMD command param1 param2` (shell form, discouraged): The `CMD` instruction is written in a shell form where the `command`, `param1`, and `param2` are included. However, this form is discouraged because it limits users' ability to override the command effectively during `docker run`.

In [None]:
FROM ubuntu
ENTRYPOINT ["top", "-b"]
CMD ["-c"]

Now if we `run` the container from the above image, command `top -b -c` will be run.

- `top -b` __will always run__
- `-c` can be changed to some other flag/command via `docker run`

Let's see how `CMD` interacts with `ENTRYPOINT` for a better understanding. Note: `/bin/sh -c` is just command which executes the proceeding code in the terminal.

![](images/docker_entrypoint_cmd_interaction.png)

#### [COPY](https://docs.docker.com/engine/reference/builder/#copy)

`COPY` enables users to specify which file(s) or directories should be copied from the host system into the image using the `COPY` instruction:

``` docker
COPY <src> <destination>
```

> One commonly seen idiom is `COPY . .`, which effectively transfers files from the build context (where docker build is executed) to the current working directory inside the container. While it may appear as if files are being copied to the same location, the distinction lies in the two file systems involved:

- The first argument to `COPY` references the build context file system, determined by the location where docker build is invoked
- The second argument to `COPY` points to the file system within the Docker container

#### Other Dockerfile Commands

There are a few other note worth commands:

- `LABEL <key>=<value>`: Facilitates the addition of metadata to the image, such as authorship, maintenance details, or contact information

- `WORKDIR dir`: Allows for the specification of a different working directory within the container

- `ENV <key>=<value>`: Sets environment variables that remain accessible throughout the specific build stage

- `EXPOSE <port>`: Although it's used less frequently, `EXPOSE` makes a specific port (e.g., `EXPOSE 80`) inside the container available for connections. Users typically specify exposed ports when running the `docker` command.

### Dockerfile Best Practices

When writing Dockerfiles, it's essential to follow best practices to create efficient and secure images:

- **Use Official Base Images**: Whenever possible, start with official base images provided by the software's maintainers (e.g., Node.js, Python, Nginx) to ensure security and reliability

- **Minimize Layer Count**: Limit the number of layers (a layer represent a set of changes) in your image to reduce image size and improve build and push/pull times

- **Clean Up**: Remove unnecessary files and dependencies in the same Dockerfile instruction to minimize the image size

- **Security**: Ensure your Dockerfile and image follow security best practices, such as not running as root and using trusted sources for software installation

- **Documentation**: Include comments and labels in your Dockerfile to document the image's purpose, maintainer, and version

### Understand Cache Management

In Docker, some commands can invalidate the cache, necessitating the re-execution of every subsequent step when creating an image.

Consider the following Dockerfile example:

``` docker

FROM ubuntu:18.04

RUN apt-get update
COPY . .

RUN apt-get install -y --no-install-recommends python3
RUN rm -rf /var/lib/apt/lists/*
```

In this case, `python3` will be installed during each docker build regardless of changes in the build context because Docker lacks the capability to determine whether the context for the `COPY` command has changed.

Instead, a more efficient approach is to follow the principle of placing `COPY` statements after setting up the operating system dependencies, especially when the installation of packages like Python is not dependent on the build context. This practice helps optimize the use of cache during image builds.

### Chaining Commands

To enhance the efficiency of your Dockerfile, it's advisable to chain multiple commands together using `&&` within a single `RUN` directive whenever possible.

Docker operates similarly to git, where it records only the changes (additions) made to the system. However, this behavior can have some undesired consequences:

- Temporary files left behind contribute to the image's size 
- Containers become less opaque, potentially exposing Docker's inner workings and vulnerabilities to attackers

It's important to note that the primary command to watch out for is `RUN`, as most commands do not create additional layers. By consolidating multiple commands into a single `RUN` directive, you can optimize image size, maintain container opacity, and enhance the efficiency of your Dockerfile.

## Hands-On: Creating a Dockerfile

In this example, you will create a Docker image that runs the `celebrity_births` web scraper. You can download the necessary files for running this scraper [here](https://aicore-files.s3.amazonaws.com/Foundations/DevOps/celebrity_example.zip).

After downloading the file, `cd` to that folder, and create a Dockerfile named `Dockerfile`. Inside the Dockerfile, write the following: 

```docker
FROM python:3.8-slim-buster
```

> Every Docker images start with a base image. This is the foundation upon which your image will be built.

Conventionally, Docker images are built from a pre-built image Docker that can be found on Docker Hub. The pre-built image usually contains some dependencies. A common use case is to use an image with Python installed. You can download and run the pre-built image using the `FROM` clause, as indicated above. 

Thus, with the first added command, we begin creating the image with the necessary Python dependencies.

Dockerfiles then consist of a series of instructions that specify how the image should be configure and what should be included in it. These instructions include actions like installing software, copying files, setting environment variables and more. 

In our example, we will continue by adding the following line to our Dockerfile:

``` docker
COPY . . 
```
This will copy everything in the Dockerfile directory (`requirements.txt` and the `scraper` folder) into the container.

> Understanding this step is extremely important. When an image is built, the relevant files are copied into the container, which is analogous to copying them into a different and separate computer. In other words, it is almost as if there is a separate mini computer containing the scraper, with Python installed.

The first `.` argument following the `COPY` instruction is the location of the assets **on your machine** that you wish to copy. The second `.` argument following the `COPY` instruction is the location where the assets will be copied to **on the Docker container**. 

As the final step before running the scraper, your Python packages must be installed, e.g. `beautifulsoup` and `requests`. Fortunately, the requirements file was also copied into the image. Thus, the packages can be installed directly using the `RUN` command, followed by the bash command:

``` docker
RUN pip install -r requirements.txt
```

Now, we can run the Python script. Note that the `RUN` clause is unsuitable here because `RUN` is executed when the image is built. This is where you perform actions like installing software, setting up configurations, and adding files to the image. It affects the image's content but doesn't dictate what happens when a container is started from the image. 

On the other hand, the `CMD` instruction is sued to specify the default command that should be executed when a container is run from the image. In essence, it determines the container's behaviour when it starts:

``` docker
CMD ["python", "scraper/celebrity_scraper.py"]
```
The `CMD` clause can be declared in many ways. In this case, we employ square brackets, and the first item is the executable (`python`), while the rest are the parameters (files). We will discuss in more detail different Dockerfile instructions in a later lesson.

## Key Takeaways

- Dockerfiles are text-based configuration files that define how Docker images should be built
- Dockerfile structure includes base image selection, instructions for setting up the environment, copying files, running commands, exposing ports, and defining the default command
- Best practices when creating Dockerfiles include: minimizing image layers, version tagging of base images, and adhering to security measures
- `FROM` sets the base image, defining the initial environment for your Docker image
- `RUN` executes commands during image creation, used for installing software and configuration tasks
- `COPY` copies files or directories into the image, crucial for adding application code and assets
- `ENTRYPOINT` defines the primary command when a container starts, while `CMD` provides default arguments that can be overridden during docker run
- Commands that invalidate the cache can lead to re-execution of subsequent steps; optimizing cache usage involves careful consideration of `COPY` placement and chaining commands using `&&` in a single `RUN` directive