**Chapter 5: Dockerfiles - Building Images**

While running pre-built images is useful, the true power of Docker in CI/CD lies in creating custom images tailored to your applications. A **Dockerfile** is a text document containing instructions to assemble a Docker image automatically. This chapter teaches you to write production-ready Dockerfiles—transforming source code into immutable, deployable artifacts. We will cover the complete instruction set, multi-language patterns, optimization techniques, and security hardening required for enterprise-grade containerization.

---

### 5.1 Dockerfile Anatomy and Syntax

A Dockerfile is a declarative script consisting of sequential instructions. Each instruction creates a new layer in the image. Understanding the syntax and execution model is fundamental to writing efficient, secure containers.

#### File Structure and Naming

**Standard Practice:**
- Filename: `Dockerfile` (no extension, capital D)
- Location: Repository root or `docker/` subdirectory
- Encoding: UTF-8, LF line endings (not CRLF)

**Example Repository Structure:**
```
myapp/
├── Dockerfile
├── .dockerignore
├── package.json
├── src/
└── README.md
```

#### Instruction Format

```dockerfile
# Comment
INSTRUCTION arguments
```

**Rules:**
- Instructions are case-insensitive (convention: UPPERCASE for visibility)
- First instruction must be `FROM` (base image)
- Instructions execute in order, top to bottom
- Each instruction creates a cached layer

#### The Layer Cache Mechanism

Docker builds images incrementally, caching layers to speed up rebuilds:

```dockerfile
FROM node:20-alpine          # Layer 1: Base image (cached if available)
WORKDIR /app                 # Layer 2: Directory creation
COPY package*.json ./        # Layer 3: File copy (cache invalidated if files change)
RUN npm ci                   # Layer 4: Dependency installation (expensive, cache desired)
COPY . .                     # Layer 5: Source code (changes frequently)
CMD ["node", "server.js"]    # Layer 6: Metadata (no cache impact)
```

**Critical Principle:** Order instructions from least-frequent change to most-frequent change. Dependencies (which change rarely) should install before source code (which changes frequently).

#### Build Context

When you run `docker build`, the CLI sends the **build context** (all files in the directory) to the daemon. A large context slows builds.

**View context size:**
```bash
docker build --no-cache -t test .  # Note transfer size in output
```

**Optimization:** Use `.dockerignore` (Chapter 9) to exclude files:
```gitignore
# .dockerignore
node_modules
.git
*.md
.env
.dockerignore
Dockerfile
```

**Key Takeaway:** Dockerfile syntax is simple, but **layer ordering** determines build performance. Structure your Dockerfile to maximize cache hits on expensive operations (dependency installation, compilation).

---

### 5.2 Base Images Selection

The `FROM` instruction defines the parent image—the foundation of your container. This choice impacts security, size, and compatibility more than any other decision.

#### Image Categories

**Official Images:**
- Maintained by Docker or upstream projects
- Security patched regularly
- Examples: `node`, `python`, `nginx`, `ubuntu`

**Variant Tags:**
- `latest`: Points to newest stable (avoid in production—non-deterministic)
- `alpine`: Minimal Linux (~5MB), musl libc, busybox tools
- `slim`: Debian-based with minimal packages (~50-100MB)
- `bookworm`, `bullseye`: Specific Debian releases (stable)
- `jammy`, `focal`: Specific Ubuntu LTS releases

#### Selection Strategy Matrix

| Requirement | Recommended Base | Rationale |
|-------------|-----------------|-----------|
| **Minimal size** | `alpine` or `distroless` | Smallest attack surface, fastest pulls |
| **Compatibility** | `slim` (Debian) | glibc compatibility, easier debugging |
| **Enterprise/Governance** | `ubuntu:22.04` | Familiar to ops teams, long-term support |
| **Security Critical** | `distroless` or `scratch` | No shell, minimal attack surface |

#### Security Considerations

**Anti-Pattern:**
```dockerfile
FROM ubuntu:latest  # Non-deterministic, may break builds
```

**Best Practice:**
```dockerfile
FROM node:20.11.0-alpine3.18  # Specific version, specific OS version
```

**Digest Pinning (Maximum Security):**
```dockerfile
FROM node:20-alpine@sha256:abc123...  # Immutable reference
```

#### Specialized Bases

**Distroless (Google):**
Language-specific images containing only application and runtime—no package manager, shell, or unnecessary utilities.

```dockerfile
# Distroless example (Go application)
FROM golang:1.21 as builder
WORKDIR /app
COPY . .
RUN CGO_ENABLED=0 go build -o server

FROM gcr.io/distroless/static-debian12
COPY --from=builder /app/server /server
CMD ["/server"]
```

**Scratch (Empty Base):**
For statically compiled binaries (Go, Rust):
```dockerfile
FROM scratch
COPY hello /
CMD ["/hello"]
```

**Key Takeaway:** Base image selection is a **security and supply chain decision**. Prefer specific versions over `latest`, Alpine or Distroless for production, and pin digests for critical systems. The base image is the largest layer—choose wisely.

---

### 5.3 Core Instructions (FROM, RUN, COPY, CMD)

Four instructions form the foundation of most Dockerfiles. Mastering their nuances separates working containers from production-grade artifacts.

#### FROM - Setting the Foundation

```dockerfile
FROM [--platform=<platform>] <image>[:<tag>] [AS <name>]
```

**Multi-Stage Usage:**
```dockerfile
FROM node:20-alpine AS builder
# ... build steps ...

FROM nginx:alpine AS production
# ... runtime steps ...
```

**Best Practice:** Always include a tag. `FROM node` implicitly uses `latest`, which changes over time.

#### RUN - Executing Commands

`RUN` executes commands in a new layer on top of the current image and commits the results.

**Shell Form (Uses `/bin/sh -c`):**
```dockerfile
RUN apt-get update && apt-get install -y curl
```

**Exec Form (JSON array, no shell processing):**
```dockerfile
RUN ["apt-get", "update", "&&", "apt-get", "install", "-y", "curl"]  # WRONG - && is literal
RUN ["/bin/bash", "-c", "apt-get update && apt-get install -y curl"]  # Correct
```

**Critical Best Practice - Clean Up in Same Layer:**
Each `RUN` creates a layer. Clean up temporary files in the same command to prevent bloat:

```dockerfile
# BAD: 100MB layer with package lists
RUN apt-get update
RUN apt-get install -y curl
RUN rm -rf /var/lib/apt/lists/*

# GOOD: Single layer, cleaned up
RUN apt-get update && apt-get install -y curl \
    && rm -rf /var/lib/apt/lists/*
```

#### COPY - Adding Files

Copies files from build context into the image.

```dockerfile
COPY [--chown=<user>:<group>] <src>... <dest>
```

**Behavior:**
- `<src>` is relative to build context (Dockerfile location)
- If `<src>` is a directory, contents are copied (not directory itself)
- Supports wildcards (`COPY *.json ./`)

**Ownership:**
```dockerfile
# Change ownership during copy (avoids separate RUN chown)
COPY --chown=node:node package.json package-lock.json ./
```

**Comparison with ADD:**
- `COPY`: Simple file/directory copy (preferred)
- `ADD`: Additional tar extraction and remote URL support (avoid—non-obvious behavior)

**Example:**
```dockerfile
# Copy package files first (dependency caching)
COPY package*.json ./

# Copy source code
COPY src/ ./src/
```

#### CMD - Default Execution

Specifies the default command to run when the container starts. Only the last `CMD` in a Dockerfile takes effect (overridden by `docker run` arguments).

**Shell Form:**
```dockerfile
CMD node server.js
# Executes: /bin/sh -c 'node server.js'
# Problem: PID 1 is shell, not node (signal handling issues)
```

**Exec Form (Preferred):**
```dockerfile
CMD ["node", "server.js"]
# Executes: node server.js directly
# PID 1 is node process (proper signal handling)
```

**Key Takeaway:** Use `COPY` over `ADD`, `RUN` with cleanup chains, and always prefer **Exec form** for `CMD` to ensure proper process management and signal handling in production containers.

---

### 5.4 Working Directories and File Structure

Proper filesystem organization ensures clarity, security, and compatibility with Kubernetes security policies.

#### WORKDIR Instruction

Sets the working directory for subsequent `RUN`, `CMD`, `ENTRYPOINT`, `COPY`, and `ADD` instructions.

```dockerfile
WORKDIR /app
```

**Benefits over `RUN mkdir` and `cd`:**
- Creates directory if it doesn't exist
- Cleaner syntax
- Absolute path for all subsequent operations

**Standard Directory Conventions:**

| Directory | Purpose |
|-----------|---------|
| `/app` | Application code (generic) |
| `/usr/src/app` | Application code (Node.js convention) |
| `/opt/app` | Optional/add-on software |
| `/var/lib/<name>` | Persistent data (databases) |
| `/tmp` | Temporary files (often mounted as tmpfs) |

#### Non-Root User Setup

Running containers as root is a security risk. Implement least privilege:

```dockerfile
FROM node:20-alpine

# Create app directory
WORKDIR /app

# Create non-root user
RUN addgroup -g 1001 -S nodejs
RUN adduser -S nodejs -u 1001

# Copy with ownership
COPY --chown=nodejs:nodejs . .

# Switch to non-root
USER nodejs

EXPOSE 3000
CMD ["node", "index.js"]
```

**Verification:**
```bash
docker run --rm myapp id  # Should show uid=1001(nodejs)
```

#### File Permissions

Ensure files are readable by the runtime user but not writable where unnecessary:

```dockerfile
COPY --chown=app:app --chmod=755 startup.sh .
```

**Read-Only Filesystems:**
For maximum security, run containers with read-only root filesystems (enforced in Kubernetes):

```dockerfile
# Create writable directory for temp files
RUN mkdir /tmp && chmod 1777 /tmp

# Application must write to /tmp only
```

**Key Takeaway:** Structure containers with **explicit working directories**, **non-root users**, and **minimal permissions**. These practices satisfy Kubernetes Pod Security Standards and reduce attack surface.

---

### 5.5 Environment Variables

Environment variables configure containers without modifying code, aligning with the 12-Factor App methodology.

#### ENV Instruction

Sets environment variables available during build and runtime.

```dockerfile
# Build-time and runtime variables
ENV NODE_ENV=production
ENV PORT=3000
ENV LOG_LEVEL=info

# Can reference other variables
ENV APP_HOME=/app
ENV DATA_DIR=$APP_HOME/data
```

**Build Arguments vs. Environment Variables:**
- `ARG`: Build-time only (not in final image, good for secrets during build)
- `ENV`: Runtime available (visible in `docker inspect`, `docker exec`)

```dockerfile
# Build-time secret (not in final image)
ARG BUILD_VERSION
RUN echo "Building version $BUILD_VERSION"

# Runtime configuration
ENV APP_VERSION=$BUILD_VERSION
```

#### Configuration Best Practices

**1. Provide Defaults:**
```dockerfile
ENV DATABASE_URL=postgres://localhost:5432/myapp
ENV CACHE_TTL=3600
```

**2. Document Required Variables:**
```dockerfile
# Required: API_KEY
# Optional: LOG_LEVEL (default: info)
```

**3. Sensitive Data:**
Never hardcode secrets in ENV instructions. Inject at runtime:

```bash
# Good: Runtime injection
docker run -e API_KEY=$API_KEY myapp

# Or via env-file
docker run --env-file .env.production myapp
```

#### Build Arguments for Variability

Use `ARG` to customize builds without changing the Dockerfile:

```dockerfile
ARG NODE_VERSION=20
FROM node:$NODE_VERSION-alpine

ARG BUILD_DATE
ARG VCS_REF

# Label with build metadata
LABEL org.opencontainers.image.created=$BUILD_DATE \
      org.opencontainers.image.revision=$VCS_REF
```

**Build with arguments:**
```bash
docker build \
  --build-arg NODE_VERSION=18 \
  --build-arg BUILD_DATE=$(date -u +'%Y-%m-%dT%H:%M:%SZ') \
  -t myapp:latest .
```

**Key Takeaway:** Use `ENV` for runtime configuration with sensible defaults, `ARG` for build-time variability. Never commit secrets to images—inject sensitive configuration at runtime via orchestration platforms (Kubernetes Secrets, Docker Secrets).

---

### 5.6 Exposing Ports

The `EXPOSE` instruction documents which ports the container listens on. While it doesn't actually publish ports (that's done at runtime with `-p`), it serves as critical documentation and enables automatic port mapping in some orchestrators.

#### Syntax and Usage

```dockerfile
EXPOSE <port> [<port>/<protocol>...]

# Examples:
EXPOSE 3000          # TCP (default)
EXPOSE 80/tcp
EXPOSE 53/udp
EXPOSE 8080/tcp 8080/udp
```

#### Multi-Port Applications

Document all listening ports:

```dockerfile
# Web application with metrics endpoint
EXPOSE 8080  # Application traffic
EXPOSE 9090  # Prometheus metrics
EXPOSE 9229  # Node.js debugger (development only)
```

#### Runtime Port Mapping

`EXPOSE` is metadata; actual port binding occurs at runtime:

```bash
# Map host port 80 to container port 8080
docker run -p 80:8080 myapp

# Map to random host port
docker run -P myapp  # Maps all EXPOSEd ports to random high ports
```

**Dynamic Ports in Kubernetes:**
```yaml
containers:
  - name: app
    image: myapp:latest
    ports:
      - containerPort: 8080  # Must match EXPOSE
        name: http
      - containerPort: 9090
        name: metrics
```

**Key Takeaway:** Always `EXPOSE` all ports your application listens on, even if only for documentation. This enables tools like Docker Compose and Kubernetes to understand service dependencies automatically.

---

### 5.7 CMD vs. ENTRYPOINT

These instructions define the container's default execution. Understanding their interaction is crucial for creating flexible, reusable images.

#### CMD (Command)

Sets default command and/or parameters, which can be overridden from the command line.

```dockerfile
CMD ["node", "server.js"]
```

Can be completely replaced:
```bash
docker run myapp /bin/sh  # Overrides CMD, runs shell instead
```

#### ENTRYPOINT

Configures the container to run as an executable. Arguments appended to `docker run` are passed to ENTRYPOINT, not replacing it.

**Exec Form (Preferred):**
```dockerfile
ENTRYPOINT ["node", "server.js"]

# Usage:
docker run myapp --port=8080
# Executes: node server.js --port=8080
```

**Shell Form (Avoid):**
```bash
ENTRYPOINT node server.js
# Runs as /bin/sh -c 'node server.js' (signal handling issues)
```

#### Combined Pattern (Best Practice)

Use `ENTRYPOINT` for the fixed executable and `CMD` for default arguments:

```dockerfile
ENTRYPOINT ["node", "server.js"]
CMD ["--port=3000", "--env=production"]
```

**Behavior:**
```bash
# Default execution
docker run myapp
# Executes: node server.js --port=3000 --env=production

# Override defaults
docker run myapp --port=8080
# Executes: node server.js --port=8080

# Override entrypoint entirely
docker run --entrypoint /bin/sh myapp
# Executes: /bin/sh
```

#### Real-World Example: Nginx

```dockerfile
FROM nginx:alpine
COPY default.conf /etc/nginx/conf.d/
EXPOSE 80
ENTRYPOINT ["nginx", "-g", "daemon off;"]
```

The `daemon off;` ensures Nginx runs in foreground (PID 1), required for proper Docker signal handling.

#### Init Systems and Signal Handling

Containers should run a single process as PID 1. If you need multiple processes, use a process manager like `supervisord` or `tini` (init system):

```dockerfile
FROM alpine
RUN apk add --no-cache tini
ENTRYPOINT ["/sbin/tini", "--"]
CMD ["myapp"]
```

**Key Takeaway:** Use `ENTRYPOINT` for the main executable to ensure the container always behaves as that application. Use `CMD` for default flags that operators might want to override. This pattern creates self-documenting containers that behave predictably in CI/CD pipelines.

---

### 5.8 Health Checks

A container running doesn't mean the application is healthy. The `HEALTHCHECK` instruction tells Docker how to test if the container is functioning correctly.

#### Syntax

```dockerfile
HEALTHCHECK [OPTIONS] CMD command
```

**Options:**
- `--interval=DURATION` (default: 30s): Time between checks
- `--timeout=DURATION` (default: 30s): Maximum time for check to complete
- `--start-period=DURATION` (default: 0s): Grace period for container startup
- `--retries=N` (default: 3): Consecutive failures before marking unhealthy

#### Implementation Examples

**HTTP Endpoint Check:**
```dockerfile
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
  CMD curl -f http://localhost:3000/health || exit 1
```

**TCP Port Check (for databases):**
```dockerfile
HEALTHCHECK --interval=10s --timeout=5s \
  CMD nc -z localhost 5432 || exit 1
```

**Custom Script:**
```dockerfile
COPY healthcheck.sh /usr/local/bin/
HEALTHCHECK --interval=5s CMD /usr/local/bin/healthcheck.sh
```

#### Health Status

Docker tracks container health:
- `starting`: Initial grace period (`start-period`)
- `healthy`: Command returns exit code 0
- `unhealthy`: Command returns non-zero `retries` consecutive times

**View Status:**
```bash
docker ps
# Shows (healthy) or (unhealthy) next to container status

docker inspect --format='{{.State.Health.Status}}' myapp
```

#### Kubernetes Integration

Kubernetes uses health checks (liveness and readiness probes) rather than Docker's HEALTHCHECK, but Docker health checks are useful for:
- Local development consistency
- Docker Compose orchestration
- Swarm mode auto-healing

**Key Takeaway:** Always implement `HEALTHCHECK` for long-running services. It enables Docker Swarm auto-recovery and provides visibility into application state beyond process existence.

---

### 5.9 Common Pitfalls and Solutions

Even experienced developers make these mistakes. Recognizing and avoiding them ensures reliable, secure, and efficient images.

#### Pitfall 1: Layer Bloat (Multiple RUN Commands)

**Problem:**
```dockerfile
RUN apt-get update
RUN apt-get install -y curl
RUN apt-get install -y git
RUN rm -rf /var/lib/apt/lists/*
```

Each RUN creates a layer. The deleted files in layer 4 still exist in layers 1-3, bloating the image.

**Solution:**
```dockerfile
RUN apt-get update && apt-get install -y \
    curl \
    git \
    && rm -rf /var/lib/apt/lists/*
```

#### Pitfall 2: Sensitive Data in Layers

**Problem:**
```dockerfile
COPY .env .
RUN source .env && build-app
```

The `.env` file exists in the layer history even if deleted in a later step.

**Solution:**
Use BuildKit secrets or multi-stage builds:

```dockerfile
# syntax=docker/dockerfile:1
FROM base as builder
RUN --mount=type=secret,id=api_key build-app --api-key=$(cat /run/secrets/api_key)

# Build command:
# DOCKER_BUILDKIT=1 docker build --secret id=api_key,src=.env .
```

#### Pitfall 3: Using ADD Instead of COPY

**Problem:**
```dockerfile
ADD https://example.com/app.tar.gz /app/
```
ADD automatically extracts tar archives and fetches URLs, which is non-obvious and can lead to unexpected behavior.

**Solution:**
Use COPY for local files. Explicitly download and verify remote resources:

```dockerfile
RUN curl -fsSL https://example.com/app.tar.gz -o app.tar.gz \
    && echo "expected_checksum  app.tar.gz" | sha256sum -c - \
    && tar -xzf app.tar.gz -C /app \
    && rm app.tar.gz
```

#### Pitfall 4: Running as Root

**Problem:**
Most base images default to root (UID 0). Compromised applications have full container access.

**Solution:**
Create and switch to non-root user:

```dockerfile
RUN useradd -m -u 1000 appuser
USER appuser
```

#### Pitfall 5: Missing .dockerignore

**Problem:**
Build context includes `node_modules`, `.git`, and local secrets, increasing build time and potentially leaking data.

**Solution:**
Create comprehensive `.dockerignore` (Chapter 9).

#### Pitfall 6: Hardcoded Configuration

**Problem:**
```dockerfile
ENV DATABASE_URL=postgres://prod-db:5432/mydb
```

Image is tied to specific environment.

**Solution:**
```dockerfile
# Set defaults only
ENV DATABASE_URL=postgres://localhost:5432/mydb
# Override at runtime
```

#### Pitfall 7: Not Handling Signals Properly

**Problem:**
Shell form CMD doesn't forward SIGTERM (Ctrl+C) properly, causing slow shutdowns (10s timeout then SIGKILL).

**Solution:**
Use exec form and ensure PID 1 is the application:

```dockerfile
CMD ["node", "server.js"]
# or
ENTRYPOINT ["exec", "node", "server.js"]
```

**Key Takeaway:** Most Dockerfile errors stem from misunderstanding the layer cache or process management. Review your Dockerfile with these pitfalls in mind: minimize layers, avoid secrets in history, run as non-root, and respect PID 1 responsibilities.

---

### Chapter Summary and Preview

In this chapter, you mastered the art of creating Docker images through Dockerfiles. You learned the **syntax and layer architecture**, understanding that instruction ordering directly impacts build performance through caching. We explored **base image selection**—weighing Alpine against Distroless against standard distributions based on security and compatibility needs. You implemented the four core instructions (`FROM`, `RUN`, `COPY`, `CMD`) with security-conscious patterns including non-root users and minimal permissions. We covered **environment variable management** distinguishing between build-time arguments and runtime configuration, **port documentation** through `EXPOSE`, and the critical distinction between `CMD` and `ENTRYPOINT` for defining container behavior. You added **health checks** for operational visibility and learned to avoid common pitfalls including layer bloat and signal handling errors.

You can now write Dockerfiles that produce secure, efficient, and maintainable images—the fundamental artifacts that flow through CI/CD pipelines. These skills translate directly to the next phase of our journey: creating multi-language application containers with optimized dependency management and runtime configurations.

In **Chapter 6: Building Multi-Language Applications**, we apply these Dockerfile fundamentals to real-world technology stacks. You will build production-ready containers for Python (Django/Flask), Node.js (Express), Java (Spring Boot), and Go applications. We will handle language-specific challenges: Python's virtual environments, Node.js multi-stage builds for minimal production images, Java's JAR packaging and JVM tuning, and Go's static compilation for Distroless deployment. Each section provides complete, working Dockerfiles following the patterns established here, preparing you to containerize any application in your CI/CD pipeline.