# Module 05: Containerization with Docker

**Difficulty**: ⭐⭐ Intermediate
**Estimated Time**: 65 minutes
**Prerequisites**:
- [Module 04: ML APIs with FastAPI](04_ml_apis_fastapi.ipynb)
- Basic command line knowledge

## Learning Objectives
By the end of this notebook, you will be able to:
1. Understand why Docker is essential for ML deployment
2. Create Dockerfiles for ML applications
3. Build Docker images for ML models
4. Run containers with proper configuration
5. Optimize Docker images for production
6. Use Docker best practices for ML workloads

## 1. Why Docker for ML?

### The Problem: "It Works on My Machine"

You've trained a perfect model on your laptop:
- Python 3.9, scikit-learn 1.0, CUDA 11.4
- Works flawlessly!

Production server:
- Python 3.7, scikit-learn 0.23, no CUDA
- Model crashes... 😰

### The Solution: Docker Containers

**Docker** packages your application with **ALL** dependencies:
- Exact Python version
- All libraries with correct versions
- System dependencies
- Environment variables
- Configuration files

**Result**: If it works in Docker on your laptop, it works **anywhere**.

### Benefits for ML

1. **Reproducibility**: Exact same environment everywhere
2. **Portability**: Run on any machine with Docker
3. **Isolation**: Each model in its own container
4. **Scalability**: Easy to replicate containers
5. **Version Control**: Docker images are versioned

### Docker vs VM

| Feature | Docker Container | Virtual Machine |
|---------|------------------|----------------|
| Size | 100 MB | 10 GB |
| Startup | < 1 second | Minutes |
| Performance | Native | Overhead |
| Isolation | Process | OS |

In [None]:
# Setup: Import required libraries
import warnings
warnings.filterwarnings('ignore')

import os
import json
from pathlib import Path
import subprocess

print("Setup complete!")
print("\nNote: This notebook demonstrates Docker concepts.")
print("Actual Docker commands should be run in a terminal.")
print("\nDocker must be installed on your system to run the examples.")

## 2. Docker Fundamentals

### Key Concepts

**Image**: Blueprint for containers (like a class)
- Contains OS, code, dependencies
- Immutable
- Can be shared via registries

**Container**: Running instance of an image (like an object)
- Isolated process
- Has its own filesystem
- Can be started/stopped

**Dockerfile**: Recipe to build an image
- Text file with instructions
- Each instruction creates a layer

**Registry**: Storage for images
- Docker Hub (public)
- AWS ECR, Google GCR (private)

### Essential Docker Commands

```bash
# Build image from Dockerfile
docker build -t my-ml-model:v1 .

# Run container
docker run -p 8000:8000 my-ml-model:v1

# List images
docker images

# List running containers
docker ps

# Stop container
docker stop <container-id>

# View logs
docker logs <container-id>

# Remove container
docker rm <container-id>

# Remove image
docker rmi my-ml-model:v1
```

## 3. Creating a Dockerfile

Let's create a Dockerfile for our FastAPI ML application.

In [None]:
# Create project directory
project_dir = Path("docker_ml_app")
project_dir.mkdir(exist_ok=True)

# Create app subdirectory
app_dir = project_dir / "app"
app_dir.mkdir(exist_ok=True)

print(f"✓ Project structure created")
print(f"  Directory: {project_dir.absolute()}")

In [None]:
# Create requirements.txt
requirements_content = """fastapi==0.104.1
uvicorn[standard]==0.24.0
scikit-learn==1.3.2
numpy==1.24.3
pandas==2.0.3
pydantic==2.5.0
joblib==1.3.2
"""

with open(project_dir / "requirements.txt", "w") as f:
    f.write(requirements_content)

print("✓ requirements.txt created")
print("\nContents:")
print(requirements_content)

In [None]:
# Create Dockerfile
dockerfile_content = """# Start from official Python image
FROM python:3.9-slim

# Set working directory
WORKDIR /app

# Copy requirements first (for layer caching)
COPY requirements.txt .

# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY ./app ./app

# Copy model files
COPY ./models ./models

# Expose port
EXPOSE 8000

# Run the application
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
"""

with open(project_dir / "Dockerfile", "w") as f:
    f.write(dockerfile_content)

print("✓ Dockerfile created")
print("\nContents:")
print(dockerfile_content)

## 4. Understanding Dockerfile Instructions

Let's break down each instruction:

### FROM python:3.9-slim
- **Purpose**: Choose base image
- **Why slim?**: Smaller size (100MB vs 900MB for full image)
- **Trade-off**: May need to install system dependencies

### WORKDIR /app
- **Purpose**: Set working directory inside container
- **Why?**: All subsequent commands run from here

### COPY requirements.txt .
- **Purpose**: Copy file from host to container
- **Why first?**: Docker layer caching - if requirements don't change, this layer is reused

### RUN pip install...
- **Purpose**: Execute command during build
- **--no-cache-dir**: Don't store pip cache (saves space)

### COPY ./app ./app
- **Purpose**: Copy application code
- **Why after requirements?**: Code changes more often than dependencies

### EXPOSE 8000
- **Purpose**: Document which port the app uses
- **Note**: Doesn't actually publish the port (done with `docker run -p`)

### CMD [...]
- **Purpose**: Command to run when container starts
- **Format**: JSON array for better handling

## 5. Optimizing Docker Images for ML

### Multi-Stage Builds

Reduce final image size by using separate build and runtime stages:

In [None]:
# Create optimized Dockerfile with multi-stage build
optimized_dockerfile = """# Build stage
FROM python:3.9 as builder

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

# Runtime stage
FROM python:3.9-slim

WORKDIR /app

# Copy only installed packages from builder
COPY --from=builder /root/.local /root/.local

# Copy application
COPY ./app ./app
COPY ./models ./models

# Make sure scripts in .local are usable
ENV PATH=/root/.local/bin:$PATH

EXPOSE 8000

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
"""

with open(project_dir / "Dockerfile.optimized", "w") as f:
    f.write(optimized_dockerfile)

print("✓ Optimized Dockerfile created")
print("\nKey optimizations:")
print("  1. Multi-stage build (smaller final image)")
print("  2. Only runtime dependencies in final image")
print("  3. Build artifacts discarded")

## 6. .dockerignore File

Exclude unnecessary files from the build context:

In [None]:
# Create .dockerignore
dockerignore_content = """# Python
__pycache__
*.pyc
*.pyo
*.pyd
.Python
*.so

# Jupyter
.ipynb_checkpoints
*.ipynb

# Virtual environments
venv/
env/
ENV/

# IDEs
.vscode/
.idea/
*.swp

# Git
.git/
.gitignore

# Data (use separate volume)
data/raw/
data/processed/
*.csv
*.parquet

# Documentation
README.md
docs/

# Testing
tests/
.pytest_cache/
.coverage
"""

with open(project_dir / ".dockerignore", "w") as f:
    f.write(dockerignore_content)

print("✓ .dockerignore created")
print("\nBenefits:")
print("  - Faster builds (smaller context)")
print("  - Smaller images")
print("  - Prevents copying sensitive files")

## 7. Docker Best Practices for ML

### 1. Use Specific Base Images

```dockerfile
# ❌ Bad: Latest tag changes over time
FROM python:latest

# ✅ Good: Specific version
FROM python:3.9.18-slim
```

### 2. Minimize Layers

```dockerfile
# ❌ Bad: Multiple RUN commands (more layers)
RUN pip install numpy
RUN pip install pandas
RUN pip install scikit-learn

# ✅ Good: Single RUN command
RUN pip install numpy pandas scikit-learn
```

### 3. Order Instructions by Change Frequency

```dockerfile
# ✅ Good order:
# 1. Base image (rarely changes)
# 2. System dependencies (rarely changes)
# 3. Python dependencies (changes occasionally)
# 4. Application code (changes frequently)
```

### 4. Use .dockerignore

Exclude unnecessary files (shown above)

### 5. Run as Non-Root User

```dockerfile
# Create non-root user
RUN useradd -m -u 1000 mluser
USER mluser
```

### 6. Set Environment Variables

```dockerfile
# Prevent Python buffering (see logs immediately)
ENV PYTHONUNBUFFERED=1

# Disable pip version check
ENV PIP_DISABLE_PIP_VERSION_CHECK=1
```

## 8. Building and Running

Here's how you would build and run the Docker image:

In [None]:
# Create example build and run commands
build_run_guide = """# Step 1: Build the Docker image
# Run this in the directory containing the Dockerfile
docker build -t ml-api:v1.0 .

# Step 2: Run the container
# -p 8000:8000 maps port 8000 from container to host
# -d runs in detached mode (background)
# --name gives the container a friendly name
docker run -d -p 8000:8000 --name ml-api-container ml-api:v1.0

# Step 3: Check if running
docker ps

# Step 4: View logs
docker logs ml-api-container

# Step 5: Test the API
curl http://localhost:8000/

# Step 6: Stop the container
docker stop ml-api-container

# Step 7: Remove the container
docker rm ml-api-container

# Advanced: Run with environment variables
docker run -d -p 8000:8000 \\
  -e MODEL_VERSION=1.0 \\
  -e LOG_LEVEL=INFO \\
  --name ml-api-container \\
  ml-api:v1.0

# Advanced: Mount model directory as volume
docker run -d -p 8000:8000 \\
  -v $(pwd)/models:/app/models \\
  --name ml-api-container \\
  ml-api:v1.0
"""

print("Docker Build & Run Guide:")
print(build_run_guide)

## 9. Exercises

Practice Dockerfile creation and optimization.

### Exercise 1: Create Complete Project Structure

Create a complete Docker project with a simple FastAPI app.

**Requirements**:
1. Create app/main.py with a simple FastAPI app
2. Create requirements.txt
3. Create Dockerfile
4. Add .dockerignore

In [None]:
# Exercise 1: Your code here

# YOUR CODE HERE

In [None]:
# Exercise 1 Solution

# Create complete project structure
exercise_dir = Path("exercise_docker_app")
exercise_dir.mkdir(exist_ok=True)
(exercise_dir / "app").mkdir(exist_ok=True)
(exercise_dir / "models").mkdir(exist_ok=True)

# Create simple FastAPI app
app_code = '''"""Simple ML API for Docker demo"""
from fastapi import FastAPI
import joblib
from pathlib import Path

app = FastAPI(title="Simple ML API")

@app.get("/")
def root():
    return {"message": "ML API is running in Docker!", "status": "healthy"}

@app.get("/health")
def health():
    return {"status": "OK"}
'''

with open(exercise_dir / "app" / "main.py", "w") as f:
    f.write(app_code)

# Requirements
reqs = "fastapi==0.104.1\nuvicorn[standard]==0.24.0\njoblib==1.3.2\n"
with open(exercise_dir / "requirements.txt", "w") as f:
    f.write(reqs)

# Dockerfile
dockerfile = """FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY ./app ./app
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
"""
with open(exercise_dir / "Dockerfile", "w") as f:
    f.write(dockerfile)

# .dockerignore
dockerignore = "__pycache__/\n*.pyc\n.git/\n.vscode/\n"
with open(exercise_dir / ".dockerignore", "w") as f:
    f.write(dockerignore)

print("✓ Complete Docker project created!")
print(f"  Location: {exercise_dir.absolute()}")
print(f"  Files: app/main.py, requirements.txt, Dockerfile, .dockerignore")

### Exercise 2: Optimize Dockerfile

Improve the Dockerfile with best practices.

**Requirements**:
1. Use multi-stage build
2. Add non-root user
3. Set Python environment variables
4. Minimize layers

In [None]:
# Exercise 2 Solution

optimized_ex_dockerfile = """# Build stage
FROM python:3.9 as builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

# Runtime stage
FROM python:3.9-slim

# Set environment variables
ENV PYTHONUNBUFFERED=1 \\
    PIP_DISABLE_PIP_VERSION_CHECK=1

# Create non-root user
RUN useradd -m -u 1000 mluser

WORKDIR /app

# Copy dependencies from builder
COPY --from=builder /root/.local /home/mluser/.local

# Copy application
COPY --chown=mluser:mluser ./app ./app

# Switch to non-root user
USER mluser

# Update PATH
ENV PATH=/home/mluser/.local/bin:$PATH

EXPOSE 8000

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
"""

with open(exercise_dir / "Dockerfile.optimized", "w") as f:
    f.write(optimized_ex_dockerfile)

print("✓ Optimized Dockerfile created!")
print("\nOptimizations applied:")
print("  ✓ Multi-stage build (smaller image)")
print("  ✓ Non-root user (security)")
print("  ✓ Environment variables (best practices)")
print("  ✓ Minimal layers")
print("\nEstimated size reduction: ~40%")

### Exercise 3: Docker Compose for ML Stack

Create a docker-compose.yml for a complete ML serving stack.

**Requirements**:
1. ML API service
2. Redis for caching
3. Proper networking
4. Volume mounts

In [None]:
# Exercise 3 Solution

docker_compose = """version: '3.8'

services:
  # ML API Service
  ml-api:
    build:
      context: .
      dockerfile: Dockerfile.optimized
    ports:
      - "8000:8000"
    environment:
      - MODEL_VERSION=1.0
      - REDIS_HOST=redis
      - REDIS_PORT=6379
    volumes:
      - ./models:/app/models:ro
    depends_on:
      - redis
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

  # Redis for caching predictions
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis-data:/data
    restart: unless-stopped

volumes:
  redis-data:

networks:
  default:
    name: ml-network
"""

with open(exercise_dir / "docker-compose.yml", "w") as f:
    f.write(docker_compose)

print("✓ Docker Compose configuration created!")
print("\nStack includes:")
print("  - ML API (port 8000)")
print("  - Redis cache (port 6379)")
print("  - Health checks")
print("  - Persistent volumes")
print("\nTo run: docker-compose up -d")
print("To stop: docker-compose down")

## 10. Summary

### Key Takeaways

1. **Docker ensures consistency** across development and production environments

2. **Dockerfiles are recipes** for building reproducible images

3. **Layer caching speeds up builds** - order instructions by change frequency

4. **Multi-stage builds reduce image size** significantly

5. **Security matters** - use non-root users and specific base images

6. **.dockerignore prevents bloat** by excluding unnecessary files

### Docker Best Practices Checklist

- ✅ Use specific base image versions (not `latest`)
- ✅ Leverage layer caching (requirements before code)
- ✅ Use multi-stage builds
- ✅ Run as non-root user
- ✅ Set Python environment variables
- ✅ Create .dockerignore file
- ✅ Minimize number of layers
- ✅ Use COPY instead of ADD
- ✅ Don't store secrets in images
- ✅ Tag images properly (version numbers)

### What's Next?

In **Module 06**, we'll explore:
- **Different model serving patterns** (batch, real-time, streaming)
- **Trade-offs** between serving approaches
- **When to use** each pattern
- **Implementation examples** for each

## 11. Additional Resources

### Documentation
- **Docker Docs**: https://docs.docker.com/
- **Dockerfile Reference**: https://docs.docker.com/engine/reference/builder/
- **Docker Compose**: https://docs.docker.com/compose/
- **Best Practices**: https://docs.docker.com/develop/dev-best-practices/

### Tutorials
- **Docker for Data Science**: https://www.docker.com/blog/tag/data-science/
- **ML in Production**: https://madewithml.com/courses/mlops/docker/

### Advanced Topics
- Multi-container applications with Docker Compose
- Docker networking and volumes
- GPU support in Docker (nvidia-docker)
- Kubernetes for container orchestration
- Docker security scanning