# Lab 4.4.1: Building Optimized Docker Images for ML

**Module:** 4.4 - Containerization & Cloud Deployment  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐ (Intermediate)

---

## Learning Objectives

By the end of this lab, you will:
- [ ] Understand Docker fundamentals and why containerization matters for ML
- [ ] Create Dockerfiles extending NGC base images
- [ ] Implement multi-stage builds for smaller image sizes
- [ ] Configure GPU support in containers
- [ ] Add health check endpoints for production readiness
- [ ] Optimize Dockerfile layers for faster builds

---

## Prerequisites

- Docker installed with NVIDIA Container Toolkit
- Basic understanding of Linux commands
- Completed: Module 3.3 (Deployment basics)

---

## Real-World Context

**Why Docker for ML?**

Imagine you've trained an amazing model on your DGX Spark. It works perfectly! But then:
- Your colleague tries to run it and gets `ModuleNotFoundError`
- You deploy to production and PyTorch version mismatch breaks everything
- Six months later, your own code doesn't work because dependencies changed

**58% of organizations now use Kubernetes for AI workloads**, and containers are the foundation.

Docker solves this by packaging your model, code, and EXACT environment into a portable container.

---

## ELI5: What is Docker?

> **Imagine you're shipping a fish tank...**
>
> You could ship just the fish, but then the recipient needs to have the right tank, the right water temperature, the right filter, the right food...
>
> **Docker is like shipping the entire aquarium** - fish, water, tank, filter, heater - everything needed for the fish to be happy. It works exactly the same whether it's in your house or your friend's house.
>
> **In AI terms:**
> - **Fish** = Your model
> - **Water/Tank/Filter** = Python, PyTorch, CUDA, libraries
> - **Aquarium** = Docker container
>
> The container includes everything your model needs to run, so it works the same everywhere!

---

## Part 1: Docker Fundamentals

### Key Concepts

| Term | Description | Analogy |
|------|-------------|---------|
| **Image** | A template with all files and config | Recipe |
| **Container** | A running instance of an image | Dish made from recipe |
| **Dockerfile** | Instructions to build an image | Recipe card |
| **Layer** | Each instruction creates a cacheable layer | Recipe step |
| **Registry** | Where images are stored (Docker Hub, NGC) | Cookbook library |

### Why NVIDIA NGC Containers?

NVIDIA's NGC (GPU Cloud) provides pre-optimized containers that:
- Are **tested on NVIDIA hardware** (including DGX Spark's Blackwell GPU)
- Have **correct CUDA/cuDNN versions** pre-installed
- Support **ARM64 architecture** (DGX Spark uses ARM v9.2 CPUs)
- Include **TensorRT, NCCL, and other optimizations**

Using NGC containers saves hours of debugging CUDA compatibility issues!

In [None]:
# First, let's verify Docker is properly installed with GPU support
import subprocess
import sys

def run_command(cmd, capture=True):
    """Run a shell command and return output."""
    result = subprocess.run(cmd, shell=True, capture_output=capture, text=True)
    if capture:
        return result.stdout.strip(), result.returncode
    return None, result.returncode

# Check Docker installation
print("=" * 60)
print("Docker Environment Check")
print("=" * 60)

# Docker version
output, code = run_command("docker --version")
if code == 0:
    print(f"Docker: {output}")
else:
    print("Docker NOT installed! Please install Docker first.")

# Docker Compose version
output, code = run_command("docker compose version 2>/dev/null || docker-compose --version")
if code == 0:
    print(f"Docker Compose: {output}")
else:
    print("Docker Compose: Not available")

# NVIDIA Container Toolkit
output, code = run_command("nvidia-container-cli --version 2>/dev/null")
if code == 0:
    print(f"NVIDIA Container Toolkit: {output.split()[0] if output else 'Installed'}")
else:
    print("NVIDIA Container Toolkit: Not installed (GPU support unavailable)")

print("\n" + "=" * 60)

In [None]:
# Test GPU access in Docker (this may take a moment to pull the image)
print("Testing GPU access in Docker...")
print("(This may take a minute if the image isn't cached)\n")

output, code = run_command(
    "docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi --query-gpu=name,memory.total --format=csv",
    capture=True
)

if code == 0:
    print("GPU accessible in Docker container!")
    print("\nGPU Information:")
    print(output)
else:
    print("GPU access failed. Possible issues:")
    print("1. NVIDIA Container Toolkit not installed")
    print("2. Docker not configured for GPU")
    print("3. GPU drivers not properly installed")
    print("\nTo install NVIDIA Container Toolkit:")
    print("  curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg")
    print("  sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit")
    print("  sudo systemctl restart docker")

---

## Part 2: Understanding NGC Base Images

### Available NGC Containers

NVIDIA provides optimized containers for different ML frameworks:

| Container | Use Case | ARM64 Support |
|-----------|----------|---------------|
| `nvcr.io/nvidia/pytorch:25.11-py3` | PyTorch training/inference | Full |
| `nvcr.io/nvidia/tensorrt:25.01-py3` | Optimized inference | Full |
| `nvcr.io/nvidia/tritonserver:25.01-py3` | Multi-model serving | Full |
| `nvcr.io/nvidia/nemo:25.01` | NLP with NeMo | Full |

### DGX Spark Considerations

Your DGX Spark has:
- **ARM64 architecture** (not x86!) - must use ARM-compatible containers
- **128GB unified memory** - can run large models
- **Blackwell GPU with FP4/FP8** - use latest containers for hardware support

In [None]:
# Let's examine what's inside an NGC container
# This shows you what you get "for free" with NGC base images

print("Examining NGC PyTorch container contents...")
print("(This shows what's pre-installed in the base image)\n")

# Check Python version
output, _ = run_command(
    "docker run --rm nvcr.io/nvidia/pytorch:24.12-py3 python --version 2>&1 || echo 'Image not pulled'"
)
print(f"Python: {output}")

# Check PyTorch version
output, _ = run_command(
    "docker run --rm nvcr.io/nvidia/pytorch:24.12-py3 python -c 'import torch; print(torch.__version__)' 2>&1 || echo 'Image not pulled'"
)
print(f"PyTorch: {output}")

# Check CUDA version
output, _ = run_command(
    "docker run --rm nvcr.io/nvidia/pytorch:24.12-py3 nvcc --version 2>&1 | grep release || echo 'Image not pulled'"
)
print(f"CUDA: {output}")

print("\n These are all pre-configured and tested by NVIDIA!")

---

## Part 3: Your First ML Dockerfile

Let's create a Dockerfile for an LLM inference server step by step.

### Dockerfile Anatomy

```dockerfile
# 1. Start from a base image
FROM nvcr.io/nvidia/pytorch:25.11-py3

# 2. Install dependencies
RUN pip install transformers accelerate

# 3. Copy your code
COPY app/ /app/

# 4. Set the working directory
WORKDIR /app

# 5. Define how to run
CMD ["python", "serve.py"]
```

Each instruction creates a **layer** that gets cached. Order matters for build speed!

---

## Understanding Our Docker Utilities Module

This curriculum provides a `docker_utils` module to simplify Dockerfile creation. Let's understand what it offers before using it.

### DockerImageBuilder Class

The `DockerImageBuilder` class provides a fluent API for generating Dockerfiles:

| Method | Description | Example |
|--------|-------------|---------|
| `add_base(image)` | Set the base image | `add_base("nvcr.io/nvidia/pytorch:24.12-py3")` |
| `add_python_deps(list)` | Add Python packages | `add_python_deps(["transformers", "fastapi"])` |
| `add_env(key, value)` | Set environment variable | `add_env("MODEL_PATH", "/models")` |
| `add_copy(src, dst)` | Copy files into image | `add_copy("app/", "/app/")` |
| `set_workdir(path)` | Set working directory | `set_workdir("/app")` |
| `expose(port)` | Expose a port | `expose(8000)` |
| `add_healthcheck(path, port)` | Add health check | `add_healthcheck("/health", port=8000)` |
| `add_entrypoint(cmd)` | Set entrypoint command | `add_entrypoint("python main.py")` |
| `generate()` | Generate Dockerfile content | Returns string |
| `save(path)` | Save to file | `save("./Dockerfile")` |

### Other Utility Functions

| Function | Description |
|----------|-------------|
| `create_dockerignore(path)` | Create a .dockerignore file with ML-friendly defaults |
| `optimize_dockerfile(path)` | Analyze a Dockerfile and suggest optimizations |

In [None]:
# Let's use our docker_utils to build a proper Dockerfile
import sys
sys.path.insert(0, '..')

from scripts.docker_utils import DockerImageBuilder

# Create a builder for an LLM inference server
builder = DockerImageBuilder("llm-inference-server", use_multistage=True)

# Configure the image
builder.add_base("nvcr.io/nvidia/pytorch:24.12-py3")  # Using 24.12 for stability

# Add Python dependencies (commonly used for LLM serving)
builder.add_python_deps([
    "transformers>=4.37.0",
    "accelerate>=0.25.0",
    "bitsandbytes>=0.41.0",
    "fastapi>=0.109.0",
    "uvicorn>=0.27.0",
    "pydantic>=2.0.0",
])

# Add environment variables
builder.add_env("MODEL_PATH", "/models")
builder.add_env("CUDA_VISIBLE_DEVICES", "0")
builder.add_env("TRANSFORMERS_CACHE", "/models/cache")

# Copy application code
builder.add_copy("app/", "/app/")

# Set working directory
builder.set_workdir("/app")

# Expose port for API
builder.expose(8000)

# Add health check (critical for Kubernetes!)
builder.add_healthcheck("/health", port=8000, interval=30, timeout=10)

# Set the entrypoint
builder.add_entrypoint("python -m uvicorn main:app --host 0.0.0.0 --port 8000")

# Generate the Dockerfile
dockerfile_content = builder.generate()

print("Generated Dockerfile:")
print("=" * 60)
print(dockerfile_content)
print("=" * 60)

### What's Happening Here?

Let's break down the generated Dockerfile:

1. **Multi-stage build**: We have two stages:
   - `builder`: Installs dependencies (can be larger)
   - `production`: Only copies what's needed (smaller final image)

2. **`--user` flag**: Installs packages to `/root/.local` for easy copying

3. **Health check**: Kubernetes/Docker will ping `/health` to verify the container is running

4. **Environment variables**: Configure model path and GPU settings

In [None]:
# Let's also create the .dockerignore file
# This prevents unnecessary files from being copied into the container

from scripts.docker_utils import create_dockerignore
import os

# Create output directory
os.makedirs("../docker-examples/inference-server", exist_ok=True)

# Create .dockerignore
dockerignore_path = create_dockerignore("../docker-examples/inference-server/.dockerignore")

# Show content
with open(dockerignore_path) as f:
    content = f.read()

print(".dockerignore content:")
print("=" * 60)
print(content)
print("=" * 60)
print(f"\nSaved to: {dockerignore_path}")

---

## Part 4: Creating the Inference Server Code

Now let's create the actual Python code that will run inside our container.

We'll build a simple FastAPI server that:
1. Loads a language model
2. Provides a `/predict` endpoint
3. Has a `/health` endpoint for Kubernetes probes

In [None]:
# Create the application directory structure
import os

app_dir = "../docker-examples/inference-server/app"
os.makedirs(app_dir, exist_ok=True)

# Create main.py - the FastAPI application
main_py = '''"""FastAPI LLM Inference Server.

This server provides endpoints for:
- Health checks (/health)
- Text generation (/predict)
- Chat completion (/chat)
"""

import os
import time
import logging
from typing import Optional, List
from contextlib import asynccontextmanager

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Global model reference
model = None
tokenizer = None


class GenerateRequest(BaseModel):
    """Request body for text generation."""
    prompt: str
    max_tokens: int = 256
    temperature: float = 0.7
    top_p: float = 0.9


class ChatMessage(BaseModel):
    """A single chat message."""
    role: str  # "user", "assistant", or "system"
    content: str


class ChatRequest(BaseModel):
    """Request body for chat completion."""
    messages: List[ChatMessage]
    max_tokens: int = 256
    temperature: float = 0.7


class GenerateResponse(BaseModel):
    """Response from generation endpoints."""
    generated_text: str
    tokens_generated: int
    latency_ms: float


class HealthResponse(BaseModel):
    """Health check response."""
    status: str
    model_loaded: bool
    gpu_available: bool


def load_model():
    """Load the language model."""
    global model, tokenizer
    
    model_path = os.environ.get("MODEL_PATH", "/models")
    model_name = os.environ.get("MODEL_NAME", "gpt2")  # Default to small model
    
    logger.info(f"Loading model: {model_name}")
    
    try:
        import torch
        from transformers import AutoModelForCausalLM, AutoTokenizer
        
        # Check if local model exists
        local_path = os.path.join(model_path, model_name)
        if os.path.exists(local_path):
            load_path = local_path
        else:
            load_path = model_name
        
        # Load tokenizer
        tokenizer = AutoTokenizer.from_pretrained(
            load_path,
            trust_remote_code=True,
        )
        
        # Add pad token if missing
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token
        
        # Load model with GPU if available
        device = "cuda" if torch.cuda.is_available() else "cpu"
        
        model = AutoModelForCausalLM.from_pretrained(
            load_path,
            torch_dtype=torch.bfloat16 if device == "cuda" else torch.float32,
            device_map="auto" if device == "cuda" else None,
            trust_remote_code=True,
        )
        
        logger.info(f"Model loaded successfully on {device}")
        return True
        
    except Exception as e:
        logger.error(f"Failed to load model: {e}")
        return False


@asynccontextmanager
async def lifespan(app: FastAPI):
    """Lifespan handler for startup/shutdown."""
    # Startup
    logger.info("Starting inference server...")
    load_model()
    yield
    # Shutdown
    logger.info("Shutting down...")


# Create FastAPI app
app = FastAPI(
    title="LLM Inference Server",
    description="A simple LLM inference server for DGX Spark",
    version="1.0.0",
    lifespan=lifespan,
)


@app.get("/health", response_model=HealthResponse)
async def health_check():
    """Health check endpoint for Kubernetes probes."""
    import torch
    
    return HealthResponse(
        status="healthy" if model is not None else "loading",
        model_loaded=model is not None,
        gpu_available=torch.cuda.is_available(),
    )


@app.post("/predict", response_model=GenerateResponse)
async def generate(request: GenerateRequest):
    """Generate text from a prompt."""
    if model is None:
        raise HTTPException(status_code=503, detail="Model not loaded")
    
    import torch
    
    start_time = time.time()
    
    try:
        # Tokenize input
        inputs = tokenizer(
            request.prompt,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=2048,
        )
        
        # Move to GPU if available
        if torch.cuda.is_available():
            inputs = {k: v.cuda() for k, v in inputs.items()}
        
        # Generate
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=request.max_tokens,
                temperature=request.temperature,
                top_p=request.top_p,
                do_sample=True,
                pad_token_id=tokenizer.pad_token_id,
            )
        
        # Decode output
        generated_text = tokenizer.decode(
            outputs[0][inputs["input_ids"].shape[1]:],
            skip_special_tokens=True,
        )
        
        tokens_generated = outputs.shape[1] - inputs["input_ids"].shape[1]
        latency_ms = (time.time() - start_time) * 1000
        
        return GenerateResponse(
            generated_text=generated_text,
            tokens_generated=tokens_generated,
            latency_ms=latency_ms,
        )
        
    except Exception as e:
        logger.error(f"Generation failed: {e}")
        raise HTTPException(status_code=500, detail=str(e))


@app.post("/chat", response_model=GenerateResponse)
async def chat(request: ChatRequest):
    """Chat completion endpoint."""
    if model is None:
        raise HTTPException(status_code=503, detail="Model not loaded")
    
    # Format messages as prompt
    prompt_parts = []
    for msg in request.messages:
        if msg.role == "system":
            prompt_parts.append(f"System: {msg.content}")
        elif msg.role == "user":
            prompt_parts.append(f"User: {msg.content}")
        elif msg.role == "assistant":
            prompt_parts.append(f"Assistant: {msg.content}")
    
    prompt_parts.append("Assistant:")
    prompt = "\\n".join(prompt_parts)
    
    # Use the generate endpoint
    return await generate(GenerateRequest(
        prompt=prompt,
        max_tokens=request.max_tokens,
        temperature=request.temperature,
    ))


@app.get("/")
async def root():
    """Root endpoint with API info."""
    return {
        "name": "LLM Inference Server",
        "version": "1.0.0",
        "endpoints": [
            "/health - Health check",
            "/predict - Text generation",
            "/chat - Chat completion",
            "/docs - API documentation",
        ]
    }


if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)
'''

# Save main.py
with open(f"{app_dir}/main.py", "w") as f:
    f.write(main_py)

print(f"Created: {app_dir}/main.py")
print(f"Lines of code: {len(main_py.splitlines())}")

In [None]:
# Create requirements.txt
requirements = """# LLM Inference Server Dependencies
# Optimized for DGX Spark (ARM64 + CUDA)

# API Framework
fastapi>=0.109.0
uvicorn[standard]>=0.27.0
pydantic>=2.0.0

# ML Libraries (PyTorch is pre-installed in NGC container)
transformers>=4.37.0
accelerate>=0.25.0
bitsandbytes>=0.41.0

# Optional: For better performance
# vllm  # Use for production - much faster than HF generate
"""

with open("../docker-examples/inference-server/requirements.txt", "w") as f:
    f.write(requirements)

print("Created: requirements.txt")
print(requirements)

In [None]:
# Now save the Dockerfile
builder.save("../docker-examples/inference-server/Dockerfile")

print("Complete project structure:")
print("=" * 60)
!ls -la ../docker-examples/inference-server/
print("\napp/ contents:")
!ls -la ../docker-examples/inference-server/app/

---

## Part 5: Understanding Layer Optimization

### ELI5: Docker Layers

> **Imagine building with LEGO...**
>
> Each LEGO layer you add stays in place. If you want to change a layer at the bottom, you have to remove all the layers above it.
>
> Docker works the same way. Each instruction creates a layer. When you rebuild:
> - Unchanged layers are **reused from cache** (fast!)
> - Changed layers and everything after must be **rebuilt** (slow!)
>
> **Best Practice:** Put things that change rarely (dependencies) at the TOP, and things that change often (your code) at the BOTTOM.

### Layer Order Matters!

```dockerfile
# BAD - Code changes invalidate dependency cache
COPY . /app
RUN pip install -r requirements.txt

# GOOD - Dependencies only reinstall when requirements.txt changes
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . /app
```

In [None]:
# Let's analyze a Dockerfile for optimization opportunities
from scripts.docker_utils import optimize_dockerfile

# First, let's create a "bad" Dockerfile to analyze
bad_dockerfile = """FROM python:3.10

COPY . /app
WORKDIR /app

RUN apt-get update
RUN apt-get install -y gcc
RUN pip install torch transformers
RUN pip install fastapi uvicorn

CMD ["python", "main.py"]
"""

# Save it temporarily
with open("/tmp/bad_dockerfile", "w") as f:
    f.write(bad_dockerfile)

# Analyze it
suggestions = optimize_dockerfile("/tmp/bad_dockerfile")

print("Dockerfile Analysis")
print("=" * 60)
print("\nOriginal Dockerfile:")
print(bad_dockerfile)
print("\n Optimization Suggestions:")
print("-" * 60)
for i, suggestion in enumerate(suggestions, 1):
    print(f"\n{i}. {suggestion}")

---

## Part 6: Building and Testing the Image

Now let's build our Docker image and test it!

In [None]:
# Build the Docker image
# Note: This may take several minutes the first time (downloading base image)

import subprocess
import os

project_dir = "../docker-examples/inference-server"

print("Building Docker image...")
print("This may take 5-10 minutes on first build.")
print("=" * 60)

# For learning purposes, let's create a smaller test image first
# using a simpler Dockerfile

test_dockerfile = """# Simple test Dockerfile
FROM python:3.10-slim

WORKDIR /app

# Install dependencies
RUN pip install --no-cache-dir fastapi uvicorn

# Copy app
COPY app/main.py /app/

# Health check
HEALTHCHECK --interval=30s --timeout=10s CMD curl -f http://localhost:8000/health || exit 1

EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
"""

# Save test Dockerfile
with open(f"{project_dir}/Dockerfile.test", "w") as f:
    f.write(test_dockerfile)

print("Created simple test Dockerfile (without ML dependencies for quick testing)")
print("\nTo build the full ML image, run:")
print(f"  cd {os.path.abspath(project_dir)}")
print("  docker build -t llm-inference:latest .")
print("\nTo build the test image:")
print(f"  docker build -f Dockerfile.test -t llm-inference:test .")

In [None]:
# If you want to actually build (uncomment to run):
# This builds the lightweight test version

# Build command
build_cmd = f"cd {os.path.abspath(project_dir)} && docker build -f Dockerfile.test -t llm-inference:test . 2>&1"

print("Building test image...")
print("Command:", build_cmd.split(" && ")[1])
print()

# Uncomment to actually build:
# result = subprocess.run(build_cmd, shell=True, capture_output=True, text=True)
# print(result.stdout)
# if result.returncode != 0:
#     print("Build errors:", result.stderr)

print("\n Uncomment the lines above to actually build the image.")

---

## Common Mistakes

### Mistake 1: Not Using NGC Containers

```dockerfile
# BAD - Will have CUDA compatibility issues on ARM64
FROM python:3.10
RUN pip install torch

# GOOD - Pre-configured for NVIDIA hardware
FROM nvcr.io/nvidia/pytorch:24.12-py3
```
**Why:** pip-installed PyTorch doesn't work on DGX Spark's ARM64 + CUDA architecture.

---

### Mistake 2: Copying Everything Before Installing Dependencies

```dockerfile
# BAD - Any code change reinstalls all dependencies
COPY . /app
RUN pip install -r requirements.txt

# GOOD - Dependencies cached unless requirements.txt changes
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . /app
```

---

### Mistake 3: Running as Root

```dockerfile
# BAD - Security risk
CMD ["python", "app.py"]

# GOOD - Non-root user
RUN useradd -m appuser
USER appuser
CMD ["python", "app.py"]
```

---

### Mistake 4: Not Including Health Checks

```dockerfile
# BAD - Kubernetes doesn't know if container is healthy
CMD ["python", "app.py"]

# GOOD - Kubernetes can monitor health
HEALTHCHECK --interval=30s CMD curl -f http://localhost:8000/health || exit 1
CMD ["python", "app.py"]
```

---

## Try It Yourself

### Exercise 1: Optimize This Dockerfile

The following Dockerfile has several issues. Identify and fix them:

```dockerfile
FROM python:3.10

RUN apt-get update
RUN apt-get install -y curl
RUN apt-get install -y git

COPY . /app
WORKDIR /app

RUN pip install torch
RUN pip install transformers
RUN pip install fastapi

CMD python main.py
```

<details>
<summary>Hint 1</summary>
Combine RUN commands with && to reduce layers.
</details>

<details>
<summary>Hint 2</summary>
Add --no-cache-dir to pip install.
</details>

<details>
<summary>Hint 3</summary>
Use NGC base image instead of python:3.10.
</details>

In [None]:
# Your optimized Dockerfile here:
optimized_dockerfile = """
# TODO: Write your optimized Dockerfile
# Consider:
# 1. Base image choice
# 2. Layer combining
# 3. Dependency ordering
# 4. Cache optimization
# 5. Health checks
"""

print(optimized_dockerfile)

### Exercise 2: Create a Dockerfile for a RAG Application

Create a Dockerfile for a RAG (Retrieval-Augmented Generation) application that needs:
- ChromaDB for vector storage
- Sentence Transformers for embeddings
- FastAPI for the API
- An LLM for generation

In [None]:
# Use the DockerImageBuilder to create a RAG application Dockerfile
from scripts.docker_utils import DockerImageBuilder

# TODO: Complete this
rag_builder = DockerImageBuilder("rag-server")

# Add your configuration here:
# rag_builder.add_base(...)
# rag_builder.add_python_deps([...])
# etc.

# Uncomment when ready:
# print(rag_builder.generate())

---

## Checkpoint

You've learned:
- Why Docker is essential for ML deployment (reproducibility, portability)
- How to use NGC containers optimized for DGX Spark
- Multi-stage builds for smaller images
- Layer optimization for faster builds
- Health checks for production readiness

---

## Challenge (Optional)

Create a complete Docker image that:
1. Uses NGC PyTorch base
2. Includes vLLM for faster inference
3. Has OpenTelemetry for observability
4. Supports multiple models via environment variables
5. Includes GPU memory monitoring endpoint

---

## Further Reading

- [Docker Best Practices for Python](https://docs.docker.com/develop/develop-images/dockerfile_best-practices/)
- [NVIDIA NGC Catalog](https://catalog.ngc.nvidia.com/containers)
- [Multi-stage Builds](https://docs.docker.com/build/building/multi-stage/)
- [Docker BuildKit](https://docs.docker.com/build/buildkit/)

---

## Cleanup

In [None]:
# Clean up Docker resources (optional)
print("To clean up Docker resources, run:")
print("  docker system prune -f          # Remove unused data")
print("  docker image prune -a -f        # Remove unused images")
print("  docker volume prune -f          # Remove unused volumes")
print("\nTo remove the test image:")
print("  docker rmi llm-inference:test")