# Lab 4.4.1: Docker ML Image - SOLUTION

**Module:** 4.4 - Containerization & Cloud Deployment  
**This is the complete solution notebook with all exercises solved.**

---

## Exercise 1 Solution: Optimize the Dockerfile

**Original (with issues):**
```dockerfile
FROM python:3.10

RUN apt-get update
RUN apt-get install -y curl
RUN apt-get install -y git

COPY . /app
WORKDIR /app

RUN pip install torch
RUN pip install transformers
RUN pip install fastapi

CMD python main.py
```

In [None]:
# SOLUTION: Optimized Dockerfile

optimized_dockerfile = '''
# ============================================
# OPTIMIZED DOCKERFILE FOR LLM INFERENCE
# ============================================

# Improvement 1: Use NGC base image for ARM64 + CUDA compatibility
FROM nvcr.io/nvidia/pytorch:24.12-py3 AS builder

# Improvement 2: Combine apt commands, clean up cache
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    git \
    && rm -rf /var/lib/apt/lists/*

# Improvement 3: Copy requirements first for layer caching
COPY requirements.txt /tmp/requirements.txt

# Improvement 4: Combine pip installs, use --no-cache-dir, install to --user
RUN pip install --user --no-cache-dir \
    torch \
    transformers \
    fastapi \
    uvicorn

# ============================================
# Production stage - minimal image
# ============================================
FROM nvcr.io/nvidia/pytorch:24.12-py3

# Copy only installed packages from builder
COPY --from=builder /root/.local /root/.local
ENV PATH=/root/.local/bin:$PATH

# Improvement 5: Set workdir before copying code
WORKDIR /app

# Improvement 6: Copy app code LAST (changes most frequently)
COPY . /app

# Improvement 7: Add health check
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

# Improvement 8: Run as non-root user (security)
RUN useradd -m appuser && chown -R appuser:appuser /app
USER appuser

# Improvement 9: Use explicit command format
EXPOSE 8000
CMD ["python", "-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
'''

print("OPTIMIZED DOCKERFILE:")
print("=" * 60)
print(optimized_dockerfile)

print("\n" + "=" * 60)
print("IMPROVEMENTS MADE:")
print("=" * 60)
improvements = [
    "1. NGC base image: Ensures ARM64 + CUDA compatibility for DGX Spark",
    "2. Combined apt commands: Reduces layers, cleans cache",
    "3. Requirements first: Better layer caching (deps only reinstall if requirements.txt changes)",
    "4. Combined pip install: Fewer layers, --no-cache-dir saves space",
    "5. Multi-stage build: Final image is smaller (no build tools)",
    "6. Copy code last: Changes to code don't invalidate dependency cache",
    "7. Health check: Kubernetes/Docker can monitor container health",
    "8. Non-root user: Security best practice",
    "9. Explicit CMD format: More robust, easier to override",
]
for imp in improvements:
    print(f"  {imp}")

## Exercise 2 Solution: RAG Application Dockerfile

In [None]:
import sys
sys.path.insert(0, '..')

from scripts.docker_utils import DockerImageBuilder

# Create RAG application Dockerfile
rag_builder = DockerImageBuilder("rag-server", use_multistage=True)

# Use NGC PyTorch base for GPU support
rag_builder.add_base("nvcr.io/nvidia/pytorch:24.12-py3")

# Add dependencies for RAG pipeline
rag_builder.add_python_deps([
    # Vector database
    "chromadb>=0.4.0",
    
    # Embeddings
    "sentence-transformers>=2.2.0",
    
    # LLM
    "transformers>=4.37.0",
    "accelerate>=0.25.0",
    "bitsandbytes>=0.41.0",
    
    # API
    "fastapi>=0.109.0",
    "uvicorn>=0.27.0",
    "pydantic>=2.0.0",
    
    # Document processing
    "pypdf>=3.0.0",
    "python-multipart>=0.0.6",
    
    # Utilities
    "langchain>=0.1.0",
    "tiktoken>=0.5.0",
])

# Environment variables
rag_builder.add_env("MODEL_PATH", "/models")
rag_builder.add_env("CHROMA_PATH", "/data/chroma")
rag_builder.add_env("EMBEDDING_MODEL", "BAAI/bge-small-en-v1.5")
rag_builder.add_env("LLM_MODEL", "llama-8b")
rag_builder.add_env("CUDA_VISIBLE_DEVICES", "0")

# Copy application
rag_builder.add_copy("app/", "/app/")

# Set working directory
rag_builder.set_workdir("/app")

# Expose port
rag_builder.expose(8000)

# Add health check
rag_builder.add_healthcheck("/health", port=8000, interval=30, timeout=10)

# Set entrypoint
rag_builder.add_entrypoint("python -m uvicorn main:app --host 0.0.0.0 --port 8000")

# Generate and display
print("RAG APPLICATION DOCKERFILE:")
print("=" * 60)
print(rag_builder.generate())
print("=" * 60)

In [None]:
# Complete RAG server implementation

rag_server_code = '''
"""RAG Server with ChromaDB and HuggingFace models."""

import os
from typing import List, Optional
from contextlib import asynccontextmanager

from fastapi import FastAPI, UploadFile, File, HTTPException
from pydantic import BaseModel
import chromadb
from sentence_transformers import SentenceTransformer

# Configuration
CHROMA_PATH = os.environ.get("CHROMA_PATH", "/data/chroma")
EMBEDDING_MODEL = os.environ.get("EMBEDDING_MODEL", "BAAI/bge-small-en-v1.5")

# Global clients
embedding_model = None
chroma_client = None
collection = None


class QueryRequest(BaseModel):
    query: str
    top_k: int = 5


class QueryResponse(BaseModel):
    query: str
    results: List[dict]
    answer: Optional[str] = None


@asynccontextmanager
async def lifespan(app: FastAPI):
    """Initialize models on startup."""
    global embedding_model, chroma_client, collection
    
    # Load embedding model
    embedding_model = SentenceTransformer(EMBEDDING_MODEL)
    
    # Initialize ChromaDB
    chroma_client = chromadb.PersistentClient(path=CHROMA_PATH)
    collection = chroma_client.get_or_create_collection("documents")
    
    yield
    
    # Cleanup
    del embedding_model, chroma_client, collection


app = FastAPI(title="RAG Server", lifespan=lifespan)


@app.get("/health")
async def health():
    return {"status": "healthy", "collection_count": collection.count()}


@app.post("/ingest")
async def ingest_document(file: UploadFile = File(...)):
    """Ingest a document into the vector store."""
    content = await file.read()
    text = content.decode("utf-8")
    
    # Chunk the document
    chunks = [text[i:i+512] for i in range(0, len(text), 400)]  # Overlapping chunks
    
    # Generate embeddings
    embeddings = embedding_model.encode(chunks).tolist()
    
    # Add to collection
    ids = [f"{file.filename}_{i}" for i in range(len(chunks))]
    collection.add(
        documents=chunks,
        embeddings=embeddings,
        ids=ids,
        metadatas=[{"source": file.filename}] * len(chunks),
    )
    
    return {"status": "success", "chunks_added": len(chunks)}


@app.post("/query", response_model=QueryResponse)
async def query(request: QueryRequest):
    """Query the vector store."""
    # Generate query embedding
    query_embedding = embedding_model.encode([request.query]).tolist()
    
    # Search
    results = collection.query(
        query_embeddings=query_embedding,
        n_results=request.top_k,
    )
    
    # Format results
    formatted_results = []
    for i, doc in enumerate(results["documents"][0]):
        formatted_results.append({
            "content": doc,
            "source": results["metadatas"][0][i].get("source", "unknown"),
            "distance": results["distances"][0][i] if "distances" in results else None,
        })
    
    return QueryResponse(
        query=request.query,
        results=formatted_results,
    )
'''

print("RAG Server Implementation:")
print("=" * 60)
print(rag_server_code)

## Challenge Solution: Complete Production-Ready Docker Image

In [None]:
# Complete production-ready Dockerfile with vLLM and OpenTelemetry

production_dockerfile = '''
# ============================================
# PRODUCTION-READY LLM INFERENCE SERVER
# Features:
#   - vLLM for high-performance inference
#   - OpenTelemetry for observability
#   - Multi-model support via env vars
#   - GPU memory monitoring
#   - Health and metrics endpoints
# ============================================

# Build stage
FROM nvcr.io/nvidia/pytorch:24.12-py3 AS builder

# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
RUN pip install --user --no-cache-dir \
    vllm>=0.3.0 \
    opentelemetry-api>=1.20.0 \
    opentelemetry-sdk>=1.20.0 \
    opentelemetry-exporter-otlp>=1.20.0 \
    opentelemetry-instrumentation-fastapi>=0.41b0 \
    prometheus-client>=0.19.0 \
    py-spy>=0.3.14 \
    nvidia-ml-py>=12.535.0

# ============================================
# Production stage
# ============================================
FROM nvcr.io/nvidia/pytorch:24.12-py3

# Copy Python packages
COPY --from=builder /root/.local /root/.local
ENV PATH=/root/.local/bin:$PATH

# Create non-root user
RUN useradd -m -u 1000 inference && \
    mkdir -p /app /models /data && \
    chown -R inference:inference /app /models /data

WORKDIR /app

# Copy application code
COPY --chown=inference:inference app/ /app/

# Environment variables
ENV MODEL_PATH=/models \
    MODEL_NAME=meta-llama/Llama-2-7b-chat-hf \
    CUDA_VISIBLE_DEVICES=0 \
    MAX_MODEL_LEN=4096 \
    GPU_MEMORY_UTILIZATION=0.9 \
    TENSOR_PARALLEL_SIZE=1 \
    OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 \
    OTEL_SERVICE_NAME=llm-inference

# Expose ports
EXPOSE 8000
EXPOSE 9090

# Health check
HEALTHCHECK --interval=30s --timeout=15s --start-period=120s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

# Switch to non-root user
USER inference

# Start server
CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
     "--model", "$MODEL_NAME", \
     "--host", "0.0.0.0", \
     "--port", "8000"]
'''

print("PRODUCTION-READY DOCKERFILE:")
print("=" * 60)
print(production_dockerfile)

In [None]:
# GPU monitoring endpoint implementation

gpu_monitoring_code = '''
"""GPU Metrics Endpoint for inference server."""

from fastapi import APIRouter
from prometheus_client import Gauge, generate_latest, CONTENT_TYPE_LATEST
from starlette.responses import Response

try:
    import pynvml
    NVML_AVAILABLE = True
except ImportError:
    NVML_AVAILABLE = False

router = APIRouter()

# Prometheus metrics
gpu_memory_used = Gauge("gpu_memory_used_bytes", "GPU memory used", ["gpu_id"])
gpu_memory_total = Gauge("gpu_memory_total_bytes", "GPU memory total", ["gpu_id"])
gpu_utilization = Gauge("gpu_utilization_percent", "GPU utilization", ["gpu_id"])
gpu_temperature = Gauge("gpu_temperature_celsius", "GPU temperature", ["gpu_id"])


def get_gpu_metrics():
    """Get GPU metrics using NVML."""
    if not NVML_AVAILABLE:
        return [{"error": "NVML not available"}]
    
    pynvml.nvmlInit()
    device_count = pynvml.nvmlDeviceGetCount()
    
    metrics = []
    for i in range(device_count):
        handle = pynvml.nvmlDeviceGetHandleByIndex(i)
        
        # Memory info
        mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
        
        # Utilization
        util = pynvml.nvmlDeviceGetUtilizationRates(handle)
        
        # Temperature
        temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
        
        # Update Prometheus metrics
        gpu_memory_used.labels(gpu_id=str(i)).set(mem_info.used)
        gpu_memory_total.labels(gpu_id=str(i)).set(mem_info.total)
        gpu_utilization.labels(gpu_id=str(i)).set(util.gpu)
        gpu_temperature.labels(gpu_id=str(i)).set(temp)
        
        metrics.append({
            "gpu_id": i,
            "memory_used_gb": mem_info.used / (1024**3),
            "memory_total_gb": mem_info.total / (1024**3),
            "memory_percent": (mem_info.used / mem_info.total) * 100,
            "gpu_utilization": util.gpu,
            "temperature_c": temp,
        })
    
    pynvml.nvmlShutdown()
    return metrics


@router.get("/gpu")
async def gpu_status():
    """Get GPU status."""
    return {"gpus": get_gpu_metrics()}


@router.get("/metrics")
async def prometheus_metrics():
    """Prometheus metrics endpoint."""
    # Update GPU metrics before returning
    get_gpu_metrics()
    
    return Response(
        content=generate_latest(),
        media_type=CONTENT_TYPE_LATEST,
    )
'''

print("GPU MONITORING ENDPOINT:")
print("=" * 60)
print(gpu_monitoring_code)

---

## Summary

This solution notebook demonstrated:

1. **Dockerfile Optimization**
   - NGC base images for DGX Spark compatibility
   - Multi-stage builds for smaller images
   - Layer caching optimization
   - Security best practices (non-root user)

2. **RAG Application Dockerfile**
   - Complete dependency list for RAG pipeline
   - ChromaDB + sentence-transformers integration
   - Production-ready FastAPI server

3. **Production Features**
   - vLLM for high-performance inference
   - OpenTelemetry for observability
   - GPU memory monitoring with NVML
   - Prometheus metrics endpoint