# Chapter 17: Advanced Kubernetes Concepts

Having mastered core resources and storage patterns, we now examine advanced workload controllers and governance mechanisms essential for production operations. These resources address specific operational requirements: ensuring node-level services run everywhere (DaemonSets), executing batch and scheduled workloads (Jobs), automatically right-sizing applications (HPA/VPA), and maintaining availability during maintenance (PDBs). Resource governance tools (Quotas and LimitRanges) enable safe multi-tenancy by preventing resource starvation and enforcing organizational policies.

These concepts complete your operational toolkit, transforming Kubernetes from a basic container platform into a sophisticated, self-managing infrastructure capable of handling batch processing, automatic scaling, and enterprise-grade resource governance.

## 17.1 DaemonSets

DaemonSets ensure that a copy of a specific Pod runs on all (or specific) nodes in the cluster. They are ideal for cluster-wide infrastructure services that must exist on every node, such as log collectors, monitoring agents, network proxies, and storage drivers.

### DaemonSet vs Deployment

| Feature | Deployment | DaemonSet |
|---------|------------|-----------|
| **Scheduling** | Distributed across cluster for availability | One per node (or matching nodes) |
| **Scaling** | Manual or HPA-based | Automatic with node count |
| **Use Case** | Application workloads | Infrastructure/Node agents |
| **Pod Naming** | Random hash | Node name-based |
| **Update Strategy** | RollingUpdate | RollingUpdate (maxUnavailable) |

### Typical DaemonSet Workloads

- **Log Aggregation**: Fluentd, Fluent Bit, Filebeat
- **Monitoring**: Prometheus Node Exporter, Datadog Agent
- **Networking**: Calico/Flannel agents, kube-proxy
- **Security**: Falco (intrusion detection), audit log forwarders
- **Storage**: Ceph OSDs, CSI node plugins
- **Utilities**: Node problem detector, cleanup agents

### DaemonSet Specification

```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
  labels:
    app: fluent-bit
spec:
  selector:
    matchLabels:
      app: fluent-bit
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1  # Update one node at a time
  template:
    metadata:
      labels:
        app: fluent-bit
    spec:
      tolerations:
      # Tolerate control-plane nodes if desired
      - key: node-role.kubernetes.io/control-plane
        operator: Exists
        effect: NoSchedule
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule
      containers:
      - name: fluent-bit
        image: fluent/fluent-bit:2.1
        resources:
          limits:
            cpu: "500m"
            memory: "256Mi"
          requests:
            cpu: "100m"
            memory: "128Mi"
        volumeMounts:
        - name: varlog
          mountPath: /var/log
          readOnly: true
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: fluent-bit-config
          mountPath: /fluent-bit/etc/
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
          type: Directory
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
          type: Directory
      - name: fluent-bit-config
        configMap:
          name: fluent-bit-config
      serviceAccountName: fluent-bit
      terminationGracePeriodSeconds: 30
```

### Node Selection

Run on specific nodes using node selectors:

```yaml
spec:
  template:
    spec:
      nodeSelector:
        monitoring: "true"  # Only nodes with this label
      tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "monitoring"
        effect: "NoSchedule"
```

### Update Strategies

**RollingUpdate (Default):**
Replaces old Pods gradually to maintain coverage:

```yaml
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 10%  # Can use percentage or absolute number (e.g., 1)
```

**OnDelete:**
Manual updates; new Pod created only when old one deleted:

```yaml
spec:
  updateStrategy:
    type: OnDelete
```

Useful for critical infrastructure where automatic rolling updates might cause instability.

### Monitoring DaemonSet Health

```bash
# Check status
kubectl get daemonset fluent-bit -n logging

# View which nodes are running the Pod
kubectl get pods -n logging -l app=fluent-bit -o wide

# Check if any failed to schedule
kubectl describe daemonset fluent-bit -n logging | grep -A 10 Events
```

## 17.2 StatefulSets Deep Dive

While Chapter 16 introduced StatefulSets, advanced patterns enable sophisticated deployment strategies for stateful applications.

### Partitioned Rolling Updates

Control the rollout to update only Pods with ordinal >= partition value, enabling canary-style updates for StatefulSets:

```yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: "postgres"
  replicas: 5
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 4  # Only update pods with ordinal >= 4 (postgres-4)
  template:
    spec:
      containers:
      - name: postgres
        image: postgres:15.4  # New version
```

**Canary Pattern:**
1. Set `partition: 4` to update only postgres-4
2. Test postgres-4 with new version
3. If successful, reduce partition to update remaining pods gradually

### Ordinal Controls (Kubernetes 1.27+)

Start replicas from a specific ordinal number:

```yaml
spec:
  ordinals:
    start: 0  # Default, but can be set to start from higher number
  replicas: 3
```

### Parallel Pod Management

For stateful applications that don't require strict startup ordering:

```yaml
spec:
  podManagementPolicy: Parallel  # Default is OrderedReady
```

- **OrderedReady**: Creates pods 0, 1, 2 sequentially; waits for each to be Running/Ready before next
- **Parallel**: Creates all pods simultaneously (faster startup, but no ordering guarantees)

### Deletion Protection

Prevent accidental StatefulSet deletion while preserving Pods:

```bash
kubectl delete statefulset postgres --cascade=orphan
# Deletes StatefulSet but leaves pods postgres-0, postgres-1, etc. running
```

Recreate with same selector to adopt existing Pods.

## 17.3 Jobs and CronJobs

Jobs run finite tasks to completion (batch processing), while CronJobs schedule Jobs periodically like Linux cron.

### Job Specification

```yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: data-processor
  namespace: batch
spec:
  completions: 5          # Must complete successfully 5 times
  parallelism: 2          # Run 2 pods in parallel
  completionMode: Indexed # Pods get completion index (0 to completions-1)
  activeDeadlineSeconds: 600  # Timeout after 10 minutes
  backoffLimit: 3         # Retry up to 3 times before marking failed
  ttlSecondsAfterFinished: 86400  # Delete Job 24h after completion
  template:
    spec:
      restartPolicy: OnFailure  # Never or OnFailure (not Always)
      containers:
      - name: processor
        image: batch-processor:latest
        env:
        - name: JOB_COMPLETION_INDEX
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
        command: ["python", "process.py", "--shard=$(JOB_COMPLETION_INDEX)"]
        resources:
          requests:
            memory: "4Gi"
            cpu: "2000m"
```

**Completion Modes:**
- **NonIndexed**: Pods are fungible; any completion counts toward total
- **Indexed**: Each Pod gets a unique index (0 to N-1), useful for parallel processing shards

### Parallel Processing Patterns

**Work Queue:**
Single Job starts multiple Pods until queue is empty:

```yaml
spec:
  parallelism: 10
  completions: 1  # Only one successful completion needed
  template:
    spec:
      containers:
      - name: worker
        image: queue-processor
        command: ["python", "worker.py"]  # Runs until queue empty, then exits
```

**Fixed Completion Count:**
For embarrassingly parallel batch processing:

```yaml
spec:
  parallelism: 5
  completions: 100  # Process 100 items, 5 at a time
```

### CronJobs

Schedule Jobs using cron expressions:

```yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-report
  namespace: analytics
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  timeZone: "America/New_York"  # Kubernetes 1.24+
  concurrencyPolicy: Forbid  # Forbid, Allow, or Replace
  startingDeadlineSeconds: 3600  # Must start within 1 hour of scheduled time
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1
  suspend: false  # Can pause by setting to true
  jobTemplate:
    spec:
      template:
        spec:
          activeDeadlineSeconds: 7200  # Job must complete within 2 hours
          backoffLimit: 2
          restartPolicy: OnFailure
          containers:
          - name: report-generator
            image: analytics/reporter:v1.2
            env:
            - name: REPORT_DATE
              value: "$(date +%Y-%m-%d)"
            resources:
              requests:
                memory: "8Gi"
                cpu: "4000m"
              limits:
                memory: "16Gi"
                cpu: "8000m"
```

**Cron Expression Format:**
```
┌───────────── minute (0 - 59)
│ ┌───────────── hour (0 - 23)
│ │ ┌───────────── day of month (1 - 31)
│ │ │ ┌───────────── month (1 - 12)
│ │ │ │ ┌───────────── day of week (0 - 6) (Sunday=0)
│ │ │ │ │
* * * * *
```

**Common Patterns:**
- `*/5 * * * *` : Every 5 minutes
- `0 */6 * * *` : Every 6 hours
- `0 0 * * 0` : Weekly on Sunday
- `0 0 1 * *` : Monthly on the 1st

### Handling Job Failures

```yaml
spec:
  backoffLimit: 4
  backoffLimitPerIndex: 2  # Kubernetes 1.28+: max failures per index in Indexed jobs
  maxFailedIndexes: 5       # Kubernetes 1.28+: max failed indexes before aborting
  podFailurePolicy:         # Kubernetes 1.26+: fine-grained failure handling
    rules:
    - action: FailJob
      onExitCodes:
        operator: In
        values: [1, 2, 137]
    - action: Ignore  # Don't count as failure
      onPodConditions:
      - type: DisruptionTarget
        status: "True"
```

## 17.4 Horizontal Pod Autoscaling (HPA)

HPA automatically scales the number of Pod replicas in a Deployment, ReplicaSet, or StatefulSet based on observed metrics such as CPU utilization, memory usage, or custom metrics.

### Basic CPU-Based Autoscaling

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Target 70% CPU utilization
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:  # Kubernetes 1.18+
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 minutes before scaling down
      policies:
      - type: Percent
        value: 10  # Scale down max 10% of replicas per minute
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100  # Double replicas if needed
        periodSeconds: 60
      - type: Pods
        value: 4   # Or add 4 pods per minute, whichever is higher
        periodSeconds: 60
      selectPolicy: Max  # Use the policy that adds most replicas
```

### Custom Metrics (Prometheus Adapter)

Scale based on application-specific metrics like requests per second:

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 2
  maxReplicas: 100
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"  # 1000 requests/second per pod
  - type: External
    external:
      metric:
        name: queue_length
        selector:
          matchLabels:
            queue: orders
      target:
        type: Value
        value: "100"  # Scale when queue has 100+ messages
```

**Prerequisites:**
- Custom Metrics API configured (Prometheus Adapter or Metrics Server)
- Application exposing metrics in Prometheus format

### Scaling Policies

**Scale Down Constraints:**
Prevent flapping by limiting scale-down speed:

```yaml
behavior:
  scaleDown:
    stabilizationWindowSeconds: 600  # Look back 10 minutes for minimum utilization
    policies:
    - type: Percent
      value: 10  # Remove max 10% of pods per minute
      periodSeconds: 60
    - type: Pods
      value: 2   # Or remove max 2 pods per minute
      periodSeconds: 60
    selectPolicy: Min  # Use the policy that removes fewer pods
```

### HPA Troubleshooting

```bash
# Check HPA status
kubectl get hpa web-app-hpa
kubectl describe hpa web-app-hpa

# View current metrics
kubectl get --raw "/apis/autoscaling/v2/namespaces/production/horizontalpodautoscalers/web-app-hpa" | jq

# Check if metrics server is running
kubectl get pods -n kube-system | grep metrics-server

# Verify resource requests are set (HPA requires requests)
kubectl get deployment web-app -o yaml | grep -A 5 resources
```

**Common Issues:**
- **Missing Resource Requests**: HPA requires resource requests to calculate utilization
- **Metrics Server Unavailable**: HPA needs metrics-server for CPU/memory
- **Stabilization**: Scale-down may be delayed by stabilization windows

## 17.5 Vertical Pod Autoscaling (VPA)

While HPA scales horizontally (more Pods), VPA scales vertically (more CPU/memory per Pod). VPA is ideal for applications that cannot scale horizontally (databases, singleton services) or have variable resource needs.

### VPA Modes

**Off**: Recommendations only, no automatic changes
**Initial**: Applies recommendations only to new Pods (at creation)
**Recreate**: Updates existing Pods by evicting them (causes downtime)
**Auto**: Combines Initial and Recreate; may use both depending on capabilities

### VPA Specification

```yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: database-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: postgres
  updatePolicy:
    updateMode: "Auto"  # Off, Initial, Recreate, or Auto
    minReplicas: 2  # Minimum replicas to consider for update (prevents single-pod downtime)
  resourcePolicy:
    containerPolicies:
    - containerName: postgres
      minAllowed:
        cpu: "500m"
        memory: "1Gi"
      maxAllowed:
        cpu: "4"
        memory: "16Gi"
      controlledResources: ["cpu", "memory"]
      controlledValues: RequestsAndLimits  # Or RequestsOnly
```

### VPA Recommendations

View recommendations without applying:

```yaml
spec:
  updatePolicy:
    updateMode: "Off"
```

Check recommendations:
```bash
kubectl get vpa database-vpa -o yaml

# Output includes:
# recommendation:
#   containerRecommendations:
#   - containerName: postgres
#     lowerBound:
#       cpu: "800m"
#       memory: 2.5Gi
#     target:
#       cpu: "1000m"
#       memory: 3Gi
#     upperBound:
#       cpu: "1200m"
#       memory: 4Gi
```

### Combining HPA and VPA

**Warning**: Do not use HPA and VPA simultaneously on the same resource unless VPA is in "Off" or "Initial" mode, as they will conflict.

**Recommended Pattern:**
- Use VPA in `Off` mode to get recommendations
- Manually update Deployment resource requests based on recommendations
- Use HPA for horizontal scaling

Or use VPA for initial sizing only:
```yaml
spec:
  updatePolicy:
    updateMode: "Initial"  # Only size new pods, let HPA handle scaling
```

## 17.6 Pod Disruption Budgets (PDB)

PDBs ensure that voluntary disruptions (node drains, cluster upgrades, manual deletions) do not compromise application availability by maintaining a minimum number of available Pods.

### Voluntary vs Involuntary Disruptions

**Voluntary** (respects PDB):
- Node drain (`kubectl drain`)
- Pod deletion by human or automation
- Cluster autoscaler scaling down
- Priority-based preemption

**Involuntary** (ignores PDB):
- Hardware failure
- Kernel panic
- Network partition
- Out-of-memory killing

### PDB Specification

```yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
  namespace: production
spec:
  minAvailable: 2  # OR maxUnavailable: 1 (cannot use both)
  selector:
    matchLabels:
      app: api-gateway
      tier: backend
```

**Strategies:**
- **minAvailable**: Absolute number or percentage of Pods that must remain available
- **maxUnavailable**: Maximum number or percentage of Pods that can be unavailable during disruption

**Percentage Values:**
```yaml
spec:
  minAvailable: 50%  # At least half must remain
  # OR
  maxUnavailable: 25%  # No more than 25% can be down
```

### PDB with StatefulSets

Critical for stateful applications:

```yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: postgres-pdb
spec:
  maxUnavailable: 1  # Only one PostgreSQL pod down at a time
  selector:
    matchLabels:
      app: postgres
```

**Important**: For StatefulSets with single replica, PDBs cannot prevent disruption (minAvailable: 1 with 1 replica means 0 can be disrupted). Use `maxUnavailable: 0` or ensure multiple replicas.

### Checking PDB Status

```bash
kubectl get pdb -n production

# NAME       MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
# api-pdb    2               N/A               1                     10d
# postgres   N/A             1                 0                     10d

# "Allowed Disruptions" shows how many can be evicted currently
```

If `ALLOWED DISRUPTIONS` is 0, `kubectl drain` will block until more Pods become available.

## 17.7 Resource Quotas

ResourceQuotas limit aggregate resource consumption per namespace, preventing a single team or application from monopolizing cluster resources.

### ResourceQuota Specification

```yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    # Compute Resources
    requests.cpu: "100"
    requests.memory: 500Gi
    limits.cpu: "200"
    limits.memory: 1000Gi
    
    # Storage
    requests.storage: 5Ti
    persistentvolumeclaims: "50"
    <storage-class-name>.storageclass.storage.k8s.io/requests.storage: 2Ti
    
    # Object Counts
    pods: "100"
    replicationcontrollers: "20"
    secrets: "50"
    configmaps: "50"
    persistentvolumeclaims: "50"
    services: "50"
    services.loadbalancers: "10"
    services.nodeports: "10"
    count/ingresses.extensions: "20"
    count/jobs.batch: "50"
    count/cronjobs.batch: "20"
    
    # GPU Resources (if using device plugins)
    requests.nvidia.com/gpu: "10"
    
    # Local Ephemeral Storage (Kubernetes 1.22+)
    requests.ephemeral-storage: 500Gi
    limits.ephemeral-storage: 1Ti
```

### Scope Selectors

Limit quotas to specific Pod priorities or phases:

```yaml
spec:
  scopes:
    - BestEffort        # Only quota BestEffort pods (no resource requests)
    - NotBestEffort     # Only quota Guaranteed/Burstable pods
    - Terminating       # Only quota pods with activeDeadlineSeconds
    - NotTerminating    # Only quota pods without deadline
```

**Priority Class Scoping (Kubernetes 1.24+):**
```yaml
spec:
  scopeSelector:
    matchExpressions:
    - scopeName: PriorityClass
      operator: In
      values: ["high-priority", "production"]
```

### Checking Quota Usage

```bash
kubectl get resourcequota -n production
kubectl describe resourcequota production-quota -n production

# Output shows used vs hard limits:
# Name:            production-quota
# Resource         Used    Hard
# --------         ----    ----
# limits.cpu       50      200
# limits.memory    200Gi   1000Gi
# pods             35      100
# requests.cpu     30      100
# requests.memory  120Gi   500Gi
```

## 17.8 Limit Ranges

While ResourceQuotas constrain aggregate namespace consumption, LimitRanges constrain individual Pod and container resources, enforcing defaults and preventing resource starvation.

### Container Limits

```yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - default:  # Applied if container specifies no limits
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:  # Applied if container specifies no requests
      cpu: "100m"
      memory: "128Mi"
    max:  # Maximum allowed
      cpu: "2"
      memory: "4Gi"
    min:  # Minimum required
      cpu: "50m"
      memory: "64Mi"
    type: Container
    
  # Pod-level limits (sum of all containers)
  - max:
      cpu: "4"
      memory: "8Gi"
    min:
      cpu: "100m"
      memory: "128Mi"
    type: Pod
    
  # PVC limits
  - max:
      storage: 500Gi
    min:
      storage: 1Gi
    default:
      storage: 10Gi
    defaultRequest:
      storage: 10Gi
    type: PersistentVolumeClaim
    
  # Ratio constraints (limit to request ratio)
  - maxLimitRequestRatio:
      cpu: "4"  # Limit cannot be more than 4x request
      memory: "2"
    type: Container
```

### Enforcing Quality of Service

Guarantee proper QoS classes:

```yaml
spec:
  limits:
  - default:
      cpu: "100m"
      memory: "128Mi"
    defaultRequest:
      cpu: "100m"
      memory: "128Mi"  # Same as limit = Guaranteed QoS
    type: Container
```

### Checking LimitRange Application

```bash
kubectl describe limitrange default-limits -n production

# Verify applied to new pods
kubectl get pod mypod -o yaml | grep -A 10 resources
```

---

## Chapter Summary and Preview

In this chapter, we explored advanced Kubernetes resources essential for production operations. DaemonSets ensure infrastructure services run on every node, critical for logging, monitoring, and networking agents. Jobs and CronJobs handle batch processing and scheduled tasks with sophisticated retry logic and parallel execution patterns. Horizontal Pod Autoscaler (HPA) enables dynamic scaling based on resource utilization or custom application metrics, while Vertical Pod Autoscaler (VPA) automatically rightsizes container resource allocations. Pod Disruption Budgets (PDBs) safeguard availability during voluntary disruptions like node maintenance or cluster upgrades. Resource Quotas prevent namespace resource monopolization in multi-tenant environments, and LimitRanges enforce organizational policies on individual container specifications, ensuring consistent resource requests and limits across workloads.

**Key Takeaways:**
- Use DaemonSets for node-level infrastructure services requiring presence on every machine; avoid for application workloads that should scale independently of node count.
- Implement PodDisruptionBudgets for all production workloads to ensure cluster maintenance operations do not compromise availability.
- Combine HPA for variable load scaling with VPA in "Off" or "Initial" mode for capacity planning, but avoid simultaneous automatic scaling to prevent conflicts.
- Enforce LimitRanges in every namespace to prevent resource starvation and ensure consistent QoS class distributions.
- Use Indexed Jobs for parallel processing workloads requiring shard awareness, and set appropriate backoff limits to prevent infinite retry loops on fundamentally failed jobs.

**Next Chapter Preview:**
Chapter 18: Kubernetes Security explores defense-in-depth strategies for production clusters. You will implement Role-Based Access Control (RBAC) for fine-grained permissions, configure Pod Security Standards to enforce runtime constraints, utilize Network Policies for micro-segmentation, manage Secrets with external secret management integration, and implement admission controllers for policy enforcement. These security mechanisms protect the advanced workloads and autoscaling configurations established in this chapter, ensuring that operational flexibility does not compromise security posture.