# Chapter 51: Scaling CI/CD Infrastructure

As organizations adopt CI/CD across hundreds of teams and thousands of repositories, the infrastructure supporting these pipelines becomes a critical production system in its own right. A CI/CD platform that cannot scale becomes the bottleneck constraining engineering velocity, causing queue delays, build failures due to resource starvation, and frustrated developers. This chapter treats CI/CD infrastructure as a first-class citizen, applying the same infrastructure-as-code, observability, and resilience patterns to Jenkins controllers, GitHub Actions runners, and Kubernetes build farms that we apply to production applications. We examine **horizontal scaling** through ephemeral build agents and autoscaling groups, **vertical scaling** for resource-intensive compilation and testing, **multi-region deployments** for disaster recovery and data sovereignty compliance, **high availability** configurations that eliminate single points of failure, **disaster recovery** strategies for pipeline state and build history, **capacity planning** methodologies to predict and prevent bottlenecks, and **cost optimization** techniques including spot instances, build caching, and rightsizing that prevent cloud bill shock while maintaining performance.

## 51.1 Horizontal Scaling

Horizontal scaling adds more build agents to distribute workload, rather than making individual agents larger. This approach provides elasticity, fault isolation, and cost efficiency through ephemeral infrastructure.

### Kubernetes-Based Build Farms

Running CI/CD agents on Kubernetes leverages cluster autoscaling for dynamic capacity.

**Jenkins Kubernetes Plugin**:
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: jenkins-agent-config
data:
  config.yaml: |
    jenkins:
      clouds:
        - kubernetes:
            name: "kubernetes"
            serverUrl: "https://kubernetes.default"
            namespace: "jenkins-agents"
            jenkinsUrl: "http://jenkins:8080"
            jenkinsTunnel: "jenkins-agent:50000"
            credentialsId: "kubernetes-service-account"
            containerCapStr: "100"
            retentionTimeout: 5
            templates:
              - name: "default-agent"
                label: "jenkins-agent"
                containers:
                  - name: "jnlp"
                    image: "jenkins/inbound-agent:latest"
                    alwaysPullImage: true
                    workingDir: "/home/jenkins/agent"
                    resourceRequestCpu: "500m"
                    resourceRequestMemory: "512Mi"
                    resourceLimitCpu: "2000m"
                    resourceLimitMemory: "2Gi"
                volumes:
                  - emptyDirVolume:
                      memory: false
                      mountPath: "/tmp"
                  - persistentVolumeClaim:
                      claimName: "build-cache"
                      mountPath: "/cache"
                      readOnly: false
                yaml: |
                  spec:
                    affinity:
                      podAntiAffinity:
                        preferredDuringSchedulingIgnoredDuringExecution:
                        - weight: 100
                          podAffinityTerm:
                            labelSelector:
                              matchExpressions:
                              - key: jenkins
                                operator: In
                                values:
                                - slave
                            topologyKey: kubernetes.io/hostname
```

**GitHub Actions Self-Hosted Runners on Kubernetes**:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: github-actions-runner
  namespace: actions-runners
spec:
  replicas: 3
  selector:
    matchLabels:
      app: github-runner
  template:
    metadata:
      labels:
        app: github-runner
    spec:
      serviceAccountName: github-runner
      containers:
      - name: runner
        image: summerwind/actions-runner:latest
        env:
        - name: GITHUB_URL
          value: https://github.com/myorg
        - name: RUNNER_TOKEN
          valueFrom:
            secretKeyRef:
              name: runner-token
              key: token
        - name: RUNNER_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: RUNNER_WORKDIR
          value: /tmp/github-runner
        - name: LABELS
          value: self-hosted,linux,x64,gpu  # Custom labels for job routing
        resources:
          requests:
            cpu: "2000m"
            memory: "4Gi"
            nvidia.com/gpu: "1"  # GPU runners for ML workloads
          limits:
            cpu: "4000m"
            memory: "8Gi"
            nvidia.com/gpu: "1"
        volumeMounts:
        - name: docker-sock
          mountPath: /var/run/docker.sock
        - name: cache
          mountPath: /cache
      volumes:
      - name: docker-sock
        hostPath:
          path: /var/run/docker.sock
          type: Socket
      - name: cache
        persistentVolumeClaim:
          claimName: runner-cache
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node-type
                operator: In
                values:
                - build-agent
```

### Autoscaling Strategies

**Horizontal Pod Autoscaler (HPA)** for build agents:
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: github-runner-hpa
  namespace: actions-runners
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: github-actions-runner
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: External
    external:
      metric:
        name: github_runner_queued_jobs
        selector:
          matchLabels:
            repository: myorg/myrepo
      target:
        type: AverageValue
        averageValue: "1"  # Scale up if any jobs queued
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 5
        periodSeconds: 60  # Add 5 pods per minute
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
      - type: Pods
        value: 2
        periodSeconds: 120
```

**Cluster Autoscaler** for node scaling:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  template:
    spec:
      containers:
      - name: cluster-autoscaler
        image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.28.0
        command:
        - ./cluster-autoscaler
        - --cloud-provider=aws
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
        - --balance-similar-node-groups
        - --skip-nodes-with-system-pods=false
        - --scale-down-delay-after-add=10m
        - --scale-down-unneeded-time=10m
        resources:
          limits:
            cpu: "1000m"
            memory: "1Gi"
```

## 51.2 Vertical Scaling

While horizontal scaling adds more workers, vertical scaling increases resources per worker for tasks that cannot be parallelized or require significant memory/CPU.

### Resource-Intensive Builds

**Machine Learning Training**:
```yaml
apiVersion: v1
kind: Pod
metadata:
  name: ml-training-job
spec:
  containers:
  - name: training
    image: ml-training:latest
    resources:
      requests:
        cpu: "32"
        memory: "128Gi"
        nvidia.com/gpu: "4"
        ephemeral-storage: "100Gi"
      limits:
        cpu: "32"
        memory: "128Gi"
        nvidia.com/gpu: "4"
        ephemeral-storage: "200Gi"
    volumeMounts:
    - name: dataset
      mountPath: /data
      readOnly: true
    - name: model-output
      mountPath: /output
  volumes:
  - name: dataset
    persistentVolumeClaim:
      claimName: training-dataset-100tb
  - name: model-output
    persistentVolumeClaim:
      claimName: model-artifacts
  nodeSelector:
    node-type: gpu-high-memory
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"
```

**Large Compilation Jobs** (C++, Rust):
```yaml
apiVersion: v1
kind: Pod
metadata:
  name: rust-build
spec:
  containers:
  - name: builder
    image: rust:1.75
    command: ["cargo", "build", "--release"]
    resources:
      requests:
        cpu: "16"
        memory: "32Gi"
        ephemeral-storage: "50Gi"
      limits:
        cpu: "16"
        memory: "32Gi"
        ephemeral-storage: "100Gi"
    env:
    - name: CARGO_HOME
      value: /cache/cargo
    - name: CARGO_TARGET_DIR
      value: /cache/target
    volumeMounts:
    - name: cache
      mountPath: /cache
  volumes:
  - name: cache
    persistentVolumeClaim:
      claimName: build-cache-ssd
  nodeSelector:
    node-type: high-cpu
```

### Node Pools for Specialized Workloads

**AWS EKS Node Groups**:
```yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: ci-cluster
  region: us-east-1
nodeGroups:
  - name: general-build
    instanceType: m6i.2xlarge
    desiredCapacity: 3
    minSize: 1
    maxSize: 20
    volumeSize: 100
    labels:
      node-type: general
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
      
  - name: high-memory
    instanceType: r6i.8xlarge  # 256GB RAM
    desiredCapacity: 0
    minSize: 0
    maxSize: 10
    volumeSize: 500
    labels:
      node-type: high-memory
    taints:
      - key: dedicated
        value: memory
        effect: NoSchedule
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
      
  - name: gpu-build
    instanceType: p4d.24xlarge  # 8x A100 GPUs
    desiredCapacity: 0
    minSize: 0
    maxSize: 5
    volumeSize: 1000
    labels:
      node-type: gpu-high-memory
      nvidia.com/gpu.present: "true"
    taints:
      - key: nvidia.com/gpu
        value: "true"
        effect: NoSchedule
    tags:
      k8s.io/cluster-autoscaler/enabled: "true"
```

## 51.3 Multi-Region Deployments

Geographic distribution ensures disaster recovery, data residency compliance, and reduced latency for distributed teams.

### Disaster Recovery Strategy

**Active-Passive** (Cost-efficient):
- Primary region handles all traffic
- Secondary region has infrastructure provisioned but scaled to zero or minimum
- RTO (Recovery Time Objective): 15-30 minutes (time to scale up)
- RPO (Recovery Point Objective): 5 minutes (data loss window)

**Active-Active** (High availability):
- Both regions serve traffic (geo-DNS or global load balancer)
- Continuous synchronization
- RTO: Near zero (automatic failover)
- RPO: Near zero (synchronous replication where possible)

### Infrastructure Replication

**Terraform Multi-Region**:
```hcl
# regions.tf
locals {
  regions = {
    primary   = "us-east-1"
    secondary = "us-west-2"
  }
}

module "ci_infrastructure" {
  for_each = local.regions
  
  source = "./modules/ci-cd"
  
  region       = each.value
  environment  = each.key
  is_primary   = each.key == "primary"
  
  # DR configuration
  backup_vault_arn = each.key == "primary" ? aws_backup_vault.primary.arn : null
  
  providers = {
    aws = aws[each.key]
  }
}

provider "aws" {
  alias  = "primary"
  region = "us-east-1"
}

provider "aws" {
  alias  = "secondary"
  region = "us-west-2"
}
```

**Cross-Region Replication**:
```yaml
# ArgoCD ApplicationSet for multi-region deployment
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: ci-cd-platform
spec:
  generators:
  - list:
      elements:
      - cluster: production-us-east-1
        url: https://prod-us-east-1.api.internal
        region: us-east-1
        weight: 100
      - cluster: production-us-west-2
        url: https://prod-us-west-2.api.internal
        region: us-west-2
        weight: 0  # Standby, weight 0 until failover
  template:
    metadata:
      name: '{{cluster}}-ci-platform'
    spec:
      project: infrastructure
      source:
        repoURL: https://github.com/company/gitops.git
        targetRevision: HEAD
        path: infrastructure/ci-cd
        helm:
          values: |
            region: {{region}}
            replicaCount: {{weight}}
            backup:
              enabled: true
              crossRegionTarget: {{region}}
      destination:
        server: '{{url}}'
        namespace: ci-cd
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
```

### Data Residency and Sovereignty

**Geo-Fencing Deployments**:
```yaml
# Ensure EU data stays in EU
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: require-data-residency
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
    namespaces: ["eu-*"]
  parameters:
    labels:
      - "data-residency/eu-only"
      - "compliance.company.com/gdpr"
---
# Pod spec with node affinity for EU
apiVersion: v1
kind: Pod
metadata:
  name: eu-data-processor
  labels:
    data-residency/eu-only: "true"
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/region
            operator: In
            values:
            - eu-west-1
            - eu-central-1
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "eu-only"
    effect: "NoSchedule"
```

## 51.4 High Availability

The CI/CD control plane itself must be resilient to node failures, zone outages, and regional disasters.

### Jenkins High Availability

**Active-Standby with Shared Storage**:
```yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: jenkins
spec:
  serviceName: jenkins
  replicas: 1  # Jenkins doesn't support active-active
  template:
    spec:
      containers:
      - name: jenkins
        image: jenkins/jenkins:lts
        ports:
        - containerPort: 8080
        volumeMounts:
        - name: jenkins-home
          mountPath: /var/jenkins_home
        resources:
          requests:
            memory: "4Gi"
            cpu: "2000m"
          limits:
            memory: "8Gi"
            cpu: "4000m"
        livenessProbe:
          httpGet:
            path: /login
            port: 8080
          initialDelaySeconds: 90
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /login
            port: 8080
          initialDelaySeconds: 60
          periodSeconds: 5
  volumeClaimTemplates:
  - metadata:
      name: jenkins-home
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 100Gi
---
# High availability through rapid failover
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: jenkins-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: jenkins
```

**External Storage for Build History**:
```yaml
# Mount S3 for build artifacts instead of local storage
- name: jenkins
  image: jenkins/jenkins:lts
  env:
  - name: JAVA_OPTS
    value: "-Djenkins.model.Jenkins.slaveAgentPort=50000"
  volumeMounts:
  - name: s3-artifacts
    mountPath: /var/jenkins_home/jobs
  volumes:
  - name: s3-artifacts
    csi:
      driver: s3.csi.aws.com
      volumeAttributes:
        bucketName: jenkins-artifacts-prod
        region: us-east-1
```

### GitHub Actions HA

**Multi-Runner Architecture**:
```yaml
# Terraform for GitHub Actions Runner Controller (ARC)
resource "helm_release" "actions_runner_controller" {
  name       = "actions-runner-controller"
  repository = "https://actions-runner-controller.github.io/actions-runner-controller"
  chart      = "actions-runner-controller"
  namespace  = "actions-runner-system"
  
  set {
    name  = "authSecret.github_token"
    value = var.github_token
  }
  
  set {
    name  = "replicaCount"
    value = "2"  # HA for controller itself
  }
}

# Runner Deployment with autoscaling
resource "kubernetes_manifest" "runner_deployment" {
  manifest = {
    apiVersion = "actions.summerwind.dev/v1alpha1"
    kind       = "RunnerDeployment"
    metadata = {
      name      = "production-runners"
      namespace = "actions-runner-system"
    }
    spec = {
      template = {
        spec = {
          repository = "myorg/myrepo"
          labels     = ["self-hosted", "production", "large"]
          resources = {
            limits = {
              cpu    = "4000m"
              memory = "16Gi"
            }
            requests = {
              cpu    = "1000m"
              memory = "4Gi"
            }
          }
          # Ephemeral runners for security
          ephemeral = true
        }
      }
    }
  }
}

# HorizontalRunnerAutoscaler
resource "kubernetes_manifest" "runner_autoscaler" {
  manifest = {
    apiVersion = "actions.summerwind.dev/v1alpha1"
    kind       = "HorizontalRunnerAutoscaler"
    metadata = {
      name      = "production-runners-autoscaler"
      namespace = "actions-runner-system"
    }
    spec = {
      scaleTargetRef = {
        name = "production-runners"
      }
      minReplicas = 3
      maxReplicas = 50
      metrics = [
        {
          type = "TotalNumberOfQueuedAndInProgressWorkflowRuns"
          repositoryNames = ["myorg/myrepo"]
        }
      ]
      scaleUpTriggers = [
        {
          githubEvent = {
            checkRun = {
              types = ["created"]
              status = "queued"
            }
          }
          duration = "5m"
        }
      ]
    }
  }
}
```

## 51.5 Disaster Recovery

CI/CD infrastructure is critical path; its loss paralyzes all software delivery. DR strategies must account for both the control plane (Jenkins, GitLab, ArgoCD) and the data (build history, artifacts, pipeline state).

### Backup Strategies

**Jenkins Backup**:
```yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: jenkins-backup
spec:
  schedule: "0 */6 * * *"  # Every 6 hours
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: amazon/aws-cli:latest
            command:
            - /bin/sh
            - -c
            - |
              # Create timestamped backup
              TIMESTAMP=$(date +%Y%m%d-%H%M%S)
              
              # Backup Jenkins home (excluding workspace for size)
              tar -czf /tmp/jenkins-backup-${TIMESTAMP}.tar.gz \
                --exclude='./workspace' \
                --exclude='./war' \
                /var/jenkins_home
              
              # Upload to S3 with versioning
              aws s3 cp /tmp/jenkins-backup-${TIMESTAMP}.tar.gz \
                s3://jenkins-backups-${AWS_REGION}/backups/
              
              # Verify backup integrity
              aws s3api head-object \
                --bucket jenkins-backups-${AWS_REGION} \
                --key backups/jenkins-backup-${TIMESTAMP}.tar.gz
                
              # Cleanup old backups (keep 30 days)
              aws s3 ls s3://jenkins-backups-${AWS_REGION}/backups/ | \
                awk '{print $4}' | \
                while read file; do
                  DATE=$(echo $file | grep -oP '\d{8}-\d{6}')
                  if [ $(($(date +%s) - $(date -d "$DATE" +%s))) -gt 2592000 ]; then
                    aws s3 rm s3://jenkins-backups-${AWS_REGION}/backups/$file
                  fi
                done
            env:
            - name: AWS_REGION
              value: us-east-1
            volumeMounts:
            - name: jenkins-home
              mountPath: /var/jenkins_home
              readOnly: true
          volumes:
          - name: jenkins-home
            persistentVolumeClaim:
              claimName: jenkins-home
          restartPolicy: OnFailure
```

**ArgoCD Backup**:
```bash
# Export ArgoCD applications and projects
argocd admin export > argocd-backup-$(date +%Y%m%d).yaml

# Include secrets (encrypted)
kubectl get secret -n argocd -o yaml >> argocd-backup-$(date +%Y%m%d).yaml

# Store in versioned S3
aws s3 cp argocd-backup-$(date +%Y%m%d).yaml s3://argocd-backups/
```

### Recovery Procedures

**Jenkins Restore**:
```bash
#!/bin/bash
# restore-jenkins.sh

BACKUP_FILE=$1
NAMESPACE="jenkins"

# Scale down Jenkins
kubectl scale statefulset jenkins --replicas=0 -n $NAMESPACE

# Wait for termination
kubectl wait --for=delete pod/jenkins-0 --timeout=300s -n $NAMESPACE

# Restore PVC
kubectl delete pvc jenkins-home -n $NAMESPACE
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: jenkins-home
  namespace: $NAMESPACE
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: fast-ssd
EOF

# Create restore job
cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
  name: jenkins-restore
  namespace: $NAMESPACE
spec:
  template:
    spec:
      containers:
      - name: restore
        image: amazon/aws-cli:latest
        command:
        - /bin/sh
        - -c
        - |
          aws s3 cp s3://jenkins-backups/${BACKUP_FILE} /tmp/backup.tar.gz
          tar -xzf /tmp/backup.tar.gz -C /var/jenkins_home
          chown -R 1000:1000 /var/jenkins_home
        volumeMounts:
        - name: jenkins-home
          mountPath: /var/jenkins_home
      volumes:
      - name: jenkins-home
        persistentVolumeClaim:
          claimName: jenkins-home
      restartPolicy: Never
EOF

# Wait for restore
kubectl wait --for=condition=complete job/jenkins-restore -n $NAMESPACE --timeout=600s

# Scale up Jenkins
kubectl scale statefulset jenkins --replicas=1 -n $NAMESPACE
```

## 51.6 Capacity Planning

Preventing bottlenecks requires understanding pipeline demand patterns and provisioning capacity accordingly.

### Metrics Collection

**Pipeline Metrics**:
```yaml
# Prometheus metrics for Jenkins
- name: jenkins-metrics
  image: prometheus/jenkins-exporter:latest
  env:
  - name: JENKINS_URL
    value: http://jenkins:8080
  - name: JENKINS_USER
    value: metrics
  - name: JENKINS_API_TOKEN
    valueFrom:
      secretKeyRef:
        name: jenkins-metrics
        key: token
  ports:
  - containerPort: 9118
    name: metrics
```

**Key Metrics to Track**:
- Queue depth (number of jobs waiting)
- Build duration trends
- Agent utilization percentage
- Failure rate by agent type
- Cost per build
- Cache hit rates

### Predictive Scaling

**Machine Learning-Based Prediction**:
```python
# capacity_predictor.py
import pandas as pd
from sklearn.linear_model import LinearRegression
import boto3

def predict_required_agents():
    # Load historical data
    cloudwatch = boto3.client('cloudwatch')
    
    # Get queue depth history
    response = cloudwatch.get_metric_statistics(
        Namespace='CI/CD',
        MetricName='QueueDepth',
        StartTime=datetime.utcnow() - timedelta(days=30),
        EndTime=datetime.utcnow(),
        Period=3600,
        Statistics=['Average']
    )
    
    df = pd.DataFrame(response['Datapoints'])
    
    # Feature engineering: time of day, day of week
    df['hour'] = pd.to_datetime(df['Timestamp']).dt.hour
    df['day_of_week'] = pd.to_datetime(df['Timestamp']).dt.dayofweek
    
    # Train model
    X = df[['hour', 'day_of_week', 'Average']]
    y = df['RequiredAgents']  # Historical data on agents needed
    
    model = LinearRegression()
    model.fit(X, y)
    
    # Predict next hour
    next_hour = datetime.utcnow().hour + 1
    prediction = model.predict([[next_hour, datetime.utcnow().weekday(), 0]])
    
    # Scale agents
    if prediction > current_agents:
        scale_agents(int(prediction))
    
    return prediction
```

### Queue Management

**Priority Queues**:
```groovy
// Jenkins Pipeline: Priority based on environment
properties([
  parameters([
    choice(
      name: 'ENVIRONMENT',
      choices: ['development', 'staging', 'production'],
      description: 'Deployment target'
    )
  ])
])

// Set priority based on environment
if (params.ENVIRONMENT == 'production') {
  currentBuild.setPriority(1)  // Highest
} else if (params.ENVIRONMENT == 'staging') {
  currentBuild.setPriority(5)
} else {
  currentBuild.setPriority(10) // Lowest
}
```

**Resource Quotas per Team**:
```yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-alpha-quota
  namespace: team-alpha
spec:
  hard:
    requests.cpu: "100"
    requests.memory: 400Gi
    pods: "50"
    # CI/CD specific
    count/jobs.batch: "100"
    count/pods: "100"
---
apiVersion: v1
kind: LimitRange
metadata:
  name: team-alpha-limits
  namespace: team-alpha
spec:
  limits:
  - default:
      cpu: "2000m"
      memory: "4Gi"
    defaultRequest:
      cpu: "500m"
      memory: "1Gi"
    type: Container
  - max:
      cpu: "8000m"
      memory: "32Gi"
    min:
      cpu: "100m"
      memory: "128Mi"
    type: Pod
```

## 51.7 Cost Optimization

CI/CD infrastructure can become expensive without governance. Optimization requires right-sizing, spot instances, and intelligent caching.

### Spot Instances and Preemptible VMs

**AWS Spot for Build Agents**:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: spot-runners
spec:
  replicas: 10
  template:
    spec:
      nodeSelector:
        node-type: spot
      tolerations:
      - key: "spot"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
      containers:
      - name: runner
        image: github-runner:latest
        resources:
          requests:
            cpu: "4000m"
            memory: "8Gi"
          limits:
            cpu: "4000m"
            memory: "8Gi"
        env:
        - name: RUNNER_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: RUNNER_TOKEN
          valueFrom:
            secretKeyRef:
              name: runner-token
              key: token
        # Handle spot interruption
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - |
                # Signal GitHub that runner is going away
                curl -X POST \
                  -H "Authorization: token ${RUNNER_TOKEN}" \
                  https://api.github.com/repos/myorg/myrepo/actions/runners/${RUNNER_NAME}/remove
                
                # Finish current job if possible (graceful shutdown)
                /actions-runner/bin/Runner.Listener remove --token ${RUNNER_TOKEN}
```

**Karpenter for Spot Optimization**:
```yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: spot-builders
spec:
  template:
    spec:
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot"]
      - key: node.kubernetes.io/instance-type
        operator: In
        values: ["m6i.2xlarge", "m6i.4xlarge", "c6i.2xlarge"]
      - key: topology.kubernetes.io/zone
        operator: In
        values: ["us-east-1a", "us-east-1b", "us-east-1c"]
  limits:
    cpu: 1000
    memory: 4000Gi
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h  # 30 days max node lifetime
```

### Build Caching

**Layer Caching for Docker**:
```yaml
# Kaniko with cache
apiVersion: v1
kind: Pod
metadata:
  name: kaniko-build
spec:
  containers:
  - name: kaniko
    image: gcr.io/kaniko-project/executor:latest
    args:
    - --dockerfile=Dockerfile
    - --context=git://github.com/myorg/myrepo
    - --destination=myregistry/myapp:${COMMIT_SHA}
    - --cache=true
    - --cache-repo=myregistry/cache
    - --cache-copy-layers=true
    - --cache-run-layers=true
    volumeMounts:
    - name: docker-config
      mountPath: /kaniko/.docker
  volumes:
  - name: docker-config
    secret:
      secretName: registry-credentials
```

**Dependency Caching**:
```yaml
# Persistent cache volume for Maven/Gradle/NPM
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: build-cache
spec:
  accessModes:
    - ReadWriteMany  # Shared across builds
  resources:
    requests:
      storage: 500Gi
  storageClassName: efs-sc  # AWS EFS for multi-AZ access
---
# Build pod using cache
apiVersion: v1
kind: Pod
metadata:
  name: maven-build
spec:
  containers:
  - name: maven
    image: maven:3.9-eclipse-temurin-17
    command: ["mvn", "clean", "package"]
    volumeMounts:
    - name: m2-cache
      mountPath: /root/.m2
    env:
    - name: MAVEN_OPTS
      value: "-Dmaven.repo.local=/root/.m2/repository"
  volumes:
  - name: m2-cache
    persistentVolumeClaim:
      claimName: build-cache
```

### Cost Monitoring

**KubeCost / OpenCost**:
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: kubecost-config
data:
  kubecost.yaml: |
    # Allocation for CI/CD namespaces
    allocations:
      filters:
        namespaces:
          - ci-cd
          - jenkins
          - github-actions
          - gitlab-runner
      aggregation: namespace
      accumulate: true
      
    # Alerts for cost anomalies
    alerts:
      - type: budget
        threshold: 10000  # $10k/day
        window: daily
        aggregation: namespace
        filter: ci-cd
        
      - type: efficiency
        threshold: 0.20  # 20% resource utilization
        window: 7d
        aggregation: controller
```

**Cost Optimization Rules**:
```yaml
# Delete completed pods after 1 hour
apiVersion: batch/v1
kind: CronJob
metadata:
  name: cleanup-completed-builds
spec:
  schedule: "0 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: cleanup
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              # Delete completed/failed pods older than 1 hour
              kubectl get pods --all-namespaces --field-selector=status.phase=Succeeded \
                -o json | jq -r '.items[] | select(.metadata.creationTimestamp | fromdateiso8601 < now - 3600) | "\(.metadata.namespace) \(.metadata.name)"' | \
                while read ns pod; do
                  kubectl delete pod $pod -n $ns
                done
                
              # Delete failed pods older than 24 hours
              kubectl get pods --all-namespaces --field-selector=status.phase=Failed \
                -o json | jq -r '.items[] | select(.metadata.creationTimestamp | fromdateiso8601 < now - 86400) | "\(.metadata.namespace) \(.metadata.name)"' | \
                while read ns pod; do
                  kubectl delete pod $pod -n $ns
                done
          restartPolicy: OnFailure
```

---

## Chapter Summary and Preview

This chapter addressed the operational challenges of scaling CI/CD infrastructure from a single team tool to an enterprise platform serving hundreds of engineers. We examined **horizontal scaling** through Kubernetes-based build farms and autoscaling runner controllers that dynamically adjust capacity to queue depth, ensuring developers never wait for build agents while minimizing idle resource costs. **Vertical scaling** strategies support resource-intensive workloads—machine learning training, large-scale compilations, and comprehensive security scanning—through specialized node pools with high-memory and GPU capabilities.

**Multi-region deployments** ensure business continuity and regulatory compliance, implementing active-passive or active-active strategies that replicate CI/CD infrastructure across geographic boundaries to satisfy data residency requirements and provide disaster recovery capabilities. **High availability** configurations eliminate single points of failure in the control plane through stateful sets with persistent storage, pod disruption budgets, and rapid failover mechanisms.

**Disaster recovery** strategies encompass backup procedures for build history, pipeline configurations, and artifact repositories, with automated verification of backup integrity and documented recovery time objectives (RTO) and recovery point objectives (RPO). **Capacity planning** methodologies leverage predictive metrics and queue theory to provision infrastructure ahead of demand spikes, preventing the pipeline itself from becoming a constraint on delivery velocity.

**Cost optimization** techniques—including spot instances for fault-tolerant builds, intelligent layer caching for container images, persistent dependency caches shared across builds, and automated cleanup of completed resources—ensure that scaling infrastructure does not scale costs linearly.

**Key Takeaways:**
- Treat CI/CD infrastructure as a production service with the same reliability, observability, and disaster recovery requirements as customer-facing applications.
- Implement horizontal pod autoscaling for build agents based on queue depth rather than static capacity to balance cost and availability.
- Use spot instances and preemptible VMs for CI/CD workloads that can tolerate interruptions (most builds), reserving on-demand capacity for critical deployments.
- Implement multi-region CI/CD infrastructure for disaster recovery, with automated failover and data replication to meet RPO/RTO requirements.
- Maintain persistent caches for dependencies and Docker layers across builds to reduce execution time and external bandwidth costs.
- Implement automated cleanup of completed builds, old artifacts, and unused images to prevent storage costs from accumulating indefinitely.
- Use capacity planning based on historical build patterns and growth projections to provision infrastructure ahead of demand rather than reactively.

**Next Chapter Preview:** Chapter 52: Multi-Cluster Deployments extends scaling from single clusters to federated environments managing hundreds of Kubernetes clusters. We will explore **cluster federation** with Kubernetes Federation v2 and GitOps-based management, **multi-cluster service mesh** for cross-cluster communication and traffic management, **global load balancing** strategies that route users to healthy clusters, **data replication** patterns for stateful applications across geographic boundaries, **failover automation** that detects cluster degradation and shifts workloads without manual intervention, **cluster configuration drift** detection and remediation, and **fleet management** tools like Rancher, Open Cluster Management (OCM), and Cluster API that provide unified control planes for distributed infrastructure. We will examine how to maintain consistency, security, and observability across a growing fleet of clusters while enabling team autonomy and regional compliance.