# Chapter 39: Multi-Environment Deployments

Modern applications must traverse multiple environments before reaching production users, each with distinct operational characteristics, security boundaries, and data requirements. Development environments prioritize rapid iteration with synthetic data, staging environments mirror production topology for validation, and production environments demand immutable artifacts with zero-downtime deployments. Managing this progression—environment promotion—requires sophisticated configuration management that maintains artifact immutability while accommodating environmental variance, secure secret handling that prevents credential leakage across boundaries, and database migration strategies that preserve data integrity across schema evolution.

This chapter establishes patterns for reliable environment progression, from local development through production, ensuring that code validated in lower environments behaves predictably when exposed to production traffic and data volumes.

## 39.1 Environment Configuration

Configuration varies across environments while application code remains constant. This separation of concerns enables the same container image to execute in development and production, differing only in externalized configuration.

### Configuration Hierarchy

Effective configuration management follows a hierarchy where defaults are overridden progressively:

```
Base Configuration (Default values)
    ↓
Environment Configuration (dev/staging/prod overlays)
    ↓
Runtime Configuration (Secrets, feature flags)
```

**Base Configuration (Kustomize):**
```yaml
# base/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
spec:
  replicas: 1  # Default, overridden per environment
  template:
    spec:
      containers:
      - name: payment
        image: payment-service:latest  # Replaced by CI/CD
        env:
        - name: LOG_LEVEL
          value: "info"  # Default, overridden in dev for verbosity
        - name: DB_POOL_SIZE
          value: "10"    # Base value
        resources:
          requests:
            memory: "256Mi"  # Minimum viable
            cpu: "100m"
```

**Explanation:**
The base configuration defines sensible defaults that allow the application to run. The `replicas: 1` supports local development, while `LOG_LEVEL: info` provides standard observability. These values are intentionally conservative, suitable for development environments, with production overrides increasing resource allocations and replica counts.

### Kustomize Environment Overlays

Kustomize provides a declarative approach to environment-specific customization without template duplication:

```yaml
# overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: production

resources:
  - ../../base

namePrefix: prod-

commonLabels:
  environment: production
  tier: critical

commonAnnotations:
  cost-center: "platform-engineering"
  compliance: "pci-dss"

patches:
  - path: deployment-patch.yaml
  - path: hpa.yaml
  - path: pdb.yaml

configMapGenerator:
  - name: payment-config
    behavior: merge
    literals:
      - LOG_LEVEL=warn
      - DB_POOL_SIZE=50
      - PAYMENT_GATEWAY_MODE=live
      - RATE_LIMIT_ENABLED=true

replicas:
  - name: payment-service
    count: 5
```

**Explanation:**
This Kustomize overlay transforms the base configuration for production:
- **namespace**: Isolates resources in the `production` namespace
- **namePrefix**: Prefixes all resources with `prod-` (e.g., `prod-payment-service`)
- **commonLabels**: Adds labels for monitoring and cost allocation
- **patches**: Applies strategic merge patches to increase resources, add HorizontalPodAutoscaler (HPA), and PodDisruptionBudget (PDB)
- **configMapGenerator**: Merges with base ConfigMap, overriding `LOG_LEVEL` to `warn` (reducing noise) and increasing `DB_POOL_SIZE` to `50` for production load
- **replicas**: Overrides the base `replicas: 1` with `5` for high availability

**Production Deployment Patch:**
```yaml
# overlays/production/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
spec:
  template:
    spec:
      containers:
      - name: payment
        resources:
          requests:
            memory: "1Gi"      # 4x base
            cpu: "500m"        # 5x base
          limits:
            memory: "2Gi"
            cpu: "2000m"
        env:
        - name: JAVA_OPTS
          value: "-XX:+UseG1GC -XX:MaxRAMPercentage=75.0 -XX:InitialRAMPercentage=50.0"
        - name: SPRING_PROFILES_ACTIVE
          value: "production,observability"
        livenessProbe:
          httpGet:
            path: /actuator/health/liveness
            port: 8080
          initialDelaySeconds: 60  # Longer for production startup
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /actuator/health/readiness
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 5
```

**Explanation:**
The patch modifies specific fields of the base Deployment:
- **Resources**: Increases memory and CPU allocations for production workload
- **JAVA_OPTS**: Tunes JVM garbage collection (G1GC) and heap sizing for containerized environments
- **SPRING_PROFILES_ACTIVE**: Activates production-specific Spring profiles
- **Probes**: Extends initial delay seconds to account for production database connection pool initialization and JVM warm-up

### Helm Values Hierarchy

For Helm-based deployments, environment-specific values files layer configurations:

```yaml
# values.yaml (Base defaults)
replicaCount: 1

image:
  repository: company/payment-service
  tag: latest
  pullPolicy: IfNotPresent

resources:
  requests:
    memory: 256Mi
    cpu: 100m

config:
  logLevel: debug
  dbPoolSize: 5
  paymentGateway:
    mode: sandbox
    timeout: 30s

ingress:
  enabled: false

serviceMonitor:
  enabled: false
```

**Production Values:**
```yaml
# values-production.yaml
replicaCount: 5

resources:
  requests:
    memory: 1Gi
    cpu: 500m
  limits:
    memory: 2Gi
    cpu: 2000m

config:
  logLevel: warn
  dbPoolSize: 50
  paymentGateway:
    mode: live
    timeout: 10s  # Stricter timeouts in production

ingress:
  enabled: true
  hosts:
    - host: payments.company.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: payments-tls
      hosts:
        - payments.company.com

serviceMonitor:
  enabled: true
  interval: 15s

autoscaling:
  enabled: true
  minReplicas: 5
  maxReplicas: 20
  targetCPUUtilizationPercentage: 70

podDisruptionBudget:
  enabled: true
  minAvailable: 3
```

**Deployment Command:**
```bash
# Development
helm install payment-service ./chart -f values.yaml -f values-dev.yaml

# Production
helm install payment-service ./chart \
  -f values.yaml \
  -f values-production.yaml \
  --set image.tag=v2.1.0  # Immutable tag from CI
```

**Explanation:**
Helm merges values files left-to-right, with later files overriding earlier ones. The production values enable ingress with TLS termination, enable Prometheus ServiceMonitor for metrics scraping, configure HPA for autoscaling, and set PodDisruptionBudget to ensure minimum availability during cluster maintenance.

## 39.2 Configuration Management

### ConfigMaps for Non-Sensitive Data

ConfigMaps store environment-specific configuration without secrets:

```yaml
# base/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: payment-config
data:
  application.yaml: |
    server:
      port: 8080
      tomcat:
        threads:
          max: 200
    
    spring:
      datasource:
        url: jdbc:postgresql://localhost:5432/payments
        driver-class-name: org.postgresql.Driver
      
      jpa:
        hibernate:
          ddl-auto: validate  # Never auto-create in production
        properties:
          hibernate:
            dialect: org.hibernate.dialect.PostgreSQLDialect
    
    payment:
      gateway:
        connect-timeout: 5s
        read-timeout: 10s
      retry:
        max-attempts: 3
        backoff-delay: 1s
```

**Environment-Specific Override:**
```yaml
# overlays/production/configmap-patch.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: payment-config
data:
  application.yaml: |
    server:
      port: 8080
      tomcat:
        threads:
          max: 500  # Increased for production load
    
    spring:
      datasource:
        url: jdbc:postgresql://prod-db-cluster.company.com:5432/payments
        hikari:
          maximum-pool-size: 50
          minimum-idle: 10
          connection-timeout: 30000
          idle-timeout: 600000
          max-lifetime: 1800000
    
    payment:
      gateway:
        url: https://api.stripe.com/v1
        connect-timeout: 3s
        read-timeout: 5s
      retry:
        max-attempts: 5
        backoff-delay: 2s
      features:
        new-checkout-flow: true
        fraud-detection-v2: true
```

**Explanation:**
The production ConfigMap override specifies:
- **Database URL**: Points to production cluster rather than localhost
- **Connection Pool**: HikariCP configuration tuned for production (50 max connections, 30s connection timeout)
- **Payment Gateway**: Live Stripe API endpoint with stricter timeouts
- **Feature Flags**: Enables new features only in production after staging validation

### External Configuration (Spring Cloud Config)

For dynamic configuration without redeployment:

```yaml
# bootstrap.yaml (in application)
spring:
  application:
    name: payment-service
  cloud:
    config:
      uri: http://config-server:8888
      fail-fast: true
      retry:
        initial-interval: 1000
        max-attempts: 6
      profile: ${SPRING_PROFILES_ACTIVE:default}
```

**Config Server Repository Structure:**
```
config-repo/
├── payment-service.yaml          # Default profile
├── payment-service-dev.yaml      # Development overrides
├── payment-service-staging.yaml  # Staging overrides
└── payment-service-prod.yaml     # Production overrides
```

**Environment-Specific Config:**
```yaml
# payment-service-prod.yaml
server:
  tomcat:
    max-threads: 500

payment:
  gateway:
    stripe:
      api-key: ${STRIPE_API_KEY}  # Referenced from secret, not hardcoded
      webhook-secret: ${STRIPE_WEBHOOK_SECRET}
  
  limits:
    max-transaction-amount: 10000
    daily-volume-limit: 1000000
  
  features:
    enable-3ds: true
    enable-instant-payouts: false
```

**Explanation:**
Spring Cloud Config externalizes configuration to a Git repository or vault. Applications bootstrap by fetching configuration from the Config Server, enabling configuration changes without container restarts (though some properties require refresh). The `${}` syntax references environment variables injected via Kubernetes Secrets, keeping sensitive data out of Git.

## 39.3 Secrets Management

Secrets require special handling to prevent exposure in Git repositories and ensure environment isolation.

### Sealed Secrets (Bitnami)

Sealed Secrets encrypt secrets for safe storage in Git:

```bash
# Install kubeseal CLI
brew install kubeseal

# Create a secret locally
kubectl create secret generic db-credentials \
  --from-literal=username=payment_user \
  --from-literal=password='SuperSecret123!' \
  --dry-run=client -o yaml > secret.yaml

# Seal the secret for the production cluster
kubeseal --controller-namespace=sealed-secrets \
         --controller-name=sealed-secrets \
         --format yaml < secret.yaml > sealed-secret-prod.yaml

# sealed-secret-prod.yaml can now be committed to Git
```

**Generated Sealed Secret:**
```yaml
apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
  name: db-credentials
  namespace: production
spec:
  encryptedData:
    password: AgByBBg0f5QGx8J7...
    username: AgAr5t7G8K9m3Pq...
  template:
    type: Opaque
    metadata:
      name: db-credentials
      namespace: production
      labels:
        environment: production
```

**Explanation:**
`kubeseal` uses asymmetric cryptography. The Sealed Secrets controller running in the cluster holds the private key. The CLI encrypts the secret using the cluster's public key (fetched from the controller). Only the target cluster can decrypt the SealedSecret into a regular Kubernetes Secret. The encrypted data is safe to commit to Git because it can only be decrypted by the production cluster's controller.

### External Secrets Operator (ESO)

ESO synchronizes secrets from external secret managers (AWS Secrets Manager, Azure Key Vault, GCP Secret Manager, HashiCorp Vault):

```yaml
# SecretStore (cluster-scoped or namespace-scoped)
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
  name: aws-secrets-manager
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1
      auth:
        jwt:
          serviceAccountRef:
            name: external-secrets-sa
            namespace: security
---
# ExternalSecret (fetches from AWS and creates K8s Secret)
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: payment-service-db
  namespace: production
spec:
  refreshInterval: 1h  # Sync frequency
  secretStoreRef:
    kind: ClusterSecretStore
    name: aws-secrets-manager
  target:
    name: db-credentials  # Name of created K8s Secret
    creationPolicy: Owner
    template:
      type: Opaque
      metadata:
        annotations:
          reloader.stakater.com/auto: "true"  # Trigger rollout on change
      data:
        connection-string: "postgresql://{{ .username }}:{{ .password }}@{{ .host }}:5432/payments"
  data:
    - secretKey: username
      remoteRef:
        key: production/payment-service/db
        property: username
    - secretKey: password
      remoteRef:
        key: production/payment-service/db
        property: password
    - secretKey: host
      remoteRef:
        key: production/payment-service/db
        property: host
```

**Explanation:**
- **ClusterSecretStore**: Defines connection to AWS Secrets Manager using IRSA (IAM Roles for Service Accounts) for authentication
- **ExternalSecret**: Specifies which secrets to fetch (`production/payment-service/db`) and how to map them to Kubernetes Secret keys
- **Template**: Constructs a PostgreSQL connection string from individual secret components
- **RefreshInterval**: ESO polls AWS every hour for changes, updating the Kubernetes Secret automatically
- **Reloader Annotation**: Stakater Reloader watches the Secret and triggers Deployment rollout when secrets change, enabling credential rotation without manual intervention

### Environment-Specific Secret Rotation

Different environments have different rotation schedules:

```yaml
# Development: Short-lived, auto-rotated
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: dev-db-credentials
  namespace: development
spec:
  refreshInterval: 5m  # Rotate frequently in dev
  target:
    name: db-credentials
  data:
    - secretKey: password
      remoteRef:
        key: dev/dynamic-credentials/payment-db
        property: password

---
# Production: Static credentials with manual rotation
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: prod-db-credentials
  namespace: production
spec:
  refreshInterval: 0  # No automatic sync (manual only)
  target:
    name: db-credentials
  data:
    - secretKey: password
      remoteRef:
        key: prod/static-credentials/payment-db
        property: password
```

**Explanation:**
Development uses dynamic credentials (e.g., Vault database secrets) that rotate every 5 minutes for security. Production uses static credentials that only change during scheduled maintenance windows (`refreshInterval: 0` disables automatic sync).

## 39.4 Environment Promotion

Environment promotion is the process of moving artifacts through environments while maintaining immutability and audit trails.

### GitOps Promotion Pattern

In GitOps, promotion is a Git operation—merging or copying configurations between branches or directories:

```bash
# Directory-based promotion (Kustomize/Flux)
repo/
├── apps/
│   ├── payment-service/
│   │   ├── base/
│   │   └── overlays/
│   │       ├── dev/
│   │       ├── staging/
│   │       └── prod/     # Promotion target
```

**Promotion via Pull Request:**
```yaml
# .github/workflows/promote-to-prod.yml
name: Promote to Production
on:
  push:
    branches:
      - main
    paths:
      - 'apps/**/overlays/staging/**'

jobs:
  promote:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          token: ${{ secrets.GITHUB_TOKEN }}
      
      - name: Detect Changed Services
        id: changes
        run: |
          # Find services modified in staging
          changed_services=$(git diff --name-only HEAD~1 | \
            grep -oP 'apps/\K[^/]+' | sort -u)
          echo "services=$changed_services" >> $GITHUB_OUTPUT
      
      - name: Create Promotion PR
        run: |
          for service in ${{ steps.changes.outputs.services }}; do
            # Copy staging kustomization to prod with version update
            cp apps/$service/overlays/staging/kustomization.yaml \
               apps/$service/overlays/prod/kustomization.yaml
            
            # Update image tag to staging-tested version
            staging_tag=$(yq '.images[0].newTag' \
              apps/$service/overlays/staging/kustomization.yaml)
            
            yq -i ".images[0].newTag = \"$staging_tag\"" \
              apps/$service/overlays/prod/kustomization.yaml
          done
          
          git checkout -b promote-to-prod-$(date +%s)
          git add .
          git commit -m "chore: Promote tested versions to production"
          git push origin HEAD
          
          gh pr create \
            --title "Production Promotion: ${{ steps.changes.outputs.services }}" \
            --body "Promoting tested staging versions to production" \
            --base main \
            --reviewer platform-team
```

**Explanation:**
This workflow detects which services changed in staging, copies their configurations to the production overlay, and creates a pull request. The production overlay uses the same image tag that was validated in staging, ensuring artifact immutability. The PR requires human approval before merging, triggering the GitOps controller to apply changes to the production cluster.

### Helm Chart Promotion

For Helm-based workflows, promotion updates version references:

```yaml
# environments/values-staging.yaml
payment-service:
  image:
    tag: "v2.1.0-rc.3"  # Release candidate
  
  config:
    paymentGateway:
      mode: "sandbox"
      testCardNumbers: true

---
# environments/values-production.yaml (after promotion)
payment-service:
  image:
    tag: "v2.1.0"  # Same artifact, promoted after staging validation
  
  config:
    paymentGateway:
      mode: "live"
      testCardNumbers: false
```

**Promotion Script:**
```bash
#!/bin/bash
# promote.sh - Promotes validated staging version to production

SERVICE=$1
STAGING_VALUES="environments/values-staging.yaml"
PROD_VALUES="environments/values-production.yaml"

# Extract image tag from staging
STAGING_TAG=$(yq ".${SERVICE}.image.tag" $STAGING_VALUES)

# Verify tag exists in registry
if ! crane manifest registry.company.com/${SERVICE}:${STAGING_TAG} > /dev/null 2>&1; then
  echo "Error: Image ${SERVICE}:${STAGING_TAG} not found in registry"
  exit 1
fi

# Update production values (removing -rc suffix if present)
PROD_TAG=${STAGING_TAG//-rc.*/}
yq -i ".${SERVICE}.image.tag = \"${PROD_TAG}\"" $PROD_VALUES

# Create commit
git add $PROD_VALUES
git commit -m "promote(${SERVICE}): ${STAGING_TAG} -> production"
git push origin main
```

**Explanation:**
The script extracts the image tag from staging configuration, verifies the image exists in the registry (preventing promotion of locally built images), strips the release candidate suffix for production, and updates the production values file. This ensures the exact binary artifact tested in staging runs in production, only configuration differs (sandbox vs. live payment gateway).

## 39.5 Database Migrations

Database schema changes require careful coordination across environments to prevent data loss and downtime.

### Migration Strategy per Environment

**Development:**
```yaml
# Job for dev environment - automatic, destructive allowed
apiVersion: batch/v1
kind: Job
metadata:
  name: db-migrate-dev
  namespace: development
spec:
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: migrate
        image: payment-service:latest
        command: ["npm", "run", "migrate"]
        env:
        - name: NODE_ENV
          value: "development"
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: url
        - name: MIGRATION_MODE
          value: "auto"  # Automatic migration on deploy
```

**Production:**
```yaml
# CronJob for production - manual approval, transactional
apiVersion: batch/v1
kind: CronJob
metadata:
  name: db-migrate-prod
  namespace: production
spec:
  schedule: "0 2 * * 0"  # Weekly maintenance window
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      activeDeadlineSeconds: 3600
      template:
        spec:
          restartPolicy: Never
          serviceAccountName: db-migrator
          initContainers:
          # Pre-migration backup
          - name: backup
            image: postgres:15-alpine
            command:
            - sh
            - -c
            - |
              pg_dump $DATABASE_URL \
                --clean --if-exists \
                > /backups/pre-migration-$(date +%s).sql
            volumeMounts:
            - name: backup-vol
              mountPath: /backups
          containers:
          - name: migrate
            image: payment-service:v2.1.0  # Specific version
            command: ["flyway", "migrate"]
            env:
            - name: FLYWAY_URL
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: url
            - name: FLYWAY_BASELINE_ON_MIGRATE
              value: "false"
            - name: FLYWAY_VALIDATE_ON_MIGRATE
              value: "true"
            - name: FLYWAY_OUT_OF_ORDER
              value: "false"  # Strict ordering in prod
            resources:
              limits:
                memory: "1Gi"
                cpu: "1000m"
          volumes:
          - name: backup-vol
            persistentVolumeClaim:
              claimName: db-backups
```

**Explanation:**
The production migration Job includes:
- **Init Container**: Creates a full database backup before migration using `pg_dump`
- **Flyway Configuration**: Validates checksums of applied migrations (`validateOnMigrate`), prevents out-of-order execution
- **Resource Limits**: Restricts migration job to prevent resource starvation
- **Schedule**: Runs during maintenance windows (2 AM Sunday) rather than on every deploy
- **Specific Image**: Uses explicit version tag, not `latest`, ensuring repeatable migrations

### Migration Rollback Procedure

When migrations fail in production:

```yaml
# rollback-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: db-rollback
  namespace: production
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: rollback
        image: payment-service:v2.0.9  # Previous known-good version
        command: ["flyway", "undo"]  # Undo last migration
        env:
        - name: FLYWAY_URL
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: url
        - name: FLYWAY_TARGET
          value: "2"  # Rollback to version 2
```

**Explanation:**
The rollback job uses the previous application version's image (v2.0.9) which contains the corresponding migration scripts. Flyway's `undo` command reverses the failed migration. The `TARGET` environment variable ensures idempotency—if version 2 is already applied, it skips; if not, it rolls back to that state.

## 39.6 Drift Detection

Drift occurs when the actual cluster state diverges from the Git-defined desired state, typically through manual `kubectl` interventions or failed automated processes.

### ArgoCD Drift Detection

ArgoCD continuously compares live state with Git:

```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payment-service
spec:
  syncPolicy:
    automated:
      selfHeal: true  # Automatically correct drift
      prune: true     # Remove resources not in Git
    syncOptions:
      - CreateNamespace=true
  ignoreDifferences:
    - group: apps
      kind: Deployment
      jsonPointers:
        - /spec/replicas  # Ignore HPA-driven scaling
```

**Drift Notification:**
```yaml
apiVersion: notification.toolkit.fluxcd.io/v1beta3
kind: Alert
metadata:
  name: drift-detection
spec:
  summary: "Configuration Drift Detected"
  providerRef:
    name: slack
  eventSeverity: info
  eventSources:
    - kind: Kustomization
      name: '*'
  inclusionList:
    - ".*DriftDetected.*"
```

### Manual Drift Audit

Periodic auditing for unauthorized changes:

```bash
#!/bin/bash
# drift-audit.sh

NAMESPACE=production
DRIFT_LOG="drift-$(date +%Y%m%d).log"

# Get all deployments in namespace
kubectl get deployments -n $NAMESPACE -o json | jq -r '.items[] | "\(.metadata.name) \(.spec.template.spec.containers[0].image)"' | while read -r name image; do
  # Check if image matches Git definition
  git_image=$(grep -r "image:.*$name" k8s/overlays/production/ | grep -o 'image:.*' | cut -d' ' -f2)
  
  if [[ "$image" != "$git_image" ]]; then
    echo "DRIFT: $name - Live: $image, Git: $git_image" >> $DRIFT_LOG
  fi
done

# Alert if drift detected
if [[ -s $DRIFT_LOG ]]; then
  cat $DRIFT_LOG | slack-notify "#alerts"
fi
```

**Explanation:**
The audit script compares running container images with those defined in Git. If a manual `kubectl set image` command changed a deployment, this detects the mismatch and alerts the team via Slack. This is a safety net for environments not using GitOps with self-healing enabled.

## 39.7 Hotfixes and Rollbacks

Production incidents require rapid response procedures that bypass normal CI/CD pipelines while maintaining audit trails.

### Hotfix Branch Strategy

```bash
# Emergency hotfix procedure
git checkout -b hotfix/payment-critical-bug v2.1.0  # Branch from production tag

# Fix bug
vim src/payment/processor.java
git commit -m "fix(payment): Null pointer in transaction processing"

# Build and push emergency image (bypassing full CI for speed)
docker build -t payment-service:v2.1.1-hotfix .
docker push payment-service:v2.1.1-hotfix

# Deploy via GitOps emergency path
cat > emergency-deploy.yaml <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
spec:
  template:
    spec:
      containers:
      - name: payment
        image: payment-service:v2.1.1-hotfix
EOF

kubectl apply -f emergency-deploy.yaml

# After incident resolved, merge properly
git checkout main
git merge hotfix/payment-critical-bug
git tag v2.1.1
```

**Explanation:**
The hotfix branches from the production tag (v2.1.0), not from main (which may contain untested features). The fix is built locally and deployed via imperative kubectl (emergency override), then properly merged back to main and tagged after the incident to ensure the fix persists in the mainline and receives full CI validation retroactively.

### Automated Rollback

```yaml
# Argo Rollout with automatic rollback
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
spec:
  replicas: 5
  strategy:
    canary:
      steps:
        - setWeight: 20
        - pause: {duration: 10m}
        - setWeight: 40
        - pause: {duration: 10m}
        - setWeight: 60
        - pause: {duration: 10m}
        - setWeight: 80
        - pause: {duration: 10m}
        - setWeight: 100
      analysis:
        templates:
          - templateName: success-rate
        args:
          - name: service-name
            value: payment-service
  template:
    spec:
      containers:
      - name: payment
        image: payment-service:v2.1.0
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
    - name: success-rate
      interval: 1m
      count: 5
      successCondition: result[0] >= 0.99
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{service=~"{{args.service-name}}",status=~"2.."}[1m]))
            /
            sum(rate(http_requests_total{service=~"{{args.service-name}}"}[1m]))
```

**Explanation:**
Argo Rollouts performs a canary deployment: 20% traffic to new version, waits 10 minutes, analyzes Prometheus metrics for 99% success rate. If success rate drops below 99%, it automatically rolls back to v2.1.0 without human intervention. This prevents faulty deployments from impacting all users.

### Database Rollback Considerations

Rolling back application code without rolling back database migrations causes schema-version mismatch:

```bash
# Safe rollback procedure
1. kubectl rollout undo deployment/payment-service  # Rollback app
2. Check if DB migrations are backward compatible
   - If yes: No action needed
   - If no: Restore from pre-migration backup
3. Verify application starts with current DB schema
```

**Backward Compatible Migration Example:**
```sql
-- V2.1.0__add_user_preferences.sql (Expand)
ALTER TABLE users ADD COLUMN preferences JSONB;
-- Old code ignores new column (safe)

-- V2.2.0__migrate_preferences.sql (Contract - weeks later)
UPDATE users SET preferences = migrate_legacy_format(old_column);
ALTER TABLE users DROP COLUMN old_column;
-- Only drop after all apps updated
```

**Explanation:**
The expand-contract pattern ensures migrations are backward compatible. New columns are added (expand) but old columns remain until all services are updated (contract). This allows application rollback without database restoration, as the previous app version simply ignores the new columns.

---

## Chapter Summary and Preview

This chapter established comprehensive patterns for managing microservices deployments across multiple environments while maintaining security boundaries and operational reliability. We examined configuration management strategies using Kustomize overlays and Helm values hierarchies, enabling the same container image to execute across development, staging, and production with environment-specific tuning for resource allocation, logging verbosity, and feature enablement. The critical distinction between configuration data (ConfigMaps) and sensitive credentials (Secrets) was addressed through Sealed Secrets for Git-safe encryption and External Secrets Operator for dynamic integration with cloud secret managers.

Environment promotion strategies in GitOps workflows treat progression as Git operations—merging tested configurations from staging directories to production—ensuring immutable artifact promotion and comprehensive audit trails through pull request history. Database migration strategies emphasized the expand-contract pattern for zero-downtime schema evolution, with environment-specific automation levels ranging from automatic migrations in development to scheduled maintenance windows with pre-migration backups in production.

Drift detection mechanisms using ArgoCD self-healing or periodic auditing scripts ensure manual cluster interventions are detected and corrected, maintaining the declarative state defined in version control. Emergency procedures for hotfixes and automated rollbacks using canary analysis provide safety nets for production incidents while preserving the integrity of the mainline codebase.

**Key Takeaways:**
- Maintain artifact immutability across environments—promote the exact image digest tested in staging to production, varying only external configuration via ConfigMaps and Secrets.
- Use Kustomize overlays or Helm values files to manage environment-specific configurations without duplicating base manifests, ensuring changes propagate consistently across environments.
- Implement the expand-contract pattern for database migrations: add new structures in one release (expand), migrate data, then remove old structures in subsequent releases (contract) to enable safe application rollbacks.
- Store secrets in external secret managers (AWS Secrets Manager, Vault) synchronized to Kubernetes via External Secrets Operator, enabling automatic rotation without application restarts.
- Enable automated drift detection and self-healing in production to prevent configuration rot, but implement ignore rules for fields managed by other controllers (HPA replica counts).

**Next Chapter Preview:**
Chapter 40: Progressive Delivery explores advanced deployment strategies that minimize risk when releasing new versions to production. We will examine canary deployments that shift traffic gradually while analyzing metrics, blue-green deployments that enable instant cutover and rollback, feature flags that decouple deployment from release, and A/B testing frameworks for data-driven feature validation. The chapter covers service mesh integration (Istio, Linkerd) for traffic management, automated analysis and promotion pipelines, and strategies for handling long-running transactions during gradual rollouts, building upon the multi-environment foundations to achieve zero-downtime deployments with measurable safety guarantees.