# **Chapter 21: Deployment & Infrastructure**

Building software is only half the battle; delivering it safely to production is where systems live or die. This chapter covers the engineering practices that separate artisanal deployments from industrial-grade release automation. We progress from continuous integration pipelines to advanced deployment patterns, infrastructure as code, and the multi-region strategies that power global applications.

---

## **21.1 CI/CD Pipelines: The Delivery Highway**

Continuous Integration (CI) and Continuous Deployment (CD) form the backbone of modern software delivery. CI ensures code changes integrate safely; CD automates the path to production.

### **The Pipeline Stages**

A production-grade pipeline moves through distinct quality gates:

```
Developer Push
      │
      ▼
┌─────────────┐
│    Build    │ ◄── Compile, package, containerize
│   (2 min)   │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│    Test     │ ◄── Unit tests, integration tests, linting
│   (5 min)   │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   Security  │ ◄── SAST (code scanning), dependency checks
│   (3 min)   │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│    Push     │ ◄── Upload artifact to registry (ECR, GCR, Docker Hub)
│   (1 min)   │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   Deploy    │ ◄── Rolling update, Blue/Green, or Canary
│   (5 min)   │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   Verify    │ ◄── Smoke tests, health checks, synthetic monitoring
│   (2 min)   │
└─────────────┘
```

**Total Lead Time**: 18 minutes from commit to production (the "elite" DevOps metric threshold is under 1 hour).

### **Pipeline as Code**

Modern CI/CD defines pipelines in version-controlled files rather than GUI configurations.

**GitHub Actions Example**:
```yaml
# .github/workflows/deploy.yml
name: Deploy to Production

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'
      
      - name: Install dependencies
        run: npm ci
      
      - name: Run unit tests
        run: npm test -- --coverage
      
      - name: Run linter
        run: npm run lint
      
      - name: Build application
        run: npm run build
      
      - name: Upload coverage to Codecov
        uses: codecov/codecov-action@v3

  security-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          ignore-unfixed: true
          format: 'sarif'
          output: 'trivy-results.sarif'
      
      - name: Upload results to GitHub Security
        uses: github/codeql-action/upload-sarif@v2
        with:
          sarif_file: 'trivy-results.sarif'

  deploy-staging:
    needs: [build-and-test, security-scan]
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1
      
      - name: Deploy to EKS
        run: |
          aws eks update-kubeconfig --name staging-cluster
          kubectl set image deployment/api api=myapp:${{ github.sha }}
          kubectl rollout status deployment/api

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production  # Requires manual approval
    steps:
      - name: Deploy to Production
        run: |
          aws eks update-kubeconfig --name prod-cluster
          kubectl set image deployment/api api=myapp:${{ github.sha }}
          kubectl rollout status deployment/api
      
      - name: Run smoke tests
        run: |
          curl -f https://api.example.com/health || exit 1
          curl -f https://api.example.com/ready || exit 1
```

**Key Concepts**:
- **Artifacts**: Immutable build outputs (Docker images, JARs) tagged with Git SHA
- **Environments**: Staging → Production progression with protection rules
- **Secrets Management**: Never commit credentials; use built-in secret stores or external vaults (HashiCorp Vault, AWS Secrets Manager)
- **Matrix Builds**: Test against multiple versions (Node 18, 20, 22) or platforms (AMD64, ARM64)

---

## **21.2 Deployment Patterns**

How you release code is as important as the code itself. These patterns minimize risk and enable rapid rollback.

### **Pattern 1: Rolling Deployment (Kubernetes Default)**

Gradually replace old instances with new ones.

```
Time: T0          T1          T2          T3
      │           │           │           │
Pod 1: [v1]  →   [v2]        [v2]        [v2]
Pod 2: [v1]  →   [v1]   →   [v2]        [v2]  
Pod 3: [v1]  →   [v1]   →   [v1]   →   [v2]

Traffic: 100% v1 → 66% v1 → 33% v1 → 100% v2
```

**Configuration**:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Max pods above desired (1 extra during update)
      maxUnavailable: 0  # Never drop below 3 pods available
  template:
    spec:
      containers:
      - name: api
        image: myapp:v2
        readinessProbe:  # Critical: Don't send traffic until ready
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
```

**Pros**: Simple, no extra infrastructure, zero downtime
**Cons**: Mixed versions coexist (compatibility required), slow rollback (must re-roll)

### **Pattern 2: Blue-Green Deployment**

Run two identical environments, switch traffic instantly.

```
┌─────────────┐         ┌─────────────┐
│   Blue      │         │   Green     │
│   (Live)    │◄───────►│  (Idle)     │
│   v1.0      │         │   v2.0      │
└─────────────┘         └─────────────┘
       ▲                        │
       │                        │
       └──────── Load Balancer ──┘
                (Switch traffic)
```

**Implementation**:
```yaml
# Service selects by label 'version'
apiVersion: v1
kind: Service
metadata:
  name: api
spec:
  selector:
    app: api
    version: blue  # Change to 'green' to switch
  ports:
  - port: 80
    targetPort: 8080
---
# Blue deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-blue
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: api
        version: blue
    spec:
      containers:
      - name: api
        image: myapp:v1
---
# Green deployment  
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-green
spec:
  replicas: 3
  template:
    metadata:
      labels:
        app: api
        version: green
    spec:
      containers:
      - name: api
        image: myapp:v2
```

**Switching Traffic**:
```bash
# Instant cutover
kubectl patch service api -p '{"spec":{"selector":{"version":"green"}}}'

# Instant rollback (if issues detected)
kubectl patch service api -p '{"spec":{"selector":{"version":"blue"}}}'
```

**Pros**: Instant rollback, zero downtime, no mixed versions
**Cons**: Double the infrastructure cost (2x pods running), database schema changes are complex (both versions must work with same DB)

### **Pattern 3: Canary Deployment**

Route small percentage of traffic to new version, monitor, gradually increase.

```
Phase 1: 100% v1
Phase 2: 95% v1, 5% v2  ◄── Monitor error rate, latency
Phase 3: 80% v1, 20% v2  ◄── Automated analysis
Phase 4: 50% v1, 50% v2
Phase 5: 0% v1, 100% v2
```

**Tools**: Flagger (Kubernetes), Argo Rollouts, Spinnaker, or cloud-native (AWS App Mesh, Istio).

**Istio Service Mesh Implementation**:
```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: api
spec:
  hosts:
  - api
  http:
  - match:
    - headers:
        x-canary:
          exact: "true"
    route:
    - destination:
        host: api
        subset: v2  # Canary users
  - route:
    - destination:
        host: api
        subset: v1  # Main traffic
      weight: 95
    - destination:
        host: api
        subset: v2
      weight: 5
```

**Automated Canary Analysis** (Flagger):
```yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: api
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  service:
    port: 8080
  analysis:
    interval: 30s
    threshold: 5  # Max failed checks before rollback
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500
      interval: 1m
    webhooks:
    - name: load-test
      url: http://flagger-loadtester.test/
      timeout: 5s
      metadata:
        cmd: "hey -z 1m -q 10 -c 2 http://api:8080/"
```

**Rollback Trigger**: If error rate > 1% or P99 latency > 500ms, automatically revert to 100% v1.

### **Pattern 4: Feature Flags (Dark Launches)**

Deploy code to production but hide features behind flags. Enable gradually by user segment.

```
Code deployed: v2.0 (contains new 'recommendation-engine')
                │
                ├─► Feature flag 'recommendations' = OFF (100% users)
                │
                ├─► Enable for 'internal-users' (dogfooding)
                │
                ├─► Enable for 5% of beta users
                │
                └─► Enable for 100% users (general availability)
```

**Implementation** (LaunchDarkly or open-source Unleash):
```python
from unleash import UnleashClient

client = UnleashClient(url="https://unleash.example.com", app_name="api")

def get_recommendations(user_id):
    if client.is_enabled("new-recommendation-engine", {"userId": user_id}):
        # New code path (v2 logic)
        return ml_recommendations(user_id)
    else:
        # Old code path (v1 logic)
        return popular_items()
```

**Database Schema Changes with Feature Flags**:
```python
# Phase 1: Deploy code that writes to both old and new tables, reads from old
def create_user(data):
    # Write to both (dual write)
    db.execute("INSERT INTO users_old ...", data)
    db.execute("INSERT INTO users_new ...", data)

# Phase 2: Backfill old data to new table (one-time job)

# Phase 3: Switch flag to read from new table
def get_user(id):
    if feature_flag('new-schema'):
        return db.query("SELECT * FROM users_new WHERE id = ?", id)
    else:
        return db.query("SELECT * FROM users_old WHERE id = ?", id)

# Phase 4: Remove old table access, clean up code
```

---

## **21.3 Infrastructure as Code (IaC)**

Manual infrastructure creation is error-prone and non-reproducible. IaC defines infrastructure in version-controlled code.

### **Terraform: The Cloud-Agnostic Standard**

Terraform uses declarative HCL (HashiCorp Configuration Language) to create, modify, and version infrastructure.

**Architecture**:
```hcl
# main.tf
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
  
  backend "s3" {
    bucket = "terraform-state-prod"
    key    = "infrastructure/terraform.tfstate"
    region = "us-east-1"
    encrypt = true
    dynamodb_table = "terraform-locks"  # State locking
  }
}

provider "aws" {
  region = var.aws_region
}

# VPC Module
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.0.0"
  
  name = "production-vpc"
  cidr = "10.0.0.0/16"
  
  azs             = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
  
  enable_nat_gateway = true
  single_nat_gateway = false
}

# EKS Cluster
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "19.0.0"
  
  cluster_name    = "production-cluster"
  cluster_version = "1.28"
  
  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets
  
  eks_managed_node_groups = {
    general = {
      desired_size = 3
      min_size     = 2
      max_size     = 10
      
      instance_types = ["m6i.large"]
      capacity_type  = "ON_DEMAND"
    }
    
    spot = {
      desired_size = 2
      min_size     = 0
      max_size     = 20
      
      instance_types = ["m6i.large", "m5.large", "m5a.large"]
      capacity_type  = "SPOT"  # 70% cheaper, interruptible
    }
  }
}
```

**Workflow**:
```bash
terraform init          # Download providers, initialize backend
terraform plan          # Preview changes (dry run)
terraform apply         # Execute changes
terraform destroy       # Tear down infrastructure (use with caution!)
```

**State Management**:
- **Remote State**: Store in S3 (encrypted) with DynamoDB locking (prevents concurrent modifications)
- **State Separation**: Separate environments (dev/staging/prod) into different state files
- **Sensitive Data**: Mark outputs as `sensitive = true` to prevent logging passwords

### **CloudFormation (AWS Native)**

AWS-specific but tightly integrated. Uses YAML or JSON.

```yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Production API Infrastructure'

Resources:
  VPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: 10.0.0.0/16
      EnableDnsHostnames: true
      Tags:
        - Key: Name
          Value: production-vpc

  ALB:
    Type: AWS::ElasticLoadBalancingV2::LoadBalancer
    Properties:
      Name: api-alb
      Scheme: internet-facing
      Type: application
      Subnets: !Ref PublicSubnets
      SecurityGroups: [!Ref ALBSecurityGroup]

  ECSCluster:
    Type: AWS::ECS::Cluster
    Properties:
      ClusterName: production-api
      CapacityProviders:
        - FARGATE
        - FARGATE_SPOT

  ECSService:
    Type: AWS::ECS::Service
    Properties:
      Cluster: !Ref ECSCluster
      TaskDefinition: !Ref TaskDefinition
      DesiredCount: 3
      DeploymentConfiguration:
        MaximumPercent: 200
        MinimumHealthyPercent: 100
        DeploymentCircuitBreaker:
          Enable: true
          Rollback: true  # Auto-rollback on failure
      LoadBalancers:
        - ContainerName: api
          ContainerPort: 8080
          TargetGroupArn: !Ref TargetGroup
```

### **Pulumi: Imperative IaC**

Write infrastructure in general-purpose languages (TypeScript, Python, Go).

```python
# Pulumi Python example
import pulumi
import pulumi_aws as aws

# Create VPC
vpc = aws.ec2.Vpc("production-vpc",
    cidr_block="10.0.0.0/16",
    tags={"Name": "production-vpc"})

# Create EC2 instance with user data
user_data = """
#!/bin/bash
echo "Hello, World!" > index.html
nohup python -m SimpleHTTPServer 80 &
"""

server = aws.ec2.Instance("web-server",
    instance_type="t3.micro",
    ami="ami-0c55b159cbfafe1f0",  # Amazon Linux 2
    user_data=user_data,
    vpc_security_group_ids=[security_group.id],
    subnet_id=public_subnet.id,
    tags={"Name": "web-server"})

pulumi.export("public_ip", server.public_ip)
```

**Comparison**:
| Tool | Paradigm | Cloud | Learning Curve | Best For |
|------|----------|-------|----------------|----------|
| Terraform | Declarative | Multi-cloud | Medium | Standardization, teams |
| CloudFormation | Declarative | AWS only | Low (for AWS users) | AWS-only shops |
| Pulumi | Imperative | Multi-cloud | Low (for developers) | Complex logic, existing code |

---

## **21.4 GitOps: The Modern Deployment Paradigm**

GitOps treats Git as the single source of truth for infrastructure and application state. Automated agents continuously reconcile the live system with Git.

**Principles**:
1. **Declarative**: System described in version-controlled files
2. **Versioned & Immutable**: Git history provides audit trail and rollback
3. **Pulled Automatically**: Agents (not humans) apply changes
4. **Continuously Reconciled**: Drift detection and auto-healing

### **ArgoCD (Kubernetes GitOps)**

ArgoCD monitors Git repositories and applies changes to Kubernetes clusters.

**Architecture**:
```
Git Repository (Source of Truth)
    │
    ├─► manifests/
    │   ├── deployment.yaml
    │   ├── service.yaml
    │   └── ingress.yaml
    │
    └─► kustomization.yaml

         │
         │ (Poll or Webhook)
         ▼
    ┌─────────────┐
    │   ArgoCD    │
    │   Server    │
    └──────┬──────┘
           │
           ▼
    ┌─────────────┐
    │  Kubernetes │
    │   Cluster   │
    └─────────────┘
```

**Application Definition**:
```yaml
# argocd-application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: production-api
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/company/infrastructure.git
    targetRevision: HEAD
    path: overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true        # Delete resources not in Git
      selfHeal: true     # Correct drift automatically
      allowEmpty: false
    syncOptions:
      - CreateNamespace=true
      - PrunePropagationPolicy=foreground
      - PruneLast=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
```

**Benefits**:
- **Drift Detection**: If someone manually `kubectl edits` a deployment, ArgoCD detects and reverts it
- **Audit Trail**: Git log shows exactly who changed what and when
- **Disaster Recovery**: New cluster restored by pointing ArgoCD at Git repo
- **Multi-cluster**: Single ArgoCD instance manages dev, staging, and prod clusters

### **Flux (CNCF GitOps)**

Alternative to ArgoCD, tightly integrated with GitHub Actions.

```yaml
# flux-system/gotk-sync.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: infrastructure
  namespace: flux-system
spec:
  interval: 1m
  url: https://github.com/company/infrastructure
  ref:
    branch: main
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: production
  namespace: flux-system
spec:
  interval: 10m
  path: ./clusters/production
  prune: true
  sourceRef:
    kind: GitRepository
    name: infrastructure
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: api
      namespace: production
```

---

## **21.5 Database Migrations in CI/CD**

Database schema changes are the riskiest deployments. They require backward compatibility and careful orchestration.

### **Migration Strategies**

**Expand-Contract Pattern** (Zero-downtime):
```
Phase 1 (Expand):
    Add new column 'email_normalized' alongside existing 'email'
    Dual write to both columns
    Backfill new column with transformed data

Phase 2 (Transition):
    Switch reads to new column (feature flag)
    Keep writing to both

Phase 3 (Contract):
    Stop writing to old column
    Remove old column (after verification)
```

**Tools**:
- **Flyway** (Java/SQL): Versioned SQL scripts
- **Liquibase** (Cross-platform): XML/YAML/JSON changelog
- **Alembic** (Python/SQLAlchemy): Auto-generate migrations from model changes
- **Atlas** (Modern): Terraform-like for databases

**Flyway Example**:
```sql
-- V1__Initial_schema.sql
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    username VARCHAR(50) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- V2__Add_email_column.sql
ALTER TABLE users ADD COLUMN email VARCHAR(255);

-- V3__Create_indexes.sql
CREATE INDEX idx_users_email ON users(email);
```

**CI/CD Integration**:
```yaml
# Run before application deployment
- name: Database Migration
  run: |
    flyway -url=${{ secrets.DB_URL }} \
           -user=${{ secrets.DB_USER }} \
           -password=${{ secrets.DB_PASSWORD }} \
           migrate
  
- name: Deploy Application
  run: kubectl apply -f k8s/
```

**Safety Checks**:
1. **Backward Compatibility**: New code must work with old schema; old code must work with new schema (during transition)
2. **Long Transactions**: Avoid `ALTER TABLE` on large tables (locks table). Use `pt-online-schema-change` (Percona) or `gh-ost` (GitHub)
3. **Rollback Plan**: Migrations must be reversible (down scripts), though data loss scenarios require backup restoration

---

## **21.6 Capacity Planning and Cost Optimization**

Cloud costs can spiral unexpectedly. Capacity planning ensures you have enough resources without over-provisioning.

### **Vertical vs. Horizontal Scaling Decision Matrix**

| Metric | Scale Up (Vertical) | Scale Out (Horizontal) |
|--------|---------------------|------------------------|
| **Latency** | Better (local memory/CPU) | Network overhead |
| **Cost** | Expensive at high end (diminishing returns) | Linear, commodity hardware |
| **Limit** | Hardware max (e.g., 128 CPU, 4TB RAM) | Theoretically unlimited |
| **Complexity** | Simple (resize VM) | Complex (distributed state) |
| **Availability** | Single point of failure | Resilient (node loss tolerant) |

**Rule of Thumb**:
- Databases: Scale up first (avoid distributed transaction complexity), then shard
- Web servers: Scale out (stateless, easy to add nodes)
- Caches: Scale out (Redis Cluster)

### **Auto-scaling Strategies**

**Reactive Scaling** (Kubernetes HPA):
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 100
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 60  # Double pods every minute
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60  # Remove 10% every minute (slow scale down)
```

**Predictive Scaling** (AWS Auto Scaling):
- Machine learning on historical traffic patterns
- Scale out 30 minutes before predicted peak (e.g., 9 AM daily spike)
- Pre-warm instances to avoid cold start latency

**Scheduled Scaling**:
```yaml
# Scale up before known events (Black Friday)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-scheduled
spec:
  minReplicas: 10  # Override normal min of 3 during event
  # ... rest of config
---
# CronJob to patch HPA before event
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-up-blackfriday
spec:
  schedule: "0 0 * 11 5"  # Midnight before Black Friday
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: kubectl
            image: bitnami/kubectl
            command:
            - kubectl
            - patch
            - hpa
            - api-hpa
            - --patch
            - '{"spec":{"minReplicas":50}}'
          restartPolicy: OnFailure
```

### **Cost Optimization Techniques**

**1. Right-sizing**:
Analyze CloudWatch/Monitoring data:
- CPU < 20% average → Downsize instance
- Memory never above 50% → Reduce allocation
- Tools: AWS Compute Optimizer, Kubecost (for K8s)

**2. Spot/Preemptible Instances**:
```hcl
# Terraform: Mixed instance policy with Spot
resource "aws_autoscaling_group" "workers" {
  mixed_instances_policy {
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.worker.id
      }
      
      override {
        instance_type     = "m6i.large"
        weighted_capacity = "2"
      }
      override {
        instance_type     = "m5.large"
        weighted_capacity = "2"
      }
    }
    
    instances_distribution {
      on_demand_percentage_above_base_capacity = 20  # 20% on-demand, 80% spot
      spot_allocation_strategy                 = "capacity-optimized"
    }
  }
}
```
Spot instances are 70-90% cheaper but can be terminated with 2 minutes notice. Use for:
- Batch processing
- CI/CD runners
- Stateful horizontal scaling (if interrupted, pod moves to on-demand)

**3. Storage Tiering**:
```hcl
# S3 Lifecycle policy
resource "aws_s3_bucket_lifecycle_configuration" "logs" {
  bucket = aws_s3_bucket.logs.id
  
  rule {
    id     = "archive-old-logs"
    status = "Enabled"
    
    transition {
      days          = 30
      storage_class = "STANDARD_IA"  # Infrequent Access (40% cheaper)
    }
    
    transition {
      days          = 90
      storage_class = "GLACIER"  # Archive (80% cheaper, 3-5 hour retrieval)
    }
    
    expiration {
      days = 2555  # Delete after 7 years (compliance)
    }
  }
}
```

**4. Reserved Instances / Savings Plans**:
For baseline capacity (always-running databases, core services), purchase 1-year or 3-year commitments for 30-60% savings vs on-demand.

---

## **21.7 Multi-Region and Multi-AZ Strategies**

Single points of failure fail. Distributing across availability zones (AZs) and regions ensures survival of datacenter-level outages.

### **Availability Zones (AZs) vs Regions**

| Concept | Scope | Latency | Use Case |
|---------|-------|---------|----------|
| **AZ** | Single datacenter or building within region | < 1ms | High availability within region |
| **Region** | Geographic area (e.g., us-east-1) | - | Disaster recovery, data sovereignty |
| **Multi-AZ** | Active-Active or Active-Passive across AZs | < 2ms | Database replication, failover |
| **Multi-Region** | Active-Active or DR across regions | 20-200ms | Global availability, DR |

### **Multi-AZ Deployment**

**Stateful Services (Databases)**:
```hcl
# RDS Multi-AZ
resource "aws_db_instance" "primary" {
  identifier           = "production-db"
  allocated_storage    = 100
  engine               = "postgres"
  instance_class       = "db.r6g.xlarge"
  multi_az             = true  # Standby in different AZ
  
  # Automatic failover: ~60 seconds
  # Synchronous replication: Zero data loss
}
```

**Kubernetes Multi-AZ**:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 6
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - api
            topologyKey: topology.kubernetes.io/zone  # Spread across AZs
      containers:
      - name: api
        image: myapp:v1
```

### **Multi-Region Strategies**

**1. Active-Passive (DR)**:
- Primary region handles 100% traffic
- Secondary region on standby with data replicated
- Failover: DNS switch or load balancer reconfiguration (RTO: minutes to hours)
- Cost: 2x infrastructure, but secondary can be smaller (scale up during failover)

**2. Active-Active (Global Load Balancing)**:
- Both regions serve traffic simultaneously
- Users routed to nearest region (GeoDNS or Global Load Balancer)
- Data replication challenges (conflict resolution, latency)

**Route 53 Geolocation Routing**:
```hcl
resource "aws_route53_record" "api" {
  zone_id = aws_route53_zone.primary.zone_id
  name    = "api.example.com"
  type    = "A"
  
  geolocation_routing_policy {
    country = "US"
  }
  
  alias {
    name                   = aws_lb.us_east.dns_name
    zone_id                = aws_lb.us_east.zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "api_eu" {
  zone_id = aws_route53_zone.primary.zone_id
  name    = "api.example.com"
  type    = "A"
  
  geolocation_routing_policy {
    continent = "EU"
  }
  
  alias {
    name                   = aws_lb.eu_west.dns_name
    zone_id                = aws_lb.eu_west.zone_id
    evaluate_target_health = true
  }
}
```

**3. Cell-Based Architecture**:
Instead of two large regions, deploy many smaller "cells" (independent instances of the application).
- Each cell handles subset of users (sharding by user ID)
- Failure affects only 1/N users
- Netflix uses 3 AWS regions, each with multiple cells

**Data Replication Patterns**:
- **Read Replicas**: Async replication to other regions (eventual consistency, seconds of lag)
- **Global Databases**: Aurora Global, Cosmos DB, Spanner (strong consistency, higher latency)
- **Event Sourcing**: Kafka MirrorMaker 2 replicates events cross-region; consumers rebuild state

---

## **21.8 Chapter Summary**

Deployment and infrastructure engineering has evolved from manual server configuration to automated, self-healing systems:

1. **CI/CD Pipelines** automate quality gates from commit to production, with deployment patterns (Blue-Green, Canary) reducing release risk.

2. **Infrastructure as Code** (Terraform/CloudFormation/Pulumi) makes infrastructure reproducible, versioned, and reviewable like application code.

3. **GitOps** (ArgoCD/Flux) establishes Git as the single source of truth, with automated agents ensuring live systems match declared state.

4. **Database Migrations** require careful choreography—expand-contract patterns and backward compatibility ensure zero-downtime schema changes.

5. **Cost Optimization** balances performance with spend through right-sizing, spot instances, and storage tiering.

6. **Multi-Region/AZ** strategies provide resilience against datacenter and regional failures, with trade-offs between cost, complexity, and recovery time.

**The Deployment Maturity Model**:
- **Level 1**: Manual deployments, SSH into servers
- **Level 2**: Scripted deployments, some automation
- **Level 3**: CI/CD pipelines, automated testing
- **Level 4**: GitOps, automated canary analysis, self-healing infrastructure
- **Level 5**: Continuous deployment (every commit to production automatically), chaos engineering

---

**Exercises**:

1. **Deployment Strategy**: Compare Blue-Green vs. Canary for a database schema change that adds a `NOT NULL` column. Which is safer and why?

2. **Terraform**: Write a Terraform module that creates an auto-scaling group with mixed on-demand and spot instances, ensuring at least 30% of capacity is on-demand for stability.

3. **GitOps**: Design an ArgoCD ApplicationSet that deploys the same application to 3 different environments (dev, staging, prod) with different replica counts and resource limits.

4. **Database Migration**: You need to rename a column from `username` to `user_name` in a table with 100 million rows without downtime. Describe the expand-contract steps.

5. **Cost Analysis**: Calculate the monthly cost difference between running 10 `m6i.large` on-demand instances ($0.086/hour) vs. 80% spot instances at $0.025/hour (with 20% on-demand for stability) in us-east-1.

6. **Multi-Region**: Design a failover strategy for a payment processing system where data consistency is critical (no lost transactions). How do you handle the split-brain scenario where both regions think they are primary?

---

The journey from `hello world` to planetary scale is measured not in lines of code, but in decisions—trade-offs between consistency and availability, latency and durability, cost and complexity. May your systems be highly available, your caches be warm, and your deployments be uneventful.

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='20. performance_optimization.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='../8. Advanced_topics_and_emerging_patterns/22. high_scalability_challanges.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
