# Chapter 40: Progressive Delivery

Deploying new software versions to production introduces inherent risk. Traditional deployment strategies—recreate, rolling update, or even blue-green—expose all users to potential defects simultaneously. Progressive delivery mitigates this risk by decoupling deployment (the technical act of installing new code) from release (the business decision to expose users to new features). Through techniques like canary releases, feature flags, and automated metric analysis, organizations can validate production changes with minimal blast radius, automatically rolling back when anomalies are detected, and gradually expanding exposure based on empirical evidence rather than scheduled timelines.

This chapter explores the toolchain and patterns for progressive delivery in Kubernetes environments, from simple feature toggles to sophisticated automated canary analysis using service meshes and observability platforms.

## 40.1 Feature Flags

Feature flags (feature toggles) decouple deployment from release, allowing code to be deployed to production while remaining dormant or restricted to specific user segments. This enables trunk-based development, A/B testing, and instant rollback without redeployment.

### Feature Flag Architecture

**Static Configuration Flags:**
Simple environment-based toggles suitable for infrastructure changes:

```yaml
# configmap-feature-flags.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: feature-flags
  namespace: production
data:
  features.yaml: |
    payment:
      newCheckoutFlow: false
      instantPayouts: true
      fraudDetectionV2: false
    
    user:
      ssoLogin: true
      profileRedesign: false
      
    infrastructure:
      useNewCacheLayer: true
      enableTracing: true
```

**Application Integration (Spring Boot):**
```java
@Configuration
@ConfigurationProperties(prefix = "features")
public class FeatureFlags {
    private Map<String, Boolean> payment = new HashMap<>();
    private Map<String, Boolean> user = new HashMap<>();
    private Map<String, Boolean> infrastructure = new HashMap<>();
    
    // Getters and setters
    
    public boolean isEnabled(String feature) {
        String[] parts = feature.split("\\.");
        String category = parts[0];
        String flag = parts[1];
        
        return switch(category) {
            case "payment" -> payment.getOrDefault(flag, false);
            case "user" -> user.getOrDefault(flag, false);
            case "infrastructure" -> infrastructure.getOrDefault(flag, false);
            default -> false;
        };
    }
}
```

**Usage in Business Logic:**
```java
@Service
public class PaymentService {
    @Autowired
    private FeatureFlags featureFlags;
    
    public PaymentResult processPayment(PaymentRequest request) {
        if (featureFlags.isEnabled("payment.newCheckoutFlow")) {
            return newCheckoutProcessor.process(request);
        } else {
            return legacyCheckoutProcessor.process(request);
        }
    }
}
```

**Explanation:**
The `FeatureFlags` class binds to the ConfigMap's YAML structure using Spring Boot's `@ConfigurationProperties`. When the ConfigMap updates, Spring Cloud Kubernetes automatically refreshes the configuration (if `@RefreshScope` is used). The `isEnabled` method provides a type-safe way to check flags, defaulting to `false` for safety. This allows the new checkout flow code to exist in production but remain inactive until the flag is toggled.

### Dynamic Feature Flags (LaunchDarkly)

For user-targeted flags and complex rollouts, specialized platforms provide SDKs:

```javascript
// payment-service/src/config/launchdarkly.js
const LaunchDarkly = require('launchdarkly-node-server-sdk');

const ldClient = LaunchDarkly.init(process.env.LAUNCHDARKLY_SDK_KEY);

async function initializeFeatureFlags() {
  await ldClient.waitForInitialization();
  console.log('LaunchDarkly initialized');
}

function getFeatureFlag(flagKey, user, defaultValue = false) {
  return ldClient.variation(flagKey, user, defaultValue);
}

// User context construction
function createUserContext(req) {
  return {
    key: req.user.id,                    // Unique user identifier
    email: req.user.email,
    custom: {
      tier: req.user.subscriptionTier,    // Target by plan level
      region: req.headers['x-region'],    // Target by geography
      signupDate: req.user.createdAt,     // Target by cohort
      betaUser: req.user.betaOptIn        // Target beta users
    }
  };
}

module.exports = { initializeFeatureFlags, getFeatureFlag, createUserContext };
```

**Route Handling with Feature Flags:**
```javascript
// payment-service/src/routes/checkout.js
const express = require('express');
const { getFeatureFlag, createUserContext } = require('../config/launchdarkly');

const router = express.Router();

router.post('/checkout', async (req, res) => {
  const user = createUserContext(req);
  
  // Check if user should see new checkout
  const useNewCheckout = getFeatureFlag('new-checkout-flow', user, false);
  
  if (useNewCheckout) {
    // Track metrics for analysis
    ldClient.track('new-checkout-accessed', user);
    
    // Route to new implementation
    return newCheckoutHandler(req, res);
  } else {
    // Route to legacy implementation
    return legacyCheckoutHandler(req, res);
  }
});

// Gradual rollout endpoint
router.post('/instant-payout', async (req, res) => {
  const user = createUserContext(req);
  
  // Percentage-based rollout (managed in LaunchDarkly dashboard)
  const enabled = getFeatureFlag('instant-payouts', user, false);
  
  if (!enabled) {
    return res.status(404).json({ error: 'Feature not available' });
  }
  
  try {
    const result = await processInstantPayout(req.body);
    
    // Success metric for canary analysis
    ldClient.track('instant-payout-success', user);
    
    res.json(result);
  } catch (error) {
    // Failure metric
    ldClient.track('instant-payout-failure', user, null, 1);
    res.status(500).json({ error: error.message });
  }
});
```

**Explanation:**
LaunchDarkly's SDK evaluates flags based on user context. The `instant-payouts` flag can be configured in the LaunchDarkly dashboard to target:
- 5% of users initially (canary)
- Users with `tier: 'enterprise'`
- Users in `region: 'us-west'`
- Users where `betaUser: true`

The SDK caches flag rules locally (updated via streaming connection), evaluating flags in microseconds without network calls. This enables complex targeting without application redeployment.

### Kubernetes-Native Feature Flags (Flagger)

Flagger integrates feature flags with canary deployments:

```yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: payment-service
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  service:
    port: 8080
    gateways:
      - istio-gateway
    match:
      - uri:
          prefix: /api/v1/payments
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 30s
      - name: request-duration
        thresholdRange:
          max: 500
        interval: 30s
    webhooks:
      - name: load-test
        type: pre-rollout
        url: http://flagger-loadtester.test/
        timeout: 5s
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 http://payment-service-canary/api/v1/payments"
      - name: conformance-test
        type: pre-rollout
        url: http://flagger-tester.test/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -sf http://payment-service-canary/api/v1/payments/health"
  # A/B Testing configuration
  abTesting:
    enabled: true
    match:
      - headers:
          x-canary:
            exact: "insider"
      - cookies:
          canary:
            exact: "always"
```

**Explanation:**
This Flagger Canary resource:
- **Traffic Splitting**: Routes 10% of traffic to canary (new version), increasing by 10% every minute up to 50%
- **A/B Testing**: Routes users with header `x-canary: insider` or cookie `canary=always` to the canary version (useful for internal testing)
- **Metrics**: Requires 99% success rate and <500ms latency; if thresholds breach 5 times (threshold: 5), it rolls back
- **Webhooks**: Runs load tests and conformance checks before routing traffic

## 40.2 Canary Releases

Canary releases route a small percentage of production traffic to a new version, validating behavior with real users before full rollout.

### Manual Canary with Kubernetes

Basic canary using Service weights:

```yaml
# canary-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: payment-service
spec:
  selector:
    app: payment-service
    # No version selector - includes both stable and canary
  ports:
    - port: 80
      targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service-stable
spec:
  replicas: 9
  selector:
    matchLabels:
      app: payment-service
      version: stable
  template:
    metadata:
      labels:
        app: payment-service
        version: stable
    spec:
      containers:
        - name: payment
          image: payment-service:v2.0.0
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service-canary
spec:
  replicas: 1  # 10% of total (1/10)
  selector:
    matchLabels:
      app: payment-service
      version: canary
  template:
    metadata:
      labels:
        app: payment-service
        version: canary
    spec:
      containers:
        - name: payment
          image: payment-service:v2.1.0  # New version
          env:
            - name: METRICS_TAGS
              value: "version:canary"
```

**Explanation:**
This naive approach uses replica ratio (1 canary : 9 stable = 10% traffic) assuming random load balancing. The canary Pod has a distinct label (`version: canary`) for monitoring purposes but shares the same Service selector, so the kube-proxy load balances across both. This lacks sophisticated traffic shaping but works for basic validation.

### Istio Traffic Splitting

Service mesh enables precise traffic management:

```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-service
spec:
  hosts:
    - payment-service
  http:
    - match:
        - headers:
            x-canary:
              exact: "true"
      route:
        - destination:
            host: payment-service
            subset: canary
          weight: 100
    - route:
        - destination:
            host: payment-service
            subset: stable
          weight: 90
        - destination:
            host: payment-service
            subset: canary
          weight: 10
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-service
spec:
  host: payment-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 50
        maxRequestsPerConnection: 10
    outlierDetection:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 30s
  subsets:
    - name: stable
      labels:
        version: stable
    - name: canary
      labels:
        version: canary
      trafficPolicy:
        outlierDetection:
          consecutiveErrors: 3  # More sensitive for canary
          interval: 10s
```

**Explanation:**
The VirtualService defines traffic routing rules:
- **Header Match**: Users with `x-canary: true` header go 100% to canary (forced testing)
- **Weight-based**: Normal traffic is 90% stable, 10% canary

The DestinationRule defines subsets (groups of Pods) based on labels and configures circuit breaking:
- **Connection Pooling**: Limits concurrent connections to prevent cascade failures
- **Outlier Detection**: Removes unhealthy Pods from the pool after consecutive errors (5 for stable, 3 for canary)

### Automated Canary Analysis

Flagger automates the canary process with metric analysis:

```yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: payment-service
spec:
  provider: istio
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  service:
    port: 8080
    targetPort: 8080
    gateways:
      - payment-gateway.istio-system.svc.cluster.local
    hosts:
      - payment.company.com
  analysis:
    interval: 30s
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
      - name: request-success-rate
        interval: 1m
        thresholdRange:
          min: 99
      - name: request-duration
        interval: 1m
        thresholdRange:
          max: 500
      - name: custom-error-rate
        templateRef:
          name: payment-error-rate
          namespace: flagger
        thresholdRange:
          max: 1
    webhooks:
      - name: conformance-testing
        type: pre-rollout
        url: http://flagger-tester.test/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -sd 'test' http://payment-service-canary:8080/api/v1/payments/validate | grep 'valid'"
      - name: load-test
        type: rollout
        url: http://flagger-loadtester.test/
        timeout: 5s
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 http://payment-service-canary:8080/api/v1/payments"
  autoscalerRef:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    name: payment-service
  progressDeadlineSeconds: 600
```

**Explanation:**
Flagger manages the entire lifecycle:
1. **Pre-rollout**: Runs conformance tests (bash script checks canary responds correctly)
2. **Traffic Shift**: Increases weight by 10% every 30s (stepWeight) up to 50% (maxWeight)
3. **Monitoring**: Checks Prometheus metrics every interval
4. **Rollback**: If success rate drops below 99% or latency exceeds 500ms more than 5 times (threshold), it rolls back to stable
5. **Promotion**: If analysis passes, promotes canary to stable (updates the stable Deployment's image)

## 40.3 Traffic Splitting

Advanced traffic splitting strategies enable fine-grained control over user routing.

### Header-Based Routing

Route internal users or beta testers to new versions:

```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-service
spec:
  hosts:
    - payment-service
  http:
    - match:
        - headers:
            x-internal-user:
              exact: "true"
            x-beta-tester:
              exact: "true"
      route:
        - destination:
            host: payment-service
            subset: canary
          weight: 100
    - match:
        - headers:
            user-agent:
              regex: ".*Mobile.*"
      route:
        - destination:
            host: payment-service
            subset: stable
          weight: 95
        - destination:
            host: payment-service
            subset: canary
          weight: 5
    - route:
        - destination:
            host: payment-service
            subset: stable
          weight: 90
        - destination:
            host: payment-service
            subset: canary
          weight: 10
```

**Explanation:**
- **Internal Users**: Employees (identified by `x-internal-user: true` header from corporate proxy) get 100% canary
- **Beta Testers**: Users who opted into beta program get canary
- **Mobile Users**: 5% of mobile traffic (User-Agent matching) goes to canary (testing mobile-specific changes)
- **Default**: 10% of remaining traffic to canary

### Cookie-Based Sticky Canary

Ensure users consistently hit the same version during a session:

```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-service
spec:
  http:
    - match:
        - headers:
            cookie:
              regex: "^(.*?; )?(canary=always)(;.*)?$"
      route:
        - destination:
            host: payment-service
            subset: canary
    - match:
        - headers:
            cookie:
              regex: "^(.*?; )?(canary=never)(;.*)?$"
      route:
        - destination:
            host: payment-service
            subset: stable
    - route:
        - destination:
            host: payment-service
            subset: stable
          weight: 90
        - destination:
            host: payment-service
            subset: canary
          weight: 10
---
apiVersion: networking.istio.io/v1beta1
kind: EnvoyFilter
metadata:
  name: canary-cookie
spec:
  configPatches:
    - applyTo: HTTP_ROUTE
      match:
        route:
          cluster: outbound|8080||payment-service
      patch:
        operation: MERGE
        value:
          hash_policy:
            - cookie:
                name: canary-user
                ttl: 3600s  # 1 hour sticky session
```

**Explanation:**
The VirtualService checks for `canary=always` or `canary=never` cookies to honor user preference. The EnvoyFilter adds consistent hashing based on a `canary-user` cookie, ensuring that if a user hits the canary version once, subsequent requests (within the 1-hour TTL) route to the same version. This prevents user experience inconsistency where a user sees the new UI on one page and old UI on another.

### Geographic Traffic Splitting

Route traffic by region for localized testing:

```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-service
spec:
  hosts:
    - payment.company.com
  http:
    - match:
        - uri:
            prefix: /api/v1/payments
        - sourceLabels:
            region: us-west
      route:
        - destination:
            host: payment-service
            subset: canary
          weight: 50
        - destination:
            host: payment-service
            subset: stable
          weight: 50
    - route:
        - destination:
            host: payment-service
            subset: stable
---
apiVersion: networking.istio.io/v1beta1
kind: Sidecar
metadata:
  name: default
  namespace: production
spec:
  outboundTrafficPolicy:
    mode: REGISTRY_ONLY
```

**Explanation:**
Traffic from `us-west` region (determined by node labels or client IP geolocation) gets 50% canary exposure, while other regions remain on stable. This limits blast radius to specific geographies during initial rollout.

## 40.4 Automated Rollbacks

Automated rollbacks detect anomalies and revert traffic to stable versions without human intervention.

### Metric-Based Rollback

Flagger's automated rollback triggers:

```yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: payment-service
spec:
  analysis:
    interval: 30s
    threshold: 3  # Allow 3 failed checks before rollback
    maxWeight: 50
    stepWeight: 25
    metrics:
      - name: request-success-rate
        interval: 30s
        thresholdRange:
          min: 99.0
      - name: request-duration
        interval: 30s
        thresholdRange:
          max: 500  # ms
      - name: error-5xx-rate
        templateRef:
          name: error-rate
        thresholdRange:
          max: 1.0  # 1%
      - name: payment-failure-rate
        templateRef:
          name: custom-payment-metric
        thresholdRange:
          max: 0.1  # 0.1% payment processing failures
```

**Prometheus Metric Template:**
```yaml
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: custom-payment-metric
spec:
  provider:
    type: prometheus
    address: http://prometheus:9090
  query: |
    sum(
      rate(
        payment_failures_total{
          namespace="{{ namespace }}",
          service="{{ service }}"
        }[1m]
      )
    )
    /
    sum(
      rate(
        payment_total{
          namespace="{{ namespace }}",
          service="{{ service }}"
        }[1m]
      )
    ) * 100
```

**Explanation:**
Flagger queries Prometheus every 30s. The custom metric calculates the percentage of failed payments. If any metric exceeds thresholds 3 times (consecutive), Flagger immediately shifts 100% traffic back to stable and scales down the canary Deployment. This happens automatically within seconds of detecting elevated error rates.

### Circuit Breaker Rollback

Istio's outlier detection provides automatic ejection:

```yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: payment-service-canary
spec:
  host: payment-service
  trafficPolicy:
    connectionPool:
      http:
        h2UpgradePolicy: UPGRADE
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 10s
      baseEjectionTime: 30s
      maxEjectionPercent: 100
  subsets:
    - name: canary
      labels:
        version: canary
```

**Explanation:**
If the canary subset returns 5 consecutive 5xx errors within any 10-second interval, Istio ejects (removes) the canary Pods from the load balancing pool for 30 seconds. If multiple Pods exist and they all fail, up to 100% (maxEjectionPercent) can be ejected, effectively routing all traffic to stable until the canary is fixed or rolled back.

### Argo Rollouts Automated Rollback

Argo Rollouts provides sophisticated rollback capabilities:

```yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
spec:
  replicas: 10
  strategy:
    canary:
      canaryService: payment-service-canary
      stableService: payment-service-stable
      trafficRouting:
        nginx:
          stableIngress: payment-service-ingress
          annotationPrefix: nginx.ingress.kubernetes.io
      steps:
        - setWeight: 5
        - pause: {duration: 10m}
        - setWeight: 20
        - pause: {duration: 10m}
        - setWeight: 50
        - pause: {duration: 10m}
        - setWeight: 100
      analysis:
        templates:
          - templateName: success-rate
          - templateName: latency
        args:
          - name: service-name
            value: payment-service-canary
  template:
    metadata:
      labels:
        app: payment-service
    spec:
      containers:
        - name: payment
          image: payment-service:v2.1.0
          resources:
            requests:
              memory: 256Mi
              cpu: 100m
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 5m
      count: 3
      successCondition: result[0] >= 0.99
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(nginx_ingress_controller_requests{
              service="{{ args.service-name }}",
              status=~"2.."
            }[5m]))
            /
            sum(rate(nginx_ingress_controller_requests{
              service="{{ args.service-name }}"
            }[5m]))
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: latency
spec:
  args:
    - name: service-name
  metrics:
    - name: latency
      interval: 5m
      count: 3
      successCondition: result[0] <= 500
      failureLimit: 1
      provider:
        prometheus:
          query: |
            histogram_quantile(0.99,
              sum(rate(nginx_ingress_controller_request_duration_seconds_bucket{
                service="{{ args.service-name }}"
              }[5m])) by (le)
            ) * 1000
```

**Explanation:**
Argo Rollouts:
1. Starts with 5% traffic to canary
2. Pauses for 10 minutes while AnalysisTemplates run
3. Each AnalysisTemplate queries Prometheus every 5 minutes, expecting 99% success rate and <500ms p99 latency
4. If analysis fails even once (failureLimit: 1), it immediately rolls back to the previous stable ReplicaSet
5. If analysis passes 3 times (count: 3), it proceeds to next weight step

## 40.5 Metric-Based Promotion

Promotion decisions based on observability data rather than fixed time intervals.

### Custom Metrics for Promotion

Define service-level indicators (SLIs) for automated gates:

```yaml
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: business-metrics
spec:
  provider:
    type: prometheus
    address: http://prometheus:9090
  query: |
    # Business metric: Revenue per minute
    sum(
      rate(
        payment_revenue_usd_total{
          namespace="{{ namespace }}",
          service="{{ service }}"
        }[1m]
      )
    )
---
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: payment-service
spec:
  analysis:
    metrics:
      - name: revenue-impact
        templateRef:
          name: business-metrics
        thresholdRange:
          min: 1000  # $1000/minute minimum
        interval: 1m
```

**Explanation:**
The canary only promotes if revenue metrics remain healthy. If the new version introduces a bug preventing checkout completion, revenue drops below $1000/min, triggering automatic rollback.

### Datadog Integration

Using external observability platforms:

```yaml
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: datadog-error-rate
spec:
  provider:
    type: datadog
    address: https://api.datadoghq.com
    apiKey: datadog-api-key
  query: |
    sum:trace.web.request.errors{
      service:payment-service,
      version:canary
    }.as_rate()
---
apiVersion: flagger.app/v1beta1
kind: Canary
spec:
  analysis:
    metrics:
      - name: datadog-errors
        templateRef:
          name: datadog-error-rate
        thresholdRange:
          max: 10  # 10 errors per second max
```

### CloudWatch Integration

For AWS-native monitoring:

```yaml
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: cloudwatch-latency
spec:
  provider:
    type: cloudwatch
    region: us-east-1
  query: |
    SELECT AVG(Duration) 
    FROM AWS/Lambda 
    WHERE FunctionName = 'payment-processor' 
    AND Version = 'canary'
```

## 40.6 Analysis Templates

Reusable metric definitions for consistent canary analysis across services.

### Standard Analysis Template

```yaml
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: http-success-rate
  namespace: flagger
spec:
  provider:
    type: prometheus
    address: http://prometheus:9090
  query: |
    100 - sum(
      rate(
        http_requests_total{
          namespace="{{ namespace }}",
          service="{{ service }}",
          status=~"5.."
        }[1m]
      )
    )
    /
    sum(
      rate(
        http_requests_total{
          namespace="{{ namespace }}",
          service="{{ service }}"
        }[1m]
      )
    ) * 100
```

**Usage Across Services:**
```yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: user-service
spec:
  analysis:
    metrics:
      - name: success-rate
        templateRef:
          name: http-success-rate
          namespace: flagger
        thresholdRange:
          min: 99.9
```

### Composite Analysis

Multiple metrics combined:

```yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: critical-service
spec:
  analysis:
    interval: 1m
    threshold: 2
    metrics:
      # Latency check
      - name: latency-p99
        interval: 30s
        thresholdRange:
          max: 1000
        query: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket[1m])) by (le)
          ) * 1000
          
      # Error rate check  
      - name: error-rate
        interval: 30s
        thresholdRange:
          max: 1
        query: |
          sum(rate(http_requests_total{status=~"5.."}[1m]))
          /
          sum(rate(http_requests_total[1m])) * 100
          
      # Custom business metric
      - name: checkout-completion
        interval: 1m
        thresholdRange:
          min: 95
        query: |
          sum(rate(checkout_completed_total[1m]))
          /
          sum(rate(checkout_started_total[1m])) * 100
```

## 40.7 Progressive Delivery Tools

### Flagger

Flagger is a Kubernetes operator that automates canary deployments using various service mesh ingress controllers.

**Installation:**
```bash
kubectl apply -k github.com/fluxcd/flagger//kustomize/flagger?ref=main

# With Istio
kubectl apply -k github.com/fluxcd/flagger//kustomize/istio?ref=main
```

**Load Tester:**
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: flagger-loadtester
spec:
  replicas: 1
  selector:
    matchLabels:
      app: loadtester
  template:
    spec:
      containers:
        - name: loadtester
          image: ghcr.io/fluxcd/flagger-loadtester:1.0.0
          command:
            - ./loadtester
            - -port=8080
---
apiVersion: v1
kind: Service
metadata:
  name: flagger-loadtester
spec:
  selector:
    app: loadtester
  ports:
    - port: 80
      targetPort: 8080
```

### Argo Rollouts

Argo Rollouts provides advanced deployment strategies including canary, blue-green, and experiments.

**Installation:**
```bash
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
```

**CLI:**
```bash
curl -LO https://github.com/argoproj/argo-rollouts/releases/latest/download/kubectl-argo-rollouts-linux-amd64
chmod +x kubectl-argo-rollouts-linux-amd64
mv kubectl-argo-rollouts-linux-amd64 /usr/local/bin/kubectl-argo-rollouts

# View rollout status
kubectl argo rollouts get rollout payment-service
kubectl argo rollouts promote payment-service  # Manual promotion
kubectl argo rollouts abort payment-service    # Manual rollback
```

### Spinnaker

Netflix's continuous delivery platform with sophisticated pipeline stages:

```json
{
  "stages": [
    {
      "type": "deploy",
      "name": "Deploy Canary",
      "clusters": [
        {
          "account": "production",
          "application": "payment-service",
          "strategy": "redblack",
          "canary": {
            "enabled": true,
            "analysisType": "realtime",
            "canaryResult": {
              "scoreThresholds": {
                "marginal": 75,
                "pass": 95
              }
            }
          }
        }
      ]
    }
  ]
}
```

### Harness

Cloud-based progressive delivery with AI-driven analysis:

```yaml
apiVersion: harness.io/v1
kind: CanaryWorkflow
metadata:
  name: payment-service
spec:
  service: payment-service
  canarySteps:
    - step: 10
      analysis:
        threshold: 95
    - step: 25
      analysis:
        threshold: 95
    - step: 50
      analysis:
        threshold: 95
    - step: 100
```

## 40.8 Best Practices

### Safety First

**1. Start Conservative**
Begin with small traffic percentages (5-10%) and short analysis intervals (1-2 minutes). Gradually increase as confidence grows.

**2. Define SLOs**
Establish clear Service Level Objectives before canarying:
- Latency p99 < 500ms
- Error rate < 0.1%
- Business metrics (checkout completion) > 99%

**3. Automated Rollbacks**
Never rely on manual rollback for production canaries. Humans are too slow during incidents.

### Testing in Production

**1. Synthetic Traffic**
Generate synthetic requests to canary versions to ensure code paths are exercised:

```yaml
webhooks:
  - name: synthetic-tests
    type: rollout
    url: http://synthetic-monitor.test/
    timeout: 30s
    metadata:
      tests: |
        - name: payment-flow
          steps:
            - POST /api/v1/payments {amount: 100}
            - GET /api/v1/payments/{id}/status
            - assert: status == 'completed'
```

**2. Shadow Traffic**
Mirror production traffic to canary without affecting users:

```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment-service
spec:
  http:
    - match:
        - uri:
            prefix: /api/v1/payments
      route:
        - destination:
            host: payment-service
            subset: stable
          weight: 100
      mirror:
        host: payment-service
        subset: canary
      mirrorPercentage:
        value: 100.0  # Mirror 100% of traffic
```

**Explanation:**
The `mirror` field sends a copy of requests to the canary subset. The canary processes the request but returns responses to the void (no user impact). This validates canary behavior under real production load without user-facing risk.

### Observability

**1. Distinct Metrics**
Tag canary metrics separately:

```java
@Component
public class CanaryMetrics {
    @Autowired
    private MeterRegistry registry;
    
    public void recordPayment(Payment payment, boolean isCanary) {
        registry.counter("payments.processed",
            "version", isCanary ? "canary" : "stable",
            "status", payment.getStatus()
        ).increment();
    }
}
```

**2. Dashboards**
Create canary-specific Grafana dashboards comparing canary vs. stable metrics side-by-side.

**3. Alerting**
Alert on canary analysis failures but use low-severity notifications (Slack) rather than paging, since automated rollback handles the incident.

### Database Considerations

**1. Schema Compatibility**
Ensure canary can run against current database schema (expand-contract pattern).

**2. Feature Flags for Schema**
Use flags to disable new DB features if rollback occurs:

```java
if (featureFlags.isEnabled("new-index-usage") && !isRollbackMode()) {
    return repository.findUsingNewIndex(query);
}
```

---

## Chapter Summary and Preview

This chapter established progressive delivery as a risk mitigation strategy that decouples deployment from release, enabling validation of production changes with minimal blast radius. We examined feature flags as a fundamental primitive, from simple environment-based toggles to sophisticated user-targeted flags using LaunchDarkly, enabling trunk-based development and gradual feature exposure. Canary releases using Flagger and Argo Rollouts automate the gradual traffic shifting process, routing small percentages of users to new versions while monitoring key metrics.

Traffic splitting strategies using Istio service mesh enable sophisticated routing based on headers, cookies, and geography, ensuring internal testers and beta users can validate changes before general availability. Automated rollback mechanisms detect anomalies in success rates, latency, and business metrics, reverting traffic to stable versions within seconds of detecting degradation without requiring human intervention or redeployment. Metric-based promotion using Prometheus, Datadog, or CloudWatch ensures releases proceed only when empirical evidence confirms stability, replacing time-based deployment windows with data-driven quality gates.

Analysis templates provide reusable metric definitions across services, ensuring consistent canary criteria organization-wide. The tooling landscape includes Flagger for Kubernetes-native automation, Argo Rollouts for advanced deployment strategies, and enterprise platforms like Spinnaker and Harness for complex pipeline orchestration.

**Key Takeaways:**
- Always decouple deployment from release using feature flags, enabling code to be deployed to production while remaining dormant or restricted to specific user segments.
- Implement automated canary analysis with multiple metric dimensions (latency, error rate, business metrics) and conservative thresholds; never rely on manual rollback for production safety.
- Use shadow traffic (traffic mirroring) to validate canary versions under production load without user impact, particularly for critical payment or transaction processing systems.
- Ensure database schema changes follow the expand-contract pattern so canary versions can safely roll back without schema incompatibility or data loss.
- Tag metrics by version (canary vs. stable) to enable precise comparison and automated decision-making during progressive delivery.

**Next Chapter Preview:**
Chapter 41: Database CI/CD addresses the unique challenges of evolving database schemas within continuous delivery pipelines. We will explore migration-based schema management using tools like Flyway and Liquibase, strategies for testing database changes in ephemeral environments, handling long-running transactions during deployments, and ensuring backward compatibility between application versions and database schemas. The chapter covers database per service patterns, schema versioning strategies, and rollback procedures for data definition language (DDL) changes, completing the progressive delivery foundation by ensuring data layer changes can be deployed with the same confidence and safety as application code.

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='39. multi_environment_deployments.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='41. database_cicd.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
