# **Chapter 11: CI/CD Pipeline Management**

---

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Implement Pipeline as Code using industry-standard formats (Jenkinsfile, GitHub Actions, GitLab CI)
- Design build automation strategies that optimize for speed, reliability, and reproducibility
- Integrate comprehensive testing strategies (Unit, Integration, E2E) into CI/CD pipelines
- Select and implement deployment strategies (Blue-Green, Canary, Rolling) based on risk tolerance and business requirements
- Apply DevSecOps principles to shift security left in the delivery pipeline

---

## **Real-World Case Study: The 3-Hour Build**

You're the Engineering Manager for "ShopStream," an e-commerce platform. Monday morning, 9:00 AM, your lead developer messages you:

*"The build is broken. Again. It takes 45 minutes to fail, and the error is 'connection timeout to test database.' I tried rerunning it 3 times. The demo to the board is in 2 hours."*

You investigate and find a cascade of issues:

- **The Build**: 3 hours to complete when it works (used to be 15 minutes)
- **The Tests**: Flaky integration tests that fail randomly 30% of the time
- **The Artifacts**: Each build produces a 2GB Docker image (bloated with dev dependencies)
- **The Deployment**: Manual "copy-paste" deployment instructions in a Confluence page
- **The Security**: Production credentials hardcoded in the pipeline YAML (visible to all 50 developers)
- **The Rollback**: Last deployment took 6 hours to roll back because "the database migration already ran"

The immediate crisis: The board demo is at risk. The deeper crisis: Your team is spending 40% of their time dealing with pipeline issues instead of building features.

This scenario illustrates that CI/CD isn't just "automation"—it's the circulatory system of your software delivery. When it's clogged, the whole organization suffers.

---

## **11.1 Pipeline as Code (Jenkinsfile, GitHub Actions, GitLab CI)**

### **The Evolution: From GUI to Code**

**Generation 1: Manual Click-Ops (2000s)**
- Log into Jenkins web UI
- Click "Configure"
- Add build steps in text boxes
- **Problem**: Configuration drift, no versioning, "works on my Jenkins"

**Generation 2: Scripted Pipelines (2010s)**
- Jenkins Job DSL
- XML configuration files
- **Problem**: Proprietary formats, vendor lock-in

**Generation 3: Pipeline as Code (Modern)**
- Jenkinsfile (Groovy-based)
- GitHub Actions (YAML)
- GitLab CI (YAML)
- Azure Pipelines (YAML)
- **Benefit**: Version controlled, code-reviewed, portable

---

### **Core Concepts**

**1. Declarative vs. Scripted**

**Declarative** (Recommended for most teams):
- Structured, opinionated syntax
- Built-in validation
- Easier to read for non-experts
- Better for simple to medium complexity

**Scripted** (Advanced use cases):
- Groovy-based (Jenkins)
- Full programming language flexibility
- Better for complex logic, loops, parallel execution
- Harder to maintain

**2. The Pipeline Structure**

Every pipeline follows the pattern:
```
Trigger → Checkout → Build → Test → Security Scan → Artifact → Deploy → Notify
```

**Key Components**:
- **Stages**: Logical groupings (Build, Test, Deploy)
- **Steps**: Individual commands (shell scripts, API calls)
- **Agents/Runners**: Where the pipeline executes
- **Environment**: Variables and secrets
- **Artifacts**: Persistent outputs (binaries, reports)

---

### **GitHub Actions Deep Dive**

**Architecture**:
- **Workflow**: Top-level YAML file (`.github/workflows/ci.yml`)
- **Jobs**: Parallel or sequential units of work
- **Steps**: Commands within a job
- **Actions**: Reusable components (from Marketplace or internal)

**Example: Comprehensive CI/CD Pipeline**

```yaml
# .github/workflows/main.yml
name: CI/CD Pipeline

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]
  workflow_dispatch:  # Manual trigger

env:
  NODE_VERSION: '18.x'
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  # Job 1: Lint and Unit Tests (Fast feedback)
  quality-gate:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4
      
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'
      
      - name: Install dependencies
        run: npm ci
      
      - name: Run linter
        run: npm run lint
      
      - name: Run unit tests
        run: npm run test:unit -- --coverage
      
      - name: Upload coverage
        uses: codecov/codecov-action@v3
        with:
          files: ./coverage/lcov.info

  # Job 2: Integration Tests (Requires services)
  integration-tests:
    runs-on: ubuntu-latest
    needs: quality-gate  # Wait for quality gate
    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_PASSWORD: postgres
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
        ports:
          - 5432:5432
      redis:
        image: redis
        ports:
          - 6379:6379
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Node.js
        uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
      
      - name: Install dependencies
        run: npm ci
      
      - name: Run database migrations
        run: npm run db:migrate
        env:
          DATABASE_URL: postgresql://postgres:postgres@localhost:5432/test
      
      - name: Run integration tests
        run: npm run test:integration

  # Job 3: Security Scanning
  security-scan:
    runs-on: ubuntu-latest
    needs: quality-gate
    steps:
      - uses: actions/checkout@v4
      
      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          scan-ref: '.'
          format: 'sarif'
          output: 'trivy-results.sarif'
      
      - name: Upload Trivy results
        uses: github/codeql-action/upload-sarif@v2
        if: always()
        with:
          sarif_file: 'trivy-results.sarif'
      
      - name: Check for secrets
        uses: trufflesecurity/trufflehog@main
        with:
          path: ./
          base: ${{ github.event.repository.default_branch }}
          head: HEAD
          extra_args: --debug --only-verified

  # Job 4: Build and Push Artifact
  build-artifact:
    runs-on: ubuntu-latest
    needs: [integration-tests, security-scan]
    outputs:
      image-tag: ${{ steps.meta.outputs.tags }}
      version: ${{ steps.version.outputs.version }}
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Generate version
        id: version
        run: |
          VERSION=$(date +'%Y.%m.%d')-${GITHUB_SHA::7}
          echo "version=$VERSION" >> $GITHUB_OUTPUT
      
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      
      - name: Log in to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      
      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=semver,pattern={{version}}
            type=sha
            type=raw,value=${{ steps.version.outputs.version }}
      
      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
          build-args: |
            VERSION=${{ steps.version.outputs.version }}

  # Job 5: Deploy to Staging
  deploy-staging:
    runs-on: ubuntu-latest
    needs: build-artifact
    environment:
      name: staging
      url: https://staging.example.com
    steps:
      - name: Deploy to Staging
        run: |
          echo "Deploying ${{ needs.build-artifact.outputs.image-tag }}"
          # kubectl set image deployment/app app=${{ needs.build-artifact.outputs.image-tag }}
          # or use a deployment tool like ArgoCD, Helm, etc.

  # Job 6: Production Deployment (Manual approval)
  deploy-production:
    runs-on: ubuntu-latest
    needs: [build-artifact, deploy-staging]
    environment:
      name: production
      url: https://example.com
    steps:
      - name: Deploy to Production
        run: |
          echo "Deploying to production..."
          # Production deployment commands
```

---

### **GitLab CI/CD**

GitLab uses a single `.gitlab-ci.yml` file with a different structure:

```yaml
# .gitlab-ci.yml
stages:
  - build
  - test
  - security
  - deploy

variables:
  NODE_VERSION: "18"
  DOCKER_IMAGE: $CI_REGISTRY_IMAGE

# Cache node_modules between jobs and pipelines
cache:
  key: ${CI_COMMIT_REF_SLUG}
  paths:
    - node_modules/

# Job templates
.unit_test_template: &unit_test
  stage: test
  script:
    - npm ci
    - npm run test:unit -- --coverage
  coverage: '/All files[^|]*\|[^|]*\s+([\d\.]+)/'
  artifacts:
    reports:
      coverage_report:
        coverage_format: cobertura
        path: coverage/cobertura-coverage.xml
    paths:
      - coverage/

# Specific jobs
lint:
  stage: build
  script:
    - npm ci
    - npm run lint
  only:
    - merge_requests
    - main

unit_tests:
  <<: *unit_test
  only:
    - merge_requests
    - main

integration_tests:
  stage: test
  services:
    - postgres:15
    - redis:7
  variables:
    POSTGRES_DB: test
    POSTGRES_USER: postgres
    POSTGRES_PASSWORD: postgres
    DATABASE_URL: postgresql://postgres:postgres@postgres/test
  script:
    - npm ci
    - npm run db:migrate
    - npm run test:integration
  only:
    - main

container_scanning:
  stage: security
  image: docker:stable
  services:
    - docker:dind
  script:
    - docker build -t $DOCKER_IMAGE:$CI_COMMIT_SHA .
    - docker run --rm -v /var/run/docker.sock:/var/run/docker.sock 
      -v $(pwd):/tmp aquasec/trivy image --exit-code 0 
      --format template --template "@contrib/sarif.tpl" 
      -o /tmp/report.sarif $DOCKER_IMAGE:$CI_COMMIT_SHA
  artifacts:
    reports:
      sast: report.sarif

deploy_staging:
  stage: deploy
  script:
    - echo "Deploy to staging"
  environment:
    name: staging
    url: https://staging.example.com
  only:
    - main

deploy_production:
  stage: deploy
  script:
    - echo "Deploy to production"
  environment:
    name: production
    url: https://example.com
  when: manual  # Requires button click
  only:
    - main
```

---

### **Project Management Considerations**

**1. Pipeline Speed Optimization**

**The Goal**: Feedback in under 10 minutes for basic checks, full pipeline under 30 minutes.

**Techniques**:
- **Parallelization**: Run lint, unit tests, and security scan in parallel (as shown in GitHub Actions example)
- **Caching**: Cache dependencies (`node_modules`, `~/.m2`, `~/.gradle`)
- **Docker Layer Caching**: Reuse unchanged layers in builds
- **Incremental Builds**: Only test what changed (monorepo tools like Nx, Turborepo)
- **Test Splitting**: Distribute tests across multiple runners

**2. Pipeline Reliability**

**Flaky Test Management**:
- Detect and quarantine flaky tests automatically
- Rerun failed tests up to N times (temporary band-aid)
- Root cause analysis: Is it the test, the code, or the infrastructure?

**Infrastructure as Code for CI**:
- Define runner infrastructure in Terraform
- Auto-scale runners based on queue depth
- Ephemeral runners (fresh VM/container per job) for consistency

**3. Cost Management**

Cloud CI/CD costs money:
- **Compute time**: Optimize job duration
- **Storage**: Artifact retention policies (delete old artifacts)
- **Parallelism**: Balance speed vs. cost (don't use 100 parallel runners for a 5-minute job)
- **Spot instances**: Use preemptible VMs for non-critical jobs

**4. Compliance and Auditability**

- **Immutable Logs**: Store pipeline logs in tamper-proof storage (S3 with Object Lock)
- **Approval Gates**: Manual approvals for production with sign-off records
- **SBOM Generation**: Software Bill of Materials generated in pipeline
- **Traceability**: Link every artifact back to commit, issue, and author

---

## **11.2 Build Automation and Artifact Management**

### **The Build Process**

A build is more than "compile the code":

```
Source Code → [Lint] → [Compile] → [Test] → [Package] → [Publish]
                ↓          ↓          ↓          ↓           ↓
             Quality    Binary      Verified   Deployable  Registry
             Gates      Artifact    Code       Unit        (Docker Hub,
                                                (JAR, EXE)   NPM, PyPI)
```

**Build Environments**:
Must be **reproducible** and **ephemeral**:
- Same OS version every time
- Same dependency versions (lock files)
- No leftover files from previous builds
- Infrastructure as Code for build agents

---

### **Artifact Management**

**What is an Artifact?**
Any file produced by the build process:
- Compiled binaries (JAR, WAR, EXE)
- Docker images
- NPM packages
- Documentation (PDF, HTML)
- Test reports and coverage data

**Artifact Repository Strategy**:

```
Development                    Production
     │                              │
     ▼                              ▼
┌──────────┐                  ┌──────────┐
│  Build   │ ──Push──→        │ Artifact │
│  Server  │   Artifact       │  Store   │
└──────────┘                  └────┬─────┘
                                   │
         ┌─────────────────────────┼─────────────────────────┐
         ▼                         ▼                         ▼
    ┌─────────┐              ┌──────────┐              ┌──────────┐
    │  Dev    │              │ Staging  │              │  Prod    │
    │  Env    │              │   Env    │              │   Env    │
    └─────────┘              └──────────┘              └──────────┘
```

**Best Practices**:
1. **Immutable Artifacts**: Once built, never modify. If you need to change it, build a new version.
2. **Semantic Versioning**: Tag artifacts with version numbers
3. **Metadata**: Store build timestamp, git commit SHA, author
4. **Retention**: Keep production artifacts forever (compliance), dev artifacts for 30 days
5. **Promotion**: Copy artifacts between repositories (don't rebuild for each environment)

---

### **Docker Build Optimization**

**Multi-Stage Builds** (Essential for production):

```dockerfile
# Stage 1: Dependencies
FROM node:18-alpine AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production

# Stage 2: Build (dev dependencies included)
FROM node:18-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

# Stage 3: Production (smallest possible)
FROM node:18-alpine AS runner
WORKDIR /app
ENV NODE_ENV production

# Copy only necessary files from previous stages
COPY --from=deps /app/node_modules ./node_modules
COPY --from=builder /app/dist ./dist
COPY package.json ./

# Security: Run as non-root
RUN addgroup --system --gid 1001 nodejs
RUN adduser --system --uid 1001 nodejs
USER nodejs

EXPOSE 3000
CMD ["node", "dist/main.js"]
```

**Results**:
- Without multi-stage: 1.2GB image (includes TypeScript compiler, dev tools)
- With multi-stage: 180MB image (only runtime + compiled code)

**BuildKit Features**:
```dockerfile
# syntax=docker/dockerfile:1
# Mount secrets (don't leak in layers)
RUN --mount=type=secret,id=npmrc,target=/root/.npmrc \
    npm ci

# Cache mount (reuse between builds)
RUN --mount=type=cache,target=/root/.npm \
    npm ci
```

---

### **Artifact Security**

**Signing and Verification**:
```bash
# Sign artifact with Cosign (part of Sigstore)
cosign sign --key cosign.key myregistry/myapp:v1.0.0

# Verify before deployment
cosign verify --key cosign.pub myregistry/myapp:v1.0.0
```

**SBOM (Software Bill of Materials)**:
```bash
# Generate SBOM with Syft
syft packages myregistry/myapp:v1.0.0 -o spdx-json > sbom.json

# Attach to artifact in registry
cosign attach sbom --sbom sbom.json myregistry/myapp:v1.0.0
```

**Vulnerability Scanning**:
- Scan images before pushing to registry
- Block deployments with Critical/High CVEs
- Regular rescans of deployed images (new vulnerabilities discovered daily)

---

## **11.3 Testing in the Pipeline (Unit, Integration, E2E)**

### **The Testing Pyramid in CI**

```
        /\
       /  \
      / E2E\          ← Few tests, slow, expensive
     / Tests\            (Run in staging only)
    /________\
   /          \
  / Integration\      ← Medium tests, moderate speed
 /    Tests     \        (Run on every build)
/________________\
/                  \
/      Unit          \   ← Many tests, fast, cheap
/       Tests          \     (Run on every commit)
/________________________\
```

**Pipeline Strategy**:
- **Commit Stage** (Fast): Lint + Unit Tests (< 5 minutes)
- **Integration Stage** (Medium): Integration tests with databases/services (< 15 minutes)
- **Acceptance Stage** (Slow): E2E tests, performance tests (< 30 minutes)
- **Production Stage**: Smoke tests, synthetic monitoring

---

### **Test Parallelization and Optimization**

**Problem**: Test suites grow until they're too slow for CI.

**Solutions**:

**1. Test Sharding** (Split across multiple runners):
```yaml
# GitHub Actions strategy
strategy:
  matrix:
    shard: [1, 2, 3, 4]
steps:
  - name: Run tests
    run: npm test -- --shard=${{ matrix.shard }}/4
```

**2. Test Selection** (Run only affected tests):
- **Jest**: `--changedSince=origin/main`
- **Nx**: `nx affected:test`
- **Bazel**: Test only targets with changed dependencies

**3. Parallel Test Execution**:
```javascript
// jest.config.js
module.exports = {
  maxWorkers: '50%', // Use half available CPUs
  testTimeout: 30000,
  // Run slow tests first to optimize total time
  testSequencer: './custom-sequencer.js'
};
```

---

### **Flaky Test Management**

**Definition**: A test that passes and fails with the same code (non-deterministic).

**Common Causes**:
- Async timing issues (not waiting for elements)
- Shared state between tests (database not cleaned)
- External dependencies (network, time, randomness)
- Resource leaks (file handles, database connections)

**Detection**:
```yaml
# Automatically detect and quarantine flaky tests
- name: Run tests with flake detection
  run: |
    npm test -- --flakes=3 --failOnFlake
    # If test fails then passes on retry, mark as flaky but don't fail build
```

**Quarantine Strategy**:
```javascript
// Skip known flaky tests but track them
const flakyTests = [
  'tests/checkout.test.js',
  'tests/payment-timeout.test.js'
];

if (flakyTests.includes(testPath) && process.env.CI) {
  test.skip('Known flaky test - see JIRA-1234', () => {});
}
```

---

## **11.4 Deployment Strategies (Blue-Green, Canary, Rolling)**

### **Strategy Selection Matrix**

| Strategy | Zero Downtime | Rollback Speed | Resource Cost | Risk Level | Complexity |
|----------|--------------|----------------|---------------|------------|------------|
| **Recreate** | No | Slow | Low | High | Simple |
| **Rolling** | Yes | Medium | Low | Medium | Medium |
| **Blue-Green** | Yes | Instant | High | Low | Medium |
| **Canary** | Yes | Fast | Medium | Low | High |
| **A/B Testing** | Yes | Fast | High | Low | High |

---

### **Blue-Green Deployment**

**Concept**: Two identical environments (Blue=Live, Green=Idle). Deploy to Green, test, switch traffic instantly.

```
Phase 1: Blue Live, Green Idle
┌─────────────┐      ┌─────────────┐
│   Blue      │      │   Green     │
│  (Live)     │      │  (Idle)     │
│  Traffic →  │      │             │
└─────────────┘      │   New       │
                     │   Version   │
                     └─────────────┘

Phase 2: Switch Traffic
┌─────────────┐      ┌─────────────┐
│   Blue      │      │   Green     │
│  (Idle)     │◄─────┤  (Live)     │
│             │      │  Traffic →  │
│  Rollback   │      │   New       │
│  Option     │      │   Version   │
└─────────────┘      └─────────────┘
```

**Implementation with Kubernetes**:
```yaml
# service.yaml - The traffic switcher
apiVersion: v1
kind: Service
metadata:
  name: my-app
spec:
  selector:
    app: my-app
    version: blue  # Change this to "green" to switch
  ports:
    - port: 80
      targetPort: 3000
```

**CI/CD Integration**:
```yaml
deploy:
  script:
    # Deploy to green (inactive)
    - kubectl apply -f k8s/green-deployment.yaml
    - kubectl rollout status deployment/myapp-green
    
    # Run smoke tests on green
    - kubectl port-forward svc/myapp-green 8080:80 &
    - curl -f http://localhost:8080/health || exit 1
    
    # Switch service to green (atomic operation)
    - kubectl patch service myapp -p '{"spec":{"selector":{"version":"green"}}}'
    
    # Keep blue for 5 minutes (instant rollback window)
    - sleep 300
    - kubectl scale deployment myapp-blue --replicas=0  # Scale down old version
```

**Pros**: Instant rollback, zero downtime, simple mental model
**Cons**: Double the infrastructure cost, data migration challenges (database schema changes)

---

### **Canary Deployment**

**Concept**: Gradually roll out to a subset of users, monitor metrics, increase rollout percentage.

```
Hour 0:   100% Blue
Hour 1:   95% Blue, 5% Canary (New)
Hour 2:   80% Blue, 20% Canary
Hour 4:   50% Blue, 50% Canary
Hour 6:   0% Blue, 100% Canary (if metrics good)
          OR
Hour 6:   100% Blue, 0% Canary (if error rate > 1%)
```

**Implementation with Istio (Service Mesh)**:
```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: myapp
spec:
  hosts:
    - myapp.example.com
  http:
    - match:
        - headers:
            x-canary:
              exact: "true"
      route:
        - destination:
            host: myapp
            subset: canary
          weight: 100
    - route:
        - destination:
            host: myapp
            subset: stable
          weight: 95
        - destination:
            host: myapp
            subset: canary
          weight: 5
```

**Automated Canary Analysis** (using Flagger or Argo Rollouts):
```yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: myapp
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  service:
    port: 80
  analysis:
    interval: 1m
    threshold: 5  # Max failed checks before rollback
    maxWeight: 50
    stepWeight: 10
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
      - name: request-duration
        thresholdRange:
          max: 500
        interval: 1m
    webhooks:
      - name: load-test
        url: http://flagger-loadtester.test/
        timeout: 5s
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 http://myapp-canary/"
```

**Pros**: Real user testing, automatic rollback on metrics, cost-effective
**Cons**: Complex tooling required, users see different versions temporarily, database compatibility required

---

### **Rolling Deployment**

**Concept**: Gradually replace old instances with new ones, one (or a few) at a time.

```
Time 0:   [Old] [Old] [Old] [Old]
Time 1:   [New] [Old] [Old] [Old]  (Drain and replace 1)
Time 2:   [New] [New] [Old] [Old]
Time 3:   [New] [New] [New] [Old]
Time 4:   [New] [New] [New] [New]
```

**Kubernetes Rolling Update**:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Can exceed desired count by 1
      maxUnavailable: 0  # Never drop below desired count
  template:
    spec:
      containers:
        - name: app
          image: myapp:v2
          readinessProbe:  # Critical for rolling updates
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 10
            periodSeconds: 5
```

**Key Configuration**:
- **Readiness Probes**: Traffic only routes to pods passing health checks
- **Max Surge**: How many extra pods can be created (speed vs. cost)
- **Max Unavailable**: How many pods can be down during update (availability vs. speed)

**Pros**: Native to Kubernetes, no extra infrastructure, automatic
**Cons**: Slow rollback (must re-roll), harder to monitor (mixed versions), potential for compatibility issues during transition

---

### **Database Migration Strategies**

**The Hard Problem**: Code can roll back in seconds. Database changes cannot.

**Pattern: Expand-Contract (Paraphrased Refactoring)**

```
Phase 1: Expand (Deploy code + migration)
- Add new column (nullable)
- Dual write to old and new columns
- Read from old column

Old App: Read/Write → Column A
New App: Read Column A, Write Column A + Column B

Phase 2: Migrate (Backfill)
- Update all existing rows to populate Column B
- Switch reads to Column B

Old App: Read/Write → Column A
New App: Read Column B, Write Column A + Column B

Phase 3: Contract (Cleanup)
- Remove old column writes
- Eventually drop Column A

New App: Read/Write → Column B
```

**Implementation in Pipeline**:
```yaml
deploy:
  steps:
    # 1. Run migrations (backward compatible only!)
    - kubectl apply -f k8s/job-migration.yaml
    - kubectl wait --for=condition=complete job/db-migration
    
    # 2. Verify migrations
    - kubectl run db-verify --rm -i --image=postgres -- psql -c "SELECT version()"
    
    # 3. Deploy app (new code works with new schema)
    - kubectl set image deployment/app app=myapp:v2
    
    # 4. Verify app health
    - kubectl rollout status deployment/app
    
    # 5. Post-deployment verification
    - ./scripts/smoke-tests.sh
```

---

## **Chapter Summary**

This chapter covered the technical backbone of modern software delivery—CI/CD pipelines that transform code into production value reliably and safely.

### **Key Takeaways:**

1. **Pipeline as Code**:
   - YAML-based configuration (GitHub Actions, GitLab CI) version-controlled alongside application code
   - Declarative syntax preferred for maintainability
   - Stages: Build → Test → Security → Artifact → Deploy
   - Parallel jobs reduce feedback time; dependencies ensure proper sequencing

2. **Build Automation**:
   - Immutable artifacts built once, promoted through environments (never rebuild)
   - Multi-stage Docker builds optimize image size and security
   - Caching strategies (dependencies, Docker layers) critical for speed
   - Artifact management includes versioning, metadata, and retention policies

3. **Testing Strategy**:
   - **Shift-left**: Fast tests (lint, unit) run first; slow tests (E2E) run later
   - **Parallelization**: Shard tests across multiple runners
   - **Flaky test management**: Detect, quarantine, and fix non-deterministic tests
   - **Coverage gates**: Enforce minimum thresholds but don't optimize for coverage alone

4. **Deployment Strategies**:
   - **Blue-Green**: Instant rollback, double infrastructure, good for critical systems
   - **Canary**: Gradual rollout with automated metric-based promotion/rollback
   - **Rolling**: Native to Kubernetes, resource-efficient, slower rollback
   - **Database**: Expand-contract pattern for zero-downtime schema changes

5. **DevSecOps Integration**:
   - Security scanning (SAST, DAST, container scanning) in pipeline
   - Secrets management (vaults, not hardcoded)
   - SBOM generation for supply chain security
   - Signed artifacts for verification

### **The CI/CD Mindset:**

- **Automation over manual**: If you do it twice, automate it
- **Fast feedback**: Fail fast, fix fast; 10-minute feedback loops vs. 3-hour cycles
- **Immutable infrastructure**: Build artifacts are cattle, not pets
- **Safety nets**: Automated rollback triggers based on health metrics
- **Observability**: Pipeline metrics (DORA) indicate organizational health

---

## **Review Questions**

1. **Your pipeline currently runs all tests (unit, integration, E2E) sequentially and takes 2 hours.** Reorder and parallelize them to achieve sub-30-minute feedback while maintaining quality gates. What runs in parallel? What must be sequential?

2. **Compare Blue-Green vs. Canary deployments for a financial trading platform.** Which would you choose for high-frequency trading vs. retail banking app? Why?

3. **A developer wants to add a step to the pipeline that sends a Slack notification.** What are the pros and cons of adding this to the pipeline YAML vs. using a webhook triggered by the CI platform?

4. **You need to deploy a database migration that renames a column.** Using the expand-contract pattern, write the sequence of deployments required to achieve this with zero downtime.

5. **Your Docker images are 3GB and take 20 minutes to build.** List three specific optimizations you would implement, with expected impact on size and build time.

6. **What is the "feedback loop" in CI/CD, and why does speed matter?** Connect this to the "cost of delay" concept from lean manufacturing.

---

## **Practical Exercise: Rescue the Pipeline**

**Scenario**: Return to ShopStream from the case study. You have 2 weeks to fix the CI/CD pipeline before the next board demo.

**Current State**:
- Monolithic Node.js app, 3-hour build time
- Single "test" stage that runs everything (lint, unit, integration, E2E)
- No artifact repository (builds happen on each deployment)
- Manual deployment via SSH and git pull
- No rollback capability (last outage required restoring database from 6-hour-old backup)
- Secrets stored in repository (`.env.production` committed to git)

**Requirements**:
- Build time: < 15 minutes
- Deployment frequency: On-demand (multiple times per day)
- Rollback: < 5 minutes
- Zero-downtime deployments
- Security: No secrets in code, vulnerability scanning

**Tasks**:

1. **Pipeline Architecture**:
   - Design the stage breakdown (what runs when?)
   - Choose parallelization strategy
   - Select deployment strategy (Blue-Green, Canary, or Rolling)

2. **Implementation**:
   - Write the CI/CD YAML file (GitHub Actions or GitLab CI)
   - Create the Dockerfile (optimized, multi-stage)
   - Design the Kubernetes deployment manifests (or docker-compose for simpler case)

3. **Database Strategy**:
   - Current issue: Database migrations run manually
   - Design: Automated migrations in pipeline with rollback capability

4. **Security Hardening**:
   - Secrets management plan
   - Container scanning integration
   - Artifact signing

5. **Monitoring**:
   - How do you know if a deployment succeeded?
   - Automated rollback triggers (what metrics?)

**Deliverable**: A working pipeline configuration (YAML files) plus a runbook for the team:

- "Happy Path": Normal deployment procedure
- "Emergency Runbook": 3 AM outage response
- "Adding a New Secret": Step-by-step (how to add new API key safely)
- "Troubleshooting Guide": Build failed, what to check?

Present to the "CTO" (instructor/peer) demonstrating:
- Build time improvement (before/after)
- Safety mechanisms (can't deploy broken code to prod)
- Rollback demonstration

---

## **Further Reading and Resources**

**Books:**
- "Continuous Delivery" by Jez Humble and David Farley (The definitive text)
- "Accelerate" by Nicole Forsgren et al. (DORA metrics and research)
- "The DevOps Handbook" by Gene Kim et al. (Implementation patterns)
- "Docker Deep Dive" by Nigel Poulton (Container optimization)

**Online Resources:**
- GitHub Actions Documentation (docs.github.com/en/actions)
- GitLab CI/CD Best Practices (docs.gitlab.com/ee/ci/best_practices/)
- DORA Metrics (cloud.google.com/devops)
- CNCF Continuous Delivery landscape (landscape.cncf.io)

**Tools to Explore:**
- **CI/CD**: GitHub Actions, GitLab CI, CircleCI, Travis CI, Jenkins, Azure DevOps
- **GitOps**: ArgoCD, Flux, Jenkins X
- **Progressive Delivery**: Flagger, Argo Rollouts, Spinnaker
- **Security**: Trivy, Snyk, SonarQube, OWASP Dependency-Check
- **Build**: Bazel, Nx, Turborepo (monorepo tools)

---

**End of Chapter 11**

---


<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='10. version_control_and_configuration_management.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='12. quality_assurance_and_testing_management.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
