# Chapter 54: CI/CD Documentation

Documentation is the institutional memory of engineering organizations. While code describes *what* a system does, documentation explains *why* it exists, *how* to operate it, and *when* to modify it. In CI/CD contexts, documentation serves multiple audiences: developers seeking to understand deployment procedures, operators troubleshooting failed pipelines, auditors verifying compliance controls, and future maintainers inheriting architectural decisions. This chapter treats documentation as code—version controlled, reviewed in pull requests, tested for accuracy, and deployed automatically. We examine **pipeline documentation** that captures the intent behind complex workflows, **architecture documentation** using diagrams-as-code that remain synchronized with implementation, **runbooks** for operational procedures that reduce incident response time, **API documentation** for platform services, **change logs** that communicate breaking changes, **README standards** that serve as service front doors, **automated generation** from source code and schemas, and **documentation-as-code** workflows that prevent the stagnation and rot that plague traditional wikis.

## 54.1 Pipeline Documentation

Pipelines encode complex business logic—security gates, deployment strategies, environment promotion rules—that requires explanation beyond YAML comments.

### Self-Documenting Pipelines

**Semantic Naming and Structure**:
```yaml
# Bad: Cryptic stage names
stages:
  - build
  - test
  - deploy

# Good: Intent-revealing names
stages:
  - compile_and_unit_test
  - security_validation
  - integration_test
  - build_container_artifacts
  - deploy_to_staging
  - smoke_test_staging
  - production_canary_deploy
  - production_verification
```

**Inline Documentation**:
```yaml
# .github/workflows/deploy.yml
# 
# Deployment Pipeline for Payment Service
# ========================================
# 
# Purpose: Safely deploy payment processing service to production
#          with zero-downtime canary releases.
#
# Triggers:
#   - Push to main branch (after PR merge)
#   - Manual dispatch for hotfixes (requires SRE approval)
#
# Security:
#   - All production deployments require signed commits
#   - Secrets injected via OIDC, not stored in repo
#   - SAST scan must pass before container build
#
# Rollback:
#   - Automatic rollback if error rate > 0.1% for 5 minutes
#   - Manual rollback: Run 'rollback-production' workflow
#
# Owners: platform-team@company.com

name: Payment Service Deployment

on:
  push:
    branches: [main]
    paths:
      - 'services/payment/**'
      - '.github/workflows/deploy-payment.yml'
  
  workflow_dispatch:
    inputs:
      rollback_target:
        description: 'Commit SHA to rollback to'
        required: true

env:
  SERVICE_NAME: payment-service
  # Using regional registry for latency and redundancy
  REGISTRY: ${{ vars.AWS_ACCOUNT_ID }}.dkr.ecr.${{ vars.AWS_REGION }}.amazonaws.com

jobs:
  # Build job documentation...
```

### Pipeline Architecture Documents

**Decision Records in Pipelines**:
```yaml
# docs/pipeline-architecture.md
# Architecture: Payment Service Deployment Pipeline

## Overview
This pipeline implements a progressive deployment strategy with 
automated rollback capabilities to ensure payment service reliability.

## Deployment Strategy: Canary with Automatic Rollback

```mermaid
graph TD
    A[Build] --> B[Deploy to Staging]
    B --> C[Smoke Tests]
    C --> D[Deploy Canary 5%]
    D --> E[Metrics Verification]
    E -->|Error Rate < 0.1%| F[Deploy 25%]
    E -->|Error Rate > 0.1%| G[Automatic Rollback]
    F --> H[Deploy 100%]
    H --> I[Verification]
```

## Key Design Decisions

### Decision: Blue/Green vs Canary
**Context**: Need zero-downtime deployment with automatic rollback capability
**Decision**: Canary deployment with automatic rollback
**Rationale**: 
  - Blue/green requires 2x capacity which is expensive for payment service
  - Canary allows gradual traffic shift with real-user metrics
**Consequences**: 
  - More complex monitoring required
  - Longer deployment time (15 min vs 5 min)

### Decision: Parallel Security Scanning
**Context**: Security scans were adding 20 minutes to pipeline
**Decision**: Run SAST and dependency scans in parallel with unit tests
**Rationale**: 
  - Security gates shouldn't block fast feedback
  - Fail fast on unit tests before expensive scans
**Consequences**: 
  - Requires larger runner instances
  - Potential for wasted compute if unit tests fail

## Troubleshooting Guide

### Symptom: Canary deployment stays at 5%
**Cause**: Metrics verification step failing to query Prometheus
**Check**: 
  1. Verify Prometheus query endpoint: `kubectl get svc prometheus -n monitoring`
  2. Check query syntax in `scripts/verify-metrics.sh`
  3. Ensure canary pods are emitting metrics: `kubectl logs -l app=payment-canary`

### Symptom: Automatic rollback not triggering
**Cause**: Error rate threshold configuration mismatch
**Check**: 
  1. Verify threshold in workflow env: `CANARY_ERROR_THRESHOLD`
  2. Check Prometheus query returns percentage (0.001) not count (1)
  3. Ensure alertmanager is routing to webhook
```

## 54.2 Architecture Documentation

Architecture diagrams must evolve with code or they become misleading. Diagrams-as-code tools enable version-controlled, reviewable visualizations.

### Diagrams as Code

**Mermaid** (Native GitHub/GitLab support):
```markdown
# docs/architecture.md

## CI/CD Platform Architecture

```mermaid
graph TB
    subgraph Developer["Developer Workflow"]
        A[Local Development] -->|Push| B[Feature Branch]
        B -->|PR| C[GitHub Actions]
    end
    
    subgraph CI["CI Pipeline"]
        C --> D[Lint & Unit Test]
        D --> E[Security Scan]
        E -->|Parallel| F[SAST]
        E -->|Parallel| G[Dependency Check]
        D --> H[Integration Test]
    end
    
    subgraph CD["CD Pipeline"]
        H -->|Merge| I[Staging Deploy]
        I --> J[Smoke Test]
        J -->|Manual Gate| K[Production]
        K --> L[Canary 5%]
        L --> M[Automatic Rollback<br/>if error > 0.1%]
        L -->|Success| N[Full Deploy]
    end
    
    subgraph Observability["Observability"]
        O[Prometheus] --> P[Grafana]
        Q[Jaeger] --> P
        R[ELK Stack] --> P
    end
    
    K -.Metrics.-> O
    L -.Traces.-> Q
    
    style CI fill:#e1f5fe
    style CD fill:#e8f5e9
```

**Structurizr** (C4 Model):
```java
// docs/architecture.dsl
workspace {

    model {
        developer = person "Developer" "Builds and deploys applications"
        operator = person "Platform Operator" "Manages CI/CD infrastructure"
        
        ciSystem = softwareSystem "CI/CD Platform" "GitHub Actions + ArgoCD" {
            pipeline = container "Pipeline Engine" "GitHub Actions" "Orchestrates builds"
            registry = container "Artifact Registry" "ECR" "Stores container images"
            gitops = container "GitOps Controller" "ArgoCD" "Manages deployments"
            secrets = container "Secret Manager" "Vault" "Dynamic secrets"
        }
        
        k8s = softwareSystem "Kubernetes Platform" "EKS Clusters" {
            staging = container "Staging Cluster" "Pre-production"
            prod = container "Production Cluster" "Multi-AZ"
        }
        
        developer -> pipeline "Pushes code, triggers builds"
        pipeline -> registry "Pushes images"
        pipeline -> secrets "Retrieves credentials"
        gitops -> registry "Pulls images"
        gitops -> staging "Deploys to"
        gitops -> prod "Deploys to"
        operator -> ciSystem "Monitors and maintains"
    }

    views {
        systemContext ciSystem {
            include *
            autolayout lr
        }
        
        container ciSystem {
            include *
            autolayout lr
        }
        
        theme default
    }
}
```

**PlantUML** (Advanced diagrams):
```plantuml
@startuml ci-cd-sequence

title Deployment Sequence - Production Canary

actor Developer
participant "GitHub Actions" as GH
participant "ArgoCD" as Argo
participant "Staging Cluster" as Staging
participant "Production Cluster" as Prod
participant "Prometheus" as Prom

Developer -> GH : Push to main
activate GH

GH -> GH : Build & Test
GH -> Argo : Update GitOps repo\n(deploy to staging)

activate Argo
Argo -> Staging : Deploy v2.0
activate Staging
Argo --> GH : Staging deployed
deactivate Argo

GH -> Staging : Smoke tests
Staging --> GH : Pass
deactivate Staging

GH -> Argo : Approve production
activate Argo
Argo -> Prod : Deploy canary (5%)
activate Prod

loop Every 30 seconds for 10 minutes
    GH -> Prom : Query error rate
    Prom --> GH : Current metrics
end

alt Error rate < 0.1%
    Argo -> Prod : Deploy 100%
    Prod --> Argo : Confirmed
else Error rate > 0.1%
    Argo -> Prod : Rollback to v1.9
    Prod --> Argo : Rolled back
    GH -> Developer : Alert: Deployment failed
end

deactivate Prod
deactivate Argo
deactivate GH

@enduml
```

### Living Documentation

**Embedding Diagrams in Code**:
```python
# scripts/generate-arch-docs.py
# Auto-generate documentation from Terraform

import json
import hcl2

def generate_infra_docs(tf_dir):
    """Generate markdown from Terraform state"""
    
    with open(f"{tf_dir}/terraform.tfstate") as f:
        state = json.load(f)
    
    resources = {}
    for resource in state['resources']:
        if resource['type'] == 'aws_instance':
            resources[resource['name']] = {
                'type': resource['type'],
                'instances': len(resource['instances'])
            }
    
    # Generate markdown table
    markdown = "## Infrastructure Resources\n\n"
    markdown += "| Resource | Type | Count |\n"
    markdown += "|----------|------|-------|\n"
    
    for name, details in resources.items():
        markdown += f"| {name} | {details['type']} | {details['instances']} |\n"
    
    return markdown

# Run in CI
if __name__ == "__main__":
    docs = generate_infra_docs("./terraform")
    with open("docs/infrastructure.md", "w") as f:
        f.write(docs)
```

## 54.3 Runbooks

Runbooks are operational procedures that reduce cognitive load during incidents. They must be executable (tested) and version-controlled.

### Runbook Structure

**Template**:
```markdown
# Runbook: Database Connection Pool Exhaustion

## Metadata
- **Service**: Payment API
- **Severity**: SEV-2 (Degraded service)
- **Last Updated**: 2024-01-15
- **Owner**: Database Team
- **Related**: [INC-2024-001](./incidents/INC-2024-001.md)

## Symptoms
- Alert: `payment_db_connection_pool_usage > 80%`
- Symptom: Response times increasing (> 500ms p95)
- Symptom: Error rate climbing on `/api/v1/payments`

## Diagnosis Steps

### Step 1: Verify the Alert
```bash
# Check current connection pool status
kubectl exec -it deployment/payment-api -c app -- \
  curl -s localhost:8080/actuator/metrics/jdbc.connections.active

# Expected: Number close to max (100)
# If < 50: False positive, check alert configuration
```

### Step 2: Identify Long-Running Queries
```bash
# Access database pod
kubectl exec -it statefulset/postgres-0 -n database -- psql -U admin

# Check active connections
SELECT pid, state, query_start, query 
FROM pg_stat_activity 
WHERE state = 'active' 
ORDER BY query_start ASC;

# Look for queries running > 30 seconds
```

### Step 3: Check for Connection Leaks
```bash
# Review application logs for connection leaks
kubectl logs deployment/payment-api --tail=500 | grep -i "connection"

# Look for: "Connection leak detected" or "Unable to get connection"
```

## Resolution Procedures

### Immediate Mitigation (Stop the bleeding)

**Option A: Restart Application (Fastest, causes brief downtime)**
```bash
# Rolling restart to reset connection pools
kubectl rollout restart deployment/payment-api
kubectl rollout status deployment/payment-api --timeout=300s
```

**Option B: Scale Up (No downtime, costs money)**
```bash
# Add replicas to distribute load
kubectl scale deployment/payment-api --replicas=10

# Verify new pods healthy
kubectl get pods -l app=payment-api -w
```

### Root Cause Fix

**If caused by slow query:**
```sql
-- Kill long-running query (get PID from Step 2)
SELECT pg_terminate_backend(<PID>);

-- Add query to slow query log for analysis
-- Create index if missing:
CREATE INDEX CONCURRENTLY idx_payments_created_at 
ON payments(created_at) 
WHERE status = 'pending';
```

**If caused by connection leak:**
```bash
# Deploy hotfix with connection timeout fix
kubectl set image deployment/payment-api \
  app=myregistry/payment-api:hotfix-conn-pool-1

# Monitor for 10 minutes
kubectl logs -l app=payment-api --tail=100 -f
```

## Verification

```bash
# Verify connection pool healthy
kubectl exec -it deployment/payment-api -c app -- \
  curl -s localhost:8080/actuator/health | jq '.status'

# Should return: "UP"

# Check error rates returning to normal
curl -s https://prometheus.company.com/api/v1/query?query=\
"rate(payment_errors_total[5m])" | jq '.data.result[0].value[1]'

# Should be < 0.01 (1% error rate)
```

## Escalation

If above steps don't resolve within 15 minutes:
1. **Page Database On-Call**: Run `pagerduty trigger -s database-oncall`
2. **Slack**: Post in #incident-response with findings from diagnosis
3. **Bridge**: Join Zoom bridge: https://company.zoom.us/j/incident-bridge

## Post-Incident

1. **Document findings** in incident channel
2. **Update this runbook** if steps were incorrect or missing
3. **Schedule post-mortem** within 24 hours for SEV-2
```

### Executable Runbooks

**Test Runbooks in CI**:
```yaml
# .github/workflows/test-runbooks.yml
name: Test Runbooks
on:
  schedule:
    - cron: '0 2 * * 1'  # Weekly Monday 2am
  workflow_dispatch:

jobs:
  test-runbooks:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup test environment
        run: |
          # Start kind cluster for testing
          kind create cluster --name runbook-test
          
      - name: Test database runbook
        run: |
          # Deploy test application
          kubectl apply -f test/fixtures/payment-api.yaml
          
          # Simulate connection pool issue
          kubectl exec deployment/payment-api -- \
            sh -c "for i in {1..200}; do curl localhost:8080/slow-query & done"
          
          # Execute runbook diagnosis steps
          ./runbooks/diagnose-connection-pool.sh
          
          # Verify expected output
          if [ $? -ne 0 ]; then
            echo "Runbook failed validation"
            exit 1
          fi
```

## 54.4 API Documentation

Internal platform APIs (CI/CD webhooks, deployment APIs, artifact registries) require the same documentation rigor as external products.

### OpenAPI Specifications

**Pipeline API**:
```yaml
# docs/openapi/pipeline-api.yaml
openapi: 3.0.3
info:
  title: CI/CD Pipeline API
  description: |
    API for triggering and monitoring CI/CD pipelines.
    Used by developer portals and automation tools.
  version: 1.2.0
  contact:
    name: Platform Team
    email: platform@company.com

servers:
  - url: https://api.ci.company.com/v1
    description: Production

paths:
  /pipelines:
    get:
      summary: List pipelines
      description: |
        Returns list of pipelines accessible to the authenticated user.
        Supports filtering by repository, status, and date range.
      parameters:
        - name: repository
          in: query
          schema:
            type: string
          example: "myorg/payment-service"
        - name: status
          in: query
          schema:
            type: string
            enum: [running, success, failed, pending]
      responses:
        '200':
          description: List of pipelines
          content:
            application/json:
              schema:
                type: object
                properties:
                  pipelines:
                    type: array
                    items:
                      $ref: '#/components/schemas/Pipeline'
                  total:
                    type: integer
                    example: 42

  /pipelines/{id}/trigger:
    post:
      summary: Trigger pipeline execution
      description: |
        Manually trigger a pipeline run. Requires `write` permission
        on the repository.
        
        **Rate limiting**: 100 requests/hour per repository.
      security:
        - bearerAuth: []
      parameters:
        - name: id
          in: path
          required: true
          schema:
            type: string
            pattern: '^[a-z0-9-]+$'
      requestBody:
        required: true
        content:
          application/json:
            schema:
              type: object
              properties:
                branch:
                  type: string
                  default: main
                parameters:
                  type: object
                  description: Key-value pairs passed to pipeline
      responses:
        '201':
          description: Pipeline triggered successfully
          headers:
            Location:
              description: URL of new pipeline run
              schema:
                type: string
        '429':
          description: Rate limit exceeded
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/Error'
              example:
                error: "Rate limit exceeded"
                retry_after: 3600

components:
  schemas:
    Pipeline:
      type: object
      properties:
        id:
          type: string
          example: "pay-svc-001"
        repository:
          type: string
          example: "myorg/payment-service"
        last_run:
          type: string
          format: date-time
        status:
          type: string
          enum: [success, failed, running, unknown]
    
    Error:
      type: object
      properties:
        error:
          type: string
        retry_after:
          type: integer
          description: Seconds until retry is allowed

  securitySchemes:
    bearerAuth:
      type: http
      scheme: bearer
      bearerFormat: JWT
```

### Documentation Generation

**From Code**:
```python
# FastAPI auto-generates OpenAPI from code
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Optional

app = FastAPI(
    title="CI/CD Pipeline API",
    description="API for triggering and monitoring pipelines",
    version="1.2.0"
)

class TriggerRequest(BaseModel):
    """Request body for triggering pipelines"""
    branch: str = "main"
    """Git branch to build"""
    parameters: Optional[dict] = None
    """Additional build parameters"""

@app.post(
    "/pipelines/{pipeline_id}/trigger",
    response_model=PipelineResponse,
    status_code=201,
    tags=["pipelines"],
    summary="Trigger pipeline execution",
    description="Manually trigger a pipeline run. Requires write permission."
)
async def trigger_pipeline(
    pipeline_id: str,
    request: TriggerRequest
):
    """
    Triggers a new pipeline execution.
    
    - **pipeline_id**: Unique identifier for the pipeline
    - **branch**: Git branch to checkout (default: main)
    - **parameters**: Custom parameters passed to build steps
    
    Returns the newly created pipeline run ID.
    """
    # Implementation...
    pass
```

## 54.5 Change Logs and Breaking Changes

Communication of changes prevents "it worked yesterday" surprises and gives teams time to adapt.

### Structured Change Logs

**Keep a Changelog Format**:
```markdown
# Changelog

All notable changes to the CI/CD platform are documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

### Added
- Support for ARM64 builds (Apple Silicon native)
- New `security/scan` job template for container scanning

### Changed
- **BREAKING**: Minimum Kubernetes version raised to 1.25
  - Migration guide: [docs/migrations/k8s-1.25.md](./migrations/k8s-1.25.md)
  - Deadline: 2024-03-01
- Updated base image to `ubuntu:22.04`

### Deprecated
- Jenkins pipeline support (end-of-life 2024-06-01)
  - Migrate to GitHub Actions using [migration tool](./tools/jenkins-migrator)

### Removed
- **BREAKING**: Removed support for Node.js 14
  - Action: Upgrade to Node.js 18 or 20

### Fixed
- Fixed race condition in parallel test execution
- Corrected artifact retention policy (now 90 days per compliance)

### Security
- Updated OpenSSL to 3.0.8 (CVE-2023-0286)
- Enforced SBOM generation for all builds
```

### Breaking Change Process

**Deprecation Workflow**:
```yaml
# .github/breaking-change-template.md
name: Breaking Change Proposal
about: Propose a breaking change to the platform
title: "[BREAKING] "
labels: breaking-change, needs-review

---

## Summary
One-sentence description of the change

## Motivation
Why is this change necessary? What problem does it solve?

## Detailed Design
Technical details of the change

## Migration Path
1. Current state: ...
2. Action required: ...
3. Timeline: ...

## Impact Assessment
- [ ] Security impact
- [ ] Performance impact  
- [ ] Cost impact
- [ ] Developer experience impact

## Rollout Plan
- **Announcement Date**: 
- **Deprecation Warning**: 
- **Hard Deadline**: 
- **Communication Channels**: 
  - [ ] Email to engineering@company.com
  - [ ] Slack #announcements
  - [ ] Developer portal banner
```

**Automated Deprecation Warnings**:
```yaml
# In pipeline configuration
jobs:
  check-deprecations:
    runs-on: ubuntu-latest
    steps:
      - name: Check for deprecated features
        run: |
          # Check if using deprecated Node version
          if grep -q "node:14" Dockerfile; then
            echo "::warning::Node.js 14 is deprecated. Migrate to 18+ by March 1st."
            echo "::warning::See: https://docs.company.com/migrations/node-18"
          fi
          
          # Check if using legacy Jenkins
          if [ -f "Jenkinsfile" ]; then
            echo "::error::Jenkins is deprecated. Migrate to GitHub Actions."
            exit 1  # Fail build after deadline
          fi
```

## 54.6 README Best Practices

The README is the front door to every repository. It must enable self-service understanding and contribution.

### README Template

```markdown
# Service Name

[![CI Status](https://github.com/myorg/service/actions/workflows/ci.yml/badge.svg)](https://github.com/myorg/service/actions)
[![Coverage](https://codecov.io/gh/myorg/service/branch/main/graph/badge.svg)](https://codecov.io/gh/myorg/service)
[![Docs](https://img.shields.io/badge/docs-latest-blue)](https://docs.company.com/services/service)

One-sentence description of what this service does.

## Overview

- **Purpose**: Brief description of business function
- **Tech Stack**: Java 17, Spring Boot, PostgreSQL, Kubernetes
- **Team**: [#team-payments](https://company.slack.com/archives/team-payments)
- **On-Call**: [PagerDuty Rotation](https://company.pagerduty.com/rotations)
- **Architecture**: [ADR-012](./docs/adr/012-service-architecture.md)

## Quick Start

### Prerequisites
- Java 17 (see [SDKMAN setup](https://docs.company.com/java-setup))
- Docker Desktop
- kubectl configured for dev cluster

### Local Development
```bash
# Clone and setup
git clone git@github.com:myorg/service.git
cd service
make setup  # Installs dependencies, sets up database

# Run locally
make dev    # Starts with hot reload on http://localhost:8080

# Run tests
make test   # Unit + integration tests
```

### Deployment
```bash
# Deploy to staging (auto on merge to main)
git push origin main

# Deploy to production (manual gate)
gh workflow run deploy-production.yml
```

## Architecture

```mermaid
graph LR
    A[API Gateway] --> B[Payment Service]
    B --> C[(PostgreSQL)]
    B --> D[Redis Cache]
    B --> E[Kafka Events]
```

See [full architecture docs](./docs/architecture.md).

## Configuration

| Variable | Description | Default | Required |
|----------|-------------|---------|----------|
| `DB_HOST` | Database hostname | `localhost` | Yes |
| `DB_PORT` | Database port | `5432` | No |
| `LOG_LEVEL` | Logging level | `INFO` | No |

## API Documentation

- OpenAPI Spec: [openapi.yaml](./openapi.yaml)
- Live Docs: https://api.company.com/payments/docs

## Monitoring

- Dashboard: [Grafana](https://grafana.company.com/d/payments)
- Alerts: [#alerts-payments](https://company.slack.com/archives/alerts-payments)
- Runbooks: [./runbooks](./runbooks)

## Contributing

See [CONTRIBUTING.md](./CONTRIBUTING.md) for:
- Branch naming conventions
- Commit message format
- PR review process
- Testing requirements

## License

[MIT](./LICENSE)
```

## 54.7 Automated Documentation

Documentation that is manually maintained will drift from reality. Automate generation where possible.

### Code to Docs

**Swagger/OpenAPI from Annotations**:
```java
@RestController
@RequestMapping("/api/v1/payments")
@Tag(name = "Payments", description = "Payment processing endpoints")
public class PaymentController {

    @Operation(
        summary = "Process payment",
        description = "Charges customer and creates transaction record",
        responses = {
            @ApiResponse(responseCode = "201", description = "Payment successful"),
            @ApiResponse(responseCode = "402", description = "Payment declined"),
            @ApiResponse(responseCode = "400", description = "Invalid request")
        }
    )
    @PostMapping
    public ResponseEntity<PaymentResponse> processPayment(
        @RequestBody @Valid PaymentRequest request
    ) {
        // Implementation
    }
}
```

**Database Schema Documentation**:
```yaml
# Generate from database schema
tables:
  - name: payments
    description: Records of financial transactions
    columns:
      - name: id
        type: uuid
        description: Primary key
      - name: amount_cents
        type: integer
        description: Transaction amount in smallest currency unit
        constraints: NOT NULL, CHECK > 0
      - name: status
        type: enum(pending, completed, failed)
        description: Transaction state
```

### Pipeline Documentation from Code

**GitHub Actions to Markdown**:
```python
# scripts/generate-workflow-docs.py
import yaml
import sys

def parse_workflow(file_path):
    with open(file_path) as f:
        workflow = yaml.safe_load(f)
    
    markdown = f"## {workflow.get('name', 'Workflow')}\n\n"
    markdown += f"{workflow.get('description', 'No description provided.')}\n\n"
    
    markdown += "### Triggers\n\n"
    on = workflow.get('on', {})
    if isinstance(on, dict):
        for trigger, config in on.items():
            markdown += f"- **{trigger}**: {config}\n"
    
    markdown += "\n### Jobs\n\n"
    for job_name, job_config in workflow.get('jobs', {}).items():
        markdown += f"#### {job_name}\n\n"
        markdown += f"Runner: `{job_config.get('runs-on', 'unknown')}`\n\n"
        markdown += "Steps:\n"
        for step in job_config.get('steps', []):
            name = step.get('name', step.get('uses', 'unnamed'))
            markdown += f"1. {name}\n"
        markdown += "\n"
    
    return markdown

if __name__ == "__main__":
    docs = parse_workflow(sys.argv[1])
    print(docs)
```

## 54.8 Documentation-as-Code

Treat documentation with the same rigor as source code: version control, CI/CD, reviews, and testing.

### GitOps for Documentation

**Repository Structure**:
```
docs/
├── README.md                 # Documentation index
├── architecture/             # C4 models, ADRs
│   ├── c4-level-1.md
│   ├── c4-level-2.md
│   └── decisions/            # Architecture Decision Records
├── runbooks/                 # Operational procedures
│   ├── database-failover.md
│   └── incident-response.md
├── api/                      # OpenAPI specs
│   └── pipeline-api.yaml
└── migrations/               # Migration guides
    ├── k8s-1.25.md
    └── node-18.md

mkdocs.yml                   # Documentation site config
.github/
└── workflows/
    └── docs.yml             # CI for docs
```

**Documentation CI/CD**:
```yaml
name: Documentation
on:
  push:
    paths:
      - 'docs/**'
      - 'mkdocs.yml'
  pull_request:
    paths:
      - 'docs/**'

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Markdown linting
        uses: DavidAnson/markdownlint-cli2-action@v13
        with:
          globs: 'docs/**/*.md'
          
      - name: Check links
        uses: lycheeverse/lychee-action@v1
        with:
          args: --timeout 30 docs/
          
      - name: Spell check
        uses: streetsidesoftware/cspell-action@v5
        with:
          files: 'docs/**/*.md'

  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          
      - name: Install dependencies
        run: |
          pip install mkdocs-material
          pip install mkdocs-mermaid2-plugin
          
      - name: Build site
        run: mkdocs build --strict
        
      - name: Deploy to GitHub Pages
        if: github.ref == 'refs/heads/main'
        run: mkdocs gh-deploy --force
```

### Review Process

**Documentation PR Template**:
```markdown
## Type of Documentation Change
- [ ] New feature documentation
- [ ] Bug fix (incorrect existing docs)
- [ ] Improvement (clarity, examples)
- [ ] Breaking change notice

## Checklist
- [ ] Technical accuracy verified by subject matter expert
- [ ] Screenshots updated (if UI changed)
- [ ] Links tested (no 404s)
- [ ] Spelling and grammar checked
- [ ] Code examples tested/executable
- [ ] ADR created if architectural decision documented

## Audience
Who is the primary reader?
- [ ] New hire onboarding
- [ ] Experienced developer
- [ ] Platform operator
- [ ] External customer
```

---

## Chapter Summary and Preview

This chapter established documentation as a critical operational capability for CI/CD platforms, treating docs with the same engineering rigor as production code. We examined **pipeline documentation** that explains not merely the steps executed but the architectural decisions behind deployment strategies, using inline comments, decision records, and troubleshooting guides to capture the "why" alongside the "what."

**Architecture documentation** using diagrams-as-code tools (Mermaid, Structurizr, PlantUML) ensures that visual representations remain synchronized with implementation, eliminating the "architecture diagram vs. reality" divergence that plagues traditional documentation. These diagrams live in version control, undergo review in pull requests, and render automatically in documentation sites.

**Runbooks** provide operational procedures for incident response, structured with clear metadata, diagnosis steps, resolution procedures, and escalation paths. Executable runbooks—tested in CI pipelines—ensure that procedures remain valid as systems evolve, preventing the dangerous scenario where operators discover outdated runbooks during critical incidents.

**API documentation** using OpenAPI specifications provides contracts for internal platform services, enabling self-service consumption of CI/CD capabilities by development teams. **Change logs** following structured formats (Keep a Changelog) communicate breaking changes with adequate notice and migration paths, preventing surprise disruptions to dependent teams.

**README standards** establish the front door to every repository, providing immediate orientation for developers through quick start guides, architecture overviews, configuration references, and links to deeper documentation. **Automated documentation generation** from code annotations, database schemas, and pipeline definitions reduces maintenance burden while ensuring accuracy.

Finally, **documentation-as-code** workflows—version control, CI/CD linting, link checking, spell checking, and automated deployment—prevent documentation rot by enforcing quality gates and making documentation updates part of the standard development workflow.

**Key Takeaways:**
- Treat documentation as code: version controlled, reviewed in PRs, tested in CI, and deployed automatically.
- Use diagrams-as-code (Mermaid, Structurizr) to ensure architecture documentation remains synchronized with implementation.
- Structure runbooks with clear metadata (severity, owner, related incidents), diagnosis steps, and executable commands that can be tested.
- Maintain change logs with breaking change notices and migration guides; provide at least 30 days notice for breaking changes.
- Automate documentation generation from OpenAPI specs, code annotations, and infrastructure definitions to reduce maintenance burden.
- Include documentation review in Definition of Done; every feature ships with updated docs or it is not complete.
- Test documentation accuracy regularly; broken links, outdated screenshots, and incorrect commands erode trust and utility.

**Next Chapter Preview:** Chapter 55: CI/CD Best Practices synthesizes the patterns and anti-patterns observed throughout this handbook into actionable guidelines. We will examine the **Twelve-Factor App** methodology as it applies to CI/CD, **immutable infrastructure** principles that prevent configuration drift, **infrastructure as code** patterns for reproducible environments, **small frequent changes** that reduce risk and improve debuggability, **automation of toil** that eliminates manual steps, **fail-fast** mechanisms that surface errors immediately, **repeatability** through hermetic builds, and **simplicity** as the ultimate sophistication. We will contrast these with common anti-patterns: the monolithic pipeline, manual interventions, hardcoded configuration, skipped tests in urgency, bloated container images, over-engineered abstractions, siloed team structures, and security as an afterthought. This chapter serves as a concise reference for daily decision-making in pipeline design and implementation.

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='53. cicd_team_collaboration.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='55. cicd_best_practices.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
