# Chapter 56: Anti-Patterns to Avoid

Anti-patterns are recurring solutions to common problems that appear reasonable on the surface but produce negative consequences over time. In CI/CD, anti-patterns emerge from organizational pressure, historical baggage, or misunderstood requirements. They degrade velocity, compromise security, and create operational fragility. This chapter provides a forensic examination of the eight most destructive CI/CD anti-patterns: the **Giant Monolith Pipeline** that creates tight coupling and cascading failures; **Manual Intervention Gates** that reintroduce human error and variability; **Hardcoded Configurations** that prevent portability and create security risks; **Skipping Tests** that destroys confidence in automation; **Bloated Container Images** that expand attack surfaces and slow deployments; **Over-Engineered Abstractions** that obscure intent and hinder debugging; **Siloed Teams** that create handoffs and blame cultures; and **Ignoring Security** until production that results in expensive rework and compliance failures. For each anti-pattern, we provide detection criteria, refactoring strategies, and prevention measures to help teams recognize and eliminate these practices.

## 56.1 Giant Monolith Pipeline

The monolith pipeline attempts to build, test, and deploy every service in a repository through a single workflow, regardless of which code actually changed.

### The Anti-Pattern

**Symptoms**:
- One workflow file handles 50+ microservices
- Changing a comment in Service A triggers integration tests for Services B through Z
- Pipeline duration exceeds 2 hours for minor changes
- "Pipeline roulette" where unrelated failures block deployments

**Example**:
```yaml
# Anti-pattern: The God Workflow
# .github/workflows/monolith.yml
name: Build Everything

on:
  push:
    branches: [main]
    # No path filters - everything triggers everything

jobs:
  build-all:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      # Build all services regardless of what changed
      - name: Build Service A
        run: ./build.sh service-a  # 5 minutes
      
      - name: Build Service B
        run: ./build.sh service-b  # 5 minutes
      
      # ... 48 more services ...
      
      - name: Build Service Z
        run: ./build.sh service-z  # 5 minutes
      
      # All tests run sequentially
      - name: Test All Services
        run: ./test-all.sh  # 45 minutes
      
      - name: Deploy All
        run: ./deploy-all.sh  # High blast radius
```

**Consequences**:
- **Feedback latency**: Developers wait hours for feedback on one-line changes
- **Cascading failures**: A flaky test in Service X blocks deployment of critical Service Y
- **Resource waste**: Compute spent building unchanged artifacts
- **Deployment fear**: High blast radius discourages frequent deployments

### Detection Criteria

```bash
# Detect monolith pipelines
# Pipeline runs > 30 minutes for minor changes
if [ "$PIPELINE_DURATION" -gt 1800 ]; then
  echo "WARNING: Potential monolith pipeline detected"
fi

# Build affects services with no code changes
CHANGED_FILES=$(git diff --name-only HEAD~1)
if echo "$CHANGED_FILES" | grep -q "service-a/" && \
   echo "$CHANGED_FILES" | grep -qv "service-b/"; then
  # Yet service-b was also built
  if pipeline_log_contains "Building service-b"; then
    echo "VIOLATION: Building unchanged services"
  fi
fi
```

### Refactoring Strategy

**Path-Based Triggering**:
```yaml
# Solution: Service-specific workflows
# .github/workflows/service-a.yml
name: Service A CI/CD

on:
  push:
    paths:
      - 'services/service-a/**'
      - 'libs/shared/**'  # Only if shared lib changed
      - '.github/workflows/service-a.yml'
    branches: [main]
  pull_request:
    paths:
      - 'services/service-a/**'

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Build Service A only
        working-directory: services/service-a
        run: make build
      
      - name: Test Service A only
        working-directory: services/service-a
        run: make test
      
      - name: Deploy Service A only
        if: github.ref == 'refs/heads/main'
        run: make deploy
```

**Dependency-Aware Builds** (Bazel/Pants):
```python
# BUILD file with precise dependencies
# Only rebuild if this service or its deps changed
java_binary(
    name = "service-a",
    srcs = glob(["src/**/*.java"]),
    deps = [
        "//libs/shared:utils",  # Only if this changes
        "//libs/shared:models",
    ],
)

# Build only what changed
# bazel build //services/... --only-changed
```

**Micro-Pipelines Architecture**:
```yaml
# Solution: Pipeline of pipelines
# Parent pipeline orchestrates, children execute

stages:
  - stage: Detect Changes
    jobs:
    - job: changes
      steps:
      - task: ChangedFiles@1
        inputs:
          rules: |
            [ServiceA]
            services/service-a/**
            [ServiceB]
            services/service-b/**
  
  - stage: Build Service A
    condition: stageDependencies.Changes.outputs['changes.ServiceA'] == 'true'
    jobs:
    - template: service-pipeline.yml
      parameters:
        service: service-a
  
  - stage: Build Service B
    condition: stageDependencies.Changes.outputs['changes.ServiceB'] == 'true'
    jobs:
    - template: service-pipeline.yml
      parameters:
        service: service-b
```

### Prevention

- **Service Boundaries**: Enforce repository per service or clear directory boundaries with CODEOWNERS
- **Build Impact Analysis**: Use tools like Nx, Bazel, or Pants to determine affected projects
- **Pipeline Time SLI**: Alert if pipeline duration exceeds 15 minutes for any service
- **Contract Testing**: Replace integration tests with consumer-driven contract tests

## 56.2 Manual Interventions

Manual gates in pipelines—whether for approvals, environment configuration, or "just to be safe" checks—reintroduce the variability and delay that CI/CD seeks to eliminate.

### The Anti-Pattern

**Symptoms**:
- Pipelines pause for human input multiple times
- "Works on my machine" because manual steps differ between environments
- Deployments only happen during business hours when approvers are available
- "Emergency" deployments bypass procedures because gates take too long

**Example**:
```yaml
# Anti-pattern: Manual gates everywhere
jobs:
  deploy-staging:
    steps:
      - deploy staging
  
  wait-for-approval:
    steps:
      - task: ManualValidation@0
        inputs:
          instructions: 'Approve to continue to production?'
          onTimeout: 'reject'
        timeoutInMinutes: 1440  # Wait up to 24 hours!
  
  deploy-production:
    dependsOn: wait-for-approval
    steps:
      - deploy production
  
  # Another manual gate for verification
  verify-production:
    dependsOn: deploy-production
    steps:
      - task: ManualValidation@0
        inputs:
          instructions: 'Did you check the logs?'
```

**Consequences**:
- **Deployment latency**: Hours or days between staging and production
- **Context switching**: Approvers lack context hours after the change was made
- **Hero culture**: Specific individuals become bottlenecks
- **Cognitive load**: Manual checks that could be automated

### Detection Criteria

```yaml
# Query for manual interventions
# Azure DevOps
az pipelines runs list --pipeline-id 123 | \
  jq '.[] | select(.result == "inProgress") | .id' | \
  while read run_id; do
    az pipelines runs show --id $run_id | \
      jq 'select(.status == "inProgress" and .reason == "manualValidationPending")'
  done

# GitHub Actions - check for workflow_dispatch with no automation
# Look for workflows with "workflow_dispatch" as primary trigger
# and no automated verification steps
```

### Refactoring Strategy

**Automated Quality Gates**:
```yaml
# Solution: Automated verification instead of human approval
jobs:
  deploy-staging:
    steps:
      - deploy staging
      - run: smoke-tests.sh
      - run: security-scan.sh
  
  deploy-production:
    needs: deploy-staging
    # No manual gate - automated quality checks passed
    steps:
      - deploy canary 5%
      - run: automated-verification.sh  # Metrics-based
      - deploy 100%
```

**Policy-Based Deployment**:
```yaml
# Solution: OPA/Gatekeeper for automated approval
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: require-change-ticket
spec:
  match:
    kinds:
      - apiGroups: ["apps"]
        kinds: ["Deployment"]
  parameters:
    labels:
      - "change.ticket/id"
      - "change.ticket/approved"
---
# Pipeline automatically validates ticket status
- name: Verify Change Ticket
  run: |
    if ! verify-ticket ${{ github.event.head_commit.message }}; then
      echo "No valid change ticket found"
      exit 1
    fi
```

**SRE Error Budgets**:
```yaml
# Instead of manual approval, use error budgets
deploy:
  strategy:
    canary:
      steps:
      - setWeight: 25
      - analysis:
          thresholdRange: [0.99, 1.01]  # 99% success rate required
          interval: 5m
          count: 3
      # Automatic promotion if metrics pass
      - setWeight: 100
```

### Prevention

- **Automated Testing**: Replace manual QA with comprehensive automated suites
- **Feature Flags**: Deploy to production dark, enable via configuration rather than pipeline gates
- **ChatOps**: Move approvals to Slack/Teams with context and audit trails, not UI clicks
- **Deployment Windows**: Use policy-as-code to enforce deployment windows automatically

## 56.3 Hardcoded Configurations

Embedding environment-specific values, secrets, or URLs in source code or container images creates security risks and prevents environment portability.

### The Anti-Pattern

**Symptoms**:
- `if (environment === 'production')` statements in code
- Database credentials in Git repositories
- Container images that only work in one environment
- Configuration files with commented-out sections for different environments

**Example**:
```yaml
# Anti-pattern: Hardcoded in Dockerfile
FROM node:20
COPY . /app
WORKDIR /app

# Hardcoded environment
ENV NODE_ENV=production
ENV DATABASE_HOST=prod-db-01.company.internal
ENV DATABASE_PASSWORD=SuperSecret123!  # Secret in image!
ENV API_KEY=sk_live_abc123xyz  # In container layers

CMD ["node", "server.js"]
```

```javascript
// Anti-pattern: Hardcoded in source
function getDatabaseConfig() {
  if (process.env.NODE_ENV === 'production') {
    return {
      host: 'prod-db-01.company.internal',
      password: 'SuperSecret123!',  // In git history forever
      ssl: true
    };
  } else {
    return {
      host: 'localhost',
      password: 'password',
      ssl: false
    };
  }
}
```

**Consequences**:
- **Security breaches**: Secrets in Git are exposed forever, even if deleted
- **Inflexibility**: Cannot deploy same image to multiple environments
- **Configuration drift**: "Works in staging but not production" due to hardcoded differences
- **Compliance failures**: Secrets in code violate SOC 2, PCI-DSS

### Detection Criteria

```bash
# Detect secrets in code
# Using truffleHog or git-secrets
truffleHog filesystem . --only-verified

# Detect hardcoded URLs
grep -r "company.internal\|company.com" --include="*.js" --include="*.py" --include="*.go" .

# Check Docker images for env vars
docker history myapp:latest | grep -i "password\|secret\|key"

# Check Kubernetes manifests for hardcoded values
kubectl get configmaps -o yaml | grep -E "(password|secret|key)" | grep -v "kind:"
```

### Refactoring Strategy

**External Configuration**:
```javascript
// Solution: External configuration only
function getDatabaseConfig() {
  // No environment conditionals
  return {
    host: process.env.DATABASE_HOST,  // Required, no default
    port: parseInt(process.env.DATABASE_PORT || '5432'),
    database: process.env.DATABASE_NAME,
    user: process.env.DATABASE_USER,
    password: process.env.DATABASE_PASSWORD,  // From vault/secret manager
    ssl: process.env.DATABASE_SSL === 'true'
  };
}

// Validate required config at startup
function validateConfig() {
  const required = ['DATABASE_HOST', 'DATABASE_USER', 'DATABASE_PASSWORD'];
  for (const env of required) {
    if (!process.env[env]) {
      throw new Error(`Missing required environment variable: ${env}`);
    }
  }
}
```

**12-Factor Config**:
```yaml
# Kubernetes External Secrets
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: db-credentials
spec:
  refreshInterval: 1h
  secretStoreRef:
    kind: SecretStore
    name: aws-secrets-manager
  target:
    name: db-credentials
  data:
  - secretKey: password
    remoteRef:
      key: prod/db/password
---
# Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  template:
    spec:
      containers:
      - name: app
        image: myapp:latest  # Same image for all environments
        env:
        - name: DATABASE_HOST
          value: "db.company.com"  # Different per environment via overlay
        - name: DATABASE_PASSWORD
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: password
```

**ConfigMaps per Environment**:
```yaml
# Base configuration
# base/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  LOG_LEVEL: "info"
  CACHE_TTL: "300"

# Production overlay
# production/configmap-patch.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  DATABASE_HOST: "prod-db.cluster.local"
  REDIS_HOST: "prod-redis.cluster.local"

# Staging overlay  
# staging/configmap-patch.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  DATABASE_HOST: "staging-db.cluster.local"
  REDIS_HOST: "staging-redis.cluster.local"
```

### Prevention

- **Pre-commit hooks**: Block commits with patterns matching secrets
- **Container scanning**: Reject images with hardcoded secrets in layers
- **Config validation**: CI checks that no hardcoded IPs or URLs exist in source
- **Secret rotation**: Regular rotation forces externalization (hardcoded secrets break)

## 56.4 Skipping Tests

Under pressure, teams disable or skip tests to "save time," destroying the safety net that enables confident deployment.

### The Anti-Pattern

**Symptoms**:
- `@Disabled` or `@Skip` annotations proliferating in test suites
- "Skip tests" checkbox in deployment pipelines
- Flaky tests commented out rather than fixed
- Tests that only run in CI, never locally

**Example**:
```yaml
# Anti-pattern: Optional test execution
jobs:
  build:
    steps:
      - checkout
      
      - name: Build
        run: mvn package -DskipTests  # Always skipping!
      
      # Or conditional based on "urgency"
      - name: Test (Optional)
        run: mvn test
        continue-on-error: true  # Ignores failures!
      
      - name: Deploy
        if: always()  # Deploys even if tests failed
        run: deploy.sh
```

```java
// Anti-pattern: Skipped tests in code
@Test
@Disabled("Flaky, need to fix later")  // Never fixed
public void testPaymentProcessing() {
    // ...
}

@Test
@DisabledIfEnvironmentVariable(named = "CI", matches = "true")  // Skip in CI!
public void testDatabaseMigration() {
    // ...
}
```

**Consequences**:
- **Regression defects**: Broken features reach production
- **Fear of change**: Teams afraid to refactor because tests don't catch breaks
- **Test debt**: Skipped tests rot and become unusable
- **False confidence**: Green builds that don't actually verify anything

### Detection Criteria

```bash
# Count skipped tests
mvn test 2>&1 | grep -c "SKIPPED"
pytest --collect-only -q 2>&1 | grep "skip"

# Check for skip flags in CI
grep -r "skipTests\|skip-tests\|--skip" .github/workflows/

# Check for continue-on-error with tests
yq eval '.jobs.*.steps[] | select(.run | contains("test")) | .["continue-on-error"]' .github/workflows/*.yml
```

### Refactoring Strategy

**Fast, Reliable Test Suite**:
```yaml
# Solution: Parallel, optimized tests
jobs:
  test:
    strategy:
      matrix:
        shard: [1, 2, 3, 4]  # Split test suite
    steps:
      - checkout
      
      - name: Cache dependencies
        uses: actions/cache@v3
        with:
          path: ~/.m2
          key: ${{ runner.os }}-m2-${{ hashFiles('**/pom.xml') }}
      
      - name: Test Shard ${{ matrix.shard }}
        run: |
          mvn test \
            -Dtest.shard=${{ matrix.shard }} \
            -Dtest.shard.total=4 \
            -Dsurefire.failIfNoSpecifiedTests=false
      
      # Fail build on test failure - no exceptions
      - name: Verify Test Results
        run: |
          if [ -f target/surefire-reports/*.txt ]; then
            grep -l "FAILURE\|ERROR" target/surefire-reports/*.txt && exit 1
          fi
```

**Flaky Test Quarantine**:
```yaml
# Isolate flaky tests instead of skipping
jobs:
  reliable-tests:
    steps:
      - run: mvn test -DexcludedGroups=flaky
  
  flaky-tests:
    continue-on-error: true  # Only this job allowed to fail
    steps:
      - run: mvn test -Dgroups=flaky
      - run: |
          # Alert on flaky tests so they get fixed
          curl -X POST \
            -H "Authorization: Bearer ${{ secrets.SLACK_TOKEN }}" \
            -d '{"text":"Flaky tests detected: ${{ github.run_id }}"}' \
            ${{ secrets.SLACK_WEBHOOK }}
```

**Test Data Management**:
```java
// Solution: Test containers for reliable integration tests
@Testcontainers
public class DatabaseIntegrationTest {
    
    @Container
    private static final PostgreSQLContainer<?> postgres = 
        new PostgreSQLContainer<>("postgres:15")
            .withDatabaseName("test")
            .withUsername("test")
            .withPassword("test");
    
    @BeforeAll
    static void setup() {
        // Dynamic configuration - no hardcoded ports
        System.setProperty("DB_URL", postgres.getJdbcUrl());
    }
    
    @Test
    public void testPersistData() {
        // Reliable, isolated test
    }
}
```

### Prevention

- **Coverage gates**: Block PRs that decrease coverage
- **Test time budgets**: Fail builds if test duration increases >10%
- **Flaky test tracking**: Track and fix flaky tests aggressively (target: <0.1% flakiness)
- **Local-first**: Tests must pass locally before pushing (pre-commit hooks)

## 56.5 Large Images

Bloated container images containing debug tools, build artifacts, or multiple versions of dependencies increase attack surface, slow deployments, and waste resources.

### The Anti-Pattern

**Symptoms**:
- Images > 1GB for simple services
- Images contain package managers (apt, yum) in production
- Source code, test files, and documentation in production images
- Multiple layer caches bloating storage

**Example**:
```dockerfile
# Anti-pattern: Bloated image
FROM ubuntu:latest  # Large base (~80MB)

# Install everything including kitchen sink
RUN apt-get update && apt-get install -y \
    build-essential \
    gcc \
    g++ \
    vim \
    curl \
    wget \
    net-tools \
    tcpdump \
    openssh-server \
    mysql-client \
    nodejs \
    npm \
    python3 \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

# Copy entire repo including tests and docs
COPY . /app

# Build in same layer
RUN cd /app && npm install && npm run build

# Keep devDependencies
RUN npm install  # Includes test frameworks, linters, etc.

EXPOSE 8080
CMD ["node", "app.js"]
```

**Consequences**:
- **Security**: Larger attack surface with unused tools (vim, curl, ssh)
- **Performance**: Slow startup, high memory usage
- **Cost**: Increased registry storage and data transfer
- **Reliability**: More layers = more potential failure points

### Detection Criteria

```bash
# Check image size
docker images myapp --format "{{.Size}}"

# Analyze image layers
dive myapp:latest  # Interactive TUI
docker history myapp:latest

# Check for sensitive tools in image
docker run --rm myapp:latest which \
  sshd curl wget nc telnet vim apt yum 2>/dev/null

# Check for source code
docker run --rm myapp:latest ls -la /app/tests 2>/dev/null
docker run --rm myapp:latest ls -la /app/src/*.test.js 2>/dev/null
```

### Refactoring Strategy

**Multi-Stage Builds**:
```dockerfile
# Solution: Distroless multi-stage build
# Stage 1: Build environment
FROM node:20-alpine AS builder
WORKDIR /build

# Only copy dependency files first (caching)
COPY package*.json ./
RUN npm ci --only=production

# Copy source and build
COPY . .
RUN npm run build

# Stage 2: Production environment
FROM gcr.io/distroless/nodejs20-debian12:nonroot

WORKDIR /app

# Copy only built artifacts and node_modules
COPY --from=builder --chown=nonroot:nonroot /build/dist ./dist
COPY --from=builder --chown=nonroot:nonroot /build/node_modules ./node_modules
COPY --from=builder --chown=nonroot:nonroot /build/package.json .

# No shell, no package manager, minimal attack surface
USER nonroot:nonroot
EXPOSE 8080
CMD ["dist/server.js"]
```

**BuildKit Optimizations**:
```dockerfile
# Use BuildKit mount for secrets (not in layers)
# syntax=docker/dockerfile:1
FROM node:20-alpine AS builder

# Mount cache for npm (persistent between builds)
RUN --mount=type=cache,target=/root/.npm \
    npm ci

# Mount secrets (not persisted in image)
RUN --mount=type=secret,id=npmrc,target=/root/.npmrc \
    npm ci

# Mount ssh key for private repos (not in image)
RUN --mount=type=ssh,id=github \
    git clone git@github.com:company/private.git
```

**Alpine/Scratch Base**:
```dockerfile
# For Go binaries - static compilation + scratch
FROM golang:1.21-alpine AS builder
WORKDIR /app
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o main .

FROM scratch  # Empty base image
COPY --from=builder /app/main /main
COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/
EXPOSE 8080
ENTRYPOINT ["/main"]
```

### Prevention

- **Image size limits**: CI rejects images > 200MB without approval
- **Base image scanning**: Only approved minimal bases allowed (distroless, alpine, scratch)
- **Layer caching**: Mandatory multi-stage builds for services
- **Regular base updates**: Automated PRs for base image updates

## 56.6 Over-Engineering

Creating abstractions, generic frameworks, or complex configurations for hypothetical future requirements that never materialize, resulting in systems that are hard to understand and modify.

### The Anti-Pattern

**Symptoms**:
- "Generic" pipeline templates with 50 parameters used by 2 services
- Abstract base classes for "future extensibility" with no implementations
- Complex DSLs for configuration when YAML would suffice
- Microservices architecture for a 3-person team

**Example**:
```yaml
# Anti-pattern: Over-abstracted pipeline
# .github/workflows/reusable-mega-workflow.yml
name: Mega Reusable Workflow

on:
  workflow_call:
    inputs:
      service_name:
        required: true
        type: string
      environment:
        required: true
        type: string
      strategy:
        required: false
        type: string
        default: 'rolling'
      canary_percentage:
        required: false
        type: number
        default: 10
      rollback_strategy:
        required: false
        type: string
        default: 'automatic'
      notification_channels:
        required: false
        type: string
        default: 'slack,email,pagerduty'
      # ... 40 more parameters ...
      enable_feature_x:
        required: false
        type: boolean
        default: false
      feature_x_config:
        required: false
        type: string

jobs:
  deploy:
    steps:
      - name: Parse 50 parameters
        run: |
          # Complex bash script to handle all combinations
          if [ "${{ inputs.enable_feature_x }}" == "true" ]; then
            # Complex logic for feature that doesn't exist yet
          fi
```

```python
# Anti-pattern: Abstract base classes for everything
class AbstractDeploymentStrategyFactory:
    def create_strategy(self):
        raise NotImplementedError

class AbstractCanaryStrategy(AbstractDeploymentStrategyFactory):
    pass  # No implementations, "for future use"

class AbstractBlueGreenStrategy(AbstractDeploymentStrategyFactory):
    pass
```

**Consequences**:
- **Cognitive load**: Engineers cannot understand the pipeline without deep diving
- **Rigidity**: Hard to change because of unknown dependencies on "flexible" features
- **Debug difficulty**: When failures occur, hard to trace through abstraction layers
- **Wasted effort**: Building for requirements that never come

### Detection Criteria

```bash
# Complexity metrics
# Count lines in workflow files
find .github/workflows -name "*.yml" -exec wc -l {} + | sort -n

# Check for excessive parameters
yq eval '.on.workflow_call.inputs | length' .github/workflows/reusable.yml

# Cyclomatic complexity in pipeline scripts
shellcheck -f json deploy.sh | jq 'length'
```

### Refactoring Strategy

**Explicit over Generic**:
```yaml
# Solution: Simple, explicit pipeline
# .github/workflows/service-a.yml
name: Deploy Service A

on:
  push:
    branches: [main]
    paths: ['services/a/**']

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Build
        run: docker build -t service-a:${{ github.sha }} ./services/a
      
      - name: Deploy to Staging
        if: github.ref == 'refs/heads/main'
        run: |
          helm upgrade --install service-a ./chart \
            --namespace staging \
            --set image.tag=${{ github.sha }}
      
      - name: Deploy to Production
        if: github.ref == 'refs/heads/main'
        run: |
          helm upgrade --install service-a ./chart \
            --namespace production \
            --set image.tag=${{ github.sha }} \
            --wait
```

**YAGNI (You Aren't Gonna Need It)**:
```python
# Solution: Concrete implementation, refactor when needed
def deploy_service(service_name, environment):
    """Deploy service - simple and explicit."""
    if environment == "production":
        run_canary_deployment(service_name)
    else:
        run_standard_deployment(service_name)

# Only abstract when you have 3+ implementations
def run_canary_deployment(service):
    # Concrete implementation
    set_weight(service, 5)
    sleep(300)
    if error_rate() < 0.001:
        set_weight(service, 100)
    else:
        rollback(service)
```

### Prevention

- **Rule of Three**: Don't abstract until you have three concrete implementations
- **Code review checklist**: "Is this the simplest solution that works?"
- **Complexity budgets**: Reject PRs that increase cyclomatic complexity >20%
- **Documentation requirement**: Complex abstractions must include architecture decision records explaining why complexity is necessary

## 56.7 Siloed Teams

Organizational structures that separate development, operations, security, and QA create handoffs, delays, and adversarial relationships ("throw it over the wall").

### The Anti-Pattern

**Symptoms**:
- Separate Jira projects for Dev and Ops with ticket handoffs
- "Code complete" means throwing to QA, then to Ops
- Security review happens only before release (blocking)
- "Works in dev, Ops problem now" mentality
- Different tooling and access between teams

**Example**:
```markdown
# Anti-pattern: Siloed workflow

1. DEV writes code → Commits to repo
2. DEV creates ticket "Please deploy to staging" → Assigns to OPS
3. OPS deploys 2 days later (backlog)
4. QA tests → Finds bug → Ticket back to DEV
5. DEV fixes → New ticket to OPS for redeploy
6. Eventually reaches PROD
7. SECURITY scans → Finds vulnerability → Blocks release
8. Emergency patch → Emergency deploy (high risk)
```

**Consequences**:
- **Long lead times**: Weeks from code complete to production
- **Low quality**: Issues found late when expensive to fix
- **Blame culture**: "They deployed it wrong" vs "They wrote buggy code"
- **Knowledge silos**: Only Ops knows how to deploy; only Dev knows the code

### Detection Criteria

- **MTTR (Mean Time To Recovery)**: >1 hour indicates poor collaboration
- **Change failure rate**: >15% suggests handoff quality issues
- **Deployment frequency**: <1 per week indicates batching due to coordination cost
- **On-call rotation**: Only Ops on-call (Dev never carries pager)

### Refactoring Strategy

**You Build It, You Run It**:
```yaml
# Solution: Self-service platform
# Platform team provides golden paths, product teams operate

# Platform provides:
# - Reusable GitHub Actions workflows
# - Terraform modules for infrastructure
# - Helm charts for deployment
# - Observability dashboards
# - Runbooks and SOPs

# Product team owns:
# - Their CI/CD pipeline (uses platform tools)
# - Their production deployments
# - Their on-call rotation
```

**Cross-Functional Teams**:
```yaml
# Team structure
team: "Payments Squad"
members:
  - backend_engineers: 3
  - frontend_engineer: 1
  - site_reliability_engineer: 1  # Embedded Ops
  - security_champion: 1          # Embedded Security
  - product_manager: 1

responsibilities:
  - "Write code"
  - "Deploy to production"
  - "Monitor and respond to incidents"
  - "Security compliance for domain"

# Shared on-call
on_call_rotation:
  includes_all_engineers: true
  schedule: "weekly"
```

**Platform as Product**:
```yaml
# Platform team treats internal services as products
platform_team:
  customers: "Development teams"
  services:
    ci_cd_platform:
      sla: "99.9% availability"
      documentation: "developer.company.com"
      support_channel: "#platform-support"
    
    kubernetes_platform:
      self_service: true
      terraform_modules: "github.com/company/terraform-modules"
      runbooks: "runbooks.company.com"
```

### Prevention

- **Team charters**: Explicitly define "You build it, you run it"
- **Shared on-call**: Developers carry pager duty for their services
- **Embedded specialists**: SREs and Security embedded in teams, not gatekeepers
- **Internal NPS**: Measure developer satisfaction with platform teams

## 56.8 Ignoring Security

Treating security as a final gate or separate concern results in bolted-on protections, expensive remediation, and compliance failures.

### The Anti-Pattern

**Symptoms**:
- Security scan only on release branches
- Vulnerabilities found in production images
- No secrets management (hardcoded credentials)
- Penetration testing only annually
- Security team is the "department of no"

**Example**:
```yaml
# Anti-pattern: Security as final gate
stages:
  - build
  - unit_test
  - integration_test
  - deploy_to_staging
  - deploy_to_production  # Deploy first!
  - security_scan          # Check after deployment (too late!)
```

```dockerfile
# Anti-pattern: No security hardening
FROM ubuntu:latest  # No vulnerability scanning
RUN apt-get update && apt-get install -y openssh-server  # Unnecessary attack surface

# Running as root
USER root

# No health checks
# No resource limits
```

**Consequences**:
- **Data breaches**: Vulnerabilities reach production
- **Compliance failures**: Cannot pass audits due to missing controls
- **Expensive remediation**: Fixing production incidents vs preventing in CI
- **Delayed releases**: Security findings block releases at last minute

### Detection Criteria

```bash
# Check if security scans only happen late
grep -A 5 "security_scan" .github/workflows/*.yml | grep -E "(needs:|stage:)" | grep -v "build\|test"

# Check for root user in Dockerfiles
grep -r "USER root" Dockerfile*

# Check for missing resource limits
yq eval '.spec.template.spec.containers[].resources' deployment.yaml | grep -c "null"
```

### Refactoring Strategy

**Shift Left Security**:
```yaml
# Solution: Security in every stage
jobs:
  lint:
    steps:
      - run: npm audit  # Dependency check early
      - run: hadolint Dockerfile  # Dockerfile best practices
  
  sast:
    steps:
      - uses: github/codeql-action/init@v2
      - run: codeql-analysis
  
  build:
    needs: [lint, sast]
    steps:
      - build image
      
      - name: Scan image before push
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: myapp:${{ github.sha }}
          severity: 'CRITICAL,HIGH'
          exit-code: '1'  # Fail build on vulnerabilities
      
      - push image (only if scan passes)
  
  deploy:
    needs: build
    steps:
      - deploy with security contexts
      - verify network policies active
```

**Secure by Default**:
```yaml
# Solution: Security contexts mandatory
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 65532
        fsGroup: 65532
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: app
        image: myapp:latest
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
            - ALL
        resources:
          limits:
            cpu: "1000m"
            memory: "1Gi"
          requests:
            cpu: "100m"
            memory: "128Mi"
```

**DevSecOps Integration**:
```yaml
# Security as enabling, not blocking
security_champion:
  role: "Embedded security advisor"
  activities:
    - "Threat modeling sessions"
    - "Secure coding training"
    - "Automated security tooling setup"
    - "Incident response support"
  
  # Security gates automated
  gates:
    - sast: "Automated in CI"
    - dast: "Automated in staging"
    - dependency_check: "Automated on every build"
    - manual_review: "Only for major architecture changes"
```

### Prevention

- **Security automation**: SAST/DAST/SCA in every build, not just releases
- **Vulnerability SLAs**: Critical vulnerabilities fixed within 24 hours
- **Least privilege**: Default deny all, explicit allows only
- **Security champions**: Security expertise embedded in teams
- **Compliance as code**: Policy enforcement (OPA/Kyverno) prevents insecure deployments

---

## Chapter Summary and Preview

This chapter examined eight critical anti-patterns that undermine CI/CD initiatives and provided concrete detection criteria and refactoring strategies for each. The **Giant Monolith Pipeline** creates coupling and feedback delays that destroy velocity; the solution is path-based triggering and micro-pipelines that isolate services. **Manual Interventions** reintroduce human variability and delay; replace them with automated quality gates and policy-based deployment. **Hardcoded Configurations** prevent portability and create security risks; externalize all configuration and use secret managers. **Skipping Tests** destroys the safety net that enables confident deployment; maintain fast, reliable test suites and fix flaky tests rather than ignoring them. **Large Images** expand attack surfaces and slow deployments; implement multi-stage builds and distroless bases. **Over-Engineering** creates complexity for hypothetical futures; embrace YAGNI and explicit simplicity. **Siloed Teams** create handoffs and adversarial relationships; implement "You Build It, You Run It" and embed specialists in teams. **Ignoring Security** until production results in expensive rework; shift security left with automated scanning and secure-by-default configurations.

These anti-patterns are interconnected—siloed teams create manual interventions, over-engineering creates monolithic pipelines, and ignoring security leads to hardcoded credentials. Addressing them requires both technical changes (refactoring pipelines) and cultural changes (team structure, accountability).

**Key Takeaways:**
- Monolithic pipelines are the enemy of velocity; decouple services with independent pipelines triggered by path filters.
- Every manual gate is a queue with infinite wait time; automate approvals through policy and metrics.
- Hardcoded secrets and configuration are security time bombs; externalize everything and rotate credentials regularly.
- Skipped tests represent unverified risk; maintain a zero-tolerance policy for skipped tests in main branches.
- Container images should be minimal attack surfaces, not development environments; multi-stage builds are mandatory.
- Complexity is a liability; do not abstract until you have three concrete implementations, and prefer explicit over clever.
- Organizational design determines software architecture; align teams with services and embed operational responsibility.
- Security is a quality attribute, not a phase; automate security checks in CI and fail builds on critical vulnerabilities.

**Detection Checklist**:
- [ ] Pipeline duration < 15 minutes for any service
- [ ] Zero manual gates between commit and production
- [ ] No secrets in Git or container layers
- [ ] Test coverage never decreases
- [ ] Container images < 200MB
- [ ] Cyclomatic complexity < 20 per workflow
- [ ] Development teams can deploy without tickets
- [ ] Security scan passes before image push

**Next Chapter Preview:** Chapter 57: Measuring CI/CD Success establishes the metrics and KPIs that indicate whether your CI/CD transformation is succeeding. We will explore **DORA metrics** (Deployment Frequency, Lead Time, Change Failure Rate, MTTR) as the industry-standard indicators of software delivery performance, **lead time** measurement from commit to production, **deployment frequency** as a predictor of organizational agility, **change failure rate** tracking the stability of your delivery process, **mean time to recovery** measuring operational resilience, **pipeline success rate** indicating build health, **developer productivity** metrics that avoid vanity measures, and **cost metrics** that optimize cloud spend. We will examine how to implement metric collection without creating perverse incentives, building dashboards that drive improvement rather than blame, and establishing SLOs for the CI/CD platform itself.

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='55. cicd_best_practices.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='57. measuring_cicd_success.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
