# Chapter 46: Debugging CI/CD Pipelines

Continuous delivery pipelines are complex distributed systems that orchestrate code compilation, artifact creation, security scanning, and deployment across multiple environments. When these pipelines fail, the root cause may reside in application code, infrastructure configuration, network connectivity, resource constraints, or the pipeline definitions themselves. Debugging CI/CD pipelines requires distinct strategies from debugging applications: pipelines are ephemeral, execute in isolated environments, and often lack interactive debugging capabilities. Understanding how to extract diagnostic information from failed builds, reproduce failures locally, and trace execution across distributed systems is essential for maintaining reliable delivery.

This chapter establishes systematic approaches for diagnosing pipeline failures, from local container debugging to distributed tracing across Kubernetes clusters, ensuring rapid resolution when continuous delivery breaks down.

## 46.1 Common Pipeline Failures

Pipeline failures manifest in distinct categories, each requiring specific diagnostic approaches.

### Build Failures

**Compilation Errors:**
```
[ERROR] COMPILATION ERROR
[ERROR] /src/main/java/com/company/PaymentService.java:[42,17] cannot find symbol
  symbol:   method processPayment()
  location: class com.company.PaymentService
```

**Dependency Resolution:**
```
npm ERR! code E404
npm ERR! 404 Not Found - GET https://registry.npmjs.org/@company/private-pkg/-/private-pkg-1.0.0.tgz
npm ERR! 404 '@company/private-pkg@1.0.0' is not in this registry.
```

**Explanation:**
Build failures are typically deterministic—the same code produces the same failure. Check for:
- Missing imports or dependencies
- Version conflicts (transitive dependency upgrades)
- Environment differences (JDK version, Node version)
- Private registry authentication failures

### Test Failures

**Flaky Tests:**
```
Test PaymentServiceTest.shouldProcessPayment: FAILED
Expected: COMPLETED
Actual: PENDING
```

**Timeout Failures:**
```
Test execution timed out after 300 seconds
```

**Explanation:**
Flaky tests pass and fail inconsistently, often due to:
- Race conditions in async code
- External service dependencies
- Time-sensitive assertions
- Shared state between tests

**Mitigation:**
```yaml
# retry flaky tests
test:
  script: ./mvnw test
  retry:
    max: 2
    when: runner_system_failure
```

### Deployment Failures

**Image Pull Errors:**
```
Failed to pull image "payment-service:v2.1.0": 
rpc error: code = Unknown desc = Error response from daemon: 
pull access denied for payment-service, repository does not exist or may require 'docker login'
```

**Crash Loop Backoffs:**
```
Back-off restarting failed container
Error: container create failed: time="2024-01-15T10:00:00Z" level=error msg="container_linux.go:380: starting container process caused: exec: \"./start.sh\": permission denied"
```

## 46.2 Container Debugging Techniques

### Docker Exec

Access running containers for interactive debugging:

```bash
# Shell into running container
docker exec -it payment-pod-abc123 /bin/sh

# Check process status
docker exec payment-pod-abc123 ps aux

# View environment variables
docker exec payment-pod-abc123 env

# Copy files from container to host for analysis
docker cp payment-pod-abc123:/var/log/app.log ./local-app.log

# Run specific debugging commands
docker exec payment-pod-abc123 java -XX:+PrintFlagsFinal -version
```

**Explanation:**
The `docker exec` command runs a new process in an existing container's namespaces. The `-it` flags allocate a pseudo-TTY and keep stdin open for interactive shells. This is useful for checking file system state, running diagnostic commands, or inspecting running processes without stopping the container.

### Debug Containers (Distroless)

Distroless images lack shells, requiring special debugging techniques:

```bash
# Copy debugging tools into distroless container
kubectl debug payment-pod-abc123 \
  --image=busybox:1.36 \
  --target=payment-service \
  -- /bin/sh

# Or use ephemeral containers (Kubernetes 1.23+)
kubectl alpha debug -it payment-pod-abc123 \
  --image=nicolaka/netshoot \
  -- /bin/bash
```

**Debug Container Configuration:**
```yaml
apiVersion: v1
kind: Pod
metadata:
  name: payment-service
spec:
  containers:
    - name: payment-service
      image: gcr.io/distroless/java17-debian12
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
    # Debug sidecar (always present but scaled to 0 in production)
    - name: debug
      image: busybox:1.36
      command: ["sleep", "infinity"]
      resources:
        limits:
          cpu: "0"
          memory: "0"
      lifecycle:
        postStart:
          exec:
            command: ["/bin/sh", "-c", "echo 'Debug container ready'"]
```

**Explanation:**
Distroless images (Google's gcr.io/distroless) contain only the application and runtime, no shell, package manager, or utilities. The `kubectl debug` command creates an ephemeral container sharing the target container's namespaces (PID, network, IPC), allowing debugging tools to inspect the target process. The `nicolaka/netshoot` image contains network debugging tools (tcpdump, netstat, curl).

### Ephemeral Debug Containers

Kubernetes 1.25+ supports ephemeral containers for debugging:

```bash
# Add ephemeral container to running pod
kubectl debug payment-pod-abc123 \
  -it \
  --image=busybox \
  -- /bin/sh

# Debug specific container in multi-container pod
kubectl debug payment-pod-abc123 \
  -it \
  --container=payment-service \
  --image=nicolaka/netshoot \
  -- /bin/bash

# Copy files using ephemeral container
kubectl debug payment-pod-abc123 \
  --image=busybox \
  -- /bin/sh -c "cat /proc/1/root/app/config/application.yml" > /tmp/config.yml
```

**Explanation:**
Ephemeral containers are temporary containers added to existing pods without modifying the pod spec. They share the target container's namespaces, enabling inspection of filesystems (via `/proc/1/root/`), network stacks, and process tables. They disappear when the debugging session ends, leaving no permanent footprint.

## 46.3 Kubernetes Pod Troubleshooting

### Crash Loop Backoffs

**Diagnosing Crash Loops:**
```bash
# Check pod status and restart count
kubectl get pods -n production
# NAME              READY   STATUS             RESTARTS   AGE
# payment-pod       0/1     CrashLoopBackOff   5          10m

# View logs from previous (crashed) container
kubectl logs payment-pod -n production --previous

# Describe pod for events
kubectl describe pod payment-pod -n production
```

**Common Causes and Fixes:**

**1. Missing Start Command:**
```dockerfile
# Bad - no CMD or ENTRYPOINT
FROM openjdk:17
COPY target/app.jar /app.jar

# Good
FROM openjdk:17
COPY target/app.jar /app.jar
ENTRYPOINT ["java", "-jar", "/app.jar"]
```

**2. Permission Denied:**
```dockerfile
# Fix executable permissions
RUN chmod +x /app/start.sh

# Or in Kubernetes
securityContext:
  runAsUser: 1000
  runAsGroup: 1000
  fsGroup: 1000
```

**3. Resource Limits:**
```yaml
resources:
  requests:
    memory: "512Mi"
    cpu: "250m"
  limits:
    memory: "1Gi"  # Increase if OOMKilled
    cpu: "1000m"
```

**4. Liveness Probe Too Aggressive:**
```yaml
livenessProbe:
  httpGet:
    path: /actuator/health
    port: 8080
  initialDelaySeconds: 60  # Increase for slow-starting apps
  periodSeconds: 10
  failureThreshold: 3
```

### Image Pull Errors

**Authentication Issues:**
```bash
# Verify image exists
docker pull ghcr.io/company/payment-service:v2.1.0

# Check Kubernetes pull secrets
kubectl get secret regcred -n production -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d

# Create registry secret
kubectl create secret docker-registry regcred \
  --docker-server=ghcr.io \
  --docker-username=$GITHUB_USER \
  --docker-password=$GITHUB_TOKEN \
  --namespace=production
```

**Image Tag Issues:**
```bash
# Check available tags
crane ls ghcr.io/company/payment-service | grep v2.1

# Verify digest matches
docker manifest inspect ghcr.io/company/payment-service:v2.1.0 --verbose
```

**Explanation:**
`crane` (Google's container tool) lists tags without pulling. Image pull errors often indicate:
- Typo in tag name
- Image not pushed to registry
- Registry authentication expired
- Network policies blocking registry access

### Resource Constraints

**Out of Memory (OOMKilled):**
```bash
# Check pod status
kubectl get pod payment-pod -o jsonpath='{.status.containerStatuses[0].lastState.terminated}'

# View events
kubectl get events --field-selector reason=OOMKilled

# Check memory usage trends
kubectl top pod payment-pod -n production
```

**Memory Debugging:**
```yaml
# Enable JVM OOM heap dump
env:
  - name: JAVA_OPTS
    value: "-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dumps"
  
volumeMounts:
  - name: dumps
    mountPath: /dumps

volumes:
  - name: dumps
    emptyDir:
      sizeLimit: 2Gi
```

**Explanation:**
When containers exceed memory limits, Kubernetes sends SIGKILL (OOMKilled). The JVM flags above create heap dumps before death, which can be copied out for analysis with `kubectl cp` or Eclipse Memory Analyzer Tool (MAT).

**CPU Throttling:**
```bash
# Check CPU throttling
kubectl get pod payment-pod -o jsonpath='{.status.containerStatuses[0].resources}'

# View metrics
kubectl top pod payment-pod --containers

# Check cgroup limits inside container
docker exec payment-pod cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
docker exec payment-pod cat /sys/fs/cgroup/cpu/cpu.cfs_period_us
```

**Explanation:**
If `cpu.cfs_quota_us` (100000 = 100ms) equals `cpu.cfs_period_us` (100000), the container has full CPU. If quota is less than period, the container is throttled. Increase CPU limits if throttling causes latency spikes.

## 46.4 Network Debugging

### DNS Resolution Issues

```bash
# Test DNS from inside pod
kubectl exec -it payment-pod -- nslookup kubernetes.default

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

# Test with debug pod
kubectl run -it --rm debug --image=busybox:1.28 --restart=Never -- nslookup payment-service.production.svc.cluster.local

# Check /etc/resolv.conf
kubectl exec payment-pod -- cat /etc/resolv.conf
# Should show: nameserver 10.96.0.10 (cluster DNS IP)
```

**DNS Debugging with Netshoot:**
```bash
kubectl run netshoot --rm -i --tty --image nicolaka/netshoot -- /bin/bash

# Inside netshoot container
dig payment-service.production.svc.cluster.local
tcpdump -i eth0 port 53
```

### Service Connectivity

**Port Forwarding for Local Testing:**
```bash
# Access cluster service locally
kubectl port-forward svc/payment-service 8080:80 -n production

# Test in another terminal
curl http://localhost:8080/actuator/health
```

**Service Endpoint Verification:**
```bash
# Check if endpoints exist
kubectl get endpoints payment-service -n production

# If empty, check label selector
kubectl get svc payment-service -o jsonpath='{.spec.selector}'

# Verify pods have matching labels
kubectl get pods -l app=payment-service,version=v2.1.0 -n production
```

**Explanation:**
Services route traffic to Pods via label selectors. If `kubectl get endpoints` returns empty, no Pods match the selector. Check for:
- Missing or incorrect labels on Pods
- Pods in different namespace
- Selectors with typos

### Network Policies

**Verify Policy Blocking:**
```bash
# Check if network policies exist
kubectl get networkpolicies -n production

# Test connectivity between pods
kubectl exec -it payment-pod -- /bin/sh -c "nc -zv order-service 8080"

# Allow all traffic temporarily for testing
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - {}
  egress:
    - {}
EOF
```

## 46.5 Permission Problems

### RBAC Debugging

**Service Account Verification:**
```bash
# Check pod service account
kubectl get pod payment-pod -o jsonpath='{.spec.serviceAccountName}'

# Verify token mounted
kubectl exec payment-pod -- ls /var/run/secrets/kubernetes.io/serviceaccount/

# Check RBAC permissions
kubectl auth can-i --list --as=system:serviceaccount:production:payment-service

# Test specific permission
kubectl auth can-i get pods --as=system:serviceaccount:production:payment-service -n production
```

**Common RBAC Issues:**
```yaml
# Missing get pods permission for discovery
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: payment-service
  namespace: production
rules:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list", "watch"]  # Required for Spring Cloud Kubernetes
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list"]
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get"]
    resourceNames: ["payment-service"]  # Specific secret only
```

### Security Context Issues

**Read-Only Filesystem:**
```yaml
securityContext:
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1000
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL

volumeMounts:
  - name: tmp
    mountPath: /tmp  # Writable volume for temp files
  - name: cache
    mountPath: /app/cache
    
volumes:
  - name: tmp
    emptyDir: {}
  - name: cache
    emptyDir:
      sizeLimit: 1Gi
```

**Explanation:**
`readOnlyRootFilesystem: true` prevents writing to container image layers. Applications must write to mounted volumes (`emptyDir`, `persistentVolumeClaim`) or ephemeral storage. This prevents attackers from modifying binaries but requires apps to be configured for non-root execution and writable temp directories.

## 46.6 Debugging Tools

### Stern (Multi-Pod Log Tailing)

```bash
# Install stern
brew install stern

# Tail logs from all pods matching pattern
stern payment-service -n production --since 10m

# Tail specific container
stern payment-service -c payment-service -n production

# Filter by regex
stern payment-service -n production | grep -i error

# Show timestamps
stern payment-service -n production -t
```

**Explanation:**
Stern tails logs from multiple Pods simultaneously, color-coding by Pod name. Unlike `kubectl logs` which requires pod names, Stern uses label selectors and automatically includes new pods matching the pattern.

### K9s (Terminal UI)

```bash
# Install k9s
brew install k9s

# Launch
k9s -n production

# Key bindings:
# :pods → view pods
# :svc → view services
# l → logs
# s → shell
# d → describe
# shift-f → port-forward
# ctrl-k → kill pod
```

### Inspektor Gadget

eBPF-based debugging tools for Kubernetes:

```bash
# Install
kubectl gadget deploy

# Trace DNS requests
kubectl gadget trace dns -n production

# Trace TCP connections
kubectl gadget trace tcp -n production

# Trace OOM kills
kubectl gadget trace oomkill -n production

# Snapshot process tree
kubectl gadget snapshot process -n production
```

**Explanation:**
Inspektor Gadget uses eBPF (extended Berkeley Packet Filter) to trace kernel events without modifying applications. It shows DNS queries, TCP connections, file opens, and OOM kills in real-time across the cluster.

### Kubectl Debug Node

Debug node-level issues:

```bash
# Access node shell
kubectl debug node/minikube -it --image=alpine -- /bin/sh

# Inside node, check container runtime
crictl ps
crictl logs <container-id>

# Check kubelet logs
journalctl -u kubelet -f
```

## 46.7 Build Reproducibility

### Reproducing CI Failures Locally

**Docker BuildKit:**
```bash
# Build with same context as CI
docker buildx build \
  --platform linux/amd64 \
  --build-arg JAR_FILE=target/app.jar \
  --cache-from type=gha \
  --tag payment-service:local \
  .

# Run with same resources as production
docker run \
  --memory=512m \
  --cpus=0.5 \
  --read-only \
  --tmpfs /tmp \
  payment-service:local
```

**Act (Run GitHub Actions Locally):**
```bash
# Install act
brew install act

# Run workflow locally
act -j build -s GITHUB_TOKEN=$GITHUB_TOKEN

# Run specific job
act -j test --secret-file .env
```

**Explanation:**
Act runs GitHub Actions workflows locally using Docker containers that mimic the GitHub Actions runner environment. This reproduces CI failures without pushing commits.

### Caching Issues

**Clear Caches:**
```bash
# Maven
rm -rf ~/.m2/repository/com/company

# npm
npm cache clean --force

# Docker
docker builder prune -f

# Gradle
rm -rf ~/.gradle/caches/build-cache-1/
```

### Deterministic Builds

```dockerfile
# Pin base image versions
FROM eclipse-temurin:17.0.9_9-jre-alpine@sha256:abc123...

# Pin dependency versions in pom.xml
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-web</artifactId>
    <version>3.2.0</version>  <!-- Exact version -->
</dependency>

# Use lock files
COPY package-lock.json ./
RUN npm ci  # Uses exact versions from lock file
```

## 46.8 Pipeline Observability

### Distributed Tracing in CI/CD

Trace pipeline execution across stages:

```yaml
# .github/workflows/trace.yml
name: Traced Build
on: [push]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Start Trace
        run: |
          TRACE_ID=$(openssl rand -hex 16)
          echo "trace_id=$TRACE_ID" >> $GITHUB_OUTPUT
          echo "::set-output name=trace_id::$TRACE_ID"
          curl -X POST http://jaeger:9411/api/v2/spans \
            -d "[{
              \"traceId\": \"$TRACE_ID\",
              \"id\": \"$(openssl rand -hex 8)\",
              \"name\": \"ci-build-start\",
              \"timestamp\": $(date +%s%3N),
              \"duration\": 1,
              \"tags\": {
                \"ci.provider\": \"github-actions\",
                \"repository\": \"${{ github.repository }}\",
                \"commit\": \"${{ github.sha }}\"
              }
            }]\""
      
      - name: Build
        run: ./mvnw build
      
      - name: End Trace
        run: |
          curl -X POST http://jaeger:9411/api/v2/spans \
            -d "[{
              \"traceId\": \"${{ steps.start.outputs.trace_id }}\",
              \"id\": \"$(openssl rand -hex 8)\",
              \"name\": \"ci-build-end\",
              \"timestamp\": $(date +%s%3N),
              \"duration\": 1
            }]\""
```

### Pipeline Metrics

Export pipeline metrics to Prometheus:

```yaml
# Push gateway for CI metrics
- name: Record Metrics
  run: |
    cat <<EOF | curl --data-binary @- http://prometheus-pushgateway:9091/metrics/job/ci-pipeline/instance/${{ github.run_id }}
    # HELP ci_build_duration_seconds Build duration
    # TYPE ci_build_duration_seconds gauge
    ci_build_duration_seconds{repository="${{ github.repository }}",branch="${{ github.ref }}"} ${{ steps.build.outputs.duration }}
    
    # HELP ci_build_status Build status (0=success, 1=failure)
    # TYPE ci_build_status gauge
    ci_build_status{repository="${{ github.repository }}"} ${{ job.status == 'success' ? 0 : 1 }}
    EOF
```

---

## Chapter Summary and Preview

This chapter established systematic debugging strategies for CI/CD pipelines, addressing the unique challenges of ephemeral build environments and distributed execution. We examined container debugging techniques including `docker exec` for running containers, ephemeral debug containers for distroless images, and `kubectl debug` for Kubernetes pods without modifying specifications. Kubernetes troubleshooting covered crash loop backoff diagnosis through log analysis and pod events, image pull error resolution via registry authentication verification and tag existence checking, and resource constraint debugging using cgroup inspection and memory dump analysis.

Network debugging strategies included DNS resolution verification using `nslookup` and CoreDNS log analysis, service connectivity testing via port forwarding and endpoint validation, and network policy troubleshooting through temporary allow-all policies. Permission debugging focused on RBAC verification using `kubectl auth can-i`, service account token validation, and security context configuration for read-only root filesystems. Advanced tools like Stern for multi-pod log tailing, K9s for interactive cluster exploration, and Inspektor Gadget for eBPF-based kernel tracing provide deep visibility into running systems.

Build reproducibility techniques ensure CI failures can be reproduced locally using Docker BuildKit with resource constraints and Act for GitHub Actions simulation. Pipeline observability extends distributed tracing concepts to CI/CD, enabling correlation between pipeline stages and visualization of build duration trends alongside application metrics.

**Key Takeaways:**
- Use `kubectl debug` with ephemeral containers to inspect distroless images without modifying Dockerfiles or pod specs; the debug container shares namespaces with the target container while providing debugging tools.
- When diagnosing CrashLoopBackOff, always check `--previous` logs for the actual error, verify resource limits aren't causing OOMKilled, and ensure executable permissions on scripts.
- For ImagePullBackOff errors, verify image tags exist using `crane ls` or `docker manifest inspect`, check `imagePullSecrets` are correctly attached to service accounts, and ensure registries are accessible from cluster nodes.
- Debug DNS issues using `nslookup` from within pods, verify CoreDNS is running, and check that search domains in `/etc/resolv.conf` include the correct namespace.
- Use `kubectl auth can-i` to verify RBAC permissions for service accounts, and remember that pods must be recreated to pick up new service account tokens after RBAC changes.
- Implement pipeline observability by pushing metrics to Prometheus Pushgateway or emitting OpenTelemetry spans from CI jobs, enabling correlation between deployment frequency and application reliability metrics.

**Next Chapter Preview:**
Chapter 47: Pipeline Performance Optimization addresses the efficiency and speed of continuous delivery pipelines. We will examine strategies for reducing build times through intelligent caching, parallel execution, and incremental builds. The chapter covers Docker layer caching optimization, dependency resolution acceleration, test parallelization strategies, and resource right-sizing for CI runners. We will explore network optimization for distributed builds, artifact management strategies to reduce transfer times, and cost optimization techniques for cloud-based CI/CD platforms, ensuring that fast feedback loops support rather than hinder developer velocity.

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='45. distributed_tracing.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='47. pipeline_performance_optimization.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
