Skip to content

Troubleshooting

Captain Dany edited this page Jun 11, 2026 · 1 revision

Troubleshooting

CD Pipeline

"exec plugin: invalid apiVersion" (OCI CLI)

Problem: Kubeconfig has OCI exec plugin config but OCI CLI is not installed on the runner.

Solution: The workflow now builds kubeconfig programmatically from a static SA token instead of using OCI exec auth.

"gh secret set" corruption

Problem: gh secret set KUBECONFIG_DEV --body "$(<kubeconfig)" stores truncated/broken data.

Root Cause: Base64-encoded kubeconfig files are too large (~4000 chars) and gh secret set --body has intermittent failure with large payloads.

Solution:

  • Use a small SA token (~930 chars) instead of full kubeconfig
  • Build kubeconfig programmatically in the workflow:
    kubectl config set-cluster ... --server=... --certificate-authority=...
    kubectl config set-credentials ... --token=${{ secrets.SA_TOKEN }}
    kubectl config set-context ... --cluster=... --user=...

"helm deploy" stuck beyond timeout

Problem: Deployment hangs for 5+ minutes then fails with context deadline exceeded.

Causes:

  • Missing ServiceAccount (pod can't be created)
  • Missing ghcr-pull secret (ImagePullBackOff)
  • Missing oscar-app-secret (CrashLoopBackOff)
  • Missing DATABASE_URL (connection refused)

Debug: The CD workflow has debug commands that run on failure, but to inspect interactively:

kubectl get pods -n oscar-dev
kubectl describe pod <pod-name> -n oscar-dev
kubectl logs <pod-name> -n oscar-dev

"Failed to ping database"

Problem: App starts but crashes after 5 seconds because database is unreachable.

Solution: Deploy Postgres manually or set postgresql.enabled=true in values:

# Manual deployment
kubectl run postgres --image=docker.io/library/postgres:16-alpine \
  --env=POSTGRES_PASSWORD=postgres \
  --env=POSTGRES_DB=opencrm \
  -n oscar-dev

OKE Cluster

CRI-O short name mode

Problem: ImagePullBackOff on official images like postgres:16-alpine.

Root Cause: OKE's CRI-O runtime refuses short image names unless they're in the public registry allowlist.

Solution: Always use full image path:

  • docker.io/library/postgres:16-alpine (not postgres:16-alpine)
  • docker.io/library/redis:7-alpine (not redis:7-alpine)
  • docker.io/bitnami/nginx-ingress-controller:latest (not nginx-ingress-controller:latest)

"exec: executable oci not found"

Problem: kubectl commands fail on the GitHub runner.

Root Cause: Kubeconfig references OCI exec plugin (command: oci) which doesn't exist on the runner.

Solution: Recreate kubeconfig using kubectl config commands with static SA token (see Deployment-Guide#Service Account Setup).

Kubernetes Objects

ServiceAccount not created

Problem: Helm deploy succeeds but no pods appear.

Root Cause: Helm chart was missing templates/serviceaccount.yaml. The deployment referenced serviceAccountName: oscar but no such SA existed.

Solution: Add the template — see deploy/helm/opencrm/templates/serviceaccount.yaml.

Ingress not created

Problem: Service is accessible internally but no ingress.

Root Cause: Helm chart was missing templates/ingress.yaml.

Solution: Add the template — see deploy/helm/opencrm/templates/ingress.yaml.

Docker Build

"invalid reference format"

Problem: Docker build fails with this error.

Root Cause: Git tags like v0.1.0+abc123 contain + which Docker rejects.

Solution: Replace + with - in the version string:

VERSION=$(echo $VERSION | sed 's/+/-/g')

"repository name must be lowercase"

Problem: Docker push fails.

Root Cause: ${{ github.repository }} expands to CaptDany/oscar (uppercase C).

Solution: Hardcode as captdany/oscar or pipe through tr '[:upper:]' '[:lower:]'.

"denied: installation not allowed" (ghcr.io)

Problem: Docker push denied.

Root Cause: Docker push to ghcr.io/CaptDany/oscar fails because the repository name has uppercase characters.

Solution: Use all-lowercase repository name: ghcr.io/captdany/oscar.

cert-manager

Certificate stuck at "pending"

Problem: kubectl get certificate shows False/Ready after hours.

Root Cause: DNS for oscar-crm.cc doesn't resolve to the cluster IP, so Let's Encrypt can't complete HTTP-01 challenge.

Solution: Configure DNS first, then cert-manager will automatically complete the challenge. See DNS-TLS.

cm-acme-http-solver pods stick around

Problem: HTTP solver pods remain after failed challenge.

Solution: They're managed by cert-manager and will be cleaned up when the challenge succeeds, or you can delete them:

kubectl delete pod -n oscar-dev -l acme.cert-manager.io/http01-solver=true

GitHub Actions

Node.js 20 deprecation warning

Problem: Node.js 20 actions are deprecated. appears in workflow runs.

Context: GitHub Actions is phasing out Node.js 20. By June 16, 2026, all actions must use Node.js 24.

Fix: Update action versions when new versions ship with Node.js 24. Currently no action on our part.