Skip to content

ops: backup failure alerting — notify when nightly backup cron fails #133

@GitAddRemote

Description

@GitAddRemote

Tech Story

As a solo engineer, I want to receive an alert when the nightly database backup fails to run so that a broken backup is caught within 24 hours — not discovered when I actually need to restore and find nothing there.

ELI5 Context

What is a dead-man's switch?
A dead-man's switch works in reverse to a normal alert. Instead of alerting when something bad happens, it alerts when something STOPS happening. Healthchecks.io works like this: your backup script pings the service every night to say "I ran successfully." If healthchecks.io doesn't receive a ping within the expected window (24h + 1h grace), it concludes the backup failed and sends you an alert. Silence is the alarm.

Why not just check the backup log file?
You'd have to SSH in and look at the log every day. The whole point of this system is that you don't have to think about it — it tells you when something is wrong, not the other way around.

What are the failure modes this catches?

  • Backblaze B2 credentials expired
  • rclone misconfigured or updated and broken
  • PostgreSQL container not running (pg_dump fails)
  • VPS disk full (no space to write the temp file)
  • The VPS cron daemon crashed or the cron job was accidentally deleted
  • Any other reason backup-db.sh exits non-zero

Does this cost anything?
No. Healthchecks.io free tier allows 20 checks. You need 1.

Technical Elaboration

Step 1: Create Healthchecks.io check (manual, one-time)

  1. Sign up at healthchecks.io
  2. Create a new check:
    • Name: Station — nightly backup
    • Schedule: 0 3 * * * (same as the backup cron — runs at 3 AM)
    • Grace time: 1 hour (alert fires at 4 AM if no ping received)
  3. Click the check name to get the Ping URL (looks like https://hc-ping.com/abc-123-uuid)
  4. Configure alert channels: email (required), Discord webhook (optional)
  5. Copy the Ping URL — it goes into GitHub secrets as BACKUP_HEALTHCHECK_URL

Step 2: Update infra/scripts/backup-db.sh

Add a ping on success at the very end of the script, and ensure the script exits non-zero on any failure (already handled by set -euo pipefail):

#!/bin/bash
set -euo pipefail   # any command failure exits immediately (non-zero) -> healthchecks.io times out

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
LABEL="${LABEL:-nightly}"
BACKUP_FILE="/tmp/station_backup_${TIMESTAMP}_${LABEL}.sql.gz"
B2_BUCKET="${B2_BUCKET:-station-backups}"
B2_PATH="postgres/${TIMESTAMP:0:6}/${TIMESTAMP}_${LABEL}"

echo "[backup] Starting at ${TIMESTAMP}"

docker exec station-postgres-1 pg_dump \
  -U "${DATABASE_USER}" \
  -d "${DATABASE_NAME}" \
  | gzip > "${BACKUP_FILE}"

echo "[backup] Created: ${BACKUP_FILE} ($(du -sh ${BACKUP_FILE} | cut -f1))"

rclone copy "${BACKUP_FILE}" "b2:${B2_BUCKET}/${B2_PATH}.sql.gz" \
  --b2-chunk-size 96M

echo "[backup] Uploaded to b2:${B2_BUCKET}/${B2_PATH}.sql.gz"

rm "${BACKUP_FILE}"

# Ping healthchecks.io on success (optional — skip gracefully if URL not set)
if [ -n "${BACKUP_HEALTHCHECK_URL:-}" ]; then
  curl -fsS --retry 3 "${BACKUP_HEALTHCHECK_URL}" > /dev/null \
    && echo "[backup] Healthcheck ping sent" \
    || echo "[backup] WARNING: healthcheck ping failed (backup itself succeeded)"
fi

echo "[backup] Complete"

Key design decisions:

  • set -euo pipefail at the top means any failure exits non-zero, which prevents the healthcheck ping and causes healthchecks.io to time out and alert
  • The healthcheck URL is optional (:- default) — the script works without it (graceful degradation)
  • The curl failure on the ping is a warning, not an error — the backup itself succeeded, so don't suppress it

Step 3: Add BACKUP_HEALTHCHECK_URL to .env.production on VPS

This variable is read by the cron job (which runs backup-db.sh directly on the VPS, not via Docker). Add it to the deploy workflow's .env injection step (issue #128):

# In the Write .env step in release.yml:
BACKUP_HEALTHCHECK_URL=${{ secrets.BACKUP_HEALTHCHECK_URL }}

Also add the cron job to read from the .env file:

0 3 * * * bash -c 'source /opt/station/.env.production && bash /opt/station/infra/scripts/backup-db.sh' >> /opt/station/logs/backup.log 2>&1

Step 4: Add BACKUP_HEALTHCHECK_URL to GitHub environment secrets

In GitHub repo Settings -> Environments -> production:

  • Add secret: BACKUP_HEALTHCHECK_URL = (the Ping URL from healthchecks.io)

This is a production-only secret (staging does not run backups).

Step 5: Verify the alert fires

Test that the system actually works end-to-end:

  1. Temporarily comment out the curl ping in backup-db.sh on the VPS
  2. Wait for the next scheduled run (or manually trigger: bash /opt/station/infra/scripts/backup-db.sh)
  3. Wait for the grace period to expire (or in healthchecks.io, click "Force notification" to test immediately)
  4. Confirm alert email/Discord message received
  5. Re-enable the curl ping

New file: infra/docs/backups.md

Document:

  1. What is backed up — PostgreSQL data, nightly at 3 AM UTC, pre-deploy snapshots on every deploy
  2. Where backups live — Backblaze B2 bucket station-backups, path structure postgres/YYYYMM/filename.sql.gz
  3. How to verify backups are runningtail /opt/station/logs/backup.log; check healthchecks.io dashboard
  4. What to do when a backup alert fires — SSH in, check log, run backup manually, check B2 credentials
  5. Retention policy — 180-day B2 lifecycle rule
  6. How to silence a false alarm — log into healthchecks.io, click "Mute for X hours" if you know a backup will be delayed (e.g. planned maintenance)
  7. How to restore — link to infra/docs/restore.md

Definition of Done

  • Healthchecks.io check created with 24h period and 1h grace window
  • Alert channel configured — test alert received (use "Send test notification" in healthchecks.io)
  • infra/scripts/backup-db.sh pings BACKUP_HEALTHCHECK_URL on successful upload
  • Ping is skipped gracefully when BACKUP_HEALTHCHECK_URL is unset (script still completes successfully)
  • BACKUP_HEALTHCHECK_URL added to GitHub environment secrets (production only)
  • VPS cron job updated to source .env.production so BACKUP_HEALTHCHECK_URL is available
  • End-to-end test: remove ping temporarily -> run backup -> confirm alert fires within grace window -> restore ping -> confirm next run clears alert
  • infra/docs/backups.md written with all 7 sections

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions