ops: backup failure alerting — notify when nightly backup cron fails

## Tech Story

As a solo engineer, I want to receive an alert when the nightly database backup fails to run so that a broken backup is caught within 24 hours — not discovered when I actually need to restore and find nothing there.

## ELI5 Context

**What is a dead-man's switch?**
A dead-man's switch works in reverse to a normal alert. Instead of alerting when something bad happens, it alerts when something STOPS happening. Healthchecks.io works like this: your backup script pings the service every night to say "I ran successfully." If healthchecks.io doesn't receive a ping within the expected window (24h + 1h grace), it concludes the backup failed and sends you an alert. Silence is the alarm.

**Why not just check the backup log file?**
You'd have to SSH in and look at the log every day. The whole point of this system is that you don't have to think about it — it tells you when something is wrong, not the other way around.

**What are the failure modes this catches?**
- Backblaze B2 credentials expired
- rclone misconfigured or updated and broken
- PostgreSQL container not running (pg_dump fails)
- VPS disk full (no space to write the temp file)
- The VPS cron daemon crashed or the cron job was accidentally deleted
- Any other reason `backup-db.sh` exits non-zero

**Does this cost anything?**
No. Healthchecks.io free tier allows 20 checks. You need 1.

## Technical Elaboration

### Step 1: Create Healthchecks.io check (manual, one-time)

1. Sign up at healthchecks.io
2. Create a new check:
   - **Name:** Station — nightly backup
   - **Schedule:** `0 3 * * *` (same as the backup cron — runs at 3 AM)
   - **Grace time:** 1 hour (alert fires at 4 AM if no ping received)
3. Click the check name to get the **Ping URL** (looks like `https://hc-ping.com/abc-123-uuid`)
4. Configure alert channels: email (required), Discord webhook (optional)
5. Copy the Ping URL — it goes into GitHub secrets as `BACKUP_HEALTHCHECK_URL`

### Step 2: Update `infra/scripts/backup-db.sh`

Add a ping on success at the very end of the script, and ensure the script exits non-zero on any failure (already handled by `set -euo pipefail`):

```bash
#!/bin/bash
set -euo pipefail   # any command failure exits immediately (non-zero) -> healthchecks.io times out

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
LABEL="${LABEL:-nightly}"
BACKUP_FILE="/tmp/station_backup_${TIMESTAMP}_${LABEL}.sql.gz"
B2_BUCKET="${B2_BUCKET:-station-backups}"
B2_PATH="postgres/${TIMESTAMP:0:6}/${TIMESTAMP}_${LABEL}"

echo "[backup] Starting at ${TIMESTAMP}"

docker exec station-postgres-1 pg_dump \
  -U "${DATABASE_USER}" \
  -d "${DATABASE_NAME}" \
  | gzip > "${BACKUP_FILE}"

echo "[backup] Created: ${BACKUP_FILE} ($(du -sh ${BACKUP_FILE} | cut -f1))"

rclone copy "${BACKUP_FILE}" "b2:${B2_BUCKET}/${B2_PATH}.sql.gz" \
  --b2-chunk-size 96M

echo "[backup] Uploaded to b2:${B2_BUCKET}/${B2_PATH}.sql.gz"

rm "${BACKUP_FILE}"

# Ping healthchecks.io on success (optional — skip gracefully if URL not set)
if [ -n "${BACKUP_HEALTHCHECK_URL:-}" ]; then
  curl -fsS --retry 3 "${BACKUP_HEALTHCHECK_URL}" > /dev/null \
    && echo "[backup] Healthcheck ping sent" \
    || echo "[backup] WARNING: healthcheck ping failed (backup itself succeeded)"
fi

echo "[backup] Complete"
```

**Key design decisions:**
- `set -euo pipefail` at the top means any failure exits non-zero, which prevents the healthcheck ping and causes healthchecks.io to time out and alert
- The healthcheck URL is optional (`:-` default) — the script works without it (graceful degradation)
- The curl failure on the ping is a warning, not an error — the backup itself succeeded, so don't suppress it

### Step 3: Add `BACKUP_HEALTHCHECK_URL` to `.env.production` on VPS

This variable is read by the cron job (which runs `backup-db.sh` directly on the VPS, not via Docker). Add it to the deploy workflow's `.env` injection step (issue #128):

```yaml
# In the Write .env step in release.yml:
BACKUP_HEALTHCHECK_URL=${{ secrets.BACKUP_HEALTHCHECK_URL }}
```

Also add the cron job to read from the `.env` file:
```cron
0 3 * * * bash -c 'source /opt/station/.env.production && bash /opt/station/infra/scripts/backup-db.sh' >> /opt/station/logs/backup.log 2>&1
```

### Step 4: Add `BACKUP_HEALTHCHECK_URL` to GitHub environment secrets

In GitHub repo Settings -> Environments -> production:
- Add secret: `BACKUP_HEALTHCHECK_URL` = (the Ping URL from healthchecks.io)

This is a production-only secret (staging does not run backups).

### Step 5: Verify the alert fires

Test that the system actually works end-to-end:

1. Temporarily comment out the `curl` ping in `backup-db.sh` on the VPS
2. Wait for the next scheduled run (or manually trigger: `bash /opt/station/infra/scripts/backup-db.sh`)
3. Wait for the grace period to expire (or in healthchecks.io, click "Force notification" to test immediately)
4. Confirm alert email/Discord message received
5. Re-enable the curl ping

### New file: `infra/docs/backups.md`

Document:
1. **What is backed up** — PostgreSQL data, nightly at 3 AM UTC, pre-deploy snapshots on every deploy
2. **Where backups live** — Backblaze B2 bucket `station-backups`, path structure `postgres/YYYYMM/filename.sql.gz`
3. **How to verify backups are running** — `tail /opt/station/logs/backup.log`; check healthchecks.io dashboard
4. **What to do when a backup alert fires** — SSH in, check log, run backup manually, check B2 credentials
5. **Retention policy** — 180-day B2 lifecycle rule
6. **How to silence a false alarm** — log into healthchecks.io, click "Mute for X hours" if you know a backup will be delayed (e.g. planned maintenance)
7. **How to restore** — link to `infra/docs/restore.md`

## Definition of Done

- [ ] Healthchecks.io check created with 24h period and 1h grace window
- [ ] Alert channel configured — test alert received (use "Send test notification" in healthchecks.io)
- [ ] `infra/scripts/backup-db.sh` pings `BACKUP_HEALTHCHECK_URL` on successful upload
- [ ] Ping is skipped gracefully when `BACKUP_HEALTHCHECK_URL` is unset (script still completes successfully)
- [ ] `BACKUP_HEALTHCHECK_URL` added to GitHub environment secrets (production only)
- [ ] VPS cron job updated to source `.env.production` so `BACKUP_HEALTHCHECK_URL` is available
- [ ] End-to-end test: remove ping temporarily -> run backup -> confirm alert fires within grace window -> restore ping -> confirm next run clears alert
- [ ] `infra/docs/backups.md` written with all 7 sections

## Dependencies

- Depends on: #125 (backup script must exist — this issue modifies it)
- Depends on: #128 (secrets management — BACKUP_HEALTHCHECK_URL stored in GitHub environment secrets)
- Blocks: nothing


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ops: backup failure alerting — notify when nightly backup cron fails #133

Tech Story

ELI5 Context

Technical Elaboration

Step 1: Create Healthchecks.io check (manual, one-time)

Step 2: Update `infra/scripts/backup-db.sh`

Step 3: Add `BACKUP_HEALTHCHECK_URL` to `.env.production` on VPS

Step 4: Add `BACKUP_HEALTHCHECK_URL` to GitHub environment secrets

Step 5: Verify the alert fires

New file: `infra/docs/backups.md`

Definition of Done

Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ops: backup failure alerting — notify when nightly backup cron fails #133

Description

Tech Story

ELI5 Context

Technical Elaboration

Step 1: Create Healthchecks.io check (manual, one-time)

Step 2: Update infra/scripts/backup-db.sh

Step 3: Add BACKUP_HEALTHCHECK_URL to .env.production on VPS

Step 4: Add BACKUP_HEALTHCHECK_URL to GitHub environment secrets

Step 5: Verify the alert fires

New file: infra/docs/backups.md

Definition of Done

Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Step 2: Update `infra/scripts/backup-db.sh`

Step 3: Add `BACKUP_HEALTHCHECK_URL` to `.env.production` on VPS

Step 4: Add `BACKUP_HEALTHCHECK_URL` to GitHub environment secrets

New file: `infra/docs/backups.md`