Skip to content

ops: backup restore runbook and periodic restore verification #130

@GitAddRemote

Description

@GitAddRemote

Tech Story

As a solo engineer, I want a written, tested restore procedure for the PostgreSQL backups created by issue #125 so that when I need to recover data — from accidental deletion, disk failure, or a bad migration — I can execute the restore mechanically in under 30 minutes without figuring it out under pressure.

ELI5 Context

Why write this down if backups already exist?
Backups are useless if you cannot restore them. Restoring under pressure — production is down, users are waiting — is the worst time to learn the procedure. The first restore should be a calm drill, not an emergency. This issue creates that drill and leaves a written record so future-you does not have to think.

What does a restore involve?

  1. Find the right backup file in Backblaze B2 (by timestamp)
  2. Download it to the VPS
  3. Decompress it (gzip compressed)
  4. Pipe it into the running PostgreSQL container as SQL commands
  5. Verify the data looks right

What is a throwaway container?
Instead of restoring into your live production database, you spin up a separate temporary PostgreSQL container, restore into that, verify the data, then tear it down. This lets you confirm a backup is valid without any risk to production. It is the equivalent of test-restoring a backup to a different server.

What is a restore drill?
A planned, calm run-through of the restore procedure using a real backup. Done quarterly to confirm: the backups are valid SQL, you know how to run the procedure, and it completes in the time you expect.

Technical Elaboration

New file: infra/docs/restore.md

Section 1: Prerequisites

- rclone configured on VPS with B2 credentials (done in #125)
- Docker running: docker ps returns no error
- .env.production present at /opt/station/.env.production
- Enough disk space: df -h /tmp should show at least 3x the compressed backup size free

Section 2: List available backups

# List all backups in B2, most recent first
rclone ls b2:station-backups/postgres/ | sort -k2 -r | head -20

# Example output:
# 45231234 postgres/202605/20260510_030000_nightly.sql.gz
# 44891234 postgres/202605/20260509_030000_nightly.sql.gz
# 44123456 postgres/202605/20260508_150000_pre-deploy-abc1234.sql.gz

Section 3: Download a backup

BACKUP_PATH="postgres/202605/20260510_030000_nightly.sql.gz"
LOCAL_FILE="/tmp/restore_$(date +%s).sql.gz"

rclone copy "b2:station-backups/${BACKUP_PATH}" /tmp/ --progress
mv "/tmp/$(basename ${BACKUP_PATH})" "${LOCAL_FILE}"
echo "Downloaded: ${LOCAL_FILE} ($(du -sh ${LOCAL_FILE} | cut -f1))"

Section 4a: Restore drill — throwaway container (zero production risk)

source /opt/station/.env.production

# Spin up throwaway Postgres
docker run --rm -d \
  --name station-restore-test \
  -e POSTGRES_USER="${DATABASE_USER}" \
  -e POSTGRES_PASSWORD=testonly \
  -e POSTGRES_DB="${DATABASE_NAME}" \
  postgres:16-alpine

sleep 5

# Restore into it
gunzip -c "${LOCAL_FILE}" | \
  docker exec -i station-restore-test \
    psql -U "${DATABASE_USER}" -d "${DATABASE_NAME}"

# Verify key tables
docker exec station-restore-test \
  psql -U "${DATABASE_USER}" -d "${DATABASE_NAME}" \
  -c "SELECT 'users' AS t, COUNT(*) FROM users UNION ALL SELECT 'organizations', COUNT(*) FROM organizations;"

# Tear down
docker stop station-restore-test
echo "Drill complete. Throwaway container removed."

Section 4b: Live restore — production (only for actual data loss)

source /opt/station/.env.production

# Stop backend to prevent writes during restore
docker compose -f /opt/station/docker-compose.prod.yml stop backend

# Restore
gunzip -c "${LOCAL_FILE}" | \
  docker exec -i station-postgres-1 \
    psql -U "${DATABASE_USER}" -d "${DATABASE_NAME}"

# Restart backend
docker compose -f /opt/station/docker-compose.prod.yml start backend
curl -f https://api.drdnt.org/health && echo "Restore successful"

Section 5: Post-restore verification

docker exec station-postgres-1 psql -U "${DATABASE_USER}" -d "${DATABASE_NAME}" -c "
  SELECT
    (SELECT COUNT(*) FROM users)         AS users,
    (SELECT COUNT(*) FROM organizations) AS organizations,
    (SELECT COUNT(*) FROM roles)         AS roles;
"
curl -f https://api.drdnt.org/health

Section 6: Decision tree

Need to restore?
- Data accidentally deleted (rows/table)?
    -> Live restore (Section 4b): stop backend, restore, restart
- Bad migration / schema corruption?
    -> Check migration-rollback.md first — TypeORM revert may be faster
       If that fails -> live restore (Section 4b)
- Disk failure / volume deleted?
    -> Provision new VPS -> run bootstrap-vps.sh -> live restore -> verify

Section 7: Estimated downtime

Scenario Estimated downtime
Drill (throwaway container) Zero — production untouched
Live restore, small DB (<100MB compressed) 5–10 minutes
Live restore, medium DB (100MB–1GB) 15–30 minutes
Full VPS rebuild + restore 45–90 minutes

Section 8: Drill log (append after each quarterly drill)

Date Backup tested Duration Notes
(first drill date here) filename X min First drill. Tables verified.

Update infra/scripts/restore-db.sh (from #125)

Ensure the script sources .env.production for credentials instead of taking them as CLI arguments — prevents credentials appearing in ps aux output:

#!/bin/bash
set -euo pipefail
source /opt/station/.env.production

BACKUP_PATH="$1"
LOCAL_FILE="/tmp/restore_$(date +%s).sql.gz"

echo "Downloading ${BACKUP_PATH}..."
rclone copy "b2:station-backups/${BACKUP_PATH}" /tmp/ --progress
mv "/tmp/$(basename ${BACKUP_PATH})" "${LOCAL_FILE}"

echo "WARNING: About to restore into live database ${DATABASE_NAME}."
echo "Stopping backend in 5 seconds... Ctrl+C to abort."
sleep 5

docker compose -f /opt/station/docker-compose.prod.yml stop backend
gunzip -c "${LOCAL_FILE}" | docker exec -i station-postgres-1 psql -U "${DATABASE_USER}" -d "${DATABASE_NAME}"
docker compose -f /opt/station/docker-compose.prod.yml start backend
rm "${LOCAL_FILE}"
echo "Restore complete. Verify: curl https://api.drdnt.org/health"

Definition of Done

  • infra/docs/restore.md written with all 8 sections: prerequisites, list backups, download, drill procedure, live restore, post-restore verification, decision tree, drill log
  • infra/scripts/restore-db.sh sources .env.production for credentials (not CLI args)
  • Drill completed: a real B2 backup downloaded, restored to a throwaway Postgres container, row counts verified, container torn down
  • Drill date and result recorded in the drill log table in infra/docs/restore.md
  • Time to complete the drill documented (establishes baseline for how long a restore takes)
  • infra/docs/restore.md linked from docs/deployment.md in the "What to do if things go wrong" section

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions