ops: backup restore runbook and periodic restore verification

## Tech Story

As a solo engineer, I want a written, tested restore procedure for the PostgreSQL backups created by issue #125 so that when I need to recover data — from accidental deletion, disk failure, or a bad migration — I can execute the restore mechanically in under 30 minutes without figuring it out under pressure.

## ELI5 Context

**Why write this down if backups already exist?**
Backups are useless if you cannot restore them. Restoring under pressure — production is down, users are waiting — is the worst time to learn the procedure. The first restore should be a calm drill, not an emergency. This issue creates that drill and leaves a written record so future-you does not have to think.

**What does a restore involve?**
1. Find the right backup file in Backblaze B2 (by timestamp)
2. Download it to the VPS
3. Decompress it (gzip compressed)
4. Pipe it into the running PostgreSQL container as SQL commands
5. Verify the data looks right

**What is a throwaway container?**
Instead of restoring into your live production database, you spin up a separate temporary PostgreSQL container, restore into that, verify the data, then tear it down. This lets you confirm a backup is valid without any risk to production. It is the equivalent of test-restoring a backup to a different server.

**What is a restore drill?**
A planned, calm run-through of the restore procedure using a real backup. Done quarterly to confirm: the backups are valid SQL, you know how to run the procedure, and it completes in the time you expect.

## Technical Elaboration

### New file: `infra/docs/restore.md`

#### Section 1: Prerequisites
```
- rclone configured on VPS with B2 credentials (done in #125)
- Docker running: docker ps returns no error
- .env.production present at /opt/station/.env.production
- Enough disk space: df -h /tmp should show at least 3x the compressed backup size free
```

#### Section 2: List available backups
```bash
# List all backups in B2, most recent first
rclone ls b2:station-backups/postgres/ | sort -k2 -r | head -20

# Example output:
# 45231234 postgres/202605/20260510_030000_nightly.sql.gz
# 44891234 postgres/202605/20260509_030000_nightly.sql.gz
# 44123456 postgres/202605/20260508_150000_pre-deploy-abc1234.sql.gz
```

#### Section 3: Download a backup
```bash
BACKUP_PATH="postgres/202605/20260510_030000_nightly.sql.gz"
LOCAL_FILE="/tmp/restore_$(date +%s).sql.gz"

rclone copy "b2:station-backups/${BACKUP_PATH}" /tmp/ --progress
mv "/tmp/$(basename ${BACKUP_PATH})" "${LOCAL_FILE}"
echo "Downloaded: ${LOCAL_FILE} ($(du -sh ${LOCAL_FILE} | cut -f1))"
```

#### Section 4a: Restore drill — throwaway container (zero production risk)
```bash
source /opt/station/.env.production

# Spin up throwaway Postgres
docker run --rm -d \
  --name station-restore-test \
  -e POSTGRES_USER="${DATABASE_USER}" \
  -e POSTGRES_PASSWORD=testonly \
  -e POSTGRES_DB="${DATABASE_NAME}" \
  postgres:16-alpine

sleep 5

# Restore into it
gunzip -c "${LOCAL_FILE}" | \
  docker exec -i station-restore-test \
    psql -U "${DATABASE_USER}" -d "${DATABASE_NAME}"

# Verify key tables
docker exec station-restore-test \
  psql -U "${DATABASE_USER}" -d "${DATABASE_NAME}" \
  -c "SELECT 'users' AS t, COUNT(*) FROM users UNION ALL SELECT 'organizations', COUNT(*) FROM organizations;"

# Tear down
docker stop station-restore-test
echo "Drill complete. Throwaway container removed."
```

#### Section 4b: Live restore — production (only for actual data loss)
```bash
source /opt/station/.env.production

# Stop backend to prevent writes during restore
docker compose -f /opt/station/docker-compose.prod.yml stop backend

# Restore
gunzip -c "${LOCAL_FILE}" | \
  docker exec -i station-postgres-1 \
    psql -U "${DATABASE_USER}" -d "${DATABASE_NAME}"

# Restart backend
docker compose -f /opt/station/docker-compose.prod.yml start backend
curl -f https://api.drdnt.org/health && echo "Restore successful"
```

#### Section 5: Post-restore verification
```bash
docker exec station-postgres-1 psql -U "${DATABASE_USER}" -d "${DATABASE_NAME}" -c "
  SELECT
    (SELECT COUNT(*) FROM users)         AS users,
    (SELECT COUNT(*) FROM organizations) AS organizations,
    (SELECT COUNT(*) FROM roles)         AS roles;
"
curl -f https://api.drdnt.org/health
```

#### Section 6: Decision tree
```
Need to restore?
- Data accidentally deleted (rows/table)?
    -> Live restore (Section 4b): stop backend, restore, restart
- Bad migration / schema corruption?
    -> Check migration-rollback.md first — TypeORM revert may be faster
       If that fails -> live restore (Section 4b)
- Disk failure / volume deleted?
    -> Provision new VPS -> run bootstrap-vps.sh -> live restore -> verify
```

#### Section 7: Estimated downtime

| Scenario | Estimated downtime |
|---|---|
| Drill (throwaway container) | Zero — production untouched |
| Live restore, small DB (<100MB compressed) | 5–10 minutes |
| Live restore, medium DB (100MB–1GB) | 15–30 minutes |
| Full VPS rebuild + restore | 45–90 minutes |

#### Section 8: Drill log (append after each quarterly drill)

| Date | Backup tested | Duration | Notes |
|---|---|---|---|
| (first drill date here) | filename | X min | First drill. Tables verified. |

### Update `infra/scripts/restore-db.sh` (from #125)

Ensure the script sources `.env.production` for credentials instead of taking them as CLI arguments — prevents credentials appearing in `ps aux` output:

```bash
#!/bin/bash
set -euo pipefail
source /opt/station/.env.production

BACKUP_PATH="$1"
LOCAL_FILE="/tmp/restore_$(date +%s).sql.gz"

echo "Downloading ${BACKUP_PATH}..."
rclone copy "b2:station-backups/${BACKUP_PATH}" /tmp/ --progress
mv "/tmp/$(basename ${BACKUP_PATH})" "${LOCAL_FILE}"

echo "WARNING: About to restore into live database ${DATABASE_NAME}."
echo "Stopping backend in 5 seconds... Ctrl+C to abort."
sleep 5

docker compose -f /opt/station/docker-compose.prod.yml stop backend
gunzip -c "${LOCAL_FILE}" | docker exec -i station-postgres-1 psql -U "${DATABASE_USER}" -d "${DATABASE_NAME}"
docker compose -f /opt/station/docker-compose.prod.yml start backend
rm "${LOCAL_FILE}"
echo "Restore complete. Verify: curl https://api.drdnt.org/health"
```

## Definition of Done

- [ ] `infra/docs/restore.md` written with all 8 sections: prerequisites, list backups, download, drill procedure, live restore, post-restore verification, decision tree, drill log
- [ ] `infra/scripts/restore-db.sh` sources `.env.production` for credentials (not CLI args)
- [ ] Drill completed: a real B2 backup downloaded, restored to a throwaway Postgres container, row counts verified, container torn down
- [ ] Drill date and result recorded in the drill log table in `infra/docs/restore.md`
- [ ] Time to complete the drill documented (establishes baseline for how long a restore takes)
- [ ] `infra/docs/restore.md` linked from `docs/deployment.md` in the "What to do if things go wrong" section

## Dependencies

- Depends on: #125 (backups must be running and B2 configured before you can drill a restore)
- Depends on: #107 (VPS must be provisioned — restore is run from there)
- Blocks: #132 (migration rollback runbook references this doc for the safe-path restore)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ops: backup restore runbook and periodic restore verification #130

Tech Story

ELI5 Context

Technical Elaboration

New file: `infra/docs/restore.md`

Section 1: Prerequisites

Section 2: List available backups

Section 3: Download a backup

Section 4a: Restore drill — throwaway container (zero production risk)

Section 4b: Live restore — production (only for actual data loss)

Section 5: Post-restore verification

Section 6: Decision tree

Section 7: Estimated downtime

Section 8: Drill log (append after each quarterly drill)

Update `infra/scripts/restore-db.sh` (from #125)

Definition of Done

Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Scenario	Estimated downtime
Drill (throwaway container)	Zero — production untouched
Live restore, small DB (<100MB compressed)	5–10 minutes
Live restore, medium DB (100MB–1GB)	15–30 minutes
Full VPS rebuild + restore	45–90 minutes

ops: backup restore runbook and periodic restore verification #130

Description

Tech Story

ELI5 Context

Technical Elaboration

New file: infra/docs/restore.md

Section 1: Prerequisites

Section 2: List available backups

Section 3: Download a backup

Section 4a: Restore drill — throwaway container (zero production risk)

Section 4b: Live restore — production (only for actual data loss)

Section 5: Post-restore verification

Section 6: Decision tree

Section 7: Estimated downtime

Section 8: Drill log (append after each quarterly drill)

Update infra/scripts/restore-db.sh (from #125)

Definition of Done

Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

New file: `infra/docs/restore.md`

Update `infra/scripts/restore-db.sh` (from #125)