ops: migration rollback runbook for failed production deploys

## Tech Story

As a solo engineer, I want a written, rehearsed runbook for rolling back a failed database migration so that when a migration breaks production at 2 AM, I can restore service in under 15 minutes by following a checklist — not by improvising.

## ELI5 Context

**What is a database migration and why does it fail?**
A migration is a SQL script that changes your database schema — adding a column, dropping a table, renaming a field. Migrations fail for reasons like: the SQL has a syntax error, a constraint is violated (e.g. adding a NOT NULL column to a table with existing rows), or the migration runs out of time/memory mid-way. A half-run migration can leave the schema in a state that crashes the app on startup.

**Why are there two rollback paths?**
The fast path (TypeORM revert) undoes the last migration using TypeORM's built-in mechanism. It's fast (seconds) but only works if the migration's `down()` method is correctly written and the database is in a consistent enough state to run it. The safe path (backup restore) is the nuclear option — it ignores the migration entirely and restores the database to the state it was in 60 seconds before the deploy started. It always works, but it takes longer and means any data written between the backup and the restore is lost.

**What is TypeORM's `down()` method?**
Every TypeORM migration file has an `up()` method (the migration) and a `down()` method (how to undo it). Running `pnpm migration:revert` executes the `down()` of the most recently run migration. This is why all migrations in this project must have a working `down()` method — it's your first line of rollback defence.

**What does "pre-deploy backup" mean here?**
Issue #126 adds a step to the deploy workflow that runs the backup script before migrations. This backup is labelled with the git SHA, so you can find it instantly: `pre-deploy-abc1234.sql.gz`. If a migration fails, this is the backup you restore.

## Technical Elaboration

### New file: `infra/docs/migration-rollback.md`

#### Section 1: Before you start — assess the situation

SSH into the VPS and run:
```bash
# Check if the backend is running
docker compose -f /opt/station/docker-compose.prod.yml ps

# Check recent logs for the error
docker compose -f /opt/station/docker-compose.prod.yml logs backend --tail=50

# Check which migration was being run
docker exec station-postgres-1 psql -U "${DATABASE_USER}" -d "${DATABASE_NAME}" \
  -c "SELECT name, timestamp FROM migrations ORDER BY timestamp DESC LIMIT 5;"
```

Answer these questions:
1. Is the backend container running? (even if it's crashing and restarting)
2. What is the exact error message?
3. Which migration was running when it failed?

#### Section 2: Decision tree

```
Migration failed during deploy
|
+-- Does the backend container start at all?
|   |
|   +-- YES -> Try Fast Path (Section 3)
|   |
|   +-- NO  -> Go to Safe Path (Section 4)
|
+-- Did Fast Path succeed?
    |
    +-- YES -> Done. Redeploy previous image tag (Section 5).
    |
    +-- NO  -> Go to Safe Path (Section 4)
```

#### Section 3: Fast path — TypeORM revert

Estimated time: 2–5 minutes.

```bash
# SSH into VPS
source /opt/station/.env.production

# Run the TypeORM revert inside the backend container
docker exec station-backend-1 sh -c "cd /app && node dist/node_modules/.bin/typeorm migration:revert -d dist/data-source.js"

# If the container is not running, start it with the CURRENT (broken) image just long enough to revert:
docker run --rm \
  --env-file /opt/station/.env.production \
  --network station_default \
  ghcr.io/gitaddremote/station-backend:latest \
  sh -c "node dist/node_modules/.bin/typeorm migration:revert -d dist/data-source.js"

# Verify the migration was reverted
docker exec station-postgres-1 psql -U "${DATABASE_USER}" -d "${DATABASE_NAME}" \
  -c "SELECT name FROM migrations ORDER BY timestamp DESC LIMIT 3;"
```

If `migration:revert` succeeds, proceed to Section 5 (redeploy previous image).
If it errors, proceed to Section 4 (safe path).

#### Section 4: Safe path — restore pre-deploy backup

Estimated time: 10–20 minutes depending on database size.

**Find the pre-deploy backup for this deploy:**
```bash
# The backup is named with the git SHA from the failed deploy
# Find the SHA: check the GitHub Actions workflow run that triggered this deploy
DEPLOY_SHA="abc1234"   # replace with actual SHA

# List backups with that SHA
rclone ls b2:station-backups/postgres/ | grep "pre-deploy-${DEPLOY_SHA}"
# Expected: postgres/202605/20260510_150000_pre-deploy-abc1234.sql.gz
```

**Restore:**
```bash
# Stop the backend to prevent writes during restore
docker compose -f /opt/station/docker-compose.prod.yml stop backend

# Restore using the script from issue #130
bash /opt/station/infra/scripts/restore-db.sh "postgres/202605/20260510_150000_pre-deploy-abc1234.sql.gz"

# The script will restart the backend automatically
```

**Redeploy the previous image (Section 5 still applies).**

#### Section 5: After rollback — redeploy previous image

After either path, the database is restored but the running backend image may still be the broken one. Redeploy the last known-good image:

```bash
# Find the previous image tag in GitHub Container Registry
# Go to: https://github.com/GitAddRemote/station/pkgs/container/station-backend
# Find the tag before the current broken one (e.g. v0.1.9)

PREVIOUS_TAG="v0.1.9"

# On VPS:
STATION_VERSION=${PREVIOUS_TAG} \
  docker compose -f /opt/station/docker-compose.prod.yml up -d backend

# Verify health
curl -f https://api.drdnt.org/health && echo "Rollback complete"
```

#### Section 6: Post-rollback checklist

- [ ] Health endpoint returns 200
- [ ] Spot-check key tables (row counts look right)
- [ ] Open an incident issue on GitHub: what failed, what was restored, data loss window (if any)
- [ ] Fix the migration's `down()` method or the SQL error
- [ ] Do NOT skip the pre-deploy backup on the retry deploy

#### Section 7: Rehearsal log

| Date | Path tested | Duration | Notes |
|---|---|---|---|
| (first rehearsal date) | Fast path | X min | Rehearsed in staging. |

### Rehearsal procedure (run in staging before v0.2.0 ships)

1. On staging, intentionally write a migration with a bad `down()` method
2. Deploy to staging (triggering the migration)
3. Run the fast path revert — confirm it fails due to bad `down()`
4. Run the safe path restore using the pre-deploy backup
5. Verify staging is back to normal
6. Record date and findings in Section 7

This rehearsal proves both paths work before you ever need them in production.

## Definition of Done

- [ ] `infra/docs/migration-rollback.md` written with all 7 sections: assessment, decision tree, fast path, safe path, redeploy, post-rollback checklist, rehearsal log
- [ ] Fast path commands tested in staging — `migration:revert` executes successfully
- [ ] Safe path tested in staging — backup from #126 restored successfully using `restore-db.sh`
- [ ] Rehearsal date and result recorded in Section 7
- [ ] `infra/docs/migration-rollback.md` linked from `docs/deployment.md` in the "What to do if things go wrong" section
- [ ] All migrations in the codebase confirmed to have working `down()` methods (run `migration:revert` locally on each)

## Dependencies

- Depends on: #126 (pre-deploy backup must be running so safe path has a backup to restore)
- Depends on: #130 (restore runbook — safe path references restore.md for the actual restore commands)
- Depends on: #107 (VPS must be provisioned for rehearsal)
- Blocks: nothing


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ops: migration rollback runbook for failed production deploys #132

Tech Story

ELI5 Context

Technical Elaboration

New file: `infra/docs/migration-rollback.md`

Section 1: Before you start — assess the situation

Section 2: Decision tree

Section 3: Fast path — TypeORM revert

Section 4: Safe path — restore pre-deploy backup

Section 5: After rollback — redeploy previous image

Section 6: Post-rollback checklist

Section 7: Rehearsal log

Rehearsal procedure (run in staging before v0.2.0 ships)

Definition of Done

Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ops: migration rollback runbook for failed production deploys #132

Description

Tech Story

ELI5 Context

Technical Elaboration

New file: infra/docs/migration-rollback.md

Section 1: Before you start — assess the situation

Section 2: Decision tree

Section 3: Fast path — TypeORM revert

Section 4: Safe path — restore pre-deploy backup

Section 5: After rollback — redeploy previous image

Section 6: Post-rollback checklist

Section 7: Rehearsal log

Rehearsal procedure (run in staging before v0.2.0 ships)

Definition of Done

Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

New file: `infra/docs/migration-rollback.md`