Skip to content

ops: migration rollback runbook for failed production deploys #132

@GitAddRemote

Description

@GitAddRemote

Tech Story

As a solo engineer, I want a written, rehearsed runbook for rolling back a failed database migration so that when a migration breaks production at 2 AM, I can restore service in under 15 minutes by following a checklist — not by improvising.

ELI5 Context

What is a database migration and why does it fail?
A migration is a SQL script that changes your database schema — adding a column, dropping a table, renaming a field. Migrations fail for reasons like: the SQL has a syntax error, a constraint is violated (e.g. adding a NOT NULL column to a table with existing rows), or the migration runs out of time/memory mid-way. A half-run migration can leave the schema in a state that crashes the app on startup.

Why are there two rollback paths?
The fast path (TypeORM revert) undoes the last migration using TypeORM's built-in mechanism. It's fast (seconds) but only works if the migration's down() method is correctly written and the database is in a consistent enough state to run it. The safe path (backup restore) is the nuclear option — it ignores the migration entirely and restores the database to the state it was in 60 seconds before the deploy started. It always works, but it takes longer and means any data written between the backup and the restore is lost.

What is TypeORM's down() method?
Every TypeORM migration file has an up() method (the migration) and a down() method (how to undo it). Running pnpm migration:revert executes the down() of the most recently run migration. This is why all migrations in this project must have a working down() method — it's your first line of rollback defence.

What does "pre-deploy backup" mean here?
Issue #126 adds a step to the deploy workflow that runs the backup script before migrations. This backup is labelled with the git SHA, so you can find it instantly: pre-deploy-abc1234.sql.gz. If a migration fails, this is the backup you restore.

Technical Elaboration

New file: infra/docs/migration-rollback.md

Section 1: Before you start — assess the situation

SSH into the VPS and run:

# Check if the backend is running
docker compose -f /opt/station/docker-compose.prod.yml ps

# Check recent logs for the error
docker compose -f /opt/station/docker-compose.prod.yml logs backend --tail=50

# Check which migration was being run
docker exec station-postgres-1 psql -U "${DATABASE_USER}" -d "${DATABASE_NAME}" \
  -c "SELECT name, timestamp FROM migrations ORDER BY timestamp DESC LIMIT 5;"

Answer these questions:

  1. Is the backend container running? (even if it's crashing and restarting)
  2. What is the exact error message?
  3. Which migration was running when it failed?

Section 2: Decision tree

Migration failed during deploy
|
+-- Does the backend container start at all?
|   |
|   +-- YES -> Try Fast Path (Section 3)
|   |
|   +-- NO  -> Go to Safe Path (Section 4)
|
+-- Did Fast Path succeed?
    |
    +-- YES -> Done. Redeploy previous image tag (Section 5).
    |
    +-- NO  -> Go to Safe Path (Section 4)

Section 3: Fast path — TypeORM revert

Estimated time: 2–5 minutes.

# SSH into VPS
source /opt/station/.env.production

# Run the TypeORM revert inside the backend container
docker exec station-backend-1 sh -c "cd /app && node dist/node_modules/.bin/typeorm migration:revert -d dist/data-source.js"

# If the container is not running, start it with the CURRENT (broken) image just long enough to revert:
docker run --rm \
  --env-file /opt/station/.env.production \
  --network station_default \
  ghcr.io/gitaddremote/station-backend:latest \
  sh -c "node dist/node_modules/.bin/typeorm migration:revert -d dist/data-source.js"

# Verify the migration was reverted
docker exec station-postgres-1 psql -U "${DATABASE_USER}" -d "${DATABASE_NAME}" \
  -c "SELECT name FROM migrations ORDER BY timestamp DESC LIMIT 3;"

If migration:revert succeeds, proceed to Section 5 (redeploy previous image).
If it errors, proceed to Section 4 (safe path).

Section 4: Safe path — restore pre-deploy backup

Estimated time: 10–20 minutes depending on database size.

Find the pre-deploy backup for this deploy:

# The backup is named with the git SHA from the failed deploy
# Find the SHA: check the GitHub Actions workflow run that triggered this deploy
DEPLOY_SHA="abc1234"   # replace with actual SHA

# List backups with that SHA
rclone ls b2:station-backups/postgres/ | grep "pre-deploy-${DEPLOY_SHA}"
# Expected: postgres/202605/20260510_150000_pre-deploy-abc1234.sql.gz

Restore:

# Stop the backend to prevent writes during restore
docker compose -f /opt/station/docker-compose.prod.yml stop backend

# Restore using the script from issue #130
bash /opt/station/infra/scripts/restore-db.sh "postgres/202605/20260510_150000_pre-deploy-abc1234.sql.gz"

# The script will restart the backend automatically

Redeploy the previous image (Section 5 still applies).

Section 5: After rollback — redeploy previous image

After either path, the database is restored but the running backend image may still be the broken one. Redeploy the last known-good image:

# Find the previous image tag in GitHub Container Registry
# Go to: https://github.com/GitAddRemote/station/pkgs/container/station-backend
# Find the tag before the current broken one (e.g. v0.1.9)

PREVIOUS_TAG="v0.1.9"

# On VPS:
STATION_VERSION=${PREVIOUS_TAG} \
  docker compose -f /opt/station/docker-compose.prod.yml up -d backend

# Verify health
curl -f https://api.drdnt.org/health && echo "Rollback complete"

Section 6: Post-rollback checklist

  • Health endpoint returns 200
  • Spot-check key tables (row counts look right)
  • Open an incident issue on GitHub: what failed, what was restored, data loss window (if any)
  • Fix the migration's down() method or the SQL error
  • Do NOT skip the pre-deploy backup on the retry deploy

Section 7: Rehearsal log

Date Path tested Duration Notes
(first rehearsal date) Fast path X min Rehearsed in staging.

Rehearsal procedure (run in staging before v0.2.0 ships)

  1. On staging, intentionally write a migration with a bad down() method
  2. Deploy to staging (triggering the migration)
  3. Run the fast path revert — confirm it fails due to bad down()
  4. Run the safe path restore using the pre-deploy backup
  5. Verify staging is back to normal
  6. Record date and findings in Section 7

This rehearsal proves both paths work before you ever need them in production.

Definition of Done

  • infra/docs/migration-rollback.md written with all 7 sections: assessment, decision tree, fast path, safe path, redeploy, post-rollback checklist, rehearsal log
  • Fast path commands tested in staging — migration:revert executes successfully
  • Safe path tested in staging — backup from feat: add Redis AOF persistence and pre-deploy migration backup #126 restored successfully using restore-db.sh
  • Rehearsal date and result recorded in Section 7
  • infra/docs/migration-rollback.md linked from docs/deployment.md in the "What to do if things go wrong" section
  • All migrations in the codebase confirmed to have working down() methods (run migration:revert locally on each)

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions