Tech Story
As a solo engineer, I want a written, rehearsed runbook for rolling back a failed database migration so that when a migration breaks production at 2 AM, I can restore service in under 15 minutes by following a checklist — not by improvising.
ELI5 Context
What is a database migration and why does it fail?
A migration is a SQL script that changes your database schema — adding a column, dropping a table, renaming a field. Migrations fail for reasons like: the SQL has a syntax error, a constraint is violated (e.g. adding a NOT NULL column to a table with existing rows), or the migration runs out of time/memory mid-way. A half-run migration can leave the schema in a state that crashes the app on startup.
Why are there two rollback paths?
The fast path (TypeORM revert) undoes the last migration using TypeORM's built-in mechanism. It's fast (seconds) but only works if the migration's down() method is correctly written and the database is in a consistent enough state to run it. The safe path (backup restore) is the nuclear option — it ignores the migration entirely and restores the database to the state it was in 60 seconds before the deploy started. It always works, but it takes longer and means any data written between the backup and the restore is lost.
What is TypeORM's down() method?
Every TypeORM migration file has an up() method (the migration) and a down() method (how to undo it). Running pnpm migration:revert executes the down() of the most recently run migration. This is why all migrations in this project must have a working down() method — it's your first line of rollback defence.
What does "pre-deploy backup" mean here?
Issue #126 adds a step to the deploy workflow that runs the backup script before migrations. This backup is labelled with the git SHA, so you can find it instantly: pre-deploy-abc1234.sql.gz. If a migration fails, this is the backup you restore.
Technical Elaboration
New file: infra/docs/migration-rollback.md
Section 1: Before you start — assess the situation
SSH into the VPS and run:
# Check if the backend is running
docker compose -f /opt/station/docker-compose.prod.yml ps
# Check recent logs for the error
docker compose -f /opt/station/docker-compose.prod.yml logs backend --tail=50
# Check which migration was being run
docker exec station-postgres-1 psql -U "${DATABASE_USER}" -d "${DATABASE_NAME}" \
-c "SELECT name, timestamp FROM migrations ORDER BY timestamp DESC LIMIT 5;"
Answer these questions:
- Is the backend container running? (even if it's crashing and restarting)
- What is the exact error message?
- Which migration was running when it failed?
Section 2: Decision tree
Migration failed during deploy
|
+-- Does the backend container start at all?
| |
| +-- YES -> Try Fast Path (Section 3)
| |
| +-- NO -> Go to Safe Path (Section 4)
|
+-- Did Fast Path succeed?
|
+-- YES -> Done. Redeploy previous image tag (Section 5).
|
+-- NO -> Go to Safe Path (Section 4)
Section 3: Fast path — TypeORM revert
Estimated time: 2–5 minutes.
# SSH into VPS
source /opt/station/.env.production
# Run the TypeORM revert inside the backend container
docker exec station-backend-1 sh -c "cd /app && node dist/node_modules/.bin/typeorm migration:revert -d dist/data-source.js"
# If the container is not running, start it with the CURRENT (broken) image just long enough to revert:
docker run --rm \
--env-file /opt/station/.env.production \
--network station_default \
ghcr.io/gitaddremote/station-backend:latest \
sh -c "node dist/node_modules/.bin/typeorm migration:revert -d dist/data-source.js"
# Verify the migration was reverted
docker exec station-postgres-1 psql -U "${DATABASE_USER}" -d "${DATABASE_NAME}" \
-c "SELECT name FROM migrations ORDER BY timestamp DESC LIMIT 3;"
If migration:revert succeeds, proceed to Section 5 (redeploy previous image).
If it errors, proceed to Section 4 (safe path).
Section 4: Safe path — restore pre-deploy backup
Estimated time: 10–20 minutes depending on database size.
Find the pre-deploy backup for this deploy:
# The backup is named with the git SHA from the failed deploy
# Find the SHA: check the GitHub Actions workflow run that triggered this deploy
DEPLOY_SHA="abc1234" # replace with actual SHA
# List backups with that SHA
rclone ls b2:station-backups/postgres/ | grep "pre-deploy-${DEPLOY_SHA}"
# Expected: postgres/202605/20260510_150000_pre-deploy-abc1234.sql.gz
Restore:
# Stop the backend to prevent writes during restore
docker compose -f /opt/station/docker-compose.prod.yml stop backend
# Restore using the script from issue #130
bash /opt/station/infra/scripts/restore-db.sh "postgres/202605/20260510_150000_pre-deploy-abc1234.sql.gz"
# The script will restart the backend automatically
Redeploy the previous image (Section 5 still applies).
Section 5: After rollback — redeploy previous image
After either path, the database is restored but the running backend image may still be the broken one. Redeploy the last known-good image:
# Find the previous image tag in GitHub Container Registry
# Go to: https://github.com/GitAddRemote/station/pkgs/container/station-backend
# Find the tag before the current broken one (e.g. v0.1.9)
PREVIOUS_TAG="v0.1.9"
# On VPS:
STATION_VERSION=${PREVIOUS_TAG} \
docker compose -f /opt/station/docker-compose.prod.yml up -d backend
# Verify health
curl -f https://api.drdnt.org/health && echo "Rollback complete"
Section 6: Post-rollback checklist
Section 7: Rehearsal log
| Date |
Path tested |
Duration |
Notes |
| (first rehearsal date) |
Fast path |
X min |
Rehearsed in staging. |
Rehearsal procedure (run in staging before v0.2.0 ships)
- On staging, intentionally write a migration with a bad
down() method
- Deploy to staging (triggering the migration)
- Run the fast path revert — confirm it fails due to bad
down()
- Run the safe path restore using the pre-deploy backup
- Verify staging is back to normal
- Record date and findings in Section 7
This rehearsal proves both paths work before you ever need them in production.
Definition of Done
Dependencies
Tech Story
As a solo engineer, I want a written, rehearsed runbook for rolling back a failed database migration so that when a migration breaks production at 2 AM, I can restore service in under 15 minutes by following a checklist — not by improvising.
ELI5 Context
What is a database migration and why does it fail?
A migration is a SQL script that changes your database schema — adding a column, dropping a table, renaming a field. Migrations fail for reasons like: the SQL has a syntax error, a constraint is violated (e.g. adding a NOT NULL column to a table with existing rows), or the migration runs out of time/memory mid-way. A half-run migration can leave the schema in a state that crashes the app on startup.
Why are there two rollback paths?
The fast path (TypeORM revert) undoes the last migration using TypeORM's built-in mechanism. It's fast (seconds) but only works if the migration's
down()method is correctly written and the database is in a consistent enough state to run it. The safe path (backup restore) is the nuclear option — it ignores the migration entirely and restores the database to the state it was in 60 seconds before the deploy started. It always works, but it takes longer and means any data written between the backup and the restore is lost.What is TypeORM's
down()method?Every TypeORM migration file has an
up()method (the migration) and adown()method (how to undo it). Runningpnpm migration:revertexecutes thedown()of the most recently run migration. This is why all migrations in this project must have a workingdown()method — it's your first line of rollback defence.What does "pre-deploy backup" mean here?
Issue #126 adds a step to the deploy workflow that runs the backup script before migrations. This backup is labelled with the git SHA, so you can find it instantly:
pre-deploy-abc1234.sql.gz. If a migration fails, this is the backup you restore.Technical Elaboration
New file:
infra/docs/migration-rollback.mdSection 1: Before you start — assess the situation
SSH into the VPS and run:
Answer these questions:
Section 2: Decision tree
Section 3: Fast path — TypeORM revert
Estimated time: 2–5 minutes.
If
migration:revertsucceeds, proceed to Section 5 (redeploy previous image).If it errors, proceed to Section 4 (safe path).
Section 4: Safe path — restore pre-deploy backup
Estimated time: 10–20 minutes depending on database size.
Find the pre-deploy backup for this deploy:
Restore:
Redeploy the previous image (Section 5 still applies).
Section 5: After rollback — redeploy previous image
After either path, the database is restored but the running backend image may still be the broken one. Redeploy the last known-good image:
Section 6: Post-rollback checklist
down()method or the SQL errorSection 7: Rehearsal log
Rehearsal procedure (run in staging before v0.2.0 ships)
down()methoddown()This rehearsal proves both paths work before you ever need them in production.
Definition of Done
infra/docs/migration-rollback.mdwritten with all 7 sections: assessment, decision tree, fast path, safe path, redeploy, post-rollback checklist, rehearsal logmigration:revertexecutes successfullyrestore-db.shinfra/docs/migration-rollback.mdlinked fromdocs/deployment.mdin the "What to do if things go wrong" sectiondown()methods (runmigration:revertlocally on each)Dependencies