Skip to content

feat: add Redis AOF persistence and pre-deploy migration backup #126

@GitAddRemote

Description

@GitAddRemote

Tech Story

As a platform engineer, I want Redis to persist data across restarts and the deploy pipeline to capture a database snapshot before running migrations, so that I never lose cached session data due to a Redis crash and always have a safe rollback point before any schema change.

ELI5 Context

What is RDB vs AOF persistence in Redis?
By default Redis saves a snapshot of all data to disk every few minutes (RDB). If Redis crashes between snapshots, you lose everything written in that window — typically 1–5 minutes of data. AOF (Append-Only File) persistence writes every single command to a log file as it happens. With appendfsync everysec, the worst-case data loss is 1 second. Think of RDB as saving a Word document every 5 minutes vs AOF as Google Docs auto-saving every keystroke.

Why does a migration need a backup before it runs?
A database migration changes the schema — adding columns, dropping tables, renaming fields. If a migration runs halfway and then fails, the database can be left in an inconsistent state that prevents the app from starting. With a backup taken 30 seconds before the migration, the worst-case recovery is: restore backup, redeploy previous image, investigate. Without it, recovery means manually reverse-engineering what the half-run migration did.

What is appendfsync everysec?
This Redis config option tells Redis to flush the AOF log to disk every second. It's the recommended balance between performance (not flushing every single write) and safety (losing at most 1 second of data). The alternative always (flush every write) is too slow; no (let the OS decide) gives no guarantees.

Technical Elaboration

File: docker-compose.prod.yml — Redis service changes

redis:
  image: redis:7-alpine
  restart: unless-stopped
  command: redis-server --requirepass ${REDIS_PASSWORD} --appendonly yes --appendfsync everysec
  volumes:
    - redis_aof:/data      # named volume — persists AOF file across container restarts
  healthcheck:
    test: ["CMD", "redis-cli", "-a", "${REDIS_PASSWORD}", "ping"]
    interval: 10s
    timeout: 3s
    retries: 3

volumes:
  postgres_data:
  redis_aof:               # renamed from redis_data to reflect AOF usage

Apply the same change to docker-compose.staging.yml — staging should mirror production configuration.

Important: if you previously had a volume named redis_data, rename it to redis_aof in the compose file and run docker volume rename station_redis_data station_redis_aof on the VPS, or simply recreate the Redis container (Redis data is ephemeral cache — loss is acceptable during a planned rename).

Verifying AOF is active

After restarting the Redis container:

docker exec station-redis-1 redis-cli -a "${REDIS_PASSWORD}" INFO persistence | grep aof_enabled
# Expected output: aof_enabled:1

File: .github/workflows/release.yml — pre-migration backup step

In the deploy-production job, add this step before the migration step:

- name: Pre-migration database backup
  run: |
    ssh ${{ secrets.VPS_USER }}@${{ secrets.VPS_HOST }} \
      "cd /opt/station && \
       LABEL=pre-deploy-${{ github.sha }} \
       bash infra/scripts/backup-db.sh"
  # If this step fails, the workflow stops — migrations never run against an un-backed-up database

The backup-db.sh script already exists from issue #125. This step just calls it with a label so the backup filename includes the git SHA for traceability.

File: infra/scripts/backup-db.sh — label support (small change)

Add an optional LABEL environment variable to the backup filename:

LABEL="${LABEL:-nightly}"
BACKUP_FILE="/tmp/station_backup_${TIMESTAMP}_${LABEL}.sql.gz"

So pre-deploy backups are named station_backup_20260510_030000_pre-deploy-abc1234.sql.gz and are distinguishable from nightly backups in B2.

New file: infra/docs/redis.md

Document:

  1. Why AOF is enabled (data safety, 1-second loss window)
  2. How to verify AOF is active (INFO persistence)
  3. What to do if Redis data is lost (it's a cache — the app degrades gracefully, no restore needed; sessions are re-created on next login)
  4. How to inspect the AOF file size: docker exec station-redis-1 redis-cli -a "${REDIS_PASSWORD}" INFO persistence | grep aof_current_size

Definition of Done

  • docker-compose.prod.yml Redis service uses --appendonly yes --appendfsync everysec
  • docker-compose.staging.yml Redis service matches production configuration
  • Redis data volume renamed to redis_aof in both compose files
  • docker exec ... redis-cli INFO persistence | grep aof_enabled returns aof_enabled:1 on the VPS
  • .github/workflows/release.yml has a pre-migration backup step that runs before migration:run
  • If the backup step fails, the workflow fails (no migration runs) — verified by testing with a broken B2 credential
  • infra/scripts/backup-db.sh accepts optional LABEL env var and includes it in the backup filename
  • infra/docs/redis.md written covering AOF, verification, and recovery
  • Staging environment tested end-to-end with AOF enabled

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions