Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -401,13 +401,37 @@ jobs:
ENVEOF
chmod 600 /opt/station/.env.production"

- name: Write production rclone config
env:
VPS_HOST: ${{ secrets.VPS_HOST }}
VPS_USER: ${{ secrets.VPS_USER }}
B2_ACCOUNT_ID: ${{ secrets.B2_ACCOUNT_ID }}
B2_APPLICATION_KEY: ${{ secrets.B2_APPLICATION_KEY }}
run: |
ssh -o StrictHostKeyChecking=yes "${VPS_USER}@${VPS_HOST}" "cat > /opt/station/rclone.conf <<'RCLONEEOF'
[b2]
type = b2
account = ${B2_ACCOUNT_ID}
key = ${B2_APPLICATION_KEY}
hard_delete = false
RCLONEEOF
chmod 600 /opt/station/rclone.conf"

- name: Log in to GHCR
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}

- name: Pre-deploy database backup
env:
VPS_HOST: ${{ secrets.VPS_HOST }}
VPS_USER: ${{ secrets.VPS_USER }}
run: |
ssh -o StrictHostKeyChecking=yes "${VPS_USER}@${VPS_HOST}" \
"cd /opt/station && BACKUP_LABEL=pre-deploy-${GITHUB_SHA::7} bash infra/scripts/backup-db.sh"

- name: Deploy production
env:
STATION_VERSION: ${{ needs.build-and-push.outputs.version }}
Expand Down
2 changes: 2 additions & 0 deletions docs/cicd.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@ bash infra/scripts/deploy.sh
```

The release workflow rewrites `/opt/station/.env.production` before every production deploy and locks it down with `chmod 600`.
It also writes `/opt/station/rclone.conf` from the production B2 secrets and runs a pre-deploy PostgreSQL backup before the backend rollout begins.

## Rollback

Expand All @@ -96,6 +97,7 @@ The release workflow rewrites `/opt/station/.env.production` before every produc
- Release runs are serialized per release branch with a workflow-level concurrency group so repeated pushes or reruns on the same release branch queue behind the in-flight run instead of canceling it mid-deploy.
- The shared staging and production deploy jobs also use a global `station-deploy` concurrency group so different release branches cannot race each other on the same VPS or image promotion path.
- Release deployments pin the target host through `VPS_KNOWN_HOSTS` and use `StrictHostKeyChecking=yes` instead of trusting first use.
- Production deploys fail closed if the pre-deploy backup cannot be created and uploaded to Backblaze B2.
- Backend and frontend CI still run on `release/**` pushes, but the release workflow no longer depends on those separate runs to gate deploys because it executes the same validation steps itself.
- The release workflow shell-quotes `STATION_VERSION` before sending it over SSH so the remote deploy treats the version as data rather than shell syntax.
- Health-check polling bounds each `curl` attempt with explicit connect and total timeouts so a single hung request cannot stall the full deploy window.
Expand Down
39 changes: 39 additions & 0 deletions docs/deployment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Deployment Runbook

## Backups

Production deploys now create a pre-deploy PostgreSQL backup before the backend rollout starts. Nightly backups also run on the VPS at `03:00` via the `deploy` user's cron.

### Verify nightly backups

```bash
ssh deploy@<vps-host>
tail -f /opt/station/logs/backup.log
```

### List backups in Backblaze B2

```bash
ssh deploy@<vps-host>
export RCLONE_CONFIG=/opt/station/rclone.conf
rclone ls "b2:${B2_BUCKET}/postgres/"
```

### Trigger a manual backup

```bash
ssh deploy@<vps-host>
cd /opt/station
bash infra/scripts/backup-db.sh manual
```

### Restore from a backup

```bash
ssh deploy@<vps-host>
cd /opt/station
bash infra/scripts/restore-db.sh postgres/202605/20260510_030000_nightly.sql.gz
```

The restore script stops the backend, restores into the running production Postgres container, and starts the backend again after the import finishes.
It replays the SQL dump into the current database. If you need a clean replacement restore, drop and recreate the target database first.
10 changes: 10 additions & 0 deletions infra/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,3 +78,13 @@ Issue `#128` documents environment-scoped secret management in `infra/docs/secre
- GitHub `staging` and `production` environments hold the deploy-time secrets.
- The release workflow writes `/opt/station/.env.staging` and `/opt/station/.env.production` on the VPS during deploys.
- Those files are recreated on every deploy and locked down with `chmod 600`.

## Backups

Issue `#125` adds the production backup contract:

- `infra/scripts/backup-db.sh`: creates a gzip-compressed `pg_dump` from the running production Postgres container and uploads it to Backblaze B2 via `rclone`
- `infra/scripts/restore-db.sh`: downloads a backup from B2 and restores it into the production Postgres container
- `infra/logrotate/station-backup`: rotates `/opt/station/logs/backup.log`

The production release workflow writes `/opt/station/rclone.conf` from GitHub environment secrets and runs a pre-deploy backup before rolling the backend forward.
6 changes: 3 additions & 3 deletions infra/docs/secrets.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ Use GitHub repository environments for `staging` and `production`. Store the fol
| `REDIS_PASSWORD` | Yes | Yes | Generate with `openssl rand -base64 24`. |
| `ALLOWED_ORIGIN` | Yes | Yes | `https://staging.station.drdnt.org` / `https://station.drdnt.org`. |
| `FRONTEND_URL` | Yes | Yes | Used in password-reset links. Should match the frontend URL. |
| `B2_ACCOUNT_ID` | Optional | Optional | Needed for backup work. |
| `B2_APPLICATION_KEY` | Optional | Optional | Needed for backup work. |
| `B2_BUCKET` | Optional | Optional | Example: `station-backups`. |
| `B2_ACCOUNT_ID` | Optional | Yes | Needed for production PostgreSQL backups to Backblaze B2. |
| `B2_APPLICATION_KEY` | Optional | Yes | Needed for production PostgreSQL backups to Backblaze B2. |
| `B2_BUCKET` | Optional | Yes | Example: `station-backups`. |
| `SENTRY_DSN` | Optional | Optional | Needed once Sentry is enabled. |
| `LOGTAIL_SOURCE_TOKEN` | Optional | Optional | Needed once log aggregation is enabled. |
| `BACKUP_HEALTHCHECK_URL` | Optional | Recommended | Production backup dead-man switch URL. Leave blank in staging if unused. |
Expand Down
10 changes: 10 additions & 0 deletions infra/logrotate/station-backup
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
/opt/station/logs/backup.log {
weekly
rotate 12
compress
missingok
notifempty
copytruncate
su deploy deploy
create 0640 deploy deploy
}
58 changes: 58 additions & 0 deletions infra/scripts/backup-db.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
#!/bin/bash
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
STATION_ROOT="$(cd "${SCRIPT_DIR}/../.." && pwd)"
ENV_FILE="${STATION_ROOT}/.env.production"
COMPOSE_FILE="${STATION_ROOT}/docker-compose.prod.yml"
RCLONE_CONFIG_FILE="${STATION_ROOT}/rclone.conf"
LOG_PREFIX="[backup]"

if [ ! -f "${ENV_FILE}" ]; then
echo "${LOG_PREFIX} Missing ${ENV_FILE}" >&2
exit 1
fi

if [ ! -f "${RCLONE_CONFIG_FILE}" ]; then
echo "${LOG_PREFIX} Missing ${RCLONE_CONFIG_FILE}" >&2
exit 1
fi

set -a
source "${ENV_FILE}"
set +a

: "${DATABASE_USER:?DATABASE_USER is required}"
: "${DATABASE_NAME:?DATABASE_NAME is required}"
: "${B2_BUCKET:?B2_BUCKET is required}"

LABEL="${1:-${BACKUP_LABEL:-nightly}}"
TIMESTAMP="$(date +%Y%m%d_%H%M%S)"
BACKUP_FILE="/tmp/station_backup_${TIMESTAMP}_${LABEL}.sql.gz"
REMOTE_PATH="postgres/${TIMESTAMP:0:6}/${TIMESTAMP}_${LABEL}.sql.gz"

export RCLONE_CONFIG="${RCLONE_CONFIG_FILE}"
trap 'rm -f "${BACKUP_FILE}"' EXIT

echo "${LOG_PREFIX} Starting backup at ${TIMESTAMP} (${LABEL})"

docker compose --env-file "${ENV_FILE}" -f "${COMPOSE_FILE}" exec -T postgres \
pg_dump -U "${DATABASE_USER}" -d "${DATABASE_NAME}" \
| gzip > "${BACKUP_FILE}"

echo "${LOG_PREFIX} Created ${BACKUP_FILE} ($(du -sh "${BACKUP_FILE}" | cut -f1))"

rclone copyto "${BACKUP_FILE}" "b2:${B2_BUCKET}/${REMOTE_PATH}" \
--b2-chunk-size 96M

echo "${LOG_PREFIX} Uploaded to b2:${B2_BUCKET}/${REMOTE_PATH}"

if [ -n "${BACKUP_HEALTHCHECK_URL:-}" ]; then
if curl -fsS --retry 3 "${BACKUP_HEALTHCHECK_URL}" >/dev/null; then
echo "${LOG_PREFIX} Healthcheck ping sent"
else
echo "${LOG_PREFIX} WARNING: healthcheck ping failed after upload" >&2
fi
fi

echo "${LOG_PREFIX} Complete"
14 changes: 13 additions & 1 deletion infra/scripts/bootstrap-vps.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ STATION_ROOT="/opt/station"
apt update
apt upgrade -y

apt install -y ca-certificates curl gnupg lsb-release
apt install -y ca-certificates curl gnupg lsb-release cron logrotate rclone

install -m 0755 -d /etc/apt/keyrings
if [ ! -f /etc/apt/keyrings/docker.asc ]; then
Expand Down Expand Up @@ -41,6 +41,7 @@ apt install -y \

systemctl enable --now docker
systemctl enable --now nginx
systemctl enable --now cron

if ! id -u "${DEPLOY_USER}" >/dev/null 2>&1; then
useradd -m -s /bin/bash "${DEPLOY_USER}"
Expand All @@ -67,9 +68,20 @@ install -d -m 755 -o "${DEPLOY_USER}" -g "${DEPLOY_USER}" "${STATION_ROOT}/logs"

bash "$(dirname "$0")/setup-swap.sh"

BACKUP_CRON='0 3 * * * cd /opt/station && bash infra/scripts/backup-db.sh >> /opt/station/logs/backup.log 2>&1'
(
crontab -u "${DEPLOY_USER}" -l 2>/dev/null | grep -Fv 'infra/scripts/backup-db.sh' || true
echo "${BACKUP_CRON}"
) | crontab -u "${DEPLOY_USER}" -

if [ -f "$(dirname "$0")/../logrotate/station-backup" ]; then
install -m 644 "$(dirname "$0")/../logrotate/station-backup" /etc/logrotate.d/station-backup
fi

echo
echo "Bootstrap complete."
echo "- Install Nginx configs from infra/nginx/ into /etc/nginx/sites-available/"
echo "- Enable the sites and reload Nginx."
echo "- Run infra/scripts/issue-certs.sh once DNS is live."
echo "- Confirm the deploy user can SSH and run Docker commands without sudo."
echo "- Configure B2 secrets and verify /opt/station/rclone.conf is written during deploy."
55 changes: 55 additions & 0 deletions infra/scripts/restore-db.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
#!/bin/bash
set -euo pipefail

if [ $# -ne 1 ]; then
echo "Usage: $0 <b2-path-to-backup>" >&2
echo "Example: $0 postgres/202605/20260510_030000_nightly.sql.gz" >&2
exit 1
fi

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
STATION_ROOT="$(cd "${SCRIPT_DIR}/../.." && pwd)"
ENV_FILE="${STATION_ROOT}/.env.production"
COMPOSE_FILE="${STATION_ROOT}/docker-compose.prod.yml"
RCLONE_CONFIG_FILE="${STATION_ROOT}/rclone.conf"
LOG_PREFIX="[restore]"
BACKUP_PATH="$1"
LOCAL_FILE="/tmp/restore_$(date +%s).sql.gz"

if [ ! -f "${ENV_FILE}" ]; then
echo "${LOG_PREFIX} Missing ${ENV_FILE}" >&2
exit 1
fi

if [ ! -f "${RCLONE_CONFIG_FILE}" ]; then
echo "${LOG_PREFIX} Missing ${RCLONE_CONFIG_FILE}" >&2
exit 1
fi

set -a
source "${ENV_FILE}"
set +a

: "${DATABASE_USER:?DATABASE_USER is required}"
: "${DATABASE_NAME:?DATABASE_NAME is required}"
: "${B2_BUCKET:?B2_BUCKET is required}"

export RCLONE_CONFIG="${RCLONE_CONFIG_FILE}"

echo "${LOG_PREFIX} Downloading ${BACKUP_PATH} from b2:${B2_BUCKET}"
rclone copyto "b2:${B2_BUCKET}/${BACKUP_PATH}" "${LOCAL_FILE}" \
--b2-chunk-size 96M

echo "${LOG_PREFIX} WARNING: backend writes will be stopped during restore"
echo "${LOG_PREFIX} WARNING: this restore replays the SQL dump into the existing database."
echo "${LOG_PREFIX} WARNING: if you need a clean replacement, drop and recreate the target database first."
echo "${LOG_PREFIX} Starting in 5 seconds. Press Ctrl+C to abort."
sleep 5

docker compose --env-file "${ENV_FILE}" -f "${COMPOSE_FILE}" stop backend
gunzip -c "${LOCAL_FILE}" | docker compose --env-file "${ENV_FILE}" -f "${COMPOSE_FILE}" exec -T postgres \
psql -U "${DATABASE_USER}" -d "${DATABASE_NAME}"
docker compose --env-file "${ENV_FILE}" -f "${COMPOSE_FILE}" start backend

rm -f "${LOCAL_FILE}"
echo "${LOG_PREFIX} Restore complete"
45 changes: 45 additions & 0 deletions infra/tests/infrastructure.test.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,8 @@ test('bash scripts have valid shell syntax', () => {

const scripts = [
path.join(infraRoot, 'scripts/bootstrap-vps.sh'),
path.join(infraRoot, 'scripts/backup-db.sh'),
path.join(infraRoot, 'scripts/restore-db.sh'),
path.join(infraRoot, 'scripts/setup-swap.sh'),
path.join(infraRoot, 'scripts/issue-certs.sh'),
path.join(infraRoot, 'scripts/deploy.sh'),
Expand Down Expand Up @@ -137,6 +139,9 @@ test('bootstrap script provisions required VPS baseline steps', () => {

assert.match(script, /apt update/);
assert.match(script, /apt upgrade -y/);
assert.match(script, /cron/);
assert.match(script, /logrotate/);
assert.match(script, /rclone/);
assert.match(script, /docker-ce/);
assert.match(script, /docker-compose-plugin/);
assert.match(script, /nginx/);
Expand All @@ -147,6 +152,36 @@ test('bootstrap script provisions required VPS baseline steps', () => {
assert.match(script, /authorized_keys/);
assert.match(script, /\/opt\/station/);
assert.match(script, /bash "\$\(dirname "\$0"\)\/setup-swap\.sh"/);
assert.match(script, /backup-db\.sh >> \/opt\/station\/logs\/backup\.log/);
assert.match(script, /logrotate\/station-backup/);
});

test('backup and restore scripts use docker compose and rclone with production env', () => {
const backupScript = readInfraFile('scripts/backup-db.sh');
const restoreScript = readInfraFile('scripts/restore-db.sh');
const logrotateConfig = readInfraFile('logrotate/station-backup');

assert.match(backupScript, /RCLONE_CONFIG_FILE="\$\{STATION_ROOT\}\/rclone\.conf"/);
assert.match(backupScript, /source "\$\{ENV_FILE\}"/);
assert.match(backupScript, /trap 'rm -f "\$\{BACKUP_FILE\}"' EXIT/);
assert.match(backupScript, /docker compose --env-file "\$\{ENV_FILE\}" -f "\$\{COMPOSE_FILE\}" exec -T postgres/);
assert.match(backupScript, /pg_dump -U "\$\{DATABASE_USER\}" -d "\$\{DATABASE_NAME\}"/);
assert.match(backupScript, /rclone copyto "\$\{BACKUP_FILE\}" "b2:\$\{B2_BUCKET\}\/\$\{REMOTE_PATH\}"/);
assert.match(backupScript, /LABEL="\$\{1:-\$\{BACKUP_LABEL:-nightly\}\}"/);
assert.match(backupScript, /BACKUP_HEALTHCHECK_URL/);
assert.match(backupScript, /curl -fsS --retry 3 "\$\{BACKUP_HEALTHCHECK_URL\}"/);

assert.match(restoreScript, /rclone copyto "b2:\$\{B2_BUCKET\}\/\$\{BACKUP_PATH\}"/);
assert.match(restoreScript, /replays the SQL dump into the existing database/);
assert.match(restoreScript, /drop and recreate the target database first/);
assert.match(restoreScript, /docker compose --env-file "\$\{ENV_FILE\}" -f "\$\{COMPOSE_FILE\}" stop backend/);
assert.match(restoreScript, /psql -U "\$\{DATABASE_USER\}" -d "\$\{DATABASE_NAME\}"/);
assert.match(restoreScript, /docker compose --env-file "\$\{ENV_FILE\}" -f "\$\{COMPOSE_FILE\}" start backend/);

assert.match(logrotateConfig, /\/opt\/station\/logs\/backup\.log/);
assert.match(logrotateConfig, /weekly/);
assert.match(logrotateConfig, /rotate 12/);
assert.match(logrotateConfig, /compress/);
});

test('swap script creates and persists a 2 GB swap file', () => {
Expand Down Expand Up @@ -266,6 +301,13 @@ test('release workflow safely quotes station version for remote deploys', () =>
assert.match(workflow, /B2_APPLICATION_KEY=\$\{B2_APPLICATION_KEY\}/);
assert.match(workflow, /B2_BUCKET=\$\{B2_BUCKET\}/);
assert.match(workflow, /BACKUP_HEALTHCHECK_URL=\$\{BACKUP_HEALTHCHECK_URL\}/);
assert.match(workflow, /Write production rclone config/);
assert.match(workflow, /cat > \/opt\/station\/rclone\.conf <<'RCLONEEOF'/);
assert.match(workflow, /account = \$\{B2_ACCOUNT_ID\}/);
assert.match(workflow, /key = \$\{B2_APPLICATION_KEY\}/);
assert.match(workflow, /chmod 600 \/opt\/station\/rclone\.conf/);
assert.match(workflow, /Pre-deploy database backup/);
assert.match(workflow, /BACKUP_LABEL=pre-deploy-\$\{GITHUB_SHA::7\} bash infra\/scripts\/backup-db\.sh/);
assert.match(workflow, /curl --fail --silent --show-error --connect-timeout 5 --max-time "\$max_time"/);
assert.match(workflow, /deadline=\$\(\(SECONDS \+ 120\)\)/);
});
Expand Down Expand Up @@ -406,6 +448,8 @@ test('release workflow and CI branch rules are configured', () => {
assert.match(cicdDoc, /JWT_SECRET/);
assert.match(cicdDoc, /REDIS_PASSWORD/);
assert.match(cicdDoc, /B2_APPLICATION_KEY/);
assert.match(cicdDoc, /rclone\.conf/);
assert.match(cicdDoc, /pre-deploy PostgreSQL backup/);
assert.match(cicdDoc, /BACKUP_HEALTHCHECK_URL/);
assert.match(cicdDoc, /staging-up\.sh/);
assert.match(cicdDoc, /station-staging/);
Expand All @@ -432,6 +476,7 @@ test('release workflow and CI branch rules are configured', () => {
assert.match(secretsDoc, /B2_APPLICATION_KEY/);
assert.match(secretsDoc, /LOGTAIL_SOURCE_TOKEN/);
assert.match(secretsDoc, /BACKUP_HEALTHCHECK_URL/);
assert.match(secretsDoc, /production PostgreSQL backups to Backblaze B2/);
assert.match(secretsDoc, /## Generic Rotation Procedure/);
assert.match(secretsDoc, /## JWT Secret Rotation/);
assert.match(secretsDoc, /## Database Password Rotation/);
Expand Down
Loading