Skip to content

fix(docker): serialize build stages + bump SSH timeout#46

Merged
alejandro-runner merged 2 commits intosynvya-stagingfrom
fix/serialize-build-stages
Apr 29, 2026
Merged

fix(docker): serialize build stages + bump SSH timeout#46
alejandro-runner merged 2 commits intosynvya-stagingfrom
fix/serialize-build-stages

Conversation

@alejandro-runner
Copy link
Copy Markdown
Member

Why

The cold build on production t3.medium locked the host hard enough that SSH and SSM both became unresponsive. Required a reboot to recover.

Root cause: BuildKit runs independent stages in parallel by default. The Dockerfile has `rust-builder` and `web-builder` as independent stages, so the cold build was running:

  • cargo release build with `-j 2` (~3 GB RSS)
  • bun install + vite build with `NODE_OPTIONS=--max-old-space-size=2048` (2 GB)
  • ...simultaneously, on a 4 GB instance with 4 GB swap

That blew past available memory + swap and the kernel locked.

Fix

  1. Serialize the build stages. Add a no-op `COPY --from=rust-builder /artifacts/keycast /tmp/.rust-builder-done` as the first instruction of `web-builder`. BuildKit honors the cross-stage dependency and won't start web-builder until rust-builder finishes. Now the two heavy compilers run sequentially, never together.

  2. Bump `command_timeout` from 30m to 60m on all four `appleboy/ssh-action` steps. A cold cargo + bun build on t3.medium with `-j 2` realistically takes 45-55 min. The previous 30m would kill the SSH session mid-build even on a healthy host.

After this lands

  • Cold build: still ~45-55 min, but stays under the timeout and doesn't OOM the host.
  • Warm builds: still fast (cache mounts reused), well under 60m.

Test plan

  • Production EC2 recovered (manual reboot)
  • Merge → confirm production deploy completes a cold build without locking the host
  • Confirm `free -h` and `swapon --show` after build show healthy state
  • Push a trivial follow-up commit → confirm warm build is fast
  • Confirm SSH access remains available throughout the build (test from another terminal during deploy)

Follow-up worth doing

This whole class of problem goes away if we stop building on the production host. Build the image on the GitHub-hosted runner (16 GB RAM) and push to a registry; production just `docker compose pull && up -d`. Standard prod deploy pattern. Worth scheduling as the real fix.

alejandro-runner and others added 2 commits April 29, 2026 14:45
The cold build on a t3.medium production host locked the box hard
enough that SSH and SSM both became unresponsive. Root cause:
BuildKit runs independent stages in parallel by default, so cargo
release build (-j 2, ~3 GB RSS) and bun/vite build (NODE_OPTIONS
2 GB) ran simultaneously on a 4 GB instance. Even with 4 GB swap
the system thrashed into a kernel lockup.

Two changes:

1. Add a no-op `COPY --from=rust-builder /artifacts/keycast
   /tmp/.rust-builder-done` as the first instruction of
   web-builder. BuildKit sees the cross-stage dependency and only
   starts web-builder once rust-builder finishes, so cargo and
   bun never run concurrently.

2. Bump appleboy/ssh-action `command_timeout` from 30m to 60m
   across all four deploy/QA steps. A cold cargo + bun build on
   t3.medium with -j 2 takes ~45-55 min; the previous 30m killed
   the SSH session mid-build.

Once cache mounts are populated by a successful cold build,
warm builds return to a few minutes and stay well under the
new timeout.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When ec2-prepare-host.sh is invoked under sudo (e.g. while
debugging on the host), $HOME resolves to /root and the cargo
[build] jobs=2 config lands in /root/.cargo/config.toml. The
deploy/QA workflow runs the script over SSH as ec2-user, so it
never reads that config — defeating the limit.

Detect SUDO_USER and write to that user's home instead, then
chown the .cargo/ tree back to them so cargo can read/write it
when running under their UID.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@alejandro-runner alejandro-runner merged commit 8bd492d into synvya-staging Apr 29, 2026
@alejandro-runner alejandro-runner deleted the fix/serialize-build-stages branch April 29, 2026 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant