fix(docker): serialize build stages + bump SSH timeout by alejandro-runner · Pull Request #46 · Synvya/keycast

alejandro-runner · 2026-04-29T21:49:16Z

Why

The cold build on production t3.medium locked the host hard enough that SSH and SSM both became unresponsive. Required a reboot to recover.

Root cause: BuildKit runs independent stages in parallel by default. The Dockerfile has `rust-builder` and `web-builder` as independent stages, so the cold build was running:

cargo release build with `-j 2` (~3 GB RSS)
bun install + vite build with `NODE_OPTIONS=--max-old-space-size=2048` (2 GB)
...simultaneously, on a 4 GB instance with 4 GB swap

That blew past available memory + swap and the kernel locked.

Fix

Serialize the build stages. Add a no-op `COPY --from=rust-builder /artifacts/keycast /tmp/.rust-builder-done` as the first instruction of `web-builder`. BuildKit honors the cross-stage dependency and won't start web-builder until rust-builder finishes. Now the two heavy compilers run sequentially, never together.
Bump `command_timeout` from 30m to 60m on all four `appleboy/ssh-action` steps. A cold cargo + bun build on t3.medium with `-j 2` realistically takes 45-55 min. The previous 30m would kill the SSH session mid-build even on a healthy host.

After this lands

Cold build: still ~45-55 min, but stays under the timeout and doesn't OOM the host.
Warm builds: still fast (cache mounts reused), well under 60m.

Test plan

Production EC2 recovered (manual reboot)
Merge → confirm production deploy completes a cold build without locking the host
Confirm `free -h` and `swapon --show` after build show healthy state
Push a trivial follow-up commit → confirm warm build is fast
Confirm SSH access remains available throughout the build (test from another terminal during deploy)

Follow-up worth doing

This whole class of problem goes away if we stop building on the production host. Build the image on the GitHub-hosted runner (16 GB RAM) and push to a registry; production just `docker compose pull && up -d`. Standard prod deploy pattern. Worth scheduling as the real fix.

The cold build on a t3.medium production host locked the box hard enough that SSH and SSM both became unresponsive. Root cause: BuildKit runs independent stages in parallel by default, so cargo release build (-j 2, ~3 GB RSS) and bun/vite build (NODE_OPTIONS 2 GB) ran simultaneously on a 4 GB instance. Even with 4 GB swap the system thrashed into a kernel lockup. Two changes: 1. Add a no-op `COPY --from=rust-builder /artifacts/keycast /tmp/.rust-builder-done` as the first instruction of web-builder. BuildKit sees the cross-stage dependency and only starts web-builder once rust-builder finishes, so cargo and bun never run concurrently. 2. Bump appleboy/ssh-action `command_timeout` from 30m to 60m across all four deploy/QA steps. A cold cargo + bun build on t3.medium with -j 2 takes ~45-55 min; the previous 30m killed the SSH session mid-build. Once cache mounts are populated by a successful cold build, warm builds return to a few minutes and stay well under the new timeout. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

When ec2-prepare-host.sh is invoked under sudo (e.g. while debugging on the host), $HOME resolves to /root and the cargo [build] jobs=2 config lands in /root/.cargo/config.toml. The deploy/QA workflow runs the script over SSH as ec2-user, so it never reads that config — defeating the limit. Detect SUDO_USER and write to that user's home instead, then chown the .cargo/ tree back to them so cargo can read/write it when running under their UID. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

alejandro-runner and others added 2 commits April 29, 2026 14:45

alejandro-runner merged commit 8bd492d into synvya-staging Apr 29, 2026

alejandro-runner deleted the fix/serialize-build-stages branch April 29, 2026 23:04

alejandro-runner mentioned this pull request Apr 29, 2026

Promote synvya-staging to synvya: serialize build stages + 60m SSH timeout #47

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(docker): serialize build stages + bump SSH timeout#46

fix(docker): serialize build stages + bump SSH timeout#46
alejandro-runner merged 2 commits intosynvya-stagingfrom
fix/serialize-build-stages

alejandro-runner commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alejandro-runner commented Apr 29, 2026

Why

Fix

After this lands

Test plan

Follow-up worth doing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant