ci(workers): smoke-test worker image boots before promoting to :latest#237
Merged
therealbrad merged 1 commit intomainfrom Apr 23, 2026
Merged
ci(workers): smoke-test worker image boots before promoting to :latest#237therealbrad merged 1 commit intomainfrom
therealbrad merged 1 commit intomainfrom
Conversation
On 2026-04-22 the multitenant-workers deploy crashlooped at startup
because testmoImportWorker's require graph pulled in next/headers,
which is stripped from the workers Docker image. `docker build` never
executed the compiled modules, so the failure didn't surface in CI —
only at deploy time.
Close the gap with a smoke-test step in each per-arch build job:
- testplanit/scripts/smoke-test-workers.js (new): require() each
compiled worker entrypoint + scheduler.js. Any require-time error
(missing module, bad native dep) throws synchronously and fails the
script. Force-exits after the require loop to short-circuit the
async startWorker() side-effects that run under CJS — this is why we
don't need live Valkey/Postgres in CI.
- .github/workflows/release.yml: after each `docker buildx bake --push`
(tag-push and manual-dispatch, amd64 and arm64), pull the just-
published workers image and run the smoke script via
`docker run --entrypoint node ... ./scripts/smoke-test-workers.js`.
Failure blocks the per-arch job, which blocks merge-manifests and
prevents :latest-workers from being retagged to a broken image.
Verified locally against a full workers build:
✓ All 16 entrypoints load, script exits 0.
✗ Injected a missing require into one entrypoint → smoke script exits
1 with a clear "Cannot find module ..." error naming the broken
worker. This is the signal that would have caught 2026-04-22.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
🎉 This PR is included in version 0.22.8 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
3 tasks
therealbrad
added a commit
that referenced
this pull request
Apr 23, 2026
…start workers (#239) The previous guard combined an ESM-style import.meta check with a CJS fallback: if ( (typeof import.meta !== "undefined" && import.meta.url === pathToFileURL(process.argv[1]).href) || typeof import.meta === "undefined" || (import.meta as any).url === undefined ) { startWorker()... } esbuild compiles each worker to CommonJS (platform: node, format: cjs) and polyfills `import.meta` as a plain object whose `.url` is `undefined`. At runtime that makes `import.meta.url === void 0` always true, so the guard always fires — meaning `require("./forecastWorker")` unintentionally invokes `startWorker()` and all its connection logic. Three workers (forecastWorker, repoCacheWorker, testmoImportWorker) call `process.exit(1)` synchronously when Valkey is unreachable, so the CI smoke test added in #237 died mid-loop and releases blocked on the smoke-test step. Fix: replace with the canonical CJS pattern `require.main === module`. esbuild preserves `require`/`module` in CJS output, and the smoke test can now `require()` each worker without triggering startup side effects — matching the intent stated in the original comment. Drops the `pathToFileURL` import from each worker (was only used by the old guard).
therealbrad
added a commit
that referenced
this pull request
Apr 24, 2026
) * fix(workers): use require.main === module guard so require() doesn't start workers The previous guard combined an ESM-style import.meta check with a CJS fallback: if ( (typeof import.meta !== "undefined" && import.meta.url === pathToFileURL(process.argv[1]).href) || typeof import.meta === "undefined" || (import.meta as any).url === undefined ) { startWorker()... } esbuild compiles each worker to CommonJS (platform: node, format: cjs) and polyfills `import.meta` as a plain object whose `.url` is `undefined`. At runtime that makes `import.meta.url === void 0` always true, so the guard always fires — meaning `require("./forecastWorker")` unintentionally invokes `startWorker()` and all its connection logic. Three workers (forecastWorker, repoCacheWorker, testmoImportWorker) call `process.exit(1)` synchronously when Valkey is unreachable, so the CI smoke test added in #237 died mid-loop and releases blocked on the smoke-test step. Fix: replace with the canonical CJS pattern `require.main === module`. esbuild preserves `require`/`module` in CJS output, and the smoke test can now `require()` each worker without triggering startup side effects — matching the intent stated in the original comment. Drops the `pathToFileURL` import from each worker (was only used by the old guard). * fix(workers): guard generateFromUrlWorker + stub env in smoke test v0.22.9 smoke test surfaced two issues hidden behind the main-guard bug fixed in the previous commit: 1. generateFromUrlWorker.ts had no main guard at all — it called startGenerateFromUrlWorker() unconditionally at module scope, so require()'ing it in the smoke test attempted to construct a BullMQ Worker with no Valkey and crashed. Wrapped in `if (require.main === module)` to match the other 14 workers. 2. env.js validates DATABASE_URL / NEXTAUTH_SECRET / NEXTAUTH_URL at module-load time via @t3-oss/env-nextjs, so any worker whose transitive imports reach env.js (syncWorker, elasticsearchReindexWorker, copyMoveWorker, duplicateScanWorker, magicSelectWorker) threw during require with zod 'invalid_type' errors. Added dummy env shims at the top of smoke-test-workers.js using `||=` so real CI-provided values still win. The smoke test is verifying module-graph integrity, not runtime config correctness. Together with the main-guard fix, the smoke test should now complete cleanly for every worker. * chore(workers): shorten NEXTAUTH_SECRET stub to fit prettier printWidth
therealbrad
added a commit
that referenced
this pull request
Apr 24, 2026
…t run scheduling (#241) * fix(workers): use require.main === module guard so require() doesn't start workers The previous guard combined an ESM-style import.meta check with a CJS fallback: if ( (typeof import.meta !== "undefined" && import.meta.url === pathToFileURL(process.argv[1]).href) || typeof import.meta === "undefined" || (import.meta as any).url === undefined ) { startWorker()... } esbuild compiles each worker to CommonJS (platform: node, format: cjs) and polyfills `import.meta` as a plain object whose `.url` is `undefined`. At runtime that makes `import.meta.url === void 0` always true, so the guard always fires — meaning `require("./forecastWorker")` unintentionally invokes `startWorker()` and all its connection logic. Three workers (forecastWorker, repoCacheWorker, testmoImportWorker) call `process.exit(1)` synchronously when Valkey is unreachable, so the CI smoke test added in #237 died mid-loop and releases blocked on the smoke-test step. Fix: replace with the canonical CJS pattern `require.main === module`. esbuild preserves `require`/`module` in CJS output, and the smoke test can now `require()` each worker without triggering startup side effects — matching the intent stated in the original comment. Drops the `pathToFileURL` import from each worker (was only used by the old guard). * fix(workers): guard generateFromUrlWorker + stub env in smoke test v0.22.9 smoke test surfaced two issues hidden behind the main-guard bug fixed in the previous commit: 1. generateFromUrlWorker.ts had no main guard at all — it called startGenerateFromUrlWorker() unconditionally at module scope, so require()'ing it in the smoke test attempted to construct a BullMQ Worker with no Valkey and crashed. Wrapped in `if (require.main === module)` to match the other 14 workers. 2. env.js validates DATABASE_URL / NEXTAUTH_SECRET / NEXTAUTH_URL at module-load time via @t3-oss/env-nextjs, so any worker whose transitive imports reach env.js (syncWorker, elasticsearchReindexWorker, copyMoveWorker, duplicateScanWorker, magicSelectWorker) threw during require with zod 'invalid_type' errors. Added dummy env shims at the top of smoke-test-workers.js using `||=` so real CI-provided values still win. The smoke test is verifying module-graph integrity, not runtime config correctness. Together with the main-guard fix, the smoke test should now complete cleanly for every worker. * chore(workers): shorten NEXTAUTH_SECRET stub to fit prettier printWidth * fix(scheduler): add require.main guard so smoke-test require() doesn't run scheduling scheduler.ts called scheduleJobs() at module top-level, so the v0.22.10 smoke test's require() of dist/scheduler.js tried to connect to Valkey and exited 1 ('Required queues are not initialized. Cannot schedule jobs.') even with SKIP_VALKEY_CONNECTION=true (which makes queues no-op but still marks them unavailable). Same pattern as the workers fix in #239/#240: gate the runtime code with `if (require.main === module)`. Production start-workers.sh invokes scheduler directly via `tsx scheduler.ts`, so require.main IS module at runtime and scheduleJobs() still runs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds a smoke-test step to every workers Docker image build so a broken require graph can't silently ship again.
Background. On 2026-04-22 the multitenant-workers deploy crashlooped at startup with
Error: Cannot find module 'next/headers'. The bug was a module import that was fine in the Next.js server image but unresolvable in the workers image (Next.js is intentionally stripped to save ~900MB — seeDockerfiledeps-workersstage).docker buildcompiles files but never executes the entrypoints, so CI stayed green and the failure only surfaced atkubectl rollout. This PR closes that gap.Changes
testplanit/scripts/smoke-test-workers.js— CJS script thatrequire()s each compiled worker entry plusdist/scheduler.js. Any require-time error (missing module, unreachable native dep, bundler-stripped package) throws synchronously and the script fails with a clear message. After the loop it callsprocess.exit(0)to short-circuit the asyncstartWorker()side-effects that fire from thetypeof import.meta === "undefined"branch of each worker's main-guard — this is why CI doesn't need live Valkey/Postgres to run the probe..github/workflows/release.yml— one new step per build job, run right afterdocker buildx bake --push:build-amd64+build-arm64(tag-push)docker-manual-amd64+docker-manual-arm64(manual dispatch)docker run --rm --entrypoint node <image> ./scripts/smoke-test-workers.js.merge-manifests, which is what retags:latest-workers. A broken image can't promote to the floating tag.The smoke script ships inside the workers image already (
COPY --from=build /app/scripts ./scriptsatDockerfile:329), so no Dockerfile changes are needed.Related Issue
Follow-up to hotfix c804cac (workers crashloop 2026-04-22 16:30 UTC). Companion to #235 (the architectural fix for that incident).
Type of Change
The fix is a CI-only guardrail; runtime behavior is unchanged.
How Has This Been Tested?
Built workers locally (
pnpm build:workers) and ran the smoke script against the real compileddist/:Verified the failure path by injecting a missing
require("definitely-not-a-real-module")into one compiled worker:That is the exact failure shape that would have caught the next/headers incident.
Test Configuration:
DockerfileFROM node:24-alpine)Checklist
Additional Notes
Maintenance contract. When adding a new worker, update three files (existing convention, per CLAUDE memory):
scripts/build-workers.js— add toentryPointsecosystem.config.js— add to PM2appspackage.json— addworker:<name>script + include inworkersconcurrently listThis PR adds a fourth:
scripts/smoke-test-workers.js— add toWORKERSarrayA comment in the new file points at
ecosystem.config.js+scripts/build-workers.jsas the two lists that must stay in sync. It's explicit rather than metaprogrammed on purpose: a smoke test that silently skips a new worker would defeat its own purpose.Why require() instead of the one-liner in the post-mortem. The user-suggested
docker run --entrypoint node <image> -e "require('./dist/workers/testmoImportWorker.js')"would hang: workers use a main-guard (typeof import.meta === "undefined"branch fires under Node's CJS loader) that kicks off BullMQ's background Valkey connection. Withoutprocess.exit(0)after the require, the Node event loop stays busy until the Valkey connection errors after its timeout — slow, and the failure mode is noisy and easy to misread. The script wraps every require in a try/catch, force-exits cleanly, and gives one pass/fail line per worker — much easier to read in a CI log.Scope. This PR is CI/test-only. It does not change worker runtime, does not change Dockerfile stages, and does not change what images are tagged where. PR 3 (next) addresses the floating-tag problem that made rollback impossible during the incident.