ci(workers): smoke-test worker image boots before promoting to :latest by therealbrad · Pull Request #237 · TestPlanIt/testplanit

therealbrad · 2026-04-22T21:07:05Z

Description

Adds a smoke-test step to every workers Docker image build so a broken require graph can't silently ship again.

Background. On 2026-04-22 the multitenant-workers deploy crashlooped at startup with Error: Cannot find module 'next/headers'. The bug was a module import that was fine in the Next.js server image but unresolvable in the workers image (Next.js is intentionally stripped to save ~900MB — see Dockerfile deps-workers stage). docker build compiles files but never executes the entrypoints, so CI stayed green and the failure only surfaced at kubectl rollout. This PR closes that gap.

Changes

New testplanit/scripts/smoke-test-workers.js — CJS script that require()s each compiled worker entry plus dist/scheduler.js. Any require-time error (missing module, unreachable native dep, bundler-stripped package) throws synchronously and the script fails with a clear message. After the loop it calls process.exit(0) to short-circuit the async startWorker() side-effects that fire from the typeof import.meta === "undefined" branch of each worker's main-guard — this is why CI doesn't need live Valkey/Postgres to run the probe.
Modified .github/workflows/release.yml — one new step per build job, run right after docker buildx bake --push:
- build-amd64 + build-arm64 (tag-push)
- docker-manual-amd64 + docker-manual-arm64 (manual dispatch)
- Pulls the image we just pushed, then runs docker run --rm --entrypoint node <image> ./scripts/smoke-test-workers.js.
- A failure blocks the per-arch job, which blocks merge-manifests, which is what retags :latest-workers. A broken image can't promote to the floating tag.

The smoke script ships inside the workers image already (COPY --from=build /app/scripts ./scripts at Dockerfile:329), so no Dockerfile changes are needed.

Related Issue

Follow-up to hotfix c804cac (workers crashloop 2026-04-22 16:30 UTC). Companion to #235 (the architectural fix for that incident).

Type of Change

Bug fix (non-breaking change that fixes an issue)

The fix is a CI-only guardrail; runtime behavior is unchanged.

How Has This Been Tested?

Unit tests
Integration tests
E2E tests
Manual testing

Built workers locally (pnpm build:workers) and ran the smoke script against the real compiled dist/:

$ node testplanit/scripts/smoke-test-workers.js
✓ notificationWorker
✓ emailWorker
✓ forecastWorker
✓ syncWorker
✓ testmoImportWorker
✓ elasticsearchReindexWorker
✓ auditLogWorker
✓ autoTagWorker
✓ budgetAlertWorker
✓ repoCacheWorker
✓ copyMoveWorker
✓ duplicateScanWorker
✓ magicSelectWorker
✓ stepSequenceScanWorker
✓ generateFromUrlWorker
✓ scheduler

All 16 worker entrypoints loaded successfully.
$ echo $?
0

Verified the failure path by injecting a missing require("definitely-not-a-real-module") into one compiled worker:

✗ auditLogWorker: Cannot find module 'definitely-not-a-real-module'
Require stack:
- .../dist/workers/auditLogWorker.js
- .../scripts/smoke-test-workers.js

1 worker entrypoint(s) failed to load.
$ echo $?
1

That is the exact failure shape that would have caught the next/headers incident.

Test Configuration:

OS: macOS (Darwin 25.4.0, ARM)
Node version: v24.14.0 (matches Dockerfile FROM node:24-alpine)

Checklist

My code follows the project's style guidelines
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published
I have signed the CLA

Additional Notes

Maintenance contract. When adding a new worker, update three files (existing convention, per CLAUDE memory):

scripts/build-workers.js — add to entryPoints
ecosystem.config.js — add to PM2 apps
package.json — add worker:<name> script + include in workers concurrently list

This PR adds a fourth:

scripts/smoke-test-workers.js — add to WORKERS array

A comment in the new file points at ecosystem.config.js + scripts/build-workers.js as the two lists that must stay in sync. It's explicit rather than metaprogrammed on purpose: a smoke test that silently skips a new worker would defeat its own purpose.

Why require() instead of the one-liner in the post-mortem. The user-suggested docker run --entrypoint node <image> -e "require('./dist/workers/testmoImportWorker.js')" would hang: workers use a main-guard (typeof import.meta === "undefined" branch fires under Node's CJS loader) that kicks off BullMQ's background Valkey connection. Without process.exit(0) after the require, the Node event loop stays busy until the Valkey connection errors after its timeout — slow, and the failure mode is noisy and easy to misread. The script wraps every require in a try/catch, force-exits cleanly, and gives one pass/fail line per worker — much easier to read in a CI log.

Scope. This PR is CI/test-only. It does not change worker runtime, does not change Dockerfile stages, and does not change what images are tagged where. PR 3 (next) addresses the floating-tag problem that made rollback impossible during the incident.

On 2026-04-22 the multitenant-workers deploy crashlooped at startup because testmoImportWorker's require graph pulled in next/headers, which is stripped from the workers Docker image. `docker build` never executed the compiled modules, so the failure didn't surface in CI — only at deploy time. Close the gap with a smoke-test step in each per-arch build job: - testplanit/scripts/smoke-test-workers.js (new): require() each compiled worker entrypoint + scheduler.js. Any require-time error (missing module, bad native dep) throws synchronously and fails the script. Force-exits after the require loop to short-circuit the async startWorker() side-effects that run under CJS — this is why we don't need live Valkey/Postgres in CI. - .github/workflows/release.yml: after each `docker buildx bake --push` (tag-push and manual-dispatch, amd64 and arm64), pull the just- published workers image and run the smoke script via `docker run --entrypoint node ... ./scripts/smoke-test-workers.js`. Failure blocks the per-arch job, which blocks merge-manifests and prevents :latest-workers from being retagged to a broken image. Verified locally against a full workers build: ✓ All 16 entrypoints load, script exits 0. ✗ Injected a missing require into one entrypoint → smoke script exits 1 with a clear "Cannot find module ..." error naming the broken worker. This is the signal that would have caught 2026-04-22. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

therealbrad · 2026-04-23T10:00:54Z

🎉 This PR is included in version 0.22.8 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

…start workers (#239) The previous guard combined an ESM-style import.meta check with a CJS fallback: if ( (typeof import.meta !== "undefined" && import.meta.url === pathToFileURL(process.argv[1]).href) || typeof import.meta === "undefined" || (import.meta as any).url === undefined ) { startWorker()... } esbuild compiles each worker to CommonJS (platform: node, format: cjs) and polyfills `import.meta` as a plain object whose `.url` is `undefined`. At runtime that makes `import.meta.url === void 0` always true, so the guard always fires — meaning `require("./forecastWorker")` unintentionally invokes `startWorker()` and all its connection logic. Three workers (forecastWorker, repoCacheWorker, testmoImportWorker) call `process.exit(1)` synchronously when Valkey is unreachable, so the CI smoke test added in #237 died mid-loop and releases blocked on the smoke-test step. Fix: replace with the canonical CJS pattern `require.main === module`. esbuild preserves `require`/`module` in CJS output, and the smoke test can now `require()` each worker without triggering startup side effects — matching the intent stated in the original comment. Drops the `pathToFileURL` import from each worker (was only used by the old guard).

) * fix(workers): use require.main === module guard so require() doesn't start workers The previous guard combined an ESM-style import.meta check with a CJS fallback: if ( (typeof import.meta !== "undefined" && import.meta.url === pathToFileURL(process.argv[1]).href) || typeof import.meta === "undefined" || (import.meta as any).url === undefined ) { startWorker()... } esbuild compiles each worker to CommonJS (platform: node, format: cjs) and polyfills `import.meta` as a plain object whose `.url` is `undefined`. At runtime that makes `import.meta.url === void 0` always true, so the guard always fires — meaning `require("./forecastWorker")` unintentionally invokes `startWorker()` and all its connection logic. Three workers (forecastWorker, repoCacheWorker, testmoImportWorker) call `process.exit(1)` synchronously when Valkey is unreachable, so the CI smoke test added in #237 died mid-loop and releases blocked on the smoke-test step. Fix: replace with the canonical CJS pattern `require.main === module`. esbuild preserves `require`/`module` in CJS output, and the smoke test can now `require()` each worker without triggering startup side effects — matching the intent stated in the original comment. Drops the `pathToFileURL` import from each worker (was only used by the old guard). * fix(workers): guard generateFromUrlWorker + stub env in smoke test v0.22.9 smoke test surfaced two issues hidden behind the main-guard bug fixed in the previous commit: 1. generateFromUrlWorker.ts had no main guard at all — it called startGenerateFromUrlWorker() unconditionally at module scope, so require()'ing it in the smoke test attempted to construct a BullMQ Worker with no Valkey and crashed. Wrapped in `if (require.main === module)` to match the other 14 workers. 2. env.js validates DATABASE_URL / NEXTAUTH_SECRET / NEXTAUTH_URL at module-load time via @t3-oss/env-nextjs, so any worker whose transitive imports reach env.js (syncWorker, elasticsearchReindexWorker, copyMoveWorker, duplicateScanWorker, magicSelectWorker) threw during require with zod 'invalid_type' errors. Added dummy env shims at the top of smoke-test-workers.js using `||=` so real CI-provided values still win. The smoke test is verifying module-graph integrity, not runtime config correctness. Together with the main-guard fix, the smoke test should now complete cleanly for every worker. * chore(workers): shorten NEXTAUTH_SECRET stub to fit prettier printWidth

…t run scheduling (#241) * fix(workers): use require.main === module guard so require() doesn't start workers The previous guard combined an ESM-style import.meta check with a CJS fallback: if ( (typeof import.meta !== "undefined" && import.meta.url === pathToFileURL(process.argv[1]).href) || typeof import.meta === "undefined" || (import.meta as any).url === undefined ) { startWorker()... } esbuild compiles each worker to CommonJS (platform: node, format: cjs) and polyfills `import.meta` as a plain object whose `.url` is `undefined`. At runtime that makes `import.meta.url === void 0` always true, so the guard always fires — meaning `require("./forecastWorker")` unintentionally invokes `startWorker()` and all its connection logic. Three workers (forecastWorker, repoCacheWorker, testmoImportWorker) call `process.exit(1)` synchronously when Valkey is unreachable, so the CI smoke test added in #237 died mid-loop and releases blocked on the smoke-test step. Fix: replace with the canonical CJS pattern `require.main === module`. esbuild preserves `require`/`module` in CJS output, and the smoke test can now `require()` each worker without triggering startup side effects — matching the intent stated in the original comment. Drops the `pathToFileURL` import from each worker (was only used by the old guard). * fix(workers): guard generateFromUrlWorker + stub env in smoke test v0.22.9 smoke test surfaced two issues hidden behind the main-guard bug fixed in the previous commit: 1. generateFromUrlWorker.ts had no main guard at all — it called startGenerateFromUrlWorker() unconditionally at module scope, so require()'ing it in the smoke test attempted to construct a BullMQ Worker with no Valkey and crashed. Wrapped in `if (require.main === module)` to match the other 14 workers. 2. env.js validates DATABASE_URL / NEXTAUTH_SECRET / NEXTAUTH_URL at module-load time via @t3-oss/env-nextjs, so any worker whose transitive imports reach env.js (syncWorker, elasticsearchReindexWorker, copyMoveWorker, duplicateScanWorker, magicSelectWorker) threw during require with zod 'invalid_type' errors. Added dummy env shims at the top of smoke-test-workers.js using `||=` so real CI-provided values still win. The smoke test is verifying module-graph integrity, not runtime config correctness. Together with the main-guard fix, the smoke test should now complete cleanly for every worker. * chore(workers): shorten NEXTAUTH_SECRET stub to fit prettier printWidth * fix(scheduler): add require.main guard so smoke-test require() doesn't run scheduling scheduler.ts called scheduleJobs() at module top-level, so the v0.22.10 smoke test's require() of dist/scheduler.js tried to connect to Valkey and exited 1 ('Required queues are not initialized. Cannot schedule jobs.') even with SKIP_VALKEY_CONNECTION=true (which makes queues no-op but still marks them unavailable). Same pattern as the workers fix in #239/#240: gate the runtime code with `if (require.main === module)`. Production start-workers.sh invokes scheduler directly via `tsx scheduler.ts`, so require.main IS module at runtime and scheduleJobs() still runs.

therealbrad mentioned this pull request Apr 22, 2026

ci(workers): publish SHA-tagged workers image alongside :latest-workers #238

Merged

14 tasks

therealbrad merged commit d4d5a2a into main Apr 23, 2026
5 checks passed

therealbrad deleted the ci/workers-image-smoke-test branch April 23, 2026 09:53

therealbrad added the released label Apr 23, 2026

therealbrad mentioned this pull request Apr 23, 2026

fix(workers): use require.main === module guard so require() doesn't start workers #239

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci(workers): smoke-test worker image boots before promoting to :latest#237

ci(workers): smoke-test worker image boots before promoting to :latest#237
therealbrad merged 1 commit intomainfrom
ci/workers-image-smoke-test

therealbrad commented Apr 22, 2026

Uh oh!

Uh oh!

therealbrad commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

therealbrad commented Apr 22, 2026

Description

Related Issue

Type of Change

How Has This Been Tested?

Checklist

Additional Notes

Uh oh!

Uh oh!

therealbrad commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant