Skip to content

ci(workers): smoke-test worker image boots before promoting to :latest#237

Merged
therealbrad merged 1 commit intomainfrom
ci/workers-image-smoke-test
Apr 23, 2026
Merged

ci(workers): smoke-test worker image boots before promoting to :latest#237
therealbrad merged 1 commit intomainfrom
ci/workers-image-smoke-test

Conversation

@therealbrad
Copy link
Copy Markdown
Contributor

Description

Adds a smoke-test step to every workers Docker image build so a broken require graph can't silently ship again.

Background. On 2026-04-22 the multitenant-workers deploy crashlooped at startup with Error: Cannot find module 'next/headers'. The bug was a module import that was fine in the Next.js server image but unresolvable in the workers image (Next.js is intentionally stripped to save ~900MB — see Dockerfile deps-workers stage). docker build compiles files but never executes the entrypoints, so CI stayed green and the failure only surfaced at kubectl rollout. This PR closes that gap.

Changes

  • New testplanit/scripts/smoke-test-workers.js — CJS script that require()s each compiled worker entry plus dist/scheduler.js. Any require-time error (missing module, unreachable native dep, bundler-stripped package) throws synchronously and the script fails with a clear message. After the loop it calls process.exit(0) to short-circuit the async startWorker() side-effects that fire from the typeof import.meta === "undefined" branch of each worker's main-guard — this is why CI doesn't need live Valkey/Postgres to run the probe.
  • Modified .github/workflows/release.yml — one new step per build job, run right after docker buildx bake --push:
    • build-amd64 + build-arm64 (tag-push)
    • docker-manual-amd64 + docker-manual-arm64 (manual dispatch)
    • Pulls the image we just pushed, then runs docker run --rm --entrypoint node <image> ./scripts/smoke-test-workers.js.
    • A failure blocks the per-arch job, which blocks merge-manifests, which is what retags :latest-workers. A broken image can't promote to the floating tag.

The smoke script ships inside the workers image already (COPY --from=build /app/scripts ./scripts at Dockerfile:329), so no Dockerfile changes are needed.

Related Issue

Follow-up to hotfix c804cac (workers crashloop 2026-04-22 16:30 UTC). Companion to #235 (the architectural fix for that incident).

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

The fix is a CI-only guardrail; runtime behavior is unchanged.

How Has This Been Tested?

  • Unit tests
  • Integration tests
  • E2E tests
  • Manual testing

Built workers locally (pnpm build:workers) and ran the smoke script against the real compiled dist/:

$ node testplanit/scripts/smoke-test-workers.js
✓ notificationWorker
✓ emailWorker
✓ forecastWorker
✓ syncWorker
✓ testmoImportWorker
✓ elasticsearchReindexWorker
✓ auditLogWorker
✓ autoTagWorker
✓ budgetAlertWorker
✓ repoCacheWorker
✓ copyMoveWorker
✓ duplicateScanWorker
✓ magicSelectWorker
✓ stepSequenceScanWorker
✓ generateFromUrlWorker
✓ scheduler

All 16 worker entrypoints loaded successfully.
$ echo $?
0

Verified the failure path by injecting a missing require("definitely-not-a-real-module") into one compiled worker:

✗ auditLogWorker: Cannot find module 'definitely-not-a-real-module'
Require stack:
- .../dist/workers/auditLogWorker.js
- .../scripts/smoke-test-workers.js

1 worker entrypoint(s) failed to load.
$ echo $?
1

That is the exact failure shape that would have caught the next/headers incident.

Test Configuration:

  • OS: macOS (Darwin 25.4.0, ARM)
  • Node version: v24.14.0 (matches Dockerfile FROM node:24-alpine)

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published
  • I have signed the CLA

Additional Notes

Maintenance contract. When adding a new worker, update three files (existing convention, per CLAUDE memory):

  • scripts/build-workers.js — add to entryPoints
  • ecosystem.config.js — add to PM2 apps
  • package.json — add worker:<name> script + include in workers concurrently list

This PR adds a fourth:

  • scripts/smoke-test-workers.js — add to WORKERS array

A comment in the new file points at ecosystem.config.js + scripts/build-workers.js as the two lists that must stay in sync. It's explicit rather than metaprogrammed on purpose: a smoke test that silently skips a new worker would defeat its own purpose.

Why require() instead of the one-liner in the post-mortem. The user-suggested docker run --entrypoint node <image> -e "require('./dist/workers/testmoImportWorker.js')" would hang: workers use a main-guard (typeof import.meta === "undefined" branch fires under Node's CJS loader) that kicks off BullMQ's background Valkey connection. Without process.exit(0) after the require, the Node event loop stays busy until the Valkey connection errors after its timeout — slow, and the failure mode is noisy and easy to misread. The script wraps every require in a try/catch, force-exits cleanly, and gives one pass/fail line per worker — much easier to read in a CI log.

Scope. This PR is CI/test-only. It does not change worker runtime, does not change Dockerfile stages, and does not change what images are tagged where. PR 3 (next) addresses the floating-tag problem that made rollback impossible during the incident.

On 2026-04-22 the multitenant-workers deploy crashlooped at startup
because testmoImportWorker's require graph pulled in next/headers,
which is stripped from the workers Docker image. `docker build` never
executed the compiled modules, so the failure didn't surface in CI —
only at deploy time.

Close the gap with a smoke-test step in each per-arch build job:

- testplanit/scripts/smoke-test-workers.js (new): require() each
  compiled worker entrypoint + scheduler.js. Any require-time error
  (missing module, bad native dep) throws synchronously and fails the
  script. Force-exits after the require loop to short-circuit the
  async startWorker() side-effects that run under CJS — this is why we
  don't need live Valkey/Postgres in CI.

- .github/workflows/release.yml: after each `docker buildx bake --push`
  (tag-push and manual-dispatch, amd64 and arm64), pull the just-
  published workers image and run the smoke script via
  `docker run --entrypoint node ... ./scripts/smoke-test-workers.js`.
  Failure blocks the per-arch job, which blocks merge-manifests and
  prevents :latest-workers from being retagged to a broken image.

Verified locally against a full workers build:
  ✓ All 16 entrypoints load, script exits 0.
  ✗ Injected a missing require into one entrypoint → smoke script exits
    1 with a clear "Cannot find module ..." error naming the broken
    worker. This is the signal that would have caught 2026-04-22.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@therealbrad therealbrad merged commit d4d5a2a into main Apr 23, 2026
5 checks passed
@therealbrad therealbrad deleted the ci/workers-image-smoke-test branch April 23, 2026 09:53
@therealbrad
Copy link
Copy Markdown
Contributor Author

🎉 This PR is included in version 0.22.8 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

therealbrad added a commit that referenced this pull request Apr 23, 2026
…start workers (#239)

The previous guard combined an ESM-style import.meta check with a CJS
fallback:

    if (
      (typeof import.meta !== "undefined" &&
        import.meta.url === pathToFileURL(process.argv[1]).href) ||
      typeof import.meta === "undefined" ||
      (import.meta as any).url === undefined
    ) { startWorker()... }

esbuild compiles each worker to CommonJS (platform: node, format: cjs)
and polyfills `import.meta` as a plain object whose `.url` is
`undefined`. At runtime that makes `import.meta.url === void 0` always
true, so the guard always fires — meaning `require("./forecastWorker")`
unintentionally invokes `startWorker()` and all its connection logic.

Three workers (forecastWorker, repoCacheWorker, testmoImportWorker)
call `process.exit(1)` synchronously when Valkey is unreachable, so the
CI smoke test added in #237 died mid-loop and releases blocked on the
smoke-test step.

Fix: replace with the canonical CJS pattern `require.main === module`.
esbuild preserves `require`/`module` in CJS output, and the smoke test
can now `require()` each worker without triggering startup side effects
— matching the intent stated in the original comment.

Drops the `pathToFileURL` import from each worker (was only used by
the old guard).
therealbrad added a commit that referenced this pull request Apr 24, 2026
)

* fix(workers): use require.main === module guard so require() doesn't start workers

The previous guard combined an ESM-style import.meta check with a CJS
fallback:

    if (
      (typeof import.meta !== "undefined" &&
        import.meta.url === pathToFileURL(process.argv[1]).href) ||
      typeof import.meta === "undefined" ||
      (import.meta as any).url === undefined
    ) { startWorker()... }

esbuild compiles each worker to CommonJS (platform: node, format: cjs)
and polyfills `import.meta` as a plain object whose `.url` is
`undefined`. At runtime that makes `import.meta.url === void 0` always
true, so the guard always fires — meaning `require("./forecastWorker")`
unintentionally invokes `startWorker()` and all its connection logic.

Three workers (forecastWorker, repoCacheWorker, testmoImportWorker)
call `process.exit(1)` synchronously when Valkey is unreachable, so the
CI smoke test added in #237 died mid-loop and releases blocked on the
smoke-test step.

Fix: replace with the canonical CJS pattern `require.main === module`.
esbuild preserves `require`/`module` in CJS output, and the smoke test
can now `require()` each worker without triggering startup side effects
— matching the intent stated in the original comment.

Drops the `pathToFileURL` import from each worker (was only used by
the old guard).

* fix(workers): guard generateFromUrlWorker + stub env in smoke test

v0.22.9 smoke test surfaced two issues hidden behind the main-guard
bug fixed in the previous commit:

1. generateFromUrlWorker.ts had no main guard at all — it called
   startGenerateFromUrlWorker() unconditionally at module scope, so
   require()'ing it in the smoke test attempted to construct a BullMQ
   Worker with no Valkey and crashed. Wrapped in
   `if (require.main === module)` to match the other 14 workers.

2. env.js validates DATABASE_URL / NEXTAUTH_SECRET / NEXTAUTH_URL at
   module-load time via @t3-oss/env-nextjs, so any worker whose
   transitive imports reach env.js (syncWorker,
   elasticsearchReindexWorker, copyMoveWorker, duplicateScanWorker,
   magicSelectWorker) threw during require with zod 'invalid_type'
   errors. Added dummy env shims at the top of smoke-test-workers.js
   using `||=` so real CI-provided values still win. The smoke test
   is verifying module-graph integrity, not runtime config
   correctness.

Together with the main-guard fix, the smoke test should now complete
cleanly for every worker.

* chore(workers): shorten NEXTAUTH_SECRET stub to fit prettier printWidth
therealbrad added a commit that referenced this pull request Apr 24, 2026
…t run scheduling (#241)

* fix(workers): use require.main === module guard so require() doesn't start workers

The previous guard combined an ESM-style import.meta check with a CJS
fallback:

    if (
      (typeof import.meta !== "undefined" &&
        import.meta.url === pathToFileURL(process.argv[1]).href) ||
      typeof import.meta === "undefined" ||
      (import.meta as any).url === undefined
    ) { startWorker()... }

esbuild compiles each worker to CommonJS (platform: node, format: cjs)
and polyfills `import.meta` as a plain object whose `.url` is
`undefined`. At runtime that makes `import.meta.url === void 0` always
true, so the guard always fires — meaning `require("./forecastWorker")`
unintentionally invokes `startWorker()` and all its connection logic.

Three workers (forecastWorker, repoCacheWorker, testmoImportWorker)
call `process.exit(1)` synchronously when Valkey is unreachable, so the
CI smoke test added in #237 died mid-loop and releases blocked on the
smoke-test step.

Fix: replace with the canonical CJS pattern `require.main === module`.
esbuild preserves `require`/`module` in CJS output, and the smoke test
can now `require()` each worker without triggering startup side effects
— matching the intent stated in the original comment.

Drops the `pathToFileURL` import from each worker (was only used by
the old guard).

* fix(workers): guard generateFromUrlWorker + stub env in smoke test

v0.22.9 smoke test surfaced two issues hidden behind the main-guard
bug fixed in the previous commit:

1. generateFromUrlWorker.ts had no main guard at all — it called
   startGenerateFromUrlWorker() unconditionally at module scope, so
   require()'ing it in the smoke test attempted to construct a BullMQ
   Worker with no Valkey and crashed. Wrapped in
   `if (require.main === module)` to match the other 14 workers.

2. env.js validates DATABASE_URL / NEXTAUTH_SECRET / NEXTAUTH_URL at
   module-load time via @t3-oss/env-nextjs, so any worker whose
   transitive imports reach env.js (syncWorker,
   elasticsearchReindexWorker, copyMoveWorker, duplicateScanWorker,
   magicSelectWorker) threw during require with zod 'invalid_type'
   errors. Added dummy env shims at the top of smoke-test-workers.js
   using `||=` so real CI-provided values still win. The smoke test
   is verifying module-graph integrity, not runtime config
   correctness.

Together with the main-guard fix, the smoke test should now complete
cleanly for every worker.

* chore(workers): shorten NEXTAUTH_SECRET stub to fit prettier printWidth

* fix(scheduler): add require.main guard so smoke-test require() doesn't run scheduling

scheduler.ts called scheduleJobs() at module top-level, so the v0.22.10
smoke test's require() of dist/scheduler.js tried to connect to Valkey
and exited 1 ('Required queues are not initialized. Cannot schedule
jobs.') even with SKIP_VALKEY_CONNECTION=true (which makes queues no-op
but still marks them unavailable).

Same pattern as the workers fix in #239/#240: gate the runtime code
with `if (require.main === module)`. Production start-workers.sh
invokes scheduler directly via `tsx scheduler.ts`, so require.main IS
module at runtime and scheduleJobs() still runs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant