Recovered newsletter sends interrupted by container restarts#27822
Recovered newsletter sends interrupted by container restarts#278229larsons wants to merge 5 commits into
Conversation
ref https://linear.app/tryghost/issue/BER-3589 - container restarts (deploys, node rescheduling) send SIGTERM with a ~60s grace period; in-flight newsletter sends were being force-killed mid-batch, leaving the email row stuck in `submitting` with some batches sent and some never attempted - BatchSendingService now registers an onShutdown cleanup task on ghost-server; the task flips a flag so workers stop picking up new batches, and awaits in-flight sendBatches so the cleanup chain blocks on Mailgun requests and EmailBatch DB writes before process.exit fires (otherwise mid-flight requests would be cut off and produce duplicate sends on resume) - when the worker queue still has unstarted batches at shutdown, sendBatches throws an InternalServerError with a SHUTDOWN_CODE sentinel; emailJob catches that specific code and returns without writing status=failed, leaving the row in `submitting` so a future restart can resume it
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
ref https://linear.app/tryghost/issue/BER-3589 - on container boot, scans for newsletter emails left in status=submitting from a previous container's interrupted send and resumes them through the normal emailJob path (flips to pending if the post is still published, or failed if the post was unpublished/deleted while the send was in flight) - per-email try/catch with logging.error so one bad row does not skip the others; runs unconditionally before activitypub.init so an activitypub failure cannot disable recovery - fixes sendBatch short-circuit accounting: when updateStatusLock returns undefined, distinguish between batch already submitted on a prior run (return true, expected resume path) and orphan submitting batch from a crashed worker (return false so the parent email promotes to failed and an operator can reconcile against Mailgun before retrying); previously both cases returned true, silently laundering partial sends as success - tightens retryEmail to throw a 400 BadRequestError when the email's status is not failed, instead of silently passing through to scheduleEmail; the internal callsite in post-email-handler.js is already pre-gated to failed, so this only changes behavior for misuse via the admin API - pairs with the shutdown side already on this branch to close the newsletter-send-recovery loop
ref https://linear.app/tryghost/issue/BER-3589 - exercises resumeInterruptedSends() against a real DB with mocked Mailgun; the unit tests fake updateStatusLock's return value, so they can't prove the load-bearing claim that the sendBatch short-circuit cooperates with the real lock when a batch is left in submitting status - three scenarios: clean resume (one batch already submitted, one batch pending — Mailgun called once for the pending batch only, email re-promotes to submitted); orphan submitting batch (parent email promotes to failed, orphan row preserved for operator reconciliation, no Mailgun calls); post-no-longer-published (email flipped to failed, no resume attempted, no Mailgun calls) - new file outside the existing describe.skip batch-sending.test.js suite so it actually runs in CI
…h window ref https://linear.app/tryghost/issue/BER-3589 - adds a max-age cutoff (default 7 days, override via bulkEmail:resumeMaxAgeMs config) so the boot scanner does not pick up emails that have been stuck in submitting from prior incidents long enough that the content is stale; rows beyond the cutoff are flipped to failed via updateStatusLock and surface in the admin UI for operator review instead of being silently resumed and sending old newsletters to current members - fixes the delivery-window behavior on resume: previously, if the original created_at + targetDeliveryWindow deadline had passed (the case for almost every resume), calculateDeliveryTimes returned undefined for every batch and Mailgun fired them all in the same second, breaking the rate-spread the targetDeliveryWindow exists to provide - on past deadline, calculateDeliveryTimes now respreads remaining batches over a fresh window of the same size starting from now; sendBatches uses the same condition (targetDeliveryWindow configured) to gate whether to apply deliveryTimes at all, instead of gating on the original deadline being in the future - this also improves the non-resume edge case where the job system delayed a send past its original window (rare); previously those sends also went out instantaneously; now they respect the spread
ref https://linear.app/tryghost/issue/BER-3589 - onShutdown logs the number of in-flight sendBatches awaited and emits a bookend "drain complete" line on exit, making it easy to confirm from logs that the cleanup task waited for in-flight Mailgun calls before exit - sendBatches emits a per-email completion summary listing succeeded vs total batches and unstarted count, so a partial-success case (resume hit an orphan submitting batch) shows up as a single grep-able line - resumeInterruptedSends emits a single end-of-scan summary listing stale rows flipped to failed and fresh rows rescheduled, alongside the existing per-row warn breadcrumbs
Summary
When a container restart (deploy, node reschedule, scale-down) sends SIGTERM mid-newsletter-send, the email row would be left in
status='submitting'with someEmailBatchrows submitted and others never attempted, and no automatic recovery on next boot. Operators recovered manually via DB edits.This branch closes the loop end-to-end: graceful shutdown leaves the row in a recoverable state, and the next boot scans for interrupted sends (bounded by age) and resumes them through the normal email-job path, respecting the original delivery-window pacing.
How it works
Shutdown side
BatchSendingService.onShutdown()is registered as a cleanup task onGhostServer. On SIGTERM:#shuttingDown = true. The worker loop checks the flag at the top of every iteration, so no new batches get picked up after the signal.Promise.allSettled([...#inFlight])— the set of livesendBatchespromises. The cleanup chain blocks on in-flight Mailgun POSTs + EmailBatch DB writes beforeprocess.exitfires, so a request mid-call isn't cut off.submitted. Unstarted batches stay in their current status.sendBatchesthrows anInternalServerErrorwith aSHUTDOWN_CODEsentinel.emailJobcatches that specific code, logs an info breadcrumb, and returns without writingstatus='failed'. The parent email row stays insubmitting.SIGKILL. Timing out and exiting while a request is in flight is the failure mode we're preventing.Resume side
initBackgroundServicescallsemailService.service.resumeInterruptedSends()beforeactivitypub.initso an activitypub failure can't disable recovery. The boot site has its own try/catch +logging.error.emails WHERE status='submitting':created_at < now - maxAge): flipped tofailedviaupdateStatusLockso they surface in the admin UI for operator review. Default cutoff is 7 days, override viabulkEmail:resumeMaxAgeMsconfig.created_at > now - maxAge): iterated sequentially, each in its own try/catch. For each row:publishedorsent(post was unpublished/deleted), flips email tofailedviaupdateStatusLock.pendingviaupdateStatusLock(required to satisfyemailJob's['pending', 'failed']precondition), logs a structured breadcrumb (email_id,post_id,batch_counts_by_status,ms_since_last_status_write,target_delivery_window_ms), then callsscheduleEmail.emailJobruns the normalsendBatchespath. Already-submittedbatches short-circuit and skip Mailgun (see the short-circuit fix below);pendingbatches send normally.sendBatchshort-circuit distinguishes between two cases that the old code collapsed:submitted: Mailgun accepted it on a prior run → returntrue(success, skip), loginfo.submitting: orphan from a crashed worker — Mailgun-side state unknown → returnfalse, logerror. The parent email is then promoted tofailedso an operator can reconcile against the Mailgun dashboard before retrying. Previously both cases returnedtrue, silently laundering partial sends as success.retryEmailnow throws a 400BadRequestErrorwhen the email's status isn'tfailed, instead of silently passing through toscheduleEmail. The internal callsite inpost-email-handler.jsis already pre-gated tofailed, so this only changes behavior for misuse via the admin API.Known gaps and follow-ups
These are explicitly not addressed in this branch. Flagged so a deployer can think about them.
findAllwith no limit and iterates sequentially. On a site with thousands of stuck rows within the 7-day cutoff (unlikely after this lands, but possible during the transition), it can blockinitBackgroundServicesfor minutes and load every row into memory. A follow-up could add aLIMIT+ a "scanner truncated, run again" warning.post.statusbut notnewsletter.status. If a newsletter was archived between the original send attempt and the resume, the resumed send goes out anyway. The right fix is a parallelnewsletter.status === 'active'check in the scanner before flipping the email topending.createBatchescrash mid-flight. If the original worker crashed between flipping email status tosubmittingandcreateBatchescompleting, theemail_recipientstable can be partially populated; the resumed send may hit a duplicate-key on(email_id, member_email)or send partial batches. Pre-existing failure mode, not introduced by this change.Test plan
cd ghost/core && pnpm test:single test/unit/server/services/email-service/— 380 passing. Covers each branch in isolation: shutdown drain, queue-non-empty throws, in-flight tracking; scanner happy path, bad row doesn't skip others, post-not-sendable marks failed, stale row flipped to failed, config override;sendBatchshort-circuit cases;retryEmail400;calculateDeliveryTimesrespread on past deadline.cd ghost/core && pnpm test:integration --grep "Resume interrupted sends"— 3 passing. Loads a real DB fixture with mixed-status batches, calls the scanner directly, asserts final state (clean resume, orphan submitting → email failed, post-not-published → email failed).cd ghost/core && pnpm lint:server— clean.pnpm dev+pnpm reset:datadocker compose kill -s SIGTERM ghostwhile the send is in flightsubmitting(notfailed)docker compose up -d ghostEmail resume: scheduling …; verify the email completes tosubmittedand no member receives a duplicateref https://linear.app/tryghost/issue/BER-3589