Recovered newsletter sends interrupted by container restarts by 9larsons · Pull Request #27822 · TryGhost/Ghost

9larsons · 2026-05-11T21:46:04Z

Summary

When a container restart (deploy, node reschedule, scale-down) sends SIGTERM mid-newsletter-send, the email row would be left in status='submitting' with some EmailBatch rows submitted and others never attempted, and no automatic recovery on next boot. Operators recovered manually via DB edits.

This branch closes the loop end-to-end: graceful shutdown leaves the row in a recoverable state, and the next boot scans for interrupted sends (bounded by age) and resumes them through the normal email-job path, respecting the original delivery-window pacing.

How it works

Shutdown side

BatchSendingService.onShutdown() is registered as a cleanup task on GhostServer. On SIGTERM:
1. Flips #shuttingDown = true. The worker loop checks the flag at the top of every iteration, so no new batches get picked up after the signal.
2. Awaits Promise.allSettled([...#inFlight]) — the set of live sendBatches promises. The cleanup chain blocks on in-flight Mailgun POSTs + EmailBatch DB writes before process.exit fires, so a request mid-call isn't cut off.
In-flight batches finish normally and get marked submitted. Unstarted batches stay in their current status.
When the worker queue still has unstarted batches after settle, sendBatches throws an InternalServerError with a SHUTDOWN_CODE sentinel. emailJob catches that specific code, logs an info breadcrumb, and returns without writing status='failed'. The parent email row stays in submitting.
No timeout on the in-flight await — if the K8s grace period elapses, accept SIGKILL. Timing out and exiting while a request is in flight is the failure mode we're preventing.

Resume side

On boot, initBackgroundServices calls emailService.service.resumeInterruptedSends() before activitypub.init so an activitypub failure can't disable recovery. The boot site has its own try/catch + logging.error.
The scanner runs two queries against emails WHERE status='submitting':
1. Stale rows (created_at < now - maxAge): flipped to failed via updateStatusLock so they surface in the admin UI for operator review. Default cutoff is 7 days, override via bulkEmail:resumeMaxAgeMs config.
2. Fresh rows (created_at > now - maxAge): iterated sequentially, each in its own try/catch. For each row:
  - Loads the post. If status isn't published or sent (post was unpublished/deleted), flips email to failed via updateStatusLock.
  - Otherwise flips email to pending via updateStatusLock (required to satisfy emailJob's ['pending', 'failed'] precondition), logs a structured breadcrumb (email_id, post_id, batch_counts_by_status, ms_since_last_status_write, target_delivery_window_ms), then calls scheduleEmail.
The resumed emailJob runs the normal sendBatches path. Already-submitted batches short-circuit and skip Mailgun (see the short-circuit fix below); pending batches send normally.
sendBatch short-circuit distinguishes between two cases that the old code collapsed:
- Batch status submitted: Mailgun accepted it on a prior run → return true (success, skip), log info.
- Batch status submitting: orphan from a crashed worker — Mailgun-side state unknown → return false, log error. The parent email is then promoted to failed so an operator can reconcile against the Mailgun dashboard before retrying. Previously both cases returned true, silently laundering partial sends as success.
retryEmail now throws a 400 BadRequestError when the email's status isn't failed, instead of silently passing through to scheduleEmail. The internal callsite in post-email-handler.js is already pre-gated to failed, so this only changes behavior for misuse via the admin API.

Known gaps and follow-ups

These are explicitly not addressed in this branch. Flagged so a deployer can think about them.

No upper bound on iteration count or time. Scanner does findAll with no limit and iterates sequentially. On a site with thousands of stuck rows within the 7-day cutoff (unlikely after this lands, but possible during the transition), it can block initBackgroundServices for minutes and load every row into memory. A follow-up could add a LIMIT + a "scanner truncated, run again" warning.
Newsletter status isn't re-checked at resume time. The scanner gates on post.status but not newsletter.status. If a newsletter was archived between the original send attempt and the resume, the resumed send goes out anyway. The right fix is a parallel newsletter.status === 'active' check in the scanner before flipping the email to pending.
createBatches crash mid-flight. If the original worker crashed between flipping email status to submitting and createBatches completing, the email_recipients table can be partially populated; the resumed send may hit a duplicate-key on (email_id, member_email) or send partial batches. Pre-existing failure mode, not introduced by this change.

Test plan

Unit: cd ghost/core && pnpm test:single test/unit/server/services/email-service/ — 380 passing. Covers each branch in isolation: shutdown drain, queue-non-empty throws, in-flight tracking; scanner happy path, bad row doesn't skip others, post-not-sendable marks failed, stale row flipped to failed, config override; sendBatch short-circuit cases; retryEmail 400; calculateDeliveryTimes respread on past deadline.
Integration: cd ghost/core && pnpm test:integration --grep "Resume interrupted sends" — 3 passing. Loads a real DB fixture with mixed-status batches, calls the scanner directly, asserts final state (clean resume, orphan submitting → email failed, post-not-published → email failed).
Lint: cd ghost/core && pnpm lint:server — clean.
Manual smoke test (required before merging out of draft):
1. pnpm dev + pnpm reset:data
2. Trigger a newsletter to all members so it splits into 2+ batches
3. docker compose kill -s SIGTERM ghost while the send is in flight
4. Verify the email row is in submitting (not failed)
5. docker compose up -d ghost
6. Watch boot logs for Email resume: scheduling …; verify the email completes to submitted and no member receives a duplicate

ref https://linear.app/tryghost/issue/BER-3589

ref https://linear.app/tryghost/issue/BER-3589 - container restarts (deploys, node rescheduling) send SIGTERM with a ~60s grace period; in-flight newsletter sends were being force-killed mid-batch, leaving the email row stuck in `submitting` with some batches sent and some never attempted - BatchSendingService now registers an onShutdown cleanup task on ghost-server; the task flips a flag so workers stop picking up new batches, and awaits in-flight sendBatches so the cleanup chain blocks on Mailgun requests and EmailBatch DB writes before process.exit fires (otherwise mid-flight requests would be cut off and produce duplicate sends on resume) - when the worker queue still has unstarted batches at shutdown, sendBatches throws an InternalServerError with a SHUTDOWN_CODE sentinel; emailJob catches that specific code and returns without writing status=failed, leaving the row in `submitting` so a future restart can resume it

coderabbitai · 2026-05-11T21:46:11Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 361c7aa5-8898-4cad-a7ff-29d5a39755c0

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ber-3589-newsletter-send-recovery

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

ref https://linear.app/tryghost/issue/BER-3589 - on container boot, scans for newsletter emails left in status=submitting from a previous container's interrupted send and resumes them through the normal emailJob path (flips to pending if the post is still published, or failed if the post was unpublished/deleted while the send was in flight) - per-email try/catch with logging.error so one bad row does not skip the others; runs unconditionally before activitypub.init so an activitypub failure cannot disable recovery - fixes sendBatch short-circuit accounting: when updateStatusLock returns undefined, distinguish between batch already submitted on a prior run (return true, expected resume path) and orphan submitting batch from a crashed worker (return false so the parent email promotes to failed and an operator can reconcile against Mailgun before retrying); previously both cases returned true, silently laundering partial sends as success - tightens retryEmail to throw a 400 BadRequestError when the email's status is not failed, instead of silently passing through to scheduleEmail; the internal callsite in post-email-handler.js is already pre-gated to failed, so this only changes behavior for misuse via the admin API - pairs with the shutdown side already on this branch to close the newsletter-send-recovery loop

ref https://linear.app/tryghost/issue/BER-3589 - exercises resumeInterruptedSends() against a real DB with mocked Mailgun; the unit tests fake updateStatusLock's return value, so they can't prove the load-bearing claim that the sendBatch short-circuit cooperates with the real lock when a batch is left in submitting status - three scenarios: clean resume (one batch already submitted, one batch pending — Mailgun called once for the pending batch only, email re-promotes to submitted); orphan submitting batch (parent email promotes to failed, orphan row preserved for operator reconciliation, no Mailgun calls); post-no-longer-published (email flipped to failed, no resume attempted, no Mailgun calls) - new file outside the existing describe.skip batch-sending.test.js suite so it actually runs in CI

…h window ref https://linear.app/tryghost/issue/BER-3589 - adds a max-age cutoff (default 7 days, override via bulkEmail:resumeMaxAgeMs config) so the boot scanner does not pick up emails that have been stuck in submitting from prior incidents long enough that the content is stale; rows beyond the cutoff are flipped to failed via updateStatusLock and surface in the admin UI for operator review instead of being silently resumed and sending old newsletters to current members - fixes the delivery-window behavior on resume: previously, if the original created_at + targetDeliveryWindow deadline had passed (the case for almost every resume), calculateDeliveryTimes returned undefined for every batch and Mailgun fired them all in the same second, breaking the rate-spread the targetDeliveryWindow exists to provide - on past deadline, calculateDeliveryTimes now respreads remaining batches over a fresh window of the same size starting from now; sendBatches uses the same condition (targetDeliveryWindow configured) to gate whether to apply deliveryTimes at all, instead of gating on the original deadline being in the future - this also improves the non-resume edge case where the job system delayed a send past its original window (rare); previously those sends also went out instantaneously; now they respect the spread

ref https://linear.app/tryghost/issue/BER-3589 - onShutdown logs the number of in-flight sendBatches awaited and emits a bookend "drain complete" line on exit, making it easy to confirm from logs that the cleanup task waited for in-flight Mailgun calls before exit - sendBatches emits a per-email completion summary listing succeeded vs total batches and unstarted count, so a partial-success case (resume hit an orphan submitting batch) shows up as a single grep-able line - resumeInterruptedSends emits a single end-of-scan summary listing stale rows flipped to failed and fresh rows rescheduled, alongside the existing per-row warn breadcrumbs

9larsons changed the title ~~Stopped newsletter sends cleanly when the container shuts down~~ Recovered newsletter sends interrupted by container restarts May 11, 2026

9larsons added 3 commits May 11, 2026 18:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Recovered newsletter sends interrupted by container restarts#27822

Recovered newsletter sends interrupted by container restarts#27822
9larsons wants to merge 5 commits into
mainfrom
ber-3589-newsletter-send-recovery

9larsons commented May 11, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 11, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

9larsons commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Known gaps and follow-ups

Test plan

Uh oh!

coderabbitai Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

9larsons commented May 11, 2026 •

edited

Loading

coderabbitai Bot commented May 11, 2026 •

edited

Loading