Skip to content

Recovered newsletter sends interrupted by container restarts#27822

Draft
9larsons wants to merge 5 commits into
mainfrom
ber-3589-newsletter-send-recovery
Draft

Recovered newsletter sends interrupted by container restarts#27822
9larsons wants to merge 5 commits into
mainfrom
ber-3589-newsletter-send-recovery

Conversation

@9larsons
Copy link
Copy Markdown
Contributor

@9larsons 9larsons commented May 11, 2026

Summary

When a container restart (deploy, node reschedule, scale-down) sends SIGTERM mid-newsletter-send, the email row would be left in status='submitting' with some EmailBatch rows submitted and others never attempted, and no automatic recovery on next boot. Operators recovered manually via DB edits.

This branch closes the loop end-to-end: graceful shutdown leaves the row in a recoverable state, and the next boot scans for interrupted sends (bounded by age) and resumes them through the normal email-job path, respecting the original delivery-window pacing.

How it works

Shutdown side

  • BatchSendingService.onShutdown() is registered as a cleanup task on GhostServer. On SIGTERM:
    1. Flips #shuttingDown = true. The worker loop checks the flag at the top of every iteration, so no new batches get picked up after the signal.
    2. Awaits Promise.allSettled([...#inFlight]) — the set of live sendBatches promises. The cleanup chain blocks on in-flight Mailgun POSTs + EmailBatch DB writes before process.exit fires, so a request mid-call isn't cut off.
  • In-flight batches finish normally and get marked submitted. Unstarted batches stay in their current status.
  • When the worker queue still has unstarted batches after settle, sendBatches throws an InternalServerError with a SHUTDOWN_CODE sentinel. emailJob catches that specific code, logs an info breadcrumb, and returns without writing status='failed'. The parent email row stays in submitting.
  • No timeout on the in-flight await — if the K8s grace period elapses, accept SIGKILL. Timing out and exiting while a request is in flight is the failure mode we're preventing.

Resume side

  • On boot, initBackgroundServices calls emailService.service.resumeInterruptedSends() before activitypub.init so an activitypub failure can't disable recovery. The boot site has its own try/catch + logging.error.
  • The scanner runs two queries against emails WHERE status='submitting':
    1. Stale rows (created_at < now - maxAge): flipped to failed via updateStatusLock so they surface in the admin UI for operator review. Default cutoff is 7 days, override via bulkEmail:resumeMaxAgeMs config.
    2. Fresh rows (created_at > now - maxAge): iterated sequentially, each in its own try/catch. For each row:
      • Loads the post. If status isn't published or sent (post was unpublished/deleted), flips email to failed via updateStatusLock.
      • Otherwise flips email to pending via updateStatusLock (required to satisfy emailJob's ['pending', 'failed'] precondition), logs a structured breadcrumb (email_id, post_id, batch_counts_by_status, ms_since_last_status_write, target_delivery_window_ms), then calls scheduleEmail.
  • The resumed emailJob runs the normal sendBatches path. Already-submitted batches short-circuit and skip Mailgun (see the short-circuit fix below); pending batches send normally.
  • sendBatch short-circuit distinguishes between two cases that the old code collapsed:
    • Batch status submitted: Mailgun accepted it on a prior run → return true (success, skip), log info.
    • Batch status submitting: orphan from a crashed worker — Mailgun-side state unknown → return false, log error. The parent email is then promoted to failed so an operator can reconcile against the Mailgun dashboard before retrying. Previously both cases returned true, silently laundering partial sends as success.
  • retryEmail now throws a 400 BadRequestError when the email's status isn't failed, instead of silently passing through to scheduleEmail. The internal callsite in post-email-handler.js is already pre-gated to failed, so this only changes behavior for misuse via the admin API.

Known gaps and follow-ups

These are explicitly not addressed in this branch. Flagged so a deployer can think about them.

  1. No upper bound on iteration count or time. Scanner does findAll with no limit and iterates sequentially. On a site with thousands of stuck rows within the 7-day cutoff (unlikely after this lands, but possible during the transition), it can block initBackgroundServices for minutes and load every row into memory. A follow-up could add a LIMIT + a "scanner truncated, run again" warning.
  2. Newsletter status isn't re-checked at resume time. The scanner gates on post.status but not newsletter.status. If a newsletter was archived between the original send attempt and the resume, the resumed send goes out anyway. The right fix is a parallel newsletter.status === 'active' check in the scanner before flipping the email to pending.
  3. createBatches crash mid-flight. If the original worker crashed between flipping email status to submitting and createBatches completing, the email_recipients table can be partially populated; the resumed send may hit a duplicate-key on (email_id, member_email) or send partial batches. Pre-existing failure mode, not introduced by this change.

Test plan

  • Unit: cd ghost/core && pnpm test:single test/unit/server/services/email-service/ — 380 passing. Covers each branch in isolation: shutdown drain, queue-non-empty throws, in-flight tracking; scanner happy path, bad row doesn't skip others, post-not-sendable marks failed, stale row flipped to failed, config override; sendBatch short-circuit cases; retryEmail 400; calculateDeliveryTimes respread on past deadline.
  • Integration: cd ghost/core && pnpm test:integration --grep "Resume interrupted sends" — 3 passing. Loads a real DB fixture with mixed-status batches, calls the scanner directly, asserts final state (clean resume, orphan submitting → email failed, post-not-published → email failed).
  • Lint: cd ghost/core && pnpm lint:server — clean.
  • Manual smoke test (required before merging out of draft):
    1. pnpm dev + pnpm reset:data
    2. Trigger a newsletter to all members so it splits into 2+ batches
    3. docker compose kill -s SIGTERM ghost while the send is in flight
    4. Verify the email row is in submitting (not failed)
    5. docker compose up -d ghost
    6. Watch boot logs for Email resume: scheduling …; verify the email completes to submitted and no member receives a duplicate

ref https://linear.app/tryghost/issue/BER-3589

ref https://linear.app/tryghost/issue/BER-3589

- container restarts (deploys, node rescheduling) send SIGTERM with a ~60s
  grace period; in-flight newsletter sends were being force-killed mid-batch,
  leaving the email row stuck in `submitting` with some batches sent and
  some never attempted
- BatchSendingService now registers an onShutdown cleanup task on ghost-server;
  the task flips a flag so workers stop picking up new batches, and awaits
  in-flight sendBatches so the cleanup chain blocks on Mailgun requests and
  EmailBatch DB writes before process.exit fires (otherwise mid-flight
  requests would be cut off and produce duplicate sends on resume)
- when the worker queue still has unstarted batches at shutdown, sendBatches
  throws an InternalServerError with a SHUTDOWN_CODE sentinel; emailJob
  catches that specific code and returns without writing status=failed,
  leaving the row in `submitting` so a future restart can resume it
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 361c7aa5-8898-4cad-a7ff-29d5a39755c0

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ber-3589-newsletter-send-recovery

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

ref https://linear.app/tryghost/issue/BER-3589

- on container boot, scans for newsletter emails left in status=submitting
  from a previous container's interrupted send and resumes them through the
  normal emailJob path (flips to pending if the post is still published, or
  failed if the post was unpublished/deleted while the send was in flight)
- per-email try/catch with logging.error so one bad row does not skip the
  others; runs unconditionally before activitypub.init so an activitypub
  failure cannot disable recovery
- fixes sendBatch short-circuit accounting: when updateStatusLock returns
  undefined, distinguish between batch already submitted on a prior run
  (return true, expected resume path) and orphan submitting batch from a
  crashed worker (return false so the parent email promotes to failed and
  an operator can reconcile against Mailgun before retrying); previously
  both cases returned true, silently laundering partial sends as success
- tightens retryEmail to throw a 400 BadRequestError when the email's
  status is not failed, instead of silently passing through to
  scheduleEmail; the internal callsite in post-email-handler.js is already
  pre-gated to failed, so this only changes behavior for misuse via the
  admin API
- pairs with the shutdown side already on this branch to close the
  newsletter-send-recovery loop
@9larsons 9larsons changed the title Stopped newsletter sends cleanly when the container shuts down Recovered newsletter sends interrupted by container restarts May 11, 2026
9larsons added 3 commits May 11, 2026 18:44
ref https://linear.app/tryghost/issue/BER-3589

- exercises resumeInterruptedSends() against a real DB with mocked Mailgun;
  the unit tests fake updateStatusLock's return value, so they can't prove
  the load-bearing claim that the sendBatch short-circuit cooperates with
  the real lock when a batch is left in submitting status
- three scenarios: clean resume (one batch already submitted, one batch
  pending — Mailgun called once for the pending batch only, email
  re-promotes to submitted); orphan submitting batch (parent email
  promotes to failed, orphan row preserved for operator reconciliation,
  no Mailgun calls); post-no-longer-published (email flipped to failed,
  no resume attempted, no Mailgun calls)
- new file outside the existing describe.skip batch-sending.test.js suite
  so it actually runs in CI
…h window

ref https://linear.app/tryghost/issue/BER-3589

- adds a max-age cutoff (default 7 days, override via bulkEmail:resumeMaxAgeMs
  config) so the boot scanner does not pick up emails that have been stuck in
  submitting from prior incidents long enough that the content is stale; rows
  beyond the cutoff are flipped to failed via updateStatusLock and surface in
  the admin UI for operator review instead of being silently resumed and
  sending old newsletters to current members
- fixes the delivery-window behavior on resume: previously, if the original
  created_at + targetDeliveryWindow deadline had passed (the case for almost
  every resume), calculateDeliveryTimes returned undefined for every batch
  and Mailgun fired them all in the same second, breaking the rate-spread
  the targetDeliveryWindow exists to provide
- on past deadline, calculateDeliveryTimes now respreads remaining batches
  over a fresh window of the same size starting from now; sendBatches uses
  the same condition (targetDeliveryWindow configured) to gate whether to
  apply deliveryTimes at all, instead of gating on the original deadline
  being in the future
- this also improves the non-resume edge case where the job system delayed
  a send past its original window (rare); previously those sends also went
  out instantaneously; now they respect the spread
ref https://linear.app/tryghost/issue/BER-3589

- onShutdown logs the number of in-flight sendBatches awaited and emits a
  bookend "drain complete" line on exit, making it easy to confirm from logs
  that the cleanup task waited for in-flight Mailgun calls before exit
- sendBatches emits a per-email completion summary listing succeeded vs
  total batches and unstarted count, so a partial-success case (resume hit
  an orphan submitting batch) shows up as a single grep-able line
- resumeInterruptedSends emits a single end-of-scan summary listing stale
  rows flipped to failed and fresh rows rescheduled, alongside the existing
  per-row warn breadcrumbs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant