Skip to content

Operations Guide

SametGoktepe edited this page Jun 17, 2026 · 1 revision

Everything that matters for running eventferry in production: rollout checklist, scaling many relays, retention, monitoring, common failure modes.

This page assumes you've read Core Concepts and have a working setup. It's not a tutorial — it's the production runbook.


Pre-launch checklist

Tick before traffic hits the relay in production.

Database

  • Outbox table migrated with createMigrationSql. Indexes match what the package ships (verify with \d outbox on pg, SHOW INDEXES FROM outbox on mysql).
  • claimTimeoutMs larger than your slowest typical publish — measure on staging, set 2–3× the p99.
  • purgeDone scheduled (cron, sidecar, or startup task). Default olderThanMs: 7 days is safe; set lower if your table grows fast.
  • Database has headroom — outbox writes add latency to your business commits. Budget 100–300μs per enqueue under load.
  • (Postgres) wal_level=logical set IF using PostgresStreamingRelay. Skip if polling-only.
  • (MySQL) Binlog enabled (log-bin, binlog-format=ROW, binlog-row-image=FULL) IF using MysqlBinlogRelay. Skip if polling-only.

Publisher

  • clientId set to a meaningful service name (broker-side logs are useless with "eventferry" as the ID).
  • acks: -1 (default). Never run production with acks: 0.
  • idempotent: true (default). No reason to disable outside debugging.
  • validateTopicsOnConnect populated with every topic the relay will publish to. Catches "topic doesn't exist" at startup instead of per-record later.
  • DLQ topics provisioned (same partition count and retention as the source — DLQ traffic is usually < 0.01% of source, but you want headroom for incidents).
  • (Transactional) transactionalId derived from per-instance runtime context. See Transactions and EOS for the multi-instance rules.
  • (Transactional) autoRecoverFromFence decision matches your instance count. Single instance → safe to enable. Multiple instances → leave off.

Observability

  • Logger plumbed into every relay and publisher.
  • onBatchClaimed, onBatchPublished, onFailed metrics flowing.
  • OTel tracing wired if you're already running OTel. Producer span + tracer.inject for end-to-end traces.
  • /healthz endpoint backed by publisher.healthCheck().
  • Alert on onFailed(record, err, willRetry: false) — that's DLQ-rate. Rate of change, not absolute count.
  • (Confluent driver) onStats wired if you want queue-depth / broker-latency dashboards.

Schema / contract

  • defineOutbox(registry) if your producer + consumer share a monorepo. Same registry both sides.
  • Schema Registry serializer + autoRegister: false if you're on Confluent Cloud production.

Scaling: run many relays

The store's SKIP LOCKED claim query + strict head-of-aggregate ordering let you run N relays concurrently without coordination. Common patterns:

Active-active (default)

Run K replicas of the same relay deployment. Each independently calls store.claimBatch; rows are partitioned across them at claim time. Scales linearly until the database's claim query becomes the bottleneck (usually around K = number of CPU cores on the DB host).

┌──────────┐   ┌──────────┐   ┌──────────┐
│  Relay-A │   │  Relay-B │   │  Relay-C │   ← N replicas
└─────┬────┘   └─────┬────┘   └─────┬────┘
      │              │              │
      └──────────────┼──────────────┘
                     ▼
              ┌────────────┐
              │  Postgres  │ ← SKIP LOCKED handles concurrency
              └────────────┘

No leader election, no coordinator, no shared state. Each relay is independent.

Dedicated retry-only relay

Run one extra relay with claimFailedOnly: true. It picks up rows from the failed queue without competing with the pending-traffic relays for the hot path.

const retryRelay = new Relay({
  store: new PostgresStore({ pool, table: "outbox", claimFailedOnly: true }),
  publisher,
  pollIntervalMs: 5_000,                  // retries don't need sub-second latency
  batchSize: 50,
});

Useful when you have a sustained retry backlog from a recent outage — segregates the recovery work so it doesn't slow down new events.

Shard by aggregate type

For massive workloads, run separate relays per aggregate domain. Each has its own outbox table (outbox_orders, outbox_users, outbox_payments) and its own publisher pool.

const ordersRelay = new Relay({
  store: new PostgresStore({ pool, table: "outbox_orders" }),
  publisher: ordersPublisher,
});

const usersRelay = new Relay({
  store: new PostgresStore({ pool, table: "outbox_users" }),
  publisher: usersPublisher,
});

Each domain scales independently. Trade-off: more moving parts, more deploys to coordinate.


Retention

Run store.purgeDone periodically. eventferry has no built-in scheduler — pick a mechanism that fits your stack.

Cron sidecar

// runs every hour as a separate process
import { PostgresStore } from "@eventferry/postgres";

const store = new PostgresStore({ pool, table: "outbox" });
const ONE_HOUR = 60 * 60 * 1000;

setInterval(async () => {
  const deleted = await store.purgeDone({
    olderThanMs: 7 * 24 * ONE_HOUR,
    batchSize: 1_000,
  });
  metrics.counter("outbox.purged", deleted);
}, ONE_HOUR);

Inline at relay startup

await store.purgeDone({ olderThanMs: 7 * 24 * 60 * 60 * 1000, batchSize: 1_000 });
await relay.start();

Fine for small tables. For tables > 1M rows, run as a sidecar so the relay doesn't wait on cleanup.

Partitioning instead

For very high-volume outboxes (> 10M rows / day), the right answer is partition the outbox table by processed_at month and drop old partitions. eventferry doesn't ship this — Postgres declarative partitioning + a small ALTER cron does it natively, without changing the schema eventferry expects.


Monitoring

Minimum signal set every production deployment should have:

Signal Source Alert on
Outbox depth (pending rows) SELECT count(*) FROM outbox WHERE status = 0 Sustained growth (claim < publish rate)
Reaper rate Per-tick row count where claimed_at is past timeout Any non-zero baseline (means a relay crashed)
DLQ rate onFailed(record, err, willRetry: false) count Rate of change
Publish p99 latency onBatchPublished minus claim time > 5× baseline
Connection failures onError count classified as KafkaJSConnectionError Sustained > 0
Producer fence rate onProducerFenced count Any non-zero → transactionalId collision investigation

Alerting recipes

// 1. Outbox depth — sustained growth alert
new Relay({
  store,
  publisher,
  hooks: {
    onBatchClaimed: (n) => metrics.gauge("outbox.claim_size", n),
  },
});

// Out-of-band SQL probe:
setInterval(async () => {
  const { rows } = await pool.query(`SELECT count(*)::int AS n FROM outbox WHERE status = 0`);
  metrics.gauge("outbox.pending_depth", rows[0].n);
}, 30_000);

// Alert rule (prom):
// alert: OutboxBacklog
// expr: outbox_pending_depth > 10000 for 5m
// → Either publish rate dropped (broker outage) or business throughput spiked.

Graceful shutdown

const relay = new Relay({ store, publisher });
await relay.start();

process.on("SIGTERM", async () => {
  log.info("SIGTERM received — draining");
  await relay.stop();                     // wait for in-flight batch to finish
  await publisher.disconnect();
  await pool.end();
  process.exit(0);
});

relay.stop() is drain-aware: it sets a stop flag and waits for the in-flight batch to call markDone / markFailed before returning. Rows claimed but not yet acked become reapable by another replica after claimTimeoutMs.

Container orchestrators usually send SIGTERM, wait terminationGracePeriodSeconds (default 30s on k8s), then SIGKILL. Set this above your claimTimeoutMs + batchSize × avg-publish-time so a normal shutdown completes inside the grace window.


Failure modes & responses

Symptom Likely cause Response
Outbox depth growing, publish rate steady Broker outage upstream Check broker logs; relays will catch up when broker recovers.
Outbox depth growing, no publishes happening Relay crashed without restart Check process supervisor; restart the relay.
Steady DLQ flow Misconfigured topic, auth issue, or schema mismatch Triage by dlq-error-class. See Dead-Letter Queue.
Reaper repeatedly reclaiming the same row Slow batch publish > claimTimeoutMs Increase claimTimeoutMs or reduce batchSize.
Producer fences in a loop Two instances with same transactionalId Make transactionalId unique per instance. See Transactions and EOS.
validateTopicsOnConnect fails after deploy Topic not provisioned, name typo Either provision the topic (Terraform) or fix the relay config.

What's next

Clone this wiki locally