Operations Guide

Everything that matters for running eventferry in production: rollout checklist, scaling many relays, retention, monitoring, common failure modes.

This page assumes you've read Core Concepts and have a working setup. It's not a tutorial — it's the production runbook.

Pre-launch checklist

Tick before traffic hits the relay in production.

Database

Outbox table migrated with createMigrationSql. Indexes match what the package ships (verify with \d outbox on pg, SHOW INDEXES FROM outbox on mysql).
claimTimeoutMs larger than your slowest typical publish — measure on staging, set 2–3× the p99.
purgeDone scheduled (cron, sidecar, or startup task). Default olderThanMs: 7 days is safe; set lower if your table grows fast.
Database has headroom — outbox writes add latency to your business commits. Budget 100–300μs per enqueue under load.
(Postgres) wal_level=logical set IF using PostgresStreamingRelay. Skip if polling-only.
(MySQL) Binlog enabled (log-bin, binlog-format=ROW, binlog-row-image=FULL) IF using MysqlBinlogRelay. Skip if polling-only.

Publisher

clientId set to a meaningful service name (broker-side logs are useless with "eventferry" as the ID).
acks: -1 (default). Never run production with acks: 0.
idempotent: true (default). No reason to disable outside debugging.
validateTopicsOnConnect populated with every topic the relay will publish to. Catches "topic doesn't exist" at startup instead of per-record later.
DLQ topics provisioned (same partition count and retention as the source — DLQ traffic is usually < 0.01% of source, but you want headroom for incidents).
(Transactional) transactionalId derived from per-instance runtime context. See Transactions and EOS for the multi-instance rules.
(Transactional) autoRecoverFromFence decision matches your instance count. Single instance → safe to enable. Multiple instances → leave off.

Observability

Logger plumbed into every relay and publisher.
onBatchClaimed, onBatchPublished, onFailed metrics flowing.
OTel tracing wired if you're already running OTel. Producer span + tracer.inject for end-to-end traces.
/healthz endpoint backed by publisher.healthCheck().
Alert on onFailed(record, err, willRetry: false) — that's DLQ-rate. Rate of change, not absolute count.
(Confluent driver) onStats wired if you want queue-depth / broker-latency dashboards.

Schema / contract

defineOutbox(registry) if your producer + consumer share a monorepo. Same registry both sides.
Schema Registry serializer + autoRegister: false if you're on Confluent Cloud production.

Scaling: run many relays

The store's SKIP LOCKED claim query + strict head-of-aggregate ordering let you run N relays concurrently without coordination. Common patterns:

Active-active (default)

Run K replicas of the same relay deployment. Each independently calls store.claimBatch; rows are partitioned across them at claim time. Scales linearly until the database's claim query becomes the bottleneck (usually around K = number of CPU cores on the DB host).

┌──────────┐   ┌──────────┐   ┌──────────┐
│  Relay-A │   │  Relay-B │   │  Relay-C │   ← N replicas
└─────┬────┘   └─────┬────┘   └─────┬────┘
      │              │              │
      └──────────────┼──────────────┘
                     ▼
              ┌────────────┐
              │  Postgres  │ ← SKIP LOCKED handles concurrency
              └────────────┘

No leader election, no coordinator, no shared state. Each relay is independent.

Dedicated retry-only relay

Run one extra relay with claimFailedOnly: true. It picks up rows from the failed queue without competing with the pending-traffic relays for the hot path.

const retryRelay = new Relay({
  store: new PostgresStore({ pool, table: "outbox", claimFailedOnly: true }),
  publisher,
  pollIntervalMs: 5_000,                  // retries don't need sub-second latency
  batchSize: 50,
});

Useful when you have a sustained retry backlog from a recent outage — segregates the recovery work so it doesn't slow down new events.

Shard by aggregate type

For massive workloads, run separate relays per aggregate domain. Each has its own outbox table (outbox_orders, outbox_users, outbox_payments) and its own publisher pool.

const ordersRelay = new Relay({
  store: new PostgresStore({ pool, table: "outbox_orders" }),
  publisher: ordersPublisher,
});

const usersRelay = new Relay({
  store: new PostgresStore({ pool, table: "outbox_users" }),
  publisher: usersPublisher,
});

Each domain scales independently. Trade-off: more moving parts, more deploys to coordinate.

Retention

Run store.purgeDone periodically. eventferry has no built-in scheduler — pick a mechanism that fits your stack.

Cron sidecar

// runs every hour as a separate process
import { PostgresStore } from "@eventferry/postgres";

const store = new PostgresStore({ pool, table: "outbox" });
const ONE_HOUR = 60 * 60 * 1000;

setInterval(async () => {
  const deleted = await store.purgeDone({
    olderThanMs: 7 * 24 * ONE_HOUR,
    batchSize: 1_000,
  });
  metrics.counter("outbox.purged", deleted);
}, ONE_HOUR);

Inline at relay startup

await store.purgeDone({ olderThanMs: 7 * 24 * 60 * 60 * 1000, batchSize: 1_000 });
await relay.start();

Fine for small tables. For tables > 1M rows, run as a sidecar so the relay doesn't wait on cleanup.

Partitioning instead

For very high-volume outboxes (> 10M rows / day), the right answer is partition the outbox table by processed_at month and drop old partitions. eventferry doesn't ship this — Postgres declarative partitioning + a small ALTER cron does it natively, without changing the schema eventferry expects.

Monitoring

Minimum signal set every production deployment should have:

Signal	Source	Alert on
Outbox depth (pending rows)	`SELECT count(*) FROM outbox WHERE status = 0`	Sustained growth (claim < publish rate)
Reaper rate	Per-tick row count where `claimed_at` is past timeout	Any non-zero baseline (means a relay crashed)
DLQ rate	`onFailed(record, err, willRetry: false)` count	Rate of change
Publish p99 latency	`onBatchPublished` minus claim time	> 5× baseline
Connection failures	`onError` count classified as `KafkaJSConnectionError`	Sustained > 0
Producer fence rate	`onProducerFenced` count	Any non-zero → transactionalId collision investigation

Alerting recipes

// 1. Outbox depth — sustained growth alert
new Relay({
  store,
  publisher,
  hooks: {
    onBatchClaimed: (n) => metrics.gauge("outbox.claim_size", n),
  },
});

// Out-of-band SQL probe:
setInterval(async () => {
  const { rows } = await pool.query(`SELECT count(*)::int AS n FROM outbox WHERE status = 0`);
  metrics.gauge("outbox.pending_depth", rows[0].n);
}, 30_000);

// Alert rule (prom):
// alert: OutboxBacklog
// expr: outbox_pending_depth > 10000 for 5m
// → Either publish rate dropped (broker outage) or business throughput spiked.

Graceful shutdown

const relay = new Relay({ store, publisher });
await relay.start();

process.on("SIGTERM", async () => {
  log.info("SIGTERM received — draining");
  await relay.stop();                     // wait for in-flight batch to finish
  await publisher.disconnect();
  await pool.end();
  process.exit(0);
});

relay.stop() is drain-aware: it sets a stop flag and waits for the in-flight batch to call markDone / markFailed before returning. Rows claimed but not yet acked become reapable by another replica after claimTimeoutMs.

Container orchestrators usually send SIGTERM, wait terminationGracePeriodSeconds (default 30s on k8s), then SIGKILL. Set this above your claimTimeoutMs + batchSize × avg-publish-time so a normal shutdown completes inside the grace window.

Failure modes & responses

Symptom	Likely cause	Response
Outbox depth growing, publish rate steady	Broker outage upstream	Check broker logs; relays will catch up when broker recovers.
Outbox depth growing, no publishes happening	Relay crashed without restart	Check process supervisor; restart the relay.
Steady DLQ flow	Misconfigured topic, auth issue, or schema mismatch	Triage by `dlq-error-class`. See Dead-Letter Queue.
Reaper repeatedly reclaiming the same row	Slow batch publish > `claimTimeoutMs`	Increase `claimTimeoutMs` or reduce `batchSize`.
Producer fences in a loop	Two instances with same `transactionalId`	Make `transactionalId` unique per instance. See Transactions and EOS.
`validateTopicsOnConnect` fails after deploy	Topic not provisioned, name typo	Either provision the topic (Terraform) or fix the relay config.

What's next

Troubleshooting recipes for specific symptoms → Troubleshooting
Migration playbook between versions → Migrations and Upgrades
API reference for every option mentioned above → API Reference

Repository · Issues · npm: @eventferry/all · MIT

Home

Get going

Adapters

Type & schema

Security

Operational

Operations

Reference

Uh oh!

Operations Guide

Pre-launch checklist

Database

Publisher

Observability

Schema / contract

Scaling: run many relays

Active-active (default)

Dedicated retry-only relay

Shard by aggregate type

Retention

Cron sidecar

Inline at relay startup

Partitioning instead

Monitoring

Alerting recipes

Graceful shutdown

Failure modes & responses

What's next

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally