-
-
Notifications
You must be signed in to change notification settings - Fork 0
Operations Guide
Everything that matters for running eventferry in production: rollout checklist, scaling many relays, retention, monitoring, common failure modes.
This page assumes you've read Core Concepts and have a working setup. It's not a tutorial — it's the production runbook.
Tick before traffic hits the relay in production.
- Outbox table migrated with
createMigrationSql. Indexes match what the package ships (verify with\d outboxon pg,SHOW INDEXES FROM outboxon mysql). -
claimTimeoutMslarger than your slowest typical publish — measure on staging, set 2–3× the p99. -
purgeDonescheduled (cron, sidecar, or startup task). DefaultolderThanMs: 7 daysis safe; set lower if your table grows fast. - Database has headroom — outbox writes add latency to your business commits. Budget 100–300μs per enqueue under load.
- (Postgres)
wal_level=logicalset IF usingPostgresStreamingRelay. Skip if polling-only. - (MySQL) Binlog enabled (
log-bin,binlog-format=ROW,binlog-row-image=FULL) IF usingMysqlBinlogRelay. Skip if polling-only.
-
clientIdset to a meaningful service name (broker-side logs are useless with"eventferry"as the ID). -
acks: -1(default). Never run production withacks: 0. -
idempotent: true(default). No reason to disable outside debugging. -
validateTopicsOnConnectpopulated with every topic the relay will publish to. Catches "topic doesn't exist" at startup instead of per-record later. - DLQ topics provisioned (same partition count and retention as the source — DLQ traffic is usually < 0.01% of source, but you want headroom for incidents).
- (Transactional)
transactionalIdderived from per-instance runtime context. See Transactions and EOS for the multi-instance rules. - (Transactional)
autoRecoverFromFencedecision matches your instance count. Single instance → safe to enable. Multiple instances → leave off.
- Logger plumbed into every relay and publisher.
-
onBatchClaimed,onBatchPublished,onFailedmetrics flowing. - OTel tracing wired if you're already running OTel. Producer span +
tracer.injectfor end-to-end traces. -
/healthzendpoint backed bypublisher.healthCheck(). - Alert on
onFailed(record, err, willRetry: false)— that's DLQ-rate. Rate of change, not absolute count. - (Confluent driver)
onStatswired if you want queue-depth / broker-latency dashboards.
-
defineOutbox(registry)if your producer + consumer share a monorepo. Same registry both sides. - Schema Registry serializer +
autoRegister: falseif you're on Confluent Cloud production.
The store's SKIP LOCKED claim query + strict head-of-aggregate ordering let you run N relays concurrently without coordination. Common patterns:
Run K replicas of the same relay deployment. Each independently calls store.claimBatch; rows are partitioned across them at claim time. Scales linearly until the database's claim query becomes the bottleneck (usually around K = number of CPU cores on the DB host).
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Relay-A │ │ Relay-B │ │ Relay-C │ ← N replicas
└─────┬────┘ └─────┬────┘ └─────┬────┘
│ │ │
└──────────────┼──────────────┘
▼
┌────────────┐
│ Postgres │ ← SKIP LOCKED handles concurrency
└────────────┘
No leader election, no coordinator, no shared state. Each relay is independent.
Run one extra relay with claimFailedOnly: true. It picks up rows from the failed queue without competing with the pending-traffic relays for the hot path.
const retryRelay = new Relay({
store: new PostgresStore({ pool, table: "outbox", claimFailedOnly: true }),
publisher,
pollIntervalMs: 5_000, // retries don't need sub-second latency
batchSize: 50,
});Useful when you have a sustained retry backlog from a recent outage — segregates the recovery work so it doesn't slow down new events.
For massive workloads, run separate relays per aggregate domain. Each has its own outbox table (outbox_orders, outbox_users, outbox_payments) and its own publisher pool.
const ordersRelay = new Relay({
store: new PostgresStore({ pool, table: "outbox_orders" }),
publisher: ordersPublisher,
});
const usersRelay = new Relay({
store: new PostgresStore({ pool, table: "outbox_users" }),
publisher: usersPublisher,
});Each domain scales independently. Trade-off: more moving parts, more deploys to coordinate.
Run store.purgeDone periodically. eventferry has no built-in scheduler — pick a mechanism that fits your stack.
// runs every hour as a separate process
import { PostgresStore } from "@eventferry/postgres";
const store = new PostgresStore({ pool, table: "outbox" });
const ONE_HOUR = 60 * 60 * 1000;
setInterval(async () => {
const deleted = await store.purgeDone({
olderThanMs: 7 * 24 * ONE_HOUR,
batchSize: 1_000,
});
metrics.counter("outbox.purged", deleted);
}, ONE_HOUR);await store.purgeDone({ olderThanMs: 7 * 24 * 60 * 60 * 1000, batchSize: 1_000 });
await relay.start();Fine for small tables. For tables > 1M rows, run as a sidecar so the relay doesn't wait on cleanup.
For very high-volume outboxes (> 10M rows / day), the right answer is partition the outbox table by processed_at month and drop old partitions. eventferry doesn't ship this — Postgres declarative partitioning + a small ALTER cron does it natively, without changing the schema eventferry expects.
Minimum signal set every production deployment should have:
| Signal | Source | Alert on |
|---|---|---|
| Outbox depth (pending rows) | SELECT count(*) FROM outbox WHERE status = 0 |
Sustained growth (claim < publish rate) |
| Reaper rate | Per-tick row count where claimed_at is past timeout |
Any non-zero baseline (means a relay crashed) |
| DLQ rate |
onFailed(record, err, willRetry: false) count |
Rate of change |
| Publish p99 latency |
onBatchPublished minus claim time |
> 5× baseline |
| Connection failures |
onError count classified as KafkaJSConnectionError
|
Sustained > 0 |
| Producer fence rate |
onProducerFenced count |
Any non-zero → transactionalId collision investigation |
// 1. Outbox depth — sustained growth alert
new Relay({
store,
publisher,
hooks: {
onBatchClaimed: (n) => metrics.gauge("outbox.claim_size", n),
},
});
// Out-of-band SQL probe:
setInterval(async () => {
const { rows } = await pool.query(`SELECT count(*)::int AS n FROM outbox WHERE status = 0`);
metrics.gauge("outbox.pending_depth", rows[0].n);
}, 30_000);
// Alert rule (prom):
// alert: OutboxBacklog
// expr: outbox_pending_depth > 10000 for 5m
// → Either publish rate dropped (broker outage) or business throughput spiked.const relay = new Relay({ store, publisher });
await relay.start();
process.on("SIGTERM", async () => {
log.info("SIGTERM received — draining");
await relay.stop(); // wait for in-flight batch to finish
await publisher.disconnect();
await pool.end();
process.exit(0);
});relay.stop() is drain-aware: it sets a stop flag and waits for the in-flight batch to call markDone / markFailed before returning. Rows claimed but not yet acked become reapable by another replica after claimTimeoutMs.
Container orchestrators usually send SIGTERM, wait terminationGracePeriodSeconds (default 30s on k8s), then SIGKILL. Set this above your claimTimeoutMs + batchSize × avg-publish-time so a normal shutdown completes inside the grace window.
| Symptom | Likely cause | Response |
|---|---|---|
| Outbox depth growing, publish rate steady | Broker outage upstream | Check broker logs; relays will catch up when broker recovers. |
| Outbox depth growing, no publishes happening | Relay crashed without restart | Check process supervisor; restart the relay. |
| Steady DLQ flow | Misconfigured topic, auth issue, or schema mismatch | Triage by dlq-error-class. See Dead-Letter Queue. |
| Reaper repeatedly reclaiming the same row | Slow batch publish > claimTimeoutMs
|
Increase claimTimeoutMs or reduce batchSize. |
| Producer fences in a loop | Two instances with same transactionalId
|
Make transactionalId unique per instance. See Transactions and EOS. |
validateTopicsOnConnect fails after deploy |
Topic not provisioned, name typo | Either provision the topic (Terraform) or fix the relay config. |
- Troubleshooting recipes for specific symptoms → Troubleshooting
- Migration playbook between versions → Migrations and Upgrades
- API reference for every option mentioned above → API Reference
Repository · Issues · npm: @eventferry/all · MIT
Get going
Adapters
Type & schema
Security
Operational
- Transactions and EOS
- Admin Operations
- Observability
- Consuming Events
- Dead-Letter Queue
- Reliability and Error Handling
Operations
Reference