-
-
Notifications
You must be signed in to change notification settings - Fork 0
Troubleshooting
Symptom → cause → fix. Each entry is a real failure mode users hit, with the diagnostic SQL / log line you should look for.
SELECT
status,
count(*) AS n,
min(created_at) AS oldest,
max(created_at) AS newest
FROM outbox
GROUP BY status;Status codes: 0=pending 1=processing 2=done 3=failed 4=dead.
| Where the rows pile up | Cause |
|---|---|
status=0 (pending) |
Relay isn't claiming — relay process down, or DB connection pool exhausted. |
status=1 (processing) |
Relay claimed but not finishing — publisher hung, or claimTimeoutMs < typical publish time so the reaper is fighting the relay. |
status=3 (failed) |
Retries firing but not succeeding — broker auth broken, or a genuine outage with next_retry_at far in the future. |
-
Pending pileup, no relay process → restart the relay; investigate why it died (probably
OOMor DB connection drop). -
Processing pileup with reaper churn → bump
claimTimeoutMs. Default 60s assumes batch publishes complete in seconds; high-volume / large-batch setups need 300s+. -
Failed pileup → check
dlq-error-classon the rows already DLQ'd. They tell you what's wrong upstream.
SELECT id, attempts, claimed_at, next_retry_at, status
FROM outbox
WHERE id = <suspect-id>;If claimed_at was recently stamped (say 30s ago) and claimTimeoutMs is 60s, the row shouldn't be reclaimable. If it IS being reclaimed, the row's publish is slower than claimTimeoutMs — the reaper picks it up while the original relay is still trying.
- Bump
claimTimeoutMsto 2–3× the p99 publish time. Default 60s breaks down when batches are large or the broker is slow. - Reduce
batchSizeso each batch finishes faster. - Check
onBatchPublishedtiming in your metrics — if individual batches take 90s, your reaper is wrong, not your relay.
Look at dlq-reason headers on the most recent DLQ records:
// In your DLQ consumer
console.log(m.headers["dlq-reason"]);
// e.g. "Connection error: ECONNREFUSED 10.0.5.12:9093"This is a network / broker availability issue, NOT eventferry's failure mode. The relay retried maxAttempts times, the broker was unreachable each time, eventually the row DLQ'd.
- Check broker logs for the same window.
- Increase
retry.maxAttemptsso a 5-minute broker rolling restart doesn't burn the budget. - Use
validateTopicsOnConnectto catch broker reachability at startup instead of per-record at publish time.
dlq-error-class: TopicAuthorizationException
dlq-reason: Not authorized to write to topic [orders.created]
The IAM / ACL identity the relay runs as doesn't have write permission on the target topic. This is a Terraform / IAM problem, not eventferry's.
- (MSK IAM) Check the IAM policy includes
kafka-cluster:WriteDataon the topic ARN. See AWS MSK IAM. - (Confluent Cloud) Verify the API key ACL includes
WRITEon the topic. - (Self-managed) Run
kafka-acls.sh --list --topic orders.created --principal User:CN=....
new KafkaPublisher({
hooks: {
onProducerFenced: (err) => log.warn("fenced", { err: err.message }),
},
});Fenced rate > 0 in a steady state means two producer instances are taking the same transactionalId. The broker fences whichever claims it later — and if both instances keep restarting, they fence each other forever.
Make transactionalId unique per instance:
new KafkaPublisher({
transactional: true,
transactionalId: () => `${process.env.POD_NAME}-${process.env.HOSTNAME}`,
});DON'T enable autoRecoverFromFence: true on a multi-instance setup — it amplifies the loop. Leave it off, let the fence propagate, and rely on the unique-ID rule to make fences rare. See Transactions and EOS.
The publisher's connect() calls validateTopicsOnConnect if set. A hang there means the admin client can't reach the broker — usually a TLS / SASL config problem.
Common log lines:
KafkaJSConnectionError: Connection error: getaddrinfo ENOTFOUND broker.bad-dns
KafkaJSConnectionError: Connection error: unable to verify the first certificate
KafkaJSSASLAuthenticationError: SASL PLAIN authentication failed: Invalid credentials
Match the error to a config fix:
-
getaddrinfo ENOTFOUND→ wrong broker hostname. Check the bootstrap list. -
unable to verify the first certificate→ broker cert isn't signed by a CA the trust store knows. Pin your CA viassl.ca. See Authentication and TLS. -
Invalid credentials→ check SASL username/password. For OAUTHBEARER, check the token your provider returns.
Testcontainers needs Docker. CI environments without a Docker daemon fail at container startup.
Error: connect ECONNREFUSED /var/run/docker.sock
- GitHub Actions: use
services: docker:dind, or run on a self-hosted runner with Docker. - GitLab: enable Docker-in-Docker.
- Local but failing: check
docker info. Restart Docker Desktop.
SHOW VARIABLES LIKE 'log_bin';
SHOW VARIABLES LIKE 'binlog_format';
SHOW VARIABLES LIKE 'binlog_row_image';
SHOW GRANTS FOR CURRENT_USER();Expected:
log_bin: ON
binlog_format: ROW
binlog_row_image: FULL
GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO ...
Anything not matching → fix the MySQL config or the user's grants. See MySQL Adapter for the canonical [mysqld] block.
SHOW wal_level;
-- → "replica" (wrong)
-- → "logical" (right)ALTER SYSTEM SET wal_level = logical;
-- Restart Postgres (wal_level change requires a restart, not just reload).On managed Postgres (RDS, Cloud SQL), the equivalent knob is a parameter group / flag toggle + a database restart. Check your provider's docs.
If you can't change wal_level, use PostgresNotifyWaker instead — gives you sub-second wake without the cluster config requirement. See Postgres Adapter.
try {
await events.enqueue(client, "orders.created", { aggregateId, payload });
} catch (err) {
if (err instanceof OutboxValidationError) {
console.error(err.topic, JSON.stringify(err.issues, null, 2));
}
}err.issues is the Standard Schema issues array. Each issue has message and (usually) a path showing the field that failed.
Common cases:
- Typo in
topic—"orders.created"vs"orders.create". The registry throwsunknown topic. - Extra field on payload that Zod rejects with
.strict()mode. -
bigintin payload — JSON can't serialize bigint. Stringify it before enqueue, parse on consume.
Wiki sidebar gösteriyor ama tıklayınca "Create new page" geliyor.
Sayfa henüz yazılmamış. Bu wiki şu an aktif yazımda — eksik sayfalar sıraya geldiğinde dolacak. Hangi sayfaya öncelik vermem gerektiğini söylemek istersen, issue açabilir veya direkt mesajla iletebilirsin.
Open an issue with:
- Package + version (
@eventferry/kafka@3.5.0). - Reproducer (minimal code + container setup if possible).
- The full error message and any relevant log lines.
- What you've already tried.
The integration test suite (packages/integration/test/) is the canonical example of working setups — if your config differs in some way from what's tested there, that's usually the first diagnostic.
Repository · Issues · npm: @eventferry/all · MIT
Get going
Adapters
Type & schema
Security
Operational
- Transactions and EOS
- Admin Operations
- Observability
- Consuming Events
- Dead-Letter Queue
- Reliability and Error Handling
Operations
Reference