Skip to content

Troubleshooting

SametGoktepe edited this page Jun 17, 2026 · 1 revision

Symptom → cause → fix. Each entry is a real failure mode users hit, with the diagnostic SQL / log line you should look for.


"Outbox depth keeps growing"

Diagnose

SELECT
  status,
  count(*) AS n,
  min(created_at) AS oldest,
  max(created_at) AS newest
FROM outbox
GROUP BY status;

Status codes: 0=pending 1=processing 2=done 3=failed 4=dead.

Where the rows pile up Cause
status=0 (pending) Relay isn't claiming — relay process down, or DB connection pool exhausted.
status=1 (processing) Relay claimed but not finishing — publisher hung, or claimTimeoutMs < typical publish time so the reaper is fighting the relay.
status=3 (failed) Retries firing but not succeeding — broker auth broken, or a genuine outage with next_retry_at far in the future.

Fix

  • Pending pileup, no relay process → restart the relay; investigate why it died (probably OOM or DB connection drop).
  • Processing pileup with reaper churn → bump claimTimeoutMs. Default 60s assumes batch publishes complete in seconds; high-volume / large-batch setups need 300s+.
  • Failed pileup → check dlq-error-class on the rows already DLQ'd. They tell you what's wrong upstream.

"Reaper keeps reclaiming the same row"

Diagnose

SELECT id, attempts, claimed_at, next_retry_at, status
FROM outbox
WHERE id = <suspect-id>;

If claimed_at was recently stamped (say 30s ago) and claimTimeoutMs is 60s, the row shouldn't be reclaimable. If it IS being reclaimed, the row's publish is slower than claimTimeoutMs — the reaper picks it up while the original relay is still trying.

Fix

  • Bump claimTimeoutMs to 2–3× the p99 publish time. Default 60s breaks down when batches are large or the broker is slow.
  • Reduce batchSize so each batch finishes faster.
  • Check onBatchPublished timing in your metrics — if individual batches take 90s, your reaper is wrong, not your relay.

"Steady DLQ flow with KafkaJSConnectionError"

Diagnose

Look at dlq-reason headers on the most recent DLQ records:

// In your DLQ consumer
console.log(m.headers["dlq-reason"]);
// e.g. "Connection error: ECONNREFUSED 10.0.5.12:9093"

Fix

This is a network / broker availability issue, NOT eventferry's failure mode. The relay retried maxAttempts times, the broker was unreachable each time, eventually the row DLQ'd.

  • Check broker logs for the same window.
  • Increase retry.maxAttempts so a 5-minute broker rolling restart doesn't burn the budget.
  • Use validateTopicsOnConnect to catch broker reachability at startup instead of per-record at publish time.

"Publish fails with TopicAuthorizationException after Terraform deploy"

Diagnose

dlq-error-class: TopicAuthorizationException
dlq-reason: Not authorized to write to topic [orders.created]

Fix

The IAM / ACL identity the relay runs as doesn't have write permission on the target topic. This is a Terraform / IAM problem, not eventferry's.

  • (MSK IAM) Check the IAM policy includes kafka-cluster:WriteData on the topic ARN. See AWS MSK IAM.
  • (Confluent Cloud) Verify the API key ACL includes WRITE on the topic.
  • (Self-managed) Run kafka-acls.sh --list --topic orders.created --principal User:CN=....

"Producer fences in a loop"

Diagnose

new KafkaPublisher({
  hooks: {
    onProducerFenced: (err) => log.warn("fenced", { err: err.message }),
  },
});

Fenced rate > 0 in a steady state means two producer instances are taking the same transactionalId. The broker fences whichever claims it later — and if both instances keep restarting, they fence each other forever.

Fix

Make transactionalId unique per instance:

new KafkaPublisher({
  transactional: true,
  transactionalId: () => `${process.env.POD_NAME}-${process.env.HOSTNAME}`,
});

DON'T enable autoRecoverFromFence: true on a multi-instance setup — it amplifies the loop. Leave it off, let the fence propagate, and rely on the unique-ID rule to make fences rare. See Transactions and EOS.


"publisher.connect() hangs"

Diagnose

The publisher's connect() calls validateTopicsOnConnect if set. A hang there means the admin client can't reach the broker — usually a TLS / SASL config problem.

Common log lines:

KafkaJSConnectionError: Connection error: getaddrinfo ENOTFOUND broker.bad-dns
KafkaJSConnectionError: Connection error: unable to verify the first certificate
KafkaJSSASLAuthenticationError: SASL PLAIN authentication failed: Invalid credentials

Fix

Match the error to a config fix:

  • getaddrinfo ENOTFOUND → wrong broker hostname. Check the bootstrap list.
  • unable to verify the first certificate → broker cert isn't signed by a CA the trust store knows. Pin your CA via ssl.ca. See Authentication and TLS.
  • Invalid credentials → check SASL username/password. For OAUTHBEARER, check the token your provider returns.

"Tests pass locally, integration tests fail in CI"

Diagnose

Testcontainers needs Docker. CI environments without a Docker daemon fail at container startup.

Error: connect ECONNREFUSED /var/run/docker.sock

Fix

  • GitHub Actions: use services: docker:dind, or run on a self-hosted runner with Docker.
  • GitLab: enable Docker-in-Docker.
  • Local but failing: check docker info. Restart Docker Desktop.

"MySQL binlog relay never wakes up on insert"

Diagnose

SHOW VARIABLES LIKE 'log_bin';
SHOW VARIABLES LIKE 'binlog_format';
SHOW VARIABLES LIKE 'binlog_row_image';
SHOW GRANTS FOR CURRENT_USER();

Expected:

log_bin: ON
binlog_format: ROW
binlog_row_image: FULL
GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO ...

Fix

Anything not matching → fix the MySQL config or the user's grants. See MySQL Adapter for the canonical [mysqld] block.


"PostgresStreamingRelay errors with WAL level not 'logical'"

Diagnose

SHOW wal_level;
-- → "replica"  (wrong)
-- → "logical"  (right)

Fix

ALTER SYSTEM SET wal_level = logical;
-- Restart Postgres (wal_level change requires a restart, not just reload).

On managed Postgres (RDS, Cloud SQL), the equivalent knob is a parameter group / flag toggle + a database restart. Check your provider's docs.

If you can't change wal_level, use PostgresNotifyWaker instead — gives you sub-second wake without the cluster config requirement. See Postgres Adapter.


"OutboxValidationError thrown but my schema looks right"

Diagnose

try {
  await events.enqueue(client, "orders.created", { aggregateId, payload });
} catch (err) {
  if (err instanceof OutboxValidationError) {
    console.error(err.topic, JSON.stringify(err.issues, null, 2));
  }
}

err.issues is the Standard Schema issues array. Each issue has message and (usually) a path showing the field that failed.

Fix

Common cases:

  • Typo in topic"orders.created" vs "orders.create". The registry throws unknown topic.
  • Extra field on payload that Zod rejects with .strict() mode.
  • bigint in payload — JSON can't serialize bigint. Stringify it before enqueue, parse on consume.

"Wiki sayfaları redlink görünüyor"

Diagnose

Wiki sidebar gösteriyor ama tıklayınca "Create new page" geliyor.

Fix

Sayfa henüz yazılmamış. Bu wiki şu an aktif yazımda — eksik sayfalar sıraya geldiğinde dolacak. Hangi sayfaya öncelik vermem gerektiğini söylemek istersen, issue açabilir veya direkt mesajla iletebilirsin.


"I don't see my issue here"

Open an issue with:

  • Package + version (@eventferry/kafka@3.5.0).
  • Reproducer (minimal code + container setup if possible).
  • The full error message and any relevant log lines.
  • What you've already tried.

The integration test suite (packages/integration/test/) is the canonical example of working setups — if your config differs in some way from what's tested there, that's usually the first diagnostic.

Clone this wiki locally