Reliability and Error Handling

How eventferry classifies publish failures and reacts to each kind. The errorKind taxonomy is what lets the relay distinguish a transient blip from a poison record from a fenced producer — three failures that look identical at the wire level but want very different responses.

The `errorKind` union

Every failed PublishResult carries an errorKind:

type PublishErrorKind =
  | "retriable"        // default for unclassified
  | "fatal"            // skip retry, DLQ + dead
  | "poison"           // record-level rejection, DLQ + dead
  | "backpressure"     // requeue WITHOUT attempts++
  | "quota"            // retry with longer backoff
  | "fenced";          // producer epoch fenced — see [[Transactions and EOS]]

The relay reads errorKind to decide between retry / requeue / DLQ. If the publisher's classifier doesn't recognize an error, it defaults to "retriable" — the safe bias. At worst we retry an error that should have been skipped; in practice we'd rather over-retry than mis-classify a transient blip as terminal.

Per-kind reaction

Kind	Attempts++	Backoff	DLQ?
`retriable`	yes	normal	only after `maxAttempts`
`fatal`	(irrelevant)	none	immediately
`poison`	(irrelevant)	none	immediately
`backpressure`	no	`backpressureDelayMs` (default 1000)	only if requeue isn't supported
`quota`	yes	normal × `quotaMultiplier` (default 5)	only after `maxAttempts`
`fenced`	yes	normal	only after `maxAttempts`, OR the publisher's `autoRecoverFromFence` path recovered transparently

`retriable`

Transient. Network blip, leader election, REQUEST_TIMED_OUT. Increments attempts, schedules next_retry_at by the backoff policy, eventually DLQs after maxAttempts.

`fatal`

Auth or permission denied. Retrying with the same credentials cannot help. Skip the backoff entirely — straight to DLQ + dead.

Examples: TopicAuthorizationException, ClusterAuthorizationException, SaslAuthenticationFailed, UnsupportedVersionException.

`poison`

The record itself is rejectable. Oversized payload, corrupt bytes, broker refused encoding. Same handling as fatal — straight to DLQ.

Examples: MESSAGE_TOO_LARGE, CORRUPT_MESSAGE, INVALID_RECORD, schema-registry refused encoding.

`backpressure`

The producer's own outbound buffer is full (librdkafka __QUEUE_FULL). This isn't a record-level failure — it's a "slow down" signal from the driver. Burning a retry attempt would penalize the relay for the broker's slowness.

eventferry handles this with requeue instead of markFailed:

Status → failed
claimed_at → NULL (so the reaper doesn't race)
attempts → unchanged
next_retry_at → now() + backpressureDelayMs

The relay reclaims it on the next tick with the same attempt count. Backpressure does not consume retry budget.

new Relay({
  store,
  publisher,
  retry: {
    backpressureDelayMs: 1000,            // default
  },
});

Stores without requeue (a requeue?: never shape) fall back to markFailed — attempts increment, log warns once.

`quota`

The broker is throttling (THROTTLING_QUOTA_EXCEEDED). A normal retry rate would make it worse. eventferry stretches the backoff via a multiplier:

new Relay({
  store,
  publisher,
  retry: {
    quotaMultiplier: 5,                   // default — 5x the computed backoff
  },
});

A quota failure at attempt 3 with initialBackoffMs: 1000, factor: 2 lands at next_retry_at = 8000ms × 5 = 40s instead of 8s. Cumulative effect: hours of breathing room before the relay gives up.

`fenced`

See Transactions and EOS for the full story. Briefly: split out of "fatal" because some fences are transient (broker restart, network partition recovery). The publisher's autoRecoverFromFence: true does ONE transparent reconnect + retry before surfacing the fence to the relay.

Configuring retries

new Relay({
  store,
  publisher,
  retry: {
    maxAttempts: 5,                       // default — 6th failure → DLQ
    initialBackoffMs: 1_000,              // default — first retry delay
    factor: 2,                            // default — exponential factor
    jitter: 0.3,                          // default — ±30% random jitter
    maxBackoffMs: 60_000,                 // default — cap on the computed delay
    backpressureDelayMs: 1_000,           // default — see above
    quotaMultiplier: 5,                   // default — see above
  },
});

Knob	Default	Notes
`maxAttempts`	`5`	Total attempts including the first. 5 means: try, 4 retries, then DLQ.
`initialBackoffMs`	`1_000`	Backoff before the 2nd attempt (1st retry).
`factor`	`2`	Exponential factor. `factor: 2` doubles each retry.
`jitter`	`0.3`	Fraction of randomness — `0.3` means the delay is multiplied by a random `[0.7, 1.3]`. Prevents thundering herds.
`maxBackoffMs`	`60_000`	Caps the computed delay. Default 60s means even attempt 10 (= 1024s raw) lands at 60s.
`backpressureDelayMs`	`1_000`	Fixed delay for `errorKind: "backpressure"`.
`quotaMultiplier`	`5`	Multiplier on the computed backoff for `errorKind: "quota"`.

Backoff arithmetic

For errorKind: "retriable" with the defaults:

attempt 1 → 1s × jitter
attempt 2 → 2s × jitter
attempt 3 → 4s × jitter
attempt 4 → 8s × jitter
attempt 5 → 16s × jitter   ← maxAttempts hit; DLQ

Total time from first failure to DLQ: ~30 seconds.

Tune initialBackoffMs higher for workloads with long-running maintenance windows (e.g. broker rolling restart over 5 minutes). Tune maxAttempts higher for workloads where you'd rather wait hours than land on DLQ.

The classifier internals

KafkaPublisher ships two classifiers:

classifyKafkajsError(err) — maps kafkajs's type + code + class name.
classifyConfluentError(err) — maps librdkafka's code + name.

Both return PublishErrorKind. Coverage tables (the RETRIABLE_TYPES, FATAL_TYPES, etc. sets) live in packages/kafka/src/{kafkajs,confluent}-classifier.ts — read them when classifying a new error class for your driver.

Custom drivers set result.errorKind directly. The relay treats absent errorKind as "retriable", so the safe default is to leave it unset and only opt into specific kinds when you've classified the error.

End-to-end failure flow

Publish attempt
     │
     ▼
  ┌─────────────────────┐
  │ classifier(err)     │  → errorKind
  └─────────────────────┘
     │
     ├── retriable / quota / fenced ──► markFailed, next_retry_at scheduled
     │                                  attempts++; eventually DLQ at maxAttempts
     │
     ├── backpressure ─────────────────► requeue (attempts unchanged)
     │
     └── fatal / poison ──────────────► markFailed (status=dead), DLQ immediately

What's next

DLQ enrichment + replay → Dead-Letter Queue
Producer-fenced restart deep-dive → Transactions and EOS
Configure the publisher's tuning → Kafka Publisher

Repository · Issues · npm: @eventferry/all · MIT

Home

Get going

Adapters

Type & schema

Security

Operational

Operations

Reference

Uh oh!

Reliability and Error Handling

The errorKind union

Per-kind reaction

retriable

fatal

poison

backpressure

quota

fenced

Configuring retries

Backoff arithmetic

The classifier internals

End-to-end failure flow

What's next

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

The `errorKind` union

`retriable`

`fatal`

`poison`

`backpressure`

`quota`

`fenced`