Skip to content

Reliability and Error Handling

SametGoktepe edited this page Jun 17, 2026 · 1 revision

How eventferry classifies publish failures and reacts to each kind. The errorKind taxonomy is what lets the relay distinguish a transient blip from a poison record from a fenced producer — three failures that look identical at the wire level but want very different responses.


The errorKind union

Every failed PublishResult carries an errorKind:

type PublishErrorKind =
  | "retriable"        // default for unclassified
  | "fatal"            // skip retry, DLQ + dead
  | "poison"           // record-level rejection, DLQ + dead
  | "backpressure"     // requeue WITHOUT attempts++
  | "quota"            // retry with longer backoff
  | "fenced";          // producer epoch fenced — see [[Transactions and EOS]]

The relay reads errorKind to decide between retry / requeue / DLQ. If the publisher's classifier doesn't recognize an error, it defaults to "retriable" — the safe bias. At worst we retry an error that should have been skipped; in practice we'd rather over-retry than mis-classify a transient blip as terminal.


Per-kind reaction

Kind Attempts++ Backoff DLQ?
retriable yes normal only after maxAttempts
fatal (irrelevant) none immediately
poison (irrelevant) none immediately
backpressure no backpressureDelayMs (default 1000) only if requeue isn't supported
quota yes normal × quotaMultiplier (default 5) only after maxAttempts
fenced yes normal only after maxAttempts, OR the publisher's autoRecoverFromFence path recovered transparently

retriable

Transient. Network blip, leader election, REQUEST_TIMED_OUT. Increments attempts, schedules next_retry_at by the backoff policy, eventually DLQs after maxAttempts.

fatal

Auth or permission denied. Retrying with the same credentials cannot help. Skip the backoff entirely — straight to DLQ + dead.

Examples: TopicAuthorizationException, ClusterAuthorizationException, SaslAuthenticationFailed, UnsupportedVersionException.

poison

The record itself is rejectable. Oversized payload, corrupt bytes, broker refused encoding. Same handling as fatal — straight to DLQ.

Examples: MESSAGE_TOO_LARGE, CORRUPT_MESSAGE, INVALID_RECORD, schema-registry refused encoding.

backpressure

The producer's own outbound buffer is full (librdkafka __QUEUE_FULL). This isn't a record-level failure — it's a "slow down" signal from the driver. Burning a retry attempt would penalize the relay for the broker's slowness.

eventferry handles this with requeue instead of markFailed:

  • Status → failed
  • claimed_atNULL (so the reaper doesn't race)
  • attemptsunchanged
  • next_retry_atnow() + backpressureDelayMs

The relay reclaims it on the next tick with the same attempt count. Backpressure does not consume retry budget.

new Relay({
  store,
  publisher,
  retry: {
    backpressureDelayMs: 1000,            // default
  },
});

Stores without requeue (a requeue?: never shape) fall back to markFailed — attempts increment, log warns once.

quota

The broker is throttling (THROTTLING_QUOTA_EXCEEDED). A normal retry rate would make it worse. eventferry stretches the backoff via a multiplier:

new Relay({
  store,
  publisher,
  retry: {
    quotaMultiplier: 5,                   // default — 5x the computed backoff
  },
});

A quota failure at attempt 3 with initialBackoffMs: 1000, factor: 2 lands at next_retry_at = 8000ms × 5 = 40s instead of 8s. Cumulative effect: hours of breathing room before the relay gives up.

fenced

See Transactions and EOS for the full story. Briefly: split out of "fatal" because some fences are transient (broker restart, network partition recovery). The publisher's autoRecoverFromFence: true does ONE transparent reconnect + retry before surfacing the fence to the relay.


Configuring retries

new Relay({
  store,
  publisher,
  retry: {
    maxAttempts: 5,                       // default — 6th failure → DLQ
    initialBackoffMs: 1_000,              // default — first retry delay
    factor: 2,                            // default — exponential factor
    jitter: 0.3,                          // default — ±30% random jitter
    maxBackoffMs: 60_000,                 // default — cap on the computed delay
    backpressureDelayMs: 1_000,           // default — see above
    quotaMultiplier: 5,                   // default — see above
  },
});
Knob Default Notes
maxAttempts 5 Total attempts including the first. 5 means: try, 4 retries, then DLQ.
initialBackoffMs 1_000 Backoff before the 2nd attempt (1st retry).
factor 2 Exponential factor. factor: 2 doubles each retry.
jitter 0.3 Fraction of randomness — 0.3 means the delay is multiplied by a random [0.7, 1.3]. Prevents thundering herds.
maxBackoffMs 60_000 Caps the computed delay. Default 60s means even attempt 10 (= 1024s raw) lands at 60s.
backpressureDelayMs 1_000 Fixed delay for errorKind: "backpressure".
quotaMultiplier 5 Multiplier on the computed backoff for errorKind: "quota".

Backoff arithmetic

For errorKind: "retriable" with the defaults:

attempt 1 → 1s × jitter
attempt 2 → 2s × jitter
attempt 3 → 4s × jitter
attempt 4 → 8s × jitter
attempt 5 → 16s × jitter   ← maxAttempts hit; DLQ

Total time from first failure to DLQ: ~30 seconds.

Tune initialBackoffMs higher for workloads with long-running maintenance windows (e.g. broker rolling restart over 5 minutes). Tune maxAttempts higher for workloads where you'd rather wait hours than land on DLQ.


The classifier internals

KafkaPublisher ships two classifiers:

  • classifyKafkajsError(err) — maps kafkajs's type + code + class name.
  • classifyConfluentError(err) — maps librdkafka's code + name.

Both return PublishErrorKind. Coverage tables (the RETRIABLE_TYPES, FATAL_TYPES, etc. sets) live in packages/kafka/src/{kafkajs,confluent}-classifier.ts — read them when classifying a new error class for your driver.

Custom drivers set result.errorKind directly. The relay treats absent errorKind as "retriable", so the safe default is to leave it unset and only opt into specific kinds when you've classified the error.


End-to-end failure flow

Publish attempt
     │
     ▼
  ┌─────────────────────┐
  │ classifier(err)     │  → errorKind
  └─────────────────────┘
     │
     ├── retriable / quota / fenced ──► markFailed, next_retry_at scheduled
     │                                  attempts++; eventually DLQ at maxAttempts
     │
     ├── backpressure ─────────────────► requeue (attempts unchanged)
     │
     └── fatal / poison ──────────────► markFailed (status=dead), DLQ immediately

What's next

Clone this wiki locally