-
-
Notifications
You must be signed in to change notification settings - Fork 0
Reliability and Error Handling
How eventferry classifies publish failures and reacts to each kind. The errorKind taxonomy is what lets the relay distinguish a transient blip from a poison record from a fenced producer — three failures that look identical at the wire level but want very different responses.
Every failed PublishResult carries an errorKind:
type PublishErrorKind =
| "retriable" // default for unclassified
| "fatal" // skip retry, DLQ + dead
| "poison" // record-level rejection, DLQ + dead
| "backpressure" // requeue WITHOUT attempts++
| "quota" // retry with longer backoff
| "fenced"; // producer epoch fenced — see [[Transactions and EOS]]The relay reads errorKind to decide between retry / requeue / DLQ. If the publisher's classifier doesn't recognize an error, it defaults to "retriable" — the safe bias. At worst we retry an error that should have been skipped; in practice we'd rather over-retry than mis-classify a transient blip as terminal.
| Kind | Attempts++ | Backoff | DLQ? |
|---|---|---|---|
retriable |
yes | normal | only after maxAttempts
|
fatal |
(irrelevant) | none | immediately |
poison |
(irrelevant) | none | immediately |
backpressure |
no |
backpressureDelayMs (default 1000) |
only if requeue isn't supported |
quota |
yes | normal × quotaMultiplier (default 5) |
only after maxAttempts
|
fenced |
yes | normal | only after maxAttempts, OR the publisher's autoRecoverFromFence path recovered transparently |
Transient. Network blip, leader election, REQUEST_TIMED_OUT. Increments attempts, schedules next_retry_at by the backoff policy, eventually DLQs after maxAttempts.
Auth or permission denied. Retrying with the same credentials cannot help. Skip the backoff entirely — straight to DLQ + dead.
Examples: TopicAuthorizationException, ClusterAuthorizationException, SaslAuthenticationFailed, UnsupportedVersionException.
The record itself is rejectable. Oversized payload, corrupt bytes, broker refused encoding. Same handling as fatal — straight to DLQ.
Examples: MESSAGE_TOO_LARGE, CORRUPT_MESSAGE, INVALID_RECORD, schema-registry refused encoding.
The producer's own outbound buffer is full (librdkafka __QUEUE_FULL). This isn't a record-level failure — it's a "slow down" signal from the driver. Burning a retry attempt would penalize the relay for the broker's slowness.
eventferry handles this with requeue instead of markFailed:
- Status →
failed -
claimed_at→NULL(so the reaper doesn't race) -
attempts→ unchanged -
next_retry_at→now() + backpressureDelayMs
The relay reclaims it on the next tick with the same attempt count. Backpressure does not consume retry budget.
new Relay({
store,
publisher,
retry: {
backpressureDelayMs: 1000, // default
},
});Stores without requeue (a requeue?: never shape) fall back to markFailed — attempts increment, log warns once.
The broker is throttling (THROTTLING_QUOTA_EXCEEDED). A normal retry rate would make it worse. eventferry stretches the backoff via a multiplier:
new Relay({
store,
publisher,
retry: {
quotaMultiplier: 5, // default — 5x the computed backoff
},
});A quota failure at attempt 3 with initialBackoffMs: 1000, factor: 2 lands at next_retry_at = 8000ms × 5 = 40s instead of 8s. Cumulative effect: hours of breathing room before the relay gives up.
See Transactions and EOS for the full story. Briefly: split out of "fatal" because some fences are transient (broker restart, network partition recovery). The publisher's autoRecoverFromFence: true does ONE transparent reconnect + retry before surfacing the fence to the relay.
new Relay({
store,
publisher,
retry: {
maxAttempts: 5, // default — 6th failure → DLQ
initialBackoffMs: 1_000, // default — first retry delay
factor: 2, // default — exponential factor
jitter: 0.3, // default — ±30% random jitter
maxBackoffMs: 60_000, // default — cap on the computed delay
backpressureDelayMs: 1_000, // default — see above
quotaMultiplier: 5, // default — see above
},
});| Knob | Default | Notes |
|---|---|---|
maxAttempts |
5 |
Total attempts including the first. 5 means: try, 4 retries, then DLQ. |
initialBackoffMs |
1_000 |
Backoff before the 2nd attempt (1st retry). |
factor |
2 |
Exponential factor. factor: 2 doubles each retry. |
jitter |
0.3 |
Fraction of randomness — 0.3 means the delay is multiplied by a random [0.7, 1.3]. Prevents thundering herds. |
maxBackoffMs |
60_000 |
Caps the computed delay. Default 60s means even attempt 10 (= 1024s raw) lands at 60s. |
backpressureDelayMs |
1_000 |
Fixed delay for errorKind: "backpressure". |
quotaMultiplier |
5 |
Multiplier on the computed backoff for errorKind: "quota". |
For errorKind: "retriable" with the defaults:
attempt 1 → 1s × jitter
attempt 2 → 2s × jitter
attempt 3 → 4s × jitter
attempt 4 → 8s × jitter
attempt 5 → 16s × jitter ← maxAttempts hit; DLQ
Total time from first failure to DLQ: ~30 seconds.
Tune initialBackoffMs higher for workloads with long-running maintenance windows (e.g. broker rolling restart over 5 minutes). Tune maxAttempts higher for workloads where you'd rather wait hours than land on DLQ.
KafkaPublisher ships two classifiers:
-
classifyKafkajsError(err)— maps kafkajs'stype+code+ class name. -
classifyConfluentError(err)— maps librdkafka'scode+name.
Both return PublishErrorKind. Coverage tables (the RETRIABLE_TYPES, FATAL_TYPES, etc. sets) live in packages/kafka/src/{kafkajs,confluent}-classifier.ts — read them when classifying a new error class for your driver.
Custom drivers set result.errorKind directly. The relay treats absent errorKind as "retriable", so the safe default is to leave it unset and only opt into specific kinds when you've classified the error.
Publish attempt
│
▼
┌─────────────────────┐
│ classifier(err) │ → errorKind
└─────────────────────┘
│
├── retriable / quota / fenced ──► markFailed, next_retry_at scheduled
│ attempts++; eventually DLQ at maxAttempts
│
├── backpressure ─────────────────► requeue (attempts unchanged)
│
└── fatal / poison ──────────────► markFailed (status=dead), DLQ immediately
- DLQ enrichment + replay → Dead-Letter Queue
- Producer-fenced restart deep-dive → Transactions and EOS
- Configure the publisher's tuning → Kafka Publisher
Repository · Issues · npm: @eventferry/all · MIT
Get going
Adapters
Type & schema
Security
Operational
- Transactions and EOS
- Admin Operations
- Observability
- Consuming Events
- Dead-Letter Queue
- Reliability and Error Handling
Operations
Reference