Skip to content

Queue Semantics support in Kafka Ingestion

Shekhar Prasad Rajak edited this page Apr 18, 2026 · 16 revisions

Ref https://github.com/apache/druid/issues/18439

Motivation

The Core Tension Kafka is a distributed log. Queues (Pub/Sub, RabbitMQ, SQS) are work distribution systems. When users try to use Kafka as a work distribution system, they hit these friction points:

  1. Scaling workers beyond partition count is impossible
Topic: 10 partitions
Need: 100 workers for burst processing
Kafka consumer group: 90 workers sit idle
Pub/Sub/RabbitMQ/SQS: all 100 workers active
  1. Rebalancing pauses are unacceptable for elastic workloads
Auto-scaling event: 50 -> 100 workers
Kafka: Triggers rebalancing, all 150 consumers pause for 30-60s
Pub/Sub: Zero pause, new workers start consuming immediately
  1. Per-message error handling is essential for job queues and Head-of-line blocking kills throughput
Message #50 takes 30s to process (slow DB query)
Kafka: Messages #51-#100 wait behind #50 in same partition
SQS: Messages #51-#100 processed by other workers immediately

Share Groups Close the Gap

Share Groups make Kafka competitive with queue systems for these use cases while retaining Kafka's strengths (log retention, replay, high throughput, multi-consumer patterns).

Proposed design

TLDR

1. Explicit Acknowledgment Mode

Use share.acknowledgement.mode=explicit so Druid controls exactly when records are acknowledged. Records are only ACCEPT-ed after the segment containing them is atomically registered in the metadata store. This guarantees at-least-once with no data loss.

2. Two-Thread Architecture

KafkaShareConsumer requires single-threaded access. We split the work:

  • ConsumerThread: owns KafkaShareConsumer exclusively (poll, ack, commit, RENEW)
  • TaskRunnerThread: processes records, builds segments, publishes

Threads communicate via two bounded queues: (to get more control over record lifecyle)

  • RecordBuffer: ConsumerThread -> TaskRunner (records to process)
  • AckCommandQueue: TaskRunner -> ConsumerThread (ACCEPT/REJECT/RELEASE/COMMIT)

3. Background Lock Renewal (RENEW)

A scheduled executor posts RENEW_ALL commands every lockTimeout * 0.4 seconds. ConsumerThread renews all in-flight records, preventing lock expiry during slow segment building. RENEW is O(1) on the broker (timer cancel + insert, no persistence).

4. DedupCache (NEW -- does not exist in Druid today)

Druid has zero record-level deduplication today. With Share Groups, broker redelivers on lock timeout or consumer crash. DedupCache stores published offset ranges per partition in the metadata store, loaded on task startup, updated atomically with segment publish. Any redelivered record found in the cache is immediately ACCEPT-ed and skipped.

5. Three Safety Invariants

INVARIANT 1: ACCEPT only after atomic segment registration in metadata store.
INVARIANT 2: Every polled record must reach exactly one of: ACCEPT, RELEASE, REJECT, or task crash.
INVARIANT 3: RENEW_ALL only touches records still in InFlightTracker (decided records removed first).

Happy case

Screenshot 2026-04-13 at 11 29 30 PM

Overlapped Processing: RENEW Enables Next Poll

image

Failure Scenario: Task Runner Crash Mid-Processing

image

Failure Scenario: Crash After Publish, Before commitSync

image

Supervisor Scaling Decision Flow

Improve the resource utilisation and maintain the high throughput, low latency end to end system image

Code Changes Required

New Interfaces

Interface Module Purpose
AcknowledgingRecordSupplier<P,S,R> indexing-service Parallel to RecordSupplier; subscribe/poll/acknowledge/commit instead of assign/seek/poll
AcknowledgeType (enum) indexing-service ACCEPT, RELEASE, REJECT, RENEW mapped to Kafka byte values

New Classes

Class Module Purpose
KafkaShareGroupRecordSupplier kafka-indexing-service Implements AcknowledgingRecordSupplier; wraps KafkaShareConsumer in explicit mode; dedup at poll time
ShareGroupConsumerThread kafka-indexing-service Dedicated thread owning KafkaShareConsumer; poll loop, RENEW scheduling, ack command draining
ShareGroupIndexTask kafka-indexing-service New task type; does not extend SeekableStreamIndexTask; carries topic + consumer config (no start/end offsets)
ShareGroupIndexTaskRunner kafka-indexing-service Ingestion loop: take from buffer -> dedup -> parse -> add -> persist -> publish -> ACCEPT -> COMMIT
DedupCache indexing-service Map<Partition, TreeSet>; backed by metadata store; loaded on startup; updated atomically with publish
ShareGroupDataSourceMetadata kafka-indexing-service Stores published offset ranges; no consumer offsets (broker manages those)
InFlightTracker kafka-indexing-service Map<Partition, Map<Offset, TrackedRecord>>; tracks per-record state: ACQUIRED, RENEW_PENDING, ACCEPT_PENDING
AckCommand kafka-indexing-service Command object: type (ACCEPT/REJECT/RELEASE/RENEW_ALL/COMMIT) + offsets + optional CompletableFuture

Modified Classes

KafkaSupervisor | Add groupType field ("consumer" | "share")

Rejected:

Two-Thread Explicit Acknowledgement Design

Design: Each task ran a consumer thread (poll + RENEW/ACCEPT each record) and a runner thread (parse + publish), connected by a blocking queue. Used Kafka's explicit acknowledgement mode requiring per-record acknowledge() calls.

Why rejected: Adding tasks decreased throughput. With explicit ack, every in-flight record requires a RENEW RPC to the broker on each poll cycle. With N tasks sharing one partition, this creates O(records x N) broker RPCs per cycle. The broker's single-threaded per-partition acknowledgement processing became the bottleneck.

Benchmark (Kafka 4.2.0, 1 partition, 500 RPS producer):

Tasks Throughput Scaling
1 25 rows/sec baseline
2 12 rows/sec 0.48x (regression)