Skip to content

InterviewPrepHub/ImplementKafka

Repository files navigation

ImplementKafka

Design & implement in memory message bus system like apache kafka in java

requirements:

topic management -

  • singleton
  • create & delete topic (CRUD for topic)
  • topic can have multiple partition

producer API

  • publish message to specified topic
  • partition selection based on key.hashcode % partition_count(random hash)
  • key is part of message payload

consumer API

  • subscribe and unsubscribe from topics
  • poll for messages on partition index of the subscribed topic
  • maintain offset tracking per consumer

message ordering

  • message within single partition must maintain their order
  • no ordering gaurantees across different partitions

concurrency

  • thread safe implementations for all components support for concurrent producers and consumer

user multi class design, thread safety, design pattern & system level complexity and use the right data structures

Relational-style schema of entities and their relationships

Relationship between components:

•	Topic (1) ────< (N) Partition   [Manages list of partitions]
•	Partition (1) ────< (N) Message     [Manages list of messages]
•	Consumer (1) ────< (N) Subscription (Topic names)   [Manages list of topics]
•	Consumer (1) ────< (N) Offset (per partition)   [Manages list of offsets]

1. Topic

Columns         Type            Description
topic _id       UUID            primary key
name            String          unique topic name
partitions      List            associated partitions

2. Partition

Columns         Type            Description
partition_id    UUID            primary key
topic_id        UUID            foreign key to topic
index           Integer         partition number (e.g., 0 to N-1 for a topic)
messages        Queue           Ordered queue of messages
offsetCounter   Long            Tracks the next offset to be assigned to a message.

helpful when consumers poll and need to track what offset they are at.

* Why does a Partition have both index and id?

index (int: 0..N-1)

Purpose: routing and array access. Producers pick a partition with: partitionIndex = abs(key.hashCode()) % partitionCount.

The key is a value provided by the producer that is hashed to determine the partition index so that all messages with the same key go to the same partition.

That formula only makes sense on a contiguous range 0..N-1. It also lets you do topic.getPartitions().get(index) in O(1).

partitionId (UUID)

Purpose: identity and admin/logging/persistence.

Globally unique, stable for tracing (“message X belongs to partition UUID=...”), auditing, metrics, and storing offsets externally. Useful if partitions are exported to dashboards or cross-process APIs, or if topics could be recreated.

3. Message

Columns         Type            Description
message_id      UUID            primary key
partition_id    UUID            foreign key to partition
key             String          message key
payload         String          message value
timestamp       Long            message timestamp
offset          Long            message offset within partition

Producer:

  1. Get Topic via TopicManager.getInstance().getTopic(topicName)
  2. Pick partition: • if key: Math.abs(key.hashCode()) % partitionCount • else: random number in [0, partitionCount)
  3. Call: partition.addMessage(key, payload)

Consumer:

  1. Each Consumer instance will: • Subscribe/unsubscribe to one or more topics • Internally track offset per partition • poll(topicName) returns next unread message from any partition

What “Producer is stateless” means

*   The Producer doesn’t keep any business state between calls to publish(...).
*   It doesn’t store offsets, partitions, subscriptions, or per-topic/per-key memory.
*   Each call computes the target partition and delegates to Topic/Partition. So you can safely share one Producer
    instance across threads without extra locking (its correctness doesn’t depend on prior calls).

Which classes are (non)-stateless in your design?

Class Stateless? Why
Producer Yes (recommended with ThreadLocalRandom) No business state; every call is independent.
Message Immutable (thread-safe) All fields final; once built, it never changes. It contains state (data) but no mutable behavior.
Partition Stateful Holds messages and offsetCounter; state changes on publish/consume. Must be thread-safe.
Topic Stateful (mostly static) Owns a fixed list of partitions. Partitions mutate internally as messages arrive.
TopicManager Stateful Registry of topics in a ConcurrentHashMap; creates/deletes topics.
Consumer Stateful Tracks subscriptions and per-partition offsets; state evolves as it polls.

Stateless objects (Producer) can be reused freely across threads, typically no internal locking required.

Immutable data objects (Message) are inherently thread-safe after construction.

Stateful components (Partition, TopicManager, Consumer) must use thread-safe structures and/or locking:

Partition: AtomicLong for offsets; concurrent container / RWLock for storage.

TopicManager: ConcurrentHashMap + synchronized create/delete.

Consumer: concurrent sets/maps for subscriptions & offsets.

About

Demo project for designing & implementing a messaging queue like kafka

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages