# Kafka

* Apache Kafka: open-source distributed event streaming platform
    - used as either a message queue or as a stream processing system
    - high performance, scalable, and durable
    - can handle a lot of data in real-time

## Basic Terminology and Architecture

* `broker`: the physical/virtual servers that host the Kafka `partitions`
    - they store data and serve clients
    - Kafka clusters are made up of multiple brokers
* `partitions`: ordered, immutable sequence of messages
    - like an append-only log file
    - these help Kafka scale since they allow producers to append messages while allowing consumers to process them in parallel
* `topics`: logical grouping of partitions
    - e.g. partitions for basketball matches or soccer matches, etc
    - you publish/subscribe to data in Kafka using topics
    - topics are multi-producer: can have many producers writing data to it
    - topics can have multiple partitions that are in different brokers
* `producers`: they write data to the topics
* `consumers`: read data from topics

## How Kafka Works

* when an event occurs, Producers respond by creating messages to be published
* a message consists of the following:
    - value (required)
    - key (optional): determines which partition message goes to
    - timestamp (optional): used to order messages in partition
    - headers (optional): similar to HTTP headers that can be used to store metadata about the message

### What Happens When a Message is Published?
* when a message is published to a Kafka topic, there are 2 steps:
1. `Partition Determination`: if the message provides a key, this key is put through a hashing function and assigned to a specific partition
    - __this always ensures that messages with the same key go to the same partition__
    - if no key, Kafka will round-robin the message to a partition
    - or will use partitioning logic provided by the producer configuration
2. `Broker Assignment`: once a partition is determined, Kafka has to find out which broker contains that partition
    - remember that a broker is a server that hosts the partitions
    - the Kafka cluster metadata contains mappings of brokers => partitions
    - this metadata is maintained by the Kafka controller (role within the broker cluster)
* once a broker for the partition is found, Kafka will send the message directly to the target partition on that broker

### Benefits of Partitions as an Append-Only Log File

* messages are appended to the end of partition like a log file
* this provides several benefits:
1. __Immutability__: once a message is written to the partition, it cannot be changed or deleted
    - this improves performance and reliability b/c:
        1. replication is simpler
        2. speeds up recovery process
        3. avoids consistency issues common in systems where data can be changed
2. __Efficiency__: since the only operation is append, this minimizes disk seek times which are a major bottleneck in many storage systems
3. __Scalability__: helps with horizontal scaling
    - more partitions can be added and distributed across a cluster of brokers
    - and each partition is replicated across multiple brokers for more fault tolerance

### Offsets

* each message has an offset that determines its position in the Kafka partition
    - think of it like an index in an array
* offsets are used by consumers to track which messages they've read already, like a bookmark
    - when consumers read messages, they maintain the current offset
    - consumers will periodically commit offsets back to Kafka
        * this allows consumers that failed/restarted to resume where they left off before the failure/restart

### Replication

* replication helps ensure availability and durability of messages
* Kafka uses a `leader-follower model` of replication
1. `Leader Replica Assignment`: each partition has a designated Leader replica which resides on a broker
    - responsible for ALL read/write requests for the partition
    - assignment of Leader Replica handled by the cluster controller
        * it ensures that each partition's leader replica is distributed across the cluster to balance load
2. `Follower Replication`: there are several follower replicas for a partition
    - their only responsibility is to replicate data from their respective leader replica
    - follower replicas for a specific partition can reside in different brokers, not just the one their leader replica is in
    - they are essentially backups for the leader replica in case it fails
        - the most up-to-date follower replica will be promoted to the leader replica in case of failures
3. `Synchronization and Consistency`: followers continuously sync with the leader replica to get the latest messages
    - very important for maintaining consistency across the cluster
    - if leader replica fails, follower replica gets promoted => minimizes downtime and data loss
4. `Controller's Role in Replication`: manages replication process
    - monitors health of all brokers and manages leadership and replication dynamics
    - if a broker fails, it reassigns the leader  role to one of the in-sync follower replicas for continued availability

### How Consumers Read Messages

* consumers read messages from Kafka topics used a `pull-based model`
    - consumers poll the broker for new messages at set intervals
* this is by design and allows for several benefits:
1. lets consumers control their consumption rate
2. simplifies failure handling
3. prevents overwhelming slow consumers
4. enables efficient batching
* [Apache Kafka Pull Design](https://kafka.apache.org/documentation.html#design_pull)

## When to use Kafka in your Interview