# Apache Kafka
Apache Kafka is an open-source distributed event streaming platform designed for building real-time data pipelines and streaming applications.Streaming data is data that is continuously generated by thousands of data sources, which typically send the data records in simultaneously. A streaming platform needs to handle this constant influx of data, and process the data sequentially and incrementally.

## Analogy
Imagine a news agency that collects stories and distributes them to various news outlets (like TV channels, websites, and newspapers).

**Producers (Reporters)**:

- Reporters gather stories (data) from different locations and send them to the central news agency.
- These stories are tagged by category (e.g., politics, sports, entertainment).

**Topics (Categories)**:

- The news agency organizes stories into categories like politics, sports, or entertainment. These categories are like Kafka topics.
- Each category has multiple folders (partitions) to store stories.

**Brokers (News Editors)**:

- The news editors manage the categories (topics), decide where stories go, and ensure they are accessible to news outlets.

**Consumers (News Outlets)**:

- Different outlets (TV, websites, newspapers) subscribe to categories they are interested in (e.g., a sports website subscribes to the sports category).
- Each outlet decides when to fetch stories, and they can replay old stories if they want.

## used for?
Kafka is used to build real-time streaming data pipelines and real-time streaming applications. A data pipeline reliably processes and moves data from one system to another, and a streaming application is an application that consumes streams of data. For example, if you want to create a data pipeline that takes in user activity data to track how people use your website in real-time, Kafka would be used to ingest and store streaming data while serving reads for the applications powering the data pipeline. Kafka is also often used as a message broker solution, which is a platform that processes and mediates communication between two applications.

## key components
- **topics**: A category of feed name that stores the messages (records)
- **producers**: Publish messages to the topics
- **brokers**: kafka servers that store the topic data and manage partitions
- **consumers**: receives the messages and processes them
- **consumer groups**: A set of consumers that work together to read data from Kafka topics. Each partition is assigned to only one consumer in a group (No duplication).
- **partitions**: A topic in Kafka is divided into multiple partitions.
Each partition is an independent, ordered log of messages.

## how it works?
Kafka combines two messaging models, queuing and publish-subscribe, to provide the key benefits of each to consumers
- Producers send messages to a specific topic in Kafka.
- Kafka stores messages in a fault-tolerant, distributed log across its brokers. A log is an ordered sequence of records, and these logs are broken up into segments, or partitions, that correspond to different subscribers.
- Consumers read messages from topics, either in real-time or by replaying them from a specific point.

## architecture

![kafka](kafka.png)

- ZooKeeper is used for cluster coordination, managing metadata, leader election, and configuration in Kafka.
- Primary broker (leader) handles all read and write requests for a partition.
- Replica brokers (followers) maintain copies of the data and can take over as leader if the primary broker fails.

## benefits
**Scalable**
- Kafka’s partitioned log model allows data to be distributed across multiple servers, making it scalable beyond what would fit on a single server. 

**Fast**
- Kafka decouples data streams so there is very low latency, making it extremely fast. 

**Durable**
- Partitions are distributed and replicated across many servers, and the data is all written to disk. This helps protect against server failure, making the data very fault-tolerant and durable. 

## Key Differences vs MQ:
- Retention: Kafka keeps data for a set amount of time, while MQ deletes messages after consumption.
- Scalability: Kafka scales better with partitions for massive data loads.
- Consumer Behavior: Kafka consumers can replay messages, while MQ consumers process and remove them.
- Focus: Kafka focuses on high throughput and distributed event streaming, while MQ specializes in reliable message delivery.