# Kafka Hands-On

> Kafka is a Publish-Subscribe (pub-sub) messaging system used for streaming data in real-time (in addition to it's ability to process batch data).

In order for Kafka to be able to handle massive volumes of data at scale coming in at rapid velocity, several components are required to setup a robust Kafka system.

As a quick reminder, below is the overall Kafka topology at a high-level:

<p align="center">

<img src= "images/Kafka_Architecture2.png">

</p>


## Kafka Components

Before getting into the details of what each component is responsible for, let's first define some of the main Kafka terms we'll be using:

### Events
> An _event_ in Kafka is used to store a fact that has occurred. When Kafka reads or writes any data, it does so in the form of events. Events can be thought of as data points.

An event typically contains the following properties:
- Key
- Value
- Timestamp
- Metadata headers (optional). 
    
Below is an example Event:

- Event Key: "AiCore"

- Event Value: "User made a payment of $300"

- Event Timestamp: "March. 22, 2021 at 3:07 p.m."

### Producers

> A _producer_ is the component that creates data for Kafka. It's an API that enables another tool or an application to publishes/writes data to one or more Kafka topics

In essence, a producer provides data that Kafka will then ingest. The producer API comes built-in with Kafka and can be configured to connect to a wide variety of data sources.

#### Consumers

> A _consumer_ is the component that reads data from Kafka. It's also an API that's responsible for reading data records from one or more Kafka topics

One of key strengths in Kafka is that In Kafka, producers and consumers are decoupled from each other and don't depend on one another to complete their tasks. This is one design feature which provides Kafka with high scalability. We can have any number of producers and consumers in a Kafka cluster and the number of producers and consumers don't have to be equal.

#### Topics

> Events are organised and stored in a Kafka _topic_. A topic is analogous to a folder in a filesystem, and the events are analogous the files stored within that folder. 

Events (data points) stored within a specific Kafka topic can be consumed once, or as often as required. This is because data stored in Kafka is not deleted automatically after it is consumed. Instead, users can define the lifetime of the data by configuring the corresponding property manually. Regardless of how much data is stored, Kafka will remain a stable system.

#### Partitions

> Every Kafka topic is composed of multiple _partitions_, or sub-divisions. When a new data point is written, it is added to one of the topic's partitions. 

Events have keys that identify them. Each event that's consumed with the _same key_, such those having the same `Customer ID`, are written to the _same partition_. Kafka's design ensures that data consumed will always be in exactly the same order as they were written.

Below is a diagram showing how a Kafka topic and partition look like:

<p align="center">
<img src= "images/Kafka_Topics.png" width=600>
</p>

## Kafka Components

To recap what we've covered so far in Lessons 1 and 2, there are __five main components__ in a Kafka System:

__1.	Broker:__ 
- A broker is a Kafka node or server which is part of the Kafka system
- A Kafka __cluster__ is usually composed of multiple Brokers
- Each broker has a unique ID
- Brokers store the topic log partitions
- Brokers handle all requests from clients (produce, consume, and metadata) and keeps data replicated within the cluster. 
- There can be one or more brokers in a cluster.

For a video explanation on brokers, please watch the following:

- [__Brokers Introduction Video__](https://www.youtube.com/embed/jHnyBSUVcOU)

A broker can be configured by updating it's properties in the `server.properties` file.

The important configuration options for a Broker are:
1. `broker.id`
    - This is an integer that must be set as a unique value for each broker
2. `listeners`
    - This is the address that the socket server listens on (hostname and the port)
3. `advertised.listeners`
    - This is the hostname and port the broker will advertise to the producers and consumers
4. `log.dirs`
    - This is a comma seperated list of directories under which to store the log files
5. `num.partitions`
    -  The number of partitions per topic
6. `log.retention.hours`
    - The minimum age of a log file before deletion
7. `zookeeper.connect`
    - A comma seperated host:port pairs each corresponding to a Zookeeper server




__2.	Zookeeper:__
- Kafka uses Zookeeper to manage service discovery for brokers (e.g. if a new broker joins, or a broker dies etc.)
- Zookeeper is part of the Hadoop technology stack, but is external to the Kafka software itself despite being required to run Kafka.
- Zookeeper maintains the state of the cluster (brokers, topics, users).

The configuration for Zookeeper can be found in the __zookeeper.proerties__ file. The important configuration options which can be found there are:

1. `dataDir`
    -   The directory where the snapshots will be stored
2. `clientPort`
    -   The port which clients will use to connect
3. `maxClientCnxns`
    -   Enables/disables the per-IP limit on the number of connections
4. `admin.enableServer`
    -   Enable/Disables the admin server to avoid port conflicts



## Setting up Kafka and configuring Zookeeper

Before we get into things, make sure that Kafka is downloaded and untarred on your machine.

You'll need to create a new "data" folder which will store the data records. Create this in the same folder that contains Kafka, so that when you list the contents, it is the same as shown below.

Within the "data" folder, we need to create a `kafka` and `zookeeper` subfolders using the commands below.
These are required in order to update the configuration files.

In [None]:
mkdir data
mkdir data/kafka
mkdir data/zookeeper
ls -al

Now, your output should be similar to:

![](images/kafka-data.png)

Next, we'll need to update the __Zookeeper Properties__ and the __Apache Kafka Log__ files to point to the new data directory we just created.

Within the config folder, edit the `zookeeper.properties` file so that the `dataDir` key has the value `data` (the relative path to your data folder)

In [None]:
dataDir=data

Finally, again within the `config` folder, edit the `server.properties` file so that the `listeners` key has the value `PLAINTEXT://127.0.0.1:9092`

In [None]:
listeners=PLAINTEXT://127.0.0.1:9092

The next step is to start the actual Kafka Cluster.

To achieve this, first we need to run Zookeeper by running the Zookeeper executable and specifying the config file using the command shown below.

_Note that at this point you may need to install Java from [here](https://www.oracle.com/java/technologies/downloads/#jdk17-mac) to prevent an error_

In [None]:
bin/zookeeper-server-start.sh config/zookeeper.properties

To read more about the __`zookeeper-server-start.sh`__ command and its various properties, please check the [official documentation](https://zookeeper.apache.org/doc/r3.6.3/zookeeperTools.html):

_Note: `zookeeper-server-start.sh` and `zkServer.sh` are essentially the same.  The naming differnce is due to the fact that the former was named by the Kafka foundation while the latter is a Zookeeper terminology._



There should be many items being downloaded which will take a few minutes.

Now, if everything works correctly you should see an output similar to the following:

![](images/kafka-zookeeper.png)


After displaying several updates, the terminal will remain open and the cursor will be blinking.  This is normal and to be expected.  We'll leave this terminal open and continue.

The next step is to run the Kafka Broker.

Open a second Ubuntu terminal session (and make sure to keep the Zookeeper one open) and type the following command:

In [None]:
cd kafka_2.13-3.0.0 
bin/kafka-server-start.sh config/server.properties

The __`kafka-server-start.sh`__ command starts the Kafka server.

It takes the following arguments:
- Server Properties file path
- Override Property Value (optional)
 


If all runs correctly, you should see a long output that looks similar to the following:

![](images/kafka-server.png)


__3.	Topic:__
- A topic is a category/feed name to which data records are stored and published
- All Kafka data records are organized into topics
- Producers write data to the topics and consumers read data from the topics
- Data records plubished in the cluster remain persisted until a specified retention period has passed by
- Each topic is divided into _partitions_, which contain records in an unchangeable sequence. Each record within a partition is assigned and identified by a unique offset (ID)

Below is a visual representation of how these different components interact with one another:


<p align="center">
<img src= "images/Kafka Zookeeper Brokers.png" width=600>
</p>

Check out this [video](https://www.youtube.com/embed/kj9JH3ZdsBQ) for more info on topics.

Now we have both the Zookeeper and Kafka server running.  This prepares us to start producing and consuming messages!

The next step is to open other Ubuntu terminals and to create a Kafka topic which we'll use to share messages between the local producer and consumer.

There are some required parameters such as the __partition number__ and the __replication factor__ along with the __topic name__ and the __server details__.

For the time being, we'll keep things simple and limit both the number of partitions and the replication factor to 1.

We can create the topic by running the command shown below.

In [None]:
bin/kafka-topics.sh --create --topic MyFirstKafkaTopic --partitions 1 --replication-factor 1 --bootstrap-server localhost:9092

The `kafka-topics.sh` command is used to create and configure Kafka Topics.

You can read more about `kafka-topics.sh` [here](https://jaceklaskowski.gitbooks.io/apache-kafka/content/kafka-admin-TopicCommand.html)

Here are some important arguments to be aware of:
- `--create`:
    -   Creates a new topic.  This is required ony the first time we are dealing with a new topic.
- `--topic`:
    -   The topic name to create, alter, describe or delete. Followed by the topic name.
- `--partitions`
    -   The number of partitions for the topic being created or altered.
- `--replication-factor`:
    -   The replication factor for messages in your topic, replication-factor of three would mean there are three copies of your message in a Kafka cluster. This is mandatory if there is no default setup in the cluster itself.
- `--bootstrap-server`:
    -   The Kafka server to connect to.  localhost:9092 is to be used in case of a local stand-alone environment.

__Important Optional arguments:__
- `alter`:
    -   Alters the number of partitions, replica assignment and/or configuration for the Topic
- `config`:
    -   A Topic configuration override for the topic being created or altered. Allows configurations for:
        -   Cleanup policy
        -   Compression type
        -   Delete retention time
        -   Flushing messages
- `delete`:
    -   Deletes a topic
- `describe`:
    -   Lists the details for a particualr topic
- `list`:
    -   Lists all the available topics


Your output for this command should look something like:

![](images/kafka-topic.png)

__4.  Producer:__ 
- Connect to a Kafka cluster either through Zookeeper or directly to a Kafka broker
- Sends records (data) to a broker.

Check out this [video](https://www.youtube.com/embed/I7zm3on_cQQ) on producers

A producer can be configured and the properties can be updated in the `producer.properties` file. Here are the important configuration options you should know about:
1. `bootstrap.servers`
    - This is a list of Brokers used for bootstrapping knowledge about the rest of the cluster.
2. `compression.type`
    - Allows specifying the compression codec for all data generated (none, gzip, lz4, zstd)
3. `partitioner.class`
    - Name of the Partitioner class to be used for partitioning events (default is random spreading)
4. `request.timeout.ms`
    - The maximum amount of time the client will wait for the response of a request
5. `buffer.memory`
    -  The total bytes of memory the Producer can use to beffer records waiting to be sent to the server



Now, we need to open two additional terminals (so the total opened Linux terminals will be five).  One terminal will be for the producer, and the second is for the consumer.

We will create a producer which sends messages to the topic, which are then consumed by the consumer and print into the standard output channel of the terminal.

To create a producer, open a new terminal and run:

In [None]:
kafka-console-producer.sh --topic MyFirstKafkaTopic --bootstrap-server localhost:9092 

For further information about this command, check the following [documentation](https://docs.cloudera.com/runtime/7.2.0/kafka-managing/topics/kafka-manage-cli-producer.html?)

__Required Arguments:__
- `topic`:
    -   Topic name to which the Producer will send the data to
- `bootstrap-server`:
    -   The Kafka server to connect to


__Important Optional Arguments:__
- `batch-size`:
    -   Number of messages to send in a single batch if they're not being sent synchronously (default is set at 200)
- `compression-code`:
    -   The data compression codec used.  Can be one of the following (default is gzip):
        -   None
        - Gzip
        - Snappy
        - Lz4
        - Zstd
- `max-memory-bytes`:
    -   The total memory used by the Producer to buffer records waiting to be sent to the server
- `max-partitions-memory-bytes`:
    -   The buffer memory size allocated for a Partition.  When data records are received which are smaller than this size, the Producer will attempt to group them together until the specified size is reached
- `property`: A mechanism to pass user-defined properties in the form of Key = Value to the message reader.  This allows custom configurations for a user-defined message reader


__5.	Consumer:__ 
- Consumes batches of records (data) from the broker.
- Consumers can specify both the Topic and Partition from which they will consume data
- There are two types of Consumers:
 
        - Low-level 
        - High-level

Check out this video on [consumers](https://www.youtube.com/embed/Z9g4jMQwog0)

A Consumer can be configured and the properties can be updated in the __consumer.properties__ file.

__Core configurations for a Consumer consist of:__
1. `bootstrap.servers`
    - This is a list of Brokers used for bootstrapping knowledge about the rest of the cluster.
2. `group.id`
    - The Consumer group ID
3. `auto.offset.reset`
    - Tells the Consumer what to do when there is no initial offset in Kafka or if the current offset does not exist anymore on the server.  Options include latest, earliest, none


To continue with our hands-on example, the final remaining step is to call the Kafka Consumer.  

To achieve this, in parallel, open a new Ubuntu terminal for the __Consumer__ by using the following commands:

In [None]:
kafka-console-consumer.sh --topic MyFirstKafkaTopic --from-beginning --bootstrap-server localhost:9092 

The __`kafka-console-consumer.sh`__ command is used to initiate the Consumer, which will then read(consume) the data records from the specified Topic.

__Required Arguments:__
- `topic`:
    -   Topic name from which to consume the data records.
- `bootstrap-server`:
    -   The Kafka server to connect to.

__Important Optional Arguments:__
- `consumer-property`:
    -   A mechansim to pass user-defined properties in the form of Kev=Value to the Consumer.
- `consumer.config`:
    -   The Consumer configuration properties file.  Note that the `consumer-property` settings take precendence over this file.
- `from-beginning`
    -   Tells the Consumer that if it doesn't already have an established Offset to consume from, start with the earliest message present in the log.
- `max-messages`:
    -   The maximum number of messages to consumer before exiting.  If it's not set, consumption is continual.
- `offset`:
    -   A non-negative number representing the Offset to consume data records from.  Can also use 'earlist' or 'latest'.  The default is 'latest'.
- `partition`:
    -   The Partition to consume data records from. The default is for consumption to start from the end of the Partition.

As always, make sure to check [the documentation](https://docs.cloudera.com/runtime/7.2.10/kafka-managing/topics/kafka-manage-cli-consumer.html) for more detail

Now, with both window terminals side by side, click on the __Producer__ terminal and enter the following JSON data:

`{
  "vehicleId":"0bf45cac-d1b8-4364-a906-980e1c2bdbcb",
  "vehicleType":"Taxi",
  "routeId":"Route-37",
  "longitude":"-95.255615",
  "latitude":"33.49808",
  "timestamp":"2017-10-16 12:31:03",
  "speed":49.0,
  "fuelLevel":38.0
}`

`{
  "vehicleId":"d5fd4b42-3742-11ec-8d3d-0242ac130003",
  "vehicleType":"Bus",
  "routeId":"Route-32",
  "longitude":"-81.615234",
  "latitude":"13.56599",
  "timestamp":"2017-10-17 14:22:03",
  "speed":37.0,
  "fuelLevel":19.0
}`

`{
  "vehicleId":"04be0177-8326-4b59-a15d-19f015c0be63",
  "vehicleType":"Passenger",
  "routeId":"Route-19",
  "longitude":"-15.611331",
  "latitude":"44.59816",
  "timestamp":"2017-10-18 09:07:08",
  "speed":75.0,
  "fuelLevel":48.0
}`

You should see the messages automatically show up on the Consumer terminal similr to the below output:

![](images/kafka-producer-consumer.png)

Now you should be getting the hang of Kafka