# Kafka-Python

> kafka-python is a Python client which allows you to interact with Apache Kafka in a pythonic way. It allows you to write Python code to perform many of the same Kafka tasks, create topics, produce data for those topics that are available to you through the terminal.

The client has 6 main API classes some of which will be familiar to you if you've used Kafka before. Below is their main use and some useful methods. 

- __KafkaConsumer__ - Allows the consumption of records from a Kafka Cluster:
    - `assign()` - Assign consumers to partitions
    - `subscribe()` - Subscribe to a list of Kafka topics
    - `topics()` - Return all topics this consumer is authorised to view<br><br>

- __KafakProducer__ - Client that publishes records to a Kafka cluster:
    - `send()` - Publish a message to a topic
    - `metrics()` - Get metrics about the producers performance<br><br>
- __KafkaAdminClient__ - The class to perform any administration of your Kafka cluster:
    - `create_topics()` - From a list of topics create new topics in the cluster
    - `list_consumer_groups()` - List all consumer groups known to the cluster
    - `create_partitions()` - Create additional partitions for an existing topic<br><br>

- __KafkaClient__ - Used to check the Kafka network input/output responses/requests:
    - `add_topic()` - Add a topic to a list of topics tracked by metadata
    - `set_topics()` - Set topics to track for metadata
    - `bootstrap_connected()` - Returns true if the bootstrap server is connected<br><br>

- __BrokerConnection__ - Initialise a connection to a Kafka Broker:
    - `connect()` - Attempt to connect to the broker and return the connection state
    - `check_version()` - Try to guess the version of the broker<br><br>    
 
- __ClusterMetadata__ - Manage the Kafka cluster metadata:
    - `leader_for_partition()` - Return the node id of the partition leader
    - `topics()` - Get a list of known topics<br><br>

There are many other methods available to you, they are available to view on kafak-python documentation [here](https://kafka-python.readthedocs.io/en/master/).


### Connecting to Kafka

Kafka-Python can be simply installed using pip just run `pip install kafka-python`. Once installed you will need to start your Kafka Broker and Zookeeper to begin interacting with Kafka. 

In [None]:
# Run both commands in the terminal inside you Kafka folder
# Remember to start Zookeeper first as it orchestrates Kafka Brokers
# Starting Zookeeper

# Starting Kafka
./bin/kafka-server-start.sh ./config/server.properties

# Starting Zookeeper
./bin/zookeeper-server-start.sh ./config/zookeeper.properties

With Kafka now successfully running and `kafka-python` installed lets try to establish a connection to the Kafka broker.

In [None]:
from kafka import KafkaClient
from kafka.cluster import ClusterMetadata

# Create a connection to retrieve metadata
meta_cluster_conn = ClusterMetadata(
    bootstrap_servers="localhost:9092", # Specific the broker address to connect to
)

# retrieve metadata about the cluster
print(meta_cluster_conn.brokers())


# Create a connection to our KafkaBroker to check if it is running
client_conn = KafkaClient(
    bootstrap_servers="localhost:9092", # Specific the broker address to connect to
    client_id="Broker test" # Create an id from this client for reference
)

# Check that the server is connected and running
print(client_conn.bootstrap_connected())
# Check our Kafka version number
print(client_conn.check_version())



If all goes well you should see that `bootstrap_connected` returned `True` to signify that we can establish a connection the broker. Additionally `check_version` should have made an attempt to return the Kafka broker version. This gives you a method to check individual Kafka brokers though Python without having to interact with the terminal. 



### Kafka Administration

Now that we know we can establish a connection to the broker we should create a Kafka admin client to create some topics.

In [None]:
from kafka import KafkaAdminClient
from kafka.admin import NewTopic
from kafka.cluster import ClusterMetadata

# Create a new Kafka client to adminstrate our Kafka broker
admin_client = KafkaAdminClient(
    bootstrap_servers="localhost:9092", 
    client_id="Kafka Administrator"
)

# topics must be pass as a list to the create_topics method
topics = []
topics.append(NewTopic(name="MLdata", num_partitions=3, replication_factor=1))
topics.append(NewTopic(name="Retaildata", num_partitions=2, replication_factor=1))

# Topics to create must be passed as a list
admin_client.create_topics(new_topics=topics)


Now that our topics are created we can list them to check their creation using the `admin_client` once more. Notice if you run the above code again you will get a `TopicAlreadyExistsError` signifying that then topics already created. 

In [None]:
admin_client.list_topics()

We can also get more detailed information about our topics with the `describe_topics` method of the `admin_client`.

In [None]:
# We can pass in a list to topics to describe or describe all topics without the topics keyword argument
admin_client.describe_topics(topics=["MLdata"])

The details listed by our `describe_topics` method are as follows:

- __error_code__: 
    - Displays if there is an error currently with the topic. 
      - __0__ - signifies the broker is running correctly. 
      - __1__ - __OFFSET_OUT_OF_RANGE__ - The requests offset is not in range of the offsets on the server.
      - __2__ - __CORRUPT_MESSAGE__ - The message produced by the server is corrupt.
      - __3__ - __UNKNOWN_TOPIC_OR_PARTITION__ - The server doesn't host this topic or partition.
      - Other error codes can be found in the Kafka documentation [here](https://kafka.apache.org/11/protocol.html)
- __is_internal__:
    - Boolean value indicates if this is an internal stream used by Kafka. For instance the `__consumer__offsets` topic is used internally by Kafka to manage offsets.
- __partitions__:
    - List of all partitions within the displayed topic and details for each.
        - __partition__ - partition number within the topic
        - __leader__ - This number represents the Broker ID the leader topic is hosted on. All other replica topics will copy from this leader on this broker.
        - __replicas__ - List brokers replicas of this topic are hosted on
        - __isr(in-sync replica)__ - List of all replicas that are currently in-sync with the leader broker. By default this means a replica has replicated all messages from the leader in the last 10 seconds.
        - __offline_replicas__ - List of any current replica brokers that are offline.


### Producing Data

With the topics created we can now send data to the Kafka topics by creating a Kafka producer from the `KafkaProducer` class. Let's create some test data we can send to our topics. 

In [None]:
# Lets create some test data to send using our kafka producer
ml_models = [
    {
        "Model_name": "ResNet-50",
        "Accuracy": "92.1",
        "Framework_used": "Pytorch"
    },
    {
        "Model_name": "Random Forest",
        "Accuracy": "82.7",
        "Framework_used": "SKLearn"
    }
] 

retail_data = [
    {
        "Item": "42 LCD TV",
        "Price": "209.99",
        "Quantity": "1"
    },
    {
        "Item": "Large Sofa",
        "Price": "259.99",
        "Quantity": 2
    }
]

We can now create two producers each sending data to their respective topics.

In [None]:
from kafka import KafkaProducer
from json import dumps

# Configure our producer which will send data to  the MLdata topic
ml_producer = KafkaProducer(
    bootstrap_servers="localhost:9092",
    client_id="ML data producer",
    value_serializer=lambda mlmessage: dumps(mlmessage).encode("ascii")
) 

# Configure our producer which will send data to the Retaildata topic
retail_producer = KafkaProducer(
    bootstrap_servers="localhost:9092",
    client_id="ML data producer",
    value_serializer=lambda retailmessage: dumps(retailmessage).encode("ascii")
)


The `value_serializer` parameter is one we haven't seen before. This method is used to convert each message sent to the topic into bytes. Kafka transports messages as `bytes` so we need to serialise our data into a format which is convertible to `bytes`.

We used the `json` modules `dumps` method to serialise the message as a JSON formatted string and encoded it. Which is now in a format which can be converted to bytes. 

With the two producers setup we can now send the data to our Kafka topics with the `send` method. 

In [None]:
# Send our ml data to the MLData topic
for mlmessage in ml_models:
    ml_producer.send(topic="MLData", value=mlmessage)

# Send retail data to the Retaildata topic
for retail_message in retail_data:
    retail_producer.send(topic="Retaildata", value=retail_message)

Since the code cell ran without error then our messages should have been serialised correctly and sent to their topics. We can now try and consume these messages from the topics to check the output.

### Consuming Messages

We now need to consume the messages stored in our topics, we have the option of subscribing to both topics or create individual consumers for each topic. 

For this example we will got with the former option and consume all the messages at once in one data stream. Begin by creating your consumer. 

In [None]:
from kafka import KafkaConsumer
from json import loads

# create our consumer to retrieve the message from the topics
data_stream_consumer = KafkaConsumer(
    bootstrap_servers="localhost:9092",    
    value_deserializer=lambda message: loads(message),
    auto_offset_reset="earliest" # This value ensures the messages are read from the beginning 
)

data_stream_consumer.subscribe(topics=["MLData", "Retaildata"])


With the producer we had to encode our data before being converted to bytes. On the consumer end we now need to decode the message into a readable format. Instead of using the `dumps` method from the `json` module we can use the `loads` method decoding the message into a Python dictionary. 

With the messages in this format we can then print the consumer messages.

In [None]:
# Loops through all messages in the consumer and prints them out individually
for message in data_stream_consumer:
    print(message)

Notice when printing out the message it is of type `tuple` with all associated metadata included e.g timestamps, partition number, topic name etc. We may only be interested in the `value` part of the message, where the true message lies. We can access individual parts of the message in the following way.

In [None]:
for message in data_stream_consumer:
    print(message.value)
    print(message.topic)
    print(message.timestamp)

Excellent, we can now access from the consumer the components of the message we're interested in receiving. 

This covers the main tasks you will perform when interacting with Kafka through Python showing the whole process of generating and receiving data from Kafka. Further would recommend browsing the [documentation](https://kafka-python.readthedocs.io/en/master/) to see some of the others tasks you can perform when managing your Kafka Cluster. 
