# Kafka Consumer

In [1]:
# set this variable with one of the following values
# -> 'local'
# -> 'docker_cluster'
CLUSTER_TYPE ='docker_cluster'

In [2]:
import os

KAFKA_BOOTSTRAP_SERVERS = ''

if CLUSTER_TYPE == 'local':

    KAFKA_HOME = '<PATH_TO_YOUR_kafka_2.13-3.2.0_FOLDER>'
    KAFKA_BOOTSTRAP_SERVERS = ['localhost:9092']
    
    # Start Zookeeper    
    os.system('{0}/bin/zookeeper-server-start.sh {0}/config/zookeeper.properties'.format(KAFKA_HOME)) 
    
    # Start one Kafka Broker
    os.system('{0}/bin/kafka-server-start.sh {0}/config/server.properties'.format(KAFKA_HOME)) 
    
elif CLUSTER_TYPE == 'docker_cluster':

    KAFKA_BOOTSTRAP_SERVERS = ['kafka-broker:9092',""" possibly other brokers in your kafka cluster """]

In [3]:
! pip3 install kafka-python confluent-kafka



In [4]:
from kafka import KafkaConsumer

Kafka consumers can be instantiated via the KafkaConsumer class

```python
#--- A TYPICAL CONSUMER
consumer = KafkaConsumer(
    bootstrap_servers=['62.30.10.23:9092'],  #<<<--- list of brokers
    security_protocol="SSL",                 #<<<--- security protocol (if any) 
    ssl_cafile="./ca.pem",                   #<<<--- certificate details (if any)
    ssl_certfile="./service.cert",           #           ...
    ssl_keyfile="./service.key",             #           ...
    value_deserializer=msgpack.unpackb,      #<<<--- message value deserialization function (e.g. unpack the message from a specific format)
    auto_offset_reset='earliest',            #<<<--- automatically bring the reading offset to the earliest message
    group_id="group_A",                      #<<<--- identify this consumer as part of group_A
)
```


Once more we'll use a simple implementation of the consumer, with no specific configurations used in this example.

In [5]:
# create a consumer pointing to a kafka cluster
consumer = KafkaConsumer(bootstrap_servers=KAFKA_BOOTSTRAP_SERVERS,
                         consumer_timeout_ms=10000 # if no message is available for consumption 
                                                   # after 10s stop the consumer
                        )

Inspect the brokers for the available topics

In [6]:
# list all available topics on the kafka brokers
consumer.topics()

{'a_partitioned_topic', 'my_awesome_topic', 'results'}

In the PUB/SUB model, we first need to subscribe to the topic of choice.

Subscribing doesn't mean any message is actually received/consumed... 

It only means that from now on the consumer will be able to poll from the partitions of the chosen topics hosted on the brokers.

In [7]:
# subscribe to topic
consumer.subscribe('my_awesome_topic')

# and check the active subscriptions
consumer.subscription()

{'my_awesome_topic'}

The `consumer` class will also offer a possibility to inspect the topics (for instance in terms of the number of partitions), but **not** to modify them. 

We can inspect how many partitions the specific topic is made of:

In [8]:
# print the list of partition IDs 
# e.g. a topic with tree partitions will have partition IDs {0, 1, 2}
consumer.partitions_for_topic('my_awesome_topic')

{0}

We can instruct the consumer to `poll` (i.e. ask for new messages from the topic) with a given cadence/logic.

For instance we can set the consumer to read out only 10 messages at a time, with a timeout between to subsequent readouts of a given $\Delta t$.

In [9]:
# set up the polling strategy for the consumer
consumer.poll(timeout_ms=0,         #<<--- do not enable dead-times before one poll to the next
              max_records=None,     #<<--- do not limit the number of records to consume at once 
              update_offsets=True   #<<--- update the reading offsets on this topic
             )

{}

Now the consumer is ready to poll messages (untile it is stopped or it reaches a timeout).

Let's look for all messages in the consumer:

In [10]:
# this consumer will keep polling and reading for messages until stopped (or it reaches the consumer_timeout_ms)
for message in consumer:
    print (message)

The reading offset can also be brought back to the beginning of the topic, to re-read the entire topic:

In [11]:
consumer.seek_to_beginning()

for message in consumer:
    print (message)

ConsumerRecord(topic='my_awesome_topic', partition=0, offset=0, timestamp=1654606909275, timestamp_type=0, key=None, value=b'', headers=[], checksum=None, serialized_key_size=-1, serialized_value_size=0, serialized_header_size=-1)
ConsumerRecord(topic='my_awesome_topic', partition=0, offset=1, timestamp=1654606909663, timestamp_type=0, key=None, value=b'', headers=[], checksum=None, serialized_key_size=-1, serialized_value_size=0, serialized_header_size=-1)
ConsumerRecord(topic='my_awesome_topic', partition=0, offset=2, timestamp=1654606909830, timestamp_type=0, key=None, value=b'', headers=[], checksum=None, serialized_key_size=-1, serialized_value_size=0, serialized_header_size=-1)
ConsumerRecord(topic='my_awesome_topic', partition=0, offset=3, timestamp=1654606913108, timestamp_type=0, key=None, value=b' hello?', headers=[], checksum=None, serialized_key_size=-1, serialized_value_size=7, serialized_header_size=-1)
ConsumerRecord(topic='my_awesome_topic', partition=0, offset=4, times

The message content (`ConsumerRecord`) is quite messy, but can be easily inspected parsing only the relevant infos.

In [12]:
from datetime import datetime

consumer.seek_to_beginning()

# break down the message into its main components
for message in consumer:
    print ("%d:%d [%s] k=%s v=%s" % (message.partition,
                          message.offset,
                          datetime.fromtimestamp(message.timestamp/1000).time(),
                          message.key,
                          message.value))

0:0 [13:01:49.275000] k=None v=b''
0:1 [13:01:49.663000] k=None v=b''
0:2 [13:01:49.830000] k=None v=b''
0:3 [13:01:53.108000] k=None v=b' hello?'
0:4 [13:01:58.569000] k=None v=b' anybody here?'
0:5 [13:05:28.139000] k=None v=b'is this working='
0:6 [13:05:34.254000] k=None v=b'wow'
0:7 [13:14:14.144000] k=None v=b'message 1'
0:8 [13:17:01.105000] k=None v=b'a new message'
0:9 [13:17:21.782000] k=None v=b'a message from the revived producer'
0:10 [13:17:44.882000] k=b'some_key' v=b'a message with key'


Let's change the topic to which the consumer is subscribed to a partitioned one:

In [13]:
consumer.subscribe('a_partitioned_topic')
consumer.subscription()

{'a_partitioned_topic'}

By inspecting the number of partitions for this topic we do see now 2 partitions: partition #0 and partition #1

In [14]:
consumer.partitions_for_topic('a_partitioned_topic')

{0, 1}

Reading out from a partitioned topic it's easy to see that the messages are sent to the two partitions in a seemengly arbitrary way:

In [15]:
import json

consumer.seek_to_beginning()

for message in consumer:
    print ("%d:%d:\t v=%s" % (message.partition,
                          message.offset,
                          json.loads(message.value)))

1:0:	 v={'name': 'Joe', 'surname': 'Smith', 'amount': '654.11', 'delta_t': '9.33', 'flag': 0}
1:1:	 v={'name': 'Joe', 'surname': 'Jones', 'amount': '244.90', 'delta_t': '7.22', 'flag': 0}
1:2:	 v={'name': 'Andy', 'surname': 'Smith', 'amount': '354.93', 'delta_t': '7.04', 'flag': 0}
1:3:	 v={'name': 'John', 'surname': 'Millers', 'amount': '874.58', 'delta_t': '6.32', 'flag': 0}
1:4:	 v={'name': 'Joe', 'surname': 'Jones', 'amount': '620.05', 'delta_t': '4.02', 'flag': 0}
1:5:	 v={'name': 'Joe', 'surname': 'Millers', 'amount': '54.48', 'delta_t': '9.99', 'flag': 0}
1:6:	 v={'name': 'Andy', 'surname': 'Smith', 'amount': '34.26', 'delta_t': '2.54', 'flag': 0}
0:0:	 v={'name': 'Joe', 'surname': 'Jones', 'amount': '304.19', 'delta_t': '8.55', 'flag': 0}
0:1:	 v={'name': 'Andy', 'surname': 'Millers', 'amount': '434.78', 'delta_t': '2.15', 'flag': 0}
0:2:	 v={'name': 'Joe', 'surname': 'Smith', 'amount': '134.71', 'delta_t': '6.89', 'flag': 0}
0:3:	 v={'name': 'Joe', 'surname': 'Millers', 'amoun

## Creating a consumer accessing only one partition

Publishing records to a partitioned topic is typically transparent for the user: the producer publishes to the topic, and the kafka cluster will redirect the message to the partition leader, later replicating that to the followers.

The same goes for a generic consumer. As we have just seen data is consumed from all partitions within the topic.

In some cases, it can however be more suitable to instantiate multiple consumers, each reading from a specific partition of a topic.

Let's assign a consumer specific to access the data of partition #0 of the previous partitioned topic.

In [16]:
from kafka import TopicPartition

# create a standard consumer
consumer_part_0 = KafkaConsumer(bootstrap_servers=KAFKA_BOOTSTRAP_SERVERS,
                                client_id='consumer_n_0',
                                consumer_timeout_ms=10000)

# assign it to a specific topic - partition combination
consumer_part_0.assign([TopicPartition('a_partitioned_topic', 0)]) # <<--- name of the topic, partition id

In [17]:
consumer_part_0.seek_to_beginning()

for message in consumer_part_0:
    print ("%d:%d:\t v=%s" % (message.partition,
                          message.offset,
                          json.loads(message.value)))

0:0:	 v={'name': 'Joe', 'surname': 'Jones', 'amount': '304.19', 'delta_t': '8.55', 'flag': 0}
0:1:	 v={'name': 'Andy', 'surname': 'Millers', 'amount': '434.78', 'delta_t': '2.15', 'flag': 0}
0:2:	 v={'name': 'Joe', 'surname': 'Smith', 'amount': '134.71', 'delta_t': '6.89', 'flag': 0}
0:3:	 v={'name': 'Joe', 'surname': 'Millers', 'amount': '902.71', 'delta_t': '1.19', 'flag': 1}
0:4:	 v={'name': 'Alice', 'surname': 'Millers', 'amount': '833.05', 'delta_t': '2.85', 'flag': 0}
0:5:	 v={'name': 'Joe', 'surname': 'Jones', 'amount': '635.61', 'delta_t': '7.81', 'flag': 0}
0:6:	 v={'name': 'John', 'surname': 'Jones', 'amount': '654.43', 'delta_t': '6.08', 'flag': 0}
0:7:	 v={'name': 'Alice', 'surname': 'Smith', 'amount': '917.92', 'delta_t': '4.13', 'flag': 0}
0:8:	 v={'name': 'John', 'surname': 'Johnson', 'amount': '786.98', 'delta_t': '1.64', 'flag': 1}
0:9:	 v={'name': 'Andy', 'surname': 'Jones', 'amount': '500.63', 'delta_t': '5.62', 'flag': 0}
0:10:	 v={'name': 'Andy', 'surname': 'Johnso

## Creating a consumer group

Multiple consumers can read from the same topic.

In kafka, every consumer is part of a consumer group (even a single consumer is part of its own consumer group, really). 

A consumer group is a number (1 or more) of cooperating consumers gathering data from the same topic, balancing the load across them and redistributing the consume calls dynamically.

If a consumer inside a consumer-group fails, the others from the same group will keep reading the whole data from the topic to which they are subscribed.

In [18]:
# create one consumer_one to read from 1 partition
# assign this consumer to a group
consumer_one = KafkaConsumer(bootstrap_servers=KAFKA_BOOTSTRAP_SERVERS,
                             client_id='consumer_one',
                             group_id='my_group',
                             auto_offset_reset='earliest',
                             consumer_timeout_ms=10000)

In [19]:
from kafka import ConsumerRebalanceListener
# subscribe this specific consumer to the partitioned topic
consumer_one.subscribe('a_partitioned_topic',
                       listener=ConsumerRebalanceListener())

#### BREAK
Use the ConsumerGroup notebook to:
1. create a second consumer `consumer_two`
2. assign it to the same consumer group `my_group`
3. subscribe to the same topic `a_partitioned_topic`

Each consumer within a group is going to be an independent process (should be run in parallel from the others) and will provide access to a fraction of the incoming data

In [20]:
consumer_one.assignment()

set()

In [None]:
# Use multiple consumers in parallel --> typically you would run each on a different thread / process / executor
for message in consumer_one:
    print ("%d:%d: k=%s v=%s" % (message.partition,
                          message.offset,
                          message.key,
                          json.loads(message.value)))

## Reading from the Kafka+Spark results topic

Let's subscribe to the `results` topic and monitor the frauds

In [None]:
consumer.subscribe('results')

for message in consumer:
    print ("%d:%d: k=%s v=%s" % (message.partition,
                          message.offset,
                          message.key,
                          message.value))
    print ('      --> sending alert message to user {}\n'.format(message.key.decode('ascii')))