# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Assignment 7 Kafka Producer: Apache Kafka Streaming

## Learning Objectives

At the end of the experiment, you will be able to

* understand what is Kafka and its components
* perform real-time data analytics with Kafka

## Information

### Introduction

Stream processing refers to the processing of data in motion, or in other words, computing on data directly as it is produced or received. The majority of data are born as continuous streams: sensor events, user activity on a website, financial trades, and so on – all these data are created as a series of events over time.

Before stream processing, this data was often stored in a database, a file system, or other forms of mass storage. Applications would query the data or compute over the data as needed. Stream Processing turns this paradigm around: the application logic, analytics, and queries exist continuously, and data flows through them continuously.

Some of the stream processing frameworks are Apache Flink, Apache Storm, Apache Kafka, and Spark streaming as shown in the figure below.  

<img src= "https://cdn.iisc.talentsprint.com/CDS/Images/Modern_Stream_Processing_Frameworks.jpg" width= 550 px/>

Here we will consider Apache Kafka for streaming.

Apache Kafka is an open-source software platform developed by the Apache Software Foundation written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.

Kafka was originally developed by LinkedIn and was subsequently open-sourced in early 2011. 

Kafka is a distributed publish-subscribe messaging system that is designed to be fast, scalable, and durable. Kafka maintains messages in topics. Producers write data to topics and consumers read from topics. Kafka is a distributed system, topics are partitioned and replicated across multiple nodes.

It is based on the commit log, and it allows users to subscribe to it and publish data to any number of systems or real-time applications. Example applications include managing passenger and driver matching at Uber, providing real-time analytics and predictive maintenance for British Gas smart home, and performing numerous real-time services across all of LinkedIn.

To know more about the use cases of Kafka click [here](https://kafka.apache.org/documentation.html#uses).

### Components of Kafka cluster

The components of Kafka are as follows:

* **Log:**
Write-ahead log, commit log, transaction log; Each partition in Apache Kafka is a log - a time-ordered, append-only sequence of data, from where data is removed only when a given retention period has been exceeded. Records are appended to the end of the log and can be read in order. The log can also be rewound and records can be skipped
over for consumers to read from any point in the partition.

* **Record or Message:**
Data sent to and from the broker is called a record, a key-value pair. The record contains the topic name and partition number. The Kafka broker keeps records inside topic partitions.

* **Broker:**
The brokers in a Kafka cluster handle the process of receiving, storing and forwarding the records to the interested consumers. 

* **Topics:**
Records are grouped into categories called topics. A Topic is a category/feed name to which records are stored and published. Example: LogMessage or StockMessage. If we wish to send a record we send it to a specific topic and if we want to read a record we read it from a specific topic.

* **Retention period:**
Records published to the cluster will stay in the cluster until a configurable retention period has passed. Kafka retains all records for a set amount of time or until a configurable size is reached. The consumption time is not impacted by the size of the log.

* **Producer, Producer API:**
The processes that publish records/messages into a topic are called producers and are using the producer API.

* **Consumer, Consumer API:**
The processes that consume records/messages from a topic are called consumers and are using the consumer API.

* **Partition:**
Topics are divided into one or more partitions, which can be replicated between nodes. Partitions are the unit of parallelism in Kafka. Partitions allow records in a topic to be distributed to multiple brokers. A topic can have any number of partitions that we can specify.

<center>
<figure>
<img src="https://cdn.iisc.talentsprint.com/CDS/Images/Kafka_offset.png" width= 450 px/>
</figure>
</center>


* **Offset:**
Kafka topics are divided into several partitions, which contain records in an unchangeable sequence. Each record in a partition is assigned and identified by its unique offset.

* **Consumer group:**
A consumer group includes the set of consumers that are subscribing to a specific topic. Kafka consumers are usually a part of a consumer group. Each consumer in the group is assigned a set of partitions, from which they are able to consume messages. Each consumer in the group
will receive records from different subsets of the partitions in the topic.

* **ZooKeeper:**
Zookeeper is a stand-alone, centralized service, acting across nodes to relieve Kafka from administrative duties. Zookeeper is responsible for controller elections, the configuration of topics, handling access control lists and cluster memberships.

* **Instance (“As in a CloudKarafka instance”):**
When a CloudKarafka plan is created, we get CloudKarafka instance or an instance of Apache Kafka. It could be a dedicated instance, an Apache Kafka broker, or a shared instance, which gives you five dedicated topics on a shared plan.


### Kafka Architecture


<img src= "https://cdn.iisc.talentsprint.com/CDS/Images/KafkaCluster.png" width= 500 px/>

Kafka stores messages that come from arbitrarily many processes called **producers**. The data can be partitioned into different **partitions** within different **topics**. Within a partition, messages are strictly ordered by their **offsets** (the position of a message within a partition) and indexed and stored together with a timestamp. Other processes called **consumers** can read messages from partitions. 

Kafka runs on a cluster of one or more servers (called **brokers**), and the partitions of all topics are distributed across the cluster nodes. Additionally, partitions are replicated to multiple brokers. This architecture allows Kafka to deliver massive streams of messages in a fault-tolerant fashion.

**Running Producer and Consumer from two separate notebooks**

The use of two Colab notebooks is necessary as both producer and consumer files need to be run **simultaneously** and we cannot run two code cells together within the same colab notebook. So while the consumer file is running in another notebook we can run the producer (given in this notebook) file and send messages. The messages sent by the producer will be available at the consumer side. Then we can perform operations on these messages for example print the message, count the number of words in it, compute the rolling mean if numerical data provided, or trigger something if a particular message is received.

The steps for the same are given below:

* Go to Kafka Consumer notebook and run `consumer.py` file first
* While consumer file is running, run `producer.py` file from Kafka Producer notebook
* When both files are running, type your message in Producer file and send
* The message will be received at the consumer side and output of the operations will be displayed
* Stop the Producer cell first and then the corresponding Consumer cell

### Setup Steps:

### Install Confluent kafka

In [4]:
# Confluent (Open Source) is a developer-optimized distribution of Apache Kafka.
# Confluent is a more complete distribution of Apache Kafka. It expands Kafka’s integration capabilities, 
# adding tools to optimize and manage Kafka clusters and methods to ensure the streams are secure. 
# It makes Kafka easier to build and easier to operate

!pip install confluent_kafka

Collecting confluent_kafka
  Downloading confluent_kafka-1.8.2-cp37-cp37m-manylinux2010_x86_64.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 26.1 MB/s 
[?25hInstalling collected packages: confluent-kafka
Successfully installed confluent-kafka-1.8.2


### Connect to Kafka cluster

CloudKarafka provides Apache Kafka as a service and offers tools to simplify the usage of it.

To know more about CloudKarafka click [here](https://www.cloudkarafka.com/docs/productoverview.html).

**CloudKarafka login:** Login to [Cloudkarafka](https://www.cloudkarafka.com) and create an instance

For detailed instructions on the account and instance creation, please refer to this [document](https://cdn.iisc.talentsprint.com/CDS/CloudKarafka.pdf).

**Connect the cluster:**

* Create an instance and get credentials
* Create two topics (one for each example) and note down the topic names

Specify your `BROKERS`, `USERNAME`, `PASSWORD`, and `TOPIC` in the below script files.

### Example 1: Send and receive messages

Here we create two files one is `producer1.py` and another one is `consumer1.py`(in Consumer notebook). Producer will send messages to a topic and consumer will read these messages in real-time from that particular topic and displays the message along with its word count and an alert message if the number of words exceeds 6.

#### Write Producer file

Here the producer will send messages to the specified `topic`.

In [5]:
%%writefile producer1.py

import sys
import os

from confluent_kafka import Producer

# Specify BROKERS, USERNAME, PASSWORD and TOPIC
brokers = "tricycle-01.srvs.cloudkafka.com:9094,tricycle-02.srvs.cloudkafka.com:9094,tricycle-03.srvs.cloudkafka.com:9094" 
username = "dgcdj8l4"
password = "RoVwni79jy5zootYD7nCiVZXv0DexzQI"
topic = "dgcdj8l4-default"

# Set the path for the user-defined modules so that they can be directly imported into the python program
os.environ['CLOUDKARAFKA_BROKERS']= brokers
os.environ['CLOUDKARAFKA_USERNAME']= username
os.environ['CLOUDKARAFKA_PASSWORD']= password
os.environ['CLOUDKARAFKA_TOPIC']= topic

if __name__ == '__main__':
    topic = os.environ['CLOUDKARAFKA_TOPIC'].split(",")[0]

    # Consumer configuration
    conf = {
        'bootstrap.servers': os.environ['CLOUDKARAFKA_BROKERS'],                # Specify kafka servers
        'session.timeout.ms': 6000,                                             # The producer sends periodic heartbeats to indicate its liveness to the broker
        'default.topic.config': {'auto.offset.reset': 'smallest'},              # if there is no offset info, the offset will be set to the smallest value available
        'security.protocol': 'SASL_SSL',                                        # protocol used to communicate with brokers
	      'sasl.mechanisms': 'SCRAM-SHA-256',                       # SASL mechanism to use for authentication. Supported: GSSAPI, PLAIN, SCRAM-SHA-256, SCRAM-SHA-512, OAUTHBEARER        
        'sasl.username': os.environ['CLOUDKARAFKA_USERNAME'],
        'sasl.password': os.environ['CLOUDKARAFKA_PASSWORD']
    }

    p = Producer(**conf)

    def delivery_callback(err, msg):
        if err:
            sys.stderr.write('%% Message failed delivery: %s\n' % err)
        else:
            sys.stderr.write('%% Message delivered to %s [%d]\n' %(msg.topic(), msg.partition()))
    print("\nEnter text: ")
    # Take input data continuously
    for line in sys.stdin:                             
        try:
            # send data to specified topic
            p.produce(topic, line.rstrip(), callback=delivery_callback)       
        except BufferError as e:
            sys.stderr.write('%% Local producer queue is full (%d messages awaiting delivery): try again\n' %len(p))
        p.poll(0)
        print("\nEnter text or interupt the execution to stop.")

    sys.stderr.write('%% Waiting for %d deliveries\n' % len(p))
    p.flush()            # makes all buffered records immediately available to send

Writing producer1.py


#### Run Producer file

Before running the producer file, make sure that the corresponding consumer file `consumer1.py` is running in [Consumer notebook](https://drive.google.com/file/d/1xX0DA_QsDQeCnYtLklZVE2G9ke0gs41R/view?usp=sharing).

The producer will keep on running and allow us to send messages. The output will be shown on the consumer side.

<font color='blue'>Before executing the below cell ensure that you created the CloudKarafka account and specified the credentials.</font>

In [7]:
!python producer1.py

%4|1644061938.739|CONFWARN|rdkafka#producer-1| [thrd:app]: Configuration property session.timeout.ms is a consumer property and will be ignored by this producer instance
%4|1644061938.739|CONFWARN|rdkafka#producer-1| [thrd:app]: Configuration property auto.offset.reset is a consumer property and will be ignored by this producer instance

Enter text: 
lets try y gettig a word with a log legth to tes this

Enter text or interupt the execution to stop.
check me
% Message delivered to dgcdj8l4-default [3]

Enter text or interupt the execution to stop.
example 2
% Message delivered to dgcdj8l4-default [4]

Enter text or interupt the execution to stop.
example 3
% Message delivered to dgcdj8l4-default [0]

Enter text or interupt the execution to stop.
Traceback (most recent call last):
  File "producer1.py", line 44, in <module>
KeyboardInterrupt


For next example **create a new topic** on CloudKarafka and use its topic name. To create a topic, please refer to step 11 in this [document](https://cdn.iisc.talentsprint.com/CDS/CloudKarafka.pdf).

### Example 2: Compute the rolling mean of the last three insertions

Here we create two files one is `producer2.py` and other one is `consumer2.py`(in Consumer notebook). Producer will send data to a topic and consumer will read these records in real-time from that particular topic and displays the rolling mean of the last three insertions. Only the added numbers will be displayed for the first two insertions.

#### Write Producer file

Here the producer will send messages to the specified `topic`.

In [13]:
%%writefile producer2.py

import sys
import os

from confluent_kafka import Producer

# Specify BROKERS, USERNAME, PASSWORD and new TOPIC
brokers = "" 
username = ""
password = ""
topic = ""

# Set the path for the user-defined modules so that they can be directly imported into the python program
os.environ['CLOUDKARAFKA_BROKERS']= brokers
os.environ['CLOUDKARAFKA_USERNAME']= username
os.environ['CLOUDKARAFKA_PASSWORD']= password
os.environ['CLOUDKARAFKA_TOPIC']= topic

if __name__ == '__main__':
    
    topic = os.environ['CLOUDKARAFKA_TOPIC'].split(",")[0]

    # Consumer configuration
    conf = {
        'bootstrap.servers': os.environ['CLOUDKARAFKA_BROKERS'],                # Specify kafka servers
        'session.timeout.ms': 6000,                                             # The producer sends periodic heartbeats to indicate its liveness to the broker
        'default.topic.config': {'auto.offset.reset': 'smallest'},              # if there is no offset info, the offset will be set to the smallest value available
        'security.protocol': 'SASL_SSL',                                        # protocol used to communicate with brokers
	      'sasl.mechanisms': 'SCRAM-SHA-256',                       # SASL mechanism to use for authentication. Supported: GSSAPI, PLAIN, SCRAM-SHA-256, SCRAM-SHA-512, OAUTHBEARER        
        'sasl.username': os.environ['CLOUDKARAFKA_USERNAME'],
        'sasl.password': os.environ['CLOUDKARAFKA_PASSWORD']
    }

    
    p = Producer(**conf)

    def delivery_callback(err, msg):
        if err:
            sys.stderr.write('%% Message failed delivery: %s\n' % err)
        else:
            sys.stderr.write('%% Message delivered to %s [%d]\n' %(msg.topic(), msg.partition()))
    print("\nEnter number: ")
    # Take input data continuously
    for line in sys.stdin:                           
        try:
            # send data to specified topic
            p.produce(topic, line, callback=delivery_callback)                 
        except BufferError as e:
           
            sys.stderr.write('%% Local producer queue is full (%d messages awaiting delivery): try again\n' %len(p))

        p.poll(0)
        print("\nEnter a number or interupt the execution to stop.")

    sys.stderr.write('%% Waiting for %d deliveries\n' % len(p))

    p.flush()                     # makes all buffered records immediately available to send

Writing producer2.py


#### Run Producer file

Before running the producer file, please make sure that the corresponding consumer file `consumer2.py` is running in [Consumer notebook](https://drive.google.com/file/d/1xX0DA_QsDQeCnYtLklZVE2G9ke0gs41R/view?usp=sharing).

The producer will keep on running and allow us to send messages. The output will be shown on the consumer side.

<font color='blue'>Before executing the below cell ensure that you created the CloudKarafka account and specified the credentials.</font>

In [15]:

!python producer2.py

%4|1644063074.499|CONFWARN|rdkafka#producer-1| [thrd:app]: Configuration property session.timeout.ms is a consumer property and will be ignored by this producer instance
%4|1644063074.499|CONFWARN|rdkafka#producer-1| [thrd:app]: Configuration property auto.offset.reset is a consumer property and will be ignored by this producer instance

Enter number: 
1

Enter a number or interupt the execution to stop.
2

Enter a number or interupt the execution to stop.
3

Enter a number or interupt the execution to stop.
4
% Message delivered to dgcdj8l4-topic1 [4]

Enter a number or interupt the execution to stop.
5

Enter a number or interupt the execution to stop.
6

Enter a number or interupt the execution to stop.
7

Enter a number or interupt the execution to stop.
8
% Message delivered to dgcdj8l4-topic1 [3]
% Message delivered to dgcdj8l4-topic1 [0]
% Message delivered to dgcdj8l4-topic1 [0]

Enter a number or interupt the execution to stop.
9

Enter a number or interupt the execution to st