# Apache Kafka Streaming Analytics
### One Broker Setup
<br>
<hr>

#### Component: Producer
In this notebook Apache Kafka is going to be used and analyzed with reference to the streaming performance using the twitter dataset.

In this case, we are going to use only one **Kafka Broker** that streams the data to the **Kafka Consumer**.

In [None]:
# Install the Python Client for Apache Kafka
!pip install confluent-kafka

In [29]:
# Load dependencies and set constants
from confluent_kafka import Producer
import time 

# DATA_GENERATION_IN_MB = 1000 # ~ 1GB
DATA_GENERATION_IN_MB = 100
DATASET_SIZE_IN_MB = 10

TWITTER_DATA_PATH = "/home/ubuntu/Stream-Analytics/data/dataset.json"
KAFKA_TOPIC_TWITTER = "twitter-stream"


def producer_stats(start, end, data_size):
    """
    Evaluates the throughput for the producer.
    :param start: the UTC time in milliseconds (time since epoch)
    :param end: the UTC time in milliseconds (time since epoch)
    :param data_size: size of produced data in MB
    """
    delta_seconds = (end - start) / 1000
    throughput = data_size / delta_seconds
    
    print("Sent {0} MB in {1} seconds.".format(data_size, delta_seconds))
    print("Throughput: ~{0} MB/s | ~{1} MBit/s".format(int(throughput), int(throughput * 8)))
    

### Reminder: Running Kafka Architecture required
The following cells assume a running Apache Kafka Environment.
<hr>
<br>

Furthermore, as we are using the **Producer** component here, we have to make sure that the **Consumer** component is already running / expecting streaming data to ensure a performance measurement under realistic circumstances.

In [4]:
# Produce the data / write it to the Kafka Cluster
producer_config = {
    "bootstrap.servers": "localhost:9092",
}
p = Producer(producer_config)

In [30]:
start = int(time.time() * 1000)

# Fill the topic with the specified amount of data
generation_steps = int(DATA_GENERATION_IN_MB / DATASET_SIZE_IN_MB)
with open(TWITTER_DATA_PATH, "r") as dataset:
    for step in range(generation_steps):
        print(f"Executing data generation step {step}...")
        dataset.seek(0) # Jump back to first line  
 
        for tweet in dataset:
            try:
                # print("IN QUEUE: {}".format(len(p)))
                p.produce(KAFKA_TOPIC_TWITTER, value=tweet)
                p.poll(0)
            except BufferError:
                print('[INFO] Local producer queue is full (%d messages awaiting delivery): Trying again after flushing...\n' % len(p)) 
                p.poll(1)
        
                # Retry sending tweet
                p.produce(KAFKA_TOPIC_TWITTER, value=tweet)   
                
p.flush(30)
end = int(time.time() * 1000)
print("Data generation done!" + "\n")

producer_stats(start, end, DATA_GENERATION_IN_MB)

Executing data generation step 0...
Executing data generation step 1...
Executing data generation step 2...
Executing data generation step 3...
Executing data generation step 4...
Executing data generation step 5...
Executing data generation step 6...
Executing data generation step 7...
[INFO] Local producer queue is full (100000 messages awaiting delivery): Trying again after flushing...

Executing data generation step 8...
[INFO] Local producer queue is full (100000 messages awaiting delivery): Trying again after flushing...

Executing data generation step 9...
Data generation done!

Sent 100 MB in 1.364 seconds.
Throughput: ~73 MB/s | ~586 MBit/s
