# Apache Kafka Streaming Analytics
### One Broker Setup
<br>
<hr>

#### Component: Producer
In this notebook Apache Kafka is going to be used and analyzed with reference to the streaming performance using the twitter dataset.

In this case, we are going to use only one **Kafka Broker** that streams the data to the **Kafka Consumer**.

In [None]:
# Install the Python Client for Apache Kafka
!pip install confluent-kafka

In [5]:
# Load dependencies and set constants
from confluent_kafka import Producer
from time import time, perf_counter

# DATA_GENERATION_IN_MB = 1000 # ~ 1GB
DATA_GENERATION_IN_MB = 100 # For Testing
DATASET_SIZE_IN_MB = 10

TWITTER_DATA_PATH = "/home/ubuntu/Stream-Analytics/data/dataset.json"
KAFKA_TOPIC_TWITTER = "twitter-stream"


def producer_stats(start: int, end: int, data_size: int):
    """
    Evaluates the throughput for the producer.
    :param start: the UTC time in milliseconds (time since epoch)
    :param end: the UTC time in milliseconds (time since epoch)
    :param data_size: size of produced data in MB
    """
    delta_seconds = float(end - start)
    throughput = data_size / delta_seconds
    
    print("Sent {0} MB in {1} seconds.".format(data_size, delta_seconds))
    print("Throughput: ~{0} MB/s | ~{1} MBit/s".format(int(throughput), int(throughput * 8)))
    

### Reminder: Running Kafka Architecture required
The following cells assume a running Apache Kafka Environment.
<hr>
<br>

Furthermore, as we are using the **Producer** component here, we have to make sure that the **Consumer** component is already running / expecting streaming data to ensure a performance measurement under realistic circumstances.

In [6]:
# Produce the data / write it to the Kafka Cluster
producer_config = {
    "bootstrap.servers": "localhost:9092",
}
p = Producer(producer_config)

In [None]:
# This is left for testing, for benchmark see below
start = perf_counter()

# Fill the topic with the specified amount of data
generation_steps = int(DATA_GENERATION_IN_MB / DATASET_SIZE_IN_MB)
with open(TWITTER_DATA_PATH, "r") as dataset:
    for step in range(generation_steps):
        print(f"Executing data generation step {step}...")
        dataset.seek(0) # Jump back to first line  
 
        for tweet in dataset:
            try:
                # print("IN QUEUE: {}".format(len(p)))
                p.produce(KAFKA_TOPIC_TWITTER, value=tweet)
                p.poll(0)
            except BufferError:
                print('[INFO] Local producer queue is full (%d messages awaiting delivery): Trying again...\n' % len(p)) 
                p.poll(1)
        
                # Retry sending tweet
                p.produce(KAFKA_TOPIC_TWITTER, value=tweet)   
p.flush(30)

end = perf_counter()
print("Data generation done!" + "\n")

producer_stats(start, end, DATA_GENERATION_IN_MB)

In [9]:
def benchmark(data_vol:int = 1000, exec_count: int = 10):
    """ Executes a throughput benchmark procedure, e.g. the amount of MB per second that was 
        sent by the producer to the Broker. To get meaningful results, the benchmark procedure 
        is repeated exec_count times.
        Default is ~ 1GB per execution, while executing 10 times.
        
        :param: data_vol Amount of data to send per execution, given in MB. As the dataset is 10 MB large,
                         data_vol % 10 = 0 should be true for the given argument.
        :param exec_count: Amount of times to repeat the benchmark procedure
    """
    print("Executing throughput benchmark ...")
    
    throughputs = []
    data_generation_steps = int(data_vol / DATASET_SIZE_IN_MB)
    
    with open(TWITTER_DATA_PATH, "r") as dataset:
        for execution in range(exec_count):
            start = perf_counter()
            
            for step in range(data_generation_steps):
                dataset.seek(0) # Jump back to first line 
                
                for tweet in dataset:
                    try:
                        p.produce(KAFKA_TOPIC_TWITTER, value=tweet)
                        p.poll(0)
                    except BufferError:
                        print('[INFO] Local producer queue is full (%d messages awaiting delivery): Trying again...\n' % len(p)) 
                        p.poll(5)

                        # Retry sending tweet
                        p.produce(KAFKA_TOPIC_TWITTER, value=tweet)   
                        
            end = perf_counter()
        
            delta_seconds = float(end - start)
            throughput = data_vol / delta_seconds
            throughputs.append(throughput)

    p.flush(30)

    print("Benchmark procedure finished. Visualizing the results ...")
    #for count, throughput in enumerate(throughputs):
        # TODO: plot throughputs

In [12]:
# Execute the benchmark
benchmark(data_vol=10, exec_count=2)

Executing throughput benchmark ...
Benchmark procedure finished. Visualizing the results ...
