# Part 1: Producing the data
In this task, we will implement one Apache Kafka producer to simulate real-time data streaming. Spark is not allowed/required in this part since it’s simulating a streaming data source.

1. Your program should send one batch of applications every 5 seconds. One batch consists of a random 100-500 rows from the application_data_stream dataset. Note that only the number of rows needs to be random, you can read the file sequentially.  
    As an example, in the first and second batches, assuming we generate random numbers 100 and 400, the first batch will send records 1-100 from the CSV file, and the second batch will send records 101-500.  
    The CSV shouldn’t be loaded to memory at once to conserve memory (i.e. Read rows as needed).  
2. Add an integer column named ‘ts’ for each row, a Unix timestamp in seconds since the epoch (UTC timezone). Spead your batch out evenly for 5 seconds.  
    For example, if you send a batch of 100 records at 2024-02-01 00:00:00 (ISO format: YYYY-MM-DD HH:MM:SS) -> (ts = 1704027600):
    - Record 1-20: ts = 1704027600 
    - Record 21-40: ts = 1704027601 
    - Record 41-60: ts = 1704027602
    - ….
3. Send your batch to a Kafka topic with an appropriate name.

All the data except for the ‘ts’ column should be sent in String type, without changing to other data types. In many streaming processing applications, the data sources usually have little to no processing power (e.g. sensors). To simulate this, we shouldn’t do any processing/transformation at the producer.


In [None]:
# import statements
from time import sleep, time
from json import dumps
from kafka3 import KafkaProducer
import random
import datetime as dt
import csv

#configurations
hostip = "host.docker.internal" 

def publish_message(producer_instance, topic_name, data):
    try:
        producer_instance.send(topic_name, data)
    except Exception as ex:
        print('Exception in publishing message.')
        print(str(ex))
        
def connect_kafka_producer():
    _producer = None
    try:
        _producer = KafkaProducer(bootstrap_servers=[f'{hostip}:9092'],
                                  value_serializer=lambda x: dumps(x).encode('utf-8'),
                                  api_version=(0, 10))
    except Exception as ex:
        print('Exception while connecting Kafka.')
        print(str(ex))
    finally:
        return _producer
    
if __name__ == "__main__":
    topic = 'test'
    path = "application_data_stream.csv"
    
    print('Publishing records..')
    producer = connect_kafka_producer()
    
    with open(path, "r", encoding="utf-8") as file:
        reader = csv.reader(file)
        header = next(reader)
        
        while True:
            size = random.randint(10, 20)
            batch = []
            
            for i in range(size):
                try:
                    line = next(reader)
                except StopIteration:
                    file.seek(0)
                    next(reader)  # Skip header
                    line = next(reader)
                
                data = dict(zip(header, line))
                ts = int(time())
                data["ts"] = ts
                batch.append(data)
            
            publish_message(producer, topic, batch)
            print(f'it has {len(batch)} data')
            sleep(5)
    

Publishing records..
it has 13 data
it has 11 data
it has 19 data
it has 16 data
it has 13 data
it has 12 data
it has 18 data
it has 11 data
it has 18 data
it has 10 data
it has 20 data
it has 17 data
it has 16 data
it has 14 data
it has 10 data
it has 18 data
it has 12 data
it has 15 data
it has 13 data
it has 12 data
it has 15 data
it has 16 data
it has 18 data
it has 19 data
it has 14 data
it has 20 data
it has 14 data
it has 11 data
it has 13 data
it has 18 data
it has 10 data
it has 20 data
it has 10 data
it has 18 data
it has 15 data
it has 14 data
it has 18 data
it has 13 data
it has 20 data
it has 13 data
it has 16 data
it has 16 data
it has 15 data
it has 11 data
it has 14 data
it has 20 data
it has 17 data
it has 11 data
it has 18 data
it has 13 data
it has 15 data
it has 10 data
it has 11 data
it has 18 data
it has 10 data
it has 11 data
it has 13 data
it has 10 data
it has 12 data
it has 14 data
it has 13 data
it has 19 data
it has 17 data
it has 18 data
it has 15 data
it h