# Building Models to Predict Prospective Customers for MonPG Part 2 - Producer

Date: 10/10/2022

Version: 1.0

Environment: Python 3.10.5 and Anaconda 4.11.0 (64-bit)

#### Libraries used:

* Datetime to manipulate date.
* Timezone to specify the timezone
* Time for controlling the execution intervals of queries.
* Kafka for streaming the data
* Random to select a random number
* Json dumps to parse the data

## 1. Producing the Data

The goal of this section is to initialise the main producer that will be sending the customer data to Spark Structured Streaming. The number of customer records in a single batch will depend upon a randomly selected number.

### 1.1 Loading the libraries

Before starting the producer, let us load the required libraries as follows:

In [1]:
# Importing the required libraries
from time import sleep
from json import dumps
from kafka3 import KafkaProducer
import random
from datetime import datetime,timezone
from pytz import timezone
import csv

### 1.2 Defining the Functions and Initialising the Kafka Connection

In this step, we will set up the functions that will form the backbone of this Kafka producer. The functions being defined are as follows:

* **readCSVFile**: To read the customer or bureau csv file, filter its customer IDs based on the ones that have not been already published and attach the timestamp to these new records.
* **publish_message**: To publish the message and its contents including the producer instance describing the server to be broadcasted to, the name of the topic for Spark to access the streamed data and the actual data batch to be sent.
* **connect_kafka_producer**: To set up the server on which kafka will broadcast the batches.

In [2]:
# Function to read the csv file and prepare the data for streaming
def readCSVFile(dataFile):

    # Opening the csv file
    with open(dataFile, 'rt') as f:
        
        # Reading the data into a dictionary
        read_data = csv.DictReader(f)
        
        # Creating an empty list to store the incoming data
        data_store = []
        
        # Looping through each record in the batch
        for record in read_data:
            
            # Filtering the records based on the random number
            if record['ID'] in cust_id_rotation:
                
                # Appending the time stamp to each batch
                data_store.append(dict(record, **batch_ts))

    return data_store

# Defining a function to publish the data acquired from the csv file
def publish_message(producer_instance, topic_name, data):
    
    try:
        producer_instance.send(topic_name, data)
        
    except Exception as ex:
        print('Exception in publishing message.')
        print(str(ex))

# Starting the Kafka server
def connect_kafka_producer():
    _producer = None
    try:
        _producer = KafkaProducer(bootstrap_servers = ['localhost:9092'],
                                  value_serializer = lambda x: dumps(x).encode('ascii'),
                                  api_version=(0, 10))
    except Exception as ex:
        print('Exception while connecting Kafka.')
        print(str(ex))
    finally:
        return _producer

### 1.2 Extracting the Data and Publishing the Records

The below code will work as follows:

* If we have a randomly selected number between 10 and 30 inclusive, the first batch will contain the data for all the customer IDs from 1 to that randomly selected number.
* The number of records for the next batch will be randomly selected again and the customer ID's starting after the last customer ID of the previous batch will be selected. This is to prevent duplication of rows.
* A timestamp in the UTC format will be attached to each batch before it is published in a variable called `ts`.
* If the last row of the customer data is reached, the batches will restart, starting from the first customer ID. The row count for the customer data has been captured individually for this purpose from the customer csv file.

In [None]:
# Start Main here...
if __name__ == '__main__':

    # Assigning a topic name for the bureau data
    topic_bureau = 'bureau_data'

    # Assigning a topic name for the customer data
    topic_customer = 'customer_data'
    
    # Creating a list of a number of customer IDs to start with
    cust_ids_rand = [id for id in range(10, 31)]
    
    # Initialising the customer ID to start from in the first batch
    cust_id_position = 1

    # Indicating the start of the publishing
    print('Publishing Top-up Customer batches..\n')
    producer = connect_kafka_producer()
    
    # Randomly selecting a customer ID from the list of IDs between 10 and 30
    cust_limit_rand = random.choice(cust_ids_rand)
    
    # Separately loading the customer data to get the row count
    cust_file = open("customer.csv")
    
    # Reading the customer file
    cust_reader = csv.reader(cust_file)
    
    # Counting the number of rows in the customer csv data minus the header
    total_cust_rows = len(list(cust_reader)) - 1
    
    # Initialising an infinite while loop to stream the data
    while True:
        
        # Incrementing the current customer ID by the customer ID position
        cust_id_rotation = [str(id + cust_id_position) for id in range(0, cust_limit_rand)]
        
        # Incrementing the customer ID position 
        cust_id_position = int(cust_id_rotation[-1]) + 1

        # Defining the current timestamp
        batch_ts = {'ts': int(datetime.now(timezone('UTC')).timestamp())}

        # Reading the customer csv file
        customer_rows = readCSVFile('customer.csv')

        # Reading the bureau csv file
        bureau_rows = readCSVFile('bureau.csv')
        
        # Printing the customer IDs to indicate the customers included in the current batch
        for cust in customer_rows:
            print(datetime.fromtimestamp(cust['ts']),'ID:',cust['ID'])
        print('-------------------------')

        # Optional Code: Printing the customer data
        #print(customer_rows)

        # Optional Code: Printing the bureau data
        #print(bureau_rows)

        # Broadcasting the customer topic data
        publish_message(producer, topic_customer, customer_rows)
        
        # Broadcasting the bureau topic data for the batch of customers
        publish_message(producer, topic_bureau, bureau_rows)
        
        # Selecting another random number for the next batch
        cust_limit_rand = random.choice(cust_ids_rand)

        # Resetting to start from the first customer ID if all rows of the customer data have been extracted
        if str(total_cust_rows) in cust_id_rotation:

            # Reverting to the first customer id
            cust_id_position = 1

        sleep(5)

Publishing Top-up Customer batches..

2022-10-18 19:51:53 ID: 1
2022-10-18 19:51:53 ID: 2
2022-10-18 19:51:53 ID: 3
2022-10-18 19:51:53 ID: 4
2022-10-18 19:51:53 ID: 5
2022-10-18 19:51:53 ID: 6
2022-10-18 19:51:53 ID: 7
2022-10-18 19:51:53 ID: 8
2022-10-18 19:51:53 ID: 9
2022-10-18 19:51:53 ID: 10
-------------------------
2022-10-18 19:52:04 ID: 11
2022-10-18 19:52:04 ID: 12
2022-10-18 19:52:04 ID: 13
2022-10-18 19:52:04 ID: 14
2022-10-18 19:52:04 ID: 15
2022-10-18 19:52:04 ID: 16
2022-10-18 19:52:04 ID: 17
2022-10-18 19:52:04 ID: 18
2022-10-18 19:52:04 ID: 19
2022-10-18 19:52:04 ID: 20
2022-10-18 19:52:04 ID: 21
2022-10-18 19:52:04 ID: 22
2022-10-18 19:52:04 ID: 23
2022-10-18 19:52:04 ID: 24
2022-10-18 19:52:04 ID: 25
2022-10-18 19:52:04 ID: 26
2022-10-18 19:52:04 ID: 27
2022-10-18 19:52:04 ID: 28
2022-10-18 19:52:04 ID: 29
2022-10-18 19:52:04 ID: 30
2022-10-18 19:52:04 ID: 31
-------------------------
2022-10-18 19:52:15 ID: 32
2022-10-18 19:52:15 ID: 33
2022-10-18 19:52:15 ID: 34
2