INFO-H-515 Project <br>
2022–2023

# Phase 2 : Producer
Gianluca Bontempi, Théo Verhelst, Cédric Simar <br>
Computer Science Department, ULB

### Information
Group Number : 5 <br>
Group Members : Rania Baguia (000459242), Hakim Amri (000459153), Julian Cailliau (000459856), Mehdi Jdaoudi (000457507)

### Imports

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, col, rank, monotonically_increasing_id
from pyspark.sql.window import Window
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType
import socket
import pandas as pd
import numpy as np
import os
import logging
import json
import time
logging.basicConfig(level=logging.INFO)

### Key notebook variables

In [None]:
FILE_PATH:str = "bike_counts.csv"

### Producer
We opted for the socket based approach for sending data, instead of the file based system. As such, the producer would first bind the port 9999 on the localhost. Then we create a producer spark session so that it can read the csv table as a spark dataframe and benefiting from parallelisation in reading the file. Finally, data is being query to the dataframe, collected, turned into an array, encoded and sent over the socket.

#### Configuring the producer

In [None]:
# take the server name and port name
host = 'localhost'
port = 9999
  
# create a socket at server side
# using TCP / IP protocol
s = socket.socket(socket.AF_INET,
                  socket.SOCK_STREAM)
  
# bind the socket with server
# and port number
s.bind((host, port))
  
s.listen(5)

In [None]:
# creating the spark session for the producer
spark = SparkSession \
    .builder \
    .master("local[10]")\
    .config("spark.executor.instances", "1") \
    .config("spark.executor.cores", "10") \
    .config("spark.executor.memory", "16G") \
    .appName("Producer") \
    .getOrCreate()

# Let us retrieve the sparkContext object
sc=spark.sparkContext

#### Loading the necessary information
We first load the sensors information, essentially sensor names. This allows us to restrict the producer to send data related to a predifined number of sensor. Such processing is done for the scalability purpose.

In [2]:
# Reading the file with the sensors information. As the latter is small, it can be read using simply python
sensors = []
with open("data/bikes_sensors.json", "r") as f:
    sensors = json.load(f)
    sensors = [
        sensor["properties"]["device_name"] for sensor in sensors["features"]
    ]

Then, we can use spark to read the csv, by first giving the schema of the data. It is worth noting that the date is being typed as a `StringType()`. The reason is that, reading it as `DateType()` was causing issues in the formating as it inserting `,` in the records. Thus when the consummer needs to evaluate the input shape, it had issues understanding the format.

In [None]:
schema = StructType([
    StructField("Date", StringType(), nullable = False),
    StructField("Time Gap", IntegerType(), nullable = False),
    StructField("Count", IntegerType(), nullable = False),
    StructField("Average speed", IntegerType(), nullable = False),
    StructField("sensor", StringType(), nullable = False),
    StructField("timestamp", IntegerType(), nullable = False)
    ]) 

bike_counts = spark.read.format("csv") \
        .option("header", True) \
        .schema(schema) \
        .load(FILE_PATH)\
        .orderBy("timestamp", "sensor")\
        .cache()

#### Run the producer

To send the data over sockets, three parameters are key :
- `batchTimeInterval` which is the time interval (sec.) between two batches.
- `timePeriod` which is the number of days to send in a batch
- `n_sensors` which is a variable indicating the number of sensors to consider (for scalability essentially). If set to `-1`, then all the sensors are considered.

In [None]:
i = 0 # Counter variable
batchTimeInterval = 10 # Time interval between batches
timePeriod = 10 # Number of days to send in a batch
timeGaps = 96 # Number of time gaps in a day
n_sensors = -1 # Number of sensors (-1 indicates no restriction)

# Checking if the number of sensors is valid
if n_sensors != -1 :
    if n_sensors > len(sensors) :
        # Warning message if the requested number of sensors is larger than the available number
        logging.warning(f"The number of sensors required {n_sensors} is larger than the actual amount of sensors available {len(sensors)}. Truncating the number of sensors to {len(sensors)}.")
        n_sensors = len(sensors)
    # Selecting a subset of sensors based on the requested number
    sensors_restricted = sensors[:n_sensors]

timestepsPerBatch = timePeriod * timeGaps # Number of timesteps per batch per sensor

logging.info(f"Waiting for connections on port {port}")
c, addr = s.accept()
logging.info(f"Connection from : {str(addr)}")

while True:
    LB = i * timestepsPerBatch # Lower bound of timestamp range
    HB = i * timestepsPerBatch + timestepsPerBatch + 1 # Upper bound of timestamp range
    if n_sensors == -1 :
        # Filtering bike_counts based on timestamp range only 
        query = bike_counts.filter((col("timestamp") > LB) & (col("timestamp") < HB))
    else :
        # Filtering bike_counts based on timestamp range and restricted sensors
        query = bike_counts.filter((col("timestamp") > LB) & (col("timestamp") < HB) & (col("sensor").isin(sensors_restricted)))

    arr = np.array(query.collect()) # Collecting query results as a numpy array
    if arr.shape[0] > 0 : 
        #  Converting numpy array to string and sending it over the connection
        message  = np.array2string(arr, separator=",", threshold=np.inf).replace("\n", "").replace(" ", "") + "\n"
        try:  
            c.send(message.encode())
        except socket.error:
            c.close()
            c, addr = s.accept()
    else :
        # Logging a message when all the data has been consumed and closing the connection 
        logging.info(f"Consummed all the data.")
        c.close()
        break
    
    time.sleep(batchTimeInterval) # Waiting for the specified time interval
    i  += 1 # Incrementing the counter for the next batch

### Closing the spark session

In [None]:
spark.stop()