# Stream Processing
## Goal
The goal of the producer is to download all historical measurements between 2018-12-06 and 2023-03-31, and then send batches of measurements in regular time intervals.
Specifically, every $∆$ seconds, the producer sends a batch of measurements corresponding to all timestamps within a time period $Π$.
For example, if $Π$ = 30 days, and ∆ = 5 seconds, the producer will send one month’s worth of data for each sensor every 5 seconds. The producer should be parameterizable by $∆$ and $Π$.

## Common imports and Environment variables

In [1]:
import time
import numpy as np
import socket
import os
from datetime import datetime, timedelta
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, first
import sys

os.environ['PYSPARK_SUBMIT_ARGS'] = '--conf spark.ui.port=4040 ' + \
                                    '--conf spark.driver.memory=1g  pyspark-shell '
np.set_printoptions(threshold=sys.maxsize)

## Parameters
In this section we will define the value of $∆$ and $Π$.
After multiple attempts, since the computation of the consumer takes around 2 minutes, we have decided to push data every 2 minutes. So the consumer time to compute.

In [2]:
delta = 120  # seconds
pi = 30  # days

## Spark session creation

In [3]:
spark = SparkSession.builder.master("local[*]").appName("stream_processing").getOrCreate()

# Get Spark context
sc = spark.sparkContext
sc.setLogLevel("ERROR")

23/05/15 21:58:57 WARN Utils: Your hostname, pop-os resolves to a loopback address: 127.0.1.1; using 192.168.129.3 instead (on interface enp38s0)
23/05/15 21:58:57 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/05/15 21:58:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Data read
The file `data.csv` contains the data pre-processed of Part 1. In case you don't have this file you can run the notebook `retrieve_preprocess_data`.
The variable `df` is the dataframe with the content of this file.

In [4]:
df = spark.read.option("header", True).csv("data.csv")
df.show()

+---+----------+--------+-------+-----+-------------+
|_c0|      Date|Time gap| Sensor|Count|Average speed|
+---+----------+--------+-------+-----+-------------+
|  0|2018-12-06|       1|  CAT17|  0.0|         -1.0|
|  1|2018-12-06|       1|CB02411|  0.0|         -1.0|
|  2|2018-12-06|       1| CB1101|  0.0|         -1.0|
|  3|2018-12-06|       1| CB1142|  0.0|         -1.0|
|  4|2018-12-06|       1| CB1143|  0.0|         -1.0|
|  5|2018-12-06|       1| CB1599|  0.0|         -1.0|
|  6|2018-12-06|       1| CB1699|  0.0|         -1.0|
|  7|2018-12-06|       1| CB2105|  0.0|         -1.0|
|  8|2018-12-06|       1| CEE016|  0.0|         -1.0|
|  9|2018-12-06|       1| CEK049|  0.0|         -1.0|
| 10|2018-12-06|       1|  CEK18|  0.0|         -1.0|
| 11|2018-12-06|       1|  CEK31|  0.0|         -1.0|
| 12|2018-12-06|       1| CEV011|  0.0|         -1.0|
| 13|2018-12-06|       1| CJE181|  0.0|         -1.0|
| 14|2018-12-06|       1|  CJM90|  0.0|         -1.0|
| 15|2018-12-06|       1| CL

## Socket creation
We are going to open a connection to the port `9999` and wait for a consumer to attach.

In [5]:
host = 'localhost'
port = 9999
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind((host, port))
s.listen(2)
print("Waiting for a new connection...")
c, addr = s.accept()
print("CONNECTION FROM:", str(addr))

Waiting for a new connection...
CONNECTION FROM: ('127.0.0.1', 39770)


This cell stops when the `consumer` notebook connects, on $localhost$ with port $9999$.

## Data streaming
Then, we send the data through the socket, with $∆$ interval, containing $Π$ days per batch.
We've set a customizable `max_day`, value that halts batch sending once it's reached, and `from_day`, which specifies the starting point.
These variables are very useful since not every sensor is active from the beginning, so modifying the `from_day` or `max_day` we can isolate specific time frame and calculate the person coefficient correspondingly.

In [None]:
max_day = '2023-05-31'
from_day = '2018-12-06'
date_format = '%Y-%m-%d'

start_date = datetime.strptime(from_day, date_format)
end_date = datetime.strptime(max_day, date_format)

while start_date <= end_date:
    # Extract pi days
    pi_days = df.filter((col("Date") >= start_date) & (col("Date") < (start_date + timedelta(days=pi)))).orderBy(col("Date"), col("Time gap")).collect()

    # Increase the start date
    start_date = start_date + timedelta(days=pi)

    # Put in an array the data to send
    data = np.array(pi_days)
    # Serialize array and print message as a string
    message = np.array2string(data, separator=',', max_line_width=1000) + '\n'
    print(message[:10])

    # Send message to the client
    try:
        c.send(message.encode())
    except socket.error:
        print("Failed. Waiting for a new connection...")
        # If failed, client is probably disconnected. Wait for another connection
        c.close()
        c, addr = s.accept()

    time.sleep(delta)

                                                                                

[['0','201


                                                                                

[['51840',




[['103680'


                                                                                

[['155520'


Then we properly close the sockets.

In [None]:
c.close()
s.close()