# Bike counter forecasting - Data Stream Producer 

This file is used to produce data from the historical_data_csv/data.csv file. 

It will send one **week** per batch from "15-04-2022" to "31-03-2023"
Week is choosed to be able to process data in a reasonable amount of time, without having to wait one hour for all data to be processed.


## Common imports

In [1]:
import time
import numpy as np
import socket
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, first 
from pyspark.sql.functions import to_timestamp,date_format, lpad
from pyspark.sql.functions import dayofweek, dayofyear, year, weekofyear, concat, lit
import sys
np.set_printoptions(threshold=sys.maxsize) # used to avoid sending ellipsis ("..." string) as message
os.environ['PYSPARK_SUBMIT_ARGS'] = '--conf spark.ui.port=4040 ' + \
                                    '--conf spark.driver.memory=1g  pyspark-shell '

# Parameters

In [2]:
delta = 2  
start_day = '2022-04-15'
end_day = "2023-03-31"

## Spark session creation

In [3]:
spark = SparkSession.builder.master("local[*]").appName("bike_forecasting").getOrCreate() 

23/05/15 20:59:20 WARN Utils: Your hostname, pop-os resolves to a loopback address: 127.0.1.1; using 192.168.129.3 instead (on interface enp38s0)
23/05/15 20:59:20 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/05/15 20:59:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Data read
The file `data.csv` contains the data pre-processed (completed) of Part 1.
The variable `df` is the dataframe with the content of this file.
The dataframe is augmented with additional Time Related features for the completion of Task 4. 
Also as data are sent week by week, we use "year_week" that signals the week inside the year of the row in order to send all data with the same combination of year week to the consumer

In [4]:
df = spark.read.option("header", True).csv("data.csv")
df = df.withColumn("day", dayofyear("Date"))
df = df.withColumn("dayofweek", dayofweek("Date"))
df = df.withColumn("year", year("Date"))
df = df.withColumn("week", weekofyear("Date"))
df = df.withColumn("year_day", concat(col("year"), lit("-"), col("day")))
df = df.withColumn("week_str", lpad(col("week"), 2, "0"))
df = df.withColumn("year_week", concat(col("year"), lit("-"), col("week_str")))
df.show()

23/05/15 20:59:25 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Date, Time gap, Sensor, Count, Average speed
 Schema: _c0, Date, Time gap, Sensor, Count, Average speed
Expected: _c0 but found: 
CSV file: file:///home/lquivron/Documents/Year_3/BigDATA/Projet/scalable-analytics-phase-2-group-3/data.csv
+---+----------+--------+-------+-----+-------------+---+---------+----+----+--------+--------+---------+
|_c0|      Date|Time gap| Sensor|Count|Average speed|day|dayofweek|year|week|year_day|week_str|year_week|
+---+----------+--------+-------+-----+-------------+---+---------+----+----+--------+--------+---------+
|  0|2018-12-06|       1|  CAT17|  0.0|         -1.0|340|        5|2018|  49|2018-340|      49|  2018-49|
|  1|2018-12-06|       1|CB02411|  0.0|         -1.0|340|        5|2018|  49|2018-340|      49|  2018-49|
|  2|2018-12-06|       1| CB1101|  0.0|         -1.0|340|        5|2018|  49|2018-340|      49|  2018-49|
|  3|2018-12-06|       1| CB114

## Data streaming
Then, we send the data through the socket, with `delta` interval, containing all the data from one day per batch.

In [5]:
host = 'localhost'
port = 2222
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind((host, port))
s.listen(2)
c, addr = s.accept()
print("CONNECTION FROM:", str(addr))

CONNECTION FROM: ('127.0.0.1', 54270)


In [10]:
delta = 2
# Select the time period
df = df.filter(col("Date") > start_day).filter( col("Date") < end_day) 

# Retrieve the week of the years to test
test_years_week = df.select("year_week").distinct().orderBy(col("year_week")).collect() 



# Iterate through year-week combination to send
for date in test_years_week:
    data_to_send = df.filter((col("year_week") == date["year_week"]))
    data = np.array(data_to_send.collect())
    
    for line in data:
        # Serialize array
        message = np.array2string(line, separator=',', max_line_width=1000) + '\n'

        # Send message to the client
        try:
            c.send(message.encode())
        except socket.error:
            print("Failed. Waiting for a new connection...")
            # If failed, client is probably disconnected. Wait for another connection
            c.close()
            c, addr = s.accept()

    time.sleep(delta)
print("All data sent")


Row(year_week='2022-15')
23/05/14 13:58:18 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Date, Time gap, Sensor, Count, Average speed
 Schema: _c0, Date, Time gap, Sensor, Count, Average speed
Expected: _c0 but found: 
CSV file: file:///home/lquivron/Documents/Year_3/BigDATA/Projet/scalable-analytics-phase-2-group-3/historical_data_csv/data.csv
Failed. Waiting for a new connection...
Row(year_week='2022-16')
23/05/14 13:58:53 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Date, Time gap, Sensor, Count, Average speed
 Schema: _c0, Date, Time gap, Sensor, Count, Average speed
Expected: _c0 but found: 
CSV file: file:///home/lquivron/Documents/Year_3/BigDATA/Projet/scalable-analytics-phase-2-group-3/historical_data_csv/data.csv
Row(year_week='2022-17')
23/05/14 13:58:56 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Date, Time gap, Sensor, Count, Average speed
 Schema: _c0, Date, Time gap, Sensor, C

Row(year_week='2022-37')
23/05/14 13:59:53 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Date, Time gap, Sensor, Count, Average speed
 Schema: _c0, Date, Time gap, Sensor, Count, Average speed
Expected: _c0 but found: 
CSV file: file:///home/lquivron/Documents/Year_3/BigDATA/Projet/scalable-analytics-phase-2-group-3/historical_data_csv/data.csv
Row(year_week='2022-38')
23/05/14 13:59:55 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Date, Time gap, Sensor, Count, Average speed
 Schema: _c0, Date, Time gap, Sensor, Count, Average speed
Expected: _c0 but found: 
CSV file: file:///home/lquivron/Documents/Year_3/BigDATA/Projet/scalable-analytics-phase-2-group-3/historical_data_csv/data.csv
Row(year_week='2022-39')
23/05/14 13:59:58 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Date, Time gap, Sensor, Count, Average speed
 Schema: _c0, Date, Time gap, Sensor, Count, Average speed
Expected: _c0 but fo

Row(year_week='2023-07')
23/05/14 14:00:55 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Date, Time gap, Sensor, Count, Average speed
 Schema: _c0, Date, Time gap, Sensor, Count, Average speed
Expected: _c0 but found: 
CSV file: file:///home/lquivron/Documents/Year_3/BigDATA/Projet/scalable-analytics-phase-2-group-3/historical_data_csv/data.csv
Row(year_week='2023-08')
23/05/14 14:00:58 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Date, Time gap, Sensor, Count, Average speed
 Schema: _c0, Date, Time gap, Sensor, Count, Average speed
Expected: _c0 but found: 
CSV file: file:///home/lquivron/Documents/Year_3/BigDATA/Projet/scalable-analytics-phase-2-group-3/historical_data_csv/data.csv
Row(year_week='2023-09')
23/05/14 14:01:01 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Date, Time gap, Sensor, Count, Average speed
 Schema: _c0, Date, Time gap, Sensor, Count, Average speed
Expected: _c0 but fo

In [11]:
c.close()
s.close()