## Assignment 3 - DGIM on stream of bits

Francisco Marques 97639 Data Science

Here we apply the DGIM method to a synthethic stream of bits generated by ''simple_socket_server_bits.py' using Spark's Structured Streaming. I was not able to use the console sink format in Jupyter Notebook so all of the code was tested in a regular Python script.

### Install and load modules

In [None]:
%pip install pyspark

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col
from dgim import * # custom class

In [None]:
N = 100 # window size
k = 50 # last k entries

### Initialize Spark session

In [None]:
# start Spark session
spark = SparkSession.builder.appName("Assignment 3 - DGIM").getOrCreate()
spark

### Connect to server via socket

Default values: host = 'localhost'; port = 9999

In [None]:
socket_df = (spark.readStream.format("socket") 
        .option("host", 'localhost') 
        .option("port", 9999)
        .load())

### Spark Dataframe with a timestamp (int) column and value (bit) column

In [None]:
split_df = (socket_df.withColumn("tmp", split(col("value"), ",")) # split values in column
            .withColumn("timestamp", col("tmp").getItem(0)) # time column 
            .withColumn("value", col("tmp").getItem(1)) # bit column
            .drop(col("tmp")))

### Print the schema of latest Dataframe and initialize DGIM object

In [None]:
# print schema to dataframe
split_df.printSchema()

# Initialize DGIM object
dgim = DGIM(N, k)

### Define a function to update the DGIM object every batch

In [None]:
def update_cumulative_sum(batch_df, batch_id):
        "Update DGIM object cumulative and real sum with each batch"
        
        print(f"Batch: {batch_id}\n")
        t_start = dgim.stream_timestamp # starting time of the batch
        rows = batch_df.toLocalIterator() # convert batch rows to generator
        
        for row in rows:
            bit = row["value"]
            dgim.update(bit)
        dgim.estimated_count += dgim.count()
        t_end = dgim.stream_timestamp # ending time of the batch
        
        new_df = spark.createDataFrame([(t_start, t_end, dgim.estimated_count, dgim.real_count)], 
                                    ['t_start', 't_end', 'estimated_sum', 'real_sum'])
        new_df.show()

### Create query to apply the previous function to each batch

In [None]:
# apply update_cumulative_sum to each batch
df_dgim = (split_df.writeStream
        .foreachBatch(update_cumulative_sum)
        .start())

df_dgim.awaitTermination()