## Assignment 3 - Exponentially Decaying Window

Francisco Marques 97639 Data Science

Here we apply the Exponentially Decaying Window method to a stream of events (letters) generated by ''simple_socket_server_csv.py' using Spark's Structured Streaming. I was not able to use the console sink format in Jupyter Notebook so all of the code was tested in a regular Python script.

### Install and load modules

In [None]:
%pip install pyspark

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col, lit, window
from pyspark.sql.functions import sum as spark_sum
from pyspark.sql.functions import round as spark_round
from pyspark.sql.types import TimestampType
import argparse

In [None]:
window_size = 10 # window size
slide = 1 # slide duration (e.g. with window_size = 10 â†’ 0-10, 1-11, 2-12, etc...)
decay = 0.1 # decay constant

### Initialize Spark session

In [None]:
# start Spark session
spark = SparkSession.builder.appName("Assignment 3 - Exponentially Decaying Window").getOrCreate()
spark

### Connect to server via socket

Default values: host = 'localhost'; port = 9999

In [None]:
socket_df = (spark.readStream.format("socket") 
        .option("host", 'localhost') 
        .option("port", 9999)
        .load())

### Spark Dataframe with a timestamp (TimestampType) column and value (str) column

In [None]:
split_df = (socket_df.withColumn('tmp', split(socket_df.value, ',')) # split by comma (csv)
            .withColumn('event_time', col('tmp').getItem(0).cast(TimestampType())) # create column with timestamps
            .withColumn('event', col('tmp').getItem(1)) # column with events
            .drop(col("tmp")).drop(col("value"))
            .withColumn("count", lit(1))) # column for initial weighted sum

### Print the schema of latest Dataframe

In [None]:
# print schema to dataframe
split_df.printSchema()

### Create a new Dataframe with sliding window that also computes the weighted sum, making up the Exponentially Decaying Window method.

In [None]:
windowed_df = (split_df.withColumn("decay_factor", lit(1 - decay)) # create decay factor column
    .withColumn("weighted_sum", (col("count") * col("decay_factor")) + 1)  # compute decay_factor for each row
    .groupBy(window(col("event_time"), f"{window_size} seconds", slideDuration = f"{slide} seconds") # group by sliding time window
            .alias("window"), col("event")) 
    .agg(spark_sum(col("weighted_sum")).alias("total_weighted_sum"), col("event")) # aggregate by summing up every weighted_sum and by event
    .withColumn("start", col("window").getField("start"))
    .withColumn("end", col("window").getField("end"))
    .withColumn("total_weighted_sum", spark_round(col("total_weighted_sum"), 4))
    .select("event", "start", "end", "total_weighted_sum")\
    .orderBy(col("total_weighted_sum").desc(), col("window.start"))) # order by descending total_weighted_sum 

### Once again print the schema of this DataFrame

In [None]:
windowed_df.printSchema()

### Create query to output results to console

In [None]:
# output to console
window_query = (windowed_df.writeStream
                .outputMode("complete") 
                .format("console") 
                .start())

# Wait for the queries to terminate
window_query.awaitTermination()