<center><h1>Reading Streaming Data from Sockets</h1></center>
<hr><hr><hr>

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").appName("003 - Reading streaming data from sockets").getOrCreate()
spark

In [3]:
sc = spark.sparkContext
sc

In [4]:
sc.defaultParallelism

8

In [5]:
# The default value in the beginning will be 200, thus, each stage will be broken down into 
spark.conf.get("spark.sql.shuffle.partitions")

'200'

- By default, the value of `"spark.sql.shuffle.partitions"` is `200`, thus each non-initial stage(any stage after the initial stage, that is stage created after at least one shuffle operation) will be divided into 200 tasks each, and the subsequent stages will be divided into that many tasks as many partition will exist after shuffling (which is again number set in `"spark.sql.shuffle.partitions"`). Below image shows, the number of tasks execution for non-initial stage(s) of this streaming job (see the last value shows `(195 + 5)`, means `200` task executed for each stage after shuffling happens once). `Stage 15` is the stage number shown in below image because in the same kernel, same job is run multiple times: <br><br>
    <img src="./images/001-each_stage_divides_into_200_tasks_by_default.jpg" width="700px">

In [6]:
spark.conf.set("spark.sql.shuffle.partitions", sc.defaultParallelism)

In [7]:
# Checking if the value of shuffle partitions has been changed
spark.conf.get("spark.sql.shuffle.partitions")

'8'

- Now, the value of `spark.sql.shuffle.partitions` has been set to number in `sc.defaultParallelism`, which is `8` in my system's case. So, now, number of tasks created for each non-initial stage(stages formed after at least one shuffle operation), will be `8`, as shown in the image below:
    <img src="./images/002-each_stage_divides_into_spark-sql-shuffle-partition_num_of_tasks.jpg" width="700px">

In [8]:
from pyspark.sql import functions as F

In [9]:
import os

#  This directory path will be utilised as the checkpoint directory for all the streaming jobs in this notebook
CHECKPOINT_DIRECTORY = "./checkpoints/003-reading_from_sockets"

In [10]:
# Check if the checkpoint folder used as checkpoint location for the below streaming jobs is empty

checkpoint_dir_contents = os.listdir(CHECKPOINT_DIRECTORY)
print( "Contents: ", checkpoint_dir_contents )

if len(checkpoint_dir_contents) == 0:
    print("Checkpoint directory is empty. Streaming jobs of this notebook can be started afresh successfully!")
else:
    print("Please delete the checkpoint directory set for this notebook, before running the streaming jobs below.")

Contents:  []
Checkpoint directory is empty. Streaming jobs of this notebook can be started afresh successfully!


In [11]:
# This dataframe will be an unbounded dataframe, where new rows as received from socket, will keep getting appended, and the dataframe will continue to grow in size indefinitely (if no duration is specifid as in this case):

streaming_df = (
    spark.readStream
    .format("socket")
    .option("host", "localhost")
    .option("port", "9999")
    .load()
)

In [12]:
streaming_df.printSchema()

root
 |-- value: string (nullable = true)



In [13]:
word_count_df = (
    streaming_df.withColumn("value", F.regexp_replace( F.col("value"), r"([.,])", "" ) )
    .withColumn("value", F.lower(F.col("value")) )
    .withColumn("words", F.split(F.col("value"), " ") )
    .drop("value")
    .withColumn("words", F.explode(F.col("words")) )
    .groupBy( "words" )
    .agg( F.count("words").alias("num_of_occurences") )
    # .orderBy( F.col("num_of_occurences").desc() )         # sorting could only be applied if writeStream outputMode is "complete"
)

- `awaitTermination()` function makes sure that when the structured streaming query is running on the executors and the driver is idle, even in that time the driver will still remain connected to the executors.

In [15]:
# Uncomment and run this cell if you want to view the microbatch received into your console.

# complete outputMode
# -------------------------
# 1. Sorting is only applicable for aggregated streaming dataset in "complete" mode

"""
stream_write_query = (
    word_count_df.writeStream
    .format("console")
    .outputMode("complete")
    .outputMode("update")
    # .outputMode("append")
    .option("checkpointLocation", "./checkpoints/003-reading_from_sockets")
    .start()
)
"""


# update outputMode
# -------------------------
# 1. update output mode results in only those values existing in the output, that are common in past result and new microbatch, and the new records in the microbatch. For this aggregation, words which are already in the existing output and new microbatch will have their count updated(increased), and the words which are new in the latest microbatch will be written in the result. The words which were in the output but not in new microbatch, their count/record will be dropped.

# 2. "console", "delta lake", "rdbms system" - these formats support "update" outputMode.
#    "file": writing to files does not support "update" outputMode.
"""
stream_write_query = (
    word_count_df.writeStream
    .format("console")
    .outputMode("update")
    .option("checkpointLocation", "./checkpoints/003-reading_from_sockets")
    .start()
)
"""


# append outputMode
# -------------------------
# 1. In "append" outputMode, we cannot update our data, old data is logged. Thus, "append" iis very popular with logs, when maintaining and writing logs in a persistent storage, using spark streeaming, as logs are not updated, they are only appended at the end.

# 2. "append" output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark. So, the below query will fail because no watermark has been added.

"""
stream_write_query = (
    word_count_df.writeStream
    .format("console")
    .outputMode("append")
    .option("checkpointLocation", "./checkpoints/003-reading_from_sockets")
    .start()
)
"""

# 3. However, we can use it to see the streaming_df, directly as output.

# The below code will simple show each microbtch, as it is obtained, and received by spark streaming.
# """
stream_write_query = (
    streaming_df.writeStream
    .format("console")
    .outputMode("append")
    .option("checkpointLocation", "./checkpoints/003-reading_from_sockets")
    .start()
)
# """


In [None]:
# awaitTermination() will make the driver remain connected to executor even when the driver is idle.
# Addition of triggerOnce() and trigger(availableNow=True) may change its behavior.
stream_write_query.awaitTermination()

In [None]:
stream_write_query

In [None]:
stream_write_query.stop()