# Structured Streaming Demo

### Demo

In [1]:
import findspark

findspark.init('/opt/spark-3.5.0/')

from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext

spark = SparkSession.builder.appName('Structured Streaming').getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/03/25 19:55:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/03/25 19:55:15 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In the terminal, type `nc -lk 5555` to run the netcat server, and then type in whatever you choose.

In [2]:
# Create DataFrame representing the stream of input lines from connection to localhost:5555
lines = spark.readStream.format("socket").option("host", "localhost").option("port", 5555).load()

24/03/25 19:55:17 WARN TextSocketSourceProvider: The socket source should not be used for production applications! It does not support recovery.


In [3]:
# Split the lines into words
# explode : This function is used to explode an array column into multiple rows.
# It takes an array column as input and produces a new row for each element in the array.
# In this case, it explodes the array of words generated by split() into multiple rows.
from pyspark.sql.functions import explode, split

words = lines.select(explode(split(lines.value, " ")).alias("word"))

In [4]:
# +------------+
# | words      |
# +------------+
# | ["a", "b"] |
# | ["c", "d"] |
# +------------+
# The explode() function in PySpark is used to flatten a DataFrame with array columns.
# When you have a DataFrame with columns containing arrays, explode() allows you to transform these arrays into separate rows, duplicating
# the other column values accordingly.

# +-----+
# | word|
# +-----+
# |    a|
# |    b|
# |    c|
# |    d|
# +-----+


 

In [5]:
# Generate running word count
wordCounts = words.groupBy("word").count()

Some of the operations we can run on the structured stream:

| Operator               | Purpose                                                                                     |
|------------------------|------------------------------------------------------------------------------------------|
| query.name()           | get the unique identifier of the running query that persists across restarts from checkpoint data |
| query.id()             | get the unique identifier of the running query that persists across restarts from checkpoint data |
| query.runId()          | get the unique id of this run of the query, which will be generated at every start/restart        |
| query.recentProgress() | an array of the most recent progress updates for this query                                       |
| query.lastProgress()   | the most recent progress update of this streaming query                                           |
| spark.streams().active | get the list of currently active streaming queries                                                |
| query.stop()           | stop the query                                                                                    |

In [6]:
# Start running the query that prints the running counts to the console
query = wordCounts.writeStream.outputMode("complete").format("console").start()
query.awaitTermination()

24/03/25 19:55:18 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /private/var/folders/v4/r34w0mtd5xzfzyxp6hnbp8y00000gn/T/temporary-ee823b38-0b1b-4515-b20c-7563deffc83b. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
24/03/25 19:55:18 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+----+-----+
|word|count|
+----+-----+
+----+-----+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+----+-----+
|word|count|
+----+-----+
|   a|    1|
+----+-----+



                                                                                

-------------------------------------------
Batch: 2
-------------------------------------------
+----+-----+
|word|count|
+----+-----+
|   a|    3|
+----+-----+



                                                                                

-------------------------------------------
Batch: 3
-------------------------------------------
+----+-----+
|word|count|
+----+-----+
|   a|    5|
+----+-----+



ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
  File "/opt/spark-3.5.0/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/spark-3.5.0/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
                          ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/hso/.pyenv/versions/3.11.0/lib/python3.11/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/opt/spark-3.5.0/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 516, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During h

KeyboardInterrupt: 