<h1><center>Spark Streaming Read from Files | Flatten JSON data</center></h1>
<hr><hr><hr>

- Sample JSON file data using which we will be working in this notebook:
```
{
  "eventId": "e3cb26d3-41b2-49a2-84f3-0156ed8d7502",
  "eventOffset": 10001,
  "eventPublisher": "device",
  "customerId": "CI00103",
  "data": {
    "devices": [
      {
        "deviceId": "D001",
        "temperature": 15,
        "measure": "C",
        "status": "ERROR"
      },
      {
        "deviceId": "D002",
        "temperature": 16,
        "measure": "C",
        "status": "SUCCESS"
      }
    ]
  },
  "eventTime": "2023-01-05 11:13:53.643364"
}
```

- Install the package `ipynbname`, if not preset, using `pip install ipynbname`

In [1]:
pip show ipynbname

Name: ipynbname
Version: 2024.1.0.0
Summary: Simply returns either notebook filename or the full path to the notebook when run from Jupyter notebook in browser.
Home-page: https://github.com/msm1089/ipynbname
Author: Mark McPherson
Author-email: msm1089@yahoo.co.uk
License: MIT
Location: e:\programs & codes\apache_spark\_spark_venv\lib\site-packages
Requires: ipykernel
Required-by: 
Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import ipynbname

In [3]:
notebook_name = ipynbname.name()
print(notebook_name)

004-flatten_streaming_JSON_files


In [4]:
import findspark
findspark.init()

In [5]:
from pyspark.sql import SparkSession

spark = ( 
    SparkSession.builder
    .master("local")
    .appName("004 - flatten streaming JSON files data")
    .config("spark.streaming.stopGracefullyOnShutdown", True)
    .getOrCreate() 
)
spark

In [6]:
import pyspark.sql.functions as F
from pyspark.sql.types import *

In [7]:
# os.listdir(f"./checkpoints/{notebook_name}")

In [8]:
# This code snippet is the base directory for saving checkpoints of streaming jobs exists. If it does not exists, creates the base directory.
BASE_CHECKPOINT_DIR_FOR_CURRENT_NOTEBOOK = f"./checkpoints/{notebook_name}"

try:
    os.listdir( BASE_CHECKPOINT_DIR_FOR_CURRENT_NOTEBOOK )
    print(f"Base checkpoint directory '{BASE_CHECKPOINT_DIR_FOR_CURRENT_NOTEBOOK}' already exists.")
except FileNotFoundError:
    os.mkdir(BASE_CHECKPOINT_DIR_FOR_CURRENT_NOTEBOOK)
    print(f"Base checkpoint directory, did not exist previously, so, it has been created with the relative path: '{BASE_CHECKPOINT_DIR_FOR_CURRENT_NOTEBOOK}'")

Base checkpoint directory './checkpoints/004-flatten_streaming_JSON_files' already exists.


In [9]:
"""
# For each run of the notebook streaming job, the directory to be used will be in the pattern "{BASE_CHECKPOINT_DIR_FOR_CURRENT_NOTEBOOK}/1", "{BASE_CHECKPOINT_DIR_FOR_CURRENT_NOTEBOOK}/2", ... and so on.
# Thus, for each run, the below code snippet checks the highest existing numbered folder inside the "BASE_CHECKPOINT_DIR_FOR_CURRENT_NOTEBOOK". If that highest numbered folder is empty(does not contain any checkpoint files), that will be set as checkpoint directory for current run, else, a new folder with ( 1+ highest numbered folder ) will be created, and will be used as the checkpoint directory.
"""
def update_checkpoint_dir( BASE_CHECKPOINT_DIR_FOR_CURRENT_NOTEBOOK ):
    # get list of all folders inside the 'BASE_CHECKPOINT_DIR_FOR_CURRENT_NOTEBOOK'
    past_checkpoint_dirs_of_this_notebook = os.listdir(BASE_CHECKPOINT_DIR_FOR_CURRENT_NOTEBOOK)
    
    
    if len(past_checkpoint_dirs_of_this_notebook) == 0:
        # If 'BASE_CHECKPOINT_DIR_FOR_CURRENT_NOTEBOOK' is empty, then create a folder named '1' inside it, and select this '1' named folder path as 'CHECKPOINT_DIRECTORY_FOR_CURRENT_RUN'
        os.mkdir(f"{BASE_CHECKPOINT_DIR_FOR_CURRENT_NOTEBOOK}/1")
        CHECKPOINT_DIRECTORY_FOR_CURRENT_RUN = f"{BASE_CHECKPOINT_DIR_FOR_CURRENT_NOTEBOOK}/1"
    else:
        # If 'BASE_CHECKPOINT_DIR_FOR_CURRENT_NOTEBOOK' is not empty, then we need to get the highest numbered folder.
    
        # Converting folder number strings to integers, to currently get the folder name having highest number
        checkpoint_folder_names_converted_to_integers = [int(i) for i in past_checkpoint_dirs_of_this_notebook]
        highest_existing_folder_number = max(checkpoint_folder_names_converted_to_integers)  # max() gives the largest integer value in this list of folder names
    
        # checking if the highest numbered folder is empty, or it contains checkpoint related files inside it.
        # This check is important as it might happen that a new folder is created but not used for storing any checkpoint data. Then, without creating another new folder with higher number, we will use that existing highest numbered empty folder.
        if len( os.listdir( f"{BASE_CHECKPOINT_DIR_FOR_CURRENT_NOTEBOOK}/{highest_existing_folder_number}" ) ) == 0:
            # If existing highest numbered folder is empty, then its path will be set to 'CHECKPOINT_DIRECTORY_FOR_CURRENT_RUN'
            CHECKPOINT_DIRECTORY_FOR_CURRENT_RUN = f"{BASE_CHECKPOINT_DIR_FOR_CURRENT_NOTEBOOK}/{highest_existing_folder_number}"
        else:
            # If existing highest numbered folder is not empty(contains checkpoint files), then a new folder with number 1 higher than highest number will be created, and its path will be set to 'CHECKPOINT_DIRECTORY_FOR_CURRENT_RUN'
            new_folder_number = highest_existing_folder_number + 1
            os.mkdir(f"{BASE_CHECKPOINT_DIR_FOR_CURRENT_NOTEBOOK}/{new_folder_number}")
            CHECKPOINT_DIRECTORY_FOR_CURRENT_RUN = f"{BASE_CHECKPOINT_DIR_FOR_CURRENT_NOTEBOOK}/{new_folder_number}"

    return CHECKPOINT_DIRECTORY_FOR_CURRENT_RUN
    
print(f"The checkpoint directory to be utilised for current execution of spark streaming job: '{update_checkpoint_dir( BASE_CHECKPOINT_DIR_FOR_CURRENT_NOTEBOOK )}'")

The checkpoint directory to be utilised for current execution of spark streaming job: './checkpoints/004-flatten_streaming_JSON_files/7'


## Batch Code for JSON transformation and flattening:
---------------------------------------------------------

In [10]:
json_df = (
    spark.read
    .format("json")
    .option("inferSchema", True)
    .load(f"./data/{notebook_name}/input/device_files/device_01.json")
)

json_df.printSchema()

root
 |-- customerId: string (nullable = true)
 |-- data: struct (nullable = true)
 |    |-- devices: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- deviceId: string (nullable = true)
 |    |    |    |-- measure: string (nullable = true)
 |    |    |    |-- status: string (nullable = true)
 |    |    |    |-- temperature: long (nullable = true)
 |-- eventId: string (nullable = true)
 |-- eventOffset: long (nullable = true)
 |-- eventPublisher: string (nullable = true)
 |-- eventTime: string (nullable = true)



In [11]:
json_df.show(truncate=False)

+----------+------------------------------------------------+------------------------------------+-----------+--------------+--------------------------+
|customerId|data                                            |eventId                             |eventOffset|eventPublisher|eventTime                 |
+----------+------------------------------------------------+------------------------------------+-----------+--------------+--------------------------+
|CI00103   |{[{D001, C, ERROR, 15}, {D002, C, SUCCESS, 16}]}|e3cb26d3-41b2-49a2-84f3-0156ed8d7502|10001      |device        |2023-01-05 11:13:53.643364|
+----------+------------------------------------------------+------------------------------------+-----------+--------------+--------------------------+



In [12]:
flattened_df = ( 
    json_df.withColumn( "data", F.explode( "data.devices" ) )
    .withColumn( "deviceId", F.col("data.deviceId") )
    .withColumn( "temperature", F.col("data.temperature") )
    .withColumn( "measure", F.col("data.measure") )
    .withColumn( "status", F.col("data.status") )
    # .drop("data")
    .select( "eventId", "eventOffset", "eventPublisher", "customerId", "deviceId", "temperature", "measure", "status", "eventTime" )
)

flattened_df.show(truncate=False)

+------------------------------------+-----------+--------------+----------+--------+-----------+-------+-------+--------------------------+
|eventId                             |eventOffset|eventPublisher|customerId|deviceId|temperature|measure|status |eventTime                 |
+------------------------------------+-----------+--------------+----------+--------+-----------+-------+-------+--------------------------+
|e3cb26d3-41b2-49a2-84f3-0156ed8d7502|10001      |device        |CI00103   |D001    |15         |C      |ERROR  |2023-01-05 11:13:53.643364|
|e3cb26d3-41b2-49a2-84f3-0156ed8d7502|10001      |device        |CI00103   |D002    |16         |C      |SUCCESS|2023-01-05 11:13:53.643364|
+------------------------------------+-----------+--------------+----------+--------+-----------+-------+-------+--------------------------+



In [13]:
# converting StructType data to MapType data. 
# The column "data" in "json_df" is a struct-type column containing single key "devices", which contains an array of StructType objects. Below code flattents each StructType object in the array.

map_type_df = ( 
    json_df.select( F.explode("data.devices").alias("exploded_devices") )
    .select( F.create_map(
        F.lit("deviceId"), F.col("exploded_devices.deviceId"),
        F.lit("temperature"), F.col("exploded_devices.temperature"),
        F.lit("measure"), F.col("exploded_devices.measure"),
        F.lit("status"), F.col("exploded_devices.status")
    ).alias( "map_type_data" ) ) 
)

map_type_df.show(truncate=False)

+----------------------------------------------------------------------+
|map_type_data                                                         |
+----------------------------------------------------------------------+
|{deviceId -> D001, temperature -> 15, measure -> C, status -> ERROR}  |
|{deviceId -> D002, temperature -> 16, measure -> C, status -> SUCCESS}|
+----------------------------------------------------------------------+



In [14]:
# We can now apply "explode()" and "pos_explode()" to MapType() column

map_type_df.select( "*" , F.explode("map_type_data").alias( "data_key", "data_value" ) ).show(truncate=False)

+----------------------------------------------------------------------+-----------+----------+
|map_type_data                                                         |data_key   |data_value|
+----------------------------------------------------------------------+-----------+----------+
|{deviceId -> D001, temperature -> 15, measure -> C, status -> ERROR}  |deviceId   |D001      |
|{deviceId -> D001, temperature -> 15, measure -> C, status -> ERROR}  |temperature|15        |
|{deviceId -> D001, temperature -> 15, measure -> C, status -> ERROR}  |measure    |C         |
|{deviceId -> D001, temperature -> 15, measure -> C, status -> ERROR}  |status     |ERROR     |
|{deviceId -> D002, temperature -> 16, measure -> C, status -> SUCCESS}|deviceId   |D002      |
|{deviceId -> D002, temperature -> 16, measure -> C, status -> SUCCESS}|temperature|16        |
|{deviceId -> D002, temperature -> 16, measure -> C, status -> SUCCESS}|measure    |C         |
|{deviceId -> D002, temperature -> 16, m

## Converting Batch Code to Streaming Code for setting the JSON flattening as a streaming job:
<hr>

In [9]:
# Thus configuration allows automatic Schema Inference, while reading streaming data
# This configuration for spark streaming is equivalent to "inferSchema=True" in pyspark batch processing
spark.conf.set("spark.sql.streaming.schemaInference", True)

In [14]:
"""
# The "cleanSource" option is used to delete or archive the input files, as soon they are read by streaming query. It has 3 possible values: "off"(default), "archive", and "delete"
# By default, it leaves the read files untouched. When value of "cleanSource" option is set to "delete", it deletes the files as soon they are read.
# When "cleanSource" option value is set to "archive", then as soon as a file is read, it moves that file to the directory mentioned using 


# ("maxFilesPerTrigger", 1) option: For each trigger/microbatch, only 1 file will be processed by this streaming read query. Without setting this option, spark will try to process as many files as it gets in the input directory, during execution. Setting this option is good for production workloads.
"""

streaming_json_df = (
    spark.readStream
    .format("json")
    .option("cleanSource", "archive")
    .option("sourceArchiveDir", "./data/"+notebook_name+"/archive")
    .option("spark.sql.streaming.fileSource.cleaner.numThreads", "0")  # If thread is set to 0, it will force the main thread to do the cleanup
    .option("maxFilesPerTrigger", 1)   
    .load(f"./data/{notebook_name}/input/device_files/")  # This path must be a path to a directory, and not path to a file
)

In [15]:
streaming_json_df.printSchema()

root
 |-- customerId: string (nullable = true)
 |-- data: struct (nullable = true)
 |    |-- devices: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- deviceId: string (nullable = true)
 |    |    |    |-- measure: string (nullable = true)
 |    |    |    |-- status: string (nullable = true)
 |    |    |    |-- temperature: long (nullable = true)
 |-- eventId: string (nullable = true)
 |-- eventOffset: long (nullable = true)
 |-- eventPublisher: string (nullable = true)
 |-- eventTime: string (nullable = true)



In [16]:
streaming_flattened_df = ( 
    streaming_json_df.withColumn( "data", F.explode( "data.devices" ) )
    .withColumn( "deviceId", F.col("data.deviceId") )
    .withColumn( "temperature", F.col("data.temperature") )
    .withColumn( "measure", F.col("data.measure") )
    .withColumn( "status", F.col("data.status") )
    # .drop("data")
    .select( "eventId", "eventOffset", "eventPublisher", "customerId", "deviceId", "temperature", "measure", "status", "eventTime" )
)

In [17]:
streaming_flattened_df.printSchema()

root
 |-- eventId: string (nullable = true)
 |-- eventOffset: long (nullable = true)
 |-- eventPublisher: string (nullable = true)
 |-- customerId: string (nullable = true)
 |-- deviceId: string (nullable = true)
 |-- temperature: long (nullable = true)
 |-- measure: string (nullable = true)
 |-- status: string (nullable = true)
 |-- eventTime: string (nullable = true)



In [None]:
# CHECKPOINT_DIRECTORY_FOR_CURRENT_RUN = update_checkpoint_dir( BASE_CHECKPOINT_DIR_FOR_CURRENT_NOTEBOOK )
CHECKPOINT_DIRECTORY_FOR_CURRENT_RUN = "./checkpoints/004-flatten_streaming_JSON_files/7"
print(CHECKPOINT_DIRECTORY_FOR_CURRENT_RUN)

In [18]:
stream_write_to_console_query = (
    streaming_flattened_df.writeStream
    .format("console")
    .outputMode("append")
    .option("checkpointLocation", CHECKPOINT_DIRECTORY_FOR_CURRENT_RUN)
    .start()
)

In [16]:
stream_write_as_csv_query = (
    streaming_flattened_df.writeStream
    .format("csv")
    .outputMode("append")
    .option("header", True)
    .option("checkpointLocation", CHECKPOINT_DIRECTORY_FOR_CURRENT_RUN)
    .option("path", "./data/"+notebook_name+"/output/device_data.csv")  # Path where the streaming data will be written
    .start()
)

In [None]:
stream_write_to_console_query.awaitTermination()
# stream_write_as_csv_query.awaitTermination()