### Lab 5: Processing Incremental Updates with PySpark Structured Streaming and Delta tables
In this lab you'll apply your knowledge of PySpark and structured streaming to implement a simple multi-hop (medallion) architecture.

#### 1.0. Import Required Libraries

In [None]:
import findspark
findspark.init()
print(findspark.find())

import os
import sys
import json
import shutil
import time

import pyspark
from delta import *
from pyspark.sql.functions import *
from pyspark.sql.types import *

#### 2.0. Instantiate Global Variables

In [None]:
# --------------------------------------------------------------------------------
# Specify Directory Structure for Source Data
# --------------------------------------------------------------------------------
base_dir = os.path.join(os.getcwd(), 'lab_data')
data_dir = os.path.join(base_dir, 'retail-org')
customers_stream_dir = os.path.join(data_dir, 'customers')

# --------------------------------------------------------------------------------
# Create Directory Structure for Data Lakehouse Files
# --------------------------------------------------------------------------------
dest_database = "customers_dlh"
sql_warehouse_dir = os.path.abspath('spark-warehouse')
database_dir = os.path.join(sql_warehouse_dir, dest_database)

customers_output_bronze = os.path.join(database_dir, 'customers_bronze')
customers_output_silver = os.path.join(database_dir, 'customers_silver')
customers_output_gold = os.path.join(database_dir, 'customers_gold')

#### 3.0. Define Global Functions

In [None]:
def remove_directory_tree(path: str):
    '''If it exists, remove the entire contents of a directory structure at a given 'path' parameter's location.'''
    try:
        if os.path.exists(path):
            shutil.rmtree(path)
            return f"Directory '{path}' has been removed successfully."
        else:
            return f"Directory '{path}' does not exist."
            
    except Exception as e:
        return f"An error occurred: {e}"

#### 4.0. Create a New Spark Session

In [None]:
worker_threads = f"local[{int(os.cpu_count()/2)}]"
shuffle_partitions = int(os.cpu_count())

builder = pyspark.sql.SparkSession.builder \
    .appName('PySpark Customers Delta Table in Juptyer')\
    .master(worker_threads)\
    .config('spark.executor.memory', '2g')\
    .config('spark.driver.memory', '4g') \
    .config("spark.streaming.stopGracefullyOnShutdown", "true") \
    .config('spark.sql.shuffle.partitions', shuffle_partitions) \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config('spark.sql.streaming.forceDeleteTempCheckpointLocation', 'true') \
    .config("spark.sql.streaming.schemaInference", "true") \
    .config('spark.sql.warehouse.dir', database_dir)

spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark

### 5.0. Initialize Data Lakehouse Directory Structure
Remove the Data Lakehouse Database Directory Structure to Ensure Idempotency

In [None]:
remove_directory_tree(database_dir)

#### 6.0. Bronze Table: Ingest and Stage Data
This lab uses a collection of customer-related CSV data found in *`../04-PySpark/lab_data/retail-org/customers/`*. 
<br>This is available to you by way of the `customers_stream_dir` variable.

##### 6.1. Read this data into a Stream using schema inference
- Use a **`_checkpoint`** folder and the **`schemaLocation`** option to store the schema info in a dedicated folder for **`customers`**.
- Set the **`maxFilesPerTrigger`** option to **`1`**.
- Set the **`inferSchema`** and **`header`** options to **`true`**.
- Use **`.csv()`** to specify the source directory.

In [None]:
customers_checkpoint_bronze = os.path.join(customers_output_bronze, '_checkpoint')

df_customers_bronze = (
    spark.readStream \
    # TODO: Configurations
)

df_customers_bronze.isStreaming

In [None]:
df_customers_bronze.printSchema()

##### 6.2. Stream the raw data to a Delta table.
 - Use the **`delta`** format.
 - Use the **`append`** output mode.
 - Use **`customers_bronze`** as the **`queryName`**.
 - Use **`availableNow = True`** for the **`trigger`**
 - Use the **`_checkpoint`** folder with the **`checkpointLocation`** option.

In [None]:
customers_bronze_query = (
    df_customers_bronze \
    .writeStream \
    # TODO: Configurations
)

In [None]:
customers_bronze_query.awaitTermination()

##### 6.3. Create a Streaming Temporary View named **`customers_bronze_temp`**
- Use the **`delta`** format.
- Set the **`inferSchema`** option to **`true`**
- Load the data from the output of the **`bronze`** delta table (**`customers_output_bronze`**)

In [None]:
(spark.readStream \
    # TODO: Configurations
)

##### 6.4. Clean and Enhance the Data
Use the CTAS syntax to define a new streaming view called **`bronze_enhanced_temp`** that does the following:
* Omits records with a null **`postcode`** (set to zero)
* Inserts a column called **`receipt_time`** containing a current timestamp using the **`current_timestamp()`** function.
* Inserts a column called **`source_file`** containing the input filename using the **`imput_file_name()`** function.

In [None]:
sql_bronze_temp = """
   #TODO: author SQL Statement
"""
spark.sql(sql_bronze_temp)

#### 7.0. Silver Table
##### 7.1. Stream the data from **`bronze_enhanced_temp`** to a **`Delta`** table named **`customers_silver`**.
 - Use the **`append`** output mode.
 - Use **`customers_silver`** as the **`queryName`**.
 - Use **`availableNow = True`** for the **`trigger`**
 - Use a **`_checkpoint`** folder with the **`checkpointLocation`** option to store the schema info in a dedicated folder for **`customers`**.

In [None]:
customers_checkpoint_silver = os.path.join(customers_output_silver, '_checkpoint')

customers_silver_query = \
(spark.table("bronze_enhanced_temp") \
     .writeStream \
     # TODO: Configurations
)

##### 7.2. Create a Streaming Temporary View
- Create another streaming temporary view named **`customers_silver_temp`** for the **`customers_silver`** table so we can perform business-level queries using SQL.

In [None]:
(spark.readStream \
     # TODO: Confgurations
)

#### 8.0. Gold Table
##### 8.1. Use the CTAS syntax to define a new streaming view called **`customer_count_by_state_temp`** that does the following:
- Reads data from the **`customers_silver_temp`** temporary view created in the preceding step.
- Selects the **`state`** along with the number of customers per (grouped by) state.

In [None]:
sql_gold_temp = """
    TODO: Author SQL statement
"""
spark.sql(sql_gold_temp)

##### 8.2. Stream the data from the **`customer_count_by_state_temp`** view to a Delta table called **`customer_count_by_state_gold`**.
- Use the **`complete`** output mode because aggregations like **`count()`** and sorting cannot operate on *unbounded* datasets.  
- Use a **`_checkpoint`** folder with the **`checkpointLocation`** option and a dedicated folder for **`customers`** as the checkpoint path.

In [None]:
customer_count_checkpoint_gold = os.path.join(customers_output_gold, '_checkpoint')

customer_count_by_state_gold_query = \
(spark.table("customer_count_by_state_temp") \
     # TODO: Configurations
)

In [None]:
customer_count_by_state_gold_query.awaitTermination()

#### 9.0. Query the Results
- Query the **`customer_count_by_state_gold`** table (this will not be a streaming query).
- Select the **`state`** and **`customer_count`** columns.
- Sort the results by **`customer_count`** in descending order (i.e., from highest to lowest).

In [None]:
sql_customer_count_query = """
    TODO: Author SQL query
"""
spark.sql(sql_customer_count_query).toPandas()

In [None]:
spark.stop()