## Lab 6: Building a Data Lakehouse with the PySpark Structured Streaming Medallion Architecture
This lab will help you learn to use many of the software libraries and programming techniques required to fulfill the requirements of the final end-of-session capstone project for course **DS-2002: Data Systems**. The spirit of the project is to provide a capstone challenge that requires students to demonstrate a practical and functional understanding of each of the data systems and architectural principles covered throughout the session.

**These include:**
- Relational Database Management Systems (e.g., MySQL, Microsoft SQL Server, Oracle, IBM DB2)
  - Online Transaction Processing Systems (OLTP): *Optimized for High-Volume Write Operations; Normalized to 3rd Normal Form.*
  - Online Analytical Processing Systems (OLAP): *Optimized for Read/Aggregation Operations; Dimensional Model (i.e, Star Schema)*
- NoSQL *(Not Only SQL)* Systems (e.g., MongoDB, CosmosDB, Cassandra, HBase, Redis)
- File System *(Data Lake)* Source Systems (e.g., AWS S3, Microsoft Azure Data Lake Storage)
  - Various Datafile Formats (e.g., JSON, CSV, Parquet, Text, Binary)
- Massively Parallel Processing *(MPP)* Data Integration Systems (e.g., Apache Spark/PySpark, Databricks)
- Data Integration Patterns (e.g., Extract-Transform-Load, Extract-Load-Transform, Extract-Load-Transform-Load, Lambda & Kappa Architectures)

## Section I: Prerequisites

### 1.0. Import Required Libraries

In [None]:
import findspark
findspark.init()
print(findspark.find())

import os
import sys
import json
import time
import pymongo
import certifi
import shutil
import pandas as pd

from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window as W

### 2.0. Instantiate Global Variables

In [None]:
# --------------------------------------------------------------------------------
# Specify MySQL Server Connection Information
# --------------------------------------------------------------------------------
mysql_args = {
    "host_name" : "localhost",
    "port" : "3306",
    "db_name" : "northwind",
    "conn_props" : {
        "user" : "jtupitza",
        "password" : "Passw0rd123!",
        "driver" : "com.mysql.cj.jdbc.Driver"
    }
}

# --------------------------------------------------------------------------------
# Specify MongoDB Cluster Connection Information
# --------------------------------------------------------------------------------
mongodb_args = {
    "cluster_location" : "local", # "atlas"
    "user_name" : "jtupitza",
    "password" : "Passw0rd1234",
    "cluster_name" : "sandbox",
    "cluster_subnet" : "zibbf",
    "db_name" : "northwind",
    "collection" : "",
    "null_column_threshold" : 0.5
}

# --------------------------------------------------------------------------------
# Specify Directory Structure for Source Data
# --------------------------------------------------------------------------------
base_dir = os.path.join(os.getcwd(), 'lab_data')
data_dir = os.path.join(base_dir, 'northwind')
batch_dir = os.path.join(data_dir, 'batch')
stream_dir = os.path.join(data_dir, 'streaming')

orders_stream_dir = os.path.join(stream_dir, 'orders')
purchase_orders_stream_dir = os.path.join(stream_dir, 'purchase_orders')
inventory_trans_stream_dir = os.path.join(stream_dir, 'inventory_transactions')

# --------------------------------------------------------------------------------
# Create Directory Structure for Data Lakehouse Files
# --------------------------------------------------------------------------------
dest_database = "northwind_dlh"
sql_warehouse_dir = os.path.abspath('spark-warehouse')
dest_database_dir = f"{dest_database}.db"
database_dir = os.path.join(sql_warehouse_dir, dest_database_dir)

orders_output_bronze = os.path.join(database_dir, 'fact_orders', 'bronze')
orders_output_silver = os.path.join(database_dir, 'fact_orders', 'silver')
orders_output_gold = os.path.join(database_dir, 'fact_orders', 'gold')

purchase_orders_output_bronze = os.path.join(database_dir, 'fact_purchase_orders', 'bronze')
purchase_orders_output_silver = os.path.join(database_dir, 'fact_purchase_orders', 'silver')
purchase_orders_output_gold = os.path.join(database_dir, 'fact_purchase_orders', 'gold')

inventory_trans_output_bronze = os.path.join(database_dir, 'fact_inventory_transactions', 'bronze')
inventory_trans_output_silver = os.path.join(database_dir, 'fact_inventory_transactions', 'silver')
inventory_trans_output_gold = os.path.join(database_dir, 'fact_inventory_transactions', 'gold')

### 3.0. Define Global Functions

In [None]:
def get_file_info(path: str):
    file_sizes = []
    modification_times = []

    '''Fetch each item in the directory, and filter out any directories.'''
    items = os.listdir(path)
    files = sorted([item for item in items if os.path.isfile(os.path.join(path, item))])

    '''Populate lists with the Size and Last Modification DateTime for each file in the directory.'''
    for file in files:
        file_sizes.append(os.path.getsize(os.path.join(path, file)))
        modification_times.append(pd.to_datetime(os.path.getmtime(os.path.join(path, file)), unit='s'))

    data = list(zip(files, file_sizes, modification_times))
    column_names = ['name','size','modification_time']
    
    return pd.DataFrame(data=data, columns=column_names)


def wait_until_stream_is_ready(query, min_batches=1):
    while len(query.recentProgress) < min_batches:
        time.sleep(5)
        
    print(f"The stream has processed {len(query.recentProgress)} batchs")


def remove_directory_tree(path: str):
    '''If it exists, remove the entire contents of a directory structure at a given 'path' parameter's location.'''
    try:
        if os.path.exists(path):
            shutil.rmtree(path)
            return f"Directory '{path}' has been removed successfully."
        else:
            return f"Directory '{path}' does not exist."
            
    except Exception as e:
        return f"An error occurred: {e}"
        

def drop_null_columns(df, threshold):
    '''Drop Columns having a percentage of NULL values that exceeds the given 'threshold' parameter value.'''
    columns_with_nulls = [col for col in df.columns if df.filter(df[col].isNull()).count() / df.count() > threshold] 
    df_dropped = df.drop(*columns_with_nulls) 
    
    return df_dropped
    
    
def get_mysql_dataframe(spark_session, sql_query : str, **args):
    '''Create a JDBC URL to the MySQL Database'''
    jdbc_url = f"jdbc:mysql://{args['host_name']}:{args['port']}/{args['db_name']}"
    
    '''Invoke the spark.read.format("jdbc") function to query the database, and fill a DataFrame.'''
    dframe = spark_session.read.format("jdbc") \
    .option("url", jdbc_url) \
    .option("driver", args['conn_props']['driver']) \
    .option("user", args['conn_props']['user']) \
    .option("password", args['conn_props']['password']) \
    .option("query", sql_query) \
    .load()
    
    return dframe
    

def get_mongo_uri(**args):
    '''Validate proper input'''
    if args["cluster_location"] not in ['atlas', 'local']:
        raise Exception("You must specify either 'atlas' or 'local' for the 'cluster_location' parameter.")
        
    if args['cluster_location'] == "atlas":
        uri = f"mongodb+srv://{args['user_name']}:{args['password']}@"
        uri += f"{args['cluster_name']}.{args['cluster_subnet']}.mongodb.net/"
    else:
        uri = "mongodb://localhost:27017/"

    return uri


def get_spark_conf_args(spark_jars : list, **args):
    jars = ""
    for jar in spark_jars:
        jars += f"{jar}, "
    
    sparkConf_args = {
        "app_name" : "PySpark Northwind Data Lakehouse (Medallion Architecture)",
        "worker_threads" : f"local[{int(os.cpu_count()/2)}]",
        "shuffle_partitions" : int(os.cpu_count()),
        "mongo_uri" : get_mongo_uri(**args),
        "spark_jars" : jars[0:-2],
        "database_dir" : sql_warehouse_dir
    }
    
    return sparkConf_args
    

def get_spark_conf(**args):
    sparkConf = SparkConf().setAppName(args['app_name'])\
    .setMaster(args['worker_threads']) \
    .set('spark.driver.memory', '4g') \
    .set('spark.executor.memory', '2g') \
    .set('spark.jars', args['spark_jars']) \
    .set('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.12:3.0.1') \
    .set('spark.mongodb.input.uri', args['mongo_uri']) \
    .set('spark.mongodb.output.uri', args['mongo_uri']) \
    .set('spark.sql.adaptive.enabled', 'false') \
    .set('spark.sql.debug.maxToStringFields', 35) \
    .set('spark.sql.shuffle.partitions', args['shuffle_partitions']) \
    .set('spark.sql.streaming.forceDeleteTempCheckpointLocation', 'true') \
    .set('spark.sql.streaming.schemaInference', 'true') \
    .set('spark.sql.warehouse.dir', args['database_dir']) \
    .set('spark.streaming.stopGracefullyOnShutdown', 'true')
    
    return sparkConf


def get_mongo_client(**args):
    '''Get MongoDB Client Connection'''
    mongo_uri = get_mongo_uri(**args)
    if args['cluster_location'] == "atlas":
        client = pymongo.MongoClient(mongo_uri, tlsCAFile=certifi.where())

    elif args['cluster_location'] == "local":
        client = pymongo.MongoClient(mongo_uri)
        
    else:
        raise Exception("A MongoDB Client could not be created.")

    return client
    
    
# TODO: Rewrite this to leverage PySpark?
def set_mongo_collections(mongo_client, db_name : str, data_directory : str, json_files : list):
    db = mongo_client[db_name]
    
    for file in json_files:
        db.drop_collection(file)
        json_file = os.path.join(data_directory, json_files[file])
        with open(json_file, 'r') as openfile:
            json_object = json.load(openfile)
            file = db[file]
            result = file.insert_many(json_object)
        
    mongo_client.close()
    

def get_mongodb_dataframe(spark_session, **args):
    '''Query MongoDB, and create a DataFrame'''
    dframe = spark_session.read.format("com.mongodb.spark.sql.DefaultSource") \
        .option("database", args['db_name']) \
        .option("collection", args['collection']).load()

    '''Drop the '_id' index column to clean up the response.'''
    dframe = dframe.drop('_id')
    
    '''Call the drop_null_columns() function passing in the dataframe.'''
    dframe = drop_null_columns(dframe, args['null_column_threshold'])
    
    return dframe

### 4.0. Initialize Data Lakehouse Directory Structure
Remove the Data Lakehouse Database Directory Structure to Ensure Idempotency

In [None]:
remove_directory_tree(database_dir)

### 5.0. Create a New Spark Session

In [None]:
worker_threads = f"local[{int(os.cpu_count()/2)}]"

jars = []
mysql_spark_jar = os.path.join(os.getcwd(), "mysql-connector-j-9.1.0", "mysql-connector-j-9.1.0.jar")
mssql_spark_jar = os.path.join(os.getcwd(), "sqljdbc_12.8", "enu", "jars", "mssql-jdbc-12.8.1.jre11.jar")

jars.append(mysql_spark_jar)
#jars.append(mssql_spark_jar)

sparkConf_args = get_spark_conf_args(jars, **mongodb_args)

sparkConf = get_spark_conf(**sparkConf_args)
spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
spark.sparkContext.setLogLevel("OFF")
spark

### 6.0. Create a New Metadata Database.

In [None]:
spark.sql(f"DROP DATABASE IF EXISTS {dest_database} CASCADE;")

sql_create_db = f"""
    CREATE DATABASE IF NOT EXISTS {dest_database}
    COMMENT 'DS-2002 Lab 06 Database'
    WITH DBPROPERTIES (contains_pii = true, purpose = 'DS-2002 Lab 6.0');
"""
spark.sql(sql_create_db)

## Section II: Populate Dimensions by Ingesting "Cold-path" Reference Data 
### 1.0. Fetch Data from the File System
#### 1.1. Verify the location of the source data files on the file system

In [None]:
get_file_info(batch_dir)

#### 1.2. Populate the <span style="color:darkred">Employees Dimension</span>
##### 1.2.1. Use PySpark to Read data from a CSV file

In [None]:
employee_csv = os.path.join(batch_dir, 'northwind_employees.csv')
print(employee_csv)

df_dim_employees = spark.read.format('csv').options(header='true', inferSchema='true').load(employee_csv)
df_dim_employees.toPandas().head(2)

##### 1.2.2. Make Necessary Transformations to the New DataFrame

In [None]:
# ----------------------------------------------------------------------------------
# Rename the 'id' column to 'employee_id' ------------------------------------------
# ----------------------------------------------------------------------------------
df_dim_employees = df_dim_employees.withColumnRenamed("id", "employee_id")

# ----------------------------------------------------------------------------------
# Add Primary Key column using SQL Windowing function: ROW_NUMBER() 
# ----------------------------------------------------------------------------------
df_dim_employees.createOrReplaceTempView("employees")
sql_employees = f"""
    SELECT *, ROW_NUMBER() OVER (ORDER BY employee_id) AS employee_key
    FROM employees;
"""
df_dim_employees = spark.sql(sql_employees)

# ----------------------------------------------------------------------------------
# Reorder Columns and display the first two rows in a Pandas dataframe
# ----------------------------------------------------------------------------------
ordered_columns = ['employee_key', 'employee_id', 'first_name', 'last_name'
                   , 'company', 'job_title', 'business_phone', 'home_phone', 'fax_number'
                   , 'address', 'city', 'state_province', 'zip_postal_code', 'country_region']

df_dim_employees = df_dim_employees[ordered_columns]
df_dim_employees.toPandas().head(2)

##### 1.2.3. Save as the <span style="color:darkred">dim_employees</span> table in the Data Lakehouse

In [None]:
df_dim_employees.write.saveAsTable(f"{dest_database}.dim_employees", mode="overwrite")

##### 1.2.4. Unit Test: Describe and Preview Table

In [None]:
spark.sql(f"DESCRIBE EXTENDED {dest_database}.dim_employees;").show()
spark.sql(f"SELECT * FROM {dest_database}.dim_employees LIMIT 2").toPandas()

#### 1.3. Populate the <span style="color:darkred">Shippers Dimension</span>
##### 1.3.1. Use PySpark to Read Data from a CSV File

In [None]:
# 1). Get a reference to the 'northwind_shippers.csv' file.

# 2). Use Spark to read the CSV file data into the 'df_dim_shippers' variable.
#     Remember to specify that the first row contains column names (header), and to infer the schema.

# 3). Unit Test: Convert the spark dataframe to a Pandas dataframe, and display the first two rows.


##### 1.3.2 Make Necessary Transformations to the New DataFrame

In [None]:
# ----------------------------------------------------------------------------------
# Rename the 'id' column to 'shipper_id' ------------------------------------------
# ----------------------------------------------------------------------------------


# ----------------------------------------------------------------------------------
# Add Primary Key column using SQL Windowing function: ROW_NUMBER() 
# ----------------------------------------------------------------------------------


# ----------------------------------------------------------------------------------
# Reorder Columns and display the first two rows in a Pandas dataframe
# ----------------------------------------------------------------------------------



##### 1.3.3. Save as the <span style="color:darkred">dim_shippers</span> table in the Data Lakehouse

##### 1.3.4. Unit Test: Describe and Preview Table

### 2.0. Fetch Reference Data from a MongoDB Atlas Database
#### 2.1. Create a New MongoDB Database, and Load Each JSON File into a New MongoDB Collection
**NOTE:** The following cell **can** be run more than once because the **set_mongo_collection()** function **is** idempotent.

In [None]:
client = get_mongo_client(**mongodb_args)

json_files = {"customers" : "northwind_customers.json",
              "invoices" : 'northwind_invoices.json',
              "suppliers" : 'northwind_suppliers.json'
             }

set_mongo_collections(client, mongodb_args["db_name"], batch_dir, json_files) 

#### 2.2. Populate the <span style="color:darkred">Customers Dimension</span>
##### 2.2.1. Fetch Data from the New MongoDB <span style="color:darkred">Customers</span> Collection

In [None]:
mongodb_args["collection"] = "customers"

df_dim_customers = get_mongodb_dataframe(spark, **mongodb_args)
df_dim_customers.toPandas().head(2)

##### 2.2.2. Make Necessary Transformations to the New Dataframe

In [None]:
# ----------------------------------------------------------------------------------
# Rename the 'id' column to 'customer_id' ------------------------------------------
# ----------------------------------------------------------------------------------


# ----------------------------------------------------------------------------------
# Add Primary Key column using the SQL Windowing function: ROW_NUMBER() 
# ----------------------------------------------------------------------------------


# ----------------------------------------------------------------------------------
# Reorder Columns and display the first two rows in a Pandas dataframe
# ----------------------------------------------------------------------------------


##### 2.2.3. Save as the <span style="color:darkred">dim_customers</span> table in the Data lakehouse

##### 2.2.4. Unit Test: Describe and Preview Table

#### 2.4. Populate the <span style="color:darkred">Suppliers Dimension</span>
##### 2.3.1. Fetch Data from the New MongoDB <span style="color:darkred">Suppliers</span> Collection

##### 2.3.2. Make Necessary Transformations to the New Dataframe

In [None]:
# ----------------------------------------------------------------------------------
# Rename the 'id' column to 'supplier_id' ------------------------------------------
# ----------------------------------------------------------------------------------


# ----------------------------------------------------------------------------------
# Add Primary Key column using SQL Windowing function: ROW_NUMBER() 
# ----------------------------------------------------------------------------------


# ----------------------------------------------------------------------------------
# Reorder Columns and display the first two rows in a Pandas dataframe
# ----------------------------------------------------------------------------------


##### 2.3.3. Save as the <span style="color:darkred">dim_suppliers</span> table in the Data lakehouse

##### 2.3.4. Unit Test: Describe and Preview Table

#### 2.4. Populate the <span style="color:darkred">Invoices Dimension</span>
##### 2.4.1. Fetch Data from the New MongoDB <span style="color:darkred">Invoices</span> Collection

##### 2.4.2. Make Necessary Transformations to the New Dataframe

In [None]:
# ----------------------------------------------------------------------------------
# Rename the 'id' column to 'invoice_id' ------------------------------------------
# ----------------------------------------------------------------------------------


# ----------------------------------------------------------------------------------
# Add Primary Key column using SQL Windowing function: ROW_NUMBER() 
# ----------------------------------------------------------------------------------


# ----------------------------------------------------------------------------------
# Reorder Columns and display the first two rows in a Pandas dataframe
# ----------------------------------------------------------------------------------


##### 2.4.3. Save as the <span style="color:darkred">dim_invoices</span> table in the Data lakehouse

##### 2.4.4. Unit Test: Describe and Preview Table

### 3.0. Fetch Reference Data from a MySQL Database
#### 3.1. Populate the <span style="color:darkred">Date Dimension</span>
##### 3.1.1 Fetch data from the <span style="color:darkred">dim_date</span> table in MySQL

In [None]:
sql_dim_date = f"SELECT * FROM {mysql_args['db_name']}.dim_date"
df_dim_date = get_mysql_dataframe(spark, sql_dim_date, **mysql_args)

##### 3.1.2. Save as the <span style="color:darkred">dim_date</span> table in the Data Lakehouse

In [None]:
df_dim_date.write.saveAsTable(f"{dest_database}.dim_date", mode="overwrite")

##### 3.1.3. Unit Test: Describe and Preview Table

In [None]:
spark.sql(f"DESCRIBE EXTENDED {dest_database}.dim_date;").show()
spark.sql(f"SELECT * FROM {dest_database}.dim_date LIMIT 2").toPandas()

#### 3.2. Populate the <span style="color:darkred">Product Dimension</span>
##### 3.2.1. Fetch data from the <span style="color:darkred">Products</span> table in MySQL

In [None]:
# ----------------------------------------------------------------------------------
# Add Primary Key column using the SQL Windowing function: ROW_NUMBER() 
# ----------------------------------------------------------------------------------


##### 3.2.2. Perform any Necessary Transformations

In [None]:
# ----------------------------------------------------------------------------------
# Rename the 'id' column to 'product_id' 
# ----------------------------------------------------------------------------------
# Using the monotonically_increasing_id() function has some limitations: starts with zero (0), and is not sequential.
    # df_dim_products = df_dim_products.withColumn("product_key", monotonically_increasing_id())
df_dim_products = df_dim_products.withColumnRenamed("id", "product_id")


# ----------------------------------------------------------------------------------
# Drop unwanted columns (description and attachments)
# ----------------------------------------------------------------------------------


# ----------------------------------------------------------------------------------
# Reorder Columns and display the first two rows in a Pandas dataframe
# ----------------------------------------------------------------------------------


##### 3.2.3. Save as the <span style="color:darkred">dim_products</span> table in the Data Lakehouse

##### 3.2.4. Unit Test: Describe and Preview Table

### 4.0. Verify Dimension Tables

In [None]:
spark.sql(f"USE {dest_database};")
spark.sql("SHOW TABLES").toPandas()

## Section III: Integrate Reference Data with Real-Time Data
### 6.0. Use PySpark Structured Streaming to Process (Hot Path) <span style="color:darkred">Orders</span> Fact Data  
#### 6.1. Verify the location of the source data files on the file system

In [None]:
get_file_info(orders_stream_dir)

#### 6.2. Create the Bronze Layer: Stage <span style="color:darkred">Orders Fact table</span> Data
##### 6.2.1. Read "Raw" JSON file data into a Stream

In [None]:
df_orders_bronze = (
    spark.readStream \
    .option("schemaLocation", orders_output_bronze) \
    .option("maxFilesPerTrigger", 1) \
    .option("multiLine", "true") \
    .json(orders_stream_dir)
)

df_orders_bronze.isStreaming

##### 6.2.2. Write the Streaming Data to a Parquet file

In [None]:
orders_checkpoint_bronze = os.path.join(orders_output_bronze, '_checkpoint')

orders_bronze_query = (
    df_orders_bronze
    # Add Current Timestamp and Input Filename columns for Traceability
    .withColumn("receipt_time", current_timestamp())
    .withColumn("source_file", input_file_name())
    
    .writeStream \
    .format("parquet") \
    .outputMode("append") \
    .queryName("orders_bronze")
    .trigger(availableNow = True) \
    .option("checkpointLocation", orders_checkpoint_bronze) \
    .option("compression", "snappy") \
    .start(orders_output_bronze)
)

##### 6.2.3. Unit Test: Implement Query Monitoring

In [None]:
print(f"Query ID: {orders_bronze_query.id}")
print(f"Query Name: {orders_bronze_query.name}")
print(f"Query Status: {orders_bronze_query.status}")

In [None]:
orders_bronze_query.awaitTermination()

#### 6.3. Create the Silver Layer: Integrate "Cold-path" Data & Make Transformations
##### 6.3.1. Prepare Role-Playing Dimension Primary and Business Keys

In [None]:
df_dim_order_date = df_dim_date.select(col("date_key").alias("order_date_key"), col("full_date").alias("order_full_date"))
df_dim_paid_date = df_dim_date.select(col("date_key").alias("paid_date_key"), col("full_date").alias("paid_full_date"))
df_dim_shipped_date = df_dim_date.select(col("date_key").alias("shipped_date_key"), col("full_date").alias("shipped_full_date"))
df_dim_shippers = df_dim_shippers.withColumnRenamed("shipper_id", "shipper_no")

##### 6.3.2. Define Silver Query to Join Streaming with Batch Data

In [None]:
df_orders_silver = spark.readStream.format("parquet").load(orders_output_bronze) \
    .join(df_dim_customers, "customer_id") \
    .join(df_dim_employees, "employee_id") \
    .join(df_dim_products, "product_id") \
    .join(df_dim_shippers, df_dim_shippers.shipper_no == col("shipper_id").cast(IntegerType()), "left_outer") \
    .join(df_dim_order_date, df_dim_order_date.order_full_date.cast(DateType()) == col("order_date").cast(DateType()), "inner") \
    .join(df_dim_shipped_date, df_dim_shipped_date.shipped_full_date.cast(DateType()) == col("shipped_date").cast(DateType()), "left_outer") \
    .join(df_dim_paid_date, df_dim_paid_date.paid_full_date.cast(DateType()) == col("paid_date").cast(DateType()), "left_outer") \
    .select(col("order_id").cast(LongType()), \
            col("order_detail_id").cast(LongType()), \
            df_dim_customers.customer_key.cast(LongType()), \
            df_dim_employees.employee_key.cast(LongType()), \
            df_dim_products.product_key.cast(LongType()), \
            df_dim_shippers.shipper_key.cast(IntegerType()), \
            df_dim_order_date.order_date_key.cast(LongType()), \
            df_dim_paid_date.paid_date_key.cast(LongType()), \
            df_dim_shipped_date.shipped_date_key.cast(LongType()), \
            col("quantity"), \
            col("unit_price"), \
            col("discount"), \
            col("shipping_fee"), \
            col("taxes"), \
            col("tax_rate"), \
            col("payment_type"), \
            col("order_status"), \
            col("order_details_status") \
           )

In [None]:
df_orders_silver.isStreaming

In [None]:
df_orders_silver.printSchema()

##### 6.3.3. Write the Transformed Streaming data to the Data Lakehouse

In [None]:
orders_checkpoint_silver = os.path.join(orders_output_silver, '_checkpoint')

orders_silver_query = (
    df_orders_silver.writeStream \
    .format("parquet") \
    .outputMode("append") \
    .queryName("orders_silver")
    .trigger(availableNow = True) \
    .option("checkpointLocation", orders_checkpoint_silver) \
    .option("compression", "snappy") \
    .start(orders_output_silver)
)

##### 6.3.4. Unit Test: Implement Query Monitoring

In [None]:
print(f"Query ID: {orders_silver_query.id}")
print(f"Query Name: {orders_silver_query.name}")
print(f"Query Status: {orders_silver_query.status}")

In [None]:
orders_silver_query.awaitTermination()

#### 6.4. Create Gold Layer: Perform Aggregations
##### 6.4.1. Define a Query to Create a Business Report
Create a new Gold table using the PySpark API. The table should include the number of Products sold per Category each Month. The results should include The Month, Product Category and Number of Products sold, sorted by the month number when the orders were placed: e.g., January, February, March.

In [None]:
df_orders_by_product_category_gold = spark.readStream.format("parquet").load(orders_output_silver) \
.join(df_dim_products, "product_key") \
.join(df_dim_date, df_dim_date.date_key.cast(IntegerType()) == col("order_date_key").cast(IntegerType())) \
.groupBy("month_of_year", "category", "month_name") \
.agg(count("product_key").alias("product_count")) \
.orderBy(asc("month_of_year"), desc("product_count"))

In [None]:
df_orders_by_product_category_gold.printSchema()

##### 6.4.2. Write the Streaming data to a Parquet File in "Complete" mode

In [None]:
orders_gold_query = (
    df_orders_by_product_category_gold.writeStream \
    .format("memory") \
    .outputMode("complete") \
    .queryName("fact_orders_by_product_category")
    .start()
)

In [None]:
wait_until_stream_is_ready(orders_gold_query, 1)

##### 6.4.3. Query the Gold Data from Memory

In [None]:
df_fact_orders_by_product_category = spark.sql("SELECT * FROM fact_orders_by_product_category")
df_fact_orders_by_product_category.printSchema()

##### 6.4.4 Create the Final Selection

In [None]:
df_fact_orders_by_product_category_gold_final = df_fact_orders_by_product_category \
.select(col("month_name").alias("Month"), \
        col("category").alias("Product Category"), \
        col("product_count").alias("Product Count")) \
.orderBy(asc("month_of_year"), desc("Product Count"))

##### 6.4.5. Load the Final Results into a New Table and Display the Results

In [None]:
df_fact_orders_by_product_category_gold_final.write.saveAsTable(f"{dest_database}.fact_orders_by_product_category", mode="overwrite")
spark.sql(f"SELECT * FROM {dest_database}.fact_orders_by_product_category").toPandas()

### 7.0. Use PySpark Structured Streaming to Process (Hot Path) <span style="color:darkred">Inventory Transactions</span> Fact Data
#### 7.1. Verify the location of the source data files on the file system

In [None]:
get_file_info(inventory_trans_stream_dir)

#### 7.2. Create the Bronze Layer: Stage <span style="color:darkred">Inventory Transactions Fact table</span> Data
##### 7.2.1. Read "Raw" JSON file data into a Stream

In [None]:
df_inventory_trans_bronze = (
    spark.readStream \
    #TODO: load data from 'inventory_trans_stream_dir'
)

df_inventory_trans_bronze.isStreaming

##### 7.2.2. Write the Streaming Data to a Parquet file

In [None]:
inventory_trans_checkpoint_bronze = os.path.join(inventory_trans_output_bronze, '_checkpoint')

inventory_trans_bronze_query = (
    df_inventory_trans_bronze
    # TODO: Add Current Timestamp and Input Filename columns for Traceability
    # TODO: writeStream to 'inventory_trans_output_bronze' in 'append' mode
)

##### 7.2.3. Unit Test: Implement Query Monitoring

In [None]:
print(f"Query ID: {inventory_trans_bronze_query.id}")
print(f"Query Name: {inventory_trans_bronze_query.name}")
print(f"Query Status: {inventory_trans_bronze_query.status}")

In [None]:
inventory_trans_bronze_query.awaitTermination()

#### 7.3. Create the Silver Layer: Integrate "Cold-path" Data & Make Transformations
##### 7.3.1. Prepare Role-Playing Dimension Primary and Business Keys

In [None]:
df_dim_created_date = #TODO: Copy df_dim_date and rename 'date_key' and 'full_date' columns.
df_dim_modified_date = #TODO: Copy df_dim_date and rename 'date_key' and 'full_date' columns.

##### 7.3.2. Define Silver Query to Join Streaming with Batch Data

In [None]:
df_inventory_trans_silver = spark.readStream.format("parquet").load(inventory_trans_output_bronze) \
    # .join to the dim_products dimension
    # .join to the dim_created_date dimension
    # .join to the dim_created_date
    # .join to the dim_modified_date dimension
    # .select() the appropriate columns

In [None]:
df_inventory_trans_silver.isStreaming

In [None]:
df_inventory_trans_silver.printSchema()

##### 7.3.3. Write the Transformed Streaming data to the Data Lakehouse

In [None]:
inventory_trans_checkpoint_silver = os.path.join(inventory_trans_output_silver, '_checkpoint')

inventory_trans_silver_query = (
    df_inventory_trans_silver.writeStream \
    # TODO: writeStream, in 'parquet' format, to 'inventory_trans_output_silver' in 'append' mode
)

##### 7.3.4. Unit Test: Implement Query Monitoring

In [None]:
print(f"Query ID: {inventory_trans_silver_query.id}")
print(f"Query Name: {inventory_trans_silver_query.name}")
print(f"Query Status: {inventory_trans_silver_query.status}")

In [None]:
inventory_trans_silver_query.awaitTermination()

#### 7.4. Create Gold Layer: Perform Aggregations
##### 7.4.1. Define a Query to Create a Business Report
Create a new Gold table using the PySpark API. The table should include the total quantity (total quantity) of the inventory transactions placed per Product. Include the Inventory Transaction Type, and the Product Name.

In [None]:
df_fact_inventory_trans_by_product_gold = spark.readStream.format("parquet").load(inventory_trans_output_silver) \
    #.join to the df_dim_products dimension
    #.join to the df_dim_date dimension on the 'created_date_key'
    # group by the 'calendar_quarter', 'transaction_type', and 'product_name columns
    # sum the 'quantity' column to create the 'Total Quantity' column
    # order by the 'Total Quantity' column

##### 7.4.2. Write the Streaming data to Memory in "Complete" mode

In [None]:
inventory_trans_gold_query = (
    df_fact_inventory_trans_by_product_gold.writeStream \
    # create the new "fact_inventory_trans_by_product" query
)

In [None]:
wait_until_stream_is_ready(inventory_trans_gold_query, 1)

##### 7.4.3. Query the Gold Data from Memory

In [None]:
df_fact_inventory_trans_by_product = spark.sql("SELECT * FROM fact_inventory_trans_by_product")
df_fact_inventory_trans_by_product.printSchema()

##### 7.4.4 Create the Final Selection

In [None]:
df_fact_inventory_trans_by_product_gold_final = df_fact_inventory_trans_by_product \
    # .select() the 'calendar_quarter' column as 'Quarter Created',
    # 'transaction_type' as 'Transaction', 'product_name' as 'Product', and 'Total Quantity'
    # ordered by 'Total Quantity'.

##### 7.4.5. Load the Final Results into a New Table and Display the Results

In [None]:
df_fact_inventory_trans_by_product_gold_final.write.saveAsTable(f"{dest_database}.fact_inventory_trans_by_product", mode="overwrite")
spark.sql(f"SELECT * FROM {dest_database}.fact_inventory_trans_by_product").toPandas()

### 8.0. Use PySpark Structured Streaming to Process (Hot Path) <span style="color:darkred">Purchase Orders</span> Fact Data
#### 8.1. Verify the location of the source data files on the file system

In [None]:
get_file_info(purchase_orders_stream_dir)

#### 8.2. Create the Bronze Layer: Stage <span style="color:darkred">Purchase Orders Fact table</span> Data
##### 8.2.1. Read "Raw" JSON file data into a Stream

In [None]:
df_purchase_orders_bronze = (
    spark.readStream \
    # TODO: load data from 'purchase_orders_stream_dir'
)

df_purchase_orders_bronze.isStreaming

##### 8.2.2. Write the Streaming Data to a Parquet file

In [None]:
purchase_orders_checkpoint_bronze = os.path.join(purchase_orders_output_bronze, '_checkpoint')

purchase_orders_bronze_query = (
    df_purchase_orders_bronze
    # TODO: Add Current Timestamp and Input Filename columns for Traceability
    # TODO: writeStream to 'purchase_orders_output_bronze' in 'append' mode

)

##### 8.2.3. Unit Test: Implement Query Monitoring

In [None]:
print(f"Query ID: {purchase_orders_bronze_query.id}")
print(f"Query Name: {purchase_orders_bronze_query.name}")
print(f"Query Status: {purchase_orders_bronze_query.status}")

In [None]:
purchase_orders_bronze_query.awaitTermination()

#### 8.3. Create the Silver Layer: Integrate "Cold-path" Data & Make Transformations
##### 8.3.1. Prepare Role-Playing Dimension Primary and Business Keys

In [None]:
df_dim_created_by = #TODO: Copy 'df_dim_employees' and rename 'employee_key' and 'employee_id' columns.
df_dim_approved_by = #TODO: Copy 'df_dim_employees' and rename 'employee_key' and 'employee_id' columns.
df_dim_submitted_by = #TODO: Copy 'df_dim_employees' and rename 'employee_key' and 'employee_id' columns.

df_dim_submitted_date = #TODO: Copy df_dim_date and rename 'date_key' and 'full_date' columns.
df_dim_creation_date = #TODO: Copy df_dim_date and rename 'date_key' and 'full_date' columns.
df_dim_approved_date = #TODO: Copy df_dim_date and rename 'date_key' and 'full_date' columns.
df_dim_date_received = #TODO: Copy df_dim_date and rename 'date_key' and 'full_date' columns.

##### 8.3.2. Define Silver Query to Join Streaming with Batch Data

In [None]:
df_purchase_orders_silver = spark.readStream.format("parquet").load(purchase_orders_output_bronze) \
    # .join 'inner' to the df_dim_products dimension
    # .join 'inner' to the df_dim_suppliers
    # .join 'left_outer' to the df_dim_created_by dimension
    # .join 'left_outer' to the df_dim_approved_by dimension
    # .join 'left_outer' to the df_dim_submitted_by dimension
    # .join 'inner' to the df_dim_submitted_date dimension
    # .join 'inner' to the df_dim_creation_date
    # .join 'left_outer' to the df_dim_approved_date dimension
    # .join 'left_outer' to the df_dim_date_received dimension
    # .select() the appropriate columns from the 'purchase orders bronze' stream


In [None]:
df_purchase_orders_silver.isStreaming

In [None]:
df_purchase_orders_silver.printSchema()

##### 8.3.3. Write the Transformed Streaming data to the Data Lakehouse

In [None]:
purchase_orders_checkpoint_silver = os.path.join(purchase_orders_output_silver, '_checkpoint')

purchase_orders_silver_query = (
    df_purchase_orders_silver.writeStream \
    # TODO: writeStream, in 'parquet' format, to 'purchase_orders_output_silver' in 'append' mode
)

##### 8.3.4. Unit Test: Implement Query Monitoring

In [None]:
print(f"Query ID: {purchase_orders_silver_query.id}")
print(f"Query Name: {purchase_orders_silver_query.name}")
print(f"Query Status: {purchase_orders_silver_query.status}")

In [None]:
purchase_orders_silver_query.awaitTermination()

#### 8.4. Create Gold Layer: Perform Aggregations
##### 8.4.1. Define a Query to Create a Business Report
Create a new Gold table using the PySpark API. The table should include the Suppliers' Company Name, the Product Name, the Total Quantity, Total Unit Cost, and Total List Price for all the purchase orders placed per Supplier for each Product.

In [None]:
df_fact_pos_products_per_supplier_gold = spark.readStream.format("parquet").load(purchase_orders_output_silver) \
# .join to the 'df_dim_products' dimension
# .join to the 'df_dim_suppliers' dimension
# .groupBy 'company' and 'product_name'
# sum 'po_detail_quantity' as 'Total Quantity'
# sum 'po_detail_unit_cost' as 'Total Unit Cost'
# sum 'list_price' as 'Total List Price'
# orderBy 'Total Quantity' in descending order

##### 8.4.2. Write the Streaming data to Memory in "Complete" mode

In [None]:
purchase_orders_gold_query = (
    df_fact_pos_products_per_supplier_gold.writeStream \
    # create the new "fact_pos_products_per_supplier" query
)

In [None]:
wait_until_stream_is_ready(purchase_orders_gold_query, 1)

##### 8.4.3. Query the Gold Data from Memory

In [None]:
df_fact_pos_products_per_supplier = spark.sql("SELECT * FROM fact_pos_products_per_supplier")
df_fact_pos_products_per_supplier.printSchema()

##### 8.4.4. Create the Final Selection

In [None]:
df_fact_pos_products_per_supplier_gold_final = df_fact_pos_products_per_supplier \
# .select() the 'company' column as 'Supplier', the 'product_name' column as 'Product',
# along with the 'Total Quantity', 'Total Unit Cost', and 'Total List Price' columns

##### 8.4.5. Load the Final Results into a New Table and Display the Results

In [None]:
df_fact_pos_products_per_supplier_gold_final.write.saveAsTable(f"{dest_database}.fact_pos_products_per_supplier", mode="overwrite")
spark.sql(f"SELECT * FROM {dest_database}.fact_pos_products_per_supplier").toPandas()

### 9.0. Stop the Spark Session

In [None]:
spark.stop()