<a href="https://colab.research.google.com/github/Mbaroudi/DELTA_LAKE_TIPS/blob/main/Delta_Lake_Tips_SCD7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction to Slowly Changing Dimension (SCD) Type 7
In the context of PySpark and using Delta Lake to manage dimensions with slowly changing dimension strategies, the concept of a **"Type 7 Dimension"** is not standard. Typically, types 1 through 6 cover various scenarios of handling dimensional data changes. However, "Type 7" often refers to a **hybrid approach** combining the features of Type 1 (overwriting old data) and Type 2 (tracking historical data with version management). This hybrid approach allows querying both the current state of the data and its historical versions.

In this notebook, we will implement a Type 7 SCD using PySpark and Delta Lake by essentially employing a **Type 2 SCD** but also maintaining a current view for easier access to the latest records.

In [None]:
!pip install pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("SCD Type 7 with Temporal Aggregates") \
    .master("local[*]") \
    .getOrCreate()


Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=78645a1f225b8e9e780e160038e97f693a622b52ee9f9e53cb3e97f9131a71a4
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


### Preparing the Data
First, initialize your DataFrame with a set of customer data. This data includes several fields: customer ID, name, email, revenue, effective date, and an indicator of whether the record is current. This setup prepares us for implementing SCD operations.

In [None]:
from pyspark.sql import SparkSession, functions as F

# Initialize Spark Session again (redundant, just for clarity in documentation)
spark = SparkSession.builder \
    .appName("SCD Type 7 with Temporal Aggregates") \
    .master("local[*]") \
    .getOrCreate()

# Sample data with 'is_current' column added
data = [
    (1, "John Doe", "john.doe@email.com", 1000, "2020-01-15", True),
    (2, "Jane Smith", "jane.smith@email.com", 1500, "2020-01-20", True)
]
columns = ["customer_id", "customer_name", "email", "revenue", "effective_date", "is_current"]
df = spark.createDataFrame(data, schema=columns)

df.show()


+-----------+-------------+--------------------+-------+--------------+----------+
|customer_id|customer_name|               email|revenue|effective_date|is_current|
+-----------+-------------+--------------------+-------+--------------+----------+
|          1|     John Doe|  john.doe@email.com|   1000|    2020-01-15|      true|
|          2|   Jane Smith|jane.smith@email.com|   1500|    2020-01-20|      true|
+-----------+-------------+--------------------+-------+--------------+----------+



### Applying SCD Type 7
Use the following code to simulate the operations of SCD Type 7. This involves creating a function to handle the merging of historical and new data, maintaining a view for the latest records.

In [None]:
def apply_scd_type_2(base_df, updates_df):
    # Join based on customer_id and check for changes
    condition = (base_df["customer_id"] == updates_df["customer_id"]) & \
                (base_df["is_current"] == True) & \
                ((base_df["customer_name"] != updates_df["customer_name"]) |
                 (base_df["revenue"] != updates_df["revenue"]))

    # Set existing records to not current if changes are detected
    updates_df = updates_df.withColumn("is_current", F.lit(True))
    updated_existing_df = base_df.join(updates_df, "customer_id", "inner") \
                                 .filter(condition) \
                                 .select(base_df["*"]) \
                                 .withColumn("is_current", F.lit(False))

    # Union all: unchanged existing, updated existing set to false, and new updates set to true
    final_df = base_df.join(updated_existing_df, ["customer_id"], "left_anti") \
                      .unionByName(updated_existing_df) \
                      .unionByName(updates_df)

    return final_df

historical_df = apply_scd_type_2(df, new_df)
historical_df.show()


+-----------+-------------+--------------------+-------+--------------+----------+
|customer_id|customer_name|               email|revenue|effective_date|is_current|
+-----------+-------------+--------------------+-------+--------------+----------+
|          2|   Jane Smith|jane.smith@email.com|   1500|    2020-01-20|      true|
|          1|     John Doe|  john.doe@email.com|   1000|    2020-01-15|     false|
|          1|Johnathan Doe|  john.doe@email.com|   1200|    2020-02-01|      true|
|          2|   Jane Smith|jane.smith@email.com|   1500|    2020-02-01|      true|
|          3|   Mike Jones|mike.jones@email.com|    500|    2020-02-01|      true|
+-----------+-------------+--------------------+-------+--------------+----------+



### Analyzing Temporal Aggregates
Next, compute the temporal aggregates such as MTD (Month-to-Date), QTD (Quarter-to-Date), and YTD (Year-to-Date) revenue. This demonstrates how to extract actionable insights from historical data.

In [None]:
# Calculating temporal aggregates
from pyspark.sql import functions as F

historical_df = historical_df.withColumn("month", F.month("effective_date"))
historical_df = historical_df.withColumn("quarter", F.quarter("effective_date"))
historical_df = historical_df.withColumn("year", F.year("effective_date"))

historical_df.createOrReplaceTempView("historical_data")

# SQL query for MTD, QTD, YTD calculations
aggregates_query = """
SELECT customer_id, customer_name,
       SUM(CASE WHEN month = 2 AND year = 2020 THEN revenue ELSE 0 END) as MTD_Revenue,
       SUM(CASE WHEN quarter = (SELECT quarter FROM historical_data WHERE month = 2 AND year = 2020 LIMIT 1) AND year = 2020 THEN revenue ELSE 0 END) as QTD_Revenue,
       SUM(CASE WHEN year = 2020 THEN revenue ELSE 0 END) as YTD_Revenue
FROM historical_data
GROUP BY customer_id, customer_name
"""
aggregates_df = spark.sql(aggregates_query)
aggregates_df.show()


+-----------+-------------+-----------+-----------+-----------+
|customer_id|customer_name|MTD_Revenue|QTD_Revenue|YTD_Revenue|
+-----------+-------------+-----------+-----------+-----------+
|          2|   Jane Smith|       1500|       3000|       3000|
|          1|     John Doe|          0|       1000|       1000|
|          1|Johnathan Doe|       1200|       1200|       1200|
|          3|   Mike Jones|        500|        500|        500|
+-----------+-------------+-----------+-----------+-----------+



### Writing Historical Data
Finally, demonstrate writing the historical data, including all versions, to a Hive table. This ensures that all data changes are preserved and can be audited or analyzed later.

In [None]:
# Storing data in a Hive fact table
aggregates_df.write.mode("overwrite").saveAsTable("fact_revenue_aggregates")
# To append data instead of overwriting, you could use mode("append")

This documentation guides you through the setup and application of a Type 7 SCD using PySpark and Delta Lake in a Colab environment. It is structured for easy replication and adaptation for similar analytics needs.