<a href="https://colab.research.google.com/github/Mbaroudi/DELTA_LAKE_TIPS/blob/main/Delta_Lake_SCD_Tips.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Enhanced SCD Type 7 with Delta Lake and Spark SQL

This notebook demonstrates the implementation of SCD Type 7 using Delta Lake and Spark SQL, showcasing ACID properties and historical data handling.

In [None]:
!pip install pyspark==3.1.2 delta-spark iceberg-spark

from pyspark.sql import SparkSession
from delta import *

builder = SparkSession.builder \
    .appName("Enhanced SCD Type 7 Example") \
    .master("local[*]") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()

## Data Creation

Create an initial dataset and write it to a Delta table. This table will be used as the base for our SCD transformations.

In [None]:
data = [
    (1, "Alice", "alice@example.com", 1000, "2020-01-01"),
    (2, "Bob", "bob@example.com", 1500, "2020-01-01")
]
columns = ["customer_id", "name", "email", "revenue", "effective_date"]

df = spark.createDataFrame(data, schema=columns)
df.write.format("delta").mode("overwrite").save("/tmp/delta_scd7")

## Merge Changes Using Delta Table API

Simulate incoming data changes and merge these using the Delta Table API to maintain historical records as per SCD Type 7.

In [None]:
from pyspark.sql.functions import *

new_data = [
    (1, "Alice", "alice@example.com", 1100, "2020-02-01"),
    (2, "Bob", "bobby@example.com", 1500, "2020-02-01"),
    (3, "Charlie", "charlie@example.com", 500, "2020-02-01")
]
new_df = spark.createDataFrame(new_data, schema=columns)

deltaTable = DeltaTable.forPath(spark, "/tmp/delta_scd7")

deltaTable.alias("old").merge(
    new_df.alias("new"),
    "old.customer_id = new.customer_id"
).whenMatchedUpdate(
    condition="""
        old.email != new.email OR
        old.revenue != new.revenue
    """,
    set={
        "name": col("new.name"),
        "email": col("new.email"),
        "revenue": col("new.revenue"),
        "effective_date": current_date()
    }
).whenNotMatchedInsertAll().execute()

## Querying Data with Spark SQL

Enable querying the data using Spark SQL, demonstrating both the current state and historical views.

In [None]:
# Register the Delta table as a SQL view
spark.read.format("delta").load("/tmp/delta_scd7").createOrReplaceTempView("customer_data")

# Current state of the data
spark.sql("""
SELECT * FROM customer_data
WHERE effective_date = (SELECT MAX(effective_date) FROM customer_data)
""").show()

# Historical view, showing all changes
spark.sql("""
SELECT * FROM customer_data
ORDER BY customer_id, effective_date
""").show()

## Time Travel Query

Delta Lake’s time travel feature allows querying previous snapshots of the dataset, useful for auditing and rollbacks.

In [None]:
# Query an older snapshot of the data
version_number = 0  # adjust based on your versioning
spark.read.format("delta").option("versionAsOf", version_number).load("/tmp/delta_scd7").show()

## Best Practices and Additional Tips

- **Optimization**: Use `OPTIMIZE` and `ZORDER BY` commands to compact files and optimize data layout.
- **Data Retention**: Configure data retention settings to manage old snapshots.
- **Monitoring and Maintenance**: Regularly monitor and optimize performance.
- **Incremental Loading**: Use patterns to efficiently load data batches.