# Slowly Changing Dimension Type 2 (SCD2) - User Table

## Overview
This notebook implements an SCD2 dimension table for users (`dimuser`). The goal is to track changes in user details over time.

## Steps
1. **Load existing user dimension table (`dimuser`)** from Delta Lake.
2. **Load latest changes** from the operational database (`velo_users`).
3. **Detect new and changed records** using an MD5 hash comparison.
4. **Insert new records** while keeping old versions with history.
5. **Use Delta Lake's `MERGE INTO`** for efficient updates.

## Key Fields
| Column     | Description |
|------------|------------|
| `user_sk`  | Unique identifier for each user record |
| `userid`   | Business key (same across historical records) |
| `street`, `city`, etc. | Address details |
| `md5`      | Hash of address details to detect changes |
| `scd_start` | Start of record validity |
| `scd_end`   | End of record validity |
| `current`   | Indicates active record (TRUE) or historical (FALSE) |

## Example
If a user changes address, a new row is added while the old one is kept with an end date.


### ðŸ“Œ Explanation
âœ… **Import required libraries:**
- `pyspark.sql` â†’ Provides the Spark DataFrame API.
- `pyspark.sql.functions` â†’ Contains useful SQL functions.
- `ConnectionConfig (cc)` â†’ Custom module to set up connections.

âœ… **Set up the environment:**
- `cc.setupEnvironment()` â†’ Configures the Spark environment.
- `cc.listEnvironment()` â†’ Lists current environment settings.


In [None]:
import ConnectionConfig as cc
cc.setupEnvironment()
cc.listEnvironment()

### ðŸ“Œ Explanation
âœ… **Start a local Spark cluster:**
- `"dimUserChanges"` â†’ Name of the cluster.
- `4` â†’ Number of worker threads.

âœ… **Get the active Spark session:**
- `getActiveSession()` â†’ Ensures the session is running.

In [None]:
spark = cc.startLocalCluster("dimUserChanges",4)
spark.getActiveSession()

### ðŸ“Œ Explanation
âœ… **Capture the job execution timestamp:**
- `datetime.now()` â†’ Retrieves the current timestamp.
- This timestamp will be used to track when records were processed.

In [None]:
from datetime import *
run_timestamp =datetime.now() #The job runtime is stored in a variable
print(run_timestamp)

### ðŸ“Œ Explanation
âœ… **Load the existing dimension table:**
- `DeltaTable.forPath(...)` â†’ Loads the `dimuser` table from Delta Lake.

âœ… **Create a temporary SQL view:**
- `createOrReplaceTempView("dim_users_current")` â†’ Allows querying via SQL.

âœ… **Show existing records:**
- `spark.sql("SELECT * FROM dim_users_current").show()` â†’ Displays data.


In [None]:
from delta import DeltaTable
current_user_table = DeltaTable.forPath(spark, "./spark-warehouse/dimuser")

In [None]:
current_user_table.toDF().createOrReplaceTempView("dim_users_current")

In [None]:
spark.sql("select * from dim_users_current").show()

### ðŸ“Œ Explanation
âœ… **Read new user data from the operational database:**
- Uses **JDBC** to connect to the database.

âœ… **Create a temporary SQL view:**
- `createOrReplaceTempView("users_operational_db")` â†’ Enables SQL queries on the latest user data.


In [None]:
### LOAD THE NEWEST CHANGES FROM THE OPERATIONAL DATABASE
df_users = spark.read \
    .format("jdbc") \
    .option("driver", cc.get_Property("driver")) \
    .option("url", cc.create_jdbc()) \
    .option("dbtable", "velo_users") \
    .option("user", cc.get_Property("username")) \
    .option("password", cc.get_Property("password")) \
    .load()

df_users.createOrReplaceTempView("users_operational_db")

### ðŸ“Œ Explanation
âœ… **Transform the operational data into a dimension format:**
- `uuid()` â†’ Generates a **unique key** for each record.
- `md5(...)` â†’ Creates a **hash of address details** to detect changes.

âœ… **Create a SQL view for transformed data:**
- `createOrReplaceTempView("dim_users_new")` â†’ Allows comparison with existing data.

âœ… **Show transformed records:**
- `dim_users_new.show()` â†’ Displays the new data.


In [None]:
### TRANSFORM THE SOURCE TABLE TO the dimension format
### IMPORTANT !!!!!!!!! CHECKING ONLY THE STREET FOR NOW !!!!!!!!
dim_users_new = spark.sql( "select uuid() as source_user_sk, \
                                        userid as source_userid, \
                                        name as source_name, \
                                        street as source_street, \
                                        md5(concat( street, number, zipcode, city, country_code)) as source_md5 \
                                    from users_operational")
dim_users_new.createOrReplaceTempView("dim_users_new")

In [None]:
dim_users_new.show()

### ðŸ“Œ Explanation
âœ… **Detect changes:**
- `LEFT OUTER JOIN` â†’ Compare new records (`dim_users_new`) with the existing ones (`dim_users_current`).
- `WHERE dwh.userid IS NULL` â†’ Identifies **new users**.
- `OR dwh.md5 <> source.source_md5` â†’ Identifies **modified users**.


In [None]:
detectedChanges=spark.sql(f"select * \
                          from dim_users_new as source \
                          left outer join dim_users_current as dwh on dwh.userid == source.source_userid and dwh.current == true \
                          where dwh.userid is null or dwh.md5 <> source.source_md5")

detectedChanges.createOrReplaceTempView("detectedChanges")

In [None]:
#DEBUG CODE TO SHOW CONTENT OF DETECTED CHANGES
detectedChanges.show()

### ðŸ“Œ Explanation
âœ… **Prepare data for update and insert:**
- Inserts **new records**.
- Updates **existing records** (sets `scd_end` and `current = FALSE`).


In [None]:
upserts = spark.sql(f"select source_user_sk as user_sk,\
                                source_userid as userid,\
                                source_name as name,\
                                source_street as street,\
                                to_timestamp('{run_timestamp}') as scd_start, \
                                to_timestamp('2100-12-12','yyyy-MM-dd') as scd_end,\
                                source_md5 as md5,\
                                true as current\
                        from  detectedChanges\
                        union \
                        select  userSK,\
                                userid,\
                                name,\
                                street,\
                                scd_start,\
                                to_timestamp('{run_timestamp}') as scd_end,\
                                md5, \
                                false \
                                from detectedChanges \
                        where current is not null")

upserts.createOrReplaceTempView("upserts")

In [None]:
#DEBUG CODE TO SHOW CONTENT OF UPSERTS
spark.sql("select * from upserts").show()

### ðŸ“Œ Explanation
âœ… **Perform the SCD2 merge operation:**
- **UPDATE** existing records â†’ Sets `scd_end` and `current = FALSE`.
- **INSERT** new records â†’ Tracks changes as new entries.


In [None]:
spark.sql("MERGE INTO dim_users_current AS target \
          using upserts AS source ON target.userid = source.userid and source.current = false and target.current=true \
          WHEN MATCHED THEN UPDATE SET scd_end = source.scd_end, current = source.current  \
          WHEN NOT MATCHED THEN INSERT (userSK, userid, name, street, scd_start, scd_end, md5, current) values (source.user_sk, source.userid, source.name, source.street, source.scd_start, source.scd_end, source.md5, source.current)")


### ðŸ“Œ Explanation
âœ… **Display the final dimension table sorted by `userid` and `scd_start`**  
- Shows **historical versions** and the **latest active record**.


In [None]:
current_user_table.toDF().sort("userid", "scd_start").show(100)

In [None]:
spark.stop()