# Slowly Changing Dimension Type 2 (SCD2) - User Table

## Overview
This notebook implements an SCD2 dimension table for users (`dimuser`). The goal is to track changes in user details over time.

## Steps
1. **Load existing user dimension table (`dimuser`)** from Delta Lake.
2. **Load latest changes** from the operational database (`velo_users`).
3. **Detect new and changed records** using an MD5 hash comparison.
4. **Insert new records** while keeping old versions with history.
5. **Use Delta Lake's `MERGE INTO`** for efficient updates.

## Key Fields
| Column     | Description |
|------------|------------|
| `user_sk`  | Unique identifier for each user record |
| `userid`   | Business key (same across historical records) |
| `street`, `city`, etc. | Address details |
| `md5`      | Hash of address details to detect changes |
| `scd_start` | Start of record validity |
| `scd_end`   | End of record validity |
| `current`   | Indicates active record (TRUE) or historical (FALSE) |

## Example
If a user changes address, a new row is added while the old one is kept with an end date.


### ðŸ“Œ Explanation
âœ… **Import required libraries:**
- `pyspark.sql` â†’ Provides the Spark DataFrame API.
- `pyspark.sql.functions` â†’ Contains useful SQL functions.
- `ConnectionConfig (cc)` â†’ Custom module to set up connections.

âœ… **Set up the environment:**
- `cc.setupEnvironment()` â†’ Configures the Spark environment.
- `cc.listEnvironment()` â†’ Lists current environment settings.


In [58]:
import ConnectionConfig as cc
cc.setupEnvironment()
cc.listEnvironment()

HOMEBREW_PREFIX: /opt/homebrew
COMMAND_MODE: unix2003
INFOPATH: /opt/homebrew/share/info:
SHELL: /bin/zsh
PYTHONPATH: /Users/user/Desktop/data4_project_group5
__CFBundleIdentifier: com.jetbrains.pycharm
TMPDIR: /var/folders/k_/tkt88xx94n17f7_nvrzjrkwc0000gn/T/
LC_ALL: en_US.UTF-8
HOME: /Users/user
HOMEBREW_REPOSITORY: /opt/homebrew
PATH: /Users/user/Desktop/data4_project_group5/myenv/bin:/Users/user/Library/Java/JavaVirtualMachines/temurin-21.0.2/Contents/Home/bin:/opt/homebrew/opt/python@3.11/bin:/opt/homebrew/opt/python@3.11/bin:/opt/homebrew/bin:/opt/homebrew/opt/python@3.11/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/Users/user/Desktop/KDG/VMware Fusion.app/Contents/Public:/usr/local/go/bin

### ðŸ“Œ Explanation
âœ… **Start a local Spark cluster:**
- `"dimUserChanges"` â†’ Name of the cluster.
- `4` â†’ Number of worker threads.

âœ… **Get the active Spark session:**
- `getActiveSession()` â†’ Ensures the session is running.

In [59]:
spark = cc.startLocalCluster("dimUserChanges",4)
spark.getActiveSession()

### ðŸ“Œ Explanation
âœ… **Capture the job execution timestamp:**
- `datetime.now()` â†’ Retrieves the current timestamp.
- This timestamp will be used to track when records were processed.

In [60]:
from datetime import *
run_timestamp =datetime.now() #The job runtime is stored in a variable
print(run_timestamp)

2025-03-12 14:52:37.854308


### ðŸ“Œ Explanation
âœ… **Load the existing dimension table:**
- `DeltaTable.forPath(...)` â†’ Loads the `dimuser` table from Delta Lake.

âœ… **Create a temporary SQL view:**
- `createOrReplaceTempView("dim_users_current")` â†’ Allows querying via SQL.

âœ… **Show existing records:**
- `spark.sql("SELECT * FROM dim_users_current").show()` â†’ Displays data.


In [61]:
from delta import DeltaTable
current_user_table = DeltaTable.forPath(spark, "./spark-warehouse/dimuser")

In [62]:
current_user_table.toDF().createOrReplaceTempView("dim_users_current")

In [63]:
spark.sql("select * from dim_users_current").show()

                                                                                

+--------------------+------+--------------------+--------------------+--------------------+--------+-------+--------------------+------------+-------------------+-------------------+--------------------+-------+
|              userSK|userid|                name|               email|              street|  number|zipcode|                city|country_code|          scd_start|            scd_end|                 md5|current|
+--------------------+------+--------------------+--------------------+--------------------+--------+-------+--------------------+------------+-------------------+-------------------+--------------------+-------+
|29767795-15df-4c9...|    15|           Hoek Emma|Emma.Hoek@telenet.be|    John Kennedylaan|  9 0801|   2520|Broechem/Emblem/O...|          BE|1999-01-01 00:00:00|2100-12-12 00:00:00|6e3779642d13bf4ab...|   true|
|4c470fb3-cfa5-4ec...|    16|     Stevens Suzanne|Suzanne.Stevens@t...|              Wijk 2| 19 0802|   2531|              Vremde|          BE|1999-

### ðŸ“Œ Explanation
âœ… **Read new user data from the operational database:**
- Uses **JDBC** to connect to the database.

âœ… **Create a temporary SQL view:**
- `createOrReplaceTempView("users_operational_db")` â†’ Enables SQL queries on the latest user data.


In [64]:
### LOAD THE NEWEST CHANGES FROM THE OPERATIONAL DATABASE
df_users = spark.read \
    .format("jdbc") \
    .option("driver", cc.get_Property("driver")) \
    .option("url", cc.create_jdbc()) \
    .option("dbtable", "velo_users") \
    .option("user", cc.get_Property("username")) \
    .option("password", cc.get_Property("password")) \
    .load()

df_users.createOrReplaceTempView("users_operational_db")

### ðŸ“Œ Explanation
âœ… **Transform the operational data into a dimension format:**
- `uuid()` â†’ Generates a **unique key** for each record.
- `md5(...)` â†’ Creates a **hash of address details** to detect changes.

âœ… **Create a SQL view for transformed data:**
- `createOrReplaceTempView("dim_users_new")` â†’ Allows comparison with existing data.

âœ… **Show transformed records:**
- `dim_users_new.show()` â†’ Displays the new data.


### IMPORTANT !!!!!!!!! CHECKING ONLY THE STREET FOR NOW !!!!!!!!


In [65]:
dim_users_new = spark.sql( "select uuid() as source_user_sk, \
                                        userid as source_userid, \
                                        name as source_name, \
                                        street as source_street, \
                                        number as strNumber, \
                                        md5(concat( street, number, zipcode, city, country_code)) as source_md5 \
                                    from users_operational")
dim_users_new.createOrReplaceTempView("dim_users_new")

In [66]:
dim_users_new.show()

+--------------------+-------------+--------------------+--------------------+---------+--------------------+
|      source_user_sk|source_userid|         source_name|       source_street|strNumber|          source_md5|
+--------------------+-------------+--------------------+--------------------+---------+--------------------+
|6a7430f9-7cf8-468...|            4|      Willems Angela|Graaf Joseph de P...|      15 |92680e7e5a3c54a58...|
|921336bd-9e95-441...|            5|    Heijnen Patricia|          Meylstraat|     111 |48ea9f32068a93fbe...|
|90de19b1-a830-481...|            6|      Driessen Anouk|   Jan Ockegemstraat| 168 0107|706882038769b7fd0...|
|92aba685-d571-441...|            7|      Dijkstra Frank|        Klamperdreef|     154 |ed370331b0fcb12d4...|
|3429534b-6105-431...|            8|  den Hartog Suzanne|      Kolibriestraat| 138 0608|f485183bb9a7d886c...|
|e6e03274-1c21-450...|            9|            Smit Tim|       Bikschotelaan|      60 |fd41f6e42aef717ff...|
|c285dc68-

### ðŸ“Œ Explanation
âœ… **Detect changes:**
- `LEFT OUTER JOIN` â†’ Compare new records (`dim_users_new`) with the existing ones (`dim_users_current`).
- `WHERE dwh.userid IS NULL` â†’ Identifies **new users**.
- `OR dwh.md5 <> source.source_md5` â†’ Identifies **modified users**.


In [67]:
detectedChanges=spark.sql(f"select * \
                          from dim_users_new as source \
                          left outer join dim_users_current as dwh on dwh.userid == source.source_userid and dwh.current == true \
                          where dwh.userid is null or dwh.md5 <> source.source_md5")

detectedChanges.createOrReplaceTempView("detectedChanges")

In [68]:
#DEBUG CODE TO SHOW CONTENT OF DETECTED CHANGES
detectedChanges.show()

                                                                                

+--------------------+-------------+---------------+-------------+---------+--------------------+--------------------+------+---------------+--------------------+-----------------+------+-------+---------+------------+-------------------+-------------------+--------------------+-------+
|      source_user_sk|source_userid|    source_name|source_street|strNumber|          source_md5|              userSK|userid|           name|               email|           street|number|zipcode|     city|country_code|          scd_start|            scd_end|                 md5|current|
+--------------------+-------------+---------------+-------------+---------+--------------------+--------------------+------+---------------+--------------------+-----------------+------+-------+---------+------------+-------------------+-------------------+--------------------+-------+
|eb463a7c-2e8e-4ce...|            3|de Boer Ricardo|  WWWWWWWWWWW|     5656|c3b39038c4a5997f1...|ca12d28c-fbed-432...|     3|de Boer Ric

### ðŸ“Œ Explanation
âœ… **Prepare data for update and insert:**
- Inserts **new records**.
- Updates **existing records** (sets `scd_end` and `current = FALSE`).


In [70]:
upserts = spark.sql(f"select source_user_sk as user_sk,\
                                source_userid as userid,\
                                source_name as name,\
                                source_street as street,\
                                strNumber, \
                                to_timestamp('{run_timestamp}') as scd_start, \
                                to_timestamp('2100-12-12','yyyy-MM-dd') as scd_end,\
                                source_md5 as md5,\
                                true as current\
                        from  detectedChanges\
                        union \
                        select  userSK,\
                                userid,\
                                name,\
                                street,\
                                strNumber, \
                                scd_start,\
                                to_timestamp('{run_timestamp}') as scd_end,\
                                md5, \
                                false \
                                from detectedChanges \
                        where current is not null")

upserts.createOrReplaceTempView("upserts")

In [71]:
#DEBUG CODE TO SHOW CONTENT OF UPSERTS
spark.sql("select * from upserts").show()

[Stage 133:>                                                        (0 + 2) / 2]

+--------------------+------+---------------+-----------------+---------+--------------------+--------------------+--------------------+-------+
|             user_sk|userid|           name|           street|strNumber|           scd_start|             scd_end|                 md5|current|
+--------------------+------+---------------+-----------------+---------+--------------------+--------------------+--------------------+-------+
|eb463a7c-2e8e-4ce...|     3|de Boer Ricardo|      WWWWWWWWWWW|     5656|2025-03-12 14:52:...| 2100-12-12 00:00:00|c3b39038c4a5997f1...|   true|
|ca12d28c-fbed-432...|     3|de Boer Ricardo|Maria Clarastraat|     5656| 1999-01-01 00:00:00|2025-03-12 14:52:...|5372c35ac3d3a8b04...|  false|
+--------------------+------+---------------+-----------------+---------+--------------------+--------------------+--------------------+-------+



                                                                                

### ðŸ“Œ Explanation
âœ… **Perform the SCD2 merge operation:**
- **UPDATE** existing records â†’ Sets `scd_end` and `current = FALSE`.
- **INSERT** new records â†’ Tracks changes as new entries.


In [None]:
spark.sql("MERGE INTO dim_users_current AS target \
          using upserts AS source ON target.userid = source.userid and source.current = false and target.current=true \
          WHEN MATCHED THEN UPDATE SET scd_end = source.scd_end, current = source.current  \
          WHEN NOT MATCHED THEN INSERT (userSK, userid, name, street, scd_start, scd_end, md5, current) values (source.user_sk, source.userid, source.name, source.street, source.scd_start, source.scd_end, source.md5, source.current)")


### ðŸ“Œ Explanation
âœ… **Display the final dimension table sorted by `userid` and `scd_start`**  
- Shows **historical versions** and the **latest active record**.


In [37]:
current_user_table.toDF().sort("userid", "scd_start").show(100)

                                                                                

+--------------------+------+--------------------+--------------------+--------------------+--------+-------+--------------------+------------+--------------------+--------------------+--------------------+-------+
|              userSK|userid|                name|               email|              street|  number|zipcode|                city|country_code|           scd_start|             scd_end|                 md5|current|
+--------------------+------+--------------------+--------------------+--------------------+--------+-------+--------------------+------------+--------------------+--------------------+--------------------+-------+
|77dec1af-5b0f-4d5...|     1|         Bouman Lars|Lars.Bouman@gmail...|               gosho|    156 |   2060|           Antwerpen|          BE| 1999-01-01 00:00:00| 2100-12-12 00:00:00|2cd208b8e9e5f4a95...|   true|
|a0518b84-aa50-4af...|     2|   van der Zee Julia|Julia.van.der.Zee...|          Europalaan|     43 |   2610| Wilrijk (Antwerpen)|          

In [None]:
spark.stop()