# Config stuff

In [1]:

import ConnectionConfig as cc
from delta import DeltaTable
from datetime import datetime

In [2]:
cc.setupEnvironment()
spark = cc.startLocalCluster("dimSalesIncrementalLoad")
spark.getActiveSession()

## Incremental load

After the sales Rep dimension is filled for the first time, the logic to update the dimension has to be handled differently. A change of a record in the source system has to be handled as a change in the dimension. The SCD2 logic is used to handle this.

The SCD2 implementation requires a more complex transformation to correctly handle changes in the source files. For detailed information consult the comments in the code.
### Setting the parameters
The timestamp of the job is used to set the scd_end date of the previous record and the scd_start date of the new record.

In [3]:
run_timestamp =datetime.now() #The job runtime is stored in a variable

### Read existing dimension

In [5]:
dt_dimSalesRep = DeltaTable.forPath(spark,".\\spark-warehouse\\dimsalesrep")

dt_dimSalesRep.toDF().createOrReplaceTempView("dimSalesRep_current")

#DEBUG CODE TO SHOW CONTENT OF DIMENSION
spark.sql("select salesRepID, name, office, salesRepSK, md5  from dimSalesRep_current ").show()

+----------+-------------+-------------+--------------------+--------------------+
|salesRepID|         name|       office|          salesRepSK|                 md5|
+----------+-------------+-------------+--------------------+--------------------+
|         5|     T. Mosby|       Berlin|d3858661-67e5-4fb...|947579dec8084039e...|
|         6|   H. Simpson|       Berlin|3ad7fdb8-02a2-467...|d636d1b0685650b34...|
|         7|   B. Stinson|San Fransisco|5880490b-03a6-4c3...|e726b2d8dc0cf9a6f...|
|         8|L. Hofstadter|     Brussels|b7c5e413-413e-421...|a2bbe52f8274b0f08...|
|         9|    S. Cooper|     Brussels|e28d311d-c5d3-412...|d85c73c9d03df0002...|
|        10| F. Underwood|     Brussels|76e8eb70-4124-41d...|44cd1a6d596b05688...|
|        11|     W. White|     New York|b86a98c1-b1a1-447...|f9ea69ce2aa4482b4...|
|        12| T. Lannister|     New York|d8c23730-aa18-44b...|3259a471f9816d7c3...|
|        13|      M. Ross|       London|1d1ba85a-a7ea-40c...|d0faf94c1bbe2d4a7...|
|   

### Read source table


##### 1 READ SOURCE TABLE
Creating dataframe with source table (from operational system). Transformed to the dimension format.
The surrogate key is a uuid to be sure it's unique.
md5 hash is used to identify changes in the source table.
A view is created of the resulting dataframe to make it available for the next step.

In [7]:
cc.set_connection("tutorial_op")

#a. Reading from a JDBC source
df_operational_sales_rep = spark.read \
    .format("jdbc") \
    .option("driver" , cc.get_Property("driver")) \
    .option("url", cc.create_jdbc()) \
    .option("dbtable", "salesrep") \
    .option("user", cc.get_Property("username")) \
    .option("password", cc.get_Property("password")) \
    .option("partitionColumn", "salesRepID") \
    .option("numPartitions", 4) \
    .option("lowerBound", 0) \
    .option("upperBound", 20) \
    .load()

df_operational_sales_rep.createOrReplaceTempView("operational_sales_rep")

#b. Transforming the source to the dimension format
df_dim_sales_rep_new = spark.sql( "select uuid() as source_salesRepSK, \
                                        salesRepId as source_salesRepId, \
                                        name as source_name, \
                                        office as source_office, \
                                        md5(concat( name, office)) as source_md5 \
                                    from operational_sales_rep")

df_dim_sales_rep_new.createOrReplaceTempView("dimSalesRep_new")

#DEBUG CODE TO SHOW CONTENT OF SOURCE
#df_dim_sales_rep_new.printSchema()
#df_dim_sales_rep_new.show()
spark.sql("select * from dimSalesRep_new").show()
#df_dim_sales_rep.write.format("delta").mode("overwrite").saveAsTable("dimSalesRep")


+--------------------+-----------------+-------------+-------------+--------------------+
|   source_salesRepSK|source_salesRepId|  source_name|source_office|          source_md5|
+--------------------+-----------------+-------------+-------------+--------------------+
|783faf4d-83ca-4c9...|                1|      R. Zane|       Berlin|1f8cbbc272a33dcc1...|
|02306d02-6bf3-455...|                2|   P. Chapman|       Berlin|14b094c31bf9e4149...|
|8c0d6326-6d5b-443...|                4|    R. Geller|     New York|6212c0ce01f144d66...|
|cf00619c-d6f3-402...|                3|     F. Crane|       Berlin|382ec0d8b8cd28ce4...|
|5844fbdf-5810-406...|                5|     T. Mosby|       Berlin|947579dec8084039e...|
|1e2f0cdc-12bb-4f8...|                6|   H. Simpson|       Berlin|d636d1b0685650b34...|
|e56bbe6d-e99b-4a0...|                7|   B. Stinson|San Fransisco|e726b2d8dc0cf9a6f...|
|d033b0db-3fa2-49f...|                8|L. Hofstadter|     Brussels|a2bbe52f8274b0f08...|
|18743c9c-


##### 2 DETECT CHANGES
Dataframe to identify SCD2 changed rows.
First a join between SOURCE (operational system) and DIMENSION (dwh) is performed
   The md5 hash is used to identify differences.
   The list contains:
       - updated source-rows (the join finds a rowand the md5 is different)  and
       - new source-rows (the leftouter join does not find a row in the dimension (dwh.salesRepId is null)

In [8]:

detectedChanges=spark.sql(f"select * \
                          from dimSalesRep_new source \
                          left outer join dimSalesRep_current dwh on dwh.salesRepID == source.source_salesRepId and dwh.current == true \
                          where dwh.salesRepId is null or dwh.md5 <> source.source_md5")

detectedChanges.createOrReplaceTempView("detectedChanges")

#DEBUG CODE TO SHOW CONTENT OF DETECTED CHANGES
detectedChanges.show()


+--------------------+-----------------+-----------+-------------+--------------------+--------------------+----------+--------+--------+-------------------+-------------------+--------------------+-------+
|   source_salesRepSK|source_salesRepId|source_name|source_office|          source_md5|          salesRepSK|salesrepid|    name|  office|          scd_start|            scd_end|                 md5|current|
+--------------------+-----------------+-----------+-------------+--------------------+--------------------+----------+--------+--------+-------------------+-------------------+--------------------+-------+
|cf00619c-d6f3-402...|                3|   F. Crane|       Berlin|382ec0d8b8cd28ce4...|cc52e5ac-765e-457...|         3|F. Crane|New York|1999-01-01 00:00:00|2100-12-12 00:00:00|0715f05df18a3a794...|   true|
+--------------------+-----------------+-----------+-------------+--------------------+--------------------+----------+--------+--------+-------------------+---------------


##### 3 TRANSOFRM TO UPSERTS
Before union: Every updated and new source-row requires the insertion of a new record in the SCD2 dimension. This new records starts at the runtime of the job and ends at the end of time (2100-12-12). Current is set to true.
Updated source-rows also require an update of the existing scd-fields. The scd_end date of the existing record is set to the runtime of the job. Current is set to false

In the next step, rows without mergeKey will be inserted in the dimension table and rows with mergekey will be updated in the dimension

In [9]:

df_upserts = spark.sql(f"select source_salesRepSK as salesRepSK,\
                                source_salesRepId as salesRepID,\
                                source_name as name,\
                                source_office as office,\
                                to_timestamp('{run_timestamp}') as scd_start, \
                                to_timestamp('2100-12-12','yyyy-MM-dd') as scd_end,\
                                source_md5 as md5,\
                                true as current\
                        from  detectedChanges\
                        union \
                        select  salesRepSK,\
                                salesRepId,\
                                name,\
                                office,\
                                scd_start,\
                                to_timestamp('{run_timestamp}') as scd_end,\
                                md5, \
                                false \
                                from detectedChanges \
                        where current is not null")

df_upserts.createOrReplaceTempView("upserts")

In [10]:

#DEBUG CODE TO SHOW CONTENT OF UPSERTS
spark.sql("select * from upserts").show()

+--------------------+----------+--------+--------+--------------------+--------------------+--------------------+-------+
|          salesRepSK|salesRepID|    name|  office|           scd_start|             scd_end|                 md5|current|
+--------------------+----------+--------+--------+--------------------+--------------------+--------------------+-------+
|cf00619c-d6f3-402...|         3|F. Crane|  Berlin|2024-09-03 13:17:...| 2100-12-12 00:00:00|382ec0d8b8cd28ce4...|   true|
|cc52e5ac-765e-457...|         3|F. Crane|New York| 1999-01-01 00:00:00|2024-09-03 13:17:...|0715f05df18a3a794...|  false|
+--------------------+----------+--------+--------+--------------------+--------------------+--------------------+-------+



#### PERFORM MERGE DIMSALESREP AND UPSERTS
merge looks for a matching dwh.salesRepID (in the dimension) for mergeKey
   - when a match is found (the dimension table contains a row where its salesRepId corresponds with one of the mergekeys)  -> perform update of row to close the period and set current to "false"
   - when no match is found (there is no salesRepID in the dimension because the mergeKey is null) -> perform an insert with the data from the updserts table (from the source). The scd-start is filled with the run_timestamp)

In [11]:
spark.sql("MERGE INTO dimSalesRep_current AS target \
          using upserts AS source ON target.salesRepID = source.salesRepID and source.current = false and target.current=true \
          WHEN MATCHED THEN UPDATE SET scd_end = source.scd_end, current = source.current  \
          WHEN NOT MATCHED THEN INSERT (salesRepSK, salesRepId, name, office, scd_start, scd_end, md5, current) values (source.salesRepSK, source.salesRepId, source.name, source.office, source.scd_start, source.scd_end, source.md5, source.current)")\

#DEBUG CODE TO SHOW CONTENT OF DIMENSION
dt_dimSalesRep.toDF().sort("salesRepID", "scd_start").show(100)


DataFrame[num_affected_rows: bigint, num_updated_rows: bigint, num_deleted_rows: bigint, num_inserted_rows: bigint]

+--------------------+----------+-------------+-------------+--------------------+--------------------+--------------------+-------+
|          salesRepSK|salesrepid|         name|       office|           scd_start|             scd_end|                 md5|current|
+--------------------+----------+-------------+-------------+--------------------+--------------------+--------------------+-------+
|e29017f3-ee76-462...|         1|      R. Zane|       Berlin| 1999-01-01 00:00:00| 2100-12-12 00:00:00|1f8cbbc272a33dcc1...|   true|
|e8dfac53-cfdb-436...|         2|   P. Chapman|       Berlin| 1999-01-01 00:00:00| 2100-12-12 00:00:00|14b094c31bf9e4149...|   true|
|cc52e5ac-765e-457...|         3|     F. Crane|     New York| 1999-01-01 00:00:00|2024-09-03 13:17:...|0715f05df18a3a794...|  false|
|cf00619c-d6f3-402...|         3|     F. Crane|       Berlin|2024-09-03 13:17:...| 2100-12-12 00:00:00|382ec0d8b8cd28ce4...|   true|
|e7b68ade-1c92-4e8...|         4|    R. Geller|     New York| 1999-01

## Delete the spark session

In [12]:
spark.stop()