### Config stuff

In [1]:

import ConnectionConfig as cc
from delta import DeltaTable
cc.setupEnvironment()

In [2]:
spark = cc.startLocalCluster("factSales")
spark.getActiveSession()

# Fact transformations
This notebooks creates the sales fact table from scratch based on the operational source table "sales"
When creating a fact table always follow the listed steps in order.


#### 1 READ NECESSARY SOURCE TABLE(S) AND PERFORM TRANSFORMATIONS
**When reading from the source table make sure you include all data necessary:**
- to calculate the measure values
- the source table keys that you have to use to lookup the correct surrogate keys in the dimension tables.

**If more than one table is needed to gather the necesary information you can opt for one of two strategies:**
- Use a select query when reading from the jdbc source with the spark.read operation. Avoid complex queries because the operational database needs a lot of resources to run those queries.
- Perform a spark.read operation for each table separately and join the tables within Spark. The joins will take place on the cluster instead of the database. You limit the database recources used, but there can be a significant overhead of unnecessary data tranferred to the cluster.


In this case we just rename Amount and create a default count_mv column.
The transformations are minimal. In reality, transformations can be far more complex. If so, it can be advisable to work out the transforms in more then one step.*



In [4]:
#EXTRACT
cc.set_connectionProfile("tutorial_op")
sale_src_df = spark.read \
    .format("jdbc") \
    .option("url", cc.create_jdbc()) \
    .option("driver" , cc.get_Property("driver")) \
    .option("dbtable", "(select order_id, order_date, salesrepid, amount from sales) as subq") \
    .option("user", cc.get_Property("username")) \
    .option("password", cc.get_Property("password")) \
    .option("partitionColumn", "Order_ID") \
    .option("numPartitions", 4) \
    .option("lowerBound", 0) \
    .option("upperBound", 1000) \
    .load()\

sale_src_df.show(20)

+--------+-------------------+----------+----------+
|order_id|         order_date|salesrepid|    amount|
+--------+-------------------+----------+----------+
|       1|2010-10-13 00:00:00|         1| 851804379|
|       2|2012-10-01 00:00:00|         1| 683057055|
|       3|2011-07-10 00:00:00|         1|1732115679|
|       4|2010-08-28 00:00:00|         1|1275042249|
|       5|2011-06-17 00:00:00|         1| 694153767|
|       6|2011-03-24 00:00:00|         1|1959464599|
|       7|2010-02-26 00:00:00|         1|1170677605|
|       8|2010-11-23 00:00:00|         1|1588502393|
|       9|2012-06-08 00:00:00|         1|1173163372|
|      10|2012-08-04 00:00:00|         1| 788682390|
|      11|2011-05-30 00:00:00|         1|1951236590|
|      12|2009-11-25 00:00:00|         1| 343432817|
|      13|2012-02-14 00:00:00|         1| 340274106|
|      14|2012-04-15 00:00:00|         1| 958504424|
|      15|2010-03-12 00:00:00|         1|1517930834|
|      16|2011-03-09 00:00:00|         1|10300

In [7]:
sale_src_df.createOrReplaceTempView("sales_source")



#### 2 MAKE DIMENSION TABLES AVAILABLE AS VIEWS

In [6]:
#EXTRACT
dim_date = spark.read.format("delta").load("spark-warehouse/dimdate")
dim_date.createOrReplaceTempView("dimDate")
dim_salesrep = spark.read.format("delta").load("spark-warehouse/dimsalesrep/")
dim_salesrep.createOrReplaceTempView("dimSalesRep")


#### 3 Build the fact table

Within the creation of a fact table always perform these two tasks:
1.   Include the measures of the fact
2.   Use the dimension tables to look up the surrogate keys that correspond with the natural key value. In case of SCD2 dimension use the scd_start en scd_end to find the correct version of the data in the dimension


In [8]:
dim_date.printSchema()

root
 |-- date_SK: long (nullable = true)
 |-- dateInt: integer (nullable = true)
 |-- CalendarDate: date (nullable = true)
 |-- CalendarYear: integer (nullable = true)
 |-- CalendarMonth: string (nullable = true)
 |-- MonthOfYear: integer (nullable = true)
 |-- CalendarDay: string (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- DayOfWeekStartMonday: integer (nullable = true)
 |-- IsWeekDay: string (nullable = true)
 |-- DayOfMonth: integer (nullable = true)
 |-- IsLastDayOfMonth: string (nullable = true)
 |-- DayOfYear: integer (nullable = true)
 |-- WeekOfYearIso: integer (nullable = true)
 |-- QuarterOfYear: integer (nullable = true)


In [10]:
sale_src_df.printSchema()
sale_src_df.show(5)

root
 |-- order_id: integer (nullable = true)
 |-- order_date: timestamp (nullable = true)
 |-- salesrepid: integer (nullable = true)
 |-- amount: integer (nullable = true)
+--------+-------------------+----------+----------+
|order_id|         order_date|salesrepid|    amount|
+--------+-------------------+----------+----------+
|       1|2010-10-13 00:00:00|         1| 851804379|
|       2|2012-10-01 00:00:00|         1| 683057055|
|       3|2011-07-10 00:00:00|         1|1732115679|
|       4|2010-08-28 00:00:00|         1|1275042249|
|       5|2011-06-17 00:00:00|         1| 694153767|
+--------+-------------------+----------+----------+
only showing top 5 rows


In [19]:
#TRANSFORM

#TODO: Buil the fact table based on the source table and the dimension tables
salesFactFromSource = spark.sql("select src.Order_ID as Order_ID, dd.date_SK as date_SK, ds.salesrep_SK as salesrep_SK, 1 as count_MV, src.amount as revenue_MV, md5(concat(src.Order_ID,dd.date_SK,ds.salesrep_SK, 1, src.amount)) as md5  from sales_source as src \
                                    left outer join dimdate as dd on cast(src.order_date as DATE) = dd.CalendarDate \
                                    left outer join dimsalesrep as ds on \
                                        src.SalesRepId = ds.salesRepId \
                                        and src.Order_Date > ds.scd_start \
                                        and src.Order_Date <= ds.scd_end")
                                    
salesFactFromSource.show(5)

+--------+-------+--------------------+--------+----------+--------------------+
|Order_ID|date_SK|         salesrep_SK|count_MV|revenue_MV|                 md5|
+--------+-------+--------------------+--------+----------+--------------------+
|       1|    650|d83fc5ef-6592-4c7...|       1| 851804379|a36d84edbd018a3a9...|
|       2|   1369|d83fc5ef-6592-4c7...|       1| 683057055|3cb2b90afa409a2fc...|
|       3|    920|d83fc5ef-6592-4c7...|       1|1732115679|cc505b78f8070b46f...|
|       4|    604|d83fc5ef-6592-4c7...|       1|1275042249|bc03a8b6d45b197be...|
|       5|    897|d83fc5ef-6592-4c7...|       1| 694153767|d89eb2886a3087753...|
+--------+-------+--------------------+--------+----------+--------------------+


## Initial load
The first time loading the fact table perform a FULL load. All data is written to the Delta Table.
After initial load the code line has to be disabled

In [20]:
salesFactFromSource.write.format("delta").mode("overwrite").saveAsTable("factSales")


## Incremental load
When previous runs where performend you can opt for a 'faster' incremental run that only writes away changes. UPDATES and INSERTS are performed in one run.
In our solution we use an md5 based on all fields in the source table to detect changes. This is not the most efficient way to detect changes. A better way is to use a timestamp field in the source table and use that to detect changes. This is not implemented in this example.

In [None]:
dt_factSales = DeltaTable.forPath(spark,".\spark-warehouse\\factsales")
dt_factSales.toDF().createOrReplaceTempView("factSales_current")
result = spark.sql("MERGE INTO factSales_current AS target \
          using factSales_new AS source ON target.orderID = source.orderID \
          WHEN MATCHED and source.MD5<>target.MD5 THEN UPDATE SET * \
          WHEN NOT MATCHED THEN INSERT *")

result.show()

In [None]:
# IMPORTANT: ALWAYS TEST THE CREATED CODE.
# In this example I changed order 498 in the operational database and checked the change after the run.
# spark.sql("select * from factsales f join dimsalesrep ds on f.salesrepSK = ds.salesrepSK where OrderID = 192  ").show()
spark.sql("select count(*) from factsales").show()
spark.sql("select * from factsales where orderId=1").show()



### Checking the history of your delta fact table

In [None]:
# The history information is derived from the delta table log files. They contain a lot of information of all the actions performed on the table. In this case it tells us something about de merge operations. You can find statistics about the update and insert counts in the document.

fact.history().show(10,False)

In [None]:
spark.stop()