# Moving our changed data into our curated tables

This notebook moves our newly loaded data into our already existing delta tables. All of these (except one) are loaded in such a way that we retain full history of changes and only move newly added or changed data. 

## User Dimension

For our user dimension we are doing a standard Type 1 Slowly Changing Dimension. We are getting all changes by querying from our conformed table against our current curated table. We are then inserting these changed or new rows into a staging table.

We are then using a SparkSQL merge operation to upsert into our dimuser conformed table based on our staging table, once complete, we delete the staging table.

In [None]:
DROP TABLE IF EXISTS StgDimUser;

CREATE TABLE StgDimUser AS
SELECT u.Id as UserId, u.FirstName, u.LastName, u.Email, u.PhoneNumber, u.MemberSince,
    ua.Address, ua.State, ua.ZipCode, us.RenewalDay, us.Active as IsSubscriptionActive
        FROM appusers as u JOIN appuseraddresses as ua ON u.Id = ua.UserId
        JOIN appusersubscriptionstatus as us ON u.Id = us.UserId
    WHERE ua.Default = true
EXCEPT
SELECT * FROM DimUser;

MERGE INTO DimUser
USING StgDimUser as s
ON DimUser.UserId = s.UserId
WHEN MATCHED THEN
  UPDATE SET
    FirstName = s.FirstName,
    LastName = s.LastName,
    Email = s.Email,
    PhoneNumber = s.PhoneNumber,
    MemberSince = s.MemberSince,
    Address = s.Address,
    State = s.State,
    ZipCode = s.ZipCode,
    RenewalDay = s.RenewalDay,
    IsSubscriptionActive = s.IsSubscriptionActive 
WHEN NOT MATCHED
  THEN INSERT (
    UserId,
    FirstName,
    LastName,
    Email,
    PhoneNumber,
    MemberSince,
    Address,
    State,
    ZipCode,
    RenewalDay,
    IsSubscriptionActive
  )
  VALUES (
    s.UserId,
    s.FirstName,
    s.LastName,
    s.Email,
    s.PhoneNumber,
    s.MemberSince,
    s.Address,
    s.State,
    s.ZipCode,
    s.RenewalDay,
    s.IsSubscriptionActive 
  );

  DROP TABLE IF EXISTS StgDimUser;

## Kiosk Dimension

Here we cheat as the kiosk dimension is very small and just drop the whole table and replace it.

In [4]:
DROP TABLE IF EXISTS DimKiosk;
CREATE TABLE DimKiosk AS
SELECT Id as KioskId, Address, State, ZipCode, InstallDate FROM appkiosk

StatementMeta(, , -1, Finished, Available)

<Spark SQL result set with 0 rows and 0 fields>

<Spark SQL result set with 0 rows and 0 fields>

## Movies Dimension

Since we will never update a movie (at least at the moment) we just insert newly added movies by using the same query pattern as above to get all new rows.

In [7]:
INSERT INTO DimMovies
SELECT m.movie_id as MovieId, m.title as Title, m.mpaa_rating as MpaaRating, g.genre as Genre, m.poster_url as PosterImageUrl, m.release_date as ReleaseDate
from dbomovies as m
JOIN dbogenres as g ON m.genre_id = g.genre_id
EXCEPT
SELECT * FROM DimMovies;

StatementMeta(, a9bcf185-18fe-4b64-826e-aefde9e7a6ba, 16, Finished, Available)

<Spark SQL result set with 0 rows and 0 fields>

## Purchases Fact

Here we query for all new rows that landed in our conformed data again using the same pattern of the EXCEPT clause. This creates a traditional insert only, non-versioned fact table.

In [8]:
INSERT INTO factpurchases
SELECT pli.Id as PurchaseLineItemId, p.Id as PurchaseId, pj.PurchasingUsersId, p.PurchaseLocationId, d.dateInt as TransactionCreatedOnDateId, 
        pli.Quantity, pli.TotalPrice, i.ItemDescription 
    FROM apppurchases as p
    JOIN apppurchaselineitems as pli ON p.Id = pli.PurchaseId
    JOIN apppurchaseuser as pj ON p.Id = pj.PurchasesId
    JOIN appinventory as i ON pli.ItemId = i.Id
    JOIN dimdate as d ON CAST(p.TransactionCreatedOn as date) = d.CalendarDate
EXCEPT
SELECT * FROM factpurchases;

StatementMeta(, a9bcf185-18fe-4b64-826e-aefde9e7a6ba, 17, Finished, Available)

<Spark SQL result set with 0 rows and 0 fields>

## Rentals Fact

As the Rentals table can have changes happen to the row (a rental can get returned which would potentially show as an update), we will build a Kimball style fact table which adds a DateModified column to show the most recently added row for a given rental. In a traditional data warehouse you would then only query this table via a view which provides only the latest row.

As part of the select and except clause we use current_timestamp on both queries so in effect the DateModified column is ignored but inserted if a new row is present.

In [11]:
INSERT INTO factrentals
SELECT r.Id as RentalId, r.MovieId, r.UserId, pur.PurchaseLocationId as RentalLocationId, d.dateInt as RentalDateId, dr.dateInt as ExpectedReturnDateId, drt.dateInt as ReturnDateId, purt.PurchaseLocationId as ReturnLocationId, rt.LateDays, 
    pr.TotalPrice as RentalPrice, prt.TotalPrice as LateFee, (RentalPrice + COALESCE(LateFee, 0)) as TotalPrice, current_timestamp as DateModified
    FROM apprentals as r
    JOIN apppurchaselineitems as pr ON r.PurchaseLineItemId = pr.Id
    JOIN apppurchases as pur ON pr.PurchaseId = pur.Id
    JOIN dimdate as d ON CAST(r.RentalDate as date) = d.CalendarDate
    JOIN dimdate as dr on CAST(r.ExpectedReturnDate as date) = dr.CalendarDate
    LEFT JOIN appreturns as rt ON r.Id = rt.RentalId
    LEFT JOIN apppurchaselineitems as prt ON rt.LateChargeLineItemId = prt.Id
    LEFT JOIN apppurchases as purt ON prt.PurchaseId = purt.Id
    LEFT JOIN dimdate as drt ON drt.CalendarDate = CAST(rt.ReturnDate as date)
EXCEPT
SELECT RentalId, MovieId, UserId, RentalLocationId, RentalDateId, ExpectedReturnDateId, ReturnDateId, ReturnLocationId, LateDays, RentalPrice, LateFee, TotalPrice, current_timestamp as DateModified FROM factrentals


StatementMeta(, a9bcf185-18fe-4b64-826e-aefde9e7a6ba, 20, Finished, Available)

<Spark SQL result set with 0 rows and 0 fields>