# Events rsvps silver

Checks to be performed:
- Primary key is unique
- Foreign key refers to existing team and membership table values
- Time information can be correctly converted to Timestamp type
- People who responded are part of the team
- Answers are in the correct range

Enrichment: 
- Date information extracted from created_at timestamp.

Data is to be stored in SCD1 table; the assumption here (simplicistic, but just to give a sample of data writing), is that if a line gets updated in the source database it will come with a new "responded_at" timestamp.
If that is not the case, the silver notebooks still show two different update methods possible for batch data, I implemented several others (also for streaming), but I do not have the knowlegde on how this data evolves in the source to make further assumptions.

N.B. In UC primary and foreign keys can also be enforced directly on the tables, but then the write operation fails or succeed altogether, with this approach it would be possible to identify single rows that dont satisfy the conditions; the natural evolution of this could be a DLT implementation. As mentioned in the data quality module, the next improvement for this would be using DBX programmaticaly.

In [0]:
from modules.data_quality import (check_data_quality_id,
                                  check_data_quality_foreign_keys,
                                  check_data_quality_timestamps,
                                  check_data_quality_events_rsvps_table)
from modules.enrichment import create_integer_datekeys
from modules.write import (add_scd1_columns,
                           identify_new_and_updated_data,
                           merge_df_to_table)

In [0]:
dbutils.widgets.text("env", "dev")

In [0]:
environment = dbutils.widgets.get('env')
catalog = "use_case_" + environment
source_schema = "bronze_layer"
target_schema = "silver_layer"

source_table = "event_rsvps"
target_table = "event_rsvps_refined"
memberships_cross_ref_table = "memberships_refined"
events_cross_ref_table = "events_refined"

source_table_reference = catalog + "." + source_schema + "." +  source_table
target_table_reference = catalog + "." + target_schema + "." + target_table
memberships_cross_ref_table_reference = catalog + "." + target_schema + "." + memberships_cross_ref_table
events_cross_ref_table_reference = catalog + "." + target_schema + "." + events_cross_ref_table


In [0]:
# source_catalog = "hive_metastore"
# source_schema = "default"
# source_table_name = "event_rsvps"

# target_catalog = "hive_metastore"
# target_schema = "default"
# target_table_name = "event_rsvps_silver"
# memberships_cross_ref_table_name = "memberships_silver"
# events_cross_ref_table_name = "events_silver"

# source_table_reference = source_catalog + "." + source_schema + "." +  source_table_name
# target_table_reference = target_catalog + "." + target_schema + "." + target_table_name
# memberships_cross_ref_table_reference = target_catalog + "." + target_schema + "." + memberships_cross_ref_table_name
# events_cross_ref_table_reference = target_catalog + "." + target_schema + "." + events_cross_ref_table_name


In [0]:
events_rsvps_df = spark.table(source_table_reference)
target_table_df = spark.table(target_table_reference)
memberships_df = spark.table(memberships_cross_ref_table_reference).where("system_is_active IS True")
events_df = spark.table(events_cross_ref_table_reference).where("system_is_active IS True")

In [0]:
events_rsvps_df, bad_formed_df = check_data_quality_id(events_rsvps_df,"event_rsvp_id")

In [0]:
check_list = [{"foreign_key_column":"event_id",
              "cross_check_table": events_df,
              "cross_check_primary_key_column": "event_id"},
              {"foreign_key_column":"membership_id",
              "cross_check_table": memberships_df,
              "cross_check_primary_key_column": "membership_id"}]
events_rsvps_df, bad_formed_df = check_data_quality_foreign_keys(events_rsvps_df, check_list)

In [0]:
events_rsvps_df, bad_formed_df = check_data_quality_timestamps(events_rsvps_df,["responded_at"],"event_rsvp_id")

In [0]:
events_rsvps_df, bad_formed_df = check_data_quality_events_rsvps_table(events_rsvps_df,memberships_df,events_df)

In [0]:
events_rsvps_df = create_integer_datekeys(events_rsvps_df,["responded_at"])

In [0]:
events_rsvps_df = add_scd1_columns(events_rsvps_df, "responded_at")

In [0]:
events_rsvps_df = identify_new_and_updated_data(events_rsvps_df,target_table_df)

In [0]:
update_columns_list = events_rsvps_df.columns
update_columns_list.remove("system_created_at")
key_columns_list = ["event_rsvp_id"]

merge_df_to_table(spark, events_rsvps_df, target_table_reference, update_columns_list, key_columns_list)