# Teams silver

Checks to be performed:
- Primary key is unique
- Time information can be correctly converted to Timestamp type

Enrichment: 
- Date information extracted from created_at timestamp.

Data is to be stored in SCD1 table; the assumption here (simplicistic, but just to give a sample of data writing), is that if a line gets updated in the source database it will come with a new "created_at" timestamp.
If that is not the case, the silver notebooks still show two different update methods possible for batch data, I implemented several others (also for streaming), but I do not have the knowlegde on how this data evolves in the source to make further assumptions.

N.B. In UC primary and foreign keys can also be enforced directly on the tables, but then the write operation fails or succeed altogether, with this approach it would be possible to identify single rows that dont satisfy the conditions; the natural evolution of this could be a DLT implementation. As mentioned in the data quality module, the next improvement for this would be using DBX programmaticaly.

In [0]:
from modules.data_quality import (check_data_quality_id,
                                  check_data_quality_timestamps,
                                  check_data_quality_teams_table)
from modules.enrichment import create_integer_datekeys
from modules.write import (add_scd1_columns,
                           identify_new_and_updated_data,
                           merge_df_to_table)

In [0]:
source_catalog = "hive_metastore"
source_schema = "default"
source_table_name = "teams"

target_catalog = "hive_metastore"
target_schema = "default"
target_table_name = "teams_silver"

source_table_reference = source_catalog + "." + source_schema + "." +  source_table_name
target_table_reference = target_catalog + "." + target_schema + "." + target_table_name

In [0]:
teams_df = spark.table(source_table_reference)
target_table = spark.table(target_table_reference)

In [0]:
teams_df, bad_formed_df = check_data_quality_id(teams_df,"team_id")

In [0]:
teams_df, bad_formed_df = check_data_quality_timestamps(teams_df,["created_at"],"team_id")

In [0]:
teams_df, bad_formed_df = check_data_quality_teams_table(teams_df)

In [0]:
teams_df = create_integer_datekeys(teams_df,["created_at"])

In [0]:
teams_df = add_scd1_columns(teams_df, "created_at")

In [0]:
teams_df = identify_new_and_updated_data(teams_df,target_table)

In [0]:
update_columns_list = teams_df.columns
update_columns_list.remove("system_created_at")
key_columns_list = ["team_id"]

merge_df_to_table(spark, teams_df, target_table_reference, update_columns_list, key_columns_list)