# Events silver

Checks to be performed:
- Primary key is unique
- Primary key refers to existing team
- Time information can be correctly converted to Timestamp type
- Logic checks (start date is not after end data, coordinates are not half-inserted)

Enrichment: 
- Date information extracted from start/end timestamp.
- Place information extracted by latitude and longitude

Data is to be stored in SCD1 table

N.B. In UC primary and foreign keys can also be enforced directly on the tables, but then the write operation fails or succeed altogether, with this approach it would be possible to identify single rows that dont satisfy the conditions; the natural evolution of this could be a DLT implementation.

In [0]:
pip install geopy

In [0]:
from modules.geo_info import enrich_geography_by_coordinates
from modules.data_quality import (check_data_quality_id,
                                  check_data_quality_foreign_keys,
                                  check_data_quality_timestamps,
                                  check_data_quality_events_table)


In [0]:
source_catalog = "hive_metastore"
source_schema = "default"
source_table_name = "events"
cross_check_schema = "default"
cross_check_table_name = "teams"

source_table_reference = source_catalog + "." + source_schema + "." +  source_table_name

cross_check_teams_reference = source_catalog + "." + cross_check_schema + "." + cross_check_table_name

In [0]:
events_df = spark.table(source_table_reference)#.limit(0)
teams_df = spark.table(cross_check_teams_reference)

In [0]:
events_df, bad_formed_df = check_data_quality_id(events_df,"event_id")

In [0]:
check_list = [{"foreign_key_column":"team_id",
              "cross_check_table": teams_df,
              "cross_check_primary_key_column": "team_id"}]
events_df, bad_formed_df = check_data_quality_foreign_keys(events_df, check_list)

In [0]:
events_df, bad_formed_df = check_data_quality_timestamps(events_df,["event_start","event_end","created_at"],"event_id")

In [0]:
events_df, bad_formed_df = check_data_quality_events_table(events_df)

1 row has wrong start/end dates in this table. At the moment the poor quality rows are identified, different possible solution could be implemented; ideally collecting all bad quality rows in one refresh cycle and save them in a table for further analysis.

In [0]:
bad_formed_df.display()

In [0]:
events_df = enrich_geography_by_coordinates(events_df)