## 99 - Cleanup: Remove Bad Records from Silver and Gold Layers

This notebook performs cleanup operations to ensure the quality and integrity of data in the Silver and Gold layers by deleting records with invalid or placeholder values (e.g., `1970-01-01` timestamps).

### Purpose
To remove corrupted or default timestamped rows from GTFS real-time and weather datasets, ensuring accurate analytics and visualizations.

### Workflow Summary
- Deletes GTFS real-time records with `event_date = '1970-01-01'`
- Deletes weather records with missing or invalid `forecast_time`
- Applies deletions across Silver and Gold Delta tables
- Validates cleanup with row counts after deletion


In [0]:
# Delete corrupted GTFS-RT Silver rows with Unix epoch default date
spark.sql("""
DELETE FROM delta.`dbfs:/silver/gtfs_rt`
WHERE event_date = '1970-01-01'
""")


In [0]:
# Confirm deletion was successful (should return 0)
spark.read.format("delta").load("dbfs:/silver/gtfs_rt") \
    .filter("event_date = '1970-01-01'") \
    .count()


In [0]:
# Delete matching bad records in GTFS-RT Enriched Gold layer
spark.sql("""
DELETE FROM delta.`dbfs:/gold/gtfs_rt_enriched`
WHERE event_date = '1970-01-01'
""")


In [0]:
# Delete bad records from GTFS-RT + Weather joined table
spark.sql("""
DELETE FROM delta.`dbfs:/gold/gtfs_rt_weather_joined`
WHERE event_date = '1970-01-01'
""")


In [0]:
# Confirm deletion in both Gold tables
for path in ["dbfs:/gold/gtfs_rt_enriched", "dbfs:/gold/gtfs_rt_weather_joined"]:
    count = spark.read.format("delta").load(path).filter("event_date = '1970-01-01'").count()
    print(f"{path} — 1970 count:", count)


In [0]:
# Delete weather records with null or default forecast_time
spark.sql("""
DELETE FROM delta.`dbfs:/silver/weather`
WHERE forecast_time IS NULL OR DATE(forecast_time) = '1970-01-01'
""")


In [0]:
# Confirm deletion of bad weather rows
spark.read.format("delta").load("dbfs:/silver/weather") \
    .filter("forecast_time IS NULL OR DATE(forecast_time) = '1970-01-01'") \
    .count()
