# ü•à Silver Layer: Enhanced Data Engineering with Temporal Features

**Purpose:** This notebook transforms Bronze data into a clean, enriched Silver table with advanced temporal features for comprehensive analytics and ML readiness.

**Key Features:**
- ‚úÖ Data cleaning and column standardization
- ‚úÖ Advanced temporal features (day of week, holidays, seasons)
- ‚úÖ US holiday detection system
- ‚úÖ Weekend and seasonal classifications
- ‚úÖ Single source of truth for all analytics

**Pipeline:** Bronze (33 cols) ‚Üí **Enhanced Silver (15 cols)** ‚Üí Gold (ML)

**Source Table:** `default.bronze_flights_data`
**Output Table:** `default.silver_flights_processed` (Enhanced with temporal intelligence)

In [0]:
!pip install holidays

Collecting holidays
  Downloading holidays-0.85-py3-none-any.whl.metadata (50 kB)
Downloading holidays-0.85-py3-none-any.whl (1.3 MB)
[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/1.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.3/1.3 MB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: holidays
Successfully installed holidays-0.85
[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
# Core PySpark imports
# Holiday detection
from datetime import datetime, timedelta

import holidays
from pyspark.sql.functions import (  # Temporal feature functions
    broadcast,
    col,
    avg,
    count,
    datediff,
    dayofmonth,
    dayofweek,
    expr,
    isnan,
    lit,
    month,
    to_date,
    trim,
    upper,
    weekofyear,
    when,
    year,
)
from pyspark.sql.types import BooleanType, DateType

print("Extra Silver imports loaded (data cleaning + temporal feature engineering)")

Extra Silver imports loaded (data cleaning + temporal feature engineering)


In [0]:
df_bronze = spark.table("default.bronze_flights_data")

In [0]:

# Added a view of the bronze table to understand specifically what was being removed from for cleaning and preprocessing
df_bronze.createOrReplaceTempView(
    "bronze_temp"
)
display(
    spark.sql(
        "SELECT * FROM bronze_temp LIMIT 10"
    )
)

FL_DATE,AIRLINE,AIRLINE_DOT,AIRLINE_CODE,DOT_CODE,FL_NUMBER,ORIGIN,ORIGIN_CITY,DEST,DEST_CITY,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,CANCELLED,CANCELLATION_CODE,DIVERTED,CRS_ELAPSED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,DELAY_DUE_CARRIER,DELAY_DUE_WEATHER,DELAY_DUE_NAS,DELAY_DUE_SECURITY,DELAY_DUE_LATE_AIRCRAFT,bronze_ingestion_timestamp
2019-05-03,Southwest Airlines Co.,Southwest Airlines Co.: WN,WN,19393,1861,STL,"St. Louis, MO",TUL,"Tulsa, OK",2200,2359.0,119.0,10.0,9.0,105.0,4.0,2315,109.0,114.0,0.0,,0.0,75.0,70.0,56.0,351.0,114.0,0.0,0.0,0.0,0.0,2025-11-28T19:15:28.621Z
2022-04-04,Southwest Airlines Co.,Southwest Airlines Co.: WN,WN,19393,805,MSP,"Minneapolis, MN",BWI,"Baltimore, MD",1815,1850.0,35.0,10.0,1900.0,2152.0,17.0,2135,2209.0,34.0,0.0,,0.0,140.0,139.0,112.0,936.0,0.0,0.0,0.0,0.0,34.0,2025-11-28T19:15:28.621Z
2022-04-25,PSA Airlines Inc.,PSA Airlines Inc.: OH,OH,20397,5260,FAY,"Fayetteville, NC",CLT,"Charlotte, NC",1303,1255.0,-8.0,16.0,1311.0,1350.0,6.0,1410,1356.0,-14.0,0.0,,0.0,67.0,61.0,39.0,118.0,,,,,,2025-11-28T19:15:28.621Z
2020-09-26,Envoy Air,Envoy Air: MQ,MQ,20398,3755,ORD,"Chicago, IL",FSD,"Sioux Falls, SD",1430,1422.0,-8.0,23.0,1445.0,1557.0,5.0,1612,1602.0,-10.0,0.0,,0.0,102.0,100.0,72.0,463.0,,,,,,2025-11-28T19:15:28.621Z
2022-07-13,Republic Airline,Republic Airline: YX,YX,20452,5857,BNA,"Nashville, TN",BOS,"Boston, MA",620,616.0,-4.0,14.0,630.0,935.0,9.0,958,944.0,-14.0,0.0,,0.0,158.0,148.0,125.0,942.0,,,,,,2025-11-28T19:15:28.621Z
2021-05-15,Southwest Airlines Co.,Southwest Airlines Co.: WN,WN,19393,2580,LAS,"Las Vegas, NV",COS,"Colorado Springs, CO",1910,1916.0,6.0,11.0,1927.0,2145.0,4.0,2200,2149.0,-11.0,0.0,,0.0,110.0,93.0,78.0,604.0,,,,,,2025-11-28T19:15:28.621Z
2023-06-01,American Airlines Inc.,American Airlines Inc.: AA,AA,19805,2275,STL,"St. Louis, MO",PHL,"Philadelphia, PA",1435,1437.0,2.0,9.0,1446.0,1738.0,10.0,1744,1748.0,4.0,0.0,,0.0,129.0,131.0,112.0,814.0,,,,,,2025-11-28T19:15:28.621Z
2021-01-20,Horizon Air,Horizon Air: QX,QX,19687,2426,GEG,"Spokane, WA",BOI,"Boise, ID",1825,1813.0,-12.0,7.0,1820.0,2009.0,4.0,2035,2013.0,-22.0,0.0,,0.0,70.0,60.0,49.0,287.0,,,,,,2025-11-28T19:15:28.621Z
2020-04-03,Mesa Airlines Inc.,Mesa Airlines Inc.: YV,YV,20378,5837,PHX,"Phoenix, AZ",MRY,"Monterey, CA",2010,2000.0,-10.0,14.0,2014.0,2141.0,2.0,2204,2143.0,-21.0,0.0,,0.0,114.0,103.0,87.0,598.0,,,,,,2025-11-28T19:15:28.621Z
2019-01-22,American Airlines Inc.,American Airlines Inc.: AA,AA,19805,2817,DFW,"Dallas/Fort Worth, TX",ORD,"Chicago, IL",1215,,,,,,,1433,,,1.0,B,0.0,138.0,,,802.0,,,,,,2025-11-28T19:15:28.621Z


In [0]:
column_count = len(df_bronze.columns)

print(f"The bronze DataFrame has {column_count} columns.")

The bronze DataFrame has 33 columns.


In [0]:
print("üìã Bronze Table Schema:")
df_bronze.printSchema()

üìã Bronze Table Schema:
root
 |-- FL_DATE: date (nullable = true)
 |-- AIRLINE: string (nullable = true)
 |-- AIRLINE_DOT: string (nullable = true)
 |-- AIRLINE_CODE: string (nullable = true)
 |-- DOT_CODE: integer (nullable = true)
 |-- FL_NUMBER: integer (nullable = true)
 |-- ORIGIN: string (nullable = true)
 |-- ORIGIN_CITY: string (nullable = true)
 |-- DEST: string (nullable = true)
 |-- DEST_CITY: string (nullable = true)
 |-- CRS_DEP_TIME: integer (nullable = true)
 |-- DEP_TIME: double (nullable = true)
 |-- DEP_DELAY: double (nullable = true)
 |-- TAXI_OUT: double (nullable = true)
 |-- WHEELS_OFF: double (nullable = true)
 |-- WHEELS_ON: double (nullable = true)
 |-- TAXI_IN: double (nullable = true)
 |-- CRS_ARR_TIME: integer (nullable = true)
 |-- ARR_TIME: double (nullable = true)
 |-- ARR_DELAY: double (nullable = true)
 |-- CANCELLED: double (nullable = true)
 |-- CANCELLATION_CODE: string (nullable = true)
 |-- DIVERTED: double (nullable = true)
 |-- CRS_ELAPSED_TIME

In [0]:
# List of columns to drop, as you provided
columns_to_drop = [
    "AIRLINE_DOT",
    "DOT_CODE",
    "FL_NUMBER",
    "ORIGIN_CITY",
    "DEST_CITY",
    "CRS_DEP_TIME",
    "DEP_TIME",
    "DEP_DELAY",
    "TAXI_OUT",
    "WHEELS_OFF",
    "WHEELS_ON",
    "TAXI_IN",
    "CRS_ARR_TIME",
    "ARR_TIME",
    "CANCELLED",
    "CANCELLATION_CODE",
    "DIVERTED",
    "CRS_ELAPSED_TIME",
    "ELAPSED_TIME",
    "AIR_TIME",
    "DISTANCE",
    "DELAY_DUE_CARRIER",
    "DELAY_DUE_WEATHER",
    "DELAY_DUE_NAS",
    "DELAY_DUE_SECURITY",
    "DELAY_DUE_LATE_AIRCRAFT",
    "bronze_ingestion_timestamp",
]

df_silver = df_bronze.drop(*columns_to_drop)

# 1. Print the new schema to see what's left
print("üìã New Silver Table Schema (after dropping columns):")
df_silver.printSchema()

# 2. Show a sample of the new DataFrame
print("\nüîé Sample data from the new Silver Table:")
df_silver.show(5)

üìã New Silver Table Schema (after dropping columns):
root
 |-- FL_DATE: date (nullable = true)
 |-- AIRLINE: string (nullable = true)
 |-- AIRLINE_CODE: string (nullable = true)
 |-- ORIGIN: string (nullable = true)
 |-- DEST: string (nullable = true)
 |-- ARR_DELAY: double (nullable = true)


üîé Sample data from the new Silver Table:
+----------+--------------------+------------+------+----+---------+
|   FL_DATE|             AIRLINE|AIRLINE_CODE|ORIGIN|DEST|ARR_DELAY|
+----------+--------------------+------------+------+----+---------+
|2019-05-03|Southwest Airline...|          WN|   STL| TUL|    114.0|
|2022-04-04|Southwest Airline...|          WN|   MSP| BWI|     34.0|
|2022-04-25|   PSA Airlines Inc.|          OH|   FAY| CLT|    -14.0|
|2020-09-26|           Envoy Air|          MQ|   ORD| FSD|    -10.0|
|2022-07-13|    Republic Airline|          YX|   BNA| BOS|    -14.0|
+----------+--------------------+------------+------+----+---------+
only showing top 5 rows


In [0]:
column_count = len(df_silver.columns)

print(f"The silver DataFrame has {column_count} columns.")

The silver DataFrame has 6 columns.


In [0]:
df_silver_data = df_silver.withColumn("flight_date", to_date(col("FL_DATE")))

# 2. Extract month and year into new columns
df_silver_data = df_silver_data.withColumn("flight_month", month(col("flight_date")))
df_silver_data = df_silver_data.withColumn("flight_year", year(col("flight_date")))

# 3. Drop the original string column
df_silver_data = df_silver_data.drop("FL_DATE")

print("New Silver Table Schema (with date columns):")
df_silver_data.printSchema()

df_silver = df_silver_data

New Silver Table Schema (with date columns):
root
 |-- AIRLINE: string (nullable = true)
 |-- AIRLINE_CODE: string (nullable = true)
 |-- ORIGIN: string (nullable = true)
 |-- DEST: string (nullable = true)
 |-- ARR_DELAY: double (nullable = true)
 |-- flight_date: date (nullable = true)
 |-- flight_month: integer (nullable = true)
 |-- flight_year: integer (nullable = true)



In [0]:
all_columns = df_silver.columns
# Find just the float/double columns
numeric_cols = [c_name for (c_name, c_type) in df_silver.dtypes if c_type in ("float", "double")]

# Get all *other* columns
other_cols = [c_name for c_name in all_columns if c_name not in numeric_cols]

# Create expressions for numeric columns (check for null OR nan)
numeric_expressions = [count(when(col(c).isNull() | isnan(c), c)).alias(c) for c in numeric_cols]

# Create expressions for all other columns (check for null only)
other_expressions = [count(when(col(c).isNull(), c)).alias(c) for c in other_cols]

# Combine the lists of expressions
all_expressions = numeric_expressions + other_expressions

# Run the counts and show the result
print("Missing value counts per column:")
df_silver.select(*all_expressions).show()

Missing value counts per column:
+---------+-------+------------+------+----+-----------+------------+-----------+
|ARR_DELAY|AIRLINE|AIRLINE_CODE|ORIGIN|DEST|flight_date|flight_month|flight_year|
+---------+-------+------------+------+----+-----------+------------+-----------+
|    86198|      0|           0|     0|   0|          0|           0|          0|
+---------+-------+------------+------+----+-----------+------------+-----------+



In [0]:
# df_silver = df_silver.fillna(0, subset=["ARR_DELAY"])
# We will NOT be dropping missing values in the arrival delay column as they indicate that the flight was cancelled or otherwise did not arrive. We will handle these values when we make the columns we are predicting on in the Gold_table notebook

In [0]:
# Step 4: Clean and rename columns
df_silver_clean = (
    df_silver.withColumnRenamed("AIRLINE", "airline_name")
    .withColumnRenamed("AIRLINE_CODE", "airline_code")
    .withColumn("airline_code", trim(upper(col("airline_code"))))
    .withColumnRenamed("ORIGIN", "origin_airport_code")
    .withColumn("origin_airport_code", trim(upper(col("origin_airport_code"))))
    .withColumnRenamed("DEST", "destination_airport_code")
    .withColumn("destination_airport_code", trim(upper(col("destination_airport_code"))))
    .withColumnRenamed("ARR_DELAY", "arrival_delay")
)

print("‚úÖ Basic data cleaning completed")
print(f"Clean columns: {len(df_silver_clean.columns)}")
print("\nüìã Clean Silver Schema:")
df_silver_clean.printSchema()

‚úÖ Basic data cleaning completed
Clean columns: 8

üìã Clean Silver Schema:
root
 |-- airline_name: string (nullable = true)
 |-- airline_code: string (nullable = true)
 |-- origin_airport_code: string (nullable = true)
 |-- destination_airport_code: string (nullable = true)
 |-- arrival_delay: double (nullable = true)
 |-- flight_date: date (nullable = true)
 |-- flight_month: integer (nullable = true)
 |-- flight_year: integer (nullable = true)



## üéÑ Advanced Temporal Feature Engineering

In [0]:
print("üéÑ Creating US holiday detection system...")

# Get year range from data for holiday generation
year_stats = df_silver_clean.agg({"flight_year": "min", "flight_year": "max"}).collect()[0]
min_year_row = df_silver_clean.agg({"flight_year": "min"}).collect()[0]
max_year_row = df_silver_clean.agg({"flight_year": "max"}).collect()[0]
min_year, max_year = int(min_year_row[0]), int(max_year_row[0])
print(f"Data spans: {min_year} to {max_year}")

# Generate US holidays for all years in dataset
all_holidays = []
for year in range(min_year, max_year + 1):
    year_holidays = holidays.UnitedStates(years=year)
    all_holidays.extend(list(year_holidays.keys()))

print(f"‚úÖ Generated {len(all_holidays)} US federal holiday dates")

# Create holiday DataFrames for efficient joins
holidays_df = spark.createDataFrame([(holiday_date,) for holiday_date in all_holidays], ["holiday_date"])

# Create extended holiday periods for proximity detection
near_holidays = []
period_holidays = []

for holiday in all_holidays:
    # Near holiday (¬±3 days)
    for offset in range(-3, 4):
        near_holidays.append(holiday + timedelta(days=offset))

    # Holiday period (¬±7 days)
    for offset in range(-7, 8):
        period_holidays.append(holiday + timedelta(days=offset))

# Remove duplicates and create DataFrames
near_holidays_df = spark.createDataFrame([(date,) for date in set(near_holidays)], ["near_holiday_date"])

period_holidays_df = spark.createDataFrame([(date,) for date in set(period_holidays)], ["period_holiday_date"])

print(f"‚úÖ Holiday proximity periods created")
print(f"Near-holiday dates: {len(set(near_holidays))}")
print(f"Holiday-period dates: {len(set(period_holidays))}")

üéÑ Creating US holiday detection system...
Data spans: 2019 to 2023
‚úÖ Generated 62 US federal holiday dates
‚úÖ Holiday proximity periods created
Near-holiday dates: 378
Holiday-period dates: 762


In [0]:
print("üìÖ Adding comprehensive temporal features...")

# Step 1: Basic temporal features
df_temporal = (
    df_silver_clean.withColumn("day_of_week", dayofweek(col("flight_date")))
    .withColumn("week_of_year", weekofyear(col("flight_date")))
    .withColumn("day_of_month", dayofmonth(col("flight_date")))
    .withColumn("is_weekend", when((col("day_of_week") == 1) | (col("day_of_week") == 7), True).otherwise(False))
)

print("‚úÖ Basic temporal features added")

# Step 2: Holiday detection using broadcast joins
df_with_holidays = (
    df_temporal.join(broadcast(holidays_df), col("flight_date") == col("holiday_date"), "left")
    .withColumn("is_holiday", when(col("holiday_date").isNotNull(), True).otherwise(False))
    .drop("holiday_date")
)

print("‚úÖ Holiday detection completed")

# Step 3: Holiday proximity features
df_enhanced = (
    df_with_holidays.join(broadcast(near_holidays_df), col("flight_date") == col("near_holiday_date"), "left")
    .withColumn("is_near_holiday", when(col("near_holiday_date").isNotNull(), True).otherwise(False))
    .drop("near_holiday_date")
)

df_enhanced = (
    df_enhanced.join(broadcast(period_holidays_df), col("flight_date") == col("period_holiday_date"), "left")
    .withColumn("is_holiday_period", when(col("period_holiday_date").isNotNull(), True).otherwise(False))
    .drop("period_holiday_date")
)

print("‚úÖ Holiday proximity features completed")

üìÖ Adding comprehensive temporal features...
‚úÖ Basic temporal features added
‚úÖ Holiday detection completed
‚úÖ Holiday proximity features completed


In [0]:
print("üåø Adding seasonal and quarterly features...")

# Step 4: Seasonal features
df_final_enhanced = df_enhanced.withColumn(
    "season",
    when(col("flight_month").isin([12, 1, 2]), "Winter")
    .when(col("flight_month").isin([3, 4, 5]), "Spring")
    .when(col("flight_month").isin([6, 7, 8]), "Summer")
    .when(col("flight_month").isin([9, 10, 11]), "Fall")
    .otherwise("Unknown"),
).withColumn(
    "quarter",
    when(col("flight_month").isin([1, 2, 3]), 1)
    .when(col("flight_month").isin([4, 5, 6]), 2)
    .when(col("flight_month").isin([7, 8, 9]), 3)
    .when(col("flight_month").isin([10, 11, 12]), 4)
    .otherwise(0),
)

print("‚úÖ Seasonal features completed")
print(f"Enhanced Silver columns: {len(df_final_enhanced.columns)}")

# Assign final DataFrame
df_silver = df_final_enhanced

üåø Adding seasonal and quarterly features...
‚úÖ Seasonal features completed
Enhanced Silver columns: 17


Based on EDA, the distribution of arrival delay times for 2020 is anomalously low due to COVID-19 disruptions. These patterns do not generalize to other years and may distort model training. Therefore, we remove all flights with flight_year = 2020 from the silver table.

In [0]:
# Optional Cell for dropping 2020 data from the dataset
# --- Compute averages ---

avg_2020 = (
    df_silver.filter(col("flight_year") == 2020)
             .agg(avg("arrival_delay").alias("avg_delay_2020"))
             .collect()[0]["avg_delay_2020"]
)

avg_non2020 = (
    df_silver.filter(col("flight_year") != 2020)
             .agg(avg("arrival_delay").alias("avg_delay_non2020"))
             .collect()[0]["avg_delay_non2020"]
)

avg_overall = (
    df_silver.agg(avg("arrival_delay").alias("avg_delay_overall"))
             .collect()[0]["avg_delay_overall"]
)

# --- Print results ---
print("Average arrival delay for 2020:       ", round(avg_2020, 2))
print("Average arrival delay for other years:", round(avg_non2020, 2))
print("Overall average arrival delay:        ", round(avg_overall, 2))

# Drop all flights from 2020
df_silver_no2020 = df_silver.filter(df_silver.flight_year != 2020)

print("Original row count:", df_silver.count())
print("Row count after dropping 2020:", df_silver_no2020.count())


# Assign final DataFrame
df_silver = df_silver_no2020


Average arrival delay for 2020:        -5.01
Average arrival delay for other years: 5.95
Overall average arrival delay:         4.26
Original row count: 3000000
Row count after dropping 2020: 2520650


## üìä Enhanced Silver Validation

In [0]:
print("üîç Validating Enhanced Silver table...")

# Show final schema
print("\nüìã Enhanced Silver Schema (15 columns):")
df_silver.printSchema()

# Show sample with temporal features
print("\nüîé Sample Enhanced Silver Data:")
df_silver.select(
    "flight_date",
    "airline_name",
    "origin_airport_code",
    "day_of_week",
    "week_of_year",
    "is_weekend",
    "is_holiday",
    "is_near_holiday",
    "season",
    "quarter",
).show(5, truncate=False)

# Feature statistics
print("\nüìä Temporal Feature Statistics:")
total_flights = df_silver.count()
weekend_flights = df_silver.filter(col("is_weekend")).count()
holiday_flights = df_silver.filter(col("is_holiday")).count()
near_holiday_flights = df_silver.filter(col("is_near_holiday")).count()

print(f"Total flights: {total_flights:,}")
print(f"Weekend flights: {weekend_flights:,} ({weekend_flights/total_flights*100:.1f}%)")
print(f"Holiday flights: {holiday_flights:,} ({holiday_flights/total_flights*100:.1f}%)")
print(f"Near holiday flights: {near_holiday_flights:,} ({near_holiday_flights/total_flights*100:.1f}%)")

# Column summary
original_cols = [
    "airline_name",
    "airline_code",
    "origin_airport_code",
    "destination_airport_code",
    "arrival_delay",
    "flight_date",
    "flight_month",
    "flight_year",
]
temporal_cols = [
    "day_of_week",
    "week_of_year",
    "day_of_month",
    "is_weekend",
    "is_holiday",
    "is_near_holiday",
    "is_holiday_period",
    "season",
    "quarter",
]

print(f"\n‚úÖ Enhanced Silver Success:")
print(f"Original business columns: {len(original_cols)}")
print(f"New temporal columns: {len(temporal_cols)}")
print(f"Total columns: {len(df_silver.columns)}")

üîç Validating Enhanced Silver table...

üìã Enhanced Silver Schema (15 columns):
root
 |-- airline_name: string (nullable = true)
 |-- airline_code: string (nullable = true)
 |-- origin_airport_code: string (nullable = true)
 |-- destination_airport_code: string (nullable = true)
 |-- arrival_delay: double (nullable = true)
 |-- flight_date: date (nullable = true)
 |-- flight_month: integer (nullable = true)
 |-- flight_year: integer (nullable = true)
 |-- day_of_week: integer (nullable = true)
 |-- week_of_year: integer (nullable = true)
 |-- day_of_month: integer (nullable = true)
 |-- is_weekend: boolean (nullable = false)
 |-- is_holiday: boolean (nullable = false)
 |-- is_near_holiday: boolean (nullable = false)
 |-- is_holiday_period: boolean (nullable = false)
 |-- season: string (nullable = false)
 |-- quarter: integer (nullable = false)


üîé Sample Enhanced Silver Data:
+-----------+----------------------+-------------------+-----------+------------+----------+----------+

In [0]:
def path_exists(path):
    """Check if a path exists"""
    try:
        dbutils.fs.ls(path)
        return True
    except:
        return False


def create_directory_if_not_exists(path):
    """Create directory if it doesn't exist"""
    if not path_exists(path):
        dbutils.fs.mkdirs(path)
        print(f"‚úÖ Created directory: {path}")
    else:
        print(f"‚ÑπÔ∏è  Directory already exists: {path}")


def table_exists(table_name):
    """Check if a table exists"""
    try:
        spark.table(table_name)
        return True
    except:
        return False

In [0]:
assert df_silver, "The DataFrame 'df_silver' does not exist."

# Define the paths for your new Silver table
SILVER_PATH = "/Volumes/workspace/default/ds-capstone/silver/flights_processed" # This path was updated, as in the Bronze_table notebook, to use the convention "ds-capstone" rather than "ds_capstone". 
# The old path here was "/Volumes/workspace/default/ds_capstone/silver/flights_processed"
SILVER_TABLE_NAME = "default.silver_flights_processed"
DATABASE_NAME = "default"


assert DATABASE_NAME, "DATABASE_NAME is not defined."

print(f"\nüìÅ Checking Silver path: {SILVER_PATH}")
if path_exists(SILVER_PATH):
    print(f"‚ö†Ô∏è  Path already exists. Checking if it's a valid Delta table...")
    try:
        # Try to read as Delta
        test_df = spark.read.format("delta").load(SILVER_PATH)
        print(f"‚úÖ Valid Delta table found with {test_df.count()} records")
        print(f"üí° Will overwrite existing table")
    except:
        print(f"‚ö†Ô∏è  Path exists but is not a valid Delta table")
        print(f"üßπ Cleaning up old data...")
        dbutils.fs.rm(SILVER_PATH, recurse=True)
        print(f"‚úÖ Old data removed")
else:
    print(f"‚úÖ Path is clear, ready to create new table")

# Create parent directory if needed
silver_parent = "/".join(SILVER_PATH.split("/")[:-1])
create_directory_if_not_exists(silver_parent)

print(f"\nüíæ Writing Silver Delta table...")
try:
    df_silver.write.format("delta").mode("overwrite").save(SILVER_PATH)
    print(f"‚úÖ Delta table written to: {SILVER_PATH}")
    print(f"‚úÖ Records written: {df_silver.count():,}")
except Exception as e:
    print(f"‚ùå ERROR: Could not write Delta table")
    print(f"   Error: {str(e)}")
    print(f"\nüí° Trying to clean and retry...")
    try:
        dbutils.fs.rm(SILVER_PATH, recurse=True)
        df_silver.write.format("delta").mode("overwrite").save(SILVER_PATH)
        print(f"‚úÖ Successfully wrote Delta table after cleanup")
    except Exception as e2:
        print(f"‚ùå Still failed: {str(e2)}")
        raise

print(f"\nüìå Registering Delta table as: {SILVER_TABLE_NAME}")
try:
    # Ensure database exists
    spark.sql(f"CREATE DATABASE IF NOT EXISTS {DATABASE_NAME}")
    print(f"‚úÖ Database '{DATABASE_NAME}' ready")

    # Drop table if it exists (to avoid conflicts)
    spark.sql(f"DROP TABLE IF EXISTS {SILVER_TABLE_NAME}")
    print(f"   Dropped existing table (if any)")

    # Create managed table
    # This reads the data you JUST wrote and saves it as a managed table
    df_for_table = spark.read.format("delta").load(SILVER_PATH)
    df_for_table.write.format("delta").mode("overwrite").saveAsTable(SILVER_TABLE_NAME)

    print(f"‚úÖ Table registered successfully as '{SILVER_TABLE_NAME}'!")
except Exception as e:
    print(f"‚ö†Ô∏è  Could not create table with saveAsTable, trying alternative method...")
    try:
        # Alternative: Create external table with explicit LOCATION
        # This just points the table name to the files you saved in Step 7
        spark.sql(
            f"""
            CREATE TABLE IF NOT EXISTS {SILVER_TABLE_NAME}
            USING DELTA
            LOCATION '{SILVER_PATH}'
        """
        )
        print(f"‚úÖ Table registered with LOCATION clause!")
    except Exception as e2:
        print(f"‚ö†Ô∏è  Table registration failed: {str(e2)}")
        print(f"üí° You can still access the data directly using:")
        print(f"   spark.read.format('delta').load('{SILVER_PATH}')")


üìÅ Checking Silver path: /Volumes/workspace/default/ds-capstone/silver/flights_processed
‚úÖ Path is clear, ready to create new table
‚úÖ Created directory: /Volumes/workspace/default/ds-capstone/silver

üíæ Writing Silver Delta table...
‚úÖ Delta table written to: /Volumes/workspace/default/ds-capstone/silver/flights_processed
‚úÖ Records written: 2,520,650

üìå Registering Delta table as: default.silver_flights_processed
‚úÖ Database 'default' ready
   Dropped existing table (if any)
‚úÖ Table registered successfully as 'default.silver_flights_processed'!
