# Silver Layer – Data Cleaning & Quality Checks

This notebook cleans the Bronze churn dataset and prepares it for feature engineering.
The Silver layer applies standardization, null handling, and basic data quality signals.

In [0]:
# Load Bronze customer churn data from Unity Catalog
bronze_df = spark.table(
    "ai_trust_catalog.churn_trust.bronze_customer_churn"
)

display(bronze_df)

In [0]:
# Inspect schema to understand data types before cleaning
bronze_df.printSchema()

In [0]:
from pyspark.sql.functions import col, count, when

# Calculate null percentage per column
null_stats = bronze_df.select([
    (count(when(col(c).isNull(), c)) / count("*")).alias(c)
    for c in bronze_df.columns
])

display(null_stats)

In [0]:
# Function to standardize column names (lowercase, underscores)
def clean_column_names(df):
    for c in df.columns:
        df = df.withColumnRenamed(
            c,
            c.lower().replace(" ", "_")
        )
    return df

silver_df = clean_column_names(bronze_df)

In [0]:
# Verify standardized column names
silver_df.columns

In [0]:
from pyspark.sql.functions import when
from functools import reduce

# Count number of null values per row (data quality signal)
null_exprs = [
    when(col(c).isNull(), 1).otherwise(0)
    for c in silver_df.columns
]

silver_df = silver_df.withColumn(
    "row_null_count",
    reduce(lambda a, b: a + b, null_exprs)
)

In [0]:
# Flag potential tenure outliers for downstream trust analysis
silver_df = silver_df.withColumn(
    "tenure_outlier_flag",
    when(col("tenure") > 60, 1).otherwise(0)
)

In [0]:
from pyspark.sql.types import StringType, NumericType

# Fill nulls:
# - Numeric columns → median
# - String columns → 'unknown'
for field in silver_df.schema.fields:
    if isinstance(field.dataType, NumericType):
        median_val = silver_df.approxQuantile(field.name, [0.5], 0.01)[0]
        silver_df = silver_df.fillna({field.name: median_val})
    elif isinstance(field.dataType, StringType):
        silver_df = silver_df.fillna({field.name: "unknown"})

In [0]:
# Final inspection of cleaned Silver dataset
display(silver_df)
silver_df.printSchema()

In [0]:
# Write cleaned data to Silver layer as Delta table
# overwriteSchema allows schema evolution when overwriting existing table
silver_df.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable(
        "ai_trust_catalog.churn_trust.silver_customer_churn"
    )

In [0]:
%sql
DESCRIBE TABLE ai_trust_catalog.churn_trust.silver_customer_churn;

In [0]:
%sql
-- Validate record count after Silver transformation
SELECT COUNT(*)
FROM ai_trust_catalog.churn_trust.silver_customer_churn;

In [0]:
%sql
-- Preview cleaned Silver data
SELECT *
FROM ai_trust_catalog.churn_trust.silver_customer_churn
LIMIT 10;

## Summary

- Loaded raw churn data from the Bronze layer
- Standardized column names for consistency
- Assessed column-level and row-level nulls
- Applied median and categorical null imputation
- Added basic data quality indicators
- Persisted cleaned data to the Silver Delta table