# 02 — Dataset Profiling (Bronze Events)

This notebook profiles the **Bronze layer** to validate:
- schema & row counts
- date coverage and daily volumes (incremental readiness)
- event_type distribution
- null patterns (brand/category_code)
- price sanity checks for purchases
- duplicate risk check (based on a composite event key)

This supports the design of the **Silver layer** rules.


In [0]:
from pyspark.sql import functions as F

SOURCE_TABLE = "workspace.default.bronze_events"
df = spark.table(SOURCE_TABLE)

print("SOURCE_TABLE =", SOURCE_TABLE)
display(df.limit(10))


In [0]:
row_count = df.count()
col_count = len(df.columns)

print("Total rows:", row_count)
print("Total columns:", col_count)
print("Columns:", df.columns)

df.printSchema()


In [0]:
display(
    df.selectExpr(
        "min(event_time) as min_event_time",
        "max(event_time) as max_event_time",
        "count(distinct bronze_batch_date) as days_loaded"
    )
)


In [0]:
display(
    df.groupBy("bronze_batch_date")
      .count()
      .orderBy("bronze_batch_date")
)


In [0]:
display(
    df.groupBy("event_type")
      .count()
      .orderBy(F.col("count").desc())
)


In [0]:
display(df.select(
    F.count(F.when(F.col("brand").isNull(), 1)).alias("brand_nulls"),
    F.count(F.when(F.col("category_code").isNull(), 1)).alias("category_code_nulls"),
    F.count(F.when(F.col("price").isNull(), 1)).alias("price_nulls"),
    F.count(F.when(F.col("user_session").isNull(), 1)).alias("user_session_nulls")
))


In [0]:
null_rates = df.select([
    (F.count(F.when(F.col(c).isNull(), 1)) / F.lit(row_count)).alias(c)
    for c in df.columns
])

display(null_rates)


In [0]:
purchases = df.filter(F.col("event_type") == "purchase")

display(
    purchases.select(
        F.count("*").alias("purchase_rows"),
        F.count(F.when(F.col("price") < 0, 1)).alias("negative_price_rows"),
        F.min("price").alias("min_price"),
        F.expr("percentile_approx(price, 0.5)").alias("median_price"),
        F.max("price").alias("max_price")
    )
)


In [0]:
display(
    df.groupBy("brand")
      .count()
      .orderBy(F.col("count").desc())
      .limit(20)
)

display(
    df.groupBy("category_code")
      .count()
      .orderBy(F.col("count").desc())
      .limit(20)
)


In [0]:
key_cols = ["user_session", "event_time", "event_type", "product_id"]

total = row_count
distinct_keys = df.select(*key_cols).distinct().count()
dupe_rows = total - distinct_keys

print("Total rows:", total)
print("Distinct event keys:", distinct_keys)
print("Estimated duplicate rows:", dupe_rows)
print("Duplicate rate:", round(dupe_rows / total, 6))


## Conclusion → Silver Rules

Based on Bronze profiling, Silver will:
- fill missing `brand` and `category_code` as `"unknown"`
- cast `price` to numeric and filter invalid values (e.g., negative prices)
- deduplicate using the composite key:
  `(user_session, event_time, event_type, product_id)`
- remain incremental by processing per `bronze_batch_date`

