### DATA QUALITY CHECKS – SILVER LAYER

---

#### Purpose
This notebook validates the **integrity and quality** of data within the **Silver layer**.

It ensures data is **clean, consistent, and ready** for downstream processing.

---

#### Key Checks Performed
- Detection of **NULL** or **duplicate** primary keys
- Identification of **unwanted spaces** in text fields
- **Standardization** and **normalization** of categorical data
- **Validation** of date fields for correctness and logical ordering
- **Consistency checks** between related numerical fields (e.g., sales = quantity × price)

---

#### Usage Instructions
- Execute this notebook **after loading data** into the Silver layer
- **Review and address** any anomalies before progressing to the **Gold layer** or analytics


In [0]:
from datetime import datetime
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType, TimestampType
from pyspark.sql.functions import to_date, to_timestamp

In [0]:
from pyspark.sql import functions as f
from pyspark.sql.window import Window

In [0]:
# path
bronze_path = 'gs://my-bucket-deep/Medallion/bronze/'
silver_path = 'gs://my-bucket-deep/Medallion/silver/'

In [0]:
tables = [
    "crm_cust_info",
    "crm_prd_info",
    "crm_sales_details",
    "erp_cust_az12",
    "erp_loc_a101",
    "erp_px_cat_g1v2"
]
bronze_dfs = {
    table: spark.read.format('delta').load(f'{bronze_path}{table}')
    for table in tables
}

silver_dfs = {
    table: spark.read.format('delta').load(f'{silver_path}{table}')
    for table in tables
}

#### Validation: `crm_cust_info`

In [0]:
# Identify NULLs or duplicate customer IDs (Primary Key)
silver_dfs['crm_cust_info'].groupBy('cst_id') \
                           .count().filter((f.col('count') > 1) | (f.col('cst_id').isNull())) \
                           .display()

cst_id,count
,1


In [0]:
# Detect leading/trailing spaces in customer first names & lastname
silver_dfs['crm_cust_info'].filter(f.trim(f.col('cst_firstname')) != f.col('cst_firstname')).select(['cst_firstname', 'cst_lastname']).display()

cst_firstname,cst_lastname


In [0]:
# Data Standardization & Consistency (cst_gndr and cst_marital_status)
silver_dfs['crm_cust_info'].select('cst_marital_status').distinct().show()
silver_dfs['crm_cust_info'].select('cst_gndr').distinct().show()

+------------------+
|cst_marital_status|
+------------------+
|           Married|
|               n/a|
|            Single|
+------------------+

+--------+
|cst_gndr|
+--------+
|  Female|
|     n/a|
|    Male|
+--------+



#### Validation: `crm_prd_info`

In [0]:
# Identify NULLs or duplicate in product IDs
silver_dfs['crm_prd_info'].groupBy('prd_id') \
                           .count().filter((f.col('count') > 1) | (f.col('prd_id').isNull())) \
                           .show()

+------+-----+
|prd_id|count|
+------+-----+
+------+-----+



In [0]:
# Detect unwanted spaces in product names
silver_dfs['crm_prd_info'].filter(f.trim(f.col('prd_nm')) != f.col('prd_nm')).select(['prd_nm']).show()

+------+
|prd_nm|
+------+
+------+



In [0]:
# Identify NULLs or negative values in product cost
silver_dfs['crm_prd_info'].filter((f.col('prd_cost') < 0) | f.col('prd_cost').isNull()).select('prd_cost').show()

+--------+
|prd_cost|
+--------+
+--------+



In [0]:
# List distinct product lines for normalization
silver_dfs['crm_prd_info'].select('prd_line').distinct().show()

+-----------+
|   prd_line|
+-----------+
|   Mountain|
|       Road|
|        n/a|
|Other Sales|
|    Touring|
+-----------+



In [0]:
# Find records with invalid product date ranges (start_date > end_date)
silver_dfs['crm_prd_info'].filter(f.col('prd_end_dt') < f.col('prd_start_dt')).select(['prd_start_dt', 'prd_end_dt']).show()

+------------+----------+
|prd_start_dt|prd_end_dt|
+------------+----------+
+------------+----------+



#### Validation `crm_sales_details`

In [0]:
# Check for invalid due dates (format or out-of-range)
silver_dfs['crm_sales_details'].filter((f.col('sls_due_dt').isNull()) | ((f.length(f.col('sls_due_dt').cast('string')) == 8))).show()

# Check for other sls_order_dt, sls_ship_dt

+-----------+-----------+-----------+------------+-----------+----------+---------+------------+---------+
|sls_ord_num|sls_prd_key|sls_cust_id|sls_order_dt|sls_ship_dt|sls_due_dt|sls_sales|sls_quantity|sls_price|
+-----------+-----------+-----------+------------+-----------+----------+---------+------------+---------+
+-----------+-----------+-----------+------------+-----------+----------+---------+------------+---------+



In [0]:
# Validate date consistency: (Order Date > Shipping/Due Dates)
silver_dfs['crm_sales_details'].filter((f.col('sls_ship_dt') > f.col('sls_due_dt')) | (f.col('sls_order_dt') > f.col('sls_ship_dt'))).select(['sls_order_dt', 'sls_ship_dt', 'sls_due_dt']).show()

+------------+-----------+----------+
|sls_order_dt|sls_ship_dt|sls_due_dt|
+------------+-----------+----------+
+------------+-----------+----------+



In [0]:
# /*
# Business Rules:
# - sls_sales = sls_quantity * sls_price
# - Negative, zero, or NULL values are not allowed
# */

silver_dfs['crm_sales_details'].filter(
    (f.col('sls_sales') != f.col('sls_quantity') * f.col('sls_price')) |
    (f.col('sls_sales') <= 0) |
    (f.col('sls_quantity') <= 0) |
    (f.col('sls_price') <= 0) |
    (f.col('sls_sales').isNull()) |
    (f.col('sls_quantity').isNull()) |
    (f.col('sls_price').isNull())
).select(['sls_sales', 'sls_quantity', 'sls_price']).show()

# Note: Data issues exist in sales and price fields
# Fix Option 1: Clean data at the source system level
# Fix Option 2: Handle errors in the data warehouse logic

+---------+------------+---------+
|sls_sales|sls_quantity|sls_price|
+---------+------------+---------+
+---------+------------+---------+



#### Validation `erp_cust_az12`

In [0]:
# Detect unrealistic birthdate values
silver_dfs['erp_cust_az12'].filter((f.col('bdate') < '1915-01-01') | (f.col('bdate') > f.current_date())).select('bdate').show()

+-----+
|bdate|
+-----+
+-----+



In [0]:
# Data Standardization & Consistency
silver_dfs['erp_cust_az12'].select('gen').distinct().show()

+------+
|   gen|
+------+
|Female|
|   n/a|
|  Male|
+------+



#### Validation `erp_loc_a101`

In [0]:
# Data Standardization & Consistency
silver_dfs['erp_loc_a101'].select('cntry').distinct().show()

+--------------+
|         cntry|
+--------------+
|       Germany|
|        France|
| United States|
|           n/a|
|        Canada|
|     Australia|
|United Kingdom|
+--------------+



#### Validation `erp_px_cat_g1v2`

In [0]:
# Identify unwanted spaces in category fields
silver_dfs['erp_px_cat_g1v2'].filter(
    (f.col('cat') != f.trim(f.col('cat'))) |
    (f.col('subcat') != f.trim(f.col('subcat'))) |
    (f.col('maintenance') != f.trim(f.col('maintenance')))
).show()

+---+---+------+-----------+
| id|cat|subcat|maintenance|
+---+---+------+-----------+
+---+---+------+-----------+



In [0]:
# Data Standardization & Consistency
silver_dfs['erp_px_cat_g1v2'].select('maintenance').distinct().show()

+-----------+
|maintenance|
+-----------+
|         No|
|        Yes|
+-----------+

