# Data Quality and Cleaning

**Training Objective:** Understand techniques for identifying and resolving data quality issues, understand strategies for handling null values, type validation, deduplication, and data standardization.

**Topics Covered:**
- Handling null values
- Type validation
- Deduplication
- Standardization
- Common quality issues

## Theoretical Introduction

**Section Objective:** Understand the foundations of data quality and data cleansing techniques.

**Basic Concepts:**
- **Data Quality**: A measure of data suitability for its intended purpose.
- **Data Cleansing**: The process of identifying and correcting errors in data.
- **Data Validation**: Verification of data compliance with business rules.
- **Data Standardization**: Unification of data formats and representation.
- **Data Profiling**: Analysis of structure, content, and relationships in data.

## User Isolation

In [None]:
%run ../00_setup

## Environment Configuration

We import necessary libraries and set up the context (Catalog and Schema) to ensure we are working in our isolated environment.

In [None]:
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql.window import Window

# Set default catalog and schema
spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {BRONZE_SCHEMA}")

# Display user context
display(spark.createDataFrame([
    ("Catalog", CATALOG),
    ("Bronze Schema", BRONZE_SCHEMA),
    ("Silver Schema", SILVER_SCHEMA),
    ("Gold Schema", GOLD_SCHEMA),
    ("User", raw_user)
], ["Parameter", "Value"]))

## Loading Data

We load customer data from a CSV file using the `DATASET_BASE_PATH` defined in `00_setup.ipynb`. We use `inferSchema` to automatically detect data types, which is useful for exploration but should be used with caution in production.

In [None]:
# Path to file in dataset
customers_path = f"{DATASET_BASE_PATH}/customers/customers.csv"

# Load data with schema inference
df_customers = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(customers_path)

# Display basic info
print(f"Total Records: {df_customers.count()}")
df_customers.printSchema()
display(df_customers.limit(5))

## Data Profiling

Before cleaning, we must understand the quality of our data. We analyze completeness, uniqueness, and accuracy.

### 1. Completeness - Null Value Analysis

We calculate the number and percentage of null values for each column.
*Note: We use an optimized aggregation approach to scan the data only once, rather than iterating through columns.*

In [None]:
# Calculate null counts for all columns in one pass
# count(when(col.isNull(), 1)) counts the null occurrences
null_counts_row = df_customers.select([
    F.count(F.when(F.col(c).isNull(), 1)).alias(c) 
    for c in df_customers.columns
]).collect()[0].asDict()

# Prepare data for display
total_count = df_customers.count()
null_stats = [(c, v, (v/total_count)*100) for c, v in null_counts_row.items()]

df_nulls = spark.createDataFrame(null_stats, ["Column", "Null Count", "Null Pct"])
display(df_nulls.orderBy(F.desc("Null Count")))

### 2. Uniqueness - Duplicate Analysis

We check how many unique rows exist compared to the total count to identify full-row duplicates.

In [None]:
# Calculate unique rows
unique_count = df_customers.distinct().count()

display(spark.createDataFrame([
    ("Total Rows", total_count),
    ("Unique Rows", unique_count),
    ("Duplicate Rows", total_count - unique_count)
], ["Metric", "Value"]))

## Handling Null Values

Strategies for handling missing data:
1.  **Fill**: Replace nulls with default values.
2.  **Drop**: Remove records with missing critical keys.
3.  **Coalesce**: Fallback to alternative columns.

### Strategy 1: Fill with Defaults

We replace null values in non-critical columns with placeholders like "Unknown" or "no phone".

In [None]:
fill_values = {
    "phone": "no phone",
    "city": "Unknown", 
    "state": "Unknown",
    "country": "Unknown"
}

df_filled = df_customers.fillna(fill_values)

# Verify: Check records that were originally null
display(df_filled.filter(F.col("city") == "Unknown").limit(5))

### Strategy 2: Drop Records

For critical columns like `customer_id`, missing values might render the record useless. In such cases, we drop the rows.

In [None]:
# customer_id is mandatory for our business logic
df_valid = df_customers.dropna(subset=["customer_id"])

display(spark.createDataFrame([
    ("Original Count", df_customers.count()),
    ("Valid Count", df_valid.count()),
    ("Dropped Count", df_customers.count() - df_valid.count())
], ["Metric", "Value"]))

### Strategy 3: Coalesce (Fallback)

We can create a new column that takes the first non-null value from a list of columns. Here, we create a `primary_contact`.

In [None]:
# Create primary_contact from email OR phone OR default
df_with_contact = df_customers.withColumn(
    "primary_contact",
    F.coalesce(F.col("email"), F.col("phone"), F.lit("no contact"))
)

display(df_with_contact.select("customer_id", "email", "phone", "primary_contact").limit(10))

## Type Validation and Conversion

We validate data types and formats, specifically dates and emails.

### Date and Email Validation Logic

We define the logic to parse dates and validate email formats using Regular Expressions.

In [None]:
# We try to parse dates and flag invalid ones
df_typed = df_customers.withColumn(
    "registration_date_parsed",
    F.to_date(F.col("registration_date"), "yyyy-MM-dd")
).withColumn(
    "registration_date_valid",
    (F.col("registration_date_parsed").isNotNull()) & 
    (F.col("registration_date_parsed") >= "2020-01-01") & 
    (F.col("registration_date_parsed") <= "2026-12-31")
).withColumn(
    "email_valid",
    F.col("email").rlike("^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$")
)

### Validation Statistics

We calculate how many records passed our validation rules.

In [None]:
# Validation Statistics
valid_dates = df_typed.filter(F.col("registration_date_valid") == True).count()
valid_emails = df_typed.filter(F.col("email_valid") == True).count()

display(spark.createDataFrame([
    ("Valid Dates", valid_dates),
    ("Valid Emails", valid_emails)
], ["Metric", "Count"]))

### Advanced Date Parsing

Real-world data often contains mixed date formats. We can use `coalesce` to try multiple formats in sequence.

In [None]:
# Using coalesce to try multiple date formats
df_with_dates = df_customers.withColumn(
    "registration_date_parsed",
    F.coalesce(
        F.to_date(F.col("registration_date"), "yyyy-MM-dd"),
        F.to_date(F.col("registration_date"), "dd/MM/yyyy"),
        F.to_date(F.col("registration_date"), "MM-dd-yyyy"),
        F.to_date(F.col("registration_date")) # Spark auto-detect fallback
    )
)

display(df_with_dates.select("registration_date", "registration_date_parsed").limit(10))

## Deduplication

Removing duplicates is essential for data integrity.

### 1. Exact Duplicates

We remove rows where **all** columns are identical.

In [None]:
df_distinct = df_customers.distinct()
print(f"Distinct count: {df_distinct.count()}")

### 2. Key-based Deduplication

We remove duplicates based on a specific key (e.g., `customer_id`), keeping only the first occurrence found.

In [None]:
# Keeps the first occurrence found
df_deduped = df_customers.dropDuplicates(["customer_id"])

display(spark.createDataFrame([
    ("Total Records", df_customers.count()),
    ("Distinct Records", df_distinct.count()),
    ("Unique Customers", df_deduped.count())
], ["Metric", "Value"]))

## Standardization

Unifying text formats (case, whitespace) and codes (phone, country).

### Text Standardization

We trim whitespace, capitalize names, and lowercase emails to ensure consistency.

In [None]:
# Trim whitespace and apply casing rules
df_standardized = df_customers.select(
    *[F.trim(F.col(c)).alias(c) if c in ["first_name", "last_name", "city", "email"] else F.col(c) for c in df_customers.columns]
).withColumn(
    "first_name", F.initcap(F.col("first_name"))
).withColumn(
    "last_name", F.initcap(F.col("last_name"))
).withColumn(
    "email", F.lower(F.col("email"))
).withColumn(
    "country", F.upper(F.col("country"))
)

display(df_standardized.limit(5))

### Code Standardization

We clean phone numbers (removing non-digits) and standardize country codes to ISO format.

In [None]:
# Clean phone numbers and standardize country codes
df_codes = df_standardized.withColumn(
    "phone_clean",
    F.regexp_replace(F.col("phone"), "[^0-9]", "") # Keep digits only
).withColumn(
    "country_iso",
    F.when(F.col("country") == "USA", "USA")
     .when(F.col("country") == "POLAND", "POL")
     .otherwise(F.col("country"))
)

display(df_codes.select("phone", "phone_clean", "country", "country_iso").limit(5))

### Advanced Standardization (Company Names)

Company names often appear in various formats (e.g., "Dad & Sons", "Dad and Sons", "dad&sons"). We can use regular expressions to normalize them into a standard format for better matching and aggregation.


In [None]:
# Create sample data with messy company names
# We simulate common data entry variations
data = [
    (1, "Dad & Sons"),
    (2, "Dad And Sons"),
    (3, "Dad Sons"),
    (4, "dad&sons"),
    (5, "Dad  &  Sons"),
    (6, "Dad-and-Sons"),
    (7, "DadnSons") # Hard case: missing delimiters
]
df_companies = spark.createDataFrame(data, ["id", "company_name"])

# Standardization logic
df_normalized = df_companies.withColumn(
    "company_normalized",
    F.lower(F.col("company_name"))
).withColumn(
    # Replace '&' or '+' with ' and '
    "company_normalized",
    F.regexp_replace(F.col("company_normalized"), "[&\\+]", " and ")
).withColumn(
    # Replace non-alphanumeric characters with space
    "company_normalized",
    F.regexp_replace(F.col("company_normalized"), "[^a-z0-9]", " ")
).withColumn(
    # Collapse multiple spaces
    "company_normalized",
    F.regexp_replace(F.col("company_normalized"), "\\s+", " ")
).withColumn(
    "company_normalized",
    F.trim(F.col("company_normalized"))
)

display(df_normalized)

## Summary

1.  **Profiling**: Analyzed nulls and duplicates.
2.  **Null Handling**: Used `fillna`, `dropna`, and `coalesce`.
3.  **Validation**: Validated dates and emails.
4.  **Deduplication**: Removed exact and key-based duplicates.
5.  **Standardization**: Cleaned text and codes.

## Clean up resources

In [None]:
# Clear cache if used
spark.catalog.clearCache()
print("Cleanup completed.")