# Data Quality and Cleaning

**Training Objective:** Understand techniques for identifying and resolving data quality issues, understand strategies for handling null values, type validation, deduplication, and data standardization.

**Topics Covered:**
- Handling null values
- Type validation
- Deduplication
- Standardization
- Common quality issues

## Theoretical Introduction

**Section Objective:** Understand the foundations of data quality and data cleansing techniques.

**Basic Concepts:**
- **Data Quality**: A measure of data suitability for its intended purpose
- **Data Cleansing**: The process of identifying and correcting errors in data
- **Data Validation**: Verification of data compliance with business rules
- **Data Standardization**: Unification of data formats and representation
- **Data Profiling**: Analysis of structure, content, and relationships in data

## User Isolation

In [0]:
%run ../00_setup

## Environment Configuration

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql.window import Window
import re
from datetime import datetime, timedelta

# Display user context (variables from 00_setup)
print("=== User Context ===")
print(f"Catalog: {CATALOG}")
print(f"Bronze Schema: {BRONZE_SCHEMA}")
print(f"Silver Schema: {SILVER_SCHEMA}")
print(f"Gold Schema: {GOLD_SCHEMA}")
print(f"User: {raw_user}")

# Set default catalog and schema
spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {BRONZE_SCHEMA}")

print("\n=== Configuration completed successfully ===")

### User Context

After running setup, we check environment variables:

**We use:**
- **CATALOG**: Isolated catalog per user
- **BRONZE_SCHEMA**: Raw data layer
- **SILVER_SCHEMA**: Cleaned data layer
- **GOLD_SCHEMA**: Aggregated data layer
- **raw_user**: User identifier
- **DATASET_BASE_PATH**: Path to the data folder

**Library Imports:**
- `functions as F` - PySpark functions for data transformation
- `types` - Data type definitions (StructType, StringType)
- `Window` - Window functions for deduplication and ranking
- `re` - Regular expressions for validation
- `datetime` - Date and time operations

## Loading Data from Dataset

**Theoretical Introduction:**

In this notebook, we use files from the local `dataset/` folder (Training Day 1-2). CSV, JSON, and Parquet files are loaded directly from the file system using the `DATASET_BASE_PATH` path from `00_setup.ipynb`.

**Key Concepts:**
- **DATASET_BASE_PATH**: Path to the dataset/ folder defined in 00_setup.ipynb
- **CSV Reader**: spark.read.format("csv") with options (header, inferSchema)

### Example: Loading Customers Data

**Goal:** Load customer data from a CSV file stored in the dataset/ folder.

**Approach:**
1. Define the path using DATASET_BASE_PATH from 00_setup.ipynb
2. Load CSV with options (header, inferSchema)
3. Basic exploration of loaded data

In [0]:
# RESOURCE: CSV file: {DATASET_BASE_PATH}/customers/customers.csv
# VARIABLE: df_customers - DataFrame with customer data

# Path to file in dataset
customers_path = f"{DATASET_BASE_PATH}/customers/customers.csv"

# Load data
df_customers = spark.read \
    .format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(customers_path)

# Basic exploration
print("=== Customer data loaded ===")
print(f"Record count: {df_customers.count()}")
print(f"Column count: {len(df_customers.columns)}")
print(f"\nColumns: {df_customers.columns}")

# Data schema
print("\n=== Data Schema ===")
df_customers.printSchema()

# Preview first records
print("\n=== First 10 records ===")
display(df_customers.limit(10))

We load customer data from a CSV file using the `DATASET_BASE_PATH` defined in `00_setup.ipynb`. The `inferSchema=true` option automatically detects data types.

In [0]:
# Basic statistics of loaded data
total_rows = df_customers.count()
total_columns = len(df_customers.columns)

display(f"Record count: {total_rows}")
display(f"Column count: {total_columns}")
display(f"Columns: {df_customers.columns}")

We display basic information about the loaded DataFrame - number of records, columns, and column names.

In [0]:
# Data schema - column types
df_customers.printSchema()

We check the data schema - column types automatically detected by Spark. In production, explicit schema definition is recommended.

In [0]:
# Preview first records
display(df_customers.limit(10))

We display the first 10 records to see the actual data and identify potential quality issues.

### Data Profiling - Data Quality Analysis

**Goal:** Identify quality issues in loaded data before cleaning begins.

In [0]:
# RESOURCE: DataFrame df_customers
# VARIABLE: profiling_report - dict with quality statistics

print("=" * 80)
print("DATA QUALITY PROFILING REPORT")
print("=" * 80)

# 1. Completeness - Null value analysis
print("\n1. COMPLETENESS - Null values per column:")
print("-" * 80)

null_analysis = []
for col_name in df_customers.columns:
    null_count = df_customers.filter(F.col(col_name).isNull()).count()
    total_count = df_customers.count()
    null_pct = (null_count / total_count) * 100
    null_analysis.append((col_name, null_count, null_pct))

# Display null value statistics
for col_name, null_count, null_pct in null_analysis:
    display(f"{col_name:20s}: {null_count:4d} nulls ({null_pct:5.1f}%)")

# 2. Uniqueness - Duplicate analysis
print("\n2. UNIQUENESS - Duplicate analysis:")
print("-" * 80)
total_rows = df_customers.count()
unique_rows = df_customers.distinct().count()
duplicate_rows = total_rows - unique_rows
print(f"  Total rows: {total_rows}")
print(f"  Unique rows: {unique_rows}")
print(f"  Duplicate rows: {duplicate_rows}")
print(f"  Duplication rate: {(duplicate_rows/total_rows)*100:.1f}%")

# 3. Consistency - Unique values in key columns
print("\n3. CONSISTENCY - Unique values:")
print("-" * 80)

for col_name in df_customers.columns:
    distinct_count = df_customers.select(col_name).distinct().count()
    print(f"  {col_name:20s}: {distinct_count:4d} unique values")

# 4. Accuracy - Sample values
print("\n4. ACCURACY - Sample values (first 5):")
print("-" * 80)
display(df_customers.limit(5))

print("\n" + "=" * 80)
print("PROFILING COMPLETED")
print("=" * 80)

**Completeness Analysis:** We check the percentage of missing values in each column. A high percentage of null values may indicate issues in the source system or the need for alternative filling strategies.

In [0]:
# 2. Uniqueness - Duplicate analysis
total_rows = df_customers.count()
unique_rows = df_customers.distinct().count()
duplicate_rows = total_rows - unique_rows
duplication_rate = (duplicate_rows/total_rows)*100

display(f"Total rows: {total_rows}")
display(f"Unique rows: {unique_rows}")
display(f"Duplicate rows: {duplicate_rows}")
display(f"Duplication rate: {duplication_rate:.1f}%")

**Uniqueness Analysis:** We identify duplicates - records that are completely identical in all columns. Duplicates can result from ETL errors or reloading the same data multiple times.

In [0]:
# 3. Consistency - Unique value analysis
consistency_analysis = []
for col_name in df_customers.columns:
    distinct_count = df_customers.select(col_name).distinct().count()
    consistency_analysis.append((col_name, distinct_count))

# Display unique value statistics
for col_name, distinct_count in consistency_analysis:
    display(f"{col_name:20s}: {distinct_count:4d} unique values")

**Consistency Analysis:** We check the number of unique values in each column. This helps identify categorical columns, potential keys, and data formatting issues.

In [0]:
# 4. Accuracy - Preview sample values
display(df_customers.limit(5))

**Accuracy Analysis:** Previewing actual data values allows for manual verification of correctness - whether the data looks realistic and meets business expectations.

## Handling Null Values

**Theoretical Introduction:**

Missing values are one of the most common data quality issues. The strategy for handling null values depends on the business context and the nature of the data. Incorrect handling can lead to errors in analysis and ML models.

**Key Concepts:**
- **fillna()**: Fill null values with a specific value or strategy
- **dropna()**: Remove records containing null values
- **coalesce()**: Select the first non-null value from multiple columns
- **Imputation**: Statistical filling methods (mean, median, mode)

**Practical Application:**
- Filling missing values with sensible defaults
- Removing records with critical missing data
- Fallback to alternative data sources

### Filling Null Values (fillna)

**Goal:** Fill missing values in columns using different strategies.

**Approach:**
1. Identify columns with null values
2. Choose appropriate strategy per column
3. Apply fillna() with a dictionary of values

We define a null filling strategy using a dictionary with default values for each column. We choose sensible defaults consistent with the business context.

We verify the effectiveness of the fillna operation by comparing the number of null values before and after the transformation. All values in the selected columns should be filled.

In [0]:
# RESOURCE: DataFrame df_customers
# VARIABLE: df_filled - DataFrame with filled null values

# Null filling strategy - only for columns that can actually have nulls
fill_values = {
    "phone": "no phone",
    "city": "Unknown", 
    "state": "Unknown",
    "country": "Unknown"
}

# Fill null values
df_filled = df_customers.fillna(fill_values)

# Verify changes
print("=== Comparison before and after fillna ===")
for col_name in fill_values.keys():
    before_nulls = df_customers.filter(F.col(col_name).isNull()).count()
    after_nulls = df_filled.filter(F.col(col_name).isNull()).count()
    print(f"{col_name:15s}: {before_nulls:3d} nulls → {after_nulls:3d} nulls")

# Sample records after filling
print("\n=== Sample filled records ===")
display(df_filled.filter(
    df_customers["phone"].isNull() | 
    df_customers["city"].isNull() |
    df_customers["state"].isNull()
).limit(5))

We check records that originally had null values - they should now be filled with default values according to our strategy.

In [0]:
# Sample records after filling - records that had null values
display(df_filled.filter(
    df_customers["phone"].isNull() | 
    df_customers["city"].isNull() |
    df_customers["state"].isNull()
).limit(5))

### Dropping Null Values (dropna)

**Goal:** Remove records with missing values in key columns.

In [0]:
# RESOURCE: DataFrame df_customers
# VARIABLE: df_valid - DataFrame with records removed where key columns are null

# Remove records without key information (customer_id is required)
df_valid = df_customers.dropna(subset=["customer_id"])

# Verify changes
print("=== Comparison before and after dropna ===")
print(f"Record count BEFORE: {df_customers.count()}")
print(f"Record count AFTER: {df_valid.count()}")
print(f"Records removed: {df_customers.count() - df_valid.count()}")

# Check if there are still nulls in customer_id
null_ids = df_valid.filter(F.col("customer_id").isNull()).count()
print(f"\nNulls in customer_id after dropna: {null_ids}")

# Alternative usage: dropna with how='all' (removes only if all columns are null)
df_any_data = df_customers.dropna(how='all')
print(f"\nRecords with any data: {df_any_data.count()}")

We remove records that do not have values in key columns. `customer_id` is mandatory - records without it are useless in business analysis.

In [0]:
# Verify changes after dropna
records_before = df_customers.count()
records_after = df_valid.count()
records_removed = records_before - records_after

display(f"Record count BEFORE: {records_before}")
display(f"Record count AFTER: {records_after}")
display(f"Records removed: {records_removed}")

# Check if there are still nulls in customer_id
null_ids = df_valid.filter(F.col("customer_id").isNull()).count()
display(f"Nulls in customer_id after dropna: {null_ids}")

We verify the effectiveness of the dropna operation - we check how many records were removed and if there are no more null values in the `customer_id` column.

In [0]:
# Alternative usage: dropna with how='all' (removes only if all columns are null)
df_any_data = df_customers.dropna(how='all')
display(f"Records with any data: {df_any_data.count()}")

Alternative strategy: `how='all'` removes only records where all columns are null. It is less restrictive and preserves records with at least one non-null value.

### Coalesce Strategy

**Goal:** Use coalesce() to fallback between alternative data sources.

In [0]:
# RESOURCE: DataFrame df_customers
# VARIABLE: df_with_contact - DataFrame with new column primary_contact

from pyspark.sql.functions import coalesce, lit

# Example: Create primary_contact selecting the first non-null value
df_with_contact = df_customers.withColumn(
    "primary_contact",
    coalesce(F.col("email"), F.col("phone"), lit("no contact"))
)

# Create full address from available fields
df_with_contact = df_with_contact.withColumn(
    "full_address",
    coalesce(
        F.concat_ws(", ", F.col("city"), F.col("state"), F.col("country")),
        F.concat_ws(", ", F.col("city"), F.col("country")),
        F.col("country"),
        lit("Address Unknown")
    )
)

# Verify
print("=== primary_contact Analysis ===")
contact_stats = df_with_contact.groupBy("primary_contact").count().orderBy(F.desc("count"))
display(contact_stats.limit(10))

# Examples
print("\n=== Sample records with primary_contact and full_address ===")
display(df_with_contact.select("customer_id", "email", "phone", "primary_contact", "city", "state", "country", "full_address").limit(10))

We use `coalesce()` to create the `primary_contact` column - selecting the first non-null value from `email`, `phone`, or a default value. This implements fallback logic in case of missing data.

In [0]:
# Create full address from available fields
df_with_contact = df_with_contact.withColumn(
    "full_address",
    coalesce(
        F.concat_ws(", ", F.col("city"), F.col("state"), F.col("country")),
        F.concat_ws(", ", F.col("city"), F.col("country")),
        F.col("country"),
        lit("Address Unknown")
    )
)

We create `full_address` using advanced coalesce with `concat_ws()` - trying different combinations of address fields, selecting the first non-null one. Fallback strategies: full → city+country → country only → default.

In [0]:
# primary_contact analysis - statistics of using different contact types
contact_stats = df_with_contact.groupBy("primary_contact").count().orderBy(F.desc("count"))
display(contact_stats.limit(10))

We analyze the effectiveness of the coalesce strategy - checking how many records use email, phone, or default value as primary_contact. This helps assess the quality of contact data.

In [0]:
# Sample records with primary_contact and full_address
display(df_with_contact.select(
    "customer_id", "email", "phone", "primary_contact", 
    "city", "state", "country", "full_address"
).limit(10))

The record preview shows how coalesce filled the new columns `primary_contact` and `full_address` using available source data or default values.

## Data Type Validation and Conversion

**Theoretical Introduction:**

Incorrect data types are a common problem when loading data from text files (CSV, JSON). Type conversions must be performed safely with error handling to avoid data loss or incorrect analysis results.

**Key Concepts:**
- **cast()**: Data type conversion (string → int, date, timestamp)
- **to_date()**: Parsing strings to DateType with format specification
- **to_timestamp()**: Parsing strings to TimestampType
- **try_cast()**: Safe conversion returning null on error (Spark 3.4+)

**Practical Application:**
- Type validation after loading CSV with inferSchema
- Parsing dates in non-standard formats
- Type conversion before joins and aggregations

### Numeric Conversion (cast)

**Goal:** Safe conversion of strings to numeric types with validation.

**Approach:**
1. Clean values before conversion (remove special characters)
2. Convert type using cast()
3. Validate value range

In [0]:
# RESOURCE: DataFrame df_customers
# VARIABLE: df_typed - DataFrame with correctly converted types

# Example: Validate registration dates and add quality flags
df_typed = df_customers.withColumn(
    "registration_date_parsed",
    F.to_date(F.col("registration_date"), "yyyy-MM-dd")
)

# Validate date range (2020-2026)
df_typed = df_typed.withColumn(
    "registration_date_valid",
    (F.col("registration_date_parsed").isNotNull()) & 
    (F.col("registration_date_parsed") >= "2020-01-01") & 
    (F.col("registration_date_parsed") <= "2026-12-31")
)

# Add account age in days
df_typed = df_typed.withColumn(
    "account_age_days",
    F.datediff(F.current_date(), F.col("registration_date_parsed"))
)

# Validate email format
df_typed = df_typed.withColumn(
    "email_valid",
    (F.col("email").isNotNull()) & 
    F.col("email").rlike("^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$")
)

# Conversion statistics
total = df_typed.count()
valid_dates = df_typed.filter(F.col("registration_date_valid") == True).count()
invalid_dates = df_typed.filter(F.col("registration_date_valid") == False).count()
valid_emails = df_typed.filter(F.col("email_valid") == True).count()

print("=== Validation Statistics ===")
print(f"Total: {total}")
print(f"Valid registration dates: {valid_dates} ({(valid_dates/total)*100:.1f}%)")
print(f"Invalid registration dates: {invalid_dates} ({(invalid_dates/total)*100:.1f}%)")
print(f"Valid emails: {valid_emails} ({(valid_emails/total)*100:.1f}%)")

# Examples of invalid values
if invalid_dates > 0:
    print("\n=== Examples of invalid registration dates ===")
    display(df_typed.filter(F.col("registration_date_valid") == False).select("customer_id", "registration_date", "registration_date_parsed", "registration_date_valid").limit(5))

# Schema after conversion
print("\n=== Schema after conversion ===")
df_typed.printSchema()

We calculate the account age in days using `datediff()` between the current date and the registration date. This is a useful metric for behavioral analysis.

In [0]:
# Validate email format using regular expression
df_typed = df_typed.withColumn(
    "email_valid",
    (F.col("email").isNotNull()) & 
    F.col("email").rlike("^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$")
)

We validate the email format using the `rlike()` regular expression. We check the basic structure: local_part@domain.extension. The `email_valid` flag can be used for filtering and reporting.

In [0]:
# Calculate validation statistics
total = df_typed.count()
valid_dates = df_typed.filter(F.col("registration_date_valid") == True).count()
invalid_dates = df_typed.filter(F.col("registration_date_valid") == False).count()
valid_emails = df_typed.filter(F.col("email_valid") == True).count()

# Display statistics using display()
display(f"Total records: {total}")
display(f"Valid registration dates: {valid_dates} ({(valid_dates/total)*100:.1f}%)")
display(f"Invalid registration dates: {invalid_dates} ({(invalid_dates/total)*100:.1f}%)")
display(f"Valid emails: {valid_emails} ({(valid_emails/total)*100:.1f}%)")

We calculate and display validation effectiveness statistics - the percentage of valid dates and emails. These metrics are crucial for assessing source data quality.

In [0]:
# Examples of invalid registration dates (if any)
invalid_dates_sample = df_typed.filter(F.col("registration_date_valid") == False).select(
    "customer_id", "registration_date", "registration_date_parsed", "registration_date_valid"
).limit(5)

if invalid_dates_sample.count() > 0:
    display(invalid_dates_sample)

Previewing records with invalid dates helps understand the nature of issues in the source data. This may indicate the need for additional validation rules.

In [0]:
# Check data schema after adding new columns
df_typed.printSchema()

We check the updated DataFrame schema after adding new columns with validation and calculations. New columns of DateType, BooleanType, and LongType extend the original schema.

We parse date strings to DateType using `to_date()` with an explicitly specified format. The `yyyy-MM-dd` format is the ISO 8601 standard.

In [0]:
# Validate date range (2020-2026) and add quality flag
df_typed = df_typed.withColumn(
    "registration_date_valid",
    (F.col("registration_date_parsed").isNotNull()) & 
    (F.col("registration_date_parsed") >= "2020-01-01") & 
    (F.col("registration_date_parsed") <= "2026-12-31")
)

We add the `registration_date_valid` flag checking the logical date range (2020-2026). Quality flags are crucial for auditability and monitoring of data pipelines.

### Date Conversion (to_date)

**Goal:** Parse strings to DateType with support for multiple formats.

In [0]:
# RESOURCE: DataFrame df_customers
# VARIABLE: df_with_dates - DataFrame with correctly parsed dates

# Convert registration_date with support for multiple formats
df_with_dates = df_customers.withColumn(
    "registration_date_parsed",
    coalesce(
        F.to_date(F.col("registration_date"), "yyyy-MM-dd"),     # Format: 2024-01-15
        F.to_date(F.col("registration_date"), "dd/MM/yyyy"),     # Format: 15/01/2024
        F.to_date(F.col("registration_date"), "MM-dd-yyyy"),     # Format: 01-15-2024
        F.to_date(F.col("registration_date"), "yyyy.MM.dd"),     # Format: 2024.01.15
        F.to_date(F.col("registration_date"))                    # Automatic detection
    )
)

# Validate conversion
total = df_with_dates.count()
parsed = df_with_dates.filter(F.col("registration_date_parsed").isNotNull()).count()
failed = df_with_dates.filter(
    F.col("registration_date").isNotNull() & 
    F.col("registration_date_parsed").isNull()
).count()

print("=== Date Conversion Statistics ===")
print(f"Total: {total}")
print(f"Successfully parsed: {parsed} ({(parsed/total)*100:.1f}%)")
print(f"Failed to parse: {failed} ({(failed/total)*100:.1f}%)")

# Examples of conversion
print("\n=== Sample date conversions ===")
display(df_with_dates.select("customer_id", "registration_date", "registration_date_parsed").limit(10))

# Records with parsing errors
if failed > 0:
    print("\n=== Records with parsing errors ===")
    display(df_with_dates.filter(
        F.col("registration_date").isNotNull() & 
        F.col("registration_date_parsed").isNull()
    ).select("customer_id", "registration_date", "registration_date_parsed").limit(5))

We use an advanced date conversion strategy with `coalesce()` - trying various popular date formats, selecting the first one that succeeds. This ensures greater flexibility with inconsistent formats in source data.

In [0]:
# Calculate date conversion effectiveness statistics
total = df_with_dates.count()
parsed = df_with_dates.filter(F.col("registration_date_parsed").isNotNull()).count()
failed = df_with_dates.filter(
    F.col("registration_date").isNotNull() & 
    F.col("registration_date_parsed").isNull()
).count()

# Display statistics
display(f"Total records: {total}")
display(f"Successfully parsed: {parsed} ({(parsed/total)*100:.1f}%)")
display(f"Failed to parse: {failed} ({(failed/total)*100:.1f}%)")

We calculate the effectiveness of the coalesce strategy for date conversion - what percentage of records were successfully parsed. A high failure rate may indicate the need to add other formats to coalesce.

In [0]:
# Preview sample date conversions
display(df_with_dates.select("customer_id", "registration_date", "registration_date_parsed").limit(10))

The preview shows original date values and their parsed counterparts. This allows for verification of conversion correctness and identification of format issues.

In [0]:
# Preview records with parsing errors (if any)
parsing_errors = df_with_dates.filter(
    F.col("registration_date").isNotNull() & 
    F.col("registration_date_parsed").isNull()
).select("customer_id", "registration_date", "registration_date_parsed").limit(5)

if parsing_errors.count() > 0:
    display(parsing_errors)

Analysis of records with parsing errors helps identify unplanned date formats in source data. This information can be used to extend the coalesce strategy with additional formats.

### Timestamp Conversion and Time Calculations

**Goal:** Convert to timestamp and perform time calculations.

In [0]:
# RESOURCE: DataFrame df_with_dates
# VARIABLE: df_with_timestamp - DataFrame with timestamp and calculations

from pyspark.sql.functions import to_timestamp, current_timestamp, current_date, datediff

# Convert registration_date_parsed to timestamp (adds time 00:00:00)
df_with_timestamp = df_with_dates.withColumn(
    "registration_timestamp",
    F.to_timestamp(F.col("registration_date_parsed"))
)

# Time calculations
df_with_timestamp = df_with_timestamp \
    .withColumn("current_date", current_date()) \
    .withColumn("days_since_registration", 
        datediff(F.col("current_date"), F.col("registration_date_parsed"))
    )

# Statistics
print("=== Time Statistics ===")
df_with_timestamp.select(
    F.min("days_since_registration").alias("min_days"),
    F.max("days_since_registration").alias("max_days"),
    F.avg("days_since_registration").alias("avg_days")
).show()

# Examples
print("\n=== Sample time calculations ===")
display(df_with_timestamp.select(
    "customer_id",
    "registration_date",
    "registration_date_parsed",
    "registration_timestamp",
    "days_since_registration"
).orderBy(F.desc("days_since_registration")).limit(10))

We convert DateType to TimestampType using `to_timestamp()`. Timestamp contains date and time information (default 00:00:00 for dates only), which is useful for precise time calculations.

In [0]:
# Time calculations - add current date and days since registration
df_with_timestamp = df_with_timestamp \
    .withColumn("current_date", current_date()) \
    .withColumn("days_since_registration", 
        datediff(F.col("current_date"), F.col("registration_date_parsed"))
    )

We perform time calculations using `current_date()` and `datediff()`. The `days_since_registration` column shows the "age" of each customer account, which is useful for segmentation and cohort analysis.

In [0]:
# Time statistics - min, max, average days since registration
time_stats = df_with_timestamp.select(
    F.min("days_since_registration").alias("min_days"),
    F.max("days_since_registration").alias("max_days"),
    F.avg("days_since_registration").alias("avg_days")
)

display(time_stats)

We calculate descriptive statistics for account age - minimum, maximum, and average days since registration. These metrics help understand the demographic profile of the customer base.

In [0]:
# Preview sample time calculations
display(df_with_timestamp.select(
    "customer_id",
    "registration_date",
    "registration_date_parsed",
    "registration_timestamp",
    "days_since_registration"
).orderBy(F.desc("days_since_registration")).limit(10))

Preview of records sorted by account age (oldest first) shows all stages of date transformation - from original string, through parsed date, timestamp, to calculated number of days.

## Data Deduplication

**Theoretical Introduction:**

Duplicates are a common data quality problem resulting from errors in source systems, reloading the same data multiple times, or errors in ETL processes. The deduplication strategy depends on the business context.

**Key Concepts:**
- **dropDuplicates()**: Remove duplicates based on all or selected columns

### Deduplication - All Columns

**Goal:** Remove completely identical records (exact duplicates).

In [0]:
# RESOURCE: DataFrame df_customers
# VARIABLE: df_distinct - DataFrame without exact duplicates

# Remove exact duplicates (all columns identical)
df_distinct = df_customers.distinct()

# Statistics
total = df_customers.count()
distinct = df_distinct.count()
duplicates = total - distinct

print("=== Deduplication - all columns ===")
print(f"Total records: {total}")
print(f"Unique records: {distinct}")
print(f"Removed duplicates: {duplicates}")
print(f"Duplication rate: {(duplicates/total)*100:.1f}%")

# Identify duplicates before removal
from pyspark.sql.functions import count as spark_count

duplicated_records = df_customers \
    .groupBy(df_customers.columns) \
    .agg(spark_count("*").alias("count")) \
    .filter(F.col("count") > 1) \
    .orderBy(F.desc("count"))

if duplicated_records.count() > 0:
    print("\n=== Examples of duplicated records ===")
    display(duplicated_records.limit(5))

We remove completely identical duplicates (exact duplicates) using `distinct()`. This operation compares all columns and keeps only unique records.

In [0]:
# Calculate deduplication statistics
total = df_customers.count()
distinct = df_distinct.count()
duplicates = total - distinct
duplication_rate = (duplicates/total)*100

display(f"Total records: {total}")
display(f"Unique records: {distinct}")
display(f"Removed duplicates: {duplicates}")
display(f"Duplication rate: {duplication_rate:.1f}%")

We calculate deduplication statistics - the number of removed duplicates and the duplication rate. These metrics help assess source data quality and the effectiveness of the cleaning process.

In [0]:
# Identify duplicates before removal (if any)
from pyspark.sql.functions import count as spark_count

duplicated_records = df_customers \
    .groupBy(df_customers.columns) \
    .agg(spark_count("*").alias("count")) \
    .filter(F.col("count") > 1) \
    .orderBy(F.desc("count"))

if duplicated_records.count() > 0:
    display(duplicated_records.limit(5))

We identify specific records that are duplicates - grouping by all columns and looking for groups with more than one record. This helps understand the nature of duplicates in the data.

### Deduplication - Key Columns

**Goal:** Remove duplicates based on business key (customer_id), keeping the latest record.

In [0]:
# RESOURCE: DataFrame df_customers
# VARIABLE: df_deduped - DataFrame with duplicates removed per customer_id

# Strategy 1: dropDuplicates() per customer_id - keeps the first encountered record
df_deduped_simple = df_customers.dropDuplicates(["customer_id"])

print("=== Deduplication per customer_id (simple strategy) ===")
print(f"Before: {df_customers.count()} records")
print(f"After: {df_deduped_simple.count()} records")
print(f"Removed: {df_customers.count() - df_deduped_simple.count()} duplicates")

# Strategy 2: Window function - keep the latest record (if we have a timestamp)
# Assuming we have a created_at or other timestamp column

if "created_at" in df_customers.columns or "last_updated" in df_customers.columns:
    from pyspark.sql.window import Window
    
    timestamp_col = "created_at" if "created_at" in df_customers.columns else "last_updated"
    
    # Window partitioned by customer_id, sorted by timestamp desc
    window_spec = Window.partitionBy("customer_id").orderBy(F.desc(timestamp_col))
    
    df_deduped = df_customers \
        .withColumn("row_num", F.row_number().over(window_spec)) \
        .filter(F.col("row_num") == 1) \
        .drop("row_num")
    
    print(f"\n=== Deduplication per customer_id (strategy: latest record) ===")
    print(f"Before: {df_customers.count()} records")
    print(f"After: {df_deduped.count()} records")
    print(f"Removed: {df_customers.count() - df_deduped.count()} duplicates")
else:
    # If no timestamp, use simple strategy
    df_deduped = df_deduped_simple
    print("\n(No timestamp column - used simple strategy)")

# Identify duplicates before removal
duplicate_ids = df_customers \
    .groupBy("customer_id") \
    .agg(spark_count("*").alias("count")) \
    .filter(F.col("count") > 1) \
    .orderBy(F.desc("count"))

if duplicate_ids.count() > 0:
    print(f"\n=== {duplicate_ids.count()} customer_id with duplicates ===")
    display(duplicate_ids.limit(10))
    
    # Examples of duplicates
    sample_duplicate_id = duplicate_ids.first()["customer_id"]
    print(f"\n=== Example of duplicates for customer_id={sample_duplicate_id} ===")
    display(df_customers.filter(F.col("customer_id") == sample_duplicate_id))

We apply a simple deduplication strategy per business key (`customer_id`) using `dropDuplicates()`. It keeps the first encountered record for each customer_id, which is fast but does not allow selecting the "best" record.

In [0]:
# Deduplication statistics per customer_id
records_before = df_customers.count()
records_after = df_deduped_simple.count()
records_removed = records_before - records_after

display(f"Records before: {records_before}")
display(f"Records after: {records_after}")
display(f"Removed duplicates: {records_removed}")

We calculate the effectiveness of deduplication per business key - how many records were removed due to customer_id duplicates. This is a key metric for assessing data quality and business key uniqueness.

## Data Standardization

**Theoretical Introduction:**

Standardization involves unifying data formats and representations according to established business rules. Non-standard data (different case, whitespace, formats) hinders analysis, joins, and aggregations.

**Key Concepts:**
- **Text standardization**: trim(), lower(), upper(), initcap()
- **Pattern standardization**: regexp_replace() for codes, phones
- **Format standardization**: Unifying date and address formats
- **Categorical standardization**: Mapping variants to standard values

**Practical Application:**
- Unifying case in text fields
- Removing whitespace from beginning and end
- Standardizing country codes, phone numbers, zip codes
- Consolidating category variants (Active/active/ACTIVE → Active)

### Text Standardization

**Goal:** Clean and standardize text fields (trim, case, whitespace).

In [0]:
# RESOURCE: DataFrame df_customers
# VARIABLE: df_standardized - DataFrame with standardized text fields

# Standardize text fields - remove whitespace
df_standardized = df_customers

# Trim whitespace from all string columns
for col_name in ["first_name", "last_name", "email", "phone", "city", "state", "country"]:
    if col_name in df_standardized.columns:
        df_standardized = df_standardized.withColumn(
            col_name,
            F.trim(F.col(col_name))
        )

# 2. Standardize specific columns
# first_name, last_name: Title Case
df_standardized = df_standardized.withColumn(
    "first_name",
    F.initcap(F.col("first_name"))
).withColumn(
    "last_name", 
    F.initcap(F.col("last_name"))
)

# email: lowercase (standard for emails)
df_standardized = df_standardized.withColumn(
    "email",
    F.lower(F.col("email"))
)

# country: uppercase (ISO standard for country codes)
df_standardized = df_standardized.withColumn(
    "country",
    F.upper(F.col("country"))
)

# city: Title Case
df_standardized = df_standardized.withColumn(
    "city",
    F.initcap(F.col("city"))
)

# Comparison before and after
print("=== Standardization Comparison ===")
display(df_customers.select("first_name", "last_name", "email", "city", "country").limit(5))
print("\n↓↓↓ AFTER STANDARDIZATION ↓↓↓\n")
display(df_standardized.select("first_name", "last_name", "email", "city", "country").limit(5))

We remove whitespace (spaces, tabs) from the beginning and end of strings using `trim()`. This is a basic standardization step that eliminates accidental spaces that can interfere with analysis and joins.

In [0]:
# Standardize names - Title Case
df_standardized = df_standardized.withColumn(
    "first_name",
    F.initcap(F.col("first_name"))
).withColumn(
    "last_name", 
    F.initcap(F.col("last_name"))
)

We standardize first and last names to Title Case using `initcap()` - first letter uppercase, others lowercase. This is standard for proper name fields, ensuring uniform formatting.

In [0]:
# Standardize emails (lowercase) and countries (uppercase)
df_standardized = df_standardized.withColumn(
    "email",
    F.lower(F.col("email"))
).withColumn(
    "country",
    F.upper(F.col("country"))
).withColumn(
    "city",
    F.initcap(F.col("city"))
)

We apply different case conventions for different data types: email `lowercase` (internet standard), country `uppercase` (ISO codes), city `Title Case` (place names). Each type has its justified conventions.

In [0]:
# Comparison before standardization
display("BEFORE Standardization:")
display(df_customers.select("first_name", "last_name", "email", "city", "country").limit(5))

display("AFTER Standardization:")
display(df_standardized.select("first_name", "last_name", "email", "city", "country").limit(5))

Comparison before and after standardization shows the effect of the transformation - unified formatting, removed whitespace, and consistent case conventions. This is crucial for ensuring analytical data quality.

### Code and Category Standardization

**Goal:** Unify code formats and map category variants.

In [0]:
# RESOURCE: DataFrame df_standardized
# VARIABLE: df_codes_standardized - DataFrame with standardized codes

df_codes_standardized = df_standardized

# 1. Standardize phone numbers (remove all non-digits, international format)
if "phone" in df_codes_standardized.columns:
    df_codes_standardized = df_codes_standardized.withColumn(
        "phone_standardized",
        F.when(F.col("phone").isNotNull(),
            F.concat(
                F.when(F.col("phone").startswith("+"), "")
                 .otherwise("+1-"),  # Default prefix for USA
                F.regexp_replace(F.col("phone"), "[^0-9]", "")
            )
        ).otherwise(F.col("phone"))
    )

# 2. Standardize customer_segment (consistent naming)
if "customer_segment" in df_codes_standardized.columns:
    df_codes_standardized = df_codes_standardized.withColumn(
        "customer_segment_standardized",
        F.when(F.upper(F.trim(F.col("customer_segment"))) == "PREMIUM", "Premium")
         .when(F.upper(F.trim(F.col("customer_segment"))) == "STANDARD", "Standard")
         .when(F.upper(F.trim(F.col("customer_segment"))) == "BASIC", "Basic")
         .otherwise("Unknown")
    )

# 3. Standardize country codes (3-letter ISO codes as example)
df_codes_standardized = df_codes_standardized.withColumn(
    "country_iso",
    F.when(F.upper(F.col("country")) == "USA", "USA")
     .when(F.upper(F.col("country")) == "POLAND", "POL")
     .when(F.upper(F.col("country")) == "GERMANY", "DEU")
     .when(F.upper(F.col("country")) == "FRANCE", "FRA")
     .otherwise(F.upper(F.col("country")))
)

# Verify standardization
print("=== Code and Category Standardization ===")

if "phone" in df_codes_standardized.columns:
    print("\n--- Phone numbers ---")
    display(df_codes_standardized.select("phone", "phone_standardized").limit(5))

if "customer_segment" in df_codes_standardized.columns:
    print("\n--- Customer segments ---")
    display(df_codes_standardized.groupBy("customer_segment", "customer_segment_standardized").count().orderBy("customer_segment"))

print("\n--- Country codes ---")
display(df_codes_standardized.groupBy("country", "country_iso").count().orderBy("country").limit(10))

## Summary

1. **Data profiling first**: Always analyze data before starting cleaning
2. **Context matters**: Cleaning strategy depends on business context
3. **Validation is critical**: Always validate conversion and transformation results
4. **Document decisions**: Log statistics and decisions for auditability

### Key Takeaways:
- Loaded data from dataset/ (using DATASET_BASE_PATH)
- Data profiling and quality issue identification
- Null value handling (fillna, dropna, coalesce)
- Type validation and conversion (cast, to_date, to_timestamp)
- Record deduplication (distinct, dropDuplicates, window functions)
- Text and code standardization (trim, case, regexp_replace)
- PySpark vs SQL approach comparison

## Clean up resources

In [0]:
# Optional test resource cleanup
# WARNING: Run only if you want to remove all created data

# Remove temporary views
spark.catalog.dropTempView("customers_raw")

# Clear cache
spark.catalog.clearCache()

print("Temporary views and cache have been cleared")
print("Source data in Volume remains intact")