# Workshop 2: Data Cleaning and Feature Engineering

## Objective

Clean the RetailMax sales data and build customer-level features (Customer 360) for the ML model.

## Context and Requirements

- **Workshop:** Customer Segmentation for RetailMax
- **Notebook type:** Hands-on Exercise
- **Prerequisites:** `01_Workshop_Data_Exploration.ipynb` completed
- **Technical requirements:**
  - Databricks Runtime 14.x LTS or newer
  - Unity Catalog enabled
- **Execution time:** ~30 minutes

---

## Theoretical Background

**Data Cleaning Strategy:**

| Issue Type | Strategy | Example |
|------------|----------|---------|
| Missing key IDs | Remove row | `customer_id` is NULL |
| Missing descriptive fields | Impute default | `email` = "unknown@example.com" |
| Invalid values | Filter out | `quantity` <= 0 |
| Duplicates | Remove | Exact row duplicates |

**RFM Analysis:**

RFM (Recency, Frequency, Monetary) is a customer segmentation technique:
- **Recency:** Days since last purchase (lower = better)
- **Frequency:** Number of purchases (higher = better)
- **Monetary:** Total spend (higher = better)

---

## Data Leakage Warning

> **Critical Rule:** When calculating imputation statistics (mean, median) or any transformation parameters, use ONLY training data. Applying statistics calculated from test data leaks future information into the model.

| Correct | Incorrect |
|---------|-----------|
| Calculate mean on train set | Calculate mean on full dataset |
| Apply train mean to test set | Calculate separate mean for test |
| Fit scaler on train only | Fit scaler on all data |

In this workshop, we work with raw data before splitting. The Pipeline in Workshop 03 handles this correctly by fitting transformers only on training data.

---

In [0]:
%run ../demo/00_Setup

## Section 1: Load Data

In [0]:
# 1. Load Data
df = spark.table("workshop_sales_data")

## Section 2: Data Cleaning

In the previous workshop, the following issues were identified:
- Negative quantities (`quantity <= 0`)
- Missing values (nulls)
- Potential duplicates

These must be removed or corrected to create a clean "Silver" layer.

In [0]:
from pyspark.sql.functions import col, count, when, lit, max, datediff, current_date, to_date

# Exercise 1: Filter invalid records
# Remove rows where 'quantity' <= 0 OR 'total_amount' < 0
# Save result to 'df_clean'
# Compare row counts before and after

df_clean = # TODO: Filter invalid records
# print(f"Original: {df.count()}, Clean: {df_clean.count()}")

## Section 3: Handle Missing Values

Strategy depends on the column type:
- **Key IDs (customer_id, order_id):** If missing, record is unusable -> REMOVE
- **Descriptive attributes (email, phone):** Can impute default value -> IMPUTATION

In [0]:
# Exercise 2: Handle missing values
# a) Remove rows where 'customer_id' OR 'order_id' is NULL
# b) Fill missing 'email' values with "unknown@example.com"
# Hint: Use .dropna() and .fillna()

df_clean = # TODO: Handle nulls

# print(f"Count after null handling: {df_clean.count()}")

In [0]:
# Exercise 3: Deduplication
# Check for and remove duplicate rows (exact duplicates)
# Hint: Use .dropDuplicates()

df_clean = # TODO: Remove duplicates
# print(f"Count after deduplication: {df_clean.count()}")

### Missing Flags (Informative Missingness)

Sometimes the fact that data is missing is informative. For example:
- A customer without email may be privacy-conscious
- A missing phone number may indicate online-only customer

Creating a "missing flag" allows the model to learn from these patterns.

In [0]:
# Exercise 3b (Optional): Create missing flag for email
# Create a column 'email_missing' that is 1 if email was NULL, 0 otherwise
# This should be done BEFORE filling nulls

# TODO: Create missing flag
# df_clean = df_clean.withColumn("email_missing", when(col("email").isNull(), 1).otherwise(0))

## Section 4: Feature Engineering - Customer 360

The goal is to predict customer behavior (segment). Transaction data must be aggregated to the customer level.

**Planned features (RFM + Demographics):**

| Feature | Type | Description |
|---------|------|-------------|
| `total_spend` | Monetary | Sum of all purchases |
| `order_count` | Frequency | Number of orders |
| `recency` | Recency | Days since last purchase |
| `tenure` | Demographics | Days since registration |

In [0]:
from pyspark.sql.functions import sum, count, avg, min, max, datediff, lit

# Helper: Find reference date (max date in dataset) for Recency calculation
max_date = df_clean.select(max("order_datetime")).collect()[0][0]
print(f"Reference Date: {max_date}")

# Exercise 4: Aggregate data
# Group by 'customer_id', 'customer_segment', 'country', 'registration_date'
# Calculate:
# - total_spend (sum of total_amount)
# - order_count (count of order_id)
# - last_purchase_date (max of order_datetime)

customer_features = df_clean.groupBy("customer_id", "customer_segment", "country", "registration_date") \
    .agg(
        # TODO: Add aggregations
    )

# Exercise 5: Calculate derived features
# - recency: days between max_date and last_purchase_date
# - tenure: days between last_purchase_date and registration_date

customer_features = customer_features.withColumn("recency", ...) \
                                     .withColumn("tenure", ...)

display(customer_features)

## 5. Encoding Categorical Features
Machine Learning models require numerical input. We need to encode `country`.

In [0]:
from pyspark.ml.feature import StringIndexer

# Exercise 6: Encode categorical features
# Use StringIndexer to convert 'country' to 'country_index'

indexer = # TODO: Create StringIndexer
# customer_features_encoded = ...

# display(customer_features_encoded)

## 6. Save Feature Table
Save this aggregated and cleaned dataset for the next workshop.

In [0]:
# Exercise 7: Save the feature table as 'workshop_customer_features'

# TODO: Save table

## Best Practices: Data Cleaning. 

| Practice | Description |
|----------|-------------|
| **Clean before split** | Remove invalid data before train/test split |
| **Document decisions** | Record why you removed/imputed each column |
| **Preserve information** | Create missing flags before filling nulls |
| **Validate counts** | Check row counts after each cleaning step |
| **Avoid data leakage** | Calculate statistics on training data only |

---

# Solutions

Reference solutions for the exercises above.Compare your results.

In [0]:
# Load Data
df = spark.table("workshop_sales_data")

from pyspark.sql.functions import max, datediff, lit, col, sum, count

# 1. Filter
df_clean = df.filter((col("quantity") > 0) & (col("total_amount") >= 0))

# 2. Nulls & Duplicates
df_clean = df_clean.dropna(subset=["customer_id", "order_id","customer_segment"]) \
                   .fillna({"email": "unknown@example.com"}) \
                   .dropDuplicates()

# 3. Aggregation (RFM + Tenure)
from pyspark.sql.functions import max, datediff, lit, col

max_date_row = df_clean.select(max("order_datetime")).collect()
max_date = max_date_row[0][0]

customer_features = df_clean.groupBy("customer_id", "customer_segment", "country", "registration_date") \
    .agg(
        sum("total_amount").alias("total_spend"),
        count("order_id").alias("order_count"),
        max("order_datetime").alias("last_purchase_date")
    ) \
    .withColumn("recency", datediff(lit(max_date), col("last_purchase_date"))) \
    .withColumn("tenure", datediff(col("last_purchase_date"), col("registration_date")))

# 4. Encoding
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="country", outputCol="country_index", handleInvalid='keep')
customer_features_encoded = indexer.fit(customer_features).transform(customer_features)

# 5. Save
customer_features_encoded.write.mode("overwrite").option("mergeSchema", "true").saveAsTable("workshop_customer_features")
display(customer_features_encoded)
