# Module 4: Feature Transformation

**Training Objective:** Master techniques for converting raw data into ML-ready features through encoding and scaling.

**Scope:**
- Categorical Encoding: `StringIndexer` and `OneHotEncoder`
- Target Encoding: Advanced encoding technique (introduction)
- Feature Scaling: `StandardScaler`, `MinMaxScaler`, `RobustScaler`
- Window Functions: Creating features from sequential data

## Context and Requirements

- **Training day:** Day 1 - Data Preparation Fundamentals
- **Notebook type:** Demo
- **Technical requirements:**
  - Databricks Runtime 14.x LTS or newer
  - Unity Catalog enabled
  - Permissions: CREATE TABLE, SELECT, MODIFY
- **Dependencies:** `03_Data_Imputing.ipynb` (creates `customer_train_imputed` table)
- **Execution time:** ~25 minutes

> **Note:** This module covers essential transformations that prepare data for ML algorithms.

## Theoretical Introduction

**Why transform features?**

Machine Learning models require numerical input. Raw data often contains text categories and features with vastly different scales.

**Encoding Techniques:**

| Technique | When to Use | Pros | Cons |
|-----------|-------------|------|------|
| **StringIndexer** | Ordinal categories (Small/Medium/Large) | Simple, preserves order | Introduces false ordinality for nominal |
| **OneHotEncoder** | Nominal categories (Country, Color) | No false ordinality | High dimensionality |
| **Target Encoding** | High-cardinality categories | Reduces dimensions | High overfitting risk |

**Scaling Techniques:**

| Scaler | Formula | When to Use |
|--------|---------|-------------|
| **StandardScaler** | $(x - \mu) / \sigma$ | Normally distributed data |
| **MinMaxScaler** | $(x - min) / (max - min)$ | Neural Networks, bounded range needed |
| **RobustScaler** | $(x - median) / IQR$ | Data with outliers |

**Why scaling matters:**
> Many algorithms (Linear Regression, K-Means, KNN, SVM) calculate distances. If one feature has range [0, 1] and another [0, 1,000,000], the second dominates!

## Per-User Isolation

Run the initialization script for per-user catalog and schema isolation:

In [None]:
%run ./00_Setup

**Load Imputed Data:**

In [None]:
# Load Imputed Data
df = spark.table("customer_train_imputed")

## Section 1: Categorical Encoding

### Example 1.1: StringIndexer
Machine Learning algorithms generally require numerical input. `StringIndexer` maps each unique string category to a numerical index (0.0, 1.0, 2.0, ...).

- **How it works:** It assigns indices based on frequency (most frequent = 0.0).
- **Limitation:** It introduces an artificial order (e.g., 0 < 1 < 2). If the category is "Country", this implies "USA < UK", which is mathematically incorrect for nominal data.

In [None]:
from pyspark.ml.feature import StringIndexer

# Index 'country'
indexer = StringIndexer(inputCol="country", outputCol="country_idx", handleInvalid="keep")
indexer_model = indexer.fit(df)
df_idx = indexer_model.transform(df)

display(df_idx.select("country", "country_idx").distinct())

### Example 1.2: OneHotEncoder
To fix the ordinality issue of StringIndexer, we use One-Hot Encoding (or Dummy Variables).

- **How it works:** It creates a binary vector for each category.
- **Why use it:** It allows the model to treat each category independently without assuming any order (e.g., USA is not "smaller" than UK).
- **Trade-off:** It increases the dimensionality of the dataset (Curse of Dimensionality).

In [None]:
from pyspark.ml.feature import OneHotEncoder

encoder = OneHotEncoder(inputCols=["country_idx"], outputCols=["country_vec"])
encoder_model = encoder.fit(df_idx)
df_encoded = encoder_model.transform(df_idx)

display(df_encoded.select("country", "country_idx", "country_vec").limit(5))

### Example 1.3: Target Encoding (Advanced)
Instead of indexing, we replace the category with the **mean of the target variable** for that category.
*Example:* If average salary in "USA" is 80k, replace "USA" with 80000.

> ‚ö†Ô∏è **Warning:** High risk of overfitting! Use with regularization or cross-validation.

In [None]:
# Manual Target Encoding Example
# Calculate mean salary per country
country_means = df.groupBy("country").agg({"salary_imputed": "avg"}).withColumnRenamed("avg(salary_imputed)", "country_target_enc")

# Join back to main table
df_target_enc = df.join(country_means, on="country", how="left")

display(df_target_enc.select("country", "country_target_enc").distinct())

## Section 2: Feature Scaling

We need to assemble features into a vector first.

In [None]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=["age_imputed", "salary_imputed"], outputCol="features_num")
df_vec = assembler.transform(df_encoded)

### Example 2.1: StandardScaler vs MinMaxScaler vs RobustScaler

Many algorithms (like Linear Regression, K-Means, KNN) calculate distances between data points. If one feature has a range of [0, 1] and another [0, 1,000,000], the second feature will dominate the distance calculation. Scaling brings them to a comparable range.

| Scaler | Best For | Characteristics |
|--------|----------|-----------------|
| **StandardScaler** | Normally distributed data | Mean=0, Std=1 |
| **MinMaxScaler** | Neural Networks, bounded range | Range [0, 1] |
| **RobustScaler** | Data with outliers | Uses Median/IQR |

In [None]:
from pyspark.ml.feature import StandardScaler, MinMaxScaler, RobustScaler

# 1. Standard Scaler
scaler_std = StandardScaler(inputCol="features_num", outputCol="features_std")
df_scaled = scaler_std.fit(df_vec).transform(df_vec)

# 2. MinMax Scaler
scaler_minmax = MinMaxScaler(inputCol="features_num", outputCol="features_minmax")
df_scaled = scaler_minmax.fit(df_scaled).transform(df_scaled)

# 3. Robust Scaler (Great for our Salary outliers!)
scaler_robust = RobustScaler(inputCol="features_num", outputCol="features_robust")
df_scaled = scaler_robust.fit(df_scaled).transform(df_scaled)

display(df_scaled.select("features_num", "features_std", "features_minmax", "features_robust").limit(5))

## Section 3: Window Functions (Sequential Features)

For time-series or ordered data, we often need values from "previous rows".

In [None]:
from pyspark.sql.window import Window
from pyspark.sql.functions import lag, lead, row_number, avg, col

# Define Window: Partition by Country, Order by Date
w = Window.partitionBy("country").orderBy("registration_date")

# 1. Lag: Previous salary in the same country
df_window = df_scaled.withColumn("prev_salary", lag("salary_imputed", 1).over(w))

# 2. Row Number: Order of registration
df_window = df_window.withColumn("reg_rank", row_number().over(w))

# 3. Rolling Average: Avg salary of last 3 people
w_rolling = w.rowsBetween(-2, 0)
df_window = df_window.withColumn("rolling_avg_salary", avg("salary_imputed").over(w_rolling))

display(df_window.select("country", "registration_date", "salary_imputed", "prev_salary", "rolling_avg_salary"))

## Best Practices

### üéØ Transformation Strategy Guide:

| Feature Type | Recommended Transformation |
|--------------|---------------------------|
| Categorical (low cardinality <10) | OneHotEncoder |
| Categorical (high cardinality >100) | Target Encoding or Embeddings |
| Categorical (ordinal) | StringIndexer only |
| Numerical (normal distribution) | StandardScaler |
| Numerical (with outliers) | RobustScaler |
| Numerical (bounded range needed) | MinMaxScaler |

### ‚ö†Ô∏è Common Mistakes to Avoid:

1. **Using StringIndexer for nominal data** ‚Üí Introduces false ordinality
2. **OneHotEncoder on high-cardinality** ‚Üí Curse of dimensionality
3. **Target Encoding without regularization** ‚Üí Overfitting
4. **Scaling before splitting** ‚Üí Data leakage
5. **Not scaling for distance-based models** ‚Üí Feature dominance

### üí° Pro Tips:

- Always fit scalers on TRAINING data only
- Use `handleInvalid="keep"` for StringIndexer to handle new categories
- Consider RobustScaler as default (more robust than StandardScaler)
- Window functions are powerful for time-series feature engineering
- Save transformer models for applying to test/production data

## Summary

### What we achieved:

- **StringIndexer**: Converted categories to numerical indices
- **OneHotEncoder**: Created binary vectors for nominal categories
- **Target Encoding**: Introduced advanced encoding (with caveats)
- **Scalers**: Compared StandardScaler, MinMaxScaler, RobustScaler
- **Window Functions**: Created lag and rolling average features

### Key Takeaways:

| # | Principle |
|---|-----------|
| 1 | **Choose encoder by category type** - nominal vs ordinal |
| 2 | **RobustScaler for outliers** - uses median/IQR |
| 3 | **Fit on train only** - prevent data leakage |
| 4 | **Window functions for time-series** - powerful feature engineering |
| 5 | **Target Encoding is risky** - use with cross-validation |

### Data Pipeline Status:

| Table | Created | Used By |
|-------|---------|---------|
| `customer_train_imputed` | Module 3 | This module |
| `customer_train_transformed` | ‚úÖ This module | Modules 5-7 |

### Next Steps:

üìö **Next Module:** Module 5 - Feature Engineering (VectorAssembler, log transforms)

## Cleanup

Optionally remove demo tables created during exercises:

In [None]:
# Cleanup - remove demo tables created in this notebook

# Uncomment the lines below to remove demo tables:

# spark.sql(f"DROP TABLE IF EXISTS {catalog_name}.{schema_name}.customer_train_transformed")

# print("‚úÖ All demo tables removed")

print("‚ÑπÔ∏è Cleanup disabled (uncomment code to remove demo tables)")