# Module 4: Feature Transformation

## Business Context: TechCorp HR Analytics (continued)

**Where We Are:**
Our TechCorp employee data now has complete values (imputed). But ML algorithms require numerical input.

**Columns to Transform:**
| Column | Type | Transformation |
|--------|------|----------------|
| `country` | Categorical (5 values) | StringIndexer → OneHotEncoder |
| `source` | Categorical (5 values incl. UNKNOWN) | StringIndexer → OneHotEncoder |
| `age_imputed` | Numerical | VectorAssembler → Scaler |
| `salary_imputed` | Numerical | VectorAssembler → Scaler |

---

**Training Objective:** Master feature transformation techniques.

**Scope:**
- Categorical Encoding: `StringIndexer` and `OneHotEncoder`
- Feature Scaling: `StandardScaler`, `MinMaxScaler`, `RobustScaler`
- Window Functions: Creating features from sequential data


## Context and Requirements

- **Training day:** Day 1 - Data Preparation Fundamentals
- **Notebook type:** Demo
- **Technical requirements:**
  - Databricks Runtime 14.x LTS or newer
  - Unity Catalog enabled
  - Permissions: CREATE TABLE, SELECT, MODIFY
- **Dependencies:** `03_Data_Imputing.ipynb` (creates `customer_train_imputed` table)
- **Execution time:** ~25 minutes

> **Note:** This module covers essential transformations that prepare data for ML algorithms.

## Theoretical Introduction

**Why transform features?**

Machine Learning models require numerical input. Raw data often contains text categories and features with vastly different scales.

**Encoding Techniques:**

| Technique | When to Use | Pros | Cons |
|-----------|-------------|------|------|
| **StringIndexer** | Ordinal categories (Small/Medium/Large) | Simple, preserves order | Introduces false ordinality for nominal |
| **OneHotEncoder** | Nominal categories (Country, Color) | No false ordinality | High dimensionality |
| **Target Encoding** | High-cardinality categories | Reduces dimensions | High overfitting risk |

**Scaling Techniques:**

| Scaler | Formula | When to Use |
|--------|---------|-------------|
| **StandardScaler** | $(x - \mu) / \sigma$ | Normally distributed data |
| **MinMaxScaler** | $(x - min) / (max - min)$ | Neural Networks, bounded range needed |
| **RobustScaler** | $(x - median) / IQR$ | Data with outliers |

**Why scaling matters:**
> Many algorithms (Linear Regression, K-Means, KNN, SVM) calculate distances. If one feature has range [0, 1] and another [0, 1,000,000], the second dominates!

## Per-User Isolation

Run the initialization script for per-user catalog and schema isolation:

In [0]:
%run ./00_Setup

**Load Imputed Data:**

In [0]:
# Load Imputed Data
df = spark.table("customer_train_imputed")

## Section 1: Categorical Encoding

### Example 1.1: StringIndexer
Machine Learning algorithms generally require numerical input. `StringIndexer` maps each unique string category to a numerical index (0.0, 1.0, 2.0, ...).

- **How it works:** It assigns indices based on frequency (most frequent = 0.0).
- **Limitation:** It introduces an artificial order (e.g., 0 < 1 < 2). If the category is "Country", this implies "USA < UK", which is mathematically incorrect for nominal data.


### StringIndexer Options

| Option         | Description                                                                                                 | Values / Default                |
|----------------|------------------------------------------------------------------------------------------------------------|---------------------------------|
| **inputCol**   | Name of the input column containing string categories.                                                      | String                          |
| **outputCol**  | Name of the output column with indexed values.                                                              | String                          |
| **handleInvalid** | How to handle unseen or NULL values in the input column.                                                 | 'error' (default), 'skip', 'keep' |
| **stringOrderType** | How to order labels before assigning indices.                                                          | 'frequencyDesc' (default), 'frequencyAsc', 'alphabetDesc', 'alphabetAsc' |
| **inputCols**  | List of input columns (for multi-column indexing).                                                         | List[String]                    |
| **outputCols** | List of output columns (for multi-column indexing).                                                        | List[String]                    |

**Details:**
- `handleInvalid='error'`: Throws error for unseen/null values (default).
- `handleInvalid='skip'`: Drops rows with unseen/null values.
- `handleInvalid='keep'`: Assigns unseen/null values to a special index.
- `stringOrderType`: Controls how categories are ordered before indexing (by frequency or alphabetically, ascending or descending).

In [0]:
from pyspark.ml.feature import StringIndexer

# Index 'country'
indexer = StringIndexer(inputCol="country", outputCol="country_idx", handleInvalid="keep")
indexer_model = indexer.fit(df)
df_idx = indexer_model.transform(df)

In [0]:
display(df_idx.select("country", "country_idx").distinct())

### Example 1.2: OneHotEncoder
To fix the ordinality issue of StringIndexer, we use One-Hot Encoding (or Dummy Variables).

- **How it works:** It creates a binary vector for each category.
- **Why use it:** It allows the model to treat each category independently without assuming any order (e.g., USA is not "smaller" than UK).
- **Trade-off:** It increases the dimensionality of the dataset (Curse of Dimensionality).

### OneHotEncoder Options

| Option           | Description                                                                                                 | Values / Default                |
|------------------|------------------------------------------------------------------------------------------------------------|---------------------------------|
| **inputCol**     | Name of the input column to encode (single column).                                                        | String                          |
| **inputCols**    | List of input columns to encode (multi-column support).                                                    | List[String]                    |
| **outputCol**    | Name of the output column for the encoded vector (single column).                                          | String                          |
| **outputCols**   | List of output columns for the encoded vectors (multi-column support).                                     | List[String]                    |
| **dropLast**     | Whether to drop the last category to avoid collinearity (one less output column per input).                | True (default), False           |
| **handleInvalid**| How to handle invalid (unseen or null) values during transform: 'error' (default) or 'keep' (extra index). | 'error' (default), 'keep'       |

**Details:**
- `inputCol`/`outputCol`: For encoding a single column.
- `inputCols`/`outputCols`: For encoding multiple columns at once.
- `dropLast=True`: Drops the last category to avoid the dummy variable trap.
- `handleInvalid='error'`: Throws error for unseen/null values (default).
- `handleInvalid='keep'`: Assigns unseen/null values to an extra category.

In [0]:
from pyspark.ml.feature import OneHotEncoder

encoder = OneHotEncoder(inputCols=["country_idx"], outputCols=["country_vec"])
encoder_model = encoder.fit(df_idx)
df_encoded = encoder_model.transform(df_idx)

In [0]:
%skip
from pyspark.ml.feature import OneHotEncoder

encoder = OneHotEncoder(inputCols=["country_idx"], outputCols=["country_vec"], dropLast=True)
encoder_model = encoder.fit(df_idx)
df_encoded = encoder_model.transform(df_idx)

In [0]:
display(df_encoded.select("country", "country_idx", "country_vec").limit(5))

### Example 1.3: Target Encoding (Advanced)
Instead of indexing, we replace the category with the **mean of the target variable** for that category.
*Example:* If average salary in "USA" is 80k, replace "USA" with 80000.

> ⚠️ **Warning:** High risk of overfitting! Use with regularization or cross-validation.

In [0]:
# Manual Target Encoding Example
# Calculate mean salary per country
country_means = df.groupBy("country").agg({"salary_imputed": "avg"}).withColumnRenamed("avg(salary_imputed)", "country_target_enc")

# Join back to main table
df_target_enc = df.join(country_means, on="country", how="left")

In [0]:
display(df_target_enc.select("country", "country_target_enc").distinct())

## Section 2: Feature Scaling

We need to assemble features into a vector first.

### VectorAssembler Options

| Option           | Description                                                                                                 | Values / Default                |
|------------------|------------------------------------------------------------------------------------------------------------|---------------------------------|
| **inputCols**    | List of input column names to assemble into a vector.                                                      | List[String] (required)         |
| **outputCol**    | Name of the output column for the assembled vector.                                                        | String (required)               |
| **handleInvalid**| How to handle invalid (NULL or NaN) values: 'error' (throw), 'skip' (drop row), 'keep' (NaN in vector).    | 'error' (default), 'skip', 'keep' |
| **params**       | Returns all params ordered by name.                                                                        | -                               |
| **uid**          | Unique identifier for the instance.                                                                        | -                               |

**Details:**
- `inputCols`: Columns can be numeric or vector type.
- `outputCol`: Output column will be of vector type.
- `handleInvalid`: Use 'keep' to retain rows with invalid values as NaN in the output vector.

In [0]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=["age_imputed", "salary_imputed"], outputCol="features_num")
df_vec = assembler.transform(df_encoded)

In [0]:
display(df_vec)

### Example 2.1: StandardScaler vs MinMaxScaler vs RobustScaler

Many algorithms (like Linear Regression, K-Means, KNN) calculate distances between data points. If one feature has a range of [0, 1] and another [0, 1,000,000], the second feature will dominate the distance calculation. Scaling brings them to a comparable range.

| Scaler | Best For | Characteristics |
|--------|----------|-----------------|
| **StandardScaler** | Normally distributed data | Mean=0, Std=1 |
| **MinMaxScaler** | Neural Networks, bounded range | Range [0, 1] |
| **RobustScaler** | Data with outliers | Uses Median/IQR |

### Options for: StandardScaler vs MinMaxScaler vs RobustScaler

| Scaler            | Option         | Description                                                                 | Values / Default                |
|-------------------|---------------|-----------------------------------------------------------------------------|---------------------------------|
| **StandardScaler**| `inputCol`    | Name of input column (vector).                                              | String (required)               |
|                   | `outputCol`   | Name of output column.                                                      | String (required)               |
|                   | `withMean`    | Center data with mean.                                                      | False (default), True           |
|                   | `withStd`     | Scale to unit standard deviation.                                           | True (default), False           |
| **MinMaxScaler**  | `inputCol`    | Name of input column (vector).                                              | String (required)               |
|                   | `outputCol`   | Name of output column.                                                      | String (required)               |
|                   | `min`         | Lower bound after transformation.                                           | 0.0 (default)                   |
|                   | `max`         | Upper bound after transformation.                                           | 1.0 (default)                   |
| **RobustScaler**  | `inputCol`    | Name of input column (vector).                                              | String (required)               |
|                   | `outputCol`   | Name of output column.                                                      | String (required)               |
|                   | `withCentering`| Center data with median.                                                    | False (default), True           |
|                   | `withScaling` | Scale data according to IQR (interquartile range).                          | True (default), False           |
|                   | `lower`       | Lower quantile to calculate IQR.                                            | 0.25 (default)                  |
|                   | `upper`       | Upper quantile to calculate IQR.                                            | 0.75 (default)                  |

**Notes:**
- All scalers require input as a vector column (use `VectorAssembler` first).
- `StandardScaler` is sensitive to outliers; `RobustScaler` is robust to outliers.
- `MinMaxScaler` is useful for neural networks or when a bounded range is needed.

In [0]:
from pyspark.ml.feature import StandardScaler, MinMaxScaler, RobustScaler

In [0]:
# 1. Standard Scaler
scaler_std = StandardScaler(inputCol="features_num", outputCol="features_std")
df_scaled = scaler_std.fit(df_vec).transform(df_vec)


In [0]:
display(df_scaled.select("features_std"))

In [0]:
# 2. MinMax Scaler
scaler_minmax = MinMaxScaler(inputCol="features_num", outputCol="features_minmax")
df_scaled = scaler_minmax.fit(df_scaled).transform(df_scaled)

In [0]:
display(df_scaled.select("features_minmax"))

In [0]:
# 3. Robust Scaler (Great for our Salary outliers!)
scaler_robust = RobustScaler(inputCol="features_num", outputCol="features_robust")
df_scaled = scaler_robust.fit(df_scaled).transform(df_scaled)



In [0]:
display(df_scaled.select("features_robust"))

In [0]:
display(df_scaled.select("features_num", "features_std", "features_minmax", "features_robust").limit(5))

## Section 3: Window Functions (Sequential Features)

For time-series or ordered data, we often need values from "previous rows".

In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import lag, lead, row_number, avg, col

In [0]:
# Define Window: Partition by Country, Order by Date
w = Window.partitionBy("country").orderBy("registration_date")

# 1. Lag: Previous salary in the same country
df_window = df_scaled.withColumn("prev_salary", lag("salary_imputed", 1).over(w))

# 2. Row Number: Order of registration
df_window = df_window.withColumn("reg_rank", row_number().over(w))

# 3. Rolling Average: Avg salary of last 3 people
w_rolling = w.rowsBetween(-2, 0)
df_window = df_window.withColumn("rolling_avg_salary", avg("salary_imputed").over(w_rolling))

In [0]:
# 1. Lag: Previous salary in the same country
df_window = df_scaled.withColumn("prev_salary", lag("salary_imputed", 1).over(w))

In [0]:
# 2. Row Number: Order of registration
df_window = df_window.withColumn("reg_rank", row_number().over(w))

In [0]:
# 3. Rolling Average: Avg salary of last 3 people
w_rolling = w.rowsBetween(-2, 0)
df_window = df_window.withColumn("rolling_avg_salary", avg("salary_imputed").over(w_rolling))

In [0]:
display(df_window.select("country", "registration_date", "salary_imputed", "prev_salary", "rolling_avg_salary"))

In [0]:
df_window.write.mode("overwrite").saveAsTable("customer_train_transformed") 

In [0]:
display(spark.table("customer_train_transformed"))

In [0]:
# Register the DataFrame as a temp view for SQL
df_scaled.createOrReplaceTempView("df_scaled_sql")

# SQL version of the window operations
query = """
SELECT
  country,
  registration_date,
  salary_imputed,
  LAG(salary_imputed, 1) OVER (PARTITION BY country ORDER BY registration_date) AS prev_salary,
  ROW_NUMBER() OVER (PARTITION BY country ORDER BY registration_date) AS reg_rank,
  AVG(salary_imputed) OVER (
    PARTITION BY country
    ORDER BY registration_date
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS rolling_avg_salary
FROM df_scaled_sql
"""

df_window_sql = spark.sql(query)
display(df_window_sql.select("country", "registration_date", "salary_imputed", "prev_salary", "rolling_avg_salary"))

## Best Practices

### 🎯 Transformation Strategy Guide:

| Feature Type | Recommended Transformation |
|--------------|---------------------------|
| Categorical (low cardinality <10) | OneHotEncoder |
| Categorical (high cardinality >100) | Target Encoding or Embeddings |
| Categorical (ordinal) | StringIndexer only |
| Numerical (normal distribution) | StandardScaler |
| Numerical (with outliers) | RobustScaler |
| Numerical (bounded range needed) | MinMaxScaler |

### ⚠️ Common Mistakes to Avoid:

1. **Using StringIndexer for nominal data** → Introduces false ordinality
2. **OneHotEncoder on high-cardinality** → Curse of dimensionality
3. **Target Encoding without regularization** → Overfitting
4. **Scaling before splitting** → Data leakage
5. **Not scaling for distance-based models** → Feature dominance

### 💡 Pro Tips:

- Always fit scalers on TRAINING data only
- Use `handleInvalid="keep"` for StringIndexer to handle new categories
- Consider RobustScaler as default (more robust than StandardScaler)
- Window functions are powerful for time-series feature engineering
- Save transformer models for applying to test/production data

## Summary

### What we achieved:

- **StringIndexer**: Converted categories to numerical indices
- **OneHotEncoder**: Created binary vectors for nominal categories
- **Target Encoding**: Introduced advanced encoding (with caveats)
- **Scalers**: Compared StandardScaler, MinMaxScaler, RobustScaler
- **Window Functions**: Created lag and rolling average features

### Key Takeaways:

| # | Principle |
|---|-----------|
| 1 | **Choose encoder by category type** - nominal vs ordinal |
| 2 | **RobustScaler for outliers** - uses median/IQR |
| 3 | **Fit on train only** - prevent data leakage |
| 4 | **Window functions for time-series** - powerful feature engineering |
| 5 | **Target Encoding is risky** - use with cross-validation |

### Data Pipeline Status:

| Table | Created | Used By |
|-------|---------|---------|
| `customer_train_imputed` | Module 3 | This module |
| `customer_train_transformed` | ✅ This module | Modules 5-7 |

### Next Steps:

📚 **Next Module:** Module 5 - Feature Engineering (VectorAssembler, log transforms)

## Cleanup

Optionally remove demo tables created during exercises:

In [0]:
# Cleanup - remove demo tables created in this notebook

# Uncomment the lines below to remove demo tables:

# spark.sql(f"DROP TABLE IF EXISTS {catalog_name}.{schema_name}.customer_train_transformed")

# print("✅ All demo tables removed")

print("ℹ️ Cleanup disabled (uncomment code to remove demo tables)")