# Module 5: Advanced Feature Engineering

**Training Objective:** Master the art of creating new features from existing data to improve model performance.

**Scope:**
- Log Transformation: Handling skewed distributions
- Interaction Features: Creating new signals (e.g., LTV Proxy)
- VectorAssembler: Preparing final feature vector for Spark ML
- Feature Selection: Correlation analysis for dimensionality reduction

## Context and Requirements

- **Training day:** Day 1 - Data Preparation Fundamentals
- **Notebook type:** Demo
- **Technical requirements:**
  - Databricks Runtime 14.x LTS or newer
  - Unity Catalog enabled
  - Permissions: CREATE TABLE, SELECT, MODIFY
- **Dependencies:** `04_Feature_Transformation.ipynb` (creates `customer_train_transformed` table)
- **Execution time:** ~20 minutes

> **Note:** Feature engineering is often the difference between a mediocre model and a great one!

## Theoretical Introduction

**What is Feature Engineering?**

Feature Engineering is the process of creating new features from raw data to improve model performance. It's often considered the most creative and impactful part of ML.

**Common Techniques:**

| Technique | When to Use | Example |
|-----------|-------------|---------|
| **Log Transform** | Skewed distributions | `log(salary)` |
| **Polynomial Features** | Non-linear relationships | `age^2`, `age*income` |
| **Interaction Features** | Combined effects | `salary * tenure` |
| **Date Features** | Time-based patterns | `day_of_week`, `month` |
| **Binning** | Continuous ‚Üí categorical | Age groups: 18-25, 26-35 |

**Why VectorAssembler?**
> Unlike Scikit-Learn which accepts a feature matrix $X$, Spark MLlib requires a **single column** of type `Vector`. `VectorAssembler` combines multiple columns into this vector.

**Feature Selection Importance:**
- Too many features ‚Üí Overfitting, slow training
- Correlated features ‚Üí Multicollinearity (confuses linear models)
- Irrelevant features ‚Üí Noise that hurts performance

## Per-User Isolation

Run the initialization script for per-user catalog and schema isolation:

In [0]:
%run ./00_Setup

**Load Transformed Data:**

In [0]:
# Load Transformed Data
df = spark.table("customer_train_transformed")

## Section 1: Feature Extraction

### Example 1.1: Log Transformation
Many real-world variables (like Salary, House Prices, Population) follow a "Power Law" or "Long Tail" distribution.

- **The Problem:** Linear models assume residuals are normally distributed. Highly skewed data violates this.
- **The Solution:** Applying a Logarithm compresses the long tail, making the distribution more bell-shaped (Gaussian).
- **Note:** We use `log1p` (log(x+1)) because `log(0)` is undefined.

In [0]:
from pyspark.sql.functions import log1p, col

In [0]:
# log1p calculates log(x + 1) to handle zeros safely
df_eng = df.withColumn("log_salary", log1p(col("salary_imputed")))

display(df_eng.select("salary_imputed", "log_salary").limit(5))

### Example 1.2: Interaction Features
Combining two features can reveal hidden patterns.
*Example:* `LTV_Proxy = Salary * Tenure` (Lifetime Value approximation).

In [0]:
from pyspark.sql.functions import datediff, current_date

In [0]:
# First, let's calculate Tenure (days since registration)
df_eng = df_eng.withColumn("tenure_days", datediff(current_date(), col("registration_date")))

# Now create the interaction: Salary * Tenure
df_eng = df_eng.withColumn("ltv_proxy", col("salary_imputed") * col("tenure_days"))


In [0]:
display(df_eng.select("salary_imputed", "tenure_days", "ltv_proxy").limit(5))

## Section 2: VectorAssembler (The Final Step)

Unlike Scikit-Learn which accepts a matrix of features ($X$), Spark MLlib requires a **single column** of type `Vector` that contains all input features.

`VectorAssembler` takes a list of columns (numerical, boolean, or vector) and combines them into this single feature vector.

In [0]:
from pyspark.ml.feature import VectorAssembler

# List of all numerical features we want to use
input_cols = ["age_imputed", "log_salary", "ltv_proxy", "reg_rank"] 
# Note: We usually include encoded categorical vectors here too, e.g., "country_vec"

assembler = VectorAssembler(inputCols=input_cols, outputCol="features_final")
df_final = assembler.transform(df_eng)

In [0]:
display(df_final.select("features_final").limit(5))

## Section 3: Feature Selection

### Example 3.1: Correlation Analysis
Which features are correlated with each other? (Multicollinearity).

In [0]:
from pyspark.ml.stat import Correlation
import pandas as pd

# Calculate Correlation Matrix
matrix = Correlation.corr(df_final, "features_final").head()
corr_array = matrix[0].toArray()

# Convert to Pandas DataFrame for better visualization
corr_df = pd.DataFrame(corr_array, columns=input_cols, index=input_cols)


In [0]:
# Display nicely
print(f"Features: {input_cols}")
display(corr_df)

In [0]:
# Save for Pipeline
df_final.write.mode("overwrite").saveAsTable("customer_train_engineered")
print("‚úÖ Saved 'customer_train_engineered'")

## Best Practices

### üéØ Feature Engineering Strategy Guide:

| Technique | When to Use | Watch Out For |
|-----------|-------------|---------------|
| **Log Transform** | Right-skewed data (salary, prices) | Zero values (use log1p) |
| **Polynomial** | Non-linear relationships | Overfitting, high dimensions |
| **Interaction** | Combined effects matter | Exponential feature growth |
| **Date Extraction** | Time patterns | Timezone issues |
| **Binning** | Reduce noise, interpretability | Loss of information |

### ‚ö†Ô∏è Common Mistakes to Avoid:

1. **Creating too many features** ‚Üí Overfitting and slow training
2. **Not checking correlations** ‚Üí Multicollinearity issues
3. **Leaking target info** ‚Üí Features derived from target
4. **Ignoring domain knowledge** ‚Üí Missing obvious patterns
5. **Not validating on holdout** ‚Üí Overly optimistic results

### üí° Pro Tips:

- Always visualize new features vs target
- Use domain knowledge to create meaningful interactions
- Remove highly correlated features (>0.95 correlation)
- Consider using automated feature selection (RFE, Lasso)
- Log transform is often the most impactful single technique

## Summary

### What we achieved:

- **Log Transformation**: Applied `log1p` to compress skewed salary distribution
- **Interaction Features**: Created `LTV_Proxy` from salary √ó tenure
- **VectorAssembler**: Combined all features into single vector column
- **Correlation Analysis**: Identified multicollinearity between features

### Key Takeaways:

| # | Principle |
|---|-----------|
| 1 | **Log transform skewed data** - most impactful single technique |
| 2 | **VectorAssembler is required** - Spark MLlib needs vector column |
| 3 | **Check correlations** - avoid multicollinearity |
| 4 | **Domain knowledge matters** - create meaningful features |
| 5 | **Less can be more** - too many features cause overfitting |

### Data Pipeline Status:

| Table | Created | Used By |
|-------|---------|---------|
| `customer_train_transformed` | Module 4 | This module |
| `customer_train_engineered` | ‚úÖ This module | Modules 6-7 |

### Next Steps:

üìö **Next Module:** Module 6 - ML Pipelines (putting it all together)

## Cleanup

Optionally remove demo tables created during exercises:

In [0]:
# Cleanup - remove demo tables created in this notebook

# Uncomment the lines below to remove demo tables:

# spark.sql(f"DROP TABLE IF EXISTS {catalog_name}.{schema_name}.customer_train_engineered")

# print("‚úÖ All demo tables removed")

print("‚ÑπÔ∏è Cleanup disabled (uncomment code to remove demo tables)")