# Module 5: Advanced Feature Engineering

## Business Context: TechCorp HR Analytics (continued)

**Where We Are:**
Data is encoded and scaled. Now we create NEW features from existing columns for **salary prediction**.

**Features to Engineer for Salary Prediction:**
| Feature | Logic | Business Meaning |
|---------|-------|------------------|
| `experience_years` | age - 22 | Years since graduation |
| `tenure_days` | today - registration_date | Time with company |
| `ltv_proxy` | age × tenure_days | Experience-loyalty interaction |

> **Note:** We also demo `log_salary` transformation as a technique, but it CANNOT be used as a feature when predicting salary (data leakage!)

---

**Training Objective:** Master feature engineering techniques.

**Scope:**
- Log Transformation for skewed distributions (demo only)
- Experience and Tenure calculations
- Interaction Features
- VectorAssembler for Spark ML

## Context and Requirements

- **Training day:** Day 1 - Data Preparation Fundamentals
- **Notebook type:** Demo
- **Technical requirements:**
  - Databricks Runtime 14.x LTS or newer
  - Unity Catalog enabled
  - Permissions: CREATE TABLE, SELECT, MODIFY
- **Dependencies:** `04_Feature_Transformation.ipynb` (creates `customer_train_transformed` table)
- **Execution time:** ~20 minutes

> **Note:** Feature engineering is often the difference between a mediocre model and a great one!

## Theoretical Introduction

**What is Feature Engineering?**

Feature Engineering is the process of creating new features from raw data to improve model performance. It's often considered the most creative and impactful part of ML.

**Common Techniques:**

| Technique | When to Use | Example |
|-----------|-------------|---------|
| **Log Transform** | Skewed distributions | `log(salary)` |
| **Polynomial Features** | Non-linear relationships | `age^2`, `age*income` |
| **Interaction Features** | Combined effects | `salary * tenure` |
| **Date Features** | Time-based patterns | `day_of_week`, `month` |
| **Binning** | Continuous → categorical | Age groups: 18-25, 26-35 |

**Why VectorAssembler?**
> Unlike Scikit-Learn which accepts a feature matrix $X$, Spark MLlib requires a **single column** of type `Vector`. `VectorAssembler` combines multiple columns into this vector.

**Feature Selection Importance:**
- Too many features → Overfitting, slow training
- Correlated features → Multicollinearity (confuses linear models)
- Irrelevant features → Noise that hurts performance

## Per-User Isolation

Run the initialization script for per-user catalog and schema isolation:

In [0]:
%run ./00_Setup

**Load Transformed Data:**

In [0]:
# Load Transformed Data
df = spark.table("customer_train_transformed")

## Section 1: Feature Extraction

### Example 1.1: Log Transformation (Demo Only)

Many real-world variables (like Salary, House Prices, Population) follow a "Power Law" or "Long Tail" distribution.

- **The Problem:** Linear models assume residuals are normally distributed. Highly skewed data violates this.
- **The Solution:** Applying a Logarithm compresses the long tail, making the distribution more bell-shaped (Gaussian).
- **Note:** We use `log1p` (log(x+1)) because `log(0)` is undefined.

> **WARNING - DATA LEAKAGE:** 
> This is a **demonstration of the technique only**. In our salary prediction task, we CANNOT use `log_salary` as a feature because it's derived from the target variable (`salary`). Using it would cause **data leakage**!

In [0]:
from pyspark.sql.functions import log1p, col, datediff, current_date

In [0]:
# DEMO: Log transformation technique
# log1p calculates log(x + 1) to handle zeros safely
#
# WARNING: This is for DEMONSTRATION only!
# We do NOT use log_salary as a feature in our model because:
# - We are predicting salary
# - log_salary = log(salary) is derived from target = DATA LEAKAGE!

df_eng = df.withColumn("log_salary_demo", log1p(col("salary_imputed")))

print("Log transformation demo (NOT used as feature!):")
display(df_eng.select("salary_imputed", "log_salary_demo").limit(5))

### Example 1.2: Experience, Tenure and Interaction Features

Combining features can reveal hidden patterns useful for salary prediction.

**Experience Years:** `age - 22` - approximates years of work experience (assuming graduation at 22).

**Tenure Days:** How long the customer has been with us (days since registration).

**LTV Proxy:** `Age * Tenure` - approximates lifetime value based on experience and loyalty.

> **Note:** These features are safe to use because they don't contain the target variable (salary)!

In [0]:
# Experience Years: estimated years since graduation (age - 22)
# This is a key predictor for salary!
df_eng = df_eng.withColumn("experience_years", col("age_imputed") - 22)

# Calculate Tenure (days since registration)
df_eng = df_eng.withColumn("tenure_days", datediff(current_date(), col("registration_date")))

# Create the interaction: Age * Tenure (Lifetime Value approximation)
# NOTE: We use age * tenure, NOT salary * tenure (to avoid data leakage!)
df_eng = df_eng.withColumn("ltv_proxy", col("age_imputed") * col("tenure_days"))

In [0]:
display(df_eng.select("age_imputed", "experience_years", "tenure_days", "ltv_proxy").limit(5))

## Section 2: VectorAssembler (The Final Step)

Unlike Scikit-Learn which accepts a matrix of features ($X$), Spark MLlib requires a **single column** of type `Vector` that contains all input features.

`VectorAssembler` takes a list of columns (numerical, boolean, or vector) and combines them into this single feature vector.

In [0]:
from pyspark.ml.feature import VectorAssembler

# List of all numerical features we want to use for SALARY PREDICTION
# These are VALID features (no data leakage!)
input_cols = ["age_imputed", "experience_years", "tenure_days", "ltv_proxy", "reg_rank"] 
# Note: We usually include encoded categorical vectors here too, e.g., "country_vec"
# NOTE: log_salary_demo is NOT included (data leakage - derived from target!)

assembler = VectorAssembler(inputCols=input_cols, outputCol="features_final")
df_final = assembler.transform(df_eng)

print(f"Features used for salary prediction: {input_cols}")
print("Note: log_salary_demo NOT used (would be data leakage!)")

In [0]:
display(df_final.select("features_final").limit(5))

## Section 3: Feature Selection

### Example 3.1: Correlation Analysis
Which features are correlated with each other? (Multicollinearity).

In [0]:
from pyspark.ml.stat import Correlation
import pandas as pd

# Calculate Correlation Matrix
matrix = Correlation.corr(df_final, "features_final").head()
corr_array = matrix[0].toArray()

# Convert to Pandas DataFrame for better visualization
corr_df = pd.DataFrame(corr_array, columns=input_cols, index=input_cols)


### Why Correlation Analysis?

Correlation analysis helps us identify pairs of features that move together (are highly correlated). In feature engineering, this is important because:

- **Multicollinearity**: Highly correlated features can confuse many ML models (especially linear models), making it hard to interpret coefficients and sometimes degrading performance.
- **Feature Selection**: By detecting and removing redundant features, we simplify the model, reduce overfitting, and speed up training.
- **Better Insights**: Understanding feature relationships can reveal hidden patterns and guide further feature engineering.

> **Goal:** Keep only the most informative, independent features for robust, interpretable models.

In [0]:
display(corr_array)

In [0]:
# Display nicely
print(f"Features: {input_cols}")
display(corr_df)

In [0]:
display(df_final)

In [0]:
# Save for Pipeline
df_final.write.mode("overwrite").saveAsTable("customer_train_engineered")
print(" Saved 'customer_train_engineered'")

## Best Practices

###  Feature Engineering Strategy Guide:

| Technique | When to Use | Watch Out For |
|-----------|-------------|---------------|
| **Log Transform** | Right-skewed data (salary, prices) | Zero values (use log1p) |
| **Polynomial** | Non-linear relationships | Overfitting, high dimensions |
| **Interaction** | Combined effects matter | Exponential feature growth |
| **Date Extraction** | Time patterns | Timezone issues |
| **Binning** | Reduce noise, interpretability | Loss of information |

### ️ Common Mistakes to Avoid:

1. **Creating too many features** → Overfitting and slow training
2. **Not checking correlations** → Multicollinearity issues
3. **Leaking target info** → Features derived from target
4. **Ignoring domain knowledge** → Missing obvious patterns
5. **Not validating on holdout** → Overly optimistic results

###  Pro Tips:

- Always visualize new features vs target
- Use domain knowledge to create meaningful interactions
- Remove highly correlated features (>0.95 correlation)
- Consider using automated feature selection (RFE, Lasso)
- Log transform is often the most impactful single technique

## Summary

### What we achieved:

- **Log Transformation**: Applied `log1p` to compress skewed salary distribution
- **Interaction Features**: Created `LTV_Proxy` from salary × tenure
- **VectorAssembler**: Combined all features into single vector column
- **Correlation Analysis**: Identified multicollinearity between features

### Key Takeaways:

| # | Principle |
|---|-----------|
| 1 | **Log transform skewed data** - most impactful single technique |
| 2 | **VectorAssembler is required** - Spark MLlib needs vector column |
| 3 | **Check correlations** - avoid multicollinearity |
| 4 | **Domain knowledge matters** - create meaningful features |
| 5 | **Less can be more** - too many features cause overfitting |

### Data Pipeline Status:

| Table | Created | Used By |
|-------|---------|---------|
| `customer_train_transformed` | Module 4 | This module |
| `customer_train_engineered` |  This module | Modules 6-7 |

### Next Steps:

 **Next Module:** Module 6 - ML Pipelines (putting it all together)

## Cleanup

Optionally remove demo tables created during exercises:

In [0]:
# Cleanup - remove demo tables created in this notebook

# Uncomment the lines below to remove demo tables:

# spark.sql(f"DROP TABLE IF EXISTS {catalog_name}.{schema_name}.customer_train_engineered")

# print(" All demo tables removed")

print("ℹ️ Cleanup disabled (uncomment code to remove demo tables)")