# Module 10: Feature Engineering

**Goal:** Transform raw data into features models can use, while avoiding leakage.

**Prerequisites:** Modules 3-4 (Linear/Logistic Regression)

**Expected Runtime:** ~25 minutes

**Outputs:**
- Transformation comparisons
- Encoding demonstrations
- Leakage detection

---

## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
plt.rcParams['figure.figsize'] = (12, 5)

## Part 1: Generate Sample E-commerce Data

In [None]:
n = 1000

# Raw features
df = pd.DataFrame({
    'customer_id': range(n),
    'tenure_days': np.random.uniform(30, 1000, n),
    'revenue': np.random.exponential(100, n),  # Skewed!
    'sessions': np.random.poisson(15, n),
    'support_tickets': np.random.poisson(2, n),
    'plan_type': np.random.choice(['Basic', 'Premium', 'Enterprise'], n, p=[0.5, 0.35, 0.15]),
    'region': np.random.choice(['US', 'EU', 'APAC', 'LATAM'], n, p=[0.4, 0.3, 0.2, 0.1]),
})

# Generate target (churn) based on features
churn_prob = 1 / (1 + np.exp(
    2 - 
    0.002 * df['tenure_days'] - 
    0.005 * df['revenue'] + 
    0.3 * df['support_tickets'] - 
    0.05 * df['sessions']
))
df['churn'] = (np.random.random(n) < churn_prob).astype(int)

print("Dataset shape:", df.shape)
print(f"\nChurn rate: {df['churn'].mean():.1%}")
df.head()

## Part 2: Numeric Transformations

In [None]:
# Look at revenue distribution
fig, axes = plt.subplots(1, 4, figsize=(16, 4))

# Original
axes[0].hist(df['revenue'], bins=30, color='#8b5cf6', edgecolor='white')
axes[0].set_title(f"Original\nSkew: {df['revenue'].skew():.2f}")
axes[0].set_xlabel('Revenue')

# Log transform
log_revenue = np.log1p(df['revenue'])
axes[1].hist(log_revenue, bins=30, color='#22c55e', edgecolor='white')
axes[1].set_title(f"Log Transform\nSkew: {log_revenue.skew():.2f}")
axes[1].set_xlabel('log(Revenue + 1)')

# Standardized
std_revenue = (df['revenue'] - df['revenue'].mean()) / df['revenue'].std()
axes[2].hist(std_revenue, bins=30, color='#0ea5e9', edgecolor='white')
axes[2].set_title(f"Standardized\nMean: {std_revenue.mean():.2f}, Std: {std_revenue.std():.2f}")
axes[2].set_xlabel('Z-score')

# Min-Max
minmax_revenue = (df['revenue'] - df['revenue'].min()) / (df['revenue'].max() - df['revenue'].min())
axes[3].hist(minmax_revenue, bins=30, color='#f97316', edgecolor='white')
axes[3].set_title(f"Min-Max\nRange: [{minmax_revenue.min():.2f}, {minmax_revenue.max():.2f}]")
axes[3].set_xlabel('Scaled [0,1]')

plt.tight_layout()
plt.show()

print("üí° Key Insight: Log transform reduced skewness from {:.2f} to {:.2f}".format(
    df['revenue'].skew(), log_revenue.skew()))

## Part 3: Categorical Encoding

In [None]:
print("=== Categorical Variables ===")
print(f"\nplan_type: {df['plan_type'].nunique()} categories")
print(df['plan_type'].value_counts())

print(f"\nregion: {df['region'].nunique()} categories")
print(df['region'].value_counts())

In [None]:
# One-Hot Encoding
df_onehot = pd.get_dummies(df[['plan_type', 'region']], prefix=['plan', 'region'])
print("=== One-Hot Encoding ===")
print(f"Created {df_onehot.shape[1]} columns")
df_onehot.head()

In [None]:
# Ordinal Encoding (for plan_type with natural order)
plan_order = {'Basic': 1, 'Premium': 2, 'Enterprise': 3}
df['plan_ordinal'] = df['plan_type'].map(plan_order)

print("=== Ordinal Encoding ===")
print("Mapping:", plan_order)
df[['plan_type', 'plan_ordinal']].head()

In [None]:
# Target Encoding (mean churn rate by region)
# ‚ö†Ô∏è Must be done carefully to avoid leakage!

# Calculate target mean per region on training data only
region_target_mean = df.groupby('region')['churn'].mean()
df['region_target_enc'] = df['region'].map(region_target_mean)

print("=== Target Encoding ===")
print("Region ‚Üí Mean Churn Rate:")
print(region_target_mean.round(3))
print("\n‚ö†Ô∏è Warning: In practice, use cross-validation encoding to avoid leakage!")

## Part 4: Feature Creation

In [None]:
# Create engineered features
df['revenue_per_session'] = df['revenue'] / (df['sessions'] + 1)
df['ticket_rate'] = df['support_tickets'] / (df['tenure_days'] / 30)  # tickets per month
df['log_revenue'] = np.log1p(df['revenue'])
df['is_high_value'] = (df['revenue'] > df['revenue'].quantile(0.75)).astype(int)
df['tenure_months'] = df['tenure_days'] / 30

print("=== Engineered Features ===")
print(df[['revenue_per_session', 'ticket_rate', 'log_revenue', 'is_high_value', 'tenure_months']].describe())

## Part 5: Impact on Model Performance

In [None]:
# Compare raw vs engineered features
y = df['churn']

# Raw numeric features
X_raw = df[['tenure_days', 'revenue', 'sessions', 'support_tickets']]

# Engineered features
X_eng = df[['tenure_months', 'log_revenue', 'sessions', 'support_tickets', 
            'revenue_per_session', 'ticket_rate', 'plan_ordinal', 'region_target_enc']]

# Split
X_raw_train, X_raw_test, X_eng_train, X_eng_test, y_train, y_test = train_test_split(
    X_raw, X_eng, y, test_size=0.3, random_state=42, stratify=y
)

# Scale for logistic regression
scaler_raw = StandardScaler()
scaler_eng = StandardScaler()

X_raw_train_scaled = scaler_raw.fit_transform(X_raw_train)
X_raw_test_scaled = scaler_raw.transform(X_raw_test)

X_eng_train_scaled = scaler_eng.fit_transform(X_eng_train)
X_eng_test_scaled = scaler_eng.transform(X_eng_test)

# Train models
lr_raw = LogisticRegression().fit(X_raw_train_scaled, y_train)
lr_eng = LogisticRegression().fit(X_eng_train_scaled, y_train)

rf_raw = RandomForestClassifier(n_estimators=100, random_state=42).fit(X_raw_train, y_train)
rf_eng = RandomForestClassifier(n_estimators=100, random_state=42).fit(X_eng_train, y_train)

# Evaluate
results = pd.DataFrame({
    'Model': ['Logistic (Raw)', 'Logistic (Engineered)', 'Random Forest (Raw)', 'Random Forest (Engineered)'],
    'Train AUC': [
        roc_auc_score(y_train, lr_raw.predict_proba(X_raw_train_scaled)[:, 1]),
        roc_auc_score(y_train, lr_eng.predict_proba(X_eng_train_scaled)[:, 1]),
        roc_auc_score(y_train, rf_raw.predict_proba(X_raw_train)[:, 1]),
        roc_auc_score(y_train, rf_eng.predict_proba(X_eng_train)[:, 1])
    ],
    'Test AUC': [
        roc_auc_score(y_test, lr_raw.predict_proba(X_raw_test_scaled)[:, 1]),
        roc_auc_score(y_test, lr_eng.predict_proba(X_eng_test_scaled)[:, 1]),
        roc_auc_score(y_test, rf_raw.predict_proba(X_raw_test)[:, 1]),
        roc_auc_score(y_test, rf_eng.predict_proba(X_eng_test)[:, 1])
    ]
})

print("=== Model Comparison ===")
print(results.to_string(index=False))

print("\nüí° Insight: Feature engineering often helps linear models more than tree models.")

## Part 6: Data Leakage Demo

Let's see what happens when we accidentally include future information.

In [None]:
# Create a "leaky" feature - future activity that correlates with churn
# In reality, this would be activity AFTER the prediction point
df['future_activity'] = np.where(
    df['churn'] == 1, 
    np.random.normal(2, 1, n),  # Churners have low future activity
    np.random.normal(10, 2, n)  # Non-churners have high future activity
)

# Features with leakage
X_leaky = df[['tenure_months', 'log_revenue', 'sessions', 'support_tickets', 'future_activity']]

# Split (AFTER creating leaky feature - the damage is done)
X_leaky_train, X_leaky_test, y_train, y_test = train_test_split(
    X_leaky, y, test_size=0.3, random_state=42, stratify=y
)

# Scale
scaler_leaky = StandardScaler()
X_leaky_train_scaled = scaler_leaky.fit_transform(X_leaky_train)
X_leaky_test_scaled = scaler_leaky.transform(X_leaky_test)

# Train
lr_leaky = LogisticRegression().fit(X_leaky_train_scaled, y_train)

# Evaluate
train_auc_leaky = roc_auc_score(y_train, lr_leaky.predict_proba(X_leaky_train_scaled)[:, 1])
test_auc_leaky = roc_auc_score(y_test, lr_leaky.predict_proba(X_leaky_test_scaled)[:, 1])

print("=== ‚ö†Ô∏è LEAKAGE DEMO ===")
print(f"\nWith 'future_activity' feature (LEAKY):")
print(f"  Train AUC: {train_auc_leaky:.3f}")
print(f"  Test AUC:  {test_auc_leaky:.3f}")
print(f"\nüö® Red Flag: Suspiciously high AUC!")
print("   The model learned a shortcut using future information.")
print("   In production, this feature wouldn't exist at prediction time.")

## Part 7: TODO - Correct Scaling Pipeline

A common mistake is fitting the scaler on all data. Let's compare.

In [None]:
# TODO: Compare correct vs incorrect scaling

X_simple = df[['tenure_days', 'revenue', 'sessions', 'support_tickets']]

# WRONG: Fit scaler on ALL data before split
scaler_wrong = StandardScaler()
X_scaled_wrong = scaler_wrong.fit_transform(X_simple)  # Fitted on everything

X_train_wrong, X_test_wrong, y_train_w, y_test_w = train_test_split(
    X_scaled_wrong, y, test_size=0.3, random_state=42
)

# RIGHT: Split first, then fit scaler only on train
X_train_right, X_test_right, y_train_r, y_test_r = train_test_split(
    X_simple, y, test_size=0.3, random_state=42
)

scaler_right = StandardScaler()
X_train_right_scaled = scaler_right.fit_transform(X_train_right)  # Fit only on train
X_test_right_scaled = scaler_right.transform(X_test_right)  # Transform test

# Train models
lr_wrong = LogisticRegression().fit(X_train_wrong, y_train_w)
lr_right = LogisticRegression().fit(X_train_right_scaled, y_train_r)

print("=== Scaling Pipeline Comparison ===")
print(f"\nWRONG (fit on all): Test AUC = {roc_auc_score(y_test_w, lr_wrong.predict_proba(X_test_wrong)[:, 1]):.3f}")
print(f"RIGHT (fit on train): Test AUC = {roc_auc_score(y_test_r, lr_right.predict_proba(X_test_right_scaled)[:, 1]):.3f}")
print("\nüí° In this case the difference is small, but on smaller datasets it matters more.")

## Part 8: TODO - Stakeholder Summary

Write a brief explanation for a PM about:
1. Why you transformed certain features
2. What leakage is and how you avoided it
3. How feature engineering improved the model

### Your Summary:

*Write your explanation here...*

---

## Key Takeaways

1. **Transform skewed data** with log ‚Äî helps linear models capture patterns
2. **Scale features** ‚Äî essential for linear models, optional for trees
3. **Encode categoricals** thoughtfully ‚Äî one-hot for low cardinality, target encoding for high
4. **Avoid leakage** ‚Äî only use information available at prediction time
5. **Fit transforms on train only** ‚Äî apply to test without refitting

### Next Steps
- Explore the interactive playground for visual transformations
- Complete the quiz to test your understanding