# Module 12: Feature Engineering Mastery

**Estimated Time**: 75 minutes

## Learning Objectives

By the end of this module, you will:
- **Understand** why feature engineering is crucial for ML success
- **Master** techniques for transforming numerical and categorical features
- **Extract** meaningful features from datetime data
- **Create** polynomial and interaction features
- **Apply** feature selection methods to improve model performance
- **Analyze** feature importance to understand model decisions
- **Build** automated feature engineering pipelines
- **Practice** end-to-end feature engineering on real datasets

## Prerequisites

- Modules 00-11 completed (especially pandas, scikit-learn)
- Understanding of basic ML concepts
- Familiarity with supervised learning algorithms

## What is Feature Engineering?

> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." ‚Äî **Andrew Ng**

**Feature engineering** is the process of using domain knowledge to create features (input variables) that make machine learning algorithms work better. It's often the difference between a mediocre model and a winning solution.

### Why It Matters

- Can improve model accuracy by **10-50%** or more
- Often more impactful than algorithm choice
- Requires creativity and domain understanding
- Key skill that separates good from great data scientists

### The Feature Engineering Process

```
Raw Data ‚Üí Feature Creation ‚Üí Feature Transformation ‚Üí Feature Selection ‚Üí Model Training
```

Let's master each step!

---

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import (
    StandardScaler,
    MinMaxScaler,
    LabelEncoder,
    OneHotEncoder,
    PolynomialFeatures,
)
from sklearn.feature_selection import SelectKBest, f_classif, RFE, SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import warnings

warnings.filterwarnings("ignore")

%matplotlib inline

# Set style for better visualizations
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)

print("‚úì All libraries loaded successfully!")
print("‚úì Ready for feature engineering!")

## 1. Introduction to Feature Engineering

Feature engineering is the art and science of transforming raw data into features that better represent the underlying problem to the predictive models.

### Types of Features

1. **Numerical Features**
   - Continuous: height, weight, price
   - Discrete: count of items, number of clicks
   
2. **Categorical Features**
   - Nominal: color, city, product category
   - Ordinal: education level, satisfaction rating
   
3. **Datetime Features**
   - Timestamps, dates, time periods
   - Can extract: year, month, day, hour, day_of_week, etc.
   
4. **Text Features**
   - Descriptions, reviews, comments
   - Requires special processing (covered in NLP module)

### Common Feature Engineering Techniques

| Technique | Description | When to Use |
|-----------|-------------|-------------|
| **Scaling** | Normalize feature ranges | Tree-based models don't need, linear models do |
| **Encoding** | Convert categorical to numerical | All ML models require numerical input |
| **Binning** | Group continuous values into bins | Create categorical from numerical |
| **Transformation** | Log, sqrt, power transforms | Handle skewed distributions |
| **Interaction** | Combine multiple features | Capture feature relationships |
| **Polynomial** | Create higher-degree features | Capture non-linear patterns |

### Load the Dataset

We'll use the feature engineering dataset created in `data_advanced/`.

In [None]:
# Load the feature engineering dataset
df = pd.read_csv("../../data_advanced/feature_engineering.csv")

print(f"Dataset loaded: {df.shape[0]} rows √ó {df.shape[1]} columns")
print("\n" + "=" * 60)
print("DATASET OVERVIEW")
print("=" * 60)

# Display first few rows
display(df.head(10))

# Data types and info
print("\n" + "=" * 60)
print("DATA TYPES")
print("=" * 60)
print(df.dtypes)

# Basic statistics
print("\n" + "=" * 60)
print("NUMERICAL FEATURES - STATISTICS")
print("=" * 60)
display(df.describe())

# Target variable distribution
print("\n" + "=" * 60)
print("TARGET VARIABLE (loan_approved)")
print("=" * 60)
print(df["loan_approved"].value_counts())
print(f"\nApproval Rate: {df['loan_approved'].mean():.2%}")

## 2. Numerical Feature Engineering

Numerical features often require transformation to improve model performance.

### Key Techniques for Numerical Features

1. **Scaling/Normalization**
   - **StandardScaler**: Mean=0, Std=1 (assumes normal distribution)
   - **MinMaxScaler**: Scale to [0, 1] range
   - **RobustScaler**: Resistant to outliers (uses median, IQR)

2. **Transformations**
   - **Log transform**: For right-skewed data
   - **Square root**: For moderate skewness
   - **Box-Cox**: Automatic optimal transformation

3. **Binning/Discretization**
   - Convert continuous ‚Üí categorical
   - Useful for age groups, income brackets, etc.

4. **Creating Derived Features**
   - Ratios: income_per_dependent = income / num_dependents
   - Differences: experience_gap = age - education_years - 18
   - Aggregations: total, average, etc.

### When to Use Which Scaler?

| Scaler | Use When | Example |
|--------|----------|---------|
| **StandardScaler** | Normal distribution | Heights, weights |
| **MinMaxScaler** | Bounded range needed | Neural networks, image pixels |
| **RobustScaler** | Outliers present | Financial data, real estate |
| **No scaling** | Tree-based models | Random Forest, XGBoost |

Let's apply these techniques!

In [None]:
# Numerical Feature Engineering Examples

# 1. SCALING - Compare different scalers
print("=" * 60)
print("1. FEATURE SCALING COMPARISON")
print("=" * 60)

# Select numerical features
numerical_features = ["age", "income", "education_years", "experience_years"]
sample_data = df[numerical_features].head()

print("\nOriginal Values:")
display(sample_data)

# Standard Scaler
scaler_standard = StandardScaler()
scaled_standard = pd.DataFrame(
    scaler_standard.fit_transform(sample_data), columns=[f"{col}_std" for col in numerical_features]
)

print("\nStandardScaler (mean=0, std=1):")
display(scaled_standard)

# MinMax Scaler
scaler_minmax = MinMaxScaler()
scaled_minmax = pd.DataFrame(
    scaler_minmax.fit_transform(sample_data),
    columns=[f"{col}_minmax" for col in numerical_features],
)

print("\nMinMaxScaler (range=[0,1]):")
display(scaled_minmax)

# 2. TRANSFORMATIONS - Handle skewed data
print("\n" + "=" * 60)
print("2. HANDLING SKEWED DISTRIBUTIONS")
print("=" * 60)

# Check skewness of income
print(f"\nIncome Skewness: {df['income'].skew():.2f}")
print("(Skewness > 1 or < -1 indicates high skewness)")

# Log transformation for right-skewed data
df["income_log"] = np.log1p(df["income"])  # log1p = log(1 + x) to handle zeros

# Visualize before and after
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(df["income"], bins=30, edgecolor="black", alpha=0.7)
axes[0].set_title(f'Original Income\nSkewness: {df["income"].skew():.2f}')
axes[0].set_xlabel("Income ($)")
axes[0].set_ylabel("Frequency")

axes[1].hist(df["income_log"], bins=30, edgecolor="black", alpha=0.7, color="green")
axes[1].set_title(f'Log-Transformed Income\nSkewness: {df["income_log"].skew():.2f}')
axes[1].set_xlabel("Log(Income)")
axes[1].set_ylabel("Frequency")

plt.tight_layout()
plt.show()

# 3. BINNING - Create categorical from numerical
print("\n" + "=" * 60)
print("3. BINNING NUMERICAL FEATURES")
print("=" * 60)

# Create age groups
df["age_group"] = pd.cut(
    df["age"], bins=[0, 25, 35, 50, 100], labels=["Young", "Early Career", "Mid Career", "Senior"]
)

print("\nAge Group Distribution:")
print(df["age_group"].value_counts().sort_index())

# 4. DERIVED FEATURES - Create new features
print("\n" + "=" * 60)
print("4. CREATING DERIVED FEATURES")
print("=" * 60)

# Income per dependent
df["income_per_dependent"] = df["income"] / (df["num_dependents"] + 1)

# Years of experience per education year (efficiency metric)
df["experience_efficiency"] = df["experience_years"] / (df["education_years"] + 1)

# Age when started working
df["work_start_age"] = df["age"] - df["experience_years"]

print("\nNew Derived Features Created:")
print("‚úì income_per_dependent")
print("‚úì experience_efficiency")
print("‚úì work_start_age")

print("\nSample of new features:")
display(
    df[
        [
            "income",
            "num_dependents",
            "income_per_dependent",
            "experience_years",
            "education_years",
            "experience_efficiency",
        ]
    ].head()
)

print("\n‚úì Numerical feature engineering complete!")

## 3. Categorical Encoding Strategies

Most machine learning algorithms require numerical input. We need to convert categorical variables into numbers.

### Common Encoding Methods

1. **Label Encoding**
   - Assigns each category a unique integer
   - Best for: Ordinal data (Low < Medium < High)
   - Warning: Creates unintended ordinality

2. **One-Hot Encoding**
   - Creates binary column for each category
   - Best for: Nominal data (no natural order)
   - Warning: Can create too many features (high cardinality)

3. **Target Encoding (Mean Encoding)**
   - Replaces category with mean of target variable
   - Best for: High cardinality features
   - Warning: Can cause overfitting (use cross-validation)

4. **Frequency Encoding**
   - Replaces category with its frequency/count
   - Best for: When frequency matters
   
5. **Binary Encoding**
   - Converts to binary representation
   - Best for: High cardinality with memory constraints

### Choosing the Right Encoding

| Scenario | Recommended Encoding |
|----------|---------------------|
| Ordinal data (Education: HS < Bachelor < Master) | **Label Encoding** |
| Nominal data (Color: Red, Blue, Green) | **One-Hot Encoding** |
| High cardinality (1000+ unique cities) | **Target/Frequency Encoding** |
| Tree-based models | **Label/Target Encoding** |
| Linear models | **One-Hot Encoding** |

Let's see examples of each!

In [None]:
# Categorical Encoding Examples

print("=" * 60)
print("CATEGORICAL FEATURES IN OUR DATASET")
print("=" * 60)

categorical_features = ["city", "job_category"]
print(f"\nCategorical columns: {categorical_features}\n")

for col in categorical_features:
    print(f"{col}:")
    print(df[col].value_counts())
    print()

# 1. LABEL ENCODING
print("=" * 60)
print("1. LABEL ENCODING")
print("=" * 60)

# Apply label encoding to job_category
le = LabelEncoder()
df["job_category_label"] = le.fit_transform(df["job_category"])

print("\nOriginal vs Label Encoded:")
comparison = (
    df[["job_category", "job_category_label"]].drop_duplicates().sort_values("job_category_label")
)
display(comparison)

print("\n‚ö†Ô∏è  Note: Numbers don't imply order (Education=0 doesn't mean it's 'less than' Finance=1)")

# 2. ONE-HOT ENCODING
print("\n" + "=" * 60)
print("2. ONE-HOT ENCODING")
print("=" * 60)

# Create dummy variables for city
city_dummies = pd.get_dummies(df["city"], prefix="city")

print(f"\nOriginal feature: 1 column (city)")
print(f"One-hot encoded: {city_dummies.shape[1]} columns")
print(f"\nNew columns created:")
print(city_dummies.columns.tolist())

print("\nSample of one-hot encoded data:")
display(pd.concat([df[["city"]].head(), city_dummies.head()], axis=1))

print("\nüí° Insight: Each city gets its own binary column")

# 3. FREQUENCY ENCODING
print("\n" + "=" * 60)
print("3. FREQUENCY ENCODING")
print("=" * 60)

# Calculate frequency for each city
city_counts = df["city"].value_counts()
df["city_frequency"] = df["city"].map(city_counts)

print("\nCity Frequency Mapping:")
print(city_counts)

print("\nSample of frequency encoded data:")
display(df[["city", "city_frequency"]].head(10))

# 4. TARGET ENCODING
print("\n" + "=" * 60)
print("4. TARGET ENCODING (Mean Encoding)")
print("=" * 60)

# Calculate mean of target variable for each city
city_target_mean = df.groupby("city")["loan_approved"].mean()
df["city_target_encoded"] = df["city"].map(city_target_mean)

print("\nTarget Encoding Mapping:")
print("(Mean loan approval rate per city)")
for city, mean_approval in city_target_mean.items():
    print(f"{city:15s}: {mean_approval:.2%}")

print("\nSample of target encoded data:")
display(df[["city", "loan_approved", "city_target_encoded"]].head(10))

print("\nüí° Cities with higher approval rates get higher encoded values")

# 5. COMPARISON OF ALL ENCODINGS
print("\n" + "=" * 60)
print("5. ENCODING COMPARISON SUMMARY")
print("=" * 60)

# Create comparison DataFrame
comparison_df = pd.DataFrame(
    {
        "Original": df["city"].head(8),
        "Label_Encoded": le.fit_transform(df["city"].head(8)),
        "Frequency": df["city_frequency"].head(8),
        "Target_Encoded": df["city_target_encoded"].head(8).round(3),
    }
)

# Add one-hot columns
for col in city_dummies.columns:
    comparison_df[col] = city_dummies[col].head(8).values

print("\nSide-by-side comparison:")
display(comparison_df)

print("\n‚úì Categorical encoding strategies demonstrated!")

## 4. Datetime Feature Extraction

Datetime features contain rich information that can be extracted into multiple useful features.

### What Can We Extract from Dates?

From a single datetime column, you can create:

1. **Temporal Components**
   - Year, Month, Day, Hour, Minute, Second
   - Day of week, Day of year, Week of year
   - Quarter, Semester

2. **Cyclical Features**
   - Is weekend?, Is holiday?, Is business hour?
   - Season (Spring, Summer, Fall, Winter)
   - Beginning/End of month

3. **Time-Based Features**
   - Time since reference date
   - Age, tenure, days until event
   - Time between events

4. **Cyclical Encoding**
   - Convert cyclical features (month, day_of_week) to sin/cos
   - Preserves cycl nature: December (12) is close to January (1)

### Example: Customer Registration Date

Let's create a sample datetime feature and extract useful information.

In [None]:
# Datetime Feature Extraction Examples

# Create sample registration dates for our dataset
np.random.seed(42)
base_date = pd.to_datetime("2020-01-01")
random_days = np.random.randint(0, 1460, size=len(df))  # 4 years of dates
df["registration_date"] = base_date + pd.to_timedelta(random_days, unit="D")

print("=" * 60)
print("DATETIME FEATURE EXTRACTION")
print("=" * 60)

print("\nOriginal datetime column:")
print(df["registration_date"].head(10))

# 1. BASIC TEMPORAL COMPONENTS
print("\n" + "=" * 60)
print("1. EXTRACTING TEMPORAL COMPONENTS")
print("=" * 60)

df["reg_year"] = df["registration_date"].dt.year
df["reg_month"] = df["registration_date"].dt.month
df["reg_day"] = df["registration_date"].dt.day
df["reg_day_of_week"] = df["registration_date"].dt.dayofweek  # Monday=0, Sunday=6
df["reg_day_name"] = df["registration_date"].dt.day_name()
df["reg_quarter"] = df["registration_date"].dt.quarter
df["reg_week_of_year"] = df["registration_date"].dt.isocalendar().week

print("\nExtracted features:")
temporal_features = df[
    [
        "registration_date",
        "reg_year",
        "reg_month",
        "reg_day",
        "reg_day_of_week",
        "reg_day_name",
        "reg_quarter",
    ]
].head(10)
display(temporal_features)

# 2. BOOLEAN FLAGS
print("\n" + "=" * 60)
print("2. CREATING BOOLEAN FLAGS")
print("=" * 60)

df["is_weekend"] = (df["reg_day_of_week"] >= 5).astype(int)
df["is_month_start"] = df["registration_date"].dt.is_month_start.astype(int)
df["is_month_end"] = df["registration_date"].dt.is_month_end.astype(int)
df["is_quarter_start"] = df["registration_date"].dt.is_quarter_start.astype(int)


# Determine season
def get_season(month):
    if month in [12, 1, 2]:
        return "Winter"
    elif month in [3, 4, 5]:
        return "Spring"
    elif month in [6, 7, 8]:
        return "Summer"
    else:
        return "Fall"


df["season"] = df["reg_month"].apply(get_season)

print("\nBoolean flags created:")
flag_sample = df[
    ["registration_date", "is_weekend", "is_month_start", "is_month_end", "season"]
].head(10)
display(flag_sample)

# 3. TIME-BASED FEATURES
print("\n" + "=" * 60)
print("3. TIME-BASED CALCULATIONS")
print("=" * 60)

# Days since registration (tenure)
reference_date = pd.to_datetime("2024-01-01")
df["days_since_registration"] = (reference_date - df["registration_date"]).dt.days
df["years_since_registration"] = df["days_since_registration"] / 365.25

print(f"\nReference date: {reference_date.date()}")
print("\nTenure calculation:")
tenure_sample = df[
    ["registration_date", "days_since_registration", "years_since_registration"]
].head(10)
display(tenure_sample)

# 4. CYCLICAL ENCODING
print("\n" + "=" * 60)
print("4. CYCLICAL ENCODING (Sin/Cos Transformation)")
print("=" * 60)

# Encode month cyclically
df["month_sin"] = np.sin(2 * np.pi * df["reg_month"] / 12)
df["month_cos"] = np.cos(2 * np.pi * df["reg_month"] / 12)

# Encode day of week cyclically
df["day_of_week_sin"] = np.sin(2 * np.pi * df["reg_day_of_week"] / 7)
df["day_of_week_cos"] = np.cos(2 * np.pi * df["reg_day_of_week"] / 7)

print("\nCyclical encoding preserves the circular nature:")
print("December (12) and January (1) are now mathematically close!\n")

cyclical_sample = (
    df[["reg_month", "month_sin", "month_cos"]].drop_duplicates().sort_values("reg_month")
)
display(cyclical_sample)

# Visualize cyclical encoding
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Month encoding
months = np.arange(1, 13)
month_sin = np.sin(2 * np.pi * months / 12)
month_cos = np.cos(2 * np.pi * months / 12)

axes[0].plot(months, month_sin, "o-", label="sin(month)", linewidth=2)
axes[0].plot(months, month_cos, "s-", label="cos(month)", linewidth=2)
axes[0].set_xlabel("Month")
axes[0].set_ylabel("Encoded Value")
axes[0].set_title("Cyclical Encoding of Months")
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[0].set_xticks(months)

# Polar plot showing cyclical nature
theta = 2 * np.pi * months / 12
axes[1] = plt.subplot(122, projection="polar")
axes[1].plot(theta, np.ones_like(theta), "o-", markersize=10, linewidth=2)
for i, month in enumerate(
    ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
):
    axes[1].text(theta[i], 1.1, month, ha="center", va="center")
axes[1].set_title("Months as Cyclical Feature\n(December is close to January)", pad=20)

plt.tight_layout()
plt.show()

print("\n‚úì Datetime feature extraction complete!")
print(
    f"‚úì Created {sum(df.columns.str.contains('reg_|month_|day_of_week_|is_|season|days_since|years_since'))} new features from 1 datetime column!"
)

## 5. Polynomial and Interaction Features

Linear models can only capture linear relationships. Polynomial and interaction features help capture non-linear patterns.

### Polynomial Features

Transform features into higher-degree polynomials:
- **Degree 2**: x, x¬≤, x*y, y, y¬≤
- **Degree 3**: x, x¬≤, x¬≥, x*y, x¬≤*y, x*y¬≤, y, y¬≤, y¬≥

**When to use:**
- Linear models (Linear Regression, Logistic Regression)
- When you suspect non-linear relationships
- Can significantly improve model performance

**Warning:**
- Creates many features (can cause overfitting)
- Increases computational cost
- Use feature selection after creating polynomials

### Interaction Features

Multiply pairs of features to capture their combined effect:
- income * education_years
- age * num_dependents

**Examples of useful interactions:**
- Price √ó Quantity = Total Value
- Hours Worked √ó Hourly Rate = Earnings
- Bedroom Count √ó Square Footage = Spaciousness Score

Let's create these features!

In [None]:
# Polynomial and Interaction Features

print("=" * 60)
print("POLYNOMIAL & INTERACTION FEATURES")
print("=" * 60)

# Select a few numerical features for demonstration
base_features = ["age", "income", "education_years"]
X_sample = df[base_features].head(5)

print("\nOriginal features:")
print(f"Shape: {X_sample.shape}")
display(X_sample)

# 1. POLYNOMIAL FEATURES (Degree 2)
print("\n" + "=" * 60)
print("1. CREATING POLYNOMIAL FEATURES (Degree 2)")
print("=" * 60)

poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X_sample)

# Get feature names
poly_feature_names = poly.get_feature_names_out(base_features)

print(f"\nOriginal features: {len(base_features)}")
print(f"Polynomial features (degree=2): {len(poly_feature_names)}")
print(f"\nNew features created:")
for i, name in enumerate(poly_feature_names):
    print(f"  {i+1}. {name}")

print("\nPolynomial features (first 3 rows):")
poly_df = pd.DataFrame(X_poly, columns=poly_feature_names)
display(poly_df.head(3))

print("\nüí° Notice:")
print("   - Original features: age, income, education_years")
print("   - Added squares: age¬≤, income¬≤, education_years¬≤")
print("   - Added interactions: age√óincome, age√óeducation_years, income√óeducation_years")

# 2. MANUAL INTERACTION FEATURES
print("\n" + "=" * 60)
print("2. CREATING CUSTOM INTERACTION FEATURES")
print("=" * 60)

# Create meaningful interactions for our loan approval problem
df["income_education_interaction"] = df["income"] * df["education_years"]
df["age_dependents_interaction"] = df["age"] * df["num_dependents"]
df["income_experience_interaction"] = df["income"] * df["experience_years"]

print("\nCreated custom interactions:")
print("‚úì income √ó education_years ‚Üí income_education_interaction")
print("‚úì age √ó num_dependents ‚Üí age_dependents_interaction")
print("‚úì income √ó experience_years ‚Üí income_experience_interaction")

interaction_sample = df[
    [
        "income",
        "education_years",
        "income_education_interaction",
        "age",
        "num_dependents",
        "age_dependents_interaction",
    ]
].head()
display(interaction_sample)

# 3. DEMONSTRATE IMPACT ON MODEL PERFORMANCE
print("\n" + "=" * 60)
print("3. IMPACT OF POLYNOMIAL FEATURES ON MODEL PERFORMANCE")
print("=" * 60)

# Prepare data WITHOUT polynomial features
features_simple = ["age", "income", "education_years", "experience_years", "num_dependents"]
X_simple = df[features_simple].fillna(df[features_simple].mean())
y = df["loan_approved"]

# Train/test split
X_train_simple, X_test_simple, y_train, y_test = train_test_split(
    X_simple, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_simple_scaled = scaler.fit_transform(X_train_simple)
X_test_simple_scaled = scaler.transform(X_test_simple)

# Train logistic regression WITHOUT polynomial features
lr_simple = LogisticRegression(random_state=42, max_iter=1000)
lr_simple.fit(X_train_simple_scaled, y_train)
score_simple = lr_simple.score(X_test_simple_scaled, y_test)

print(f"\nüìä Logistic Regression WITHOUT polynomial features:")
print(f"   Accuracy: {score_simple:.4f}")

# Prepare data WITH polynomial features
poly_transform = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly_transform.fit_transform(X_train_simple_scaled)
X_test_poly = poly_transform.transform(X_test_simple_scaled)

# Train logistic regression WITH polynomial features
lr_poly = LogisticRegression(random_state=42, max_iter=1000)
lr_poly.fit(X_train_poly, y_train)
score_poly = lr_poly.score(X_test_poly, y_test)

print(f"\nüìä Logistic Regression WITH polynomial features:")
print(f"   Features: {X_train_simple.shape[1]} ‚Üí {X_train_poly.shape[1]}")
print(f"   Accuracy: {score_poly:.4f}")
print(
    f"   Improvement: {(score_poly - score_simple):.4f} ({((score_poly - score_simple)/score_simple)*100:.1f}%)"
)

# Visualize comparison
fig, ax = plt.subplots(figsize=(10, 6))
models = ["Without\nPolynomial Features", "With\nPolynomial Features"]
accuracies = [score_simple, score_poly]
colors = ["lightblue", "lightgreen"]

bars = ax.bar(models, accuracies, color=colors, edgecolor="black", linewidth=2, alpha=0.7)

# Add value labels on bars
for bar, acc in zip(bars, accuracies):
    height = bar.get_height()
    ax.text(
        bar.get_x() + bar.get_width() / 2.0,
        height,
        f"{acc:.4f}",
        ha="center",
        va="bottom",
        fontsize=14,
        fontweight="bold",
    )

ax.set_ylabel("Accuracy", fontsize=12)
ax.set_title("Impact of Polynomial Features on Model Performance", fontsize=14, fontweight="bold")
ax.set_ylim(0, 1)
ax.grid(axis="y", alpha=0.3)

plt.tight_layout()
plt.show()

print("\n‚úì Polynomial and interaction features demonstrated!")
print(f"‚úì Polynomial features increased model performance!")

## 6. Feature Selection Methods

Not all features are useful. Some are redundant, some are irrelevant, and some may even hurt model performance.

### Why Feature Selection?

1. **Reduces overfitting** - Less redundant data means less opportunity to make decisions based on noise
2. **Improves accuracy** - Less misleading data means better model performance
3. **Reduces training time** - Fewer features = faster computation
4. **Improves interpretability** - Simpler models are easier to explain

### Feature Selection Methods

1. **Filter Methods**
   - Statistical tests (correlation, chi-square, ANOVA)
   - SelectKBest, SelectPercentile
   - Fast but don't consider feature interactions

2. **Wrapper Methods**
   - Recursive Feature Elimination (RFE)
   - Forward/Backward selection
   - Slow but consider feature interactions

3. **Embedded Methods**
   - Lasso (L1) regularization
   - Tree-based feature importance
   - Built into model training

### Comparison

| Method | Speed | Accuracy | Considers Interactions |
|--------|-------|----------|------------------------|
| **Filter** | ‚ö° Fast | Good | ‚ùå No |
| **Wrapper** | üêå Slow | Best | ‚úÖ Yes |
| **Embedded** | üöÄ Medium | Great | ‚úÖ Yes |

Let's apply each method!

In [None]:
# Feature Selection Methods - Comprehensive Examples

# Prepare data
print("=" * 60)
print("FEATURE SELECTION METHODS COMPARISON")
print("=" * 60)

feature_cols = [
    "age",
    "income",
    "education_years",
    "experience_years",
    "num_dependents",
    "income_per_dependent",
    "experience_efficiency",
    "work_start_age",
]

X_full = df[feature_cols].fillna(df[feature_cols].mean())
y = df["loan_approved"]
X_train, X_test, y_train, y_test = train_test_split(X_full, y, test_size=0.2, random_state=42)

print(f"\nTotal features available: {X_train.shape[1]}")
print(f"Training samples: {X_train.shape[0]}")

# METHOD 1: SelectKBest (Filter)
print("\n" + "=" * 60)
print("METHOD 1: SelectKBest (Filter Method)")
print("=" * 60)

selector_kbest = SelectKBest(score_func=f_classif, k=5)
X_train_kbest = selector_kbest.fit_transform(X_train, y_train)
selected_kbest = X_full.columns[selector_kbest.get_support()].tolist()

print(f"\nSelected {len(selected_kbest)} features: {selected_kbest}")

# METHOD 2: RFE (Wrapper)
print("\n" + "=" * 60)
print("METHOD 2: RFE (Wrapper Method)")
print("=" * 60)

selector_rfe = RFE(LogisticRegression(random_state=42, max_iter=1000), n_features_to_select=5)
X_train_rfe = selector_rfe.fit_transform(X_train, y_train)
selected_rfe = X_full.columns[selector_rfe.get_support()].tolist()

print(f"\nSelected {len(selected_rfe)} features: {selected_rfe}")

# METHOD 3: Random Forest Importance (Embedded)
print("\n" + "=" * 60)
print("METHOD 3: Random Forest Feature Importance (Embedded)")
print("=" * 60)

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

importances = pd.DataFrame(
    {"Feature": feature_cols, "Importance": rf.feature_importances_}
).sort_values("Importance", ascending=False)

print("\nFeature Importances:")
display(importances)

# Select top 5 features
top_5_features = importances.head(5)["Feature"].tolist()
print(f"\nTop 5 features: {top_5_features}")

# COMPARE PERFORMANCE
print("\n" + "=" * 60)
print("PERFORMANCE COMPARISON")
print("=" * 60)

results = []

# Baseline: All features
lr_all = LogisticRegression(random_state=42, max_iter=1000)
lr_all.fit(X_train, y_train)
results.append(("All Features (8)", lr_all.score(X_test, y_test), 8))

# SelectKBest
lr_kb = LogisticRegression(random_state=42, max_iter=1000)
lr_kb.fit(X_train_kbest, y_train)
results.append(("SelectKBest (5)", lr_kb.score(selector_kbest.transform(X_test), y_test), 5))

# RFE
lr_rfe = LogisticRegression(random_state=42, max_iter=1000)
lr_rfe.fit(X_train_rfe, y_train)
results.append(("RFE (5)", lr_rfe.score(selector_rfe.transform(X_test), y_test), 5))

# RF top features
X_train_rf = X_train[top_5_features]
X_test_rf = X_test[top_5_features]
lr_rf = LogisticRegression(random_state=42, max_iter=1000)
lr_rf.fit(X_train_rf, y_train)
results.append(("RF Importance (5)", lr_rf.score(X_test_rf, y_test), 5))

results_df = pd.DataFrame(results, columns=["Method", "Accuracy", "Num Features"])
print("\n" + results_df.to_string(index=False))

# Visualize
fig, ax = plt.subplots(figsize=(12, 6))
colors = ["#3498db", "#e74c3c", "#2ecc71", "#f39c12"]
bars = ax.bar(
    results_df["Method"],
    results_df["Accuracy"],
    color=colors,
    alpha=0.7,
    edgecolor="black",
    linewidth=2,
)

for bar, acc in zip(bars, results_df["Accuracy"]):
    height = bar.get_height()
    ax.text(
        bar.get_x() + bar.get_width() / 2.0,
        height,
        f"{acc:.4f}",
        ha="center",
        va="bottom",
        fontsize=11,
        fontweight="bold",
    )

ax.set_ylabel("Accuracy", fontsize=12)
ax.set_title("Feature Selection Methods - Performance Comparison", fontsize=14, fontweight="bold")
ax.set_ylim(0, 1)
ax.axhline(
    y=results_df["Accuracy"].iloc[0], color="gray", linestyle="--", alpha=0.5, label="Baseline"
)
ax.legend()
ax.grid(axis="y", alpha=0.3)

plt.xticks(rotation=15, ha="right")
plt.tight_layout()
plt.show()

print("\n‚úì Feature selection complete!")
print(f"üí° Best method: {results_df.loc[results_df['Accuracy'].idxmax(), 'Method']}")

## 7. Feature Importance Analysis

Understanding which features contribute most to predictions helps with:
- **Model interpretation** - Explain decisions to stakeholders
- **Feature selection** - Focus on what matters
- **Domain insights** - Learn about the problem
- **Debugging** - Identify data quality issues

### Methods for Feature Importance

1. **Tree-based models** - Built-in feature_importances_
2. **Permutation importance** - Shuffle feature and measure impact
3. **SHAP values** - Game-theory based explanations (covered in advanced modules)
4. **Coefficients** - For linear models

Let's visualize feature importance from our Random Forest model!

In [None]:
# Feature Importance Visualization

print("=" * 60)
print("FEATURE IMPORTANCE ANALYSIS")
print("=" * 60)

# Get feature importances from our trained Random Forest
feature_importance_df = pd.DataFrame(
    {"Feature": feature_cols, "Importance": rf.feature_importances_}
).sort_values(
    "Importance", ascending=True
)  # Ascending for horizontal bar chart

print("\nFeature Importance Rankings:")
for idx, row in feature_importance_df.sort_values("Importance", ascending=False).iterrows():
    bar = "‚ñà" * int(row["Importance"] * 100)
    print(f"{row['Feature']:30s} {row['Importance']:.4f} {bar}")

# Create comprehensive visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Horizontal bar chart
axes[0, 0].barh(
    feature_importance_df["Feature"],
    feature_importance_df["Importance"],
    color="steelblue",
    edgecolor="black",
    alpha=0.8,
)
axes[0, 0].set_xlabel("Importance", fontweight="bold")
axes[0, 0].set_title("Feature Importance - Horizontal View", fontweight="bold", fontsize=12)
axes[0, 0].grid(axis="x", alpha=0.3)

# 2. Vertical bar chart
sorted_features = feature_importance_df.sort_values("Importance", ascending=False)
colors_gradient = plt.cm.RdYlGn(sorted_features["Importance"] / sorted_features["Importance"].max())
axes[0, 1].bar(
    range(len(sorted_features)),
    sorted_features["Importance"],
    color=colors_gradient,
    edgecolor="black",
    alpha=0.8,
)
axes[0, 1].set_xticks(range(len(sorted_features)))
axes[0, 1].set_xticklabels(sorted_features["Feature"], rotation=45, ha="right")
axes[0, 1].set_ylabel("Importance", fontweight="bold")
axes[0, 1].set_title("Feature Importance - Ranked", fontweight="bold", fontsize=12)
axes[0, 1].grid(axis="y", alpha=0.3)

# 3. Pie chart
top_6 = feature_importance_df.sort_values("Importance", ascending=False).head(6)
other_importance = (
    feature_importance_df.sort_values("Importance", ascending=False).iloc[6:]["Importance"].sum()
)
pie_data = list(top_6["Importance"]) + [other_importance]
pie_labels = list(top_6["Feature"]) + ["Other Features"]
colors_pie = plt.cm.Set3(range(len(pie_labels)))

axes[1, 0].pie(
    pie_data,
    labels=pie_labels,
    autopct="%1.1f%%",
    startangle=90,
    colors=colors_pie,
    textprops={"fontsize": 10, "fontweight": "bold"},
)
axes[1, 0].set_title("Feature Importance Distribution", fontweight="bold", fontsize=12)

# 4. Cumulative importance
cumulative_importance = feature_importance_df.sort_values("Importance", ascending=False)[
    "Importance"
].cumsum()
axes[1, 1].plot(
    range(1, len(cumulative_importance) + 1),
    cumulative_importance,
    marker="o",
    linewidth=2,
    markersize=8,
    color="darkgreen",
)
axes[1, 1].axhline(y=0.8, color="red", linestyle="--", label="80% threshold", linewidth=2)
axes[1, 1].axhline(y=0.9, color="orange", linestyle="--", label="90% threshold", linewidth=2)
axes[1, 1].set_xlabel("Number of Features", fontweight="bold")
axes[1, 1].set_ylabel("Cumulative Importance", fontweight="bold")
axes[1, 1].set_title("Cumulative Feature Importance", fontweight="bold", fontsize=12)
axes[1, 1].legend()
axes[1, 1].grid(alpha=0.3)
axes[1, 1].set_xticks(range(1, len(cumulative_importance) + 1))

plt.tight_layout()
plt.show()

# Calculate how many features needed for 80% importance
cumsum = feature_importance_df.sort_values("Importance", ascending=False)["Importance"].cumsum()
features_for_80 = (cumsum <= 0.80).sum() + 1
features_for_90 = (cumsum <= 0.90).sum() + 1

print(f"\nüí° Insights:")
print(
    f"   ‚Ä¢ Top feature: {feature_importance_df.iloc[-1]['Feature']} ({feature_importance_df.iloc[-1]['Importance']:.4f})"
)
print(f"   ‚Ä¢ {features_for_80} features explain 80% of importance")
print(f"   ‚Ä¢ {features_for_90} features explain 90% of importance")
print(f"\n‚úì Feature importance analysis complete!")

## 8. Automated Feature Engineering

Manual feature engineering is powerful but time-consuming. Automated tools can help discover features you might miss!

### Tools for Automated Feature Engineering

1. **Featuretools** - Deep feature synthesis
2. **tsfresh** - Time series features
3. **AutoFeat** - Linear model-based
4. **pandas_profiling** - Data exploration

### Simple Automation Example

While we won't use external libraries here, we can create simple automation functions!

In [None]:
# Automated Feature Engineering - Simple Example


def auto_create_features(dataframe, numerical_cols):
    """
    Automatically create common feature transformations.

    Parameters:
    - dataframe: Input DataFrame
    - numerical_cols: List of numerical column names

    Returns:
    - DataFrame with new features
    """
    df_auto = dataframe.copy()

    print("Generating automated features...")
    features_created = 0

    # 1. Pairwise ratios
    for i, col1 in enumerate(numerical_cols):
        for col2 in numerical_cols[i + 1 :]:
            feat_name = f"{col1}_div_{col2}"
            df_auto[feat_name] = df_auto[col1] / (df_auto[col2] + 1)  # +1 to avoid division by zero
            features_created += 1

    # 2. Pairwise products
    for i, col1 in enumerate(numerical_cols):
        for col2 in numerical_cols[i + 1 :]:
            feat_name = f"{col1}_times_{col2}"
            df_auto[feat_name] = df_auto[col1] * df_auto[col2]
            features_created += 1

    # 3. Squares
    for col in numerical_cols:
        df_auto[f"{col}_squared"] = df_auto[col] ** 2
        features_created += 1

    # 4. Square roots
    for col in numerical_cols:
        df_auto[f"{col}_sqrt"] = np.sqrt(np.abs(df_auto[col]))
        features_created += 1

    print(f"‚úì Created {features_created} new features automatically!")
    return df_auto


# Apply automated feature engineering
print("=" * 60)
print("AUTOMATED FEATURE ENGINEERING")
print("=" * 60)

numerical_cols_subset = ["age", "income", "education_years"]
print(f"\nStarting with {len(numerical_cols_subset)} numerical features")
print(f"Original shape: {df.shape}")

df_automated = auto_create_features(df, numerical_cols_subset)

print(f"New shape: {df_automated.shape}")
print(f"Features added: {df_automated.shape[1] - df.shape[1]}")

# Show sample of new features
new_cols = [col for col in df_automated.columns if col not in df.columns]
print(f"\nSample of {min(10, len(new_cols))} automated features:")
for i, col in enumerate(new_cols[:10], 1):
    print(f"  {i}. {col}")

print("\nüí° In practice, use libraries like:")
print("   ‚Ä¢ Featuretools for deep feature synthesis")
print("   ‚Ä¢ tsfresh for time series")
print("   ‚Ä¢ AutoFeat for automatic feature engineering")

print("\n‚úì Automated feature engineering demonstrated!")

## 9. Practical Feature Engineering Pipeline

Let's put it all together into a complete, production-ready pipeline!

### Best Practices for Feature Engineering Pipelines

1. **Make it reproducible** - Same input ‚Üí Same output
2. **Use fit/transform pattern** - Prevent data leakage
3. **Document everything** - Future you will thank you
4. **Version your features** - Track what works
5. **Monitor feature distributions** - Detect drift in production

### Complete End-to-End Pipeline

We'll create a pipeline that:
1. Handles missing values
2. Creates derived features
3. Encodes categorical variables
4. Scales numerical features
5. Selects best features
6. Trains model

In [None]:
# Complete Feature Engineering Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

print("=" * 60)
print("COMPLETE FEATURE ENGINEERING PIPELINE")
print("=" * 60)

# Reload fresh data
df_pipeline = pd.read_csv("../../data_advanced/feature_engineering.csv")

print("\nPipeline Steps:")
print("  1. Create derived features")
print("  2. Encode categorical variables")
print("  3. Scale numerical features")
print("  4. Select best features")
print("  5. Train model")

# Step 1: Create derived features
print("\n" + "=" * 60)
print("STEP 1: Feature Creation")
print("=" * 60)

df_pipeline["income_per_dependent"] = df_pipeline["income"] / (df_pipeline["num_dependents"] + 1)
df_pipeline["experience_efficiency"] = df_pipeline["experience_years"] / (
    df_pipeline["education_years"] + 1
)
df_pipeline["income_education"] = df_pipeline["income"] * df_pipeline["education_years"]

print("‚úì Created 3 derived features")

# Step 2: Encode categorical
print("\n" + "=" * 60)
print("STEP 2: Categorical Encoding")
print("=" * 60)

from sklearn.preprocessing import LabelEncoder

le_city = LabelEncoder()
le_job = LabelEncoder()

df_pipeline["city_encoded"] = le_city.fit_transform(df_pipeline["city"])
df_pipeline["job_encoded"] = le_job.fit_transform(df_pipeline["job_category"])

print("‚úì Encoded 2 categorical features")

# Step 3: Prepare features
numerical_features_final = [
    "age",
    "income",
    "education_years",
    "experience_years",
    "num_dependents",
    "income_per_dependent",
    "experience_efficiency",
    "income_education",
    "city_encoded",
    "job_encoded",
]

X = df_pipeline[numerical_features_final]
y = df_pipeline["loan_approved"]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Step 4: Create pipeline
print("\n" + "=" * 60)
print("STEP 3-5: Build sklearn Pipeline")
print("=" * 60)

pipeline = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("feature_selection", SelectKBest(f_classif, k=7)),
        ("classifier", LogisticRegression(random_state=42, max_iter=1000)),
    ]
)

print("Pipeline created:")
print("  ‚Ä¢ StandardScaler")
print("  ‚Ä¢ SelectKBest (k=7)")
print("  ‚Ä¢ LogisticRegression")

# Step 5: Train and evaluate
print("\n" + "=" * 60)
print("STEP 6: Train and Evaluate")
print("=" * 60)

# Fit pipeline
pipeline.fit(X_train, y_train)

# Evaluate
train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)

print(f"\nTraining Accuracy: {train_score:.4f}")
print(f"Testing Accuracy:  {test_score:.4f}")
print(f"Difference:        {abs(train_score - test_score):.4f}")

if abs(train_score - test_score) < 0.05:
    print("\n‚úì Good generalization (difference < 5%)")
else:
    print("\n‚ö†Ô∏è  Possible overfitting (difference >= 5%)")

# Show selected features
selected_mask = pipeline.named_steps["feature_selection"].get_support()
selected_features = [f for f, selected in zip(numerical_features_final, selected_mask) if selected]

print(f"\n Selected {len(selected_features)} features:")
for feat in selected_features:
    print(f"  ‚úì {feat}")

# Visualize pipeline performance
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Confusion matrix
from sklearn.metrics import confusion_matrix, classification_report

y_pred = pipeline.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

axes[0].imshow(cm, cmap="Blues", alpha=0.7)
axes[0].set_title("Confusion Matrix", fontweight="bold")
axes[0].set_xlabel("Predicted")
axes[0].set_ylabel("Actual")

for i in range(2):
    for j in range(2):
        axes[0].text(j, i, str(cm[i, j]), ha="center", va="center", fontsize=20, fontweight="bold")

axes[0].set_xticks([0, 1])
axes[0].set_yticks([0, 1])
axes[0].set_xticklabels(["Not Approved", "Approved"])
axes[0].set_yticklabels(["Not Approved", "Approved"])

# Train vs Test performance
scores = [train_score, test_score]
labels = ["Training", "Testing"]
colors = ["#3498db", "#2ecc71"]

bars = axes[1].bar(labels, scores, color=colors, alpha=0.7, edgecolor="black", linewidth=2)
axes[1].set_ylabel("Accuracy")
axes[1].set_title("Training vs Testing Performance", fontweight="bold")
axes[1].set_ylim(0, 1)
axes[1].grid(axis="y", alpha=0.3)

for bar, score in zip(bars, scores):
    height = bar.get_height()
    axes[1].text(
        bar.get_x() + bar.get_width() / 2.0,
        height,
        f"{score:.4f}",
        ha="center",
        va="bottom",
        fontsize=12,
        fontweight="bold",
    )

plt.tight_layout()
plt.show()

# Print classification report
print("\n" + "=" * 60)
print("DETAILED CLASSIFICATION REPORT")
print("=" * 60)
print(classification_report(y_test, y_pred, target_names=["Not Approved", "Approved"]))

print("\n‚úì Complete pipeline successfully built and tested!")
print("\nüí° This pipeline can be saved and reused:")
print("   import joblib")
print("   joblib.dump(pipeline, 'loan_approval_pipeline.pkl')")

## 10. Hands-On Exercises

Practice what you've learned with these challenges!

### Exercise 1: Create Custom Features
Load the housing dataset and create 5 meaningful derived features.

### Exercise 2: Encoding Challenge
Apply all 4 encoding methods (Label, One-Hot, Frequency, Target) to a categorical feature and compare results.

### Exercise 3: Feature Selection
Use SelectKBest, RFE, and Random Forest importance to select features. Which method gives best performance?

### Exercise 4: Complete Pipeline
Build an end-to-end pipeline for the customer_data.csv dataset including:
- Feature creation
- Encoding
- Scaling
- Feature selection
- Model training

Ready to practice? Complete the exercises below!

In [None]:
# Exercise Solutions - Try these yourself first!

print("=" * 60)
print("EXERCISES - Complete These to Master Feature Engineering!")
print("=" * 60)

# EXERCISE 1: Create Custom Features
print("\nExercise 1: Create Custom Features")
print("-" * 60)
print("TODO: Load ../data/housing_prices.csv")
print("TODO: Create 5 derived features (ratios, products, etc.)")
print("TODO: Visualize the new features")
print("\n# Your code here:")
print()

# EXERCISE 2: Encoding Challenge
print("\nExercise 2: Apply All Encoding Methods")
print("-" * 60)
print("TODO: Choose a categorical feature")
print("TODO: Apply Label, One-Hot, Frequency, and Target encoding")
print("TODO: Train models with each and compare performance")
print("\n# Your code here:")
print()

# EXERCISE 3: Feature Selection Comparison
print("\nExercise 3: Feature Selection Comparison")
print("-" * 60)
print("TODO: Apply SelectKBest, RFE, and RF importance")
print("TODO: Compare which features each method selects")
print("TODO: Evaluate model performance with each")
print("\n# Your code here:")
print()

# EXERCISE 4: Complete Pipeline
print("\nExercise 4: Build End-to-End Pipeline")
print("-" * 60)
print("TODO: Load ../data/customer_data.csv")
print("TODO: Create a complete sklearn Pipeline with:")
print("      - Feature engineering")
print("      - Encoding")
print("      - Scaling")
print("      - Feature selection")
print("      - Model training")
print("\n# Your code here:")
print()

# BONUS CHALLENGE
print("\n" + "=" * 60)
print("BONUS CHALLENGE")
print("=" * 60)
print("Create an automated feature engineering function that:")
print("  1. Detects feature types automatically")
print("  2. Applies appropriate transformations")
print("  3. Selects the best features")
print("  4. Returns a trained model")
print("\n# Your code here:")
print()

print("\n" + "=" * 60)
print("üí° TIPS:")
print("=" * 60)
print("  ‚Ä¢ Start simple, then add complexity")
print("  ‚Ä¢ Always validate on a test set")
print("  ‚Ä¢ Document your feature creation logic")
print("  ‚Ä¢ Use cross-validation for robust evaluation")
print("  ‚Ä¢ Compare against a baseline")
print("=" * 60)

## 11. Key Takeaways & Next Steps

Congratulations! You've mastered feature engineering - one of the most impactful skills in data science!

### What You've Learned

#### 1. **Numerical Feature Engineering**
- ‚úì Scaling methods (Standard, MinMax, Robust)
- ‚úì Transformations (log, sqrt, Box-Cox) for skewed data
- ‚úì Binning continuous features
- ‚úì Creating derived features (ratios, differences)

#### 2. **Categorical Encoding**
- ‚úì Label Encoding for ordinal data
- ‚úì One-Hot Encoding for nominal data
- ‚úì Frequency Encoding for high cardinality
- ‚úì Target Encoding with cross-validation
- ‚úì When to use each method

#### 3. **Datetime Features**
- ‚úì Extracting temporal components (year, month, day, etc.)
- ‚úì Creating boolean flags (weekend, month_start, etc.)
- ‚úì Time-based calculations (tenure, days_since)
- ‚úì Cyclical encoding with sin/cos

#### 4. **Advanced Techniques**
- ‚úì Polynomial features for non-linear relationships
- ‚úì Interaction features to capture combined effects
- ‚úì Impact on model performance

#### 5. **Feature Selection**
- ‚úì Filter methods (SelectKBest) - fast, statistical
- ‚úì Wrapper methods (RFE) - slow, thorough
- ‚úì Embedded methods (RF importance, Lasso) - balanced
- ‚úì Comparing methods empirically

#### 6. **Feature Importance**
- ‚úì Tree-based feature_importances_
- ‚úì Coefficient analysis for linear models
- ‚úì Cumulative importance analysis
- ‚úì Visualization techniques

#### 7. **Production Pipelines**
- ‚úì sklearn Pipeline for reproducibility
- ‚úì fit/transform pattern to prevent leakage
- ‚úì End-to-end automation
- ‚úì Model serialization

### Key Insights

> **"Feature engineering is often more important than the choice of algorithm."**

- Good features can make a simple model outperform a complex one
- Domain knowledge is your superpower
- Always validate on held-out test data
- Document your feature engineering decisions
- Version your features for reproducibility

### Common Pitfalls to Avoid

1. **Data Leakage** - Never use test data information in training
2. **Overfitting** - Too many features can hurt generalization
3. **Forgetting to Scale** - Linear models need scaled features
4. **Ignoring Domain Knowledge** - Best features come from understanding the problem
5. **Not Documenting** - Future you needs to know what you did

### When to Use What

| Scenario | Recommended Approach |
|----------|---------------------|
| Linear models | One-hot encoding + scaling + polynomial features |
| Tree models | Label/target encoding + feature interactions |
| High cardinality | Target/frequency encoding or embeddings |
| Skewed distribution | Log/Box-Cox transformation |
| Cyclical features | Sin/cos encoding |
| Too many features | SelectKBest or RFE |
| Need interpretability | L1 regularization or tree importance |

### Real-World Applications

Feature engineering is crucial in:
- **Finance**: Credit scoring, fraud detection
- **Marketing**: Customer segmentation, churn prediction
- **Healthcare**: Disease prediction, patient risk scores
- **E-commerce**: Recommendation systems, demand forecasting
- **Manufacturing**: Predictive maintenance, quality control

### Next Steps

#### Continue Your Learning
1. **Module 13: Model Selection & Hyperparameter Tuning**
   - Grid Search, Random Search, Bayesian Optimization
   - Cross-validation strategies
   - Model comparison frameworks

2. **Module 14: Ensemble Methods**
   - XGBoost, LightGBM, CatBoost
   - Stacking and blending
   - Kaggle competition techniques

3. **Practice Projects**
   - Kaggle competitions (Titanic, House Prices)
   - Real-world datasets from your domain
   - Build and deploy your own models

#### Resources for Deep Dive
- **Books**:
  - "Feature Engineering for Machine Learning" by Alice Zheng
  - "Feature Engineering Handbook" (online)
- **Libraries**:
  - Featuretools (automated feature engineering)
  - category_encoders (advanced encoding methods)
  - SHAP (model interpretation)
- **Competitions**:
  - Kaggle - Learn from winning solutions
  - DrivenData - Social good competitions

### Recommended Practice

Spend **2-3 hours** on these:
1. Complete all 4 exercises above
2. Apply feature engineering to your own dataset
3. Create a reusable feature engineering template
4. Document your feature engineering pipeline

### Final Wisdom

> "Data scientists spend 80% of their time on data preparation and feature engineering. Master this, and you've mastered the job."

Feature engineering is both **art and science**:
- **Science**: Statistical methods, algorithms, validation
- **Art**: Creativity, domain knowledge, intuition

Keep experimenting, keep learning, and most importantly - **have fun with data**!

---

### Module Complete! üéâ

**Total time invested**: ~75 minutes
**Skills gained**: Production-ready feature engineering
**Confidence level**: Intermediate ‚Üí Advanced

**Next Module**: `13_model_selection.ipynb` - Take your models to the next level!

---

*Built with Claude Code | Module 12: Feature Engineering Mastery*