# Module 00: Introduction to Feature Engineering

**Difficulty**: ⭐ Beginner  
**Estimated Time**: 45 minutes  
**Prerequisites**: Basic ML understanding (Module 05.1), Pandas basics

## Learning Objectives

By the end of this notebook, you will be able to:

1. Explain what feature engineering is and why it's critical for ML success
2. Understand the feature engineering workflow and when to apply it
3. Identify different types of features and transformations
4. Recognize common pitfalls (data leakage, overfitting)
5. Demonstrate measurable impact of feature engineering on model performance

## 1. What is Feature Engineering?

**Feature Engineering** is the process of transforming raw data into features that better represent the underlying patterns to predictive models.

### Why It Matters

According to Andrew Ng:
> "Applied machine learning is basically feature engineering."

**Real-world impact**:
- Can improve model accuracy by 5-20% or more
- Often more effective than algorithm selection
- Essential for winning Kaggle competitions
- Makes simpler models work as well as complex ones

### Example: Predicting House Prices

**Raw features**:
- `length = 50 feet`
- `width = 30 feet`
- `price = $300,000`

**Engineered feature**:
- `area = length × width = 1,500 sq ft`

The engineered `area` feature is much more predictive of price than length and width separately!

## 2. Setup

Let's import libraries and create a simple dataset to demonstrate feature engineering impact.

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# ML libraries
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Set random seed for reproducibility
np.random.seed(42)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)

print("✓ Setup complete!")

## 3. Demonstrating Feature Engineering Impact

Let's create a synthetic dataset to show how feature engineering dramatically improves model performance.

In [None]:
# Create synthetic house data
n_samples = 500

# Generate raw features
house_data = pd.DataFrame({
    'length_feet': np.random.uniform(20, 80, n_samples),
    'width_feet': np.random.uniform(15, 60, n_samples),
    'bedrooms': np.random.randint(1, 6, n_samples),
    'age_years': np.random.randint(0, 50, n_samples),
})

# True price depends on area (length × width) and other factors
# This simulates real-world where the underlying pattern involves interactions
house_data['area_sqft'] = house_data['length_feet'] * house_data['width_feet']
house_data['price'] = (
    200 * house_data['area_sqft'] +  # $200 per sq ft
    10000 * house_data['bedrooms'] +  # $10k per bedroom
    -500 * house_data['age_years'] +   # Depreciation
    np.random.normal(0, 50000, n_samples)  # Random noise
)

print(f"Created dataset with {len(house_data)} houses")
print(f"\nFirst few rows:")
house_data.head()

### Model Performance WITHOUT Feature Engineering

In [None]:
# Use only raw features (length, width, bedrooms, age)
# We're intentionally NOT using the area we calculated
raw_features = ['length_feet', 'width_feet', 'bedrooms', 'age_years']
X_raw = house_data[raw_features]
y = house_data['price']

# Split data
X_raw_train, X_raw_test, y_train, y_test = train_test_split(
    X_raw, y, test_size=0.2, random_state=42
)

# Train model
model_raw = LinearRegression()
model_raw.fit(X_raw_train, y_train)

# Evaluate
y_pred_raw = model_raw.predict(X_raw_test)
rmse_raw = np.sqrt(mean_squared_error(y_test, y_pred_raw))
r2_raw = r2_score(y_test, y_pred_raw)

print("Model Performance WITHOUT Feature Engineering:")
print(f"RMSE: ${rmse_raw:,.0f}")
print(f"R² Score: {r2_raw:.3f}")

### Model Performance WITH Feature Engineering

In [None]:
# Now include our engineered feature: area
# This better represents what actually affects house price
engineered_features = ['area_sqft', 'bedrooms', 'age_years']
X_engineered = house_data[engineered_features]

# Split data (same random state for fair comparison)
X_eng_train, X_eng_test, y_train, y_test = train_test_split(
    X_engineered, y, test_size=0.2, random_state=42
)

# Train model
model_engineered = LinearRegression()
model_engineered.fit(X_eng_train, y_train)

# Evaluate
y_pred_eng = model_engineered.predict(X_eng_test)
rmse_eng = np.sqrt(mean_squared_error(y_test, y_pred_eng))
r2_eng = r2_score(y_test, y_pred_eng)

print("Model Performance WITH Feature Engineering:")
print(f"RMSE: ${rmse_eng:,.0f}")
print(f"R² Score: {r2_eng:.3f}")

### Compare Results

In [None]:
# Calculate improvement
rmse_improvement = (rmse_raw - rmse_eng) / rmse_raw * 100
r2_improvement = (r2_eng - r2_raw) / r2_raw * 100

print("="*60)
print("FEATURE ENGINEERING IMPACT")
print("="*60)
print(f"\nRMSE Reduction: {rmse_improvement:.1f}%")
print(f"R² Improvement: {r2_improvement:.1f}%")
print(f"\nBy creating just ONE engineered feature (area),")
print(f"we improved model performance by {rmse_improvement:.0f}%!")
print("="*60)

In [None]:
# Visualize the difference
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Without feature engineering
axes[0].scatter(y_test, y_pred_raw, alpha=0.5)
axes[0].plot([y_test.min(), y_test.max()], 
             [y_test.min(), y_test.max()], 
             'r--', lw=2, label='Perfect Prediction')
axes[0].set_xlabel('Actual Price')
axes[0].set_ylabel('Predicted Price')
axes[0].set_title(f'Without Feature Engineering\nR² = {r2_raw:.3f}')
axes[0].legend()

# Plot 2: With feature engineering
axes[1].scatter(y_test, y_pred_eng, alpha=0.5, color='green')
axes[1].plot([y_test.min(), y_test.max()], 
             [y_test.min(), y_test.max()], 
             'r--', lw=2, label='Perfect Prediction')
axes[1].set_xlabel('Actual Price')
axes[1].set_ylabel('Predicted Price')
axes[1].set_title(f'With Feature Engineering\nR² = {r2_eng:.3f}')
axes[1].legend()

plt.tight_layout()
plt.show()

print("Notice how predictions are much closer to the red line (perfect predictions)")
print("when we use the engineered 'area' feature!")

## 4. Types of Feature Engineering

Feature engineering encompasses many techniques:

### 4.1 Handling Missing Data
- **Imputation**: Fill missing values with mean, median, or advanced methods
- **Indicator features**: Create `is_missing` flags
- **Deletion**: Remove rows or columns (use cautiously)

### 4.2 Encoding Categorical Variables
- **One-hot encoding**: Convert categories to binary columns
- **Ordinal encoding**: Map ordered categories to numbers
- **Target encoding**: Use target statistics per category

### 4.3 Feature Scaling
- **Normalization**: Scale to [0, 1] range
- **Standardization**: Transform to mean=0, std=1
- **Robust scaling**: Handle outliers better

### 4.4 Creating New Features
- **Interactions**: Multiply features (like our `area` example)
- **Polynomials**: Create squared, cubed terms
- **Binning**: Convert continuous to categorical
- **Domain-specific**: Use domain knowledge

### 4.5 Feature Selection
- **Filter methods**: Correlation, chi-square
- **Wrapper methods**: Recursive feature elimination
- **Embedded methods**: L1 regularization, tree importance

## 5. The Feature Engineering Workflow

```
1. Understand the Problem
   ↓
2. Exploratory Data Analysis (EDA)
   ↓
3. Handle Missing Data
   ↓
4. Encode Categorical Variables
   ↓
5. Create New Features
   ↓
6. Scale/Transform Features
   ↓
7. Select Important Features
   ↓
8. Build Pipeline & Evaluate
   ↓
9. Iterate (repeat steps 3-8)
```

**Critical Rules**:
- ✅ **Always** split data BEFORE feature engineering
- ✅ Fit transformations on training data only
- ✅ Apply same transformations to test data
- ❌ **Never** use test data statistics in feature engineering (causes data leakage!)

## 6. Common Pitfalls and How to Avoid Them

### 6.1 Data Leakage

**What**: Information from test set leaks into training

**Example of BAD practice**:

In [None]:
# ❌ BAD: Scaling before splitting
# This causes data leakage!
from sklearn.preprocessing import StandardScaler

# Don't do this:
# X_scaled = StandardScaler().fit_transform(X)  # Uses ALL data including test
# X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

print("❌ Scaling before splitting causes data leakage!")
print("The scaler sees test data, giving unrealistic performance.")

**Example of GOOD practice**:

In [None]:
# ✅ GOOD: Split first, then scale
from sklearn.preprocessing import StandardScaler

# Create sample data
X_sample = house_data[['length_feet', 'width_feet', 'bedrooms']]
y_sample = house_data['price']

# 1. Split first
X_train_sample, X_test_sample, y_train_sample, y_test_sample = train_test_split(
    X_sample, y_sample, test_size=0.2, random_state=42
)

# 2. Fit scaler on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_sample)

# 3. Apply same transformation to test data
X_test_scaled = scaler.transform(X_test_sample)  # Note: transform, not fit_transform!

print("✅ CORRECT: Split data, then fit on train, transform both")
print(f"Training set shape: {X_train_scaled.shape}")
print(f"Test set shape: {X_test_scaled.shape}")

### 6.2 Target Leakage

**What**: Features that wouldn't be available at prediction time

**Example**: Predicting hospital readmission using "number of days in hospital"
- You won't know this until AFTER the patient is admitted!

### 6.3 Overfitting

**What**: Creating too many features that memorize training data

**Prevention**:
- Use cross-validation
- Apply regularization
- Feature selection
- Keep features interpretable

## 7. Exercise Section

Let's practice what we learned!

### Exercise 1: Create Interaction Features

Create a dataset of rectangles and predict their perimeter. Engineer a new feature that would help the model.

In [None]:
# Exercise 1: Your code here

# Create rectangle dataset
rectangles = pd.DataFrame({
    'length': np.random.uniform(1, 20, 200),
    'width': np.random.uniform(1, 20, 200)
})

# True perimeter formula: 2 * (length + width)
rectangles['perimeter'] = 2 * (rectangles['length'] + rectangles['width'])

# TODO: Create an engineered feature that helps predict perimeter
# Hint: What mathematical relationship exists between length, width, and perimeter?

# Your code here:
# rectangles['engineered_feature'] = ???

# Test your feature
# (We'll provide solution after you try!)

In [None]:
# Solution to Exercise 1
# (Try the exercise above first!)

# The key insight: perimeter = 2*(length + width)
# So a good engineered feature is the sum of length and width
rectangles['length_plus_width'] = rectangles['length'] + rectangles['width']

# Compare models
from sklearn.linear_model import LinearRegression

# Without engineered feature
X_basic = rectangles[['length', 'width']]
y = rectangles['perimeter']
model1 = LinearRegression().fit(X_basic, y)
r2_basic = model1.score(X_basic, y)

# With engineered feature
X_engineered = rectangles[['length_plus_width']]
model2 = LinearRegression().fit(X_engineered, y)
r2_eng = model2.score(X_engineered, y)

print(f"R² without feature engineering: {r2_basic:.4f}")
print(f"R² with feature engineering: {r2_eng:.4f}")
print(f"\nBoth should be perfect (1.0) because perimeter has a linear relationship!")

### Exercise 2: Identify Data Leakage

Which of the following would cause data leakage when predicting customer churn?

In [None]:
# Exercise 2: Identify which features cause data leakage

print("Scenario: Predicting if a customer will cancel their subscription (churn)")
print("\nWhich features would cause data leakage?\n")

features = {
    'A': 'Number of customer service calls in last month',
    'B': 'Cancellation date',
    'C': 'Monthly subscription cost',
    'D': 'Number of logins in last 7 days',
    'E': 'Reason for cancellation',
    'F': 'Customer age'
}

for key, feature in features.items():
    print(f"{key}. {feature}")

print("\nYour answer: (list letters that cause leakage)")
# Write your answer here as a comment
# answer = []

In [None]:
# Solution to Exercise 2

print("Features that cause data leakage:")
print("\nB. Cancellation date - Only known AFTER churn happens!")
print("E. Reason for cancellation - Only known AFTER churn happens!")
print("\nThese are examples of TARGET LEAKAGE - information only available")
print("after the event you're trying to predict.")
print("\nFeatures A, C, D, F are all valid because they're known BEFORE churn.")

### Exercise 3: Feature Engineering Ideas

For each scenario, suggest 2-3 engineered features that might improve model performance.

In [None]:
# Exercise 3: Brainstorm engineered features

scenarios = {
    'Scenario 1': {
        'task': 'Predict credit card fraud',
        'raw_features': ['transaction_amount', 'merchant_name', 'transaction_time', 
                         'card_holder_location', 'merchant_location'],
        'your_ideas': [
            # Write your ideas here
            # Example: 'distance_from_home (using card_holder_location and merchant_location)'
        ]
    },
    'Scenario 2': {
        'task': 'Predict movie ratings',
        'raw_features': ['user_id', 'movie_id', 'release_year', 'genres', 'runtime_minutes'],
        'your_ideas': [
            # Write your ideas here
        ]
    }
}

print("Think about:")
print("- Interactions between features")
print("- Domain knowledge")
print("- Time-based features")
print("- Statistical aggregations\n")

for name, scenario in scenarios.items():
    print(f"{name}: {scenario['task']}")
    print(f"Raw features: {scenario['raw_features']}")
    print("Your engineered features:")
    print("(Write your ideas in the code above)\n")

In [None]:
# Solution to Exercise 3

print("Suggested Engineered Features:\n")

print("Scenario 1: Predict credit card fraud")
print("1. distance_between_locations - Geographic distance between card holder and merchant")
print("2. transaction_hour - Extract hour from transaction_time (frauds often at night)")
print("3. amount_vs_user_avg - How much larger than user's average transaction?")
print("4. is_foreign_country - Binary: merchant country != card holder country")
print("5. time_since_last_transaction - Rapid successive transactions = suspicious\n")

print("Scenario 2: Predict movie ratings")
print("1. movie_age - Current year minus release_year")
print("2. genre_count - Number of genres (multi-genre might appeal to more people)")
print("3. is_long_movie - Binary: runtime > 150 minutes")
print("4. user_avg_rating - User's average rating of all movies (some rate higher)")
print("5. genre_match_user_preference - Does genre match user's most-watched genres?")

## 8. Summary

### Key Takeaways

1. **Feature engineering often matters more than algorithm choice**
   - Can improve performance by 5-20%+ 
   - Makes simple models competitive with complex ones

2. **Core techniques** we'll cover in this module:
   - Handling missing data (Module 01)
   - Encoding categorical variables (Module 02)
   - Feature scaling (Module 03)
   - Creating interactions and polynomials (Module 04)
   - Binning and discretization (Module 05)
   - Date/time features (Module 06)
   - Text features (Module 07)
   - Feature selection (Module 08)
   - Feature importance (Module 09)
   - Automated feature engineering (Module 10)

3. **Critical rules to prevent data leakage**:
   - Always split data BEFORE feature engineering
   - Fit transformations on training data only
   - Never use information unavailable at prediction time

4. **Feature engineering is iterative**:
   - Try ideas, measure impact, keep what works
   - Use domain knowledge
   - Learn from competitions and real projects

### What's Next?

**Module 01**: Handling Missing Data - Learn imputation strategies and when to use each

### Additional Resources

- [Kaggle Feature Engineering Course](https://www.kaggle.com/learn/feature-engineering)
- "Feature Engineering for Machine Learning" by Alice Zheng
- [Feature-engine library documentation](https://feature-engine.readthedocs.io/)

---

**Congratulations!** You've completed Module 00. You now understand:
- What feature engineering is and why it's critical
- The dramatic impact it can have on model performance
- Different types of feature engineering techniques
- How to avoid common pitfalls like data leakage

Ready to dive deeper? Let's move to **Module 01: Handling Missing Data**!