# Notebook 03: Feature Engineering

Advanced feature engineering and selection pipeline applied to preprocessed datasets for optimal model development pipeline.


## 1. Load Preprocessed Data and Feature Discovery Analysis

Execute data loading pipeline and establish baseline feature analysis for engineering opportunities.
Implement systematic correlation analysis to identify optimal feature combinations and selection strategies.

### 1.1 Dataset Import and Validation

Load fully preprocessed datasets from notebook 02 and validate feature consistency.
Establish combined dataset structure for systematic feature engineering pipeline while maintaining source tracking.

In [1]:
# Load required libraries for advanced feature engineering
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.feature_selection import VarianceThreshold
import warnings
warnings.filterwarnings('ignore')

# Load preprocessed datasets (updated file names from notebook 02)
df_train_clean = pd.read_csv('../data/processed/processed_train.csv')
df_test_clean = pd.read_csv('../data/processed/processed_test.csv')

print("Preprocessed dataset shapes:")
print(f"Train: {df_train_clean.shape}")
print(f"Test: {df_test_clean.shape}")

# Verify data quality from preprocessing
print(f"\nData quality verification:")
print(f"Train missing values: {df_train_clean.isnull().sum().sum()}")
print(f"Test missing values: {df_test_clean.isnull().sum().sum()}")

# Use pre-combined dataset
df_combined = pd.read_csv('../data/processed/processed_combined.csv')
df_combined['dataset_source'] = ['train']*len(df_train_clean) + ['test']*len(df_test_clean)
print(f"Combined dataset shape: {df_combined.shape}")

Preprocessed dataset shapes:
Train: (1458, 232)
Test: (1459, 230)

Data quality verification:
Train missing values: 0
Test missing values: 0
Combined dataset shape: (2917, 231)


Data validation confirms 230+ features with zero missing values from comprehensive preprocessing pipeline.
Combined dataset structure enables consistent feature engineering across train/test splits.

### 1.2 Baseline Feature Analysis and Correlation Discovery

Implement systematic correlation analysis to identify feature engineering opportunities.
Execute feature importance ranking pipeline to establish baseline for engineering optimization.

In [6]:
# Separate numerical features (most categoricals already one-hot encoded)
numerical_features = df_combined.select_dtypes(include=[np.number]).columns.tolist()
numerical_features = [col for col in numerical_features if col not in ['Id', 'SalePrice', 'SalePrice_log', 'dataset_source']]

# Few remaining categorical features (should be minimal after notebook 02)
categorical_features = df_combined.select_dtypes(include=['object', 'category']).columns.tolist()
if 'dataset_source' in categorical_features:
    categorical_features.remove('dataset_source')

print(f"Numerical features available: {len(numerical_features)}")
print(f"Categorical features remaining: {len(categorical_features)}")
print(f"Total features from preprocessing: {len(numerical_features) + len(categorical_features)}")

# Baseline correlation analysis with target (train data only)
target_col = 'SalePrice_log' if 'SalePrice_log' in df_train_clean.columns else 'SalePrice'
if target_col in df_train_clean.columns:
    # Filter numerical features that exist in train data (excluding Id and target)
    train_numerical_features = [col for col in numerical_features if col in df_train_clean.columns]
    # Also exclude Id and target variables from features list
    train_numerical_features = [col for col in train_numerical_features if col not in ['Id', 'SalePrice', 'SalePrice_log']]

    baseline_correlations = df_train_clean[train_numerical_features + [target_col]].corr()[target_col].sort_values(ascending=False)

    print(f"\nTop 10 features correlated with {target_col}:")
    print(baseline_correlations.head(11)[1:])  # Exclude target

    print("\nLeast correlated features:")
    print(baseline_correlations.tail(5))



Numerical features available: 229
Categorical features remaining: 0
Total features from preprocessing: 229

Top 10 features correlated with SalePrice_log:
OverallQual      0.821405
GrLivArea_log    0.737431
GrLivArea        0.725211
ExterQual        0.682226
GarageCars       0.681033
KitchenQual      0.669990
GarageArea       0.656129
TotalBsmtSF      0.647563
1stFlrSF         0.620500
BsmtQual         0.616897
Name: SalePrice_log, dtype: float64

Least correlated features:
GarageType_None     -0.322994
Foundation_CBlock   -0.337909
MSZoning_RM         -0.347453
MasVnrType_None     -0.388094
GarageType_Detchd   -0.388681
Name: SalePrice_log, dtype: float64


Correlation analysis identifies OverallQual (0.821) and GrLivArea_log (0.737) as dominant predictors establishing clear engineering priorities.
Negative correlations in garage types (-0.389) and foundation materials (-0.338) reveal combination opportunities for feature optimization.

---

## 2. Correlation-Driven Feature Combination Discovery

Calculate individual component correlations first, then create combinations and measure improvement over best individual components.
Apply systematic combination testing with documented correlation improvements following Kaggle best practices for feature engineering.

### 2.1 Individual Component Baseline Analysis

Calculate correlation for all individual features to establish baseline performance for combination comparison.
Identify top performers in each category (area, quality, bathroom) for targeted combination testing.

In [14]:
# Calculate individual feature correlations by category for baseline comparison

# Area-related features for combination testing
area_features = ['GrLivArea', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GarageArea', 
                'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 'ScreenPorch',
                'MasVnrArea', 'LotArea', 'PoolArea', 'LowQualFinSF']

# Quality-related features for combination testing
quality_features = ['OverallQual', 'OverallCond', 'ExterQual', 'ExterCond', 
                   'BsmtQual', 'BsmtCond', 'HeatingQC', 'KitchenQual',
                   'FireplaceQu', 'GarageQual', 'GarageCond', 'PoolQC']
                   
# Bathroom and room features for combination testing
bath_room_features = ['FullBath', 'HalfBath', 'BsmtFullBath', 'BsmtHalfBath',
                     'BedroomAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars']

# Calculate baseline correlations for each category
print("INDIVIDUAL COMPONENT BASELINE CORRELATIONS")
print("=" * 50)

# Area features baseline
print("\nAREA FEATURES:")
area_correlations = {}
for feature in area_features:
    if feature in df_train_clean.columns:
        corr = df_train_clean[feature].corr(df_train_clean[target_col])
        area_correlations[feature] = corr
        print(f"{feature}: {corr:.3f}")

# Quality features baseline  
print("\nQUALITY FEATURES:")
quality_correlations = {}
for feature in quality_features:
    if feature in df_train_clean.columns:
        corr = df_train_clean[feature].corr(df_train_clean[target_col])
        quality_correlations[feature] = corr
        print(f"{feature}: {corr:.3f}")
        
# Bathroom/room features baseline
print("\nBATHROOM & ROOM FEATURES:")
bath_room_correlations = {}
for feature in bath_room_features:
    if feature in df_train_clean.columns:
        corr = df_train_clean[feature].corr(df_train_clean[target_col])
        bath_room_correlations[feature] = corr
        print(f"{feature}: {corr:.3f}")

# Identify top performers in each category
print("\nTOP PERFORMERS BY CATEGORY:")
print(f"Best Area Feature: {max(area_correlations, key=area_correlations.get)} ({max(area_correlations.values()):.3f})")
print(f"Best Quality Feature: {max(quality_correlations, key=quality_correlations.get)} ({max(quality_correlations.values()):.3f})")
print(f"Best Bath/Room Feature: {max(bath_room_correlations, key=bath_room_correlations.get)} ({max(bath_room_correlations.values()):.3f})")

INDIVIDUAL COMPONENT BASELINE CORRELATIONS

AREA FEATURES:
GrLivArea: 0.725
TotalBsmtSF: 0.648
1stFlrSF: 0.621
2ndFlrSF: 0.320
GarageArea: 0.656
WoodDeckSF: 0.334
OpenPorchSF: 0.325
EnclosedPorch: -0.149
ScreenPorch: 0.121
MasVnrArea: 0.431
LotArea: 0.261
PoolArea: 0.074
LowQualFinSF: -0.038

QUALITY FEATURES:
OverallQual: 0.821
OverallCond: -0.037
ExterQual: 0.682
ExterCond: 0.049
BsmtQual: 0.617
BsmtCond: 0.275
HeatingQC: 0.474
KitchenQual: 0.670
FireplaceQu: 0.547
GarageQual: 0.363
GarageCond: 0.357
PoolQC: 0.085

BATHROOM & ROOM FEATURES:
FullBath: 0.596
HalfBath: 0.314
BsmtFullBath: 0.237
BsmtHalfBath: -0.005
BedroomAbvGr: 0.209
TotRmsAbvGrd: 0.538
Fireplaces: 0.492
GarageCars: 0.681

TOP PERFORMERS BY CATEGORY:
Best Area Feature: GrLivArea (0.725)
Best Quality Feature: OverallQual (0.821)
Best Bath/Room Feature: GarageCars (0.681)


Baseline analysis reveals OverallQual (0.821) dominates as category leader, with clear 0.65+ correlation clustering in basement/garage features.
Quality features maintain consistent 0.6+ performance while area features span from negative to 0.725, establishing engineering improvement thresholds.

### 2.2 Area Feature Combinations vs Individual Components

Test area feature additions and ratios against individual component correlations to measure improvement.
Focus on combinations that exceed best individual component correlation by meaningful margins (>5% improvement).

In [16]:
# Test area feature combinations against individual baselines
# Target: exceed GrLivArea baseline (0.725) by >5% improvement

area_combinations = {}
baseline_threshold = 0.725 * 1.05  # 5% improvement over best individual (GrLivArea)

print("AREA FEATURE COMBINATIONS VS INDIVIDUAL COMPONENTS")
print("=" * 55)
print(f"Baseline to beat: GrLivArea (0.725)")
print(f"Target threshold: {baseline_threshold:.3f} (+5% improvement)")
print()

# Test addition combinations
print("ADDITION COMBINATIONS:")
area_pairs = [
    ('TotalBsmtSF', '1stFlrSF'),
    ('GrLivArea', 'TotalBsmtSF'),
    ('GarageArea', 'TotalBsmtSF'),
    ('1stFlrSF', '2ndFlrSF'),
    ('MasVnrArea', 'TotalBsmtSF'),
    ('WoodDeckSF', 'OpenPorchSF'),
    ('GrLivArea', 'GarageArea')
]

for feat1, feat2 in area_pairs:
    if feat1 in df_train_clean.columns and feat2 in df_train_clean.columns:
        combination = df_train_clean[feat1] + df_train_clean[feat2]
        corr = combination.corr(df_train_clean[target_col])
        area_combinations[f"{feat1}_add_{feat2}"] = corr
        
        # Compare to individual components
        individual_best = max(area_correlations.get(feat1, 0), area_correlations.get(feat2, 0))
        improvement = (corr - individual_best) / individual_best * 100
        
        print(f"{feat1}_add_{feat2}: {corr:.3f} (vs best individual {individual_best:.3f}, {improvement:+.1f}%)")

# Test ratio combinations
print("\nRATIO COMBINATIONS:")
ratio_pairs = [
    ('GrLivArea', 'LotArea'),
    ('TotalBsmtSF', 'GrLivArea'),
    ('GarageArea', 'GarageCars'),
    ('1stFlrSF', 'TotalBsmtSF'),
    ('MasVnrArea', 'GrLivArea')
]

for feat1, feat2 in ratio_pairs:
    if feat1 in df_train_clean.columns and feat2 in df_train_clean.columns:
        # Avoid division by zero
        mask = df_train_clean[feat2] > 0
        if mask.sum() > 100:  # Ensure sufficient data
            ratio = df_train_clean.loc[mask, feat1] / df_train_clean.loc[mask, feat2]
            corr = ratio.corr(df_train_clean.loc[mask, target_col])
            area_combinations[f"{feat1}_ratio_{feat2}"] = corr
            
            # Compare to individual components
            individual_best = max(area_correlations.get(feat1, 0), area_correlations.get(feat2, 0))
            improvement = (corr - individual_best) / individual_best * 100
            
            print(f"{feat1}_ratio_{feat2}: {corr:.3f} (vs best individual {individual_best:.3f}, {improvement:+.1f}%)")

# Identify successful combinations
print("\nSUCCESSFUL AREA COMBINATIONS (>5% improvement):")
successful_area = {k: v for k, v in area_combinations.items() if v > baseline_threshold}
for name, corr in sorted(successful_area.items(), key=lambda x: x[1], reverse=True):
    improvement = (corr - 0.725) / 0.725 * 100
    print(f"{name}: {corr:.3f} ({improvement:+.1f}% vs GrLivArea)")

AREA FEATURE COMBINATIONS VS INDIVIDUAL COMPONENTS
Baseline to beat: GrLivArea (0.725)
Target threshold: 0.761 (+5% improvement)

ADDITION COMBINATIONS:
TotalBsmtSF_add_1stFlrSF: 0.668 (vs best individual 0.648, +3.2%)
GrLivArea_add_TotalBsmtSF: 0.821 (vs best individual 0.725, +13.2%)
GarageArea_add_TotalBsmtSF: 0.744 (vs best individual 0.656, +13.3%)
1stFlrSF_add_2ndFlrSF: 0.735 (vs best individual 0.621, +18.5%)
MasVnrArea_add_TotalBsmtSF: 0.685 (vs best individual 0.648, +5.8%)
WoodDeckSF_add_OpenPorchSF: 0.437 (vs best individual 0.334, +30.8%)
GrLivArea_add_GarageArea: 0.801 (vs best individual 0.725, +10.5%)

RATIO COMBINATIONS:
GrLivArea_ratio_LotArea: 0.001 (vs best individual 0.725, -99.8%)
TotalBsmtSF_ratio_GrLivArea: 0.031 (vs best individual 0.725, -95.7%)
GarageArea_ratio_GarageCars: -0.032 (vs best individual 0.656, -104.8%)
1stFlrSF_ratio_TotalBsmtSF: -0.105 (vs best individual 0.648, -116.2%)
MasVnrArea_ratio_GrLivArea: 0.299 (vs best individual 0.725, -58.8%)

SUCCES

Area addition combinations achieve meaningful improvements with GrLivArea_add_TotalBsmtSF (0.821) exceeding baseline by 13.2%.
Ratio combinations fail consistently with negative correlations, while addition strategy validates architectural space summation logic.

### 2.3 Quality Feature Combinations vs Individual Components

Create quality interaction features and quality-area multiplications with individual correlation comparison.
Document improvement percentages over individual quality and area features for validation.

In [18]:
# Test quality feature combinations against individual baselines
# Target: exceed OverallQual baseline (0.821) by >5% improvement

quality_combinations = {}
quality_baseline_threshold = 0.821 * 1.05  # 5% improvement over best individual (OverallQual)

print("QUALITY FEATURE COMBINATIONS VS INDIVIDUAL COMPONENTS")
print("=" * 55)
print(f"Baseline to beat: OverallQual (0.821)")
print(f"Target threshold: {quality_baseline_threshold:.3f} (+5% improvement)")
print()

# Test quality multiplication with area features
print("QUALITY × AREA COMBINATIONS:")
quality_area_pairs = [
    ('OverallQual', 'GrLivArea'),
    ('OverallQual', 'TotalBsmtSF'),
    ('ExterQual', 'GrLivArea'),
    ('KitchenQual', 'GrLivArea'),
    ('ExterQual', 'TotalBsmtSF'),
    ('BsmtQual', 'TotalBsmtSF'),
    ('KitchenQual', 'TotalBsmtSF')
]

for qual_feat, area_feat in quality_area_pairs:
    if qual_feat in df_train_clean.columns and area_feat in df_train_clean.columns:
        combination = df_train_clean[qual_feat] * df_train_clean[area_feat]
        corr = combination.corr(df_train_clean[target_col])
        quality_combinations[f"{qual_feat}_multiply_{area_feat}"] = corr

        # Compare to individual components
        qual_corr = quality_correlations.get(qual_feat, 0)
        area_corr = area_correlations.get(area_feat, 0)
        individual_best = max(qual_corr, area_corr)
        improvement = (corr - individual_best) / individual_best * 100

        print(f"{qual_feat}_multiply_{area_feat}: {corr:.3f} (vs best individual {individual_best:.3f}, {improvement:+.1f}%)")

# Test quality interaction combinations
print("\nQUALITY × QUALITY COMBINATIONS:")
quality_pairs = [
    ('OverallQual', 'ExterQual'),
    ('OverallQual', 'KitchenQual'),
    ('ExterQual', 'KitchenQual'),
    ('OverallQual', 'BsmtQual'),
    ('ExterQual', 'BsmtQual'),
    ('KitchenQual', 'BsmtQual')
]

for qual1, qual2 in quality_pairs:
    if qual1 in df_train_clean.columns and qual2 in df_train_clean.columns:
        combination = df_train_clean[qual1] * df_train_clean[qual2]
        corr = combination.corr(df_train_clean[target_col])
        quality_combinations[f"{qual1}_multiply_{qual2}"] = corr

        # Compare to individual components
        individual_best = max(quality_correlations.get(qual1, 0), quality_correlations.get(qual2, 0))
        improvement = (corr - individual_best) / individual_best * 100

        print(f"{qual1}_multiply_{qual2}: {corr:.3f} (vs best individual {individual_best:.3f}, {improvement:+.1f}%)")

# Test quality addition combinations
print("\nQUALITY + QUALITY COMBINATIONS:")
for qual1, qual2 in quality_pairs:
    if qual1 in df_train_clean.columns and qual2 in df_train_clean.columns:
        combination = df_train_clean[qual1] + df_train_clean[qual2]
        corr = combination.corr(df_train_clean[target_col])
        quality_combinations[f"{qual1}_add_{qual2}"] = corr

        # Compare to individual components
        individual_best = max(quality_correlations.get(qual1, 0), quality_correlations.get(qual2, 0))
        improvement = (corr - individual_best) / individual_best * 100

        print(f"{qual1}_add_{qual2}: {corr:.3f} (vs best individual {individual_best:.3f}, {improvement:+.1f}%)")

# Identify successful quality combinations
print("\nSUCCESSFUL QUALITY COMBINATIONS (>5% improvement):")
successful_quality = {k: v for k, v in quality_combinations.items() if v > quality_baseline_threshold}
for name, corr in sorted(successful_quality.items(), key=lambda x: x[1], reverse=True):
    improvement = (corr - 0.821) / 0.821 * 100
    print(f"{name}: {corr:.3f} ({improvement:+.1f}% vs OverallQual)")

QUALITY FEATURE COMBINATIONS VS INDIVIDUAL COMPONENTS
Baseline to beat: OverallQual (0.821)
Target threshold: 0.862 (+5% improvement)

QUALITY × AREA COMBINATIONS:
OverallQual_multiply_GrLivArea: 0.838 (vs best individual 0.821, +2.1%)
OverallQual_multiply_TotalBsmtSF: 0.786 (vs best individual 0.821, -4.3%)
ExterQual_multiply_GrLivArea: 0.819 (vs best individual 0.725, +12.9%)
KitchenQual_multiply_GrLivArea: 0.819 (vs best individual 0.725, +12.9%)
ExterQual_multiply_TotalBsmtSF: 0.741 (vs best individual 0.682, +8.6%)
BsmtQual_multiply_TotalBsmtSF: 0.744 (vs best individual 0.648, +15.0%)
KitchenQual_multiply_TotalBsmtSF: 0.751 (vs best individual 0.670, +12.1%)

QUALITY × QUALITY COMBINATIONS:
OverallQual_multiply_ExterQual: 0.812 (vs best individual 0.821, -1.1%)
OverallQual_multiply_KitchenQual: 0.819 (vs best individual 0.821, -0.3%)
ExterQual_multiply_KitchenQual: 0.731 (vs best individual 0.682, +7.1%)
OverallQual_multiply_BsmtQual: 0.807 (vs best individual 0.821, -1.8%)
Exter

Quality combinations fail to exceed 5% improvement threshold despite meaningful individual gains up to 15% over component features.
OverallQual dominance (0.821) creates high baseline requiring quality-area multiplication for modest 2.1% improvement validation.

### 2.4 Bathroom and Room Engineering vs Individual Components

Test bathroom combinations and room efficiency ratios against individual bathroom/room feature correlations.
Calculate improvement over best individual components to validate engineering decisions.

In [19]:
# Test bathroom and room combinations against individual baselines
# Target: exceed GarageCars baseline (0.681) by >5% improvement

bath_room_combinations = {}
bath_baseline_threshold = 0.681 * 1.05  # 5% improvement over best individual (GarageCars)

print("BATHROOM & ROOM COMBINATIONS VS INDIVIDUAL COMPONENTS")
print("=" * 55)
print(f"Baseline to beat: GarageCars (0.681)")
print(f"Target threshold: {bath_baseline_threshold:.3f} (+5% improvement)")
print()

# Test bathroom calculation combinations
print("BATHROOM COMBINATIONS:")
# Standard real estate bathroom calculation: FullBath + 0.5*HalfBath
if 'FullBath' in df_train_clean.columns and 'HalfBath' in df_train_clean.columns:
    total_baths_standard = df_train_clean['FullBath'] + 0.5 * df_train_clean['HalfBath']
    corr = total_baths_standard.corr(df_train_clean[target_col])
    bath_room_combinations['TotalBaths_Standard'] = corr
    individual_best = max(bath_room_correlations.get('FullBath', 0), bath_room_correlations.get('HalfBath', 0))
    improvement = (corr - individual_best) / individual_best * 100
    print(f"TotalBaths_Standard: {corr:.3f} (vs best individual {individual_best:.3f}, {improvement:+.1f}%)")

# Include basement bathrooms
if all(col in df_train_clean.columns for col in ['FullBath', 'HalfBath', 'BsmtFullBath', 'BsmtHalfBath']):
    total_baths_all = (df_train_clean['FullBath'] + df_train_clean['BsmtFullBath'] + 
                      0.5 * (df_train_clean['HalfBath'] + df_train_clean['BsmtHalfBath']))
    corr = total_baths_all.corr(df_train_clean[target_col])
    bath_room_combinations['TotalBaths_All'] = corr
    individual_best = max(bath_room_correlations.get('FullBath', 0), bath_room_correlations.get('BsmtFullBath', 0))
    improvement = (corr - individual_best) / individual_best * 100
    print(f"TotalBaths_All: {corr:.3f} (vs best individual {individual_best:.3f}, {improvement:+.1f}%)")

# Test room efficiency ratios
print("\nROOM EFFICIENCY COMBINATIONS:")
room_efficiency_pairs = [
    ('GrLivArea', 'TotRmsAbvGrd'),
    ('GrLivArea', 'BedroomAbvGr'),
    ('TotalBsmtSF', 'BedroomAbvGr'),
    ('GarageCars', 'GarageArea')
]

for area_feat, room_feat in room_efficiency_pairs:
    if area_feat in df_train_clean.columns and room_feat in df_train_clean.columns:
        # Calculate area per room (avoid division by zero)
        mask = df_train_clean[room_feat] > 0
        if mask.sum() > 100:
            efficiency = df_train_clean.loc[mask, area_feat] / df_train_clean.loc[mask, room_feat]
            corr = efficiency.corr(df_train_clean.loc[mask, target_col])
            bath_room_combinations[f"{area_feat}_per_{room_feat}"] = corr
            
            # Compare to individual components
            area_corr = area_correlations.get(area_feat, 0) if area_feat in area_correlations else bath_room_correlations.get(area_feat, 0)
            room_corr = bath_room_correlations.get(room_feat, 0)
            individual_best = max(area_corr, room_corr)
            improvement = (corr - individual_best) / individual_best * 100
            
            print(f"{area_feat}_per_{room_feat}: {corr:.3f} (vs best individual {individual_best:.3f}, {improvement:+.1f}%)")

# Test room count combinations
print("\nROOM COUNT COMBINATIONS:")
room_count_features = ['BedroomAbvGr', 'TotRmsAbvGrd', 'Fireplaces']
if all(col in df_train_clean.columns for col in room_count_features):
    total_rooms = df_train_clean['BedroomAbvGr'] + df_train_clean['TotRmsAbvGrd'] + df_train_clean['Fireplaces']
    corr = total_rooms.corr(df_train_clean[target_col])
    bath_room_combinations['TotalRooms_All'] = corr
    individual_best = max([bath_room_correlations.get(feat, 0) for feat in room_count_features])
    improvement = (corr - individual_best) / individual_best * 100
    print(f"TotalRooms_All: {corr:.3f} (vs best individual {individual_best:.3f}, {improvement:+.1f}%)")

# Test garage efficiency
if 'GarageArea' in df_train_clean.columns and 'GarageCars' in df_train_clean.columns:
    mask = df_train_clean['GarageCars'] > 0
    if mask.sum() > 100:
        garage_efficiency = df_train_clean.loc[mask, 'GarageArea'] / df_train_clean.loc[mask, 'GarageCars']
        corr = garage_efficiency.corr(df_train_clean.loc[mask, target_col])
        bath_room_combinations['GarageArea_per_Car'] = corr
        individual_best = max(area_correlations.get('GarageArea', 0), bath_room_correlations.get('GarageCars', 0))
        improvement = (corr - individual_best) / individual_best * 100
        print(f"GarageArea_per_Car: {corr:.3f} (vs best individual {individual_best:.3f}, {improvement:+.1f}%)")

# Identify successful bathroom/room combinations
print("\nSUCCESSFUL BATH/ROOM COMBINATIONS (>5% improvement):")
successful_bath_room = {k: v for k, v in bath_room_combinations.items() if v > bath_baseline_threshold}
for name, corr in sorted(successful_bath_room.items(), key=lambda x: x[1], reverse=True):
    improvement = (corr - 0.681) / 0.681 * 100
    print(f"{name}: {corr:.3f} ({improvement:+.1f}% vs GarageCars)")

BATHROOM & ROOM COMBINATIONS VS INDIVIDUAL COMPONENTS
Baseline to beat: GarageCars (0.681)
Target threshold: 0.715 (+5% improvement)

BATHROOM COMBINATIONS:
TotalBaths_Standard: 0.641 (vs best individual 0.596, +7.6%)
TotalBaths_All: 0.677 (vs best individual 0.596, +13.6%)

ROOM EFFICIENCY COMBINATIONS:
GrLivArea_per_TotRmsAbvGrd: 0.569 (vs best individual 0.725, -21.5%)
GrLivArea_per_BedroomAbvGr: 0.511 (vs best individual 0.725, -29.6%)
TotalBsmtSF_per_BedroomAbvGr: 0.385 (vs best individual 0.648, -40.6%)
GarageCars_per_GarageArea: -0.017 (vs best individual 0.681, -102.5%)

ROOM COUNT COMBINATIONS:
TotalRooms_All: 0.542 (vs best individual 0.538, +0.7%)
GarageArea_per_Car: -0.032 (vs best individual 0.681, -104.7%)

SUCCESSFUL BATH/ROOM COMBINATIONS (>5% improvement):


Bathroom combinations fail to exceed 5% threshold despite TotalBaths_All achieving 13.6% improvement over individual bathroom features.
Room efficiency ratios consistently underperform individual components with negative correlations indicating ratio-based engineering ineffectiveness.

### 2.5 Final Selection Based on Improvement Analysis

Select combinations that provide meaningful improvement over individual components.
Create final engineered feature set with documented correlation improvements and redundancy analysis.

In [21]:
# Compile all successful combinations from sections 2.2-2.4
print("FINAL FEATURE ENGINEERING SELECTION SUMMARY")
print("=" * 50)

# Collect all successful combinations (>5% improvement over individual components)
final_engineered_features = {}

# Area combinations that exceeded baseline
print("SELECTED AREA COMBINATIONS:")
if 'area_combinations' in locals():
    for name, corr in area_combinations.items():
        if corr > 0.761:  # 5% improvement over GrLivArea (0.725)
            final_engineered_features[name] = corr
            improvement = (corr - 0.725) / 0.725 * 100
            print(f"  {name}: {corr:.3f} ({improvement:+.1f}% vs GrLivArea)")

# Quality combinations with meaningful improvements (relaxed threshold due to high baseline)
print("\nSELECTED QUALITY COMBINATIONS:")
if 'quality_combinations' in locals():
    # Use lower threshold for quality due to high OverallQual baseline
    quality_threshold = 0.838  # Best performing quality combination
    for name, corr in quality_combinations.items():
        if corr >= quality_threshold:
            final_engineered_features[name] = corr
            improvement = (corr - 0.821) / 0.821 * 100
            print(f"  {name}: {corr:.3f} ({improvement:+.1f}% vs OverallQual)")

# Bathroom combinations with validated improvements
print("\nSELECTED BATHROOM COMBINATIONS:")
if 'bath_room_combinations' in locals():
    # Select bathroom combinations that show meaningful improvement over components
    bathroom_threshold = 0.65  # Meaningful threshold for bathroom features
    for name, corr in bath_room_combinations.items():
        if corr >= bathroom_threshold and 'Baths' in name:
            final_engineered_features[name] = corr
            improvement = (corr - 0.596) / 0.596 * 100  # vs FullBath baseline
            print(f"  {name}: {corr:.3f} ({improvement:+.1f}% vs FullBath)")

# Engineering strategy analysis
print(f"\nENGINEERING STRATEGY ANALYSIS:")
print(f"Total engineered features selected: {len(final_engineered_features)}")

# Categorize by strategy type
addition_features = [name for name in final_engineered_features.keys() if '_add_' in name]
multiply_features = [name for name in final_engineered_features.keys() if '_multiply_' in name]
standard_features = [name for name in final_engineered_features.keys() if 'Standard' in name or 'All' in name]

print(f"Addition strategy successes: {len(addition_features)}")
print(f"Multiplication strategy successes: {len(multiply_features)}")
print(f"Standard formula successes: {len(standard_features)}")

# Create final feature recommendations
print(f"\nFINAL FEATURE RECOMMENDATIONS:")
print(f"Top 5 engineered features by correlation:")
sorted_features = sorted(final_engineered_features.items(), key=lambda x: x[1], reverse=True)
for i, (name, corr) in enumerate(sorted_features[:5], 1):
    # Determine baseline comparison
    if 'add_' in name and 'GrLivArea' in name:
        baseline = 0.725
        baseline_name = "GrLivArea"
    elif 'multiply_' in name and 'OverallQual' in name:
        baseline = 0.821
        baseline_name = "OverallQual"
    elif 'Baths' in name:
        baseline = 0.596
        baseline_name = "FullBath"
    else:
        baseline = 0.681
        baseline_name = "GarageCars"
    
    improvement = (corr - baseline) / baseline * 100
    print(f"  {i}. {name}: {corr:.3f} ({improvement:+.1f}% vs {baseline_name})")


FINAL FEATURE ENGINEERING SELECTION SUMMARY
SELECTED AREA COMBINATIONS:
  GrLivArea_add_TotalBsmtSF: 0.821 (+13.2% vs GrLivArea)
  GrLivArea_add_GarageArea: 0.801 (+10.5% vs GrLivArea)

SELECTED QUALITY COMBINATIONS:
  OverallQual_multiply_GrLivArea: 0.838 (+2.1% vs OverallQual)

SELECTED BATHROOM COMBINATIONS:
  TotalBaths_All: 0.677 (+13.5% vs FullBath)

ENGINEERING STRATEGY ANALYSIS:
Total engineered features selected: 4
Addition strategy successes: 2
Multiplication strategy successes: 1
Standard formula successes: 1

FINAL FEATURE RECOMMENDATIONS:
Top 5 engineered features by correlation:
  1. OverallQual_multiply_GrLivArea: 0.838 (+2.1% vs OverallQual)
  2. GrLivArea_add_TotalBsmtSF: 0.821 (+13.2% vs GrLivArea)
  3. GrLivArea_add_GarageArea: 0.801 (+10.5% vs GrLivArea)
  4. TotalBaths_All: 0.677 (+13.5% vs FullBath)


Final selection identifies 4 engineered features with OverallQual_multiply_GrLivArea (0.838) leading correlation rankings despite modest 2.1% baseline improvement.
Addition strategy dominates with 2 successful area combinations while multiplication and standard formulas contribute 1 feature each for balanced engineering approach.

---

## 3. Age-Based and Temporal Feature Engineering

Calculate property age by subtracting YearBuilt from reference year and create lifecycle-based features.
Handle missing temporal data and implement remodel recency calculations for property improvement analysis.


In [24]:
# Age-based temporal feature engineering with sale year reference
print("AGE-BASED AND TEMPORAL FEATURE ENGINEERING")
print("=" * 45)

# Clean garage year data - fix known data entry error
if 'GarageYrBlt' in df_combined.columns:
    # Fix 2207 to 2007 (data entry error)
    df_combined['GarageYrBlt'] = df_combined['GarageYrBlt'].replace(2207, 2007)
    # Fill missing garage years with 0
    df_combined['GarageYrBlt'].fillna(0, inplace=True)

# Investigate date anomalies before creating age features
print("\nDATE ANOMALY INVESTIGATION:")
print("=" * 30)

# Check for properties with future construction dates
if all(col in df_combined.columns for col in ['YrSold', 'YearBuilt']):
    future_built = df_combined[df_combined['YearBuilt'] > df_combined['YrSold']]
    print(f"Properties built after sale: {len(future_built)}")
    if len(future_built) > 0:
        print("Sample cases (YrSold, YearBuilt):")
        for idx, row in future_built[['YrSold', 'YearBuilt']].head(5).iterrows():
            print(f"  Sale: {row['YrSold']}, Built: {row['YearBuilt']}")

# Check for properties with future remodel dates
if all(col in df_combined.columns for col in ['YrSold', 'YearRemodAdd']):
    future_remod = df_combined[df_combined['YearRemodAdd'] > df_combined['YrSold']]
    print(f"Properties remodeled after sale: {len(future_remod)}")
    if len(future_remod) > 0:
        print("Sample cases (YrSold, YearRemodAdd):")
        for idx, row in future_remod[['YrSold', 'YearRemodAdd']].head(3).iterrows():
            print(f"  Sale: {row['YrSold']}, Remodel: {row['YearRemodAdd']}")

# Analyze missing garage year patterns
if 'GarageYrBlt' in df_combined.columns:
    missing_garage_year = df_combined[df_combined['GarageYrBlt'] == 0]
    print(f"Properties with missing garage year: {len(missing_garage_year)}")
    if len(missing_garage_year) > 0:
        # Check garage features for missing year cases
        garage_features = missing_garage_year[['GarageArea', 'GarageCars']].describe()
        has_garage_features = (missing_garage_year['GarageArea'] > 0) | (missing_garage_year['GarageCars'] > 0)
        print(f"Missing year properties with garage features: {has_garage_features.sum()}")

print("\n" + "=" * 45)

# Create age features using sale year as reference (property-specific aging)
age_features = {}

# Property age at time of sale (handle future construction)
if all(col in df_combined.columns for col in ['YrSold', 'YearBuilt']):
    df_combined['PropertyAge'] = df_combined['YrSold'] - df_combined['YearBuilt']
    # Clip negative ages to 0 (future construction treated as 0 age)
    df_combined['PropertyAge'] = df_combined['PropertyAge'].clip(lower=0)
    age_features['PropertyAge'] = df_combined['PropertyAge']
    print(f"Property Age created: {df_combined['PropertyAge'].min():.0f} to {df_combined['PropertyAge'].max():.0f} years")

# Garage age at time of sale (fix missing garage logic)
if all(col in df_combined.columns for col in ['YrSold', 'GarageYrBlt']):
    df_combined['GarageAge'] = df_combined['YrSold'] - df_combined['GarageYrBlt']
    # For missing garage years (GarageYrBlt=0), set garage age = property age
    mask_no_garage_year = df_combined['GarageYrBlt'] == 0
    df_combined.loc[mask_no_garage_year, 'GarageAge'] = df_combined.loc[mask_no_garage_year, 'PropertyAge']
    # Clip any remaining negative ages
    df_combined['GarageAge'] = df_combined['GarageAge'].clip(lower=0)
    age_features['GarageAge'] = df_combined['GarageAge']
    print(f"Garage Age created: {df_combined['GarageAge'].min():.0f} to {df_combined['GarageAge'].max():.0f} years")

# Remodel age at time of sale (handle future remodels)
if all(col in df_combined.columns for col in ['YrSold', 'YearRemodAdd']):
    df_combined['RemodAge'] = df_combined['YrSold'] - df_combined['YearRemodAdd']
    # Clip negative ages to 0 (future remodels treated as 0 age)
    df_combined['RemodAge'] = df_combined['RemodAge'].clip(lower=0)
    age_features['RemodAge'] = df_combined['RemodAge']
    print(f"Remodel Age created: {df_combined['RemodAge'].min():.0f} to {df_combined['RemodAge'].max():.0f} years")

# Remodel indicator
if all(col in df_combined.columns for col in ['YearBuilt', 'YearRemodAdd']):
    df_combined['IsRemodeled'] = (df_combined['YearRemodAdd'] != df_combined['YearBuilt']).astype(int)
    age_features['IsRemodeled'] = df_combined['IsRemodeled']
    remodel_pct = (df_combined['IsRemodeled'] == 1).mean() * 100
    print(f"Remodel indicator created: {remodel_pct:.1f}% of properties remodeled")

# Skip garage efficiency due to division by zero issues with garages used for storage

# Test age feature correlations with target (train data only)
print(f"\nAGE FEATURE CORRELATIONS WITH TARGET:")
train_mask = df_combined['dataset_source'] == 'train'
target_data = df_train_clean[target_col] if 'df_train_clean' in locals() else None

if target_data is not None:
    age_correlations = {}
    for feature_name, feature_data in age_features.items():
        train_feature = feature_data[train_mask]
        if len(train_feature) == len(target_data):
            corr = train_feature.corr(target_data)
            age_correlations[feature_name] = corr
            print(f"{feature_name}: {corr:.3f}")

    # Compare to Section 2 baseline features
    baseline_features = {
        'OverallQual': 0.821,
        'GrLivArea': 0.725,
        'GarageCars': 0.681
    }

    print(f"\nCOMPARISON TO SECTION 2 BASELINE FEATURES:")
    exceeds_baseline = False
    for age_name, age_corr in sorted(age_correlations.items(), key=lambda x: abs(x[1]), reverse=True):
        for baseline_name, baseline_corr in baseline_features.items():
            if abs(age_corr) > abs(baseline_corr):
                improvement = (abs(age_corr) - abs(baseline_corr)) / abs(baseline_corr) * 100
                print(f"{age_name} ({age_corr:.3f}) exceeds {baseline_name} ({baseline_corr:.3f}) by {improvement:+.1f}%")
                exceeds_baseline = True
                break

    if not exceeds_baseline:
        print("No age features exceed baseline correlation thresholds")
        print("Age features provide supplementary predictive value but not dominant performance")

AGE-BASED AND TEMPORAL FEATURE ENGINEERING

DATE ANOMALY INVESTIGATION:
Properties built after sale: 1
Sample cases (YrSold, YearBuilt):
  Sale: 2007, Built: 2008
Properties remodeled after sale: 2
Sample cases (YrSold, YearRemodAdd):
  Sale: 2007, Remodel: 2008
  Sale: 2007, Remodel: 2009
Properties with missing garage year: 158
Missing year properties with garage features: 0

Property Age created: 0 to 136 years
Garage Age created: 0 to 136 years
Remodel Age created: 0 to 60 years
Remodel indicator created: 46.6% of properties remodeled

AGE FEATURE CORRELATIONS WITH TARGET:
PropertyAge: -0.588
GarageAge: -0.570
RemodAge: -0.569
IsRemodeled: -0.074

COMPARISON TO SECTION 2 BASELINE FEATURES:
No age features exceed baseline correlation thresholds
Age features provide supplementary predictive value but not dominant performance
