# Notebook 04: Model Development & Optimization

Advanced model development, hyperparameter tuning, and performance optimization using the feature engineered dataset from Notebook 03.

## Objectives:
1. Load and verify feature engineered data from Notebook 03
2. Develop baseline models for comparison
3. Implement advanced regression models (Random Forest, XGBoost, etc.)
4. Perform hyperparameter optimization
5. Ensemble methods and model stacking
6. Final model evaluation and selection

In [3]:
# =============================================================================
# 1. Data Loading and Import from Notebook 03
# =============================================================================

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Machine Learning Libraries
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler
import xgboost as xgb

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

X_train = pd.read_csv('../data/processed/X_train_final.csv')
X_test = pd.read_csv('../data/processed/X_test_final.csv')
target_data = pd.read_csv('../data/processed/y_train_final.csv')

print("\n" + "="*60)
print("Data Import Summary:")
print(f"Features available: {X_train.shape[1]}")
print(f"Training samples: {X_train.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")
print("Target variable: Log-transformed SalePrice")
print("="*60)


Data Import Summary:
Features available: 191
Training samples: 1458
Test samples: 1459
Target variable: Log-transformed SalePrice


In [4]:
# =============================================================================
# 2. Data Verification and Quality Check
# =============================================================================

print("\n2. Data Verification and Quality Check")
print("="*60)

# Check for missing values
print("2.1 Missing Values Check:")
train_missing = X_train.isnull().sum()
test_missing = X_test.isnull().sum()

if train_missing.sum() > 0:
    print(f"❌ Training data has {train_missing.sum()} missing values")
    print("Features with missing values:")
    print(train_missing[train_missing > 0])
else:
    print("✓ No missing values in training data")

if test_missing.sum() > 0:
    print(f"❌ Test data has {test_missing.sum()} missing values") 
    print("Features with missing values:")
    print(test_missing[test_missing > 0])
else:
    print("✓ No missing values in test data")

# Check data types and feature consistency
print(f"\n2.2 Feature Consistency Check:")
print(f"Training features: {X_train.shape[1]}")
print(f"Test features: {X_test.shape[1]}")

if X_train.shape[1] == X_test.shape[1]:
    print("✓ Feature count matches between train and test")
else:
    print("❌ Feature count mismatch between train and test!")

# Check if column names match
train_cols = set(X_train.columns)
test_cols = set(X_test.columns)

if train_cols == test_cols:
    print("✓ Feature names match between train and test")
else:
    print("❌ Feature names don't match!")
    print(f"Only in train: {train_cols - test_cols}")
    print(f"Only in test: {test_cols - train_cols}")

# Check target variable
print(f"\n2.3 Target Variable Analysis:")
print(f"Target shape: {y_train.shape}")
print(f"Target range: {y_train.min():.3f} to {y_train.max():.3f}")
print(f"Target mean: {y_train.mean():.3f}")
print(f"Target std: {y_train.std():.3f}")
print(f"Target skewness: {stats.skew(y_train):.3f}")

# Display feature types
print(f"\n2.4 Feature Types Distribution:")
print(X_train.dtypes.value_counts())

# Show first few rows to verify data looks correct
print(f"\n2.5 Sample Data Preview:")
print("First 3 rows of training features:")
display(X_train.head(3))

print("\nFirst 3 target values:")
print(y_train.head(3).values)

print("\n" + "="*60)
print("Data verification complete!")
print("="*60)


2. Data Verification and Quality Check
2.1 Missing Values Check:
✓ No missing values in training data
✓ No missing values in test data

2.2 Feature Consistency Check:
Training features: 191
Test features: 191
✓ Feature count matches between train and test
✓ Feature names match between train and test

2.3 Target Variable Analysis:
Target shape: (1458,)
Target range: 10.460 to 13.534
Target mean: 12.024
Target std: 0.400
Target skewness: 0.121

2.4 Feature Types Distribution:
int64      167
float64     24
Name: count, dtype: int64

2.5 Sample Data Preview:
First 3 rows of training features:


Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,LotShape,LandContour,Utilities,LandSlope,OverallQual,OverallCond,MasVnrArea,ExterQual,ExterCond,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,BsmtUnfSF,HeatingQC,CentralAir,LowQualFinSF,GrLivArea,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageFinish,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscVal,GarageAge,HouseAge,YearsSinceRemod,BsmtFinSF,TotalFlrSF,TotalBaths,GarageAreaPerCar,MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Street_Pave,Alley_None,Alley_Pave,LotConfig_CulDSac,LotConfig_FR2,LotConfig_FR3,LotConfig_Inside,Neighborhood_Blueste,Neighborhood_BrDale,Neighborhood_BrkSide,Neighborhood_ClearCr,Neighborhood_CollgCr,Neighborhood_Crawfor,Neighborhood_Edwards,Neighborhood_Gilbert,Neighborhood_IDOTRR,Neighborhood_MeadowV,Neighborhood_Mitchel,Neighborhood_NAmes,Neighborhood_NPkVill,Neighborhood_NWAmes,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker,Condition1_Feedr,Condition1_Norm,Condition1_PosA,Condition1_PosN,Condition1_RRAe,Condition1_RRAn,Condition1_RRNe,Condition1_RRNn,Condition2_Feedr,Condition2_Norm,Condition2_PosA,Condition2_PosN,Condition2_RRAe,Condition2_RRAn,Condition2_RRNn,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,HouseStyle_1.5Unf,HouseStyle_1Story,HouseStyle_2.5Fin,HouseStyle_2.5Unf,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,RoofStyle_Gable,RoofStyle_Gambrel,RoofStyle_Hip,RoofStyle_Mansard,RoofStyle_Shed,RoofMatl_Membran,RoofMatl_Metal,RoofMatl_Roll,RoofMatl_Tar&Grv,RoofMatl_WdShake,RoofMatl_WdShngl,Exterior1st_AsphShn,Exterior1st_Brk Cmn,Exterior1st_BrkFace,Exterior1st_CBlock,Exterior1st_CmentBd,Exterior1st_HdBoard,Exterior1st_ImStucc,Exterior1st_MetalSd,Exterior1st_Plywood,Exterior1st_Stone,Exterior1st_Stucco,Exterior1st_VinylSd,Exterior1st_Wd Sdng,Exterior1st_Wd Shng,Exterior2nd_AsphShn,Exterior2nd_Brk Cmn,Exterior2nd_BrkFace,Exterior2nd_CBlock,Exterior2nd_CmentBd,Exterior2nd_HdBoard,Exterior2nd_ImStucc,Exterior2nd_MetalSd,Exterior2nd_Other,Exterior2nd_Plywood,Exterior2nd_Stone,Exterior2nd_Stucco,Exterior2nd_VinylSd,Exterior2nd_Wd Sdng,Exterior2nd_Wd Shng,MasVnrType_BrkFace,MasVnrType_None,MasVnrType_Stone,Foundation_CBlock,Foundation_PConc,Foundation_Slab,Foundation_Stone,Foundation_Wood,Heating_GasA,Heating_GasW,Heating_Grav,Heating_OthW,Heating_Wall,Electrical_FuseF,Electrical_FuseP,Electrical_Mix,Electrical_SBrkr,GarageType_Attchd,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd,GarageType_None,MiscFeature_None,MiscFeature_Othr,MiscFeature_Shed,MiscFeature_TenC,SaleType_CWD,SaleType_Con,SaleType_ConLD,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,1,4.110874,4.189655,9.04204,4,4,4,3,7,1.791759,5.283204,4,3,4,3,1,6,1,5.01728,5,2,0.0,7.444833,3,0.693147,4,2.197225,8,0.0,0,2,3,3,3,0.0,4.127134,0.0,0.0,0.0,0.0,0,0,0.0,1.791759,1.791759,5,6.561031,7.444833,3.5,274.0,0,0,1,0,1,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0
1,2,3.044522,4.394449,9.169623,4,4,4,3,6,2.197225,0.0,3,3,4,3,4,5,1,5.652489,5,2,0.0,7.141245,3,0.693147,3,1.94591,8,0.693147,3,2,3,3,3,5.700444,0.0,0.0,0.0,0.0,0.0,0,0,0.0,3.465736,3.465736,31,6.886532,7.141245,2.5,230.0,0,0,1,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0
2,3,4.110874,4.234107,9.328212,3,4,4,3,7,1.791759,5.09375,4,3,4,3,2,6,1,6.075346,5,2,0.0,7.488294,3,0.693147,4,1.94591,8,0.693147,3,2,3,3,3,0.0,3.7612,0.0,0.0,0.0,0.0,0,0,0.0,2.079442,2.079442,6,6.188264,7.488294,3.5,304.0,0,0,1,0,1,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0



First 3 target values:
[12.24769432 12.10901093 12.31716669]

Data verification complete!


In [5]:
# =============================================================================
# 3. Model Setup and Evaluation Framework
# =============================================================================

print("\n3. Model Setup and Evaluation Framework")
print("="*60)

# Define evaluation metrics
def evaluate_model(model, X_train, y_train, cv_folds=5, model_name="Model"):
    """
    Evaluate a model using cross-validation
    Returns mean and std of RMSE scores
    """
    # Cross-validation scores (negative RMSE, so we negate to get positive RMSE)
    cv_scores = cross_val_score(model, X_train, y_train, 
                               cv=cv_folds, 
                               scoring='neg_root_mean_squared_error',
                               n_jobs=-1)
    
    rmse_scores = -cv_scores  # Convert to positive RMSE
    
    print(f"\n{model_name} Cross-Validation Results:")
    print(f"  RMSE: {rmse_scores.mean():.4f} (+/- {rmse_scores.std() * 2:.4f})")
    print(f"  Individual scores: {rmse_scores}")
    
    return rmse_scores.mean(), rmse_scores.std()

# Setup cross-validation strategy
cv_folds = 5
kfold = KFold(n_splits=cv_folds, shuffle=True, random_state=42)

print(f"✓ Cross-validation setup: {cv_folds}-fold CV")
print("✓ Evaluation metric: Root Mean Squared Error (RMSE)")
print("✓ Target variable: Log-transformed SalePrice")

# Initialize results storage
model_results = {}

print("\n" + "="*60)
print("Model evaluation framework ready!")
print("="*60)


3. Model Setup and Evaluation Framework
✓ Cross-validation setup: 5-fold CV
✓ Evaluation metric: Root Mean Squared Error (RMSE)
✓ Target variable: Log-transformed SalePrice

Model evaluation framework ready!


In [6]:
# =============================================================================
# 4. Baseline Models
# =============================================================================

print("\n4. Baseline Models Development")
print("="*60)

print("Testing baseline regression models to establish performance benchmarks...")

# 4.1 Linear Regression
print("\n4.1 Linear Regression (OLS)")
lr_model = LinearRegression()
lr_mean, lr_std = evaluate_model(lr_model, X_train, y_train, cv_folds, "Linear Regression")
model_results['Linear Regression'] = {'mean': lr_mean, 'std': lr_std}

# 4.2 Ridge Regression (with basic regularization)
print("\n4.2 Ridge Regression")
ridge_model = Ridge(alpha=1.0, random_state=42)
ridge_mean, ridge_std = evaluate_model(ridge_model, X_train, y_train, cv_folds, "Ridge Regression")
model_results['Ridge Regression'] = {'mean': ridge_mean, 'std': ridge_std}

# 4.3 Lasso Regression
print("\n4.3 Lasso Regression")
lasso_model = Lasso(alpha=0.001, random_state=42, max_iter=10000)
lasso_mean, lasso_std = evaluate_model(lasso_model, X_train, y_train, cv_folds, "Lasso Regression")
model_results['Lasso Regression'] = {'mean': lasso_mean, 'std': lasso_std}

# 4.4 Elastic Net
print("\n4.4 Elastic Net")
elastic_model = ElasticNet(alpha=0.001, l1_ratio=0.5, random_state=42, max_iter=10000)
elastic_mean, elastic_std = evaluate_model(elastic_model, X_train, y_train, cv_folds, "Elastic Net")
model_results['Elastic Net'] = {'mean': elastic_mean, 'std': elastic_std}

# Summary of baseline results
print(f"\n{'='*60}")
print("BASELINE MODELS SUMMARY:")
print(f"{'='*60}")
print(f"{'Model':<20} {'RMSE':<10} {'Std':<10}")
print("-" * 40)

for model_name, results in model_results.items():
    print(f"{model_name:<20} {results['mean']:<10.4f} {results['std']:<10.4f}")

# Find best baseline model
best_baseline = min(model_results.items(), key=lambda x: x[1]['mean'])
print(f"\n🏆 Best Baseline Model: {best_baseline[0]}")
print(f"   RMSE: {best_baseline[1]['mean']:.4f} (+/- {best_baseline[1]['std']:.4f})")

print(f"\n{'='*60}")
print("Baseline models complete! Ready for advanced models...")
print(f"{'='*60}")


4. Baseline Models Development
Testing baseline regression models to establish performance benchmarks...

4.1 Linear Regression (OLS)

Linear Regression Cross-Validation Results:
  RMSE: 0.1245 (+/- 0.0144)
  Individual scores: [0.11261191 0.12669513 0.13513737 0.12363186 0.12462509]

4.2 Ridge Regression

Ridge Regression Cross-Validation Results:
  RMSE: 0.1203 (+/- 0.0125)
  Individual scores: [0.10995805 0.12305629 0.1272663  0.11651827 0.12465893]

4.3 Lasso Regression

Lasso Regression Cross-Validation Results:
  RMSE: 0.1192 (+/- 0.0095)
  Individual scores: [0.11299789 0.1202319  0.12515204 0.11442778 0.12319979]

4.4 Elastic Net

Elastic Net Cross-Validation Results:
  RMSE: 0.1180 (+/- 0.0088)
  Individual scores: [0.11283696 0.11939618 0.12315057 0.11279639 0.12167377]

BASELINE MODELS SUMMARY:
Model                RMSE       Std       
----------------------------------------
Linear Regression    0.1245     0.0072    
Ridge Regression     0.1203     0.0063    
Lasso Regres