# Notebook 04: Model Training and Comparison

Systematic implementation of comprehensive model training and comparison methodology for evidence-based machine learning evaluation.
Algorithm selection and performance optimization applied to feature-engineered datasets for competitive model development.

---

## 1. Load Preprocessed Data and Baseline Establishment

THIS DEPENDS ON SECTION 7 FROM NOTEBOOK 03

Load feature-engineered datasets and establish baseline performance metrics for model comparison framework.
Validate data consistency and feature count alignment with notebook 03 preprocessing pipeline outputs.

### 1.1 Dataset Import and Validation

Load feature-engineered datasets from notebook 03 and establish performance baseline for systematic model comparison.
Validate 275 features and SalePrice_log target consistency while creating train/validation splits for cross-validation framework.

In [10]:
# Load required libraries for model development and evaluation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
import warnings
warnings.filterwarnings('ignore')

# Load feature-engineered datasets from notebook 03
df_train_engineered = pd.read_csv('../data/processed/train_feature_engineered.csv')
df_test_engineered = pd.read_csv('../data/processed/test_feature_engineered.csv')

print("Feature-engineered dataset shapes:")
print(f"Train: {df_train_engineered.shape}")
print(f"Test: {df_test_engineered.shape}")

# Verify data quality and consistency from feature engineering
print(f"\nData quality verification:")
print(f"Train missing values: {df_train_engineered.isnull().sum().sum()}")
print(f"Test missing values: {df_test_engineered.isnull().sum().sum()}")

# Feature count validation against notebook 03 expectations
feature_cols = [col for col in df_train_engineered.columns 
                if col not in ['SalePrice', 'SalePrice_log', 'Id']]
print(f"\nFeature validation:")
print(f"Features available: {len(feature_cols)}")
print(f"Expected from NB03: 275 features (excluding Id)")
print(f"Feature count validation: {'PASSED' if len(feature_cols) == 275 else 'REVIEW'}")

# Verify target variables from notebook 03 preprocessing
target_cols = [col for col in df_train_engineered.columns if 'SalePrice' in col]
print(f"Target variables available: {target_cols}")

Feature-engineered dataset shapes:
Train: (1458, 278)
Test: (1459, 276)

Data quality verification:
Train missing values: 0
Test missing values: 0

Feature validation:
Features available: 275
Expected from NB03: 275 features (excluding Id)
Feature count validation: PASSED
Target variables available: ['SalePrice', 'SalePrice_log']


Dataset loading confirms 275 engineered features with zero missing values and dual target variable availability.
Feature matrix preparation enables systematic model development with validated preprocessing pipeline outputs.

### 1.2 Baseline Performance Establishment and Target Preparation

Establish baseline performance metrics using simple mean prediction for model comparison framework.
Create train/validation splits with SalePrice_log as primary target following notebook 03 optimization results.

In [11]:
# Separate features and targets (use both original and log-transformed)
X_train = df_train_engineered[feature_cols]
y_train_original = df_train_engineered['SalePrice']
y_train_log = df_train_engineered['SalePrice_log']  # Primary target from NB03
X_test = df_test_engineered[feature_cols]

print(f"Feature matrix shapes:")
print(f"X_train: {X_train.shape}")
print(f"X_test: {X_test.shape}")
print(f"y_train_original: {y_train_original.shape}")
print(f"y_train_log: {y_train_log.shape} (primary target - pre-optimized)")

# Create train/validation split for model development (consistent seed)
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(
    X_train, y_train_log, test_size=0.2, random_state=42
)

print(f"\nTrain/validation split:")
print(f"X_train_split: {X_train_split.shape}")
print(f"X_val_split: {X_val_split.shape}")
print(f"y_train_split: {y_train_split.shape}")
print(f"y_val_split: {y_val_split.shape}")

# Baseline performance using simple mean prediction
baseline_pred_log = np.full(len(y_val_split), y_train_split.mean())
baseline_rmse_log = np.sqrt(mean_squared_error(y_val_split, baseline_pred_log))

# Convert predictions and targets to original scale for interpretable baseline
baseline_pred_original = np.exp(baseline_pred_log)
y_val_original = np.exp(y_val_split)
baseline_rmse_original = np.sqrt(mean_squared_error(y_val_original, baseline_pred_original))

print(f"\nBaseline performance (mean prediction):")
print(f"Baseline RMSE (log scale): {baseline_rmse_log:.4f}")
print(f"Baseline RMSE (original scale): {baseline_rmse_original:,.0f}")
print(f"Mean log target: {y_train_split.mean():.4f}")
print(f"Mean original target: {np.exp(y_train_split.mean()):,.0f}")

# Scale features for consistency
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_split)

# RandomForest baseline (matching notebook 03 exactly)
rf_baseline = RandomForestRegressor(n_estimators=50, random_state=42, n_jobs=-1)
rf_cv_scores = cross_val_score(rf_baseline, X_train_scaled, y_train_split, cv=3,
                              scoring='neg_mean_squared_error', n_jobs=-1)
rf_rmse_log = np.sqrt(-rf_cv_scores.mean())

print(f"\nRandomForest baseline (reproducing NB03):")
print(f"RandomForest RMSE (log scale): {rf_rmse_log:.4f}")
print(f"Performance range: 0.41 (mean) to {rf_rmse_log:.4f} (RandomForest)")

Feature matrix shapes:
X_train: (1458, 275)
X_test: (1459, 275)
y_train_original: (1458,)
y_train_log: (1458,) (primary target - pre-optimized)

Train/validation split:
X_train_split: (1166, 275)
X_val_split: (292, 275)
y_train_split: (1166,)
y_val_split: (292,)

Baseline performance (mean prediction):
Baseline RMSE (log scale): 0.4106
Baseline RMSE (original scale): 75,775
Mean log target: 12.0234
Mean original target: 166,602

RandomForest baseline (reproducing NB03):
RandomForest RMSE (log scale): 0.1396
Performance range: 0.41 (mean) to 0.1396 (RandomForest)
