# Modeling

## Phase 3 — Model Training and Evaluation

**Objective:**
Train and evaluate machine learning models using a clean and
leakage-free preprocessing pipeline.

The modeling process follows a progressive strategy:
1. Establish a simple baseline
2. Train more expressive models
3. Compare performance using a consistent metric


In [1]:
import sys
import os

# Add project root to Python path
project_root = os.path.abspath("..")
if project_root not in sys.path:
    sys.path.append(project_root)

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline


In [3]:
from src.preprocessing import build_preprocessor
from src.temporal import add_temporal_features

train_df = pd.read_csv("../data/raw/train.csv")
train_df = add_temporal_features(train_df)

X = train_df.drop(columns=["SalePrice"])
y = train_df["SalePrice"]

preprocessor = build_preprocessor()

In [4]:
model = Ridge(alpha=1.0)

pipeline = Pipeline(steps=[
    ("preprocessing", preprocessor),
    ("model", model)
])

In [5]:
rmse_scores = -cross_val_score(
    pipeline,
    X,
    y,
    scoring="neg_root_mean_squared_error",
    cv=5
)

rmse_scores.mean(), rmse_scores.std()

(np.float64(34891.49186789454), np.float64(5388.131182573366))

The baseline Ridge regression provides a reference performance level.
This result serves as a benchmark to evaluate whether more complex models
or target transformations bring meaningful improvements.


## Step 2 — Target Transformation

The target variable (`SalePrice`) shows strong right skewness.
A logarithmic transformation may stabilize variance and improve
model performance, especially for linear models.


In [7]:
y_log = np.log1p(y)

rmse_log_scores = -cross_val_score(
    pipeline,
    X,
    y_log,
    scoring="neg_root_mean_squared_error",
    cv=5
)

rmse_log_scores.mean(), rmse_log_scores.std()

(np.float64(0.1531466470193084), np.float64(0.023793450452665142))

### Technical Decision

The log-transformed target leads to more stable cross-validation results.
Therefore, subsequent models will be trained using the log-transformed target.


In [10]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(
    n_estimators=300,
    random_state=42,
    n_jobs=-1
)

rf_pipeline = Pipeline(steps=[
    ("preprocessing", preprocessor),
    ("model", rf_model)
])

rf_rmse_scores = -cross_val_score(
    rf_pipeline,
    X,
    y_log,
    scoring="neg_root_mean_squared_error",
    cv=5
)

rf_rmse_scores.mean(), rf_rmse_scores.std()

(np.float64(0.1470402649405938), np.float64(0.00879567968512972))

### Model Comparison Note

Random Forest captures non-linear relationships that linear models cannot.
However, model selection considers not only average performance but also
stability across folds.


In [11]:
from sklearn.ensemble import GradientBoostingRegressor
gb_model = GradientBoostingRegressor(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=3,
    random_state=42
)

gb_pipeline = Pipeline(steps=[
    ("preprocessing", preprocessor),
    ("model", gb_model)
])

gb_rmse_scores = -cross_val_score(
    gb_pipeline,
    X,
    y_log,
    scoring="neg_root_mean_squared_error",
    cv=5
)

gb_rmse_scores.mean(), gb_rmse_scores.std()

(np.float64(0.13434665869663368), np.float64(0.011343703866994494))