# Modeling

## Phase 3 â€” Model Training and Evaluation

**Objective:**
Train and evaluate machine learning models using a clean and
leakage-free preprocessing pipeline.

The modeling process follows a progressive strategy:
1. Establish a simple baseline
2. Train more expressive models
3. Compare performance using a consistent metric


In [1]:
import sys
import os

# Add project root to Python path
project_root = os.path.abspath("..")
if project_root not in sys.path:
    sys.path.append(project_root)

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline


In [3]:
from src.preprocessing import build_preprocessor
from src.temporal import add_temporal_features

train_df = pd.read_csv("../data/raw/train.csv")
train_df = add_temporal_features(train_df)

X = train_df.drop(columns=["SalePrice"])
y = train_df["SalePrice"]

preprocessor = build_preprocessor()

In [4]:
model = Ridge(alpha=1.0)

pipeline = Pipeline(steps=[
    ("preprocessing", preprocessor),
    ("model", model)
])

In [5]:
rmse_scores = -cross_val_score(
    pipeline,
    X,
    y,
    scoring="neg_root_mean_squared_error",
    cv=5
)

rmse_scores.mean(), rmse_scores.std()

(np.float64(34891.49186789454), np.float64(5388.131182573366))

The baseline Ridge regression provides a reference performance level.
This result serves as a benchmark to evaluate whether more complex models
or target transformations bring meaningful improvements.
