# 04 — Tree-Based Models

Objective:
Evaluate non-linear models to capture interactions and complex pricing behavior
in used car valuation.

## 1. Imports

In [10]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor

import category_encoders as ce

import warnings
warnings.filterwarnings("ignore")

## 2. Loading the Data

In [2]:
df = pd.read_csv("../data/processed/vehicles_feature_audited.csv")
df_small = df.sample(n=120000, random_state=42)

X = df_small.drop(columns=["price", "log_price"])
y = df_small["log_price"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(X_train.shape)

(96000, 14)


## 3. Define Column Groups

In [3]:
numerical_cols = ["year", "odometer"]

target_encode_cols = ["model", "region"]

onehot_cols = [
    "manufacturer", "condition", "cylinders", "fuel",
    "title_status", "transmission", "drive",
    "size", "type", "paint_color"
]

## 4. Tree Preprocessing

Tree models do not require scaling.
We retain:
- Target encoding for high-cardinality features
- OneHot encoding for low-cardinality features
- Numerical features passed through unchanged

In [4]:
tree_preprocessor = ColumnTransformer(
    transformers=[
        ("onehot",
         OneHotEncoder(drop="first", handle_unknown="ignore"),
         onehot_cols),
        
        ("target",
         ce.TargetEncoder(cols=target_encode_cols, smoothing=10),
         target_encode_cols),
        
        ("num",
         "passthrough",
         numerical_cols)
    ]
)

## 5. Random Forest (Lightweight Baseline)

We use:
- 50 trees
- Max depth = 20
- 3-fold CV
To quickly evaluate whether non-linear modeling improves performance.

In [5]:
rf_model = Pipeline(steps=[
    ("preprocessing", tree_preprocessor),
    ("regressor", RandomForestRegressor(
        n_estimators=50,
        max_depth=20,
        random_state=42,
        n_jobs=4
    ))
])

In [6]:
#Cross-Validation
kf = KFold(n_splits=3, shuffle=True, random_state=42)

rf_cv_rmse = -cross_val_score(
    rf_model,
    X_train,
    y_train,
    cv=kf,
    scoring="neg_root_mean_squared_error"
)

print("Random Forest CV RMSE (log):", rf_cv_rmse.mean())
print("Per fold:", rf_cv_rmse)
print("Std Dev:", rf_cv_rmse.std())



Random Forest CV RMSE (log): 0.8349978398880267
Per fold: [0.82671401 0.8319483  0.84633121]
Std Dev: 0.008293907958951335




## 6. LightGBM Baseline (Boosted Trees)

We evaluate Gradient Boosting using LightGBM.

### Motivation for Gradient Boosting (LightGBM)

Linear models underperformed (CV RMSE ≈ 0.995), indicating bias and inability to capture non-linear relationships.

Random Forest significantly improved performance (CV RMSE ≈ 0.835), confirming:
- Non-linear structure in pricing
- Strong feature interactions
- Tree-based models are more suitable

However, Random Forest:
- Averages independent trees
- Reduces variance but does not aggressively reduce bias

Gradient Boosting builds trees sequentially, where each new tree corrects previous residual errors. 
This typically leads to stronger performance on structured/tabular data.

LightGBM is chosen because:
- It is optimized for large datasets
- Uses histogram-based splitting (faster training)
- Handles high-dimensional features efficiently
- Is widely used in industry for tabular ML problems

Objective:
Evaluate whether boosting further reduces log RMSE compared to Random Forest.

In [11]:
import lightgbm as lgb

lgb_model = Pipeline(steps=[
    ("preprocessing", tree_preprocessor),
    ("regressor", lgb.LGBMRegressor(
        n_estimators=200,
        learning_rate=0.1,
        max_depth=-1,
        random_state=42,
        n_jobs=4,
        verbosity=-1
    ))
]) 

In [12]:
#Cross-Validation
kf = KFold(n_splits=3, shuffle=True, random_state=42)

lgb_cv_rmse = -cross_val_score(
    lgb_model,
    X_train,
    y_train,
    cv=kf,
    scoring="neg_root_mean_squared_error"
)

print("LightGBM CV RMSE (log):", lgb_cv_rmse.mean())
print("Per fold:", lgb_cv_rmse)
print("Std Dev:", lgb_cv_rmse.std())

LightGBM CV RMSE (log): 0.8277661239659446
Per fold: [0.81881676 0.83256868 0.83191293]
Std Dev: 0.006333814631087257


## 7. LightGBM Hyperparameter Tuning (Small Grid)

Objective:
Refine model complexity and learning dynamics to determine whether 
further bias reduction is possible beyond the baseline LightGBM model (CV RMSE ≈ 0.828).

We tune only high-impact parameters:

- num_leaves → controls tree complexity
- learning_rate → controls step size of boosting
- n_estimators → number of boosting rounds

We use a small controlled grid to avoid overfitting and excessive compute.
Tuning is performed on the 120k sample dataset.

In [13]:
from itertools import product

param_grid = {
    "num_leaves": [31, 63],
    "learning_rate": [0.1, 0.05],
    "n_estimators": [200, 400]
}

kf = KFold(n_splits=3, shuffle=True, random_state=42)

results = []

for num_leaves, lr, n_est in product(
        param_grid["num_leaves"],
        param_grid["learning_rate"],
        param_grid["n_estimators"]):

    model = Pipeline(steps=[
        ("preprocessing", tree_preprocessor),
        ("regressor", lgb.LGBMRegressor(
            num_leaves=num_leaves,
            learning_rate=lr,
            n_estimators=n_est,
            max_depth=-1,
            random_state=42,
            n_jobs=4,
            verbosity=-1
        ))
    ])

    cv_rmse = -cross_val_score(
        model,
        X_train,
        y_train,
        cv=kf,
        scoring="neg_root_mean_squared_error"
    )

    mean_rmse = cv_rmse.mean()

    print(f"Leaves={num_leaves}, LR={lr}, Est={n_est} → RMSE={mean_rmse:.4f}")

    results.append((num_leaves, lr, n_est, mean_rmse))

Leaves=31, LR=0.1, Est=200 → RMSE=0.8278
Leaves=31, LR=0.1, Est=400 → RMSE=0.8115
Leaves=31, LR=0.05, Est=200 → RMSE=0.8421
Leaves=31, LR=0.05, Est=400 → RMSE=0.8258
Leaves=63, LR=0.1, Est=200 → RMSE=0.8129
Leaves=63, LR=0.1, Est=400 → RMSE=0.7978
Leaves=63, LR=0.05, Est=200 → RMSE=0.8253
Leaves=63, LR=0.05, Est=400 → RMSE=0.8108


## 8. Overfitting Diagnostic (Train vs CV Error)

We evaluate whether the best LightGBM configuration
is overfitting by comparing training RMSE with cross-validation RMSE.

In [15]:
best_model = Pipeline(steps=[
    ("preprocessing", tree_preprocessor),
    ("regressor", lgb.LGBMRegressor(
        num_leaves=63,
        learning_rate=0.1,
        n_estimators=400,
        max_depth=-1,
        random_state=42,
        n_jobs=4,
        verbosity=-1
    ))
])

In [16]:
from sklearn.metrics import mean_squared_error
import numpy as np

best_model.fit(X_train, y_train)

train_preds = best_model.predict(X_train)
train_rmse = np.sqrt(mean_squared_error(y_train, train_preds))

print("Train RMSE (log):", train_rmse)
print("CV RMSE (log): 0.7978")
print("Gap:", 0.7978 - train_rmse)

Train RMSE (log): 0.5585516976445868
CV RMSE (log): 0.7978
Gap: 0.23924830235541317


## 9. Regularized LightGBM (Variance Control)

We add regularization to reduce overfitting observed in the tuned model.

In [17]:
regularized_model = Pipeline(steps=[
    ("preprocessing", tree_preprocessor),
    ("regressor", lgb.LGBMRegressor(
        num_leaves=63,
        learning_rate=0.1,
        n_estimators=400,
        min_child_samples=50,
        subsample=0.8,
        colsample_bytree=0.8,
        random_state=42,
        n_jobs=4,
        verbosity=-1
    ))
])

kf = KFold(n_splits=3, shuffle=True, random_state=42)

reg_cv_rmse = -cross_val_score(
    regularized_model,
    X_train,
    y_train,
    cv=kf,
    scoring="neg_root_mean_squared_error"
)

print("Regularized LightGBM CV RMSE (log):", reg_cv_rmse.mean())
print("Std Dev:", reg_cv_rmse.std())

Regularized LightGBM CV RMSE (log): 0.8045971295183284
Std Dev: 0.005204331268911649


## 10. Model Selection Decision

We compared multiple LightGBM configurations on the 120k sample dataset.

Best configuration:

- num_leaves = 63
- learning_rate = 0.1
- n_estimators = 400

Cross-Validation RMSE (log) ≈ 0.798

Observations:
- Increasing model capacity reduced bias and improved validation performance.
- Strong regularization slightly worsened validation RMSE.
- Although the model shows a noticeable train–validation gap, 
  validation performance remains strongest for this configuration.

Decision:
Select this LightGBM configuration as the final candidate model.

Next Step:
Train this configuration on the full dataset and evaluate once on the held-out test set.