# Step 4 – Tree ensembles


## 4A ― Imports & data reload (code)


In [1]:
# Step 4 – Tree ensembles
import pandas as pd
import numpy as np
from pathlib import Path
import joblib

from sklearn.model_selection import RepeatedKFold, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, r2_score

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Same dataset & fitted preprocessor
DATA      = Path("../data/asteroids_clean.csv")
PREPROC_P = Path("../data/preprocess.pkl")

df         = pd.read_csv(DATA)
preprocess = joblib.load(PREPROC_P)


Load the clean data and the *fitted* `preprocess` object so we can bolt
tree models on top.  We import RandomForest & GradientBoosting plus
cross-validation helpers.


## 4B ― Rebuild X/y and split once (code)

In [2]:
TARGET = "diameter"
DROP_ALWAYS = ["Unnamed: 0", "GM", "G", "IR", "extent",
               "UB", "BV", "spec_B", "spec_T", "name", "per_y"]

X = df.drop(columns=[TARGET] + DROP_ALWAYS, errors="ignore").copy()
y = df[TARGET].copy()
X["condition_code"] = X["condition_code"].astype("object")

from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.20, random_state=RANDOM_STATE
)


Mirror the exact preprocessing decisions from Step 2 so the data lines
up with `preprocess`.  The split stays identical (random_state=42) for
apples-to-apples comparisons.

## 4C ― Random Forest with minimal tuning (code)

In [6]:
from sklearn.pipeline import Pipeline 
from sklearn.base import clone          # ← add this import

rf_base = RandomForestRegressor(
    n_estimators=300,
    max_depth=None,
    min_samples_leaf=2,
    n_jobs=-1,
    random_state=RANDOM_STATE
)

# Use sklearn's clone to make a *fresh* copy of the fitted pre-processor
rf_pipeline = Pipeline([
    ("prep", clone(preprocess)),        # ✅ replace joblib.clone
    ("rf",   rf_base)
])

rf_pipeline.fit(X_train, y_train)
y_pred_rf = rf_pipeline.predict(X_val)

def metrics(y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(((y_true - y_pred) ** 2).mean())
    r2  = r2_score(y_true, y_pred)
    return {"MAE": mae, "RMSE": rmse, "R²": r2}

scores_rf = metrics(y_val, y_pred_rf)
scores_rf


{'MAE': 0.6666082256443399,
 'RMSE': np.float64(3.064663596902431),
 'R²': 0.7911703824817187}

## 4D ― Gradient-Boosting with coarse hyper-search (code)

In [7]:
gb = GradientBoostingRegressor(random_state=RANDOM_STATE)

param_dist = {
    "gb__n_estimators":  [200, 400, 600],
    "gb__learning_rate": [0.03, 0.05, 0.1],
    "gb__max_depth":     [2, 3, 4],
    "gb__subsample":     [0.6, 0.8, 1.0]
}

gb_pipeline = Pipeline([
    ("prep", preprocess),
    ("gb",   gb)
])

cv = RepeatedKFold(n_splits=5, n_repeats=2, random_state=RANDOM_STATE)

search = RandomizedSearchCV(
    gb_pipeline,
    param_distributions=param_dist,
    n_iter=20,
    scoring="neg_root_mean_squared_error",
    cv=cv,
    n_jobs=-1,
    random_state=RANDOM_STATE,
    verbose=0
)

search.fit(X_train, y_train)
print("Best params:", search.best_params_)
best_gb = search.best_estimator_

y_pred_gb = best_gb.predict(X_val)
scores_gb = metrics(y_val, y_pred_gb)
scores_gb


Best params: {'gb__subsample': 0.8, 'gb__n_estimators': 600, 'gb__max_depth': 2, 'gb__learning_rate': 0.1}


{'MAE': 0.6705381806817967,
 'RMSE': np.float64(2.7757099648405865),
 'R²': 0.8286931841473344}

**GradientBoostingRegressor** captures non-linearities via additive trees.  
We run a *RandomizedSearchCV* (20 combos × 10-fold CV) over key knobs:

| Hyper-param | Effect |
|-------------|--------|
| `n_estimators` / `learning_rate` | trade-off bias vs variance |
| `max_depth` | tree complexity |
| `subsample` | stochastic boosting for extra regularisation |

The best model is evaluated on the same validation set.


## 4E ― Compare all models so far (code)

In [9]:
# ╔══════════════════════════════════════════════════════════════╗
# ║  Re-establish baseline scores (Dummy + LinearRegression)     ║
# ╚══════════════════════════════════════════════════════════════╝
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.base import clone

# ----- Dummy (median) -----
dummy_pipe = Pipeline([
    ("prep", clone(preprocess)),        # fresh copy of fitted transformer
    ("reg",  DummyRegressor(strategy="median"))
])
dummy_pipe.fit(X_train, y_train)
y_pred_dummy = dummy_pipe.predict(X_val)
scores_dummy = metrics(y_val, y_pred_dummy)

# ----- LinearRegression -----
lin_pipe = Pipeline([
    ("prep", clone(preprocess)),
    ("reg",  LinearRegression())
])
lin_pipe.fit(X_train, y_train)
y_pred_lin = lin_pipe.predict(X_val)
scores_lin = metrics(y_val, y_pred_lin)


In [10]:
import pandas as pd

results = pd.DataFrame(
    [scores_dummy, scores_lin, scores_rf, scores_gb],
    index=["Dummy", "LinearReg", "RandomForest", "GradBoost"]
).round(3)

results


Unnamed: 0,MAE,RMSE,R²
Dummy,2.656,6.864,-0.047
LinearReg,2.227,9.304,-0.925
RandomForest,0.667,3.065,0.791
GradBoost,0.671,2.776,0.829


Put every model’s MAE / RMSE / R² side-by-side.  
Typical pattern you should see:

* **RandomForest** → big drop in both MAE & RMSE, R² positive.  
* **GradBoost**   → often edges out RF after tuning.

If either tree model *fails* to beat the Dummy baseline, double-check
that `preprocess` is the *fitted* version and that target/leak issues
aren’t creeping in.


## 4F ― Save the best model & commit (code)

In [11]:
joblib.dump(best_gb, "../data/model_gradboost.pkl")


['../data/model_gradboost.pkl']

Persist the tuned GradientBoost model so future notebooks (or a web API)
can load it quickly:

```python
model = joblib.load("../data/model_gradboost.pkl")


In [None]:
!git add notebooks/04_tree_models.ipynb data/model_gradboost.pkl
!git commit -m "Step 4: RandomForest + tuned GradientBoost with results table"
!git push


fatal: pathspec 'notebooks/04_tree_models.ipynb' did not match any files
On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   02_preprocessing.ipynb[m
	[31mmodified:   03_baselines.ipynb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m../data/data.csv[m
	[31m../data/model_gradboost.pkl[m
	[31m04_tree_models.ipynb[m

no changes added to commit (use "git add" and/or "git commit -a")
