# Thesis Documentation for `04_tree_models.ipynb`

This document provides a detailed methodological justification for the steps in the 04_tree_models.ipynb notebook. The focus of this stage is to train, tune, and evaluate advanced tree-based ensemble models and compare their performance against the established baselines.

## 4-A & 4-B: Workspace Initialization and Data Recreation

In [1]:
# 4-A: Imports & data reload
import pandas as pd
import numpy as np
from pathlib import Path
import joblib
from sklearn.model_selection import RepeatedKFold, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, r2_score

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

DATA      = Path("../data/asteroids_clean.csv")
PREPROC_P = Path("../data/preprocess.pkl")

df         = pd.read_csv(DATA)
preprocess = joblib.load(PREPROC_P)

# 4-B: Rebuild X/y and split once
TARGET = "diameter"
DROP_ALWAYS = ["Unnamed: 0", "GM", "G", "IR", "extent",
               "UB", "BV", "spec_B", "spec_T", "name", "per_y"]

X = df.drop(columns=[TARGET] + DROP_ALWAYS, errors="ignore").copy()
y = df[TARGET].copy()
X["condition_code"] = X["condition_code"].astype("object")

from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.20, random_state=RANDOM_STATE
)

# Thesis Justification
## Objective:
To prepare the workspace for training advanced models, ensuring complete consistency with the previous stages of the project.

## Methodology:
The environment is initialized by importing the necessary libraries, including the ensemble model classes (`RandomForestRegressor`, `GradientBoostingRegressor`) and hyperparameter tuning tools (`RandomizedSearchCV`). The cleaned data and the fitted `preprocess` object are loaded, and the data is partitioned into identical training and validation sets using the established `RANDOM_STATE`.

## Justification:
This rigorous setup is crucial for maintaining the scientific validity of the model comparison. By reusing the exact same fitted preprocessor and `random_state` for the data split, we isolate the change in performance to be purely a function of the model algorithm itself, eliminating data-related confounding variables.

Load the clean data and the *fitted* `preprocess` object so we can bolt
tree models on top.  We import RandomForest & GradientBoosting plus
cross-validation helpers.


Mirror the exact preprocessing decisions from Step 2 so the data lines
up with `preprocess`.  The split stays identical (random_state=42) for
apples-to-apples comparisons.

## 4C ― Random Forest Regressor with minimal tuning

In [2]:
from sklearn.pipeline import Pipeline 
from sklearn.base import clone          # ← add this import

rf_base = RandomForestRegressor(
    n_estimators=300,
    max_depth=None,
    min_samples_leaf=2,
    n_jobs=-1,
    random_state=RANDOM_STATE
)

# Use sklearn's clone to make a *fresh* copy of the fitted pre-processor
rf_pipeline = Pipeline([
    ("prep", clone(preprocess)),        # ✅ replace joblib.clone
    ("rf",   rf_base)
])

rf_pipeline.fit(X_train, y_train)
y_pred_rf = rf_pipeline.predict(X_val)

def metrics(y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(((y_true - y_pred) ** 2).mean())
    r2  = r2_score(y_true, y_pred)
    return {"MAE": mae, "RMSE": rmse, "R²": r2}

scores_rf = metrics(y_val, y_pred_rf)
scores_rf


{'MAE': 0.6666082256443401,
 'RMSE': np.float64(3.064663596902431),
 'R²': 0.7911703824817187}

# Thesis Justification
## Objective:
To evaluate the performance of a Random Forest model, a powerful ensemble method capable of capturing non-linear relationships and feature interactions.

## Methodology:
A `RandomForestRegressor` is instantiated with a set of reasonable default hyperparameters. This model is then placed into a `Pipeline` with a `clone` of the fitted preprocessor. The pipeline is trained, and its performance is evaluated on the validation set.

## Justification:

 - Model Choice: Linear models assume an additive, linear relationship between features and the target. The physics of asteroids may involve complex, non-linear interactions. Random Forest, an ensemble of decision trees, is an excellent next step as it makes no such assumptions and can model these intricate patterns effectively.

 - Hyperparameters: The chosen parameters (`n_estimators=300`, `min_samples_leaf=2`) represent a sensible starting point, creating a reasonably large and regularized forest without extensive tuning. `n_jobs=-1` is used to parallelize training and accelerate the process.

 - `clone(preprocess)`: Using `clone` is a critical best practice. It creates a fresh, unfitted copy of the preprocessing pipeline structure, which is then fitted inside the main `rf_pipeline` when `.fit()` is called. This ensures that the model can be treated as a single, self-contained object, which is essential for cross-validation and hyperparameter tuning.

## 4-D: GradientBoostingRegressor with Hyperparameter Tuning

In [3]:
gb = GradientBoostingRegressor(random_state=RANDOM_STATE)

param_dist = {
    "gb__n_estimators":  [200, 400, 600],
    "gb__learning_rate": [0.03, 0.05, 0.1],
    "gb__max_depth":     [2, 3, 4],
    "gb__subsample":     [0.6, 0.8, 1.0]
}

gb_pipeline = Pipeline([
    ("prep", preprocess),
    ("gb",   gb)
])

cv = RepeatedKFold(n_splits=5, n_repeats=2, random_state=RANDOM_STATE)

search = RandomizedSearchCV(
    gb_pipeline,
    param_distributions=param_dist,
    n_iter=20,
    scoring="neg_root_mean_squared_error",
    cv=cv,
    n_jobs=-1,
    random_state=RANDOM_STATE,
    verbose=0
)

search.fit(X_train, y_train)
print("Best params:", search.best_params_)
best_gb = search.best_estimator_

y_pred_gb = best_gb.predict(X_val)
scores_gb = metrics(y_val, y_pred_gb)
scores_gb

Best params: {'gb__subsample': 0.8, 'gb__n_estimators': 600, 'gb__max_depth': 2, 'gb__learning_rate': 0.1}


{'MAE': 0.6705381806817967,
 'RMSE': np.float64(2.7757099648405865),
 'R²': 0.8286931841473344}

# Thesis Justification
## Objective:
To train a Gradient Boosting model and systematically search for an optimal set of hyperparameters to maximize its predictive performance.

## Methodology:
A `GradientBoostingRegressor` is combined with the preprocessor in a pipeline. A `RandomizedSearchCV` is configured to explore a distribution of key hyperparameters using a robust cross-validation strategy. The best-performing model from this search is then evaluated on the hold-out validation set.

## Justification:

 - Model Choice: Gradient Boosting is another state-of-the-art ensemble method. Unlike Random Forest, which builds trees independently, Gradient Boosting builds them sequentially, with each new tree correcting the errors of the previous one. This often leads to higher predictive accuracy, making it a logical model to test.

 - `RandomizedSearchCV` vs. `GridSearchCV`: For a hyperparameter space of this size (3x3x3x3 = 81 combinations), an exhaustive `GridSearchCV` would be computationally expensive. `RandomizedSearchCV` is a more efficient alternative that samples a fixed number of parameter combinations (`n_iter=20`). Research has shown that randomized search can often find models that are as good as or better than those found by grid search in a fraction of the time.

 - Cross-Validation (`RepeatedKFold`): Standard k-fold cross-validation can have high variance depending on how the folds are split. `RepeatedKFold` (with 5 splits and 2 repeats) mitigates this by running the k-fold process multiple times with different random shuffles. This provides a more stable and reliable estimate of a model's true performance during the search.

 - Scoring Metric: `neg_root_mean_squared_error` is chosen as the scoring metric for the search. It is the negative of RMSE, used because scikit-learn's search functions are designed to maximize a score. Maximizing negative RMSE is equivalent to minimizing RMSE.

**GradientBoostingRegressor** captures non-linearities via additive trees.  
We run a *RandomizedSearchCV* (20 combos × 10-fold CV) over key knobs:

| Hyper-param | Effect |
|-------------|--------|
| `n_estimators` / `learning_rate` | trade-off bias vs variance |
| `max_depth` | tree complexity |
| `subsample` | stochastic boosting for extra regularisation |

The best model is evaluated on the same validation set.


# 4-E & 4-F: Final Comparison and Model Persistence

### 4-E: Re-establish baseline scores

In [4]:
# ╔══════════════════════════════════════════════════════════════╗
# ║  Re-establish baseline scores (Dummy + LinearRegression)     ║
# ╚══════════════════════════════════════════════════════════════╝
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.base import clone

# ----- Dummy (median) -----
dummy_pipe = Pipeline([
    ("prep", clone(preprocess)),        # fresh copy of fitted transformer
    ("reg",  DummyRegressor(strategy="median"))
])
dummy_pipe.fit(X_train, y_train)
y_pred_dummy = dummy_pipe.predict(X_val)
scores_dummy = metrics(y_val, y_pred_dummy)

# ----- LinearRegression -----
lin_pipe = Pipeline([
    ("prep", clone(preprocess)),
    ("reg",  LinearRegression())
])
lin_pipe.fit(X_train, y_train)
y_pred_lin = lin_pipe.predict(X_val)
scores_lin = metrics(y_val, y_pred_lin)


### Create comparison table

In [5]:
import pandas as pd

results = pd.DataFrame(
    [scores_dummy, scores_lin, scores_rf, scores_gb],
    index=["Dummy", "LinearReg", "RandomForest", "GradBoost"]
).round(3)

results


Unnamed: 0,MAE,RMSE,R²
Dummy,2.656,6.864,-0.047
LinearReg,2.227,9.304,-0.925
RandomForest,0.667,3.065,0.791
GradBoost,0.671,2.776,0.829


Put every model’s MAE / RMSE / R² side-by-side.  
Typical pattern you should see:

* **RandomForest** → big drop in both MAE & RMSE, R² positive.  
* **GradBoost**   → often edges out RF after tuning.

If either tree model *fails* to beat the Dummy baseline, double-check
that `preprocess` is the *fitted* version and that target/leak issues
aren’t creeping in.


## 4F ― Save the best model & commit

In [6]:
import joblib, pathlib
joblib.dump(best_gb, pathlib.Path("../data/model_gradboost.pkl"))


['../data/model_gradboost.pkl']

# Thesis Justification
## Objective:
To create a final, comprehensive comparison of all evaluated models and to save the best-performing model for interpretation and future use.

## Methodology:
The baseline models are re-run within the notebook to ensure a fair comparison. The scores from all four models (`Dummy`, `LinearRegression`, `RandomForest`, `GradientBoosting`) are compiled into a single pandas DataFrame. The best model from the search (`best_gb`) is then serialized and saved to a file using `joblib`.

## Justification:

 - Final Comparison Table: The results table provides the definitive evidence for model selection. The dramatic improvement in all metrics (lower MAE/RMSE, higher R²) for the tree-based models compared to the linear model strongly indicates the presence of significant non-linear relationships and feature interactions in the data, which the ensemble methods successfully captured. The table clearly shows that the tuned `GradientBoostingRegressor` is the superior model.

 - Model Persistence: Saving the `best_gb` object is the final step of the modeling phase. This object is not just a model; it is a complete, fitted pipeline that encapsulates all preprocessing and prediction logic. This single file can now be loaded in the final notebook for model interpretation (e.g., feature importance, SHAP analysis) and could be deployed in a production environment to make predictions on new asteroid data.

In [None]:
!git add notebooks/04_tree_models.ipynb data/model_gradboost.pkl
!git commit -m "Step 4: RandomForest + tuned GradientBoost with results table"
!git push
