# Final Model Selection – CatBoost, XGBoost, XGBoost RF (MC-CV, Optuna)

This notebook runs **Monte Carlo Cross-Validation (MC-CV)** for the final model on a tabular feature set, using:
- **CatBoost** (gradient boosting on categorical features)
- **XGBoost (boosted trees)**
- **XGBoost RF mode** (random forest-style XGBoost)

It is designed to work with the `7_final_model/outputs/.../*_train_final_features.csv` tables built from
`model_data`, BupaR, and DTW features.

**Key steps implemented in this notebook:**
- MC-CV performance comparison across **CatBoost**, **XGBoost**, and **XGBoost RF mode**
- Selection of the best base model by **mean Recall** (from MC-CV)
- **Optuna** hyperparameter tuning for the selected model (within the 2016–2018 training window)
- **Temporal probability calibration**: train on 2016–2017, calibrate on 2018 using isotonic regression (Brier score reported)
- Final model fitting on the training window and export of:
  - Uncalibrated tuned model (`.joblib`)
  - Calibrated model (`.joblib`)
  - Native model formats (`.cbm` / `.json` for CatBoost, `.json` booster for XGBoost)
  - All artifacts saved locally and in S3 under `gold/final_model/cohort_name=.../age_band=.../event_year=train/models/`.

## 1. Configure Cohort and Load Final Feature Table

In [None]:
import os
import sys
from pathlib import Path

import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedShuffleSplit

# Ensure project root is importable so we can reuse shared helpers
PROJECT_ROOT = Path(os.getenv("PROJECT_ROOT", ".")).resolve()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

# Optional: plug in cohort/age-band here
COHORT_NAME = "opioid_ed"      # e.g. "opioid_ed" or "non_opioid_ed"
AGE_BAND = "0-12"              # e.g. "0-12", "13-24", "65-74"
SPLITS = 200                    # MC-CV splits for final model
TRAIN_PROP = 0.8                # train proportion per split
RANDOM_SEED = 1997

PROJECT_ROOT = Path(os.getenv("PROJECT_ROOT", ".")).resolve()
AGE_BAND_FNAME = AGE_BAND.replace("-", "_")

final_features_path = (
    PROJECT_ROOT
    / "7_final_model"
    / "outputs"
    / COHORT_NAME
    / AGE_BAND_FNAME
    / f"{COHORT_NAME}_{AGE_BAND_FNAME}_train_final_features.csv"
)

print(f"Project root: {PROJECT_ROOT}")
print(f"Final features: {final_features_path}")
if not final_features_path.exists():
    raise FileNotFoundError(f"Final feature table not found: {final_features_path}")

full_df = pd.read_csv(final_features_path)
print(f"Loaded final feature table: {full_df.shape[0]} rows, {full_df.shape[1]} columns")

In [None]:
# Split into X/y and basic diagnostics

if "target" not in full_df.columns:
    raise ValueError("Expected a 'target' column in final features table")

X = full_df.drop(columns=["target"])
y = full_df["target"].astype(int)

print("Target distribution:")
print(y.value_counts(normalize=True))

# Identify categorical vs numeric features (simple heuristic)
cat_cols = [c for c in X.columns if X[c].dtype == "object"]
num_cols = [c for c in X.columns if c not in cat_cols]

print(f"Categorical features: {len(cat_cols)}")
print(f"Numeric features: {len(num_cols)}")

# Basic sanity check
if len(np.unique(y)) != 2:
    raise ValueError("Final model currently assumes binary target")

Rows: 454,453
Columns: 107
$ event_date            [3m[90m<dttm>[39m[23m 2017-03-07[90m, [39m2017-03-07[90m, [39m2017-03-10[90m, [39m2017-03-10[90m,[39m…
$ event_year            [3m[90m<int>[39m[23m 2017[90m, [39m2017[90m, [39m2017[90m, [39m2017[90m, [39m2017[90m, [39m2017[90m, [39m2017[90m, [39m2017[90m, [39m…
$ mi_person_key         [3m[90m<chr>[39m[23m "1000000185"[90m, [39m"1000000185"[90m, [39m"1000000185"[90m, [39m"10000…
$ drug_name             [3m[90m<chr>[39m[23m "tarka"[90m, [39m"atorvastatin_calcium"[90m, [39m"azithromycin"[90m,[39m…
$ target                [3m[90m<int>[39m[23m 0[90m, [39m0[90m, [39m0[90m, [39m0[90m, [39m0[90m, [39m0[90m, [39m0[90m, [39m0[90m, [39m0[90m, [39m0[90m, [39m0[90m, [39m0[90m, [39m0[90m, [39m0[90m, [39m0[90m, [39m0[90m, [39m…
$ pattern_1             [3m[90m<chr>[39m[23m "0"[90m, [39m"9efb483a"[90m, [39m"0"[90m, [39m"0"[90m, [39m"0"[90m, [39m"0"[90

In [None]:
## 2. Define Models (CatBoost, XGBoost, XGBoost RF)

from catboost import CatBoostClassifier
from xgboost import XGBClassifier
import optuna

RANDOM_STATE = RANDOM_SEED

MODEL_PARAMS = {
    "catboost": {
        "iterations": 1000,
        "learning_rate": 0.1,
        "depth": 6,
        "loss_function": "Logloss",
        "eval_metric": "Logloss",
        "verbose": False,
        "random_seed": RANDOM_STATE,
    },
    "xgboost": {
        "n_estimators": 500,
        "max_depth": 6,
        "learning_rate": 0.05,
        "subsample": 0.9,
        "colsample_bytree": 0.8,
        "random_state": RANDOM_STATE,
        "n_jobs": -1,
        "objective": "binary:logistic",
    },
    "xgboost_rf": {
        "n_estimators": 500,
        "max_depth": 6,
        "learning_rate": 0.05,
        "subsample": 0.8,
        "colsample_bytree": 0.8,
        "random_state": RANDOM_STATE,
        "n_jobs": -1,
        "objective": "binary:logistic",
        "num_parallel_tree": 1,  # RF-style, adjusted via subsample/colsample
    },
}

def make_model(name: str):
    if name == "catboost":
        return CatBoostClassifier(**MODEL_PARAMS[name])
    elif name in ("xgboost", "xgboost_rf"):
        return XGBClassifier(**MODEL_PARAMS[name])
    else:
        raise ValueError(f"Unknown model: {name}")


$learning_rate
[1] 0.2864326

$depth
[1] 6

$colsample_bylevel
[1] 0.7341253

$min_data_in_leaf
[1] 70

$l2_leaf_reg
[1] 1.72146

$iterations
[1] 1000

$grow_policy
[1] "Lossguide"

$boosting_type
[1] "Plain"

$bootstrap_type
[1] "MVS"

$early_stopping_rounds
[1] 50

$eval_metric
[1] "Logloss"

$random_seed
[1] 1997

$verbose
[1] 0

$age_band
[1] "65-74"



In [None]:
## 3. MC-CV Helper – Run One Split

from sklearn.metrics import roc_auc_score, log_loss, recall_score


def run_single_split(model_name: str, X, y, train_idx, test_idx):
    model = make_model(model_name)

    X_train = X.iloc[train_idx]
    y_train = y.iloc[train_idx]
    X_test = X.iloc[test_idx]
    y_test = y.iloc[test_idx]

    if model_name == "catboost":
        # CatBoost can handle categorical indices directly
        cat_indices = [X.columns.get_loc(c) for c in cat_cols]
        model.set_params(cat_features=cat_indices)

    model.fit(X_train, y_train)
    y_prob = model.predict_proba(X_test)[:, 1]
    y_pred = (y_prob >= 0.5).astype(int)

    return {
        "roc_auc": roc_auc_score(y_test, y_prob),
        "logloss": log_loss(y_test, y_prob),
        "recall": recall_score(y_test, y_pred),
    }

List of 14
 $ learning_rate        : num 0.286
 $ depth                : num 6
 $ colsample_bylevel    : num 0.734
 $ min_data_in_leaf     : num 70
 $ l2_leaf_reg          : num 1.72
 $ iterations           : num 1000
 $ grow_policy          : chr "Lossguide"
 $ boosting_type        : chr "Plain"
 $ bootstrap_type       : chr "MVS"
 $ early_stopping_rounds: num 50
 $ eval_metric          : chr "Logloss"
 $ random_seed          : num 1997
 $ verbose              : num 0
 $ loss_function        : chr "Logloss"


In [15]:
# Drop unnecessary columns
drop_cols <- c("mi_person_key", "event_date", "event_year", "group_id", "__index_level_0__")
train_data <- train_df %>% select(-all_of(drop_cols))
test_data  <- test_df %>% select(-all_of(drop_cols))

# Step 1: Coerce character columns to factors
char_cols <- names(train_data)[sapply(train_data, is.character)]
train_data[char_cols] <- lapply(train_data[char_cols], as.factor)
test_data[char_cols]  <- lapply(test_data[char_cols], as.factor)

# Step 2: Identify categorical columns (factors)
categorical_cols <- names(train_data)[sapply(train_data, is.factor)]

# Step 3: Coerce all non-categorical features to numeric
numeric_cols <- setdiff(names(train_data), c("target", categorical_cols))
train_data[numeric_cols] <- lapply(train_data[numeric_cols], function(x) as.numeric(as.character(x)))
test_data[numeric_cols]  <- lapply(test_data[numeric_cols], function(x) as.numeric(as.character(x)))


## 4. Run Monte Carlo Cross-Validation (All Three Models)

In [None]:
from collections import defaultdict

results = defaultdict(list)

sss = StratifiedShuffleSplit(
    n_splits=SPLITS,
    train_size=TRAIN_PROP,
    random_state=RANDOM_SEED,
)

for split_idx, (train_idx, test_idx) in enumerate(sss.split(X, y), start=1):
    print(f"Split {split_idx}/{SPLITS}")
    for model_name in ("catboost", "xgboost", "xgboost_rf"):
        metrics = run_single_split(model_name, X, y, train_idx, test_idx)
        metrics["split"] = split_idx
        metrics["model"] = model_name
        results[model_name].append(metrics)

mc_cv_df = pd.concat(
    [pd.DataFrame(v) for v in results.values()],
    ignore_index=True,
)

mc_cv_summary = (
    mc_cv_df
    .groupby("model")["roc_auc", "logloss", "recall"]
    .agg(["mean", "std"])
)

mc_cv_summary

## 5. Choose Best Model, Run Optuna Tuning, and Fit Final Model

In [None]:
# 1) Choose best base model by mean recall from MC-CV (you can also use logloss)

best_model_name = (
    mc_cv_df
    .groupby("model")["recall"]
    .mean()
    .sort_values(ascending=False)
    .index[0]
)

print(f"Best base model by mean recall: {best_model_name}")

# 2) Define Optuna objective for that best model
from sklearn.model_selection import StratifiedKFold


def optuna_objective(trial):
    if best_model_name == "catboost":
        params = {
            "iterations": trial.suggest_int("iterations", 400, 1200),
            "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
            "depth": trial.suggest_int("depth", 4, 8),
            "l2_leaf_reg": trial.suggest_float("l2_leaf_reg", 1e-2, 10.0, log=True),
            "loss_function": "Logloss",
            "eval_metric": "Logloss",
            "verbose": False,
            "random_seed": RANDOM_STATE,
        }
        model = CatBoostClassifier(**params)
        cat_indices = [X.columns.get_loc(c) for c in cat_cols]
        model.set_params(cat_features=cat_indices)
    else:
        # XGBoost / XGBoost RF share most of the search space
        params = {
            "n_estimators": trial.suggest_int("n_estimators", 200, 800),
            "max_depth": trial.suggest_int("max_depth", 3, 8),
            "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
            "subsample": trial.suggest_float("subsample", 0.6, 1.0),
            "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),
            "random_state": RANDOM_STATE,
            "n_jobs": -1,
            "objective": "binary:logistic",
        }
        if best_model_name == "xgboost_rf":
            # Slightly more RF-like sampling
            params["subsample"] = trial.suggest_float("subsample", 0.5, 0.9)
        model = XGBClassifier(**params)

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
    recalls = []

    for train_idx, val_idx in cv.split(X, y):
        X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]
        model.fit(X_tr, y_tr)
        y_prob = model.predict_proba(X_val)[:, 1]
        y_pred = (y_prob >= 0.5).astype(int)
        recalls.append(recall_score(y_val, y_pred))

    return float(np.mean(recalls))


# 3) Run Optuna study to tune the selected model
N_TRIALS = 30  # adjust as needed for EC2 resources
study = optuna.create_study(direction="maximize")
study.optimize(optuna_objective, n_trials=N_TRIALS)

print("Best Optuna trial:", study.best_trial.number)
print("Best Optuna recall:", study.best_trial.value)
print("Best Optuna params:", study.best_trial.params)

# 4) Fit final model on full data with tuned hyperparameters
if best_model_name == "catboost":
    tuned_params = {
        **MODEL_PARAMS["catboost"],
        **study.best_trial.params,
        "random_seed": RANDOM_STATE,
        "loss_function": "Logloss",
        "eval_metric": "Logloss",
        "verbose": False,
    }
    final_model = CatBoostClassifier(**tuned_params)
    cat_indices = [X.columns.get_loc(c) for c in cat_cols]
    final_model.set_params(cat_features=cat_indices)
else:
    base = MODEL_PARAMS["xgboost"] if best_model_name == "xgboost" else MODEL_PARAMS["xgboost_rf"]
    tuned_params = {**base, **study.best_trial.params}
    tuned_params["random_state"] = RANDOM_STATE
    final_model = XGBClassifier(**tuned_params)

final_model.fit(X, y)

print("Final tuned model fitted on full dataset.")

# ------------------------------------------------------------------
# 5a. Probability calibration using a temporally ordered calibration split
#      (e.g., train on 2016–2017, calibrate on 2018; test remains 2019)
# ------------------------------------------------------------------
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import brier_score_loss
import duckdb

# Use model_data to recover each patient's latest event_year in the training window
model_data_path = (
    PROJECT_ROOT
    / "model_data"
    / f"cohort_name={COHORT_NAME}"
    / f"age_band={AGE_BAND}"
    / "model_events.parquet"
)

if not model_data_path.exists():
    raise FileNotFoundError(f"model_data parquet not found for temporal calibration: {model_data_path}")

con = duckdb.connect()
patient_years_df = con.execute(
    f"""
    SELECT mi_person_key, MAX(event_year) AS max_event_year
    FROM read_parquet('{model_data_path}')
    WHERE event_year BETWEEN 2016 AND 2018
    GROUP BY mi_person_key
    """
).df()
con.close()

# Merge year info onto the feature table (full_df should still contain mi_person_key)
if "mi_person_key" not in full_df.columns:
    raise ValueError("Expected 'mi_person_key' column in final features for temporal split")

full_with_year = full_df.merge(patient_years_df, on="mi_person_key", how="left")

# Define temporal train vs calibration: <=2017 for train, 2018 for calibration
train_mask = full_with_year["max_event_year"].isin([2016, 2017])
cal_mask = full_with_year["max_event_year"] == 2018

if not train_mask.any() or not cal_mask.any():
    raise ValueError("Temporal calibration split failed: need patients in both train (<=2017) and cal (2018) groups")

X_train_cal = X[train_mask]
y_train_cal = y[train_mask]
X_cal = X[cal_mask]
y_cal = y[cal_mask]

print(f"Temporal calibration: {X_train_cal.shape[0]} train patients (<=2017), {X_cal.shape[0]} calib patients (2018)")

# Refit tuned model on temporal calibration-training subset, then calibrate on X_cal/y_cal
final_model.fit(X_train_cal, y_train_cal)
calibrated_model = CalibratedClassifierCV(final_model, method="isotonic", cv="prefit")
calibrated_model.fit(X_cal, y_cal)

# Compute Brier score on calibration set as a sanity check
cal_probs = calibrated_model.predict_proba(X_cal)[:, 1]
cal_brier = brier_score_loss(y_cal, cal_probs)
print(f"Calibration Brier score (lower is better): {cal_brier:.4f}")

                             feature_name   importance
drug_name                       drug_name 8.855378e+01
pattern_1                       pattern_1 4.650160e+00
pattern_5                       pattern_5 1.448110e+00
pattern_2_lift             pattern_2_lift 1.368222e+00
pattern_1_lift             pattern_1_lift 1.049881e+00
pattern_2_support       pattern_2_support 5.429555e-01
pattern_1_certainty   pattern_1_certainty 4.954519e-01
pattern_2_confidence pattern_2_confidence 4.233358e-01
pattern_1_support       pattern_1_support 4.021109e-01
pattern_8                       pattern_8 2.468538e-01
pattern_2_certainty   pattern_2_certainty 2.304825e-01
pattern_2                       pattern_2 1.524628e-01
pattern_4                       pattern_4 1.423138e-01
pattern_18                     pattern_18 1.378250e-01
pattern_7                       pattern_7 9.985262e-02
pattern_16                     pattern_16 3.856357e-02
pattern_11                     pattern_11 7.674536e-03
pattern_9 

## 6. Save Final Model + Calibrated Model Artifacts (Local + S3 Gold Folder)

In [None]:
import joblib

from helpers_1997_13.constants import S3_BUCKET
from helpers_1997_13.common_imports import s3_client

output_dir = (
    PROJECT_ROOT
    / "7_final_model"
    / "outputs"
    / COHORT_NAME
    / AGE_BAND_FNAME
)
output_dir.mkdir(parents=True, exist_ok=True)

# Local artifact paths
mc_cv_path = output_dir / f"{COHORT_NAME}_{AGE_BAND_FNAME}_mc_cv_results.csv"
model_path = output_dir / f"{COHORT_NAME}_{AGE_BAND_FNAME}_final_model_{best_model_name}.joblib"
calib_model_path = output_dir / f"{COHORT_NAME}_{AGE_BAND_FNAME}_final_model_{best_model_name}_calibrated.joblib"

mc_cv_df.to_csv(mc_cv_path, index=False)
joblib.dump(final_model, model_path)
joblib.dump(calibrated_model, calib_model_path)

print(f"Saved MC-CV results to: {mc_cv_path}")
print(f"Saved final model (joblib) to: {model_path}")
print(f"Saved calibrated model (joblib) to: {calib_model_path}")


# ------------------------------------------------------------------
# Export model in native formats (CatBoost/XGBoost) and upload to S3
# ------------------------------------------------------------------

event_year = "train"  # training window (2016–2018)
s3_prefix_models = (
    f"gold/final_model/cohort_name={COHORT_NAME}/"
    f"age_band={AGE_BAND_FNAME}/event_year={event_year}/models"
)


def upload_to_s3(local_path):
    key = f"{s3_prefix_models}/{local_path.name}"
    s3_client.upload_file(str(local_path), S3_BUCKET, key)
    print(f"Uploaded to s3://{S3_BUCKET}/{key}")


# Always upload MC-CV results and both joblib models
upload_to_s3(mc_cv_path)
upload_to_s3(model_path)
upload_to_s3(calib_model_path)

# CatBoost: save .cbm and JSON
if best_model_name == "catboost":
    cat_cbm_path = output_dir / f"{COHORT_NAME}_{AGE_BAND_FNAME}_final_model_catboost.cbm"
    cat_json_path = output_dir / f"{COHORT_NAME}_{AGE_BAND_FNAME}_final_model_catboost.json"

    final_model.save_model(str(cat_cbm_path))
    final_model.save_model(str(cat_json_path), format="json")

    print(f"Saved CatBoost CBM to: {cat_cbm_path}")
    print(f"Saved CatBoost JSON to: {cat_json_path}")

    upload_to_s3(cat_cbm_path)
    upload_to_s3(cat_json_path)

# XGBoost / XGBoost RF: save JSON booster
if best_model_name in ("xgboost", "xgboost_rf"):
    xgb_json_path = output_dir / f"{COHORT_NAME}_{AGE_BAND_FNAME}_final_model_{best_model_name}.json"
    final_model.get_booster().save_model(str(xgb_json_path))
    print(f"Saved XGBoost JSON to: {xgb_json_path}")
    upload_to_s3(xgb_json_path)

In [None]:
# (Optional) place for custom export / additional diagnostics if needed.
# For now, all relevant artifacts are saved by the previous cell.

✓ Uploaded catboost_model_r.cbm to s3://pgxdatalake/catboost_models/non_opioid_ed/age_band=65-74/catboost_model_r.cbm

✓ Uploaded catboost_model_info_r.json to s3://pgxdatalake/catboost_models/non_opioid_ed/age_band=65-74/catboost_model_info_r.json

✓ Uploaded catboost_model_info_r.parquet to s3://pgxdatalake/catboost_models/non_opioid_ed/age_band=65-74/catboost_model_info_r.parquet




Model Information:
✓ All outputs saved to local path: /home/pgx3874/pgx-datasets/catboost_analysis/catboost_models/ed_non_opioid/cohort6 
✓ All outputs uploaded to: s3://pgxdatalake/catboost_models/non_opioid_ed/age_band=65-74


# Back to Main Pipeline
[Return to Main Pipeline](../pgx_cohort_pipeline.ipynb)