Context:

1. What AutoGluon actually is

AutoGluon 
- is an open-source AutoML framework, not a proprietary or novel machine-learning model.
- uses multi-level stacking built exclusively on out-of-fold predictions, ensuring strict separation between training and validation data.

How it works:

- Level 0 (base models):
    - Multiple diverse models are trained independently, for example:
    - XGBoost variants (XGB_A, XGB_B)
    - LightGBM variants (LGB_A)
    - CatBoost variants (CAT_A)
    - Neural networks (NN)
    - Random Forests (RF)
    - Linear models
    - kNN
    - etc

Each model generates out-of-fold predictions via cross-validation.

- Level 1 (meta-model):
    - A meta-learner is trained only on the OOF predictions produced by Level-0 models.
    - At no point does it see predictions made on data the base models were trained on.

- Level 2 (optional):
  - In some configurations, AutoGluon applies stacking on top of stacked models, further reducing variance and improving robustness.
 
2. Who launched AutoGluon and why

AutoGluon was originally developed and released by researchers and engineers at Amazon Web Services (AWS) and open-sourced in 2019.

Its original motivation was practical rather than academic:
- reduce the cost of deploying strong ML baselines in production,
- minimize human bias in model selection,
- automate best practices that expert ML engineers apply manually.

3. Why it became widely adopted

AutoGluon gained traction because it consistently demonstrated strong performance on tabular prediction problems, especially in settings where:
- model selection uncertainty is high,
- multiple algorithms perform similarly,
- ensemble effects dominate single-model gains.

It became particularly popular in:
- industrial ML pipelines for tabular data,
- rapid prototyping and benchmarking,
- competitive data science environments such as Kaggle.

Its success comes from three structural strengths:
- Systematic exploration of many model families
- Strict out-of-fold stacking, preventing data leakage
- Robustness-driven optimization, favoring stability over noisy peak scores

AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data
(Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy,Mu Li, Alexander Smola)[https://arxiv.org/pdf/2003.06505]

In [1]:
import warnings
warnings.filterwarnings("ignore")

from pathlib import Path
from datetime import datetime
import numpy as np
import pandas as pd

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from autogluon.tabular import TabularPredictor

# Settings

OUT_DIR = Path("models_autogluon_best")
OUT_DIR.mkdir(parents=True, exist_ok=True)

# y was transformed with log1p → True
USE_LOG1P = True

# Time budget (seconds)
# 300 = 5 min | 900 = 15 min | 3600 = 1h | 10800 = 3h | 21600 = 6h
TIME_LIMIT = 21600

# Preset
PRESET = "extreme_quality"       # use "extreme_quality" to have best results
VERBOSITY = 2

# Stability
REFIT_FULL = False # True after all the bugs + time_limit > 1h
SAVE_BAG_FOLDS = True

# If XGBoost causes crashes, keep it excluded
EXCLUDED_MODEL_TYPES = ["XGBoost"]   # safe default


# Data Directory

def resolve_data_dir() -> Path:
    candidates = []

    cwd = Path.cwd()
    candidates += [
        cwd / "data" / "preprocessed",
        cwd / "data",
    ]

    base_dir = cwd.parent
    candidates += [
        base_dir / "M2_Applied_ML" / "data" / "preprocessed",
        base_dir / "M2_Applied_ML" / "data",
    ]

    try:
        script_dir = Path(__file__).resolve().parent
        candidates += [
            script_dir / "data" / "preprocessed",
            script_dir / "data",
            script_dir.parent / "data" / "preprocessed",
            script_dir.parent / "data",
        ]
    except NameError:
        pass

    required = [
        "X_train_ready.csv",
        "X_test_ready.csv",
        "y_train_ready.csv",
        "y_test_ready.csv",
    ]

    for d in candidates:
        if d.exists() and all((d / f).exists() for f in required):
            return d

    raise FileNotFoundError(
        "Could not find data directory. Tried:\n" + "\n".join(map(str, candidates))
    )


# Data 
def load_data(data_dir: Path):
    X_train = pd.read_csv(data_dir / "X_train_ready.csv")
    X_test  = pd.read_csv(data_dir / "X_test_ready.csv")
    y_train = pd.read_csv(data_dir / "y_train_ready.csv").values.ravel()
    y_test  = pd.read_csv(data_dir / "y_test_ready.csv").values.ravel()

    # Drop ID-like columns if present
    for col in ["Order", "PID"]:
        X_train.drop(columns=[col], errors="ignore", inplace=True)
        X_test.drop(columns=[col], errors="ignore", inplace=True)

    return (
        X_train.reset_index(drop=True),
        X_test.reset_index(drop=True),
        np.asarray(y_train),
        np.asarray(y_test),
    )

# Metrics

def evaluate(y_true, y_pred, model_name):
    r2 = r2_score(y_true, y_pred)

    if USE_LOG1P:
        y_true_u = np.expm1(y_true)
        y_pred_u = np.expm1(y_pred)
    else:
        y_true_u = y_true
        y_pred_u = y_pred

    rmse = np.sqrt(mean_squared_error(y_true_u, y_pred_u))
    mae  = mean_absolute_error(y_true_u, y_pred_u)

    return {
        "model": model_name,
        "r2_test": r2,
        "rmse_test": rmse,
        "mae_test": mae,
    }

DATA_DIR = resolve_data_dir()
print(f"[Data] Using directory: {DATA_DIR}")

X_train, X_test, y_train, y_test = load_data(DATA_DIR)
print(f"[Data] Train shape: {X_train.shape} | Test shape: {X_test.shape}")

# Prepare AutoGluon training frame
train_ag = X_train.copy()
train_ag["target"] = y_train

run_id = datetime.now().strftime("%Y%m%d_%H%M%S")
ag_path = OUT_DIR / f"autogluon_{PRESET}_{TIME_LIMIT}s_{run_id}"

# Train

predictor = TabularPredictor(
    label="target",
    eval_metric="root_mean_squared_error",
    path=str(ag_path),
).fit(
    train_data=train_ag,
    presets=PRESET,
    time_limit=TIME_LIMIT,
    verbosity=VERBOSITY,
    refit_full=REFIT_FULL,
    save_bag_folds=SAVE_BAG_FOLDS,
    set_best_to_refit_full=False,
    excluded_model_types=EXCLUDED_MODEL_TYPES,
)

# Evaluation (TEST)

pred_test = predictor.predict(X_test).values
results = evaluate(y_test, pred_test, f"AutoGluon ({PRESET}, {TIME_LIMIT}s)")

print("\n===== AUTOGLUON HOLDOUT RESULTS =====")
print(results)
print("\nSaved AutoGluon model directory:", ag_path)

# SAVE ARTIFACTS

# Save test predictions
pred_path = OUT_DIR / f"pred_test_{run_id}.csv"
pd.DataFrame({"prediction": pred_test}).to_csv(pred_path, index=False)

# Save run summary (metrics + config)
summary_path = OUT_DIR / f"run_summary_{run_id}.csv"
pd.DataFrame([{
    **results,
    "preset": PRESET,
    "time_limit_s": TIME_LIMIT,
    "use_log1p": USE_LOG1P,
    "excluded_models": ",".join(EXCLUDED_MODEL_TYPES),
    "train_shape": str(X_train.shape),
    "test_shape": str(X_test.shape),
    "model_dir": str(ag_path),
}]).to_csv(summary_path, index=False)

print("Saved predictions:", pred_path)
print("Saved run summary:", summary_path)

print("\nAutoGluon model safely trained and saved.")

# reload the model
# from autogluon.tabular import TabularPredictor
# predictor = TabularPredictor.load("models_autogluon_best/autogluon_...")

[Data] Using directory: C:\Users\eric-\Desktop\Data sciences\M2_Applied_ML\data\preprocessed


Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.5.0
Python Version:     3.12.3
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.26200
CPU Count:          32
Pytorch Version:    2.9.1+cpu
CUDA Version:       CUDA is not available
Memory Avail:       11.46 GB / 31.19 GB (36.8%)
Disk Space Avail:   446.20 GB / 853.99 GB (52.2%)
Presets specified: ['extreme_quality']
Using hyperparameters preset: hyperparameters='zeroshot_2025_12_18_gpu'
Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=8, num_bag_sets=1


[Data] Train shape: (2344, 159) | Test shape: (586, 159)


Beginning AutoGluon training ... Time limit = 21600s
AutoGluon will save models to "C:\Users\eric-\Desktop\Data sciences\M2_Applied_ML\models_autogluon_best\autogluon_extreme_quality_21600s_20260117_021515"
Train Data Rows:    2344
Train Data Columns: 159
Label Column:       target
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
	Label info (max, min, mean, stddev): (13.534474352733596, 9.456418894572888, 12.01181, 0.40122)
	If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
Problem Type:       regression
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    11819.92 MB
	Train Data (Original)  Memory Usage: 2.84 MB (0.0% of available mem


===== AUTOGLUON HOLDOUT RESULTS =====
{'model': 'AutoGluon (extreme_quality, 21600s)', 'r2_test': 0.9471284730434566, 'rmse_test': 22231.116212459343, 'mae_test': 13102.377846363046}

Saved AutoGluon model directory: models_autogluon_best\autogluon_extreme_quality_21600s_20260117_021515
Saved predictions: models_autogluon_best\pred_test_20260117_021515.csv
Saved run summary: models_autogluon_best\run_summary_20260117_021515.csv

✅ DONE — AutoGluon model safely trained and saved.


In [7]:
pred_test = predictor.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                      model  score_val              eval_metric  pred_time_val      fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0       WeightedEnsemble_L2  -0.115017  root_mean_squared_error      12.460689  13746.634718                0.000997           0.139493            2       True         13
1       LightGBM_r37_BAG_L1  -0.116802  root_mean_squared_error       1.244145     78.795318                1.244145          78.795318            1       True          7
2       LightGBM_r57_BAG_L1  -0.118980  root_mean_squared_error       0.773706    267.150397                0.773706         267.150397            1       True         11
3   LightGBMPrep_r21_BAG_L1  -0.119477  root_mean_squared_error       0.356112     28.628946                0.356112          28.628946            1       True          5
4        CatBoost_c1_BAG_L1  -0.119682  root_mean_squared_error       0.155383    1

In [6]:
lbard = predictor.leaderboard()
lbard.sort_values(by='score_val', ascending=False)

Unnamed: 0,model,score_val,eval_metric,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,WeightedEnsemble_L2,-0.115017,root_mean_squared_error,12.460689,13746.634718,0.000997,0.139493,2,True,13
1,LightGBM_r37_BAG_L1,-0.116802,root_mean_squared_error,1.244145,78.795318,1.244145,78.795318,1,True,7
2,LightGBM_r57_BAG_L1,-0.11898,root_mean_squared_error,0.773706,267.150397,0.773706,267.150397,1,True,11
3,LightGBMPrep_r21_BAG_L1,-0.119477,root_mean_squared_error,0.356112,28.628946,0.356112,28.628946,1,True,5
4,CatBoost_c1_BAG_L1,-0.119682,root_mean_squared_error,0.155383,105.976634,0.155383,105.976634,1,True,1
5,TabM_r184_BAG_L1,-0.119778,root_mean_squared_error,5.477163,4085.446244,5.477163,4085.446244,1,True,9
6,LightGBMPrep_r41_BAG_L1,-0.119903,root_mean_squared_error,1.646469,276.769654,1.646469,276.769654,1,True,2
7,TabM_r69_BAG_L1,-0.120212,root_mean_squared_error,5.226889,9447.648083,5.226889,9447.648083,1,True,6
8,LightGBM_r73_BAG_L1,-0.120827,root_mean_squared_error,0.57418,59.387189,0.57418,59.387189,1,True,4
9,LightGBM_r162_BAG_L1,-0.121969,root_mean_squared_error,0.434371,248.403986,0.434371,248.403986,1,True,10
