**To train neural networks faster, you need to enable GPUs for the notebook:**
* Navigate to Edit→Notebook Settings
* select GPU from the Hardware Accelerator drop-down

# Setup

## Installation

In [None]:
!pip install pytabkit
!pip install openml

## Getting a dataset

In [None]:
import openml
from sklearn.model_selection import train_test_split

task = openml.tasks.get_task(359946, download_splits=False) # pol dataset
dataset = openml.datasets.get_dataset(task.dataset_id, download_data=False)
X, y, categorical_indicator, attribute_names = dataset.get_data(
    dataset_format='dataframe',
    target=task.target_name
)
# X, _, y, _ = train_test_split(X, y, train_size=0.1, random_state=0)  # subsample data

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Using RealMLP

In [3]:
%%time
from pytabkit.models.sklearn.sklearn_interfaces import RealMLP_TD_Classifier
from sklearn.metrics import root_mean_squared_error

model = RealMLP_TD_Classifier()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)
print(f"RMSE of RealMLP: {rmse}")

RMSE of RealMLP: 3.1390869160739507
CPU times: user 1min 24s, sys: 2.33 s, total: 1min 26s
Wall time: 1min 39s


## With bagging
It is possible to do bagging (ensembling of models on 5-fold cross-validation) simply by passing `n_cv=5` to the constructor. Note that it doesn't take 5x as long because of vectorized training.

In [4]:
%%time
from pytabkit.models.sklearn.sklearn_interfaces import RealMLP_TD_Classifier
from sklearn.metrics import root_mean_squared_error

model = RealMLP_TD_Classifier(n_cv=5)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)
print(f"RMSE of RealMLP with bagging: {rmse}")

RMSE of RealMLP with bagging: 2.9542286077192244
CPU times: user 1min 16s, sys: 729 ms, total: 1min 17s
Wall time: 1min 19s


## With hyperparameter optimization
It is possible to do hyperparameter optimization directly inside a sklearn interface by using the `RealMLP_HPO_Regressor` interface.
This is also available for classification, and for other models, for instance `LGBM_HPO_Classifier` or `LGBM_HPO_TPE_Classifier` (to use the Tree-structured Parzen Estimator algorithm).

In [8]:
%%time
from pytabkit.models.sklearn.sklearn_interfaces import RealMLP_HPO_Regressor

n_hyperopt_steps = 3 # small number for demonstration purposes
model = RealMLP_HPO_Regressor(n_hyperopt_steps=n_hyperopt_steps)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)
print(f"RMSE of RealMLP with {n_hyperopt_steps} steps HPO: {rmse}")

RMSE of RealMLP with 3 steps HPO: 2.435140371322632
CPU times: user 3min 39s, sys: 2.12 s, total: 3min 41s
Wall time: 3min 45s


# Using improved default for tree based models

`TD` stands for *tuned defaults*, which are the improved default we propose. `D` stands for *defaults*, which are the libraries defaults.

In [6]:
%%time
from pytabkit.models.sklearn.sklearn_interfaces import CatBoost_TD_Regressor, CatBoost_D_Regressor, LGBM_TD_Regressor, LGBM_D_Regressor, XGB_TD_Regressor, XGB_D_Regressor

for model in [CatBoost_TD_Regressor(), CatBoost_D_Regressor(), LGBM_TD_Regressor(), LGBM_D_Regressor(), XGB_TD_Regressor(), XGB_D_Regressor()]:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    rmse = root_mean_squared_error(y_test, y_pred)
    print(f"RMSE of {model.__class__.__name__}: {rmse}")

RMSE of CatBoost_TD_Regressor: 4.254329681396484
RMSE of CatBoost_D_Regressor: 5.49345064163208
RMSE of LGBM_TD_Regressor: 4.418639183044434
RMSE of LGBM_D_Regressor: 5.085862159729004
RMSE of XGB_TD_Regressor: 4.645600318908691
RMSE of XGB_D_Regressor: 5.538084983825684
CPU times: user 46 s, sys: 1.22 s, total: 47.2 s
Wall time: 30.4 s


# Ensembling tuned defaults of tree-based methods and RealMLP: a very good baseline

In [7]:
%%time
from pytabkit.models.sklearn.sklearn_interfaces import Ensemble_TD_Regressor

model = Ensemble_TD_Regressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)
print(f"RMSE of Ensemble_TD_Regressor: {rmse}")

RMSE of Ensemble_TD_Regressor: 2.7520666122436523
CPU times: user 2min 4s, sys: 1.49 s, total: 2min 6s
Wall time: 1min 46s
