# Setup

**To train neural networks faster, you need to enable GPUs for the notebook:**
* Navigate to Edit→Notebook Settings
* select GPU from the Hardware Accelerator drop-down

## Installation

In [None]:
!pip install pytabkit
!pip install openml

## Getting a dataset

In [None]:
import openml
from sklearn.model_selection import train_test_split
import numpy as np

task = openml.tasks.get_task(361113) # covertype dataset
dataset = openml.datasets.get_dataset(task.dataset_id, download_data=False)
X, y, categorical_indicator, attribute_names = dataset.get_data(
    dataset_format='dataframe',
    target=task.target_name
)
# we restrict to 15K samples for demonstration purposes
index = np.random.choice(range(len(X)), 15_000, replace=False)
X = X.iloc[index]
y = y.iloc[index]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Using RealMLP

In [None]:
%%time
from pytabkit import RealMLP_TD_Classifier
from sklearn.metrics import accuracy_score

model = RealMLP_TD_Classifier()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy of RealMLP: {acc}")

Accuracy of RealMLP: 0.8770666666666667
CPU times: user 1min 11s, sys: 192 ms, total: 1min 11s
Wall time: 1min 11s


## With bagging
It is possible to do bagging (ensembling of models on 5-fold cross-validation) simply by passing `n_cv=5` to the constructor. Note that it doesn't take 5x as long because of vectorized training.

In [None]:
%%time
from pytabkit import RealMLP_TD_Classifier
from sklearn.metrics import accuracy_score

model = RealMLP_TD_Classifier(n_cv=5)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy of RealMLP with bagging: {acc}")

Accuracy of RealMLP with bagging: 0.8930666666666667
CPU times: user 1min 8s, sys: 180 ms, total: 1min 9s
Wall time: 1min 8s


## With hyperparameter optimization
It is possible to do hyperparameter optimization directly inside a sklearn interface by using the `RealMLP_HPO_Regressor` interface.
This is also available for classification, and for other models, for instance `LGBM_HPO_Classifier` or `LGBM_HPO_TPE_Classifier` (to use the Tree-structured Parzen Estimator algorithm).

In [None]:
%%time
from pytabkit import RealMLP_HPO_Classifier
from sklearn.metrics import accuracy_score

n_hyperopt_steps = 3 # small number for demonstration purposes
model = RealMLP_HPO_Classifier(n_hyperopt_steps=n_hyperopt_steps)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy of RealMLP with {n_hyperopt_steps} steps HPO: {acc}")

Accuracy of RealMLP with 3 steps HPO: 0.8605333333333334
CPU times: user 2min 27s, sys: 442 ms, total: 2min 28s
Wall time: 2min 28s


# Using improved default for tree based models

`TD` stands for *tuned defaults*, which are the improved default we propose. `D` stands for *defaults*, which are the libraries defaults.

In [None]:
%%time
from pytabkit import CatBoost_TD_Classifier, CatBoost_D_Classifier, LGBM_TD_Classifier, LGBM_D_Classifier, XGB_TD_Classifier, XGB_D_Classifier

for model in [CatBoost_TD_Classifier(), CatBoost_D_Classifier(), LGBM_TD_Classifier(), LGBM_D_Classifier(), XGB_TD_Classifier(), XGB_D_Classifier()]:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"Accuracy of {model.__class__.__name__}: {acc}")


Accuracy of CatBoost_TD_Classifier: 0.8685333333333334
Accuracy of CatBoost_D_Classifier: 0.8464
Accuracy of LGBM_TD_Classifier: 0.8602666666666666
Accuracy of LGBM_D_Classifier: 0.8344
Accuracy of XGB_TD_Classifier: 0.8544
Accuracy of XGB_D_Classifier: 0.8472
CPU times: user 1min 55s, sys: 44.3 s, total: 2min 40s
Wall time: 24 s


# Ensembling tuned defaults of tree-based methods and RealMLP: a very good baseline

In [None]:
%%time
from pytabkit import Ensemble_TD_Classifier

model = Ensemble_TD_Classifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy of Ensemble_TD_Classifier: {acc}")

Accuracy of Ensemble_TD_Classifier: 0.8834666666666666
CPU times: user 2min 34s, sys: 38 s, total: 3min 12s
Wall time: 1min 30s
