# Customized model base

For researchers or model base developers, the basic need is comparing their own models with existing benchmarks in `tabensemb`. In this part, a model base is built within the framework assuming that we want to integrate `TabNet` ([from dreamquark-ai team](https://github.com/dreamquark-ai/tabnet)) into `tabensemb` (indeed `pytorch_tabular` and `pytorch_widedeep` have done that) for regression tasks.

**Remark**: For `PyTorch`-based models, we have implemented most requirements of the framework so that users can integrate `torch.nn.Module`s more conveniently. Check "Customized `PyTorch`-based model base" for details.

## Example: Implement TabNet as a model base

In [1]:
import tabensemb
import numpy as np
import torch

prefix = "../../../../"
tabensemb.setting["default_output_path"] = prefix + "output"
tabensemb.setting["default_config_path"] = prefix + "configs"
tabensemb.setting["default_data_path"] = prefix + "data"

All model bases inherit `AbstractModel` and implement methods within the class. If necessary methods are not implemented, `NotImplementedError` will be raised during usage.

In [2]:
from tabensemb.model import AbstractModel

We use `scikit-optimize` (https://github.com/scikit-optimize/scikit-optimize) to do Bayesian hyperparameter optimization, so space classes are imported.

In [3]:
from skopt.space import Integer, Real, Categorical

First, we define the initialization of the model base. Always remember to pass all args and kwargs to ``__init__`` of `AbstractModel`. Also, we are not discussing classification tasks here, but they are straight forward.

```python
class TabNet(AbstractModel):
    def __init__(self, *args, **kwargs):
        super(TabNet, self).__init__(*args, **kwargs)
        if self.trainer.datamodule.task != "regression":
            raise Exception(f"We only discuss regression tasks here.")
```

We should define the name of the model base, and all available models in the model base.

```python
    def _get_program_name(self):
        return "TabNet"

    def _get_model_names(self):
        return ["TabNet"]
```

For each model in the model base, the program will request initial hyperparameters of the model and their search spaces. They are defined as

```python
    def _space(self, model_name):
        return [
            Integer(low=4, high=64, prior="uniform", name="n_d", dtype=int),  # 8
            Integer(low=4, high=64, prior="uniform", name="n_a", dtype=int),  # 8
            Integer(low=3, high=10, prior="uniform", name="n_steps", dtype=int),  # 3
            Real(low=1.0, high=2.0, prior="uniform", name="gamma"),  # 1.3
            Integer(
                low=1, high=5, prior="uniform", name="n_independent", dtype=int
            ),  # 2
            Integer(low=1, high=5, prior="uniform", name="n_shared", dtype=int),  # 2
        ] + self.trainer.SPACE

    def _initial_values(self, model_name):
        return {
            "n_d": 8,
            "n_a": 8,
            "n_steps": 3,
            "gamma": 1.3,
            "n_independent": 2,
            "n_shared": 2,
            "lr": self.trainer.args["lr"],
            "weight_decay": self.trainer.args["weight_decay"],
            "batch_size": self.trainer.args["batch_size"],
        }
```

Before training, each model base has its own way to process the dataset. Since we can not access the testing set in the training stage, two separate methods are defined to process the whole dataset.

`_train_data_preprocess` will return the processed dataset according to a given `Trainer`, which provide all training information and data required. In this example, `X_train/X_val/X_test` represent training/validation/testing sets, and `y_train/y_val/y_test` represent corresponding labels.

```python
    def _train_data_preprocess(self, model_name):
        data = self.trainer.datamodule
        cont_feature_names = self.trainer.cont_feature_names
        X_train = data.X_train[cont_feature_names].values.astype(np.float32)
        X_val = data.X_val[cont_feature_names].values.astype(np.float32)
        X_test = data.X_test[cont_feature_names].values.astype(np.float32)
        y_train = data.y_train.astype(np.float32)
        y_val = data.y_val.astype(np.float32)
        y_test = data.y_test.astype(np.float32)

        return {
            "X_train": X_train,
            "y_train": y_train,
            "X_val": X_val,
            "y_val": y_val,
            "X_test": X_test,
            "y_test": y_test,
        }
```

Correspondingly, `_data_preprocess` will process an upcoming new dataset, including tabular data `df` containing continuous features and categorical features, and unstacked derived data `derived_data` (multi-modal data or something else depending on the configuration introduced in "Using data functionalities"). The returned value should have the same structure as the `X_test` returned in `_train_data_preprocess`.

```python
    def _data_preprocess(self, df, derived_data, model_name):
        return df[self.trainer.cont_feature_names].values.astype(np.float32)
```

**Remark**: The tabular dataset has gone through all processing stages defined in the `DataModule` inside the trainer **except scaling**. Call `self.trainer.datamodule.data_transform(df, scaler_only=True)` to scale it using the trained scaler if no scaling stage is defined in the model base.

The program will pass a selected set of hyperparameters as `kwargs` to initialize a model, train a model, and predict using the model. The returned `model` will be stored locally and reloaded for evaluation and inference, so make sure it contains all information needed to make predictions.

```python
    def _new_model(self, model_name, verbose, **kwargs):
        from pytorch_tabnet.tab_model import TabNetRegressor

        def extract_params(**kwargs):
            params = {}
            optim_params = {}
            batch_size = 32
            for key, value in kwargs.items():
                if key in [
                    "n_d",
                    "n_a",
                    "n_steps",
                    "gamma",
                    "n_independent",
                    "n_shared",
                ]:
                    params[key] = value
                elif key == "batch_size":
                    batch_size = int(value)
                else:
                    optim_params[key] = value
            return params, optim_params, batch_size

        params, optim_params, batch_size = extract_params(**kwargs)

        model = TabNetRegressor(
            verbose=20 if verbose else 0, optimizer_params=optim_params
        )

        model.set_params(**params)
        return model
```

**Remark**: `kwargs` has all keys defined in `_initial_values`. If a parameter named `batch_size` is included, a new key named `original_batch_size` exists in `kwargs`. The values of `batch_size` and `original_batch_size` may be different if the program finds that the batch size will make the mini-batches tiny. The threshold is defined by `self.limit_batch_size` (default to 6). A tiny batch might interrupt some models, so it is better to use the modified `batch_size` value.

The framework will pass `X_train`, `y_train`, `X_val`, `y_val` from `_train_data_preprocess` to the following `_train_single_model` method, along with some other arguments stating the current training stage. `epoch` is the number of epochs to train the model. `warm_start=True` means the passed model is already trained and should be fine-tuned based on a new dataset. `in_bayes_opt=True` means that the passed `kwargs` is selected by a bayesian hyperparameter optimization step, and a simplified training routine is needed to reduce optimization time.

```python
    def _train_single_model(
        self,
        model,
        epoch,
        X_train,
        y_train,
        X_val,
        y_val,
        verbose,
        warm_start,
        in_bayes_opt,
        **kwargs,
    ):
        eval_set = [(X_val, y_val)]

        model.fit(
            X_train,
            y_train,
            eval_set=eval_set,
            max_epochs=epoch if not in_bayes_opt else self.trainer.args["bayes_epoch"],
            patience=self.trainer.args["patience"],
            loss_fn=torch.nn.MSELoss(),
            eval_metric=["mse"],
            batch_size=int(kwargs["batch_size"]),
            warm_start=warm_start,
            drop_last=False,
        )
```

To evaluate the model or make use of the model, `_pred_single_model` is defined and `X_test` processed in `_train_data_preprocess` or `_data_preprocess` is passed as an argument.

```python
    def _pred_single_model(self, model, X_test, verbose, **kwargs):
        return model.predict(X_test).reshape(-1, 1)
```

The full code is as followed:

In [4]:
class TabNet(AbstractModel):
    def __init__(self, *args, **kwargs):
        super(TabNet, self).__init__(*args, **kwargs)
        if self.trainer.datamodule.task != "regression":
            raise Exception(f"We only discuss regression tasks here.")

    def _get_program_name(self):
        return "TabNet"

    def _get_model_names(self):
        return ["TabNet"]

    def _space(self, model_name):
        return [
                   Integer(low=4, high=16, prior="uniform", name="n_d", dtype=int),  # 8
                   Integer(low=4, high=16, prior="uniform", name="n_a", dtype=int),  # 8
                   Integer(low=1, high=6, prior="uniform", name="n_steps", dtype=int),  # 3
                   Real(low=1.0, high=1.5, prior="uniform", name="gamma"),  # 1.3
                   Integer(
                       low=1, high=4, prior="uniform", name="n_independent", dtype=int
                   ),  # 2
                   Integer(low=1, high=4, prior="uniform", name="n_shared", dtype=int),  # 2
               ] + self.trainer.SPACE

    def _initial_values(self, model_name):
        return {
            "n_d": 8,
            "n_a": 8,
            "n_steps": 3,
            "gamma": 1.3,
            "n_independent": 2,
            "n_shared": 2,
            "lr": self.trainer.args["lr"],
            "weight_decay": self.trainer.args["weight_decay"],
            "batch_size": self.trainer.args["batch_size"],
        }

    def _train_data_preprocess(self, model_name):
        data = self.trainer.datamodule
        cont_feature_names = self.trainer.cont_feature_names
        X_train = data.X_train[cont_feature_names].values.astype(np.float32)
        X_val = data.X_val[cont_feature_names].values.astype(np.float32)
        X_test = data.X_test[cont_feature_names].values.astype(np.float32)
        y_train = data.y_train.astype(np.float32)
        y_val = data.y_val.astype(np.float32)
        y_test = data.y_test.astype(np.float32)

        return {
            "X_train": X_train,
            "y_train": y_train,
            "X_val": X_val,
            "y_val": y_val,
            "X_test": X_test,
            "y_test": y_test,
        }

    def _data_preprocess(self, df, derived_data, model_name):
        return df[self.trainer.cont_feature_names].values.astype(np.float32)

    def _new_model(self, model_name, verbose, **kwargs):
        from pytorch_tabnet.tab_model import TabNetRegressor

        TabNetRegressor.device_name = "cpu"
        model = TabNetRegressor(
            verbose=20 if verbose else 0, optimizer_params={"lr": kwargs["lr"], "weight_decay": kwargs["weight_decay"]}
        )

        model.set_params(
            **{"n_d": kwargs["n_d"], "n_a": kwargs["n_a"], "n_steps": kwargs["n_steps"], "gamma": kwargs["gamma"],
               "n_independent": kwargs["n_independent"], "n_shared": kwargs["n_shared"]})
        return model

    def _train_single_model(
            self,
            model,
            epoch,
            X_train,
            y_train,
            X_val,
            y_val,
            verbose,
            warm_start,
            in_bayes_opt,
            **kwargs,
    ):
        eval_set = [(X_val, y_val)]

        model.fit(
            X_train,
            y_train,
            eval_set=eval_set,
            max_epochs=epoch if not in_bayes_opt else self.trainer.args["bayes_epoch"],
            patience=self.trainer.args["patience"],
            loss_fn=torch.nn.MSELoss(),
            eval_metric=["mse"],
            batch_size=int(kwargs["batch_size"]),
            warm_start=warm_start,
            drop_last=False,
        )

    def _pred_single_model(self, model, X_test, verbose, **kwargs):
        return model.predict(X_test).reshape(-1, 1)

We can compare the model with TabNet implemented in other two model bases. Note that because of different training routines and randomization, they perform differently.

In [5]:
from tabensemb.trainer import Trainer
from tabensemb.model import PytorchTabular, WideDeep

trainer = Trainer(device="cpu")
trainer.load_config("sample")
trainer.load_data()
trainer.add_modelbases(
    [PytorchTabular(trainer, model_subset=["TabNet"]), WideDeep(trainer, model_subset=["TabNet"]), TabNet(trainer)])
trainer.train(stderr_to_stdout=True)

Project will be saved to ../../../../output/sample/2023-08-03-20-56-54-0_sample
Dataset size: 153 51 52
Data saved to ../../../../output/sample/2023-08-03-20-56-54-0_sample (data.csv and tabular_data.csv).

-------------Run PytorchTabular-------------

Training TabNet
Global seed set to 42
2023-08-03 20:56:54,716 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-08-03 20:56:54,717 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-08-03 20:56:54,730 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: TabNetModel
2023-08-03 20:56:54,749 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-08-03 20:56:54,794 - {pytorch_tabular.tabular_model:582} - INFO - Training Started

  | Name             | Type           | Params
----

In [6]:
trainer.get_leaderboard()

PytorchTabular metrics
TabNet 1/1
WideDeep metrics
TabNet 1/1
TabNet metrics
TabNet 1/1
Trainer saved. To load the trainer, run trainer = load_trainer(path='../../../../output/sample/2023-08-03-20-56-54-0_sample/trainer.pkl')


Unnamed: 0,Program,Model,Training RMSE,Training MSE,Training MAE,Training MAPE,Training R2,Training MEDIAN_ABSOLUTE_ERROR,Training EXPLAINED_VARIANCE_SCORE,Testing RMSE,...,Testing R2,Testing MEDIAN_ABSOLUTE_ERROR,Testing EXPLAINED_VARIANCE_SCORE,Validation RMSE,Validation MSE,Validation MAE,Validation MAPE,Validation R2,Validation MEDIAN_ABSOLUTE_ERROR,Validation EXPLAINED_VARIANCE_SCORE
0,TabNet,TabNet,172.02546,29592.758915,137.818516,1.024958,0.101657,116.960863,0.107024,165.081366,...,0.080482,115.368511,0.085757,145.030899,21033.961806,115.993113,0.950409,0.050059,84.36516,0.053165
1,WideDeep,TabNet,170.028963,28909.848366,136.487055,0.993,0.122388,113.84225,0.126011,166.971009,...,0.059311,112.697731,0.065377,144.846749,20980.580583,117.191995,1.011844,0.05247,91.948082,0.054091
2,PytorchTabular,TabNet,169.574964,28755.66856,135.228814,0.977328,0.127068,109.420505,0.127485,169.437693,...,0.031311,114.269955,0.044738,143.152965,20492.771338,118.205225,1.03269,0.074501,95.991233,0.074509


## More customizations

As the base class of all model bases, `AbstractModel` divides key functions into segmentations so that developers can modify almost all of them for customized usages. Some "high-level" ones are introduced here. For "low-level" (fundamental) ones, interested readers may refer to the source code and API docs of `AbstractModel`.