# TabNet CODE

### Introduce TabNET
---
* 고스트 배치 정규화 (GBN)

* Sparsemax

* 


### code_source
---
* TabNet Torch code: https://ichi.pro/ko/pytorcheseo-tabnet-guhyeon-277727554318969
* Sparsemax code: https://github.com/gokceneraslan/SparseMax.torch
* paper: https://arxiv.org/pdf/1908.07442v4.pdf
* pytorch-TabNet1: https://pypi.org/project/pytorch-tabnet/
* pytorch-TabNet2: https://wsshin.tistory.com/5
* pytorch-TabNet-Regressor: https://www.kaggle.com/rapela/tps-02-21-tabnet-regressor


### Parameter Tuning
---
* Source: https://dreamquark-ai.github.io/tabnet/generated_docs/README.html#model-parameters

#### Model parameters

1. n_d: int(default = 8)
    - Width of the decision prediction layer. Bigger values gives more capacity to the model with the risk of overfitting. Values typically range from 8 to 64.
<br/> 
2. n_a: int(default = 8)
    - Width of the attention embedding for each mask. According to the paper n_d=n_a is usually a good choice. (default=8)
<br/> 
3. n_steps: int(default = 3)
    - Number of steps in the architecture (usually between 3 and 10)
<br/> 
4. gamma: float (default = 1.3)
    - This is the coefficient for feature reusage in the masks. A value close to 1 will make mask selection least correlated between layers. Values range from 1.0 to 2.0.
<br/> 
5. cat_idxs: list of int (default=[] - Mandatory for embeddings)
    - List of categorical features indices.
<br/> 
6. cat_dims: list of int (default=[] - Mandatory for embeddings)
    - List of categorical features number of modalities (number of unique values for a categorical feature) /!\ no new modalities can be predicted
<br/> 
7. cat_emb_dim : list of int (optional)
    - List of embeddings size for each categorical features. (default =1)
<br/> 
8. n_independent : int (default=2)
    - Number of independent Gated Linear Units layers at each step. Usual values range from 1 to 5.
<br/> 
9. n_shared : int (default=2)
    - Number of shared Gated Linear Units at each step Usual values range from 1 to 5
<br/> 
10. epsilon : float (default 1e-15)
    - Should be left untouched
<br/> 
11. seed : int (default=0)
    - Random seed for reproducibility
<br/> 
12. momentum : float
    - Momentum for batch normalization, typically ranges from 0.01 to 0.4 (default=0.02)
<br/> 
13. clip_value : float (default None)
    - If a float is given this will clip the gradient at clip_value.
<br/> 
14. lambda_sparse : float (default = 1e-3)
    - This is the extra sparsity loss coefficient as proposed in the original paper. The bigger this coefficient is, the sparser your model will be in terms of feature selection. Depending on the difficulty of your problem, reducing this value could help.
<br/>
15. optimizer_fn : torch.optim (default=torch.optim.Adam)
    - Pytorch optimizer function
<br/>     
16. optimizer_params: dict (default=dict(lr=2e-2))
    - Parameters compatible with optimizer_fn used initialize the optimizer. Since we have Adam as our default optimizer, we use this to define the initial learning rate used for training. As mentionned in the original paper, a large initial learning of 0.02 with decay is a good option.
<br/> 
17. scheduler_fn : torch.optim.lr_scheduler (default=None)
    - Pytorch Scheduler to change learning rates during training.
<br/> 
18. model_name : str (default = ‘DreamQuarkTabNet’)
    - Name of the model used for saving in disk, you can customize this to easily retrieve and reuse your trained models.
<br/> 
19. saving_path : str (default = ‘./’)
    - Path defining where to save models.
<br/> 
20. verbose : int (default=1)
    - Verbosity for notebooks plots, set to 1 to see every epoch, 0 to get None.
<br/> 
21. device_name : str (default=’auto’) 
    - ‘cpu’ for cpu training, ‘gpu’ for gpu training, ‘auto’ to automatically detect gpu.
<br/> 
22. mask_type: str (default=’sparsemax’) 
    - Either “sparsemax” or “entmax” : this is the masking function to use for selecting features
<br/> 
<br/> 

#### Fit parameters
1. X_train : np.array
<br/> 
2. y_train : np.array
<br/> 
3. eval_set: list of tuple
<br/> 
4. eval_name: list of str
<br/> 
5. eval_metric : list of str
<br/> 
6. max_epochs : int (default = 200)
<br/> 
7. patience : int (default = 15)
<br/> 
8. weights : int or dict (default=0)
    -  Only for TabNetClassifier Sampling parameter 0 : no sampling 1 : automated sampling with inverse class occurrences dict : keys are classes, values are weights for each class
<br/> 
9. loss_fn : torch.loss or list of torch.loss
    - Loss function for training (default to mse for regression and cross entropy for classification) When using TabNetMultiTaskClassifier you can set a list of same length as number of tasks, each task will be assigned its own loss function
<br/> 
10. batch_size : int (default=1024)
    - Number of examples per batch, large batch sizes are recommended
<br/>     
11. virtual_batch_size : int (default=128)
    - Size of the mini batches used for “Ghost Batch Normalization”. /!\ virtual_batch_size should divide batch_size
<br/> 
12. num_workers : int (default=0)
    - Number or workers used in torch.utils.data.Dataloader
<br/> 
13. drop_last : bool (default=False)
    - Whether to drop last batch if not complete during training
<br/> 
14. callbacks : list of callback function
    - List of custom callbacks
<br/> 
15. pretraining_ratio : float
      - /!\ TabNetPretrainer Only : Percentage of input features to mask during pretraining.
      - Should be between 0 and 1. The bigger the harder the reconstruction task is.

In [2]:
import warnings

warnings.filterwarnings( 'ignore' )

In [10]:
import os
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

import torch
from torch import nn
from pytorch_tabnet.tab_model import TabNetRegressor

from tqdm.notebook import tqdm
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold

from sklearn.metrics import mean_squared_error

import random

import optuna 
from optuna import Trial, visualization
from optuna.samplers import TPESampler
SEED = 42

os.environ["CUDA_LAUNCH_BLOCKING"] = "1"


def seed_everything(seed_value):
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    os.environ['PYTHONHASHSEED'] = str(seed_value)

    if torch.cuda.is_available(): 
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

seed_everything(SEED)


train_data = pd.read_csv("train.csv") 
test_data = pd.read_csv("test.csv")

x_data = train_data.loc[:, 'f0':'f99']
y_data = train_data.loc[:, 'loss']

x_train, x_test, y_train, y_test=train_test_split(x_data,
                                                  y_data,
                                                  test_size=0.3,   #전체 중 20%를 테스트용으로 분할
                                                                   #나머지 80%는 훈련용
                                                  shuffle=True,    #무작위로 섞어서 추출
                                                  random_state=SEED) #무작위 추출 시 일정한 기준으로
x_val, x_test, y_val, y_test = train_test_split(x_test,
                                                y_test,
                                                test_size=0.5,
                                                shuffle=True,
                                                random_state=SEED)

y_train = y_train.values
y_val = y_val.values


for c in x_train.columns:
    if x_train[c].dtype == 'object':
        lbl = LabelEncoder()
        lbl.fit(list(x_train[c].values) + list(x_test[c].values))

        x_train[c] = lbl.transform(x_train[c].values)
        x_test[c] = lbl.transform(x_test[c].values)

columns = x_test.columns

In [17]:
def Optuna_TabNet(trial):
    NUM_FOLDS = trial.suggest_categorical("NUM_FOLDS", [3, 5, 7])
    N_D = trial.suggest_categorical("N_D", [8, 16, 32, 64])
    N_A = N_D
    N_STEPS = trial.suggest_int("N_STEPS", 3, 9)
    GAMMA = trial.suggest_uniform("GAMMA", 1.0, 2.0)
    LAMBDA_SPARSE =  trial.suggest_uniform("LAMBDA_SPARSE", 0, 1e-2)
    # Learning Rate
    OPT_LR = trial.suggest_categorical('OPT_LR', [5e-2, 1e-2, 1e-3, 1e-4])
    OPT_WEIGHT_DECAY = trial.suggest_categorical('OPT_WEIGHT_DECAY', [1e-8, 1e-6, 1e-5, 1e-4, 1e-3])
    OPT_MOMENTUM = trial.suggest_uniform("OPT_MOMENTUM", 0.01, 0.4)
    MASK_TYPE = trial.suggest_categorical('MASK_TYPE',  ["sparsemax", "entmax"])
    
    SCHEDULER_MIN_LR = 1e-6
    SCHEDULER_FACTOR = 0.9
    
    tabnet_params = dict(n_d=N_D, n_a=N_A, n_steps=N_STEPS, gamma=GAMMA,
                         lambda_sparse=LAMBDA_SPARSE, 
                         optimizer_fn=torch.optim.SGD,
                         optimizer_params=dict(lr=OPT_LR, 
                                               weight_decay=OPT_WEIGHT_DECAY, 
                                               momentum=OPT_MOMENTUM),
                         mask_type=MASK_TYPE,
                         scheduler_params=dict(mode="min",
                                               patience=20,
                                               min_lr=SCHEDULER_MIN_LR,
                                               factor=SCHEDULER_FACTOR,),
                         scheduler_fn=torch.optim.lr_scheduler.ReduceLROnPlateau,
                         verbose=10,
                         seed=SEED
                         )
    print(tabnet_params)
    
    return TabNetRegressor(**tabnet_params)

def objective(trial):
    
    MAX_EPOCH = trial.suggest_categorical("MAX_EPOCH", [1000, 3000, 5000])
    BATCH_SIZE = 512
    
    train_df, val_df = x_train.iloc[:][columns], x_val.iloc[:][columns]
    
    
    train_df = train_df.to_numpy()
    train_target = y_train.reshape(-1, 1)
    
    val_df = val_df.to_numpy()
    val_target = y_val.reshape(-1, 1)
    
    model = Optuna_TabNet(trial)
    model.fit(X_train=train_df,
              y_train=train_target,
              eval_set=[(val_df, val_target)],
              eval_name = ["val"],
              eval_metric = ['mse'],#["logits_ll"],
              max_epochs=MAX_EPOCH,
              patience=20, 
              batch_size=BATCH_SIZE,
              drop_last=False)
    
    score = mean_squared_error(model.predict(x_test.to_numpy()), y_test)
    print(score)
    return score



In [None]:
study = optuna.create_study(direction='minimize',
                            sampler=TPESampler())
study.optimize(lambda trial : objective(trial), n_trials=50)
print('Best trial: score {},\nparams {}'.format(study.best_trial.value,study.best_trial.params))

best_param = study.best_trial.params

[32m[I 2021-08-22 16:37:03,614][0m A new study created in memory with name: no-name-260bd75e-9adc-4e4f-bef3-204909ce1b50[0m


{'n_d': 64, 'n_a': 64, 'n_steps': 5, 'gamma': 1.7142678654254453, 'lambda_sparse': 0.0055511417357764995, 'optimizer_fn': <class 'torch.optim.sgd.SGD'>, 'optimizer_params': {'lr': 0.001, 'weight_decay': 0.0001, 'momentum': 0.35292327146569685}, 'mask_type': 'entmax', 'scheduler_params': {'mode': 'min', 'patience': 20, 'min_lr': 1e-06, 'factor': 0.9}, 'scheduler_fn': <class 'torch.optim.lr_scheduler.ReduceLROnPlateau'>, 'verbose': 10, 'seed': 42}
Device used : cpu
epoch 0  | loss: 78.35551| val_mse: 82.1103 |  0:01:08s


In [None]:
train_oof = np.zeros((len(x_train)))
test_preds = 0

kf = KFold(n_splits=NUM_FOLDS, shuffle=True, random_state=SEED)

for f, (train_ind, val_ind) in tqdm(enumerate(kf.split(x_train, y_train))):

    print(f'Fold {f}')
    train_df, val_df = x_train.iloc[train_ind][columns], x_train.iloc[val_ind][columns]

    train_target, val_target = y_train[train_ind], y_train[val_ind]

    print(train_df.shape, train_target.shape)
    print(val_df.shape, val_target.shape)

    train_target=train_target.reshape(-1,1)
    val_target=val_target.reshape(-1,1)

    train_df      = train_df.to_numpy()
    train_target      = train_target.reshape(-1, 1)

    val_df = val_df.to_numpy()
    val_target = val_target.reshape(-1, 1)

    model = TabNetRegressor(**tabnet_params)

    model.fit(X_train=train_df,
              y_train=train_target,
              eval_set=[(val_df, val_target)],
              eval_name = ["val"],
              eval_metric = ['mse'],#["logits_ll"],
              max_epochs=MAX_EPOCH, #20
              patience=20, batch_size=BATCH_SIZE,
              drop_last=False)#,

    temp_oof = model.predict(val_df)
    train_oof[val_ind] = temp_oof.reshape(-1)     
    print(mean_squared_error(temp_oof, val_target, squared=False))
    
    temp_test = model.predict(x_test.to_numpy())
    test_preds += temp_test/NUM_FOLDS

Index(['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10',
       'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20',
       'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30',
       'f31', 'f32', 'f33', 'f34', 'f35', 'f36', 'f37', 'f38', 'f39', 'f40',
       'f41', 'f42', 'f43', 'f44', 'f45', 'f46', 'f47', 'f48', 'f49', 'f50',
       'f51', 'f52', 'f53', 'f54', 'f55', 'f56', 'f57', 'f58', 'f59', 'f60',
       'f61', 'f62', 'f63', 'f64', 'f65', 'f66', 'f67', 'f68', 'f69', 'f70',
       'f71', 'f72', 'f73', 'f74', 'f75', 'f76', 'f77', 'f78', 'f79', 'f80',
       'f81', 'f82', 'f83', 'f84', 'f85', 'f86', 'f87', 'f88', 'f89', 'f90',
       'f91', 'f92', 'f93', 'f94', 'f95', 'f96', 'f97', 'f98', 'f99'],
      dtype='object')


In [None]:
# XG-Boost Score: 61.34741620778698

In [22]:
print('#### fold #########',np.sqrt(mean_squared_error(y_test, test_preds)),mean_squared_error(y_test, test_preds))

#### fold ######### 7.893137848830042 62.30162510063334


In [24]:
np.save('TabNet_ytest.npy', test_preds)

In [20]:
from pandas import Series, DataFrame

raw_data = {'id': [ i for i in range(250000,400000)],
            'loss':  }

data = DataFrame(raw_data)
data.set_index('id', inplace=True)
print(data)
data.to_csv("submission.csv", mode='w')

ValueError: All arrays must be of the same length