# <p style="background-color:#33ff99; font-family:newtimeroman; font-size:170%; text-align:center">[TPS 03-21] TabNet Classifier 🔥🔥 </p>

Hello Kagglers, in this Notebook we will try to use TabNet Classifier to solve TPS Playground problem! I added some credits in the final of this notebook :D

* This notebook is an update version of my previous work (TabNet Regression) for TPS 02-21

> For this competition, you will be predicting a binary target based on a number of feature columns given in the data. All of the feature columns, cat0 - cat18 are categorical, and the feature columns cont0 - cont10 are continuous.


## TabNet: Attentive Interpretable Tabular Learning

"We propose a novel high-performance and interpretable canonical deep tabular data learning architecture, TabNet. TabNet uses sequential attention to choose which features to reason from at each decision step, enabling interpretability and more efficient learning as the learning capacity is used for the most salient features. We demonstrate that TabNet outperforms other neural network and decision tree variants on a wide range of non-performance-saturated tabular datasets and yields interpretable feature attributions plus insights into the global model behavior. Finally, for the first time to our knowledge, we demonstrate self-supervised learning for tabular data, significantly improving performance with unsupervised representation learning when unlabeled data is abundant."

[TabNet Paper](https://arxiv.org/abs/1908.07442)



## Install PyTorch TabNet

In [None]:
!pip install pytorch-tabnet


## Parameters


In [None]:
FLAG_LOCAL = False # Flag to run in kaggle notebook or in jupyter server :)

KAGGLE_PATH = "/kaggle/input/tabular-playground-series-mar-2021/"
LOCAL_PATH = "/home/rapela/Downloads/kaggle/tps_mar/input/"

PATH = (LOCAL_PATH if FLAG_LOCAL else KAGGLE_PATH)

TRAIN_PATH = PATH + "train.csv"
TEST_PATH = PATH + "test.csv"
SUBMISSION = PATH + "sample_submission.csv"
SUBMISSION_OUTPUT = "submission.csv"

print(TRAIN_PATH)
print(TEST_PATH)
print(SUBMISSION)

NUM_FOLDS = 2
SEED = 42


## TabNet Parameters

In [None]:
## TabNet Parameters
MAX_EPOCH = 500
N_D = 2 
N_A = 2 
N_STEPS = 3
GAMMA = 1.3
LAMBDA_SPARSE = 0
OPT_LR = 1e-2
OPT_WEIGHT_DECAY = 1e-5
OPT_MOMENTUM = 0.9
MASK_TYPE = "entmax"
SCHEDULER_MIN_LR = 1e-6
SCHEDULER_FACTOR = 0.9
DEVICE_NAME = "cuda"

BATCH_SIZE = 1024*4


## Imports Libs

In [None]:
import torch
from torch import nn
from pytorch_tabnet.tab_model import TabNetClassifier
from pytorch_tabnet.pretraining import TabNetPretrainer

from tqdm.notebook import tqdm
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold

from sklearn.metrics import roc_auc_score

import numpy as np
import pandas as pd 

import os
import random
import sys
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"


def seed_everything(seed_value):
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    os.environ['PYTHONHASHSEED'] = str(seed_value)
    
    if torch.cuda.is_available(): 
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
        
seed_everything(SEED)


## Import Data

In [None]:
train = pd.read_csv(TRAIN_PATH, index_col='id')
test = pd.read_csv(TEST_PATH, index_col='id')
submission = pd.read_csv(SUBMISSION, index_col='id')
target = train.pop("target")
target = target.values

In [None]:
train.shape

In [None]:
train.head()

In [None]:
test.shape

In [None]:
test.head()

## Label Encoding


In [None]:
types = train.dtypes

categorical_columns = []
categorical_dims =  {}
for col in train.columns:
    if types[col] == 'object':
        lbl = LabelEncoder()
        lbl.fit(list(train[col].values) + list(test[col].values))
        
        train[col] = lbl.transform(train[col].values)
        test[col] = lbl.transform(test[col].values)

        categorical_columns.append(col)
        categorical_dims[col] = len(lbl.classes_)        

In [None]:
features = [ col for col in train.columns] 

cat_idxs = [ i for i, f in enumerate(features) if f in categorical_columns]

cat_dims = [ categorical_dims[f] for i, f in enumerate(features) if f in categorical_columns]

In [None]:
train.head()

In [None]:
test.head()

In [None]:
columns = test.columns


## Create TabNet Params Dictionary

In [None]:
tabnet_params = dict(n_d=N_D, n_a=N_A, n_steps=N_STEPS, gamma=GAMMA,
                    lambda_sparse=LAMBDA_SPARSE, 
                    optimizer_fn=torch.optim.Adam,
                    optimizer_params=dict(lr=OPT_LR, weight_decay=OPT_WEIGHT_DECAY),#, \momentum=OPT_MOMENTUM),
                    mask_type=MASK_TYPE,
                     
                    scheduler_fn=torch.optim.lr_scheduler.OneCycleLR,
                    scheduler_params={
                                    "is_batch_level": True,
                                    "max_lr": 5e-2,
                                    "steps_per_epoch":int(train.shape[0] / BATCH_SIZE) + 1,
                                    "epochs": MAX_EPOCH
                    },
                     
                    #scheduler_fn=torch.optim.lr_scheduler.ReduceLROnPlateau,
                    #scheduler_params=dict(mode="min",
                    #                   patience=200,
                    #                   min_lr=SCHEDULER_MIN_LR,
                    #                   factor=SCHEDULER_FACTOR,),
                    verbose=10,
                    device_name=DEVICE_NAME,
                    seed=SEED
                    )

## Run Kfold with TabNet Classifier

In [None]:
# train = train[0:3000]
# test = test[0:3000]
# target = target[0:3000]

In [None]:
print(train.shape)

In [None]:
print(test.shape)

In [None]:
print(target.shape)

## Unsupervised Pre-training

In [None]:
#     TabNetPretrainer
unsupervised_model = TabNetPretrainer(
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=2e-2),
    mask_type='entmax', 
    device_name=DEVICE_NAME,
)


unsupervised_model.fit(
    X_train=np.asarray(train.values.tolist() + test.values.tolist()),
    eval_set=[np.asarray(train.values.tolist() + test.values.tolist())],
    patience=100,
    max_epochs=500, 
    batch_size=2048, virtual_batch_size=128,
    num_workers=0,
    drop_last=False,
    pretraining_ratio=0.9,
)    

In [None]:
train_oof = np.zeros((len(train)))
test_preds = 0

kf = KFold(n_splits=NUM_FOLDS, shuffle=True, random_state=SEED)

for f, (train_ind, val_ind) in tqdm(enumerate(kf.split(train, target))):

    print(f'Fold {f}')
    train_df, val_df = train.iloc[train_ind][columns], train.iloc[val_ind][columns]

    train_target, val_target = target[train_ind], target[val_ind]

    train_target = train_target.reshape(-1,1)
    val_target = val_target.reshape(-1,1)

    train_df      = train_df.to_numpy()
    train_target = np.squeeze(train_target)
    
    val_df = val_df.to_numpy()
    val_target = np.squeeze(val_target)
    

    model = TabNetClassifier(**tabnet_params)

    model.fit(X_train=train_df,
          y_train=train_target,
          eval_set=[(val_df, val_target)],
          eval_name = ["val"],
          eval_metric = ['auc'],
          max_epochs=MAX_EPOCH,
          patience=20, 
          batch_size=BATCH_SIZE, virtual_batch_size=128,
          num_workers=1, 
          drop_last=False,
          from_unsupervised=unsupervised_model    
         )#,
    
        #         X_train=X_train, y_train=y_train,
        #         eval_set=[(X_train, y_train), (X_valid, y_valid)],
        #         eval_name=['train', 'valid'],
        #         eval_metric=['auc'],
        #         max_epochs=max_epochs , patience=20,
        #         batch_size=1024, virtual_batch_size=128,
        #         num_workers=0,
        #         weights=1,
        #         drop_last=False,
        #         from_unsupervised=loaded_pretrain
    
    temp_oof = model.predict_proba(val_df)[:,1]
    train_oof[val_ind] = temp_oof.reshape(-1)
    temp_test = model.predict_proba(test.to_numpy())[:, 1]

    test_preds += temp_test/NUM_FOLDS     
    
    print(f'Fold {f}: {roc_auc_score(val_target, temp_oof)}')        


## Submit your output csv

In [None]:
submission['target'] = test_preds
submission.to_csv(SUBMISSION_OUTPUT)

## Credits

* [TabNet Paper](https://arxiv.org/pdf/1908.07442.pdf)
* [TabNet PyTorch GitHub](https://github.com/dreamquark-ai/tabnet)
* [Kaggle Notebook TabNet Regressor](https://www.kaggle.com/optimo/tabnetregressor-2-0-train-infer?scriptVersionId=44853427)
* [Tunguz CV Notebook](https://www.kaggle.com/tunguz/tps-02-21-feature-importance-with-xgboost-and-shap)
* [Rank Gauss](https://www.kaggle.com/tottenham/10-fold-simple-dnn-with-rank-gauss)

If it was useful for you please comment! Your feedback is really important