### Tabnet with Pytorch

![Tabnet](https://github.com/titu1994/tf-TabNet/raw/master/images/TabNet.png?raw=true)

![성능평가지표](https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbgIGZp%2FbtqPsEsWvjd%2FzXRjwK6QPrzlPVj5FkAlG1%2Fimg.png)

### 1. import Library

In [47]:
import os
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
from pytorch_tabnet.tab_model import TabNetClassifier

### 2. Data Load

In [48]:
data = pd.read_csv('./data_eda02_final.csv')
data.drop('Unnamed: 0', axis=1, inplace=True)

In [49]:
target = 'label'
if "Set" not in data.columns:
    data["Set"] = np.random.choice(["train", "valid", "test"], p =[.6,.2,.2], size=(data.shape[0],))

train_indices = data[data.Set=='train'].index
valid_indices = data[data.Set=='valid'].index
test_indices = data[data.Set=='test'].index

In [50]:
train_indices.shape, valid_indices.shape, test_indices.shape

((747,), (243,), (249,))

### 3. Data Preprocessing

In [51]:
nunique = data.nunique()
types = data.dtypes

In [52]:
categorical_columns = []
categorical_dims =  {}
for col in data.columns:
    if types[col] == 'object' or nunique[col] < 200:
        print(col, data[col].nunique())
        l_enc = LabelEncoder()
        data[col] = data[col].fillna("VV_likely")
        data[col] = l_enc.fit_transform(data[col].values)
        categorical_columns.append(col)
        categorical_dims[col] = len(l_enc.classes_)
    else:
        data.fillna(data.loc[train_indices, col].mean(), inplace=True)

url_len 88
url_num_hyphens_dom 5
url_path_len 83
url_domain_len 46
url_num_dots 4
url_ip_present 2
html_num_tags('script') 14
html_num_tags('div') 43
html_num_tags('form') 2
html_num_tags('a') 17
label 2
Set 3


In [53]:
# Categorical Embedding을 위해 Categorical 변수의 차원과 idxs를 담음.
unused_feat = ['Set']
features = [ col for col in data.columns if col not in unused_feat+[target]]
cat_idxs = [ i for i, f in enumerate(features) if f in categorical_columns]
cat_dims = [ categorical_dims[f] for i, f in enumerate(features) if f in categorical_columns]

In [54]:
x_train = data[features].values[train_indices]
y_train = data[target].values[train_indices]

x_valid = data[features].values[valid_indices]
y_valid = data[target].values[valid_indices]


x_test = data[features].values[test_indices]
y_test = data[target].values[test_indices]

In [55]:
x_train.shape, y_train.shape, x_valid.shape, y_valid.shape , x_test.shape, y_test.shape

((747, 11), (747,), (243, 11), (243,), (249, 11), (249,))

- Tabnet은 특이하게 one-hot-encoding을 지원하지 않는다. 무조건 input은 1D 로 들어가야한다.

In [38]:
# y_train2 = y_train.copy()

In [42]:
# def to_categorical(y, num_classes):
#     return np.eye(num_classes, dtype='int64')[y]

In [43]:
# y_train = to_categorical(y_train, 2)
# y_valid = to_categorical(y_valid, 2)
# y_test = to_categorical(y_valid, 2)

In [44]:
# x_train.shape, y_train.shape

((728, 11), (728, 2))

### 4. Define the Model

In [56]:
clf = TabNetClassifier(cat_idxs=cat_idxs,
                       cat_dims=cat_dims,
                       cat_emb_dim=10,
                       optimizer_fn=torch.optim.Adam,
                       optimizer_params=dict(lr=1e-2),
                       scheduler_params={"step_size":50,
                                         "gamma":0.9},
                       scheduler_fn=torch.optim.lr_scheduler.StepLR,
                       mask_type='sparsemax' # "sparsemax", entmax
                       )



### 5. Train / Valid

In [57]:
max_epochs = 100

clf.fit(
    X_train=x_train, y_train=y_train,
    eval_set=[(x_train, y_train), (x_valid, y_valid)],
    eval_name=['train', 'valid'],
    eval_metric=['accuracy'],
    max_epochs=max_epochs , patience=20,
    batch_size=32, virtual_batch_size=32,
    num_workers=0,
    weights=1,
    drop_last=False,
)

epoch 0  | loss: 0.84976 | train_accuracy: 0.4257  | valid_accuracy: 0.42387 |  0:00:01s
epoch 1  | loss: 0.62073 | train_accuracy: 0.45248 | valid_accuracy: 0.46091 |  0:00:02s
epoch 2  | loss: 0.58613 | train_accuracy: 0.60643 | valid_accuracy: 0.63786 |  0:00:03s
epoch 3  | loss: 0.51021 | train_accuracy: 0.68809 | valid_accuracy: 0.67078 |  0:00:04s
epoch 4  | loss: 0.47345 | train_accuracy: 0.81392 | valid_accuracy: 0.80658 |  0:00:06s
epoch 5  | loss: 0.43385 | train_accuracy: 0.82999 | valid_accuracy: 0.82305 |  0:00:07s
epoch 6  | loss: 0.3882  | train_accuracy: 0.83802 | valid_accuracy: 0.79835 |  0:00:08s
epoch 7  | loss: 0.3964  | train_accuracy: 0.84739 | valid_accuracy: 0.82305 |  0:00:09s
epoch 8  | loss: 0.33466 | train_accuracy: 0.85274 | valid_accuracy: 0.84362 |  0:00:10s
epoch 9  | loss: 0.38383 | train_accuracy: 0.85274 | valid_accuracy: 0.83951 |  0:00:11s
epoch 10 | loss: 0.32294 | train_accuracy: 0.84873 | valid_accuracy: 0.83539 |  0:00:12s
epoch 11 | loss: 0.31



In [59]:
y_pred = clf.predict_proba(x_test)

In [61]:
print(y_pred[:, 1])

[9.9923301e-01 9.9765909e-01 1.4161796e-04 3.7828060e-05 9.9777621e-01
 9.9972457e-01 4.2075064e-04 2.5769461e-05 9.9962628e-01 9.9943405e-01
 1.4161796e-04 2.2435504e-05 4.3341020e-04 9.7619087e-01 4.0245627e-04
 1.9030641e-04 9.9976605e-01 5.9406314e-04 9.9988234e-01 9.5683783e-03
 9.9918920e-01 9.9980205e-01 1.4176948e-03 9.9965048e-01 9.9993920e-01
 4.6789457e-04 1.0003853e-04 9.9992073e-01 4.5881131e-01 9.3665816e-02
 9.9990141e-01 9.9980205e-01 9.9995041e-01 1.4161796e-04 9.9865049e-01
 9.9998868e-01 1.0565481e-04 9.9967813e-01 1.5920599e-04 1.6833850e-03
 9.9916077e-01 9.9980205e-01 1.3001354e-03 9.9945325e-01 1.4161796e-04
 9.9974293e-01 8.4898595e-05 3.4454644e-01 6.8571814e-04 9.9814141e-01
 9.9987400e-01 9.9774396e-01 9.9952543e-01 3.5465183e-04 9.9791962e-01
 1.4176948e-03 3.0510084e-05 4.3454310e-01 1.8628116e-04 9.8415208e-01
 2.0110032e-03 9.9992585e-01 9.9891210e-01 9.9971241e-01 4.2542517e-03
 9.9945325e-01 2.6435205e-03 9.9898440e-01 1.6036807e-02 1.4567998e-01
 9.998

In [62]:
print(y_test)

[1 1 0 0 1 1 0 0 1 1 0 0 0 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 1 0 1 1 1 0 1 1 0
 1 0 0 1 1 0 1 0 1 0 0 0 1 1 1 1 0 1 0 0 0 0 1 0 1 1 1 0 1 0 1 0 0 1 1 1 1
 0 1 0 1 0 1 0 1 1 0 1 1 0 1 1 1 1 0 1 1 1 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 1
 0 0 1 1 1 0 1 1 1 0 0 0 1 0 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 0 1 0 0 1 1 0 0
 1 0 1 0 0 0 1 1 1 1 1 0 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 0
 1 1 0 1 1 0 0 0 1 1 0 0 0 1 1 0 1 1 1 0 0 1 1 1 0 1 1 1 0 1 1 1 1 1 0 1 0
 0 0 0 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 0 0 0 1 0 1 0 0]


### 6. Evaluate

In [69]:
from sklearn.metrics import roc_auc_score

print(roc_auc_score(y_test, y_pred[:, 1]))

0.9746610504146374


- accuracy_score은 에러가 난다. stackoverflow 찾아보니 그냥 roc_auc_score쓰라고 함.