![](https://github.com/SauravMaheshkar/Tabular-Playground-Series-May-2021/blob/main/assets/Banner.png?raw=true)

# Table of Content

1. [Packages 📦 and Basic Setup](#basic)
2. [Pre-Processing 👎🏻 -> 👍](#pre)
3. [The Model 👷‍♀️](#model)
4. [Training 💪🏻](#train)

<a id = 'basic'></a>
<h1 style="background-color:black;color:white;padding:10px; height: 50px;"> <center>Packages 📦 and Basic Setup</center> </h1>

Initiially introduced in the paper titled [**"TabNet: Attentive Interpretable Tabular Learning"**](https://arxiv.org/pdf/1908.07442.pdf), TabNet is a novel high-performance and interpretable canonical deep tabular data learning architecture. It uses sequential attention to choose which features to reason
from at each decision step, enabling interpretability and more efficient learning as the learning capacity is used for the most salient features.

![](https://github.com/SauravMaheshkar/Tabular-Playground-Series-May-2021/blob/main/assets/tabnet.png?raw=true)

A Pytorch Implementation of Tabnet has been made available by the team at [dreamquark-ai](https://github.com/dreamquark-ai/tabnet). We can simply install the package using `pip`.

```
pip install pytorch-tabnet
```

This kernel aims to be a starter notebook for you to add your own pre-processing / parameters.

In [None]:
%%capture
!pip install pytorch-tabnet

In [None]:
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from pytorch_tabnet.tab_model import TabNetClassifier
from pytorch_tabnet.pretraining import TabNetPretrainer

train = pd.read_csv("../input/tabular-playground-series-may-2021/train.csv")

<a id = 'pre'></a>
<h1 style="background-color:black;color:white;padding:10px; height: 50px;"> <center>Pre-Processing 👎🏻 -> 👍</center> </h1>

Upon a closer look, we realize that most features are left skewed in this dataset. Thus, Normalization seems ideal.

<center><img src="https://github.com/SauravMaheshkar/Tabular-Playground-Series-May-2021/blob/main/assets/feature_distribution.png?raw=true"></center>

> Image taken from [TPS-May: Categorical EDA](https://www.kaggle.com/subinium/tps-may-categorical-eda)

In [None]:
# Normalization
for i in range(50):
    mean, std = train[f'feature_{i}'].mean(), train[f'feature_{i}'].std()
    train[f'feature_{i}'] = train[f'feature_{i}'].apply(lambda x : (x-mean)/std)

We'll split the dataset into a **80-10-10** split for training, validation and test respectively.

In [None]:
# Train, Test, Validation Split
target = 'target'
if "Set" not in train.columns:
    train["Set"] = np.random.choice(["train", "valid", "test"], p =[.8, .1, .1], size=(train.shape[0],))

train_indices = train[train.Set=="train"].index
valid_indices = train[train.Set=="valid"].index
test_indices = train[train.Set=="test"].index

* Fill NaN values + optional script for `object` columns.

In [None]:
nunique = train.nunique()
types = train.dtypes

categorical_columns = []
categorical_dims =  {}
for col in train.columns:
    if types[col] == 'object':
        print(col, train[col].nunique())
        l_enc = LabelEncoder()
        train[col] = l_enc.fit_transform(train[col].values)
        categorical_columns.append(col)
        categorical_dims[col] = len(l_enc.classes_)
    else:
        train.fillna(train.loc[train_indices, col].mean(), inplace=True)

<a id='model'></a>
<h1 style="background-color:black;color:white;padding:10px; height: 50px;"> <center>The Model 👷‍♀️</center> </h1>

In [None]:
# Columns not to use
unused_feat = ['Set']

# Features to Use
features = [ col for col in train.columns if col not in unused_feat+[target]] 

X_train = train[features].values[train_indices]
y_train = train[target].values[train_indices]

X_valid = train[features].values[valid_indices]
y_valid = train[target].values[valid_indices]

X_test = train[features].values[test_indices]
y_test = train[target].values[test_indices]

In [None]:
# Basic model parameters
max_epochs = 30
batch_size = 1024
opt = torch.optim.Adam # Optimizer
opt_params = dict(lr=1e-3)
sch = torch.optim.lr_scheduler.StepLR # LR Scheduler
sch_params = {"step_size":10, "gamma":0.9}
mask = 'entmax'
workers = 2 # For torch DataLoader
sample_type = 1 # For automated sampling with inverse class occurrences 
virtual_batch = 128 # Size of the mini batches used for "Ghost Batch Normalization"

The paper highlights a semi-supervised pre-training method which is available via the `TabNetPretrainer` class. We'll use this pretrain this model and use it to boost the tabnet model performance as a unsupervised prior.

In [None]:
unsupervised_model = TabNetPretrainer(
    optimizer_fn = opt,
    optimizer_params = opt_params,
    mask_type = mask)

In [None]:
clf = TabNetClassifier(gamma = 1.5,
                       lambda_sparse = 1e-4,
                       optimizer_fn = opt,
                       optimizer_params = opt_params,
                       scheduler_fn = sch,
                       scheduler_params = sch_params,
                       mask_type = mask)

<a id = 'train'></a>
<h1 style="background-color:black;color:white;padding:10px; height: 50px;"> <center>Training 💪🏻</center> </h1>

In [None]:
unsupervised_model.fit(
    X_train=X_train,
    eval_set=[X_valid],
    pretraining_ratio=0.8)

In [None]:
clf.fit(X_train=X_train, 
    y_train=y_train,
    eval_set=[(X_train, y_train), (X_valid, y_valid)],
    eval_name=['train', 'val'],
    eval_metric=["logloss", 'balanced_accuracy'],
    max_epochs=max_epochs , patience=15,
    batch_size=batch_size,
    virtual_batch_size=virtual_batch,
    num_workers=workers,
    weights=sample_type,
    drop_last=False,
    from_unsupervised=unsupervised_model)

In [None]:
# plot losses
plt.plot(clf.history['loss'])

In [None]:
# plot auc
plt.plot(clf.history['train_logloss'])
plt.plot(clf.history['val_logloss'])

<h1 style="background-color:black;color:white;padding:10px; height: 50px;"> <center>Submission</center> </h1>

In [None]:
test = pd.read_csv('../input/tabular-playground-series-may-2021/test.csv')
test_indices = test.index
test_ds = test[features].values[test_indices]

sample_submission = pd.read_csv('../input/tabular-playground-series-may-2021/sample_submission.csv')
sample_submission[['Class_1','Class_2', 'Class_3', 'Class_4']] = clf.predict_proba(test_ds)
sample_submission.to_csv('tabnet_submission.csv',index = False)

In [None]:
sample_submission.head()