# About this notebook

This notebook is a simple pipeline using pytorch-tabnet (https://github.com/dreamquark-ai/tabnet) following the original paper's parameters (https://arxiv.org/abs/1908.07442).

It performs pretraining on test set and standard 5 fold cross validation with voting ensembling of the folds.

Almost no preprocessing is done (except from removing class 5 row and ignoring trivial columns), no feature engineering is done.

This is just a very basic starting pipeline.


In [None]:
# Install pytorch-tabnet
!pip install pytorch-tabnet

In [None]:
#importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pytorch_tabnet.tab_model import TabNetClassifier
from pytorch_tabnet.pretraining import TabNetPretrainer
import torch
from sklearn.metrics import roc_auc_score, accuracy_score
import phik
from phik.report import plot_correlation_matrix
from phik import report

import copy

In [None]:
def reduce_memory_usage(df):
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != 'object':
            c_min = df[col].min()
            c_max = df[col].max()
            
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    pass
        else:
            df[col] = df[col].astype('category')
    
    return df

In [None]:
train = pd.read_csv('/kaggle/input/tabular-playground-series-dec-2021/train.csv')
reduce_memory_usage(train)
test = pd.read_csv('/kaggle/input/tabular-playground-series-dec-2021/test.csv')
reduce_memory_usage(test)
# remove the only 5 cover type target
train = train[train.Cover_Type!=5].reset_index(drop=True)
train.loc[train.Cover_Type == 4, 'Cover_Type'] = 3
print(train.shape)
print(test.shape)

In [None]:
train.Cover_Type.value_counts()

**EDA**
* Aspect is the compass direction that a terrain faces. Here, It is expressed in degrees. All the values from 0 to 359 are present. Besides, there are some values greater than 359 and some smaller than 0. It will be better If we make all the values in this column lie in the range (0, 359). Moreover, all the values in this column lies in the range (-360, 720) so adding 360 to angles smaller than 0 and subtracting 360 from angles greater than 359 will do the work.

In [None]:
train["Aspect"][train["Aspect"] < 0] += 360
train["Aspect"][train["Aspect"] > 359] -= 360

test["Aspect"][test["Aspect"] < 0] += 360
test["Aspect"][test["Aspect"] > 359] -= 360

**A quick Google search about Hillshade leads to the following result:**

Hillshading computes surface illumination as values from 0 to 255 based on a given compass direction to the sun (azimuth) and a certain altitude above the horizon (altitude). Hillshades are often used to produce maps that are visually appealing.

Thus, hillshade is a 3D representation of a terrain which is used to gain insight about its form by measuring luminosity of certain patches of that terrain that results when a source of light is casted at a particular angle.

More Information about hillshade here

In both train and test datasets, there are certain rows with hillshade value more than 255 or less than 0. They must be the result of recording error and should be relpaced with an appropriate value. Perhaps, values less than 0 refer to the darkest shade and replacing them with 0 should be fine. Similarly, we can assume that hillshade values more than 255 refer to the brightest shades and a value of 255 should be good replacement.

In [None]:
train.loc[train["Hillshade_9am"] < 0, "Hillshade_9am"] = 0
test.loc[test["Hillshade_9am"] < 0, "Hillshade_9am"] = 0

train.loc[train["Hillshade_Noon"] < 0, "Hillshade_Noon"] = 0
test.loc[test["Hillshade_Noon"] < 0, "Hillshade_Noon"] = 0

train.loc[train["Hillshade_3pm"] < 0, "Hillshade_3pm"] = 0
test.loc[test["Hillshade_3pm"] < 0, "Hillshade_3pm"] = 0

train.loc[train["Hillshade_9am"] > 255, "Hillshade_9am"] = 255
test.loc[test["Hillshade_9am"] > 255, "Hillshade_9am"] = 255

train.loc[train["Hillshade_Noon"] > 255, "Hillshade_Noon"] = 255
test.loc[test["Hillshade_Noon"] > 255, "Hillshade_Noon"] = 255

train.loc[train["Hillshade_3pm"] > 255, "Hillshade_3pm"] = 255
test.loc[test["Hillshade_3pm"] > 255, "Hillshade_3pm"] = 255

In [None]:
# a = train.nunique().reset_index(drop=False).rename(columns={"index": "feat_name", 0: "count"})
# # drop columns with a single value
# drop_cols = ["Id"] + list(a[a["count"] < 2 ].feat_name)
# target = ["Cover_Type"]

In [None]:
# # categorical features are columns with small modalities
# cat_features = [col for col in list(a[a["count"] < 10 ].feat_name) if col not in drop_cols+target]
# num_features = [col for col in train.columns if col not in drop_cols+target+cat_features]

# features = cat_features + num_features

**Phik (𝜙k) correlation for FE** 

exploring num features using phik correlation

In [None]:
# phik_overview_num_features = train[target+num_features].phik_matrix()
# phik_overview_num_features.round(2)

add num features that correlate with the target

In [None]:
features_Hillshade = ['Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm']
features_for_generation = ['Elevation',
#                            'Aspect',
#                            'Slope',
                           'Horizontal_Distance_To_Hydrology',
                           'Vertical_Distance_To_Hydrology',
                           'Horizontal_Distance_To_Roadways',
#                            'Hillshade_9am',
#                            'Hillshade_Noon',
#                            'Hillshade_3pm',
                           'Horizontal_Distance_To_Fire_Points']
# soil_features = [x for x in train.columns if x.startswith("Soil_Type")]
wilderness_features = [x for x in train.columns if x.startswith("Wilderness_Area")]

def add_num_Feature(X):
    # Thanks @mpwolke : https://www.kaggle.com/mpwolke/tooezy-where-are-you-no-camping-here
#     X["Soil_Count"] = X[soil_features].apply(sum, axis=1)

#     # Thanks @yannbarthelemy : https://www.kaggle.com/yannbarthelemy/tps-december-first-simple-feature-engineering
    X["Wilderness_Area_Count"] = X[wilderness_features].apply(sum, axis=1)
    X["Hillshade_mean"] = X[features_Hillshade].mean(axis=1)
    X['amp_Hillshade'] = X[features_Hillshade].max(axis=1) - X[features_Hillshade].min(axis=1)
    for featr in features_for_generation:
        X[featr+'^2'] = X[featr]**2
        X[featr+'^3'] = X[featr]**3
#         X[featr+'log'] = np.log(X[featr])
#         for ft in features_for_generation:            
#             X[featr+'_'+ft] = X[featr] * X[ft]
    return X.copy()

In [None]:
add_num_Feature(train)
reduce_memory_usage(train)

In [None]:
add_num_Feature(test)
reduce_memory_usage(train)

exploring cat features using phik correlation

In [None]:
# phik_overview_cat_features = train[target+cat_features].phik_matrix()
# phik_overview_cat_features.round(2)

In [None]:
# train['Wilderness_Area4+Soil_Type39'] = train['Wilderness_Area4']*train['Soil_Type39']
# test['Wilderness_Area4+Soil_Type39'] = test['Wilderness_Area4']*test['Soil_Type39']

add cat features that correlate with the target

In [None]:
# cat_features_for_generation = ['Wilderness_Area1',
#                                'Wilderness_Area3',
#                                'Wilderness_Area4',
#                                'Soil_Type10',
#                                'Soil_Type39']

# def add_cat_Feature(X):
#     for featr in cat_features_for_generation:        
#         for ft in cat_features_for_generation:
#             if ft != featr:
#                 X[featr+'_'+ft] = (X[featr] * X[ft])
#     return X.copy()

In [None]:
# add_cat_Feature(train)
# reduce_memory_usage(train)

In [None]:
# add_cat_Feature(test)
# reduce_memory_usage(test)

In [None]:
a = train.nunique().reset_index(drop=False).rename(columns={"index": "feat_name", 0: "count"})
# drop columns with a single value
drop_cols = ["Id"] + list(a[a["count"] < 2 ].feat_name)
target = ["Cover_Type"]

In [None]:
# categorical features are columns with small modalities
cat_features = [col for col in list(a[a["count"] < 10 ].feat_name) if col not in drop_cols+target]
num_features = [col for col in train.columns if col not in drop_cols+target+cat_features]

features = cat_features + num_features

In [None]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
train[num_features] = scaler.fit_transform(train[num_features])
test[num_features] = scaler.transform(test[num_features])

In [None]:
# This is only needed if using embeddings (not used at the moment)

from sklearn.preprocessing import LabelEncoder

categorical_columns = []
categorical_dims =  {}
for col in cat_features:
    l_enc = LabelEncoder()
    train[col] = train[col].fillna("VV_likely")
    train[col] = l_enc.fit_transform(train[col].values)
    categorical_columns.append(col)
    categorical_dims[col] = len(l_enc.classes_)
    
    test[col] = l_enc.transform(test[col].values)
    
cat_idxs = [i for i, f in enumerate(features) if f in cat_features]
cat_dims = [categorical_dims[f] for i, f in enumerate(features) if f in cat_features]

X_test = test[features].values

In [None]:
BS = 1024*16
VBS = BS 
max_epochs=55

tabnet_params = {"n_d" : 64,
                 "n_a" : 64,
                 "n_steps" : 5,
                 "gamma" : 1.5,
                 "n_independent" : 2,
                 "n_shared" : 2,
                 "cat_idxs" : cat_idxs,
                 "cat_dims" : cat_dims,
                 "cat_emb_dim" : 1,
                 "lambda_sparse" : 1e-4,
                 "momentum" : 0.3,
                 "clip_value" : 2.,
                 "optimizer_fn" : torch.optim.Adam,
                 "optimizer_params" :dict(lr=2e-2),}


params = copy.deepcopy(tabnet_params)
params["scheduler_fn"]=torch.optim.lr_scheduler.StepLR
params["scheduler_params"]={"is_batch_level":False,
                            "gamma":0.95,
                            "step_size": 1,}

In [None]:
# Pretrain the model on test set

X_unsup_valid = train[features].values[:100000]
params = tabnet_params.copy()

unsupervised_model = TabNetPretrainer(**params)

unsupervised_model.fit(
    X_train=X_test,
    eval_set=[X_unsup_valid],
    pretraining_ratio=0.8,
    max_epochs=25,
    patience=13,
    batch_size=1024*4,
    virtual_batch_size=1024*4
)


In [None]:

# Split for cross validation or single validation
from sklearn.model_selection import StratifiedKFold

N_SPLITS=5
skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=0)

cv_preds = np.zeros((X_test.shape[0], N_SPLITS))

fold_idx=0
for train_idx, val_idx in skf.split(train, train[target]):

    # Create the numpy datasets

    X_train = train.loc[train_idx, features].values
    Y_train = train.loc[train_idx, target].values.reshape(-1)

    X_val = train.loc[val_idx, features].values
    Y_val = train.loc[val_idx, target].values.reshape(-1)

    # Train a tabnet classifier

    params = copy.deepcopy(tabnet_params)

    # Scheduling scheme here is the only part not similar to the original paper
    # but the dataset is not exactly the same

    # params["scheduler_fn"]=torch.optim.lr_scheduler.StepLR
    # params["scheduler_params"]={"is_batch_level":False,
    #                             "gamma":0.95,
    #                             "step_size": 5,}
    params["scheduler_fn"]=torch.optim.lr_scheduler.OneCycleLR
    params["scheduler_params"]={"is_batch_level":True,
                                "max_lr":5e-2,
                                "steps_per_epoch":int(X_train.shape[0] / BS),
                                "epochs":max_epochs}

    clf = TabNetClassifier(**params)

    clf.fit(
        X_train,
        Y_train,
        eval_set=[(X_train, Y_train), (X_val, Y_val)],
        eval_name=['train', 'valid'],
        eval_metric=['accuracy'],
        max_epochs=max_epochs,
        patience=20,
        drop_last=True,
        batch_size=BS,
        virtual_batch_size=VBS,
    #     weights=1,
        from_unsupervised=unsupervised_model
    )
    
    preds = clf.predict(X_test)
    cv_preds[:, fold_idx] = preds
    fold_idx+=1

In [None]:
# Voting ensembling

from scipy import stats
final_res, _ = stats.mode(cv_preds, axis=1)

In [None]:
df_submission = pd.read_csv('/kaggle/input/tabular-playground-series-dec-2021/sample_submission.csv')
df_submission['Cover_Type']= final_res.astype(int)
# df_submission.loc[df_submission.Cover_Type == 4, 'Cover_Type'] = 3
df_submission.to_csv('submission3.csv',index=False)

In [None]:
df_submission.Cover_Type.value_counts()