<h1 style='color:white; background:#663399; border:0'><center> TPS-Aug: EDA, Baselines (XGB, Keras NN)</center></h1>

![](https://storage.googleapis.com/kaggle-competitions/kaggle/26480/logos/header.png?t=2021-04-09-00-57-05)

<a id="section-start"></a>

The goal of these competitions is to provide a fun, and approachable for anyone, tabular dataset. These competitions will be great for people looking for something in between the Titanic Getting Started competition and a Featured competition. If you're an established competitions master or grandmaster, these probably won't be much of a challenge for you. We encourage you to avoid saturating the leaderboard.

The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with calculating the loss associated with a loan defaults. Although the features are anonymized, they have properties relating to real-world features.

For this competition, you will be predicting a `target loss` based on a number of feature columns given in the data. The ground truth loss is integer valued, although predictions can be continuous.

### See also my previous TPS works:
- [TPS-July: EDA, Baseline Analysis (XGBRegressor)](https://www.kaggle.com/maksymshkliarevskyi/tps-july-eda-baseline-analysis-xgbregressor)
- [TPS-Jun: starting point (EDA, Baseline, CV)](https://www.kaggle.com/maksymshkliarevskyi/tps-jun-starting-point-eda-baseline-cv)

In [None]:
!pip install /kaggle/input/adjusttext
!pip install /kaggle/input/bioinfokit

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

# for feature importance study
import eli5
from eli5.sklearn import PermutationImportance
from pdpbox import pdp
import shap

# for PCA
from bioinfokit.visuz import cluster
from sklearn.decomposition import PCA

# ML
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from xgboost import XGBRegressor
import os
import tensorflow as tf
import keras
from tensorflow.keras import models, layers, callbacks
import tensorflow_addons as tfa
from keras.utils.vis_utils import plot_model
from keras import backend as K
import gc

# Reproducability
def set_seed(seed = 0):
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'
    print('*** --- Set seed "%i" --- ***' %seed)

# Custom theme
plt.style.use('fivethirtyeight')

figure = {'dpi': '200'}
font = {'family': 'serif'}
grid = {'linestyle': ':', 'alpha': .9}
axes = {'titlecolor': 'black', 'titlesize': 20, 'titleweight': 'bold',
        'labelsize': 12, 'labelweight': 'bold'}

plt.rc('font', **font)
plt.rc('figure', **figure)
plt.rc('grid', **grid)
plt.rc('axes', **axes)

my_colors = ['#DC143C', '#FF1493', '#FF7F50', '#FFD700', '#32CD32', 
             '#4ddbff', '#1E90FF', '#663399', '#708090']

caption = "© maksymshkliarevskyi"

# Show our custom palette
sns.palplot(sns.color_palette(my_colors))
plt.title('Custom palette')
plt.text(6.9, 0.75, caption, size = 8)
plt.show()

First, let's load the data and take a look at basic statistics.

In [None]:
train = pd.read_csv('../input/tabular-playground-series-aug-2021/train.csv', 
                    index_col = 0)
test = pd.read_csv('../input/tabular-playground-series-aug-2021/test.csv', 
                   index_col = 0)
ss = pd.read_csv('../input/tabular-playground-series-aug-2021/sample_submission.csv')

<h2 style='color:white; background:#663399; border:0'><center>EDA</center></h2>

[**Back to the start**](#section-start)

In [None]:
train.describe().T.style.background_gradient(subset = ['count'], cmap = 'viridis') \
    .bar(subset = ['mean', '50%'], color = my_colors[6]) \
    .bar(subset = ['std'], color = my_colors[0])

In [None]:
test.describe().T.style.background_gradient(subset = ['count'], cmap = 'viridis') \
    .bar(subset = ['mean', '50%'], color = my_colors[6]) \
    .bar(subset = ['std'], color = my_colors[0])

Wow, f60 feature has very big values.

In [None]:
dtypes = train.dtypes.value_counts().reset_index()

plt.figure(figsize = (12, 1))
plt.title('Data types\n')
plt.barh(str(dtypes.iloc[0, 0]), dtypes.iloc[0, 1],
         label = str(dtypes.iloc[0, 0]), color = my_colors[4])
plt.legend(loc = 'upper center', ncol = 3, fontsize = 13,
           bbox_to_anchor = (0.5, 1.45), frameon = False)
plt.yticks('')
plt.text(85, -0.9, caption, size = 8)
plt.show()

We have 250000 training and 150000 test observations. All our data is in float32 format.

Before we continue, let's pull the target feature into the separate variable.

In [None]:
loss = train.loss

train.drop(['loss'], axis = 1, inplace = True)

It's important to see if our data has missing values.

In [None]:
# Concatenate train and test datasets
all_data = pd.concat([train, test], axis = 0)

# columns with missing values
cols_with_na = all_data.isna().sum()[all_data.isna().sum() > 0].sort_values(ascending = False)
cols_with_na

As in the previous competitions, our data has no missing values. Now, let's look at the feature distributions.

In [None]:
print('Train data')
fig = plt.figure(figsize = (15, 120))
for idx, i in enumerate(train.columns):
    fig.add_subplot(np.ceil(len(train.columns)/4), 4, idx+1)
    train.iloc[:, idx].hist(bins = 20)
    plt.title(i)
plt.text(10, -8000, caption, size = 12)
plt.show()

In [None]:
fig = plt.figure(figsize = (15, 5))
loss.hist(bins = 20)
plt.title('loss')
plt.text(37, -12000, caption, size = 12)
plt.show()

In [None]:
print('Test data')
fig = plt.figure(figsize = (15, 120))
for idx, i in enumerate(test.columns):
    fig.add_subplot(np.ceil(len(test.columns)/4), 4, idx+1)
    test.iloc[:, idx].hist(bins = 20)
    plt.title(i)
plt.text(10, -8000, caption, size = 12)
plt.show()

We should also look at the correlation between features.

In [None]:
corr = train.corr()
mask = np.triu(np.ones_like(corr, dtype = bool))

plt.figure(figsize = (15, 15))
plt.title('Corelation matrix')
sns.heatmap(corr, mask = mask, cmap = 'Spectral_r', linewidths = .5)
plt.text(106, 106, caption, size = 8)
plt.show()

In [None]:
corr = test.corr()
mask = np.triu(np.ones_like(corr, dtype = bool))

plt.figure(figsize = (15, 15))
plt.title('Corelation matrix')
sns.heatmap(corr, mask = mask, cmap = 'Spectral_r', linewidths = .5)
plt.text(106, 106, caption, size = 8)
plt.show()

All features are weakly correlated.

<h2 style='color:white; background:#663399; border:0'><center>XGBRegressor Baseline</center></h2>

[**Back to the start**](#section-start)

At first, we'll train a very simple basic XGBRegressor.

In [None]:
# Create data sets for training (85%) and validation (15%)
X_train, X_valid, y_train, y_valid = train_test_split(train, loss, 
                                                      test_size = 0.15,
                                                      random_state = 0)

In [None]:
def root_mean_squared_error(y_true, y_pred):
    from sklearn.metrics import mean_squared_error
    from math import sqrt
    return sqrt(mean_squared_error(y_true, y_pred))

In [None]:
# The basic model
params = {'random_state': 0,
          'predictor': 'gpu_predictor',
          'tree_method': 'gpu_hist',
          'eval_metric': 'logloss'}

model = XGBRegressor(**params)

model.fit(X_train, y_train, verbose = False)

preds = model.predict(X_valid)
print('Valid RMSE of the basic model: {}'.format(root_mean_squared_error(y_valid, preds)))

In [None]:
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_valid)

shap.summary_plot(shap_values, X_valid)

<h2 style='color:white; background:#663399; border:0'><center>Keras NN Baseline</center></h2>

[**Back to the start**](#section-start)

In this step, we'll train our baseline keras NN. We'll use the residual network from this excellent notebook [tabular residual network](https://www.kaggle.com/oxzplvifi/tabular-residual-network).

Let's define architecture.

In [None]:
def root_mean_squared_error(y_true, y_pred):
    return K.sqrt(K.mean(K.square(y_pred - y_true))) 

def create_model(shape = train.shape[1]):
    inputs = layers.Input(shape = (shape))

    hidden = layers.Dropout(0.25)(inputs)
    hidden = tfa.layers.WeightNormalization(layers.Dense(units = 128, activation = 'relu'))(hidden)

    output = layers.Dropout(0.25)(layers.Concatenate()([inputs, hidden]))
    output = tfa.layers.WeightNormalization(layers.Dense(units = 64, activation = 'relu'))(output) 

    output = layers.Dropout(0.25)(layers.Concatenate()([inputs, hidden, output]))
    output = tfa.layers.WeightNormalization(layers.Dense(units = 32, activation = 'relu'))(output) 
    output = layers.Dense(1)(output)

    model = keras.Model(inputs = inputs, outputs = output, name = "res_nn_model")

    model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.003),
                  loss = root_mean_squared_error)
    return model

Before starting the training process, let's visualize our architecture.

In [None]:
model = create_model()

plot_model(model, to_file = 'model_plot.png', show_shapes = True, show_layer_names = True)

In [None]:
scaler = StandardScaler()
train = scaler.fit_transform(train)
test = scaler.transform(test)

# Create data sets for training (85%) and validation (15%)
X_train, X_valid, y_train, y_valid = train_test_split(train, np.float32(loss), 
                                                      test_size = 0.15,
                                                      random_state = 0)

In [None]:
set_seed()

model = create_model()

early_stopping = callbacks.EarlyStopping(
    patience = 7,
    min_delta = 0.00001,
    restore_best_weights = True,
    monitor = "val_loss")
plateau = callbacks.ReduceLROnPlateau(
    factor = 0.6,                                     
    patience = 3,                                   
    min_delt = 0.00001,                                
    verbose = 0) 

history = model.fit(X_train, y_train,
                    batch_size = 64,
                    epochs = 30,
                    validation_data = (X_valid, y_valid),
                    callbacks = [early_stopping, plateau],
                    verbose = 0)

min_loss = round(np.min(history.history['val_loss']), 5)
min_loss_epoch = np.argmin(history.history['val_loss']) + 1
print('Best valid RMSE at the epoch #{} : {}'.format(min_loss_epoch, min_loss))

fig, ax = plt.subplots(figsize = (20, 4))
sns.lineplot(x = history.epoch, y = history.history['loss'])
sns.lineplot(x = history.epoch, y = history.history['val_loss'])
ax.set_title('Learning Curve (Loss) (Best value: {})'.format(min_loss))
ax.set_ylabel('Loss')
ax.set_xlabel('Epoch')
ax.legend(['train', 'test'], loc='best')
plt.show()

del X_train, X_valid, y_train, y_valid, train
gc.collect()

The loss of the NN is slightly fewer. But we used very simple sighting models. There is still a lot of work to be done with the selection of optimal parameters.

<a id="section-8"></a>
<h2 style='color:white; background:#008294; border:0'><center>Test prediction</center></h2>

[**Back to the table of contents**](#section-start)

Let's take a look at our predictions and prepare them for submission.

In [None]:
ss.loss = model.predict(test, batch_size = 8)
ss

In [None]:
ss.to_csv('submission.csv', index = False)

<h2 style='color:white; background:#663399; border:0'><center>WORK IN PROGRESS...</center></h2>

[**Back to the start**](#section-start)