<h1 style='color:white; background:#1E90FF; border:0'><center>TPS-Sep: all for start (EDA,XGB+CatBoost Baseline)</center></h1>

![](https://storage.googleapis.com/kaggle-competitions/kaggle/26480/logos/header.png?t=2021-04-09-00-57-05)

<a id="section-start"></a>

The goal of these competitions is to provide a fun, and approachable for anyone, tabular dataset. These competitions will be great for people looking for something in between the Titanic Getting Started competition and a Featured competition. If you're an established competitions master or grandmaster, these probably won't be much of a challenge for you. We encourage you to avoid saturating the leaderboard.

The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with calculating the loss associated with a loan defaults. Although the features are anonymized, they have properties relating to real-world features.

For this competition, you will predict whether a customer made a claim upon an insurance policy. The ground truth claim is binary valued, but a prediction may be any number from `0.0` to `1.0`, representing the probability of a claim. The features in this dataset have been anonymized and may contain missing values.

### See also my previous TPS works:
- [TPS-Jun: starting point (EDA, Baseline, CV)](https://www.kaggle.com/maksymshkliarevskyi/tps-jun-starting-point-eda-baseline-cv)
- [TPS-July: EDA, Baseline Analysis (XGBRegressor)](https://www.kaggle.com/maksymshkliarevskyi/tps-july-eda-baseline-analysis-xgbregressor)
- [TPS-Aug: EDA, Baselines (XGB, Keras NN)](https://www.kaggle.com/maksymshkliarevskyi/tps-aug-eda-baselines-xgb-keras-nn)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

import warnings
warnings.filterwarnings('ignore')

# for feature importance study
import eli5
from eli5.sklearn import PermutationImportance
from pdpbox import pdp
import shap

# ML
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.preprocessing import StandardScaler, RobustScaler
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.impute import SimpleImputer
from sklearn import metrics
import os
import gc

# Reproducability
def set_seed(seed = 0):
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'
    print('*** --- Set seed "%i" --- ***' %seed)

# Custom theme
plt.style.use('fivethirtyeight')

figure = {'dpi': '100'}
font = {'family': 'serif'}
grid = {'linestyle': ':', 'alpha': .9}
axes = {'titlecolor': 'black', 'titlesize': 20, 'titleweight': 'bold',
        'labelsize': 12, 'labelweight': 'bold'}

plt.rc('font', **font)
plt.rc('figure', **figure)
plt.rc('grid', **grid)
plt.rc('axes', **axes)

my_colors = ['#DC143C', '#FF1493', '#FF7F50', '#FFD700', '#32CD32', 
             '#4ddbff', '#1E90FF', '#663399', '#708090']

caption = "© maksymshkliarevskyi"

# Show our custom palette
sns.palplot(sns.color_palette(my_colors))
plt.title('Custom palette')
plt.text(6.9, 0.75, caption, size = 8)
plt.show()

<h2 style='color:white; background:#1E90FF; border:0'><center>Required functions</center></h2>

[**Back to the start**](#section-start)

In [None]:
def model_imp_viz(model, columns, bias = 0.01):
    imp = pd.DataFrame({'importance': model.feature_importances_,
                        'features': columns}).sort_values('importance', 
                                                          ascending = False)
    fig, ax = plt.subplots(figsize = (10, 40))
    plt.title('Feature importances', size = 15, fontweight = 'bold', fontfamily = 'serif')

    sns.barplot(x = imp.importance, y = imp.features, edgecolor = 'black',
                palette = reversed(sns.color_palette("viridis", len(imp.features))))

    for i in ['top', 'right']:
            ax.spines[i].set_visible(None)

    rects = ax.patches
    labels = imp.importance
    for rect, label in zip(rects, labels):
        x_value = rect.get_width() + bias
        y_value = rect.get_y() + rect.get_height() / 2

        ax.text(x_value, y_value, round(label, 4), fontsize = 9, color = 'black',
                 ha = 'center', va = 'center')
    ax.set_xlabel('Importance', fontweight = 'bold', fontfamily = 'serif')
    ax.set_ylabel('Features', fontweight = 'bold', fontfamily = 'serif')
    plt.show()

First, let's load the data and take a look at basic statistics.

In [None]:
train = pd.read_csv('../input/tabular-playground-series-sep-2021/train.csv', 
                    index_col = 0)
test = pd.read_csv('../input/tabular-playground-series-sep-2021/test.csv', 
                   index_col = 0)
ss = pd.read_csv('../input/tabular-playground-series-sep-2021/sample_solution.csv')

<h2 style='color:white; background:#1E90FF; border:0'><center>EDA</center></h2>

[**Back to the start**](#section-start)

In [None]:
train.describe().T.style.background_gradient(subset = ['count'], cmap = 'viridis') \
    .bar(subset = ['mean', '50%'], color = my_colors[6]) \
    .bar(subset = ['std'], color = my_colors[0])

In [None]:
test.describe().T.style.background_gradient(subset = ['count'], cmap = 'viridis') \
    .bar(subset = ['mean', '50%'], color = my_colors[6]) \
    .bar(subset = ['std'], color = my_colors[0])

In [None]:
dtypes = train.dtypes.value_counts().reset_index()

plt.figure(figsize = (12, 1))
plt.title('Data types\n')
plt.barh(str(dtypes.iloc[0, 0]), dtypes.iloc[0, 1],
         label = str(dtypes.iloc[0, 0]), color = my_colors[4])
plt.barh(str(dtypes.iloc[0, 0]), dtypes.iloc[1, 1], 
         left = dtypes.iloc[0, 1], 
         label = str(dtypes.iloc[1, 0]), color = my_colors[5])
plt.legend(loc = 'upper center', ncol = 3, fontsize = 13,
           bbox_to_anchor = (0.5, 1.45), frameon = False)
plt.yticks('')
plt.text(110, -0.9, caption, size = 8)
plt.show()

We have  training and  test observations. All our data is in `float32` format. Target feature has `int` format.

Before we continue, let's pull the target feature into the separate variable.

In [None]:
claim = train.claim

train.drop(['claim'], axis = 1, inplace = True)

It's important to see if our data has missing values.

In [None]:
# Concatenate train and test datasets
all_data = pd.concat([train, test], axis = 0)

# columns with missing values
cols_with_na = all_data.isna().sum()[all_data.isna().sum() > 0].sort_values(ascending = False)
cols_with_na

Let's create a variable with the sum of missing features, and then look at the claim rate for each group of this new feature.

In [None]:
train_na = train.isnull().sum(1)
test_na = test.isnull().sum(1)

In [None]:
missing = pd.DataFrame({'na': train_na, 
                        'claim': claim}).groupby('na').agg(['mean', 'sum'])
missing['claim']['mean'] = missing['claim'] * 100

missing['claim']['mean'].plot.bar(figsize=(15, 5), 
                                  color = my_colors[0], 
                                  legend = None,
                                  width = 0.7)
plt.title('Claim rate by missing features')
plt.xticks(rotation=0)
plt.xlabel('Sum of missing features')
plt.ylabel('Claim rate, %')
plt.show()

missing['claim']['sum'].plot.bar(figsize=(15, 5), 
                                 color = my_colors[2], 
                                 legend = None,
                                 width = 0.7)
plt.title('Sum of observations by missing features')
plt.xticks(rotation=0)
plt.xlabel('Sum of missing features')
plt.ylabel('Sum of observations')
plt.show()

Wow, some groups have a low claim rate. Looks like this new feature can be used for prediction.

Now, let's look at the feature distributions.

In [None]:
print('Train data')
fig = plt.figure(figsize = (20, 140))
for idx, i in enumerate(train.columns):
    fig.add_subplot(np.ceil(len(train.columns)/4), 4, idx+1)
    train.iloc[:, idx].hist(bins = 20)
    plt.title(i)
plt.text(9, -20000, caption, size = 12)
plt.show()

In [None]:
print('Test data')
fig = plt.figure(figsize = (20, 140))
for idx, i in enumerate(test.columns):
    fig.add_subplot(np.ceil(len(test.columns)/4), 4, idx+1)
    test.iloc[:, idx].hist(bins = 20)
    plt.title(i)
plt.text(9, -15000, caption, size = 12)
plt.show()

We should also look at the correlations between features.

In [None]:
corr = train.corr()
mask = np.triu(np.ones_like(corr, dtype = bool))

plt.figure(figsize = (15, 15))
plt.title('Corelation matrix')
sns.heatmap(corr, mask = mask, cmap = 'Spectral_r', linewidths = .5)
plt.text(124, 124, caption, size = 8)
plt.show()

In [None]:
corr = test.corr()
mask = np.triu(np.ones_like(corr, dtype = bool))

plt.figure(figsize = (15, 15))
plt.title('Corelation matrix')
sns.heatmap(corr, mask = mask, cmap = 'Spectral_r', linewidths = .5)
plt.text(124, 124, caption, size = 8)
plt.show()

All features are weakly correlated.

<h2 style='color:white; background:#1E90FF; border:0'><center>XGB Baseline</center></h2>

[**Back to the start**](#section-start)

In this step, we'll train our simple baseline XGBClassifier model. There are missing values in our data. Let's fill them with an average.

In [None]:
train['na'] = train_na
test['na'] = test_na

column_name = train.columns

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
train = imputer.fit_transform(train)
test = imputer.transform(test)

In [None]:
# Create data sets for training (80%) and validation (20%)
X_train, X_valid, y_train, y_valid = train_test_split(train, claim, 
                                                      test_size = 0.2,
                                                      random_state = 0)

In [None]:
# The basic model
params = {'random_state': 0,
          'predictor': 'gpu_predictor',
          'tree_method': 'gpu_hist',
          'eval_metric': 'auc'}

model = XGBClassifier(**params)

model.fit(X_train, y_train, verbose = False)

preds = model.predict_proba(X_valid)[:, 1]
fpr, tpr, thresholds = metrics.roc_curve(y_valid, preds)
print('Valid AUC: ', metrics.auc(fpr, tpr))

In [None]:
metrics.plot_confusion_matrix(model, X_valid, y_valid,
                              cmap = 'inferno')
plt.title('Confusion matrix')
plt.grid(False)
plt.show()

In [None]:
model_imp_viz(model, column_name.values, bias = 0.04)

The feature we created is really very important!

In [None]:
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_valid)

shap.summary_plot(shap_values, X_valid)

Let's try a slightly more complex model.

In [None]:
# The basic model
XGB_params = {'n_estimators': 10000,
              'max_depth': 2,
              'colsample_bytree': 0.30,
              'learning_rate': 0.09,
              'reg_lambda': 18,
              'reg_alpha': 18,
              'random_state': 0,
              'predictor': 'gpu_predictor',
              'tree_method': 'gpu_hist',
              'eval_metric': 'auc'}

model = XGBClassifier(**XGB_params)

model.fit(X_train, y_train,
          eval_set = [(X_valid, y_valid)],
          early_stopping_rounds = 300,
          verbose = 1000)

preds = model.predict_proba(X_valid)[:, 1]
fpr, tpr, thresholds = metrics.roc_curve(y_valid, preds)
print('Valid AUC: ', metrics.auc(fpr, tpr))

In [None]:
metrics.plot_confusion_matrix(model, X_valid, y_valid,
                              cmap = 'inferno')
plt.title('Confusion matrix')
plt.grid(False)
plt.show()

In [None]:
model_imp_viz(model, column_name.values, bias = 0.04)

In [None]:
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_valid)

shap.summary_plot(shap_values, X_valid)

Also, let's try a CatBoostClassifier.

In [None]:
CATB_params = {'n_estimators': 10000,
               'max_depth': 2,
               'learning_rate': 0.09,
               'grow_policy': 'SymmetricTree',
               'task_type': 'GPU',
               'eval_metric': 'AUC',
               'random_state': 0}
    
model = CatBoostClassifier(**CATB_params)
model.fit(X_train, y_train,  
          eval_set=[(X_valid, y_valid)],
          early_stopping_rounds = 400,
          verbose = 1000)

preds = model.predict_proba(X_valid)[:, 1]
fpr, tpr, thresholds = metrics.roc_curve(y_valid, preds)
print('Valid AUC: ', metrics.auc(fpr, tpr))

In [None]:
metrics.plot_confusion_matrix(model, X_valid, y_valid,
                              cmap = 'inferno')
plt.title('Confusion matrix')
plt.grid(False)
plt.show()

In [None]:
model_imp_viz(model, column_name.values, bias = 3)

In [None]:
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_valid)

shap.summary_plot(shap_values, X_valid)

<h2 style='color:white; background:#1E90FF; border:0'><center>Test prediction</center></h2>

[**Back to the table of contents**](#section-start)

Let's make 5-folds cv prediction for XGBClassifier and  CatBoostClassifier.

In [None]:
FOLDS = 5
ss.claim = np.zeros(len(ss.claim))
metric = []
kfold = KFold(n_splits = FOLDS, random_state = 0, shuffle = True)
i = 1
for train_idx, test_idx in kfold.split(train):
    X_train, y_train = train[train_idx, :], claim[train_idx]
    X_test, y_test = train[test_idx, :], claim[test_idx]

    model = XGBClassifier(**XGB_params)
    model.fit(X_train, y_train,
              eval_set=[(X_test, y_test)],
              early_stopping_rounds = 300,
              verbose = 0)
    
    preds = model.predict_proba(X_test)[:, 1]
    fpr, tpr, thresholds = metrics.roc_curve(y_test, preds)
    AUC = metrics.auc(fpr, tpr)    
    print('[FOLD #{}] Validation AUC: {:.5f}'.format(i, AUC))

    ss.claim += model.predict_proba(test)[:, 1] / FOLDS * 0.5
    metric.append(AUC)
    i += 1
print('*'*50)
print('[ALL FOLDS] Mean Validation AUC: {:.5f}'.format(np.mean(metric)))

In [None]:
metric = []
kfold = KFold(n_splits = FOLDS, random_state = 0, shuffle = True)
i = 1
for train_idx, test_idx in kfold.split(train):
    X_train, y_train = train[train_idx, :], claim[train_idx]
    X_test, y_test = train[test_idx, :], claim[test_idx]

    model = CatBoostClassifier(**CATB_params)
    model.fit(X_train, y_train,
              eval_set = [(X_test, y_test)],
              early_stopping_rounds = 400,
              verbose = 0)
    
    preds = model.predict_proba(X_test)[:, 1]
    fpr, tpr, thresholds = metrics.roc_curve(y_test, preds)
    AUC = metrics.auc(fpr, tpr)    
    print('[FOLD #{}] Validation AUC: {:.5f}'.format(i, AUC))

    ss.claim += model.predict_proba(test)[:, 1] / FOLDS * 0.5
    metric.append(AUC)
    i += 1
print('*'*50)
print('[ALL FOLDS] Mean Validation AUC: {:.5f}'.format(np.mean(metric)))

In [None]:
ss.to_csv('submission.csv', index = False)
ss

<h2 style='color:white; background:#1E90FF; border:0'><center>WORK IN PROGRESS...</center></h2>

[**Back to the start**](#section-start)