<h1 style='color:white; background:#008294; border:0'><center>TPS-Jun: starting point (EDA, Baseline)</center></h1>

![](https://storage.googleapis.com/kaggle-competitions/kaggle/26480/logos/header.png?t=2021-04-09-00-57-05)

<a id="section-start"></a>

The goal of these competitions is to provide a fun, and approachable for anyone, tabular dataset. These competitions will be great for people looking for something in between the Titanic Getting Started competition and a Featured competition.

The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the category on an eCommerce product given various attributes about the listing. Although the features are anonymized, they have properties relating to real-world features.

<h3 style='color:white; background:#008294; border:0'><center>Table of contents:</center></h3>


1. [**Loading libraries and data**](#section-1) <br>
2. [**EDA**](#section-2) <br>
3. [**Baseline**](#section-3) <br>
 3.1. [Simple model](#section-4) <br>
 3.2. [Feature importance](#section-5) <br>
 3.3. [Principal Component Analysis](#section-6) <br>
 3.4. [Test prediction](#section-7) <br>

<a id="section-1"></a>
<h1 style='color:white; background:#008294; border:0'><center>1. Loading libraries and data</center></h1>

[**Back to the table of contents**](#section-start)

In [None]:
!pip install /kaggle/input/adjusttext
!pip install /kaggle/input/bioinfokit

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
from sklearn.metrics import log_loss

import warnings
warnings.filterwarnings('ignore')

# for feature importance study
import eli5
from eli5.sklearn import PermutationImportance
from pdpbox import pdp
import shap

# for PCA
from bioinfokit.visuz import cluster
from sklearn.decomposition import PCA

# Custom theme
plt.style.use('fivethirtyeight')

figure = {'dpi': '200'}
font = {'family': 'serif'}
grid = {'linestyle': ':', 'alpha': .9}
axes = {'titlecolor': 'black', 'titlesize': 20, 'titleweight': 'bold',
        'labelsize': 12, 'labelweight': 'bold'}

plt.rc('font', **font)
plt.rc('figure', **figure)
plt.rc('grid', **grid)
plt.rc('axes', **axes)

my_colors = ['#DC143C', '#FF1493', '#FF7F50', '#FFD700', '#32CD32', 
             '#00FFFF', '#1E90FF', '#663399', '#708090']

caption = "© maksymshkliarevskyi"

# Show our custom palette
sns.palplot(sns.color_palette(my_colors))
plt.title('Custom palette')
plt.text(6.9, 0.75, caption, size = 8)
plt.show()

This competition dataset is similar to the Tabular Playground Series - May 2021 dataset, but with increased observations, increased features, and increased class labels.

In [None]:
train = pd.read_csv('../input/tabular-playground-series-jun-2021/train.csv', 
                    index_col = 0)
test = pd.read_csv('../input/tabular-playground-series-jun-2021/test.csv', 
                   index_col = 0)
ss = pd.read_csv('../input/tabular-playground-series-jun-2021/sample_submission.csv', 
                 index_col = 0)

y = train.target
target_names = np.sort(train.target.unique())
target = LabelEncoder().fit_transform(train.target) + 1
train.drop(['target'], axis = 1, inplace = True)

<a id="section-2"></a>
<h1 style='color:white; background:#008294; border:0'><center>2. EDA</center></h1>

[**Back to the table of contents**](#section-start)



Let's see if columns in train and test datasets are the same.
The result must be zero.

In [None]:
sum(train.columns != test.columns)

Good sign! Let's take a look at the general statistical information about our data. We'll use a built-in function describe with some visual features.

### Train data

In [None]:
train.describe().T.style.background_gradient(subset = ['count'], cmap = 'viridis') \
    .bar(subset = ['mean', '50%'], color = my_colors[6]) \
    .bar(subset = ['std'], color = my_colors[0])

### Test data

In [None]:
test.describe().T.style.background_gradient(subset = ['count'], cmap = 'viridis') \
    .bar(subset = ['mean', '50%'], color = my_colors[6]) \
    .bar(subset = ['std'], color = my_colors[0])

In [None]:
dtypes = train.dtypes.value_counts().reset_index()

plt.figure(figsize = (12, 1))
plt.title('Data types\n')
plt.barh(str(dtypes.iloc[0, 0]), dtypes.iloc[0, 1],
         label = str(dtypes.iloc[0, 0]), color = my_colors[4])
plt.legend(loc = 'upper center', ncol = 3, fontsize = 13,
           bbox_to_anchor = (0.5, 1.45), frameon = False)
plt.yticks('')
plt.text(65, -0.9, caption, size = 8)
plt.show()

It's also important to see if our data has missing values.

In [None]:
# Concatenate train and test datasets
all_data = pd.concat([train, test], axis = 0)

# columns with missing values
cols_with_na = all_data.isna().sum()[all_data.isna().sum() > 0].sort_values(ascending = False)
cols_with_na

As before, our data has no missing values. Now, let's look at the feature distributions.

In [None]:
fig = plt.figure(figsize = (20, 70))
for i in range(len(train.columns)):
    fig.add_subplot(np.ceil(len(train.columns)/5), 5, i+1)
    all_data.iloc[:, i].hist(bins = 20)
    plt.title('feature_{}'.format(i))
plt.text(25, -50000, caption, size = 12)
plt.show()

In [None]:
plt.figure(figsize = (12, 4))
plt.title('Target feature')
sns.countplot(x = y, edgecolor = 'black', 
              palette = sns.color_palette(my_colors))
plt.xlabel('')
plt.text(7.5, -8000, caption, size = 8)
plt.show()

We should also look at the correlation between features

In [None]:
corr = train.corr()
mask = np.triu(np.ones_like(corr, dtype = bool))

plt.figure(figsize = (15, 15))
plt.title('Corelation matrix')
sns.heatmap(corr, mask = mask, cmap = 'Spectral_r', linewidths = .5)
plt.text(77, 83, caption, size = 8)
plt.show()

All features are weakly correlated.

<a id="section-3"></a>
<h1 style='color:white; background:#008294; border:0'><center>3. Baseline</center></h1>

[**Back to the table of contents**](#section-start)

The algorithm of our actions will be as follows:
- first, we'll train a very simple basic XGBClassifier,
- then we'll look at the importance of features and build some interesting visualizations,
- and then we'll make a PCA.

<a id="section-4"></a>
<h2 style='color:white; background:#008294; border:0'><center>3.1. Simple model</center></h2>

[**Back to the table of contents**](#section-start)

At first, we'll split our data into train and validation sets.

In [None]:
# Create data sets for training (80%) and validation (20%)
X_train, X_valid, y_train, y_valid = train_test_split(train, target, 
                                                      test_size = 0.2,
                                                      random_state = 0)

In [None]:
# The basic model
params = {'random_state': 0,
          'predictor': 'gpu_predictor',
          'tree_method': 'gpu_hist',
          'eval_metric': 'logloss'}

model = XGBClassifier(**params)

model.fit(X_train, y_train, verbose = False)

preds = model.predict_proba(X_valid)
print('Valid log_loss of the basic model: {}'.format(log_loss(y_valid, preds)))

<a id="section-5"></a>
<h2 style='color:white; background:#008294; border:0'><center>3.2. Feature importance</center></h2>

[**Back to the table of contents**](#section-start)

Now, we'll see at the permutation importance of features.

In [None]:
pi = PermutationImportance(model, random_state = 0).fit(X_valid, y_valid)
eli5.show_weights(pi, feature_names = X_valid.columns.tolist())

Almost all features have approximately the same importance for the model. The most important are 'feature_43', 'feature_56' and 'feature_21'. Let's take a closer look and visualize them with the Partial Dependence Plot ([pdpbox library](https://pdpbox.readthedocs.io/en/latest/)).

In [None]:
pdp_f = pdp.pdp_isolate(model = model, dataset = X_valid, 
                        model_features = X_valid.columns.tolist(),
                        feature = 'feature_43')
pdp.pdp_plot(pdp_f, 'feature_43')
plt.text(125, -0.12, caption, size = 8)
plt.show()

In [None]:
pdp_f = pdp.pdp_isolate(model = model, dataset = X_valid, 
                        model_features = X_valid.columns.tolist(),
                        feature = 'feature_56')
pdp.pdp_plot(pdp_f, 'feature_56')
plt.text(90, -0.1, caption, size = 8)
plt.show()

In [None]:
pdp_f = pdp.pdp_isolate(model = model, dataset = X_valid, 
                        model_features = X_valid.columns.tolist(),
                        feature = 'feature_21')
pdp.pdp_plot(pdp_f, 'feature_21')
plt.text(70, -0.07, caption, size = 8)
plt.show()

As we can see, in the context of classes, these features have different effects. An increase in the values of these features leads to an increase in the prediction probability for some classes and a decrease in others.

Let's also take a look at the relationship between the two most important features.

In [None]:
features = ['feature_43', 'feature_56']
inter = pdp.pdp_interact(model = model, dataset = X_valid,
                         model_features = X_valid.columns.tolist(),
                         features = features)
pdp.pdp_interact_plot(inter, feature_names = features,
                      plot_type = 'contour')
plt.text(0.55, 0.118, caption, size = 8)
plt.show()

The nature of the interaction varies greatly from class to class. For some, the prediction probability increases with large values of both features, for others, vice versa.

In [None]:
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_valid)

for i in range(len(target_names)):
    plt.title('Class_{}'.format(i+1))
    shap.summary_plot(shap_values[i], X_valid)

The contribution of various features to prediction varies greatly, but we can see a completely logical pattern: large feature values most often lead to a stronger change in the prediction probability for some classes.

<a id="section-6"></a>
<h2 style='color:white; background:#008294; border:0'><center>3.3. Principal Component Analysis</center></h2>

[**Back to the table of contents**](#section-start)

Now, let's do a principal component analysis. We'll look at the correlation between components and features, as well as study the proportion of variance for the first components. We'll use the bioinfokit package to visualize the PCA results.

In [None]:
pca = PCA()
pca_out = pca.fit(train)

In [None]:
comp = pca_out.components_
num_pc = pca_out.n_features_
pc_list = ["PC" + str(i) for i in list(range(1, num_pc + 1))]
comp_df = pd.DataFrame.from_dict(dict(zip(pc_list, comp)))
comp_df['variable'] = train.columns.values
comp_df = comp_df.set_index('variable')

comp_df.head(10).style.background_gradient(cmap = 'viridis')

In [None]:
cluster.screeplot(obj = [pc_list[:20], pca_out.explained_variance_ratio_[:20]], 
                  show = True, dim = (16, 5), axlabelfontsize = 13)

In my opinion, for a start, we can try to use the first five components as additional features.

In [None]:
# PCA loadings plots
# 2D
cluster.pcaplot(x = comp[0], y = comp[1], 
                labels = range(num_pc), 
                var1 = round(pca_out.explained_variance_ratio_[0]*100, 2),
                var2 = round(pca_out.explained_variance_ratio_[1]*100, 2),
                show = True, dim = (10, 8), axlabelfontsize = 13)

# 3D
cluster.pcaplot(x = comp[0], y = comp[1], z = comp[2],  
                labels = range(num_pc), 
                var1 = round(pca_out.explained_variance_ratio_[0]*100, 2), 
                var2 = round(pca_out.explained_variance_ratio_[1]*100, 2), 
                var3 = round(pca_out.explained_variance_ratio_[2]*100, 2),
                show = True, dim = (14, 10), axlabelfontsize = 13)

Some features are clearly different from others. This will also need to be taken into account in the subsequent modeling.

Let's add the components as new features and train the base model again.

In [None]:
X_pca = pd.DataFrame(pca.transform(train), columns = pc_list)
train_new = pd.concat([train, X_pca.iloc[:, :5]], axis = 1)

# Create data sets for training (80%) and validation (20%)
X_train, X_valid, y_train, y_valid = train_test_split(train_new, target, 
                                                      test_size = 0.2,
                                                      random_state = 0)

In [None]:
# The basic model
params = {'random_state': 0,
          'predictor': 'gpu_predictor',
          'tree_method': 'gpu_hist',
          'eval_metric': 'logloss'}

model = XGBClassifier(**params)

model.fit(X_train, y_train, verbose = False)

preds = model.predict_proba(X_valid)
print('Valid log_loss of the basic model: {}'.format(log_loss(y_valid, preds)))

This practically did not affect the accuracy of our model, but do not forget that this is a very basic, almost untuned model. In the future, we will experiment with different models and their parameters. For a starting point, this, I think, is enough.

In the future, we will expand our data analysis and experiment with machine learning.

<a id="section-7"></a>
<h2 style='color:white; background:#008294; border:0'><center>3.4. Test prediction</center></h2>

[**Back to the table of contents**](#section-start)

Now let's train the model with cross-validation and some parameters and then predict the test data.

In [None]:
FOLDS = 7

ss.iloc[:, :] = np.zeros((len(test), 9))

params = {'n_estimators': 400,
          'max_depth': 6,
          'min_child_weight': 3,
          'learning_rate': 0.03,
          'subsample': 0.7,
          'random_state': 0,
          'predictor': 'gpu_predictor',
          'tree_method': 'gpu_hist',
          'eval_metric': 'logloss'}

kfold = KFold(n_splits = FOLDS, random_state = 0, shuffle = True)
i = 1
for train_idx, test_idx in kfold.split(train_new):
    print('Training {} fold...'.format(i))
    X_train, y_train = train_new.iloc[train_idx, :], target[train_idx]
    X_valid, y_valid = train_new.iloc[test_idx, :], target[test_idx]
    
    model = XGBClassifier(**params)
    model.fit(X_train, y_train, verbose = False)
    preds = model.predict_proba(X_valid)
    print('Valid log_loss for {} fold: {}'.format(i, log_loss(y_valid, preds)))
    i += 1
    ss.iloc[:, :] += model.predict_proba(test) / FOLDS

In [None]:
ss

In [None]:
ss.to_csv('submission.csv', index = True)

<h2 style='color:white; background:#008294; border:0'><center>WORK IN PROGRESS...</center></h2>

[**Back to the table of contents**](#section-start)