# TPS-06 Solution Assortment EDA, Optuna, ensemble🍭

## Motivation

The Tabular Playground Series has become a regular competition that is released on the first of every month. In previous competitions, a brilliant and wonderful variety of analyses and solutions have been published. However, since there are quite a lot of methods being shared, some people may not know where to start.

In this notebook, I hope to share examples of the use of things that are easy to get a handle on, following the methods used so far, like following:

- Overviewing (Stats, Missings, Zeros)
- EDA (distribution, correlation, PCA, Umap)
- optuna
- BoostingClassifier
- Blending

## Contents

- [Load Data & Libraries](#1)
- [Data Overview](#2)
    - Stats
    - Missings
    - Zeros
    - Metric
- [EDA](#3)
    - kdeplot for all features
    - Correlations with heatmap
    - Interactions of all features
    - PCA Result
    - Umap Result
- [Hyperparameter tuning with optuna](#4)
    - Simple example
    - CatBoostClassifier
    - HistGradinetBoostingClassifier
    - XGBoost
- [Train & Inference](#5)
    - CatBoostClassifier
    - HistGradinetBoostingClassifier
    - XGBoost
    - Feature importance
    - Blending

<a id='1'></a>
# <div class="alert alert-block alert-success">Load Data & Libraries</div>

First, I'll load data and libraries. Additionally, I'll do some preprocessing.

### Load data and libraries

In [None]:
!pip install dataprep

In [None]:
# Standard library
import math
import random

# 3rd party library
from catboost import CatBoostClassifier
from catboost import Pool
from dataprep import eda
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import numpy as np 
import optuna
from optuna import create_study, logging
from optuna.pruners import MedianPruner
import pandas as pd
import plotly.express as px
import seaborn as sns
import sklearn
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, KFold, GroupKFold, StratifiedKFold
from sklearn.experimental import enable_hist_gradient_boosting
from optuna.integration import XGBoostPruningCallback
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.preprocessing import LabelEncoder
import umap
import xgboost as xgb

In [None]:
# Fix seed

def fix_seed(seed):
    # random
    random.seed(seed)
    # Numpy
    np.random.seed(seed)

SEED = 46
fix_seed(SEED)

In [None]:
# Load csv data of this competition.

DATA = "../input/tabular-playground-series-jun-2021"
train = pd.read_csv(DATA + "/train.csv")
test = pd.read_csv(DATA + "/test.csv")

In [None]:
# Remove the ID column as it is in the way.

train = train.drop('id', axis=1)
test = test.drop('id', axis=1)

In [None]:
train.head()

In [None]:
test.head()

<a id='2'></a>
# <div class="alert alert-block alert-success">Data Overview</div>

Let's see stats, missings and zeros.

The output is vertically long and difficult to check. So we need to check it, but for those where the result is known to some extent, the Outputs are hidden. If you are interested, please open them and have a look.

In [None]:
train.describe().T.style.bar(subset=['mean'], color='#20c8f2')\
                   .background_gradient(subset=['std'], cmap='YlGn')

All values are Int.

In [None]:
train.info()

There are no missing values.

In [None]:
pd.DataFrame(train.isna().sum()/len(train), columns=["missing_rate"])\
                        .style.bar(subset=['missing_rate'], color='#20c8f2')

It seems that there are many columns with a value of 0. However, in rare cases, there are columns that have no 0 at all.

In [None]:
pd.DataFrame((train==0).sum()/len(train), columns=["zero_rate"])\
    .style.bar(subset=['zero_rate'], color='#20c8f2')

In [None]:
test.describe().T.style.bar(subset=['mean'], color='#20c8f2')\
                 .background_gradient(subset=['std'], cmap='YlGn')

All values are Int.

In [None]:
test.info()

There are no missing values.

In [None]:
pd.DataFrame(test.isna().sum()/len(test), columns=["missing_rate"])\
                        .style.bar(subset=['missing_rate'], color='#20c8f2')

It seems that there are many columns with a value of 0. However, in rare cases, there are columns that have no 0 at all.

In [None]:
pd.DataFrame((test==0).sum()/len(test), columns=["zero_rate"])\
    .style.bar(subset=['zero_rate'], color='#20c8f2')

## Metric

The metric used for evaluation is multi-class logarithmic loss.

$$
   logloss = -\frac{1}{N}\sum^{N}_{i-1}\sum^{M}_{j-1}y_{ij}\log(p_{ij})
$$

Here N is the number of rows, M is  is the number of class labels, i is the index of data and j is the index of class. 

In this case, we will use predict_proba() in the classifier model as the probability of each class for output , but I made my own evaluation function because it was useful for CV.

In [None]:
def multiclass_log_loss(y_pred, y_true):
    score = sum([math.log(pred[label]) for pred, label in zip(y_pred, y_true)])
    return - score / len(y_true)

# <div class="alert alert-block alert-success">Preprocessing</div>

We'll try preprocessing for the later process is minimal but necessary.

In [None]:
# Separate X and y.

feature_cols = [col for col in train.columns if col != "target"]
target_cat = train["target"]
df = train.drop("target", axis=1)

In [None]:
df.head()

Since target is list (or Series) of string, such as "Class_6", we will convert it to a list of number using label encoding so that it can be entered into the model.

We can easily do this with [sklearn.preprocessing.LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) .

In [None]:
# I'll try lavel encoding because targets are string.
le = LabelEncoder()
target = le.fit_transform(target_cat)

In [None]:
print("-"*30)
print("Before label encoding, ")
print(target_cat[:10])
print("-"*30)
print("After label encoding, ")
print(target[:10])
print("-"*30)

<a id='3'></a>
# <div class="alert alert-block alert-success">EDA</div>

I'll see standard visualizations.

## kdeplot for all features

Compare the distribution of the train and test data. All features seem to have a similar distribution.

In [None]:
# I refered https://www.kaggle.com/subinium/tps-may-categorical-eda

plt.style.use("Solarize_Light2")
print(f"Orange is train, and blue is test data.")

fig, axes = plt.subplots(19, 4, figsize=(15, 30), gridspec_kw=dict(wspace=0.3, hspace=0.6))
for col, ax in zip(feature_cols, axes.flatten()):
    
    sns.kdeplot(x=df[col], ax=ax, alpha=0.5, fill=True, linewidth=0.6, color='orange')
    sns.kdeplot(x=test[col], ax=ax, alpha=0.1, fill=True, linewidth=0.6)

## Countplot for target of train data

The target in the training data seems to be quite biased.

In [None]:
plt.figure(figsize=(13, 8))
g = sns.countplot(target_cat, order=[f"Class_{i}" for i in range(1, 10)])
g.tick_params(labelsize=14)
g.set_xlabel("target",fontsize=20)
g.set_ylabel("Count",fontsize=20)
g.set_title("Count plot for target of train data",fontsize=25)

## Correlations with heatmap

Correlation is not high across the whole features. For more details, you can hover over to see the details!

In [None]:
def extract_tril_without_diagonal(corr_matrix):
    return np.tril(corr_matrix) - np.triu(np.tril(corr_matrix))

fig = px.imshow(extract_tril_without_diagonal(df.corr().values),
                x=feature_cols, y=feature_cols, width=700, height=700)
fig.update_layout(title='Correlation between features')
fig.show()

## Interactions of all features

You can see scatter plot between all features. Please choose two features with pull down and check their interactions.

In [None]:
eda.create_report(df,display=["Interactions"])

## PCA Result

Because of the large number of features, you may want to use PCA to reduce the number of dimensions. We will also check the cumulative contribution ratio. If we take roughly 54 components, we can see that there is a 95% contribution rate.

In [None]:
# I refered https://www.kaggle.com/kushal1506/deciding-n-components-in-pca

pca = PCA().fit(df)

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (15,6)

fig, ax = plt.subplots()
xi = np.arange(1, len(feature_cols)+1, step=1)
y = np.cumsum(pca.explained_variance_ratio_)

plt.ylim(0.0,1.1)
plt.plot(xi, y, marker='o', linestyle='--', color='b')

plt.xlabel('Number of Components')
plt.xticks(np.arange(0, len(feature_cols), step=2))

plt.title('The number of components needed to explain variance')

plt.axhline(y=0.95, color='r', linestyle='-')
plt.text(0.5, 0.85, '95% cut-off threshold', color = 'red', fontsize=16)

ax.grid(axis='x')
plt.show()

## Umap Result

We will also check the result of dropping the data into two dimensions with Umap. The data seems to be quite mixed up.

In [None]:
reducer = umap.UMAP()
embedding = reducer.fit_transform(df)

In [None]:
plt.scatter(
    embedding[:, 0],
    embedding[:, 1],
    c=target,
    s=1,
    alpha=0.5)
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP projection of train data', fontsize=15)

We also look at the results after reducing the dimension in PCA, but it did not change much.

In [None]:
pca = PCA(n_components=54).fit(df)
df_pca = pca.transform(df)

reducer = umap.UMAP()
embedding = reducer.fit_transform(df_pca)

plt.scatter(
    embedding[:, 0],
    embedding[:, 1],
    c=target,
    s=1,
    alpha=0.5)
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP projection of train data after PCA', fontsize=15)

<a id='4'></a>
# <div class="alert alert-block alert-success">Hyperparameter tuning with optuna</div>

We will use [CatBoost](https://catboost.ai/) for the model. This is because I felt that CatBoost performed well in the competition in May. 

We will see how to tune the hyperparameters using [Optuna](https://optuna.org/).

With Optuna, you can efficiently search for good hyperparameter values with a small number of trials.

### Simple example

Let's see simple optimization example for following quadratic_function.
$$
z = (x -2 )^2 + (y - 3)^2
$$

In [None]:
# Refered Code Examples of https://optuna.org/

def quadratic_function(x, y):
    """Calculate quadratic_function (x -2 )^2 + (y - 3)^2
    """
    return (x - 2) ** 2 + (y - 3)**2

def objective(trial):
    x = trial.suggest_uniform('x', -10, 10)
    y = trial.suggest_uniform('y', -10, 10)
    return quadratic_function(x, y)

study = optuna.create_study()
study.optimize(objective, n_trials=100)

In [None]:
study.best_params

In [None]:
x = np.arange(-1.0, 5.0, 0.1)
y = np.arange(0., 5.0, 0.1)
X, Y = np.meshgrid(x, y)
Z = quadratic_function(X, Y)

best_x = study.best_params["x"]
best_y = study.best_params["y"]
best_z = quadratic_function(best_x, best_y)

fig = plt.figure()
ax = Axes3D(fig)
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_zlabel("f(x, y)")
ax.plot_wireframe(X, Y, Z, alpha=0.1)
ax.scatter3D([best_x], [best_y], [best_z],  c='Red', s=100);
plt.show()

The red dot is the minimum point explored by Optuna. The next optimization we do is more difficult than this one, but it does the same thing: we'll search for the hyperparameter that minimizes the CV score.

In [None]:
display(optuna.visualization.plot_optimization_history(study))
display(optuna.visualization.plot_slice(study))
display(optuna.visualization.plot_parallel_coordinate(study))

<div class="alert alert-block alert-warning">Note:  Optuna is easy and friendly, but if you felt too difficult, you can skip this section. Remember here when you need someday!</div>



We'll specify trial times.

In [None]:
# ↓↓↓ You can change number of trial. Now, to keep time, I'll set small number.

trials_catboost = 5
trials_histgradientboost = 5
trials_xgboost = 5

## For CatBoostClassifier

Train the model with the hyperparameter values assigned by Optuna and calculate the validation score. In this case, we will use kfold to split the train data into three parts and return the CV score to optuna.

In [None]:
# I refered the great notebook which wrote pipeline to optimize xgboost training with optuna,
# but I can't refer my voted notebooks list...
# If I could find it, I'll note the URL.

def train_and_val_catboost(df, target, params, n_splits=3):
    """Calculate and return validation score averaged of CatBoostClassifier n_splits times tried with kfold.
    """
    test_preds = None
    train_mertics = 0
    val_mertics = 0 
    
    kf = KFold(n_splits = n_splits , shuffle = True , random_state = 42)
    for fold, (tr_index , val_index) in enumerate(kf.split(df.values , target)):
        print("-" * 50)
        print(f"Fold {fold + 1}")
    
        x_train,x_val = df.values[tr_index] , df.values[val_index]
        y_train,y_val = target[tr_index] , target[val_index]
        
        train_dataset = Pool(data=x_train,
                     label=y_train)
        eval_data = Pool(data=x_val,
                     label=y_val)
    
        model = CatBoostClassifier(**params)
        model.fit(train_dataset, eval_set = eval_data, verbose = 100)
    
        train_preds = model.predict_proba(x_train)
        train_mertics += multiclass_log_loss(train_preds, y_train)
        print("Training Metric : " , multiclass_log_loss(train_preds, y_train))
    
        val_preds = model.predict_proba(x_val)
        val_mertics += multiclass_log_loss(val_preds, y_val)
        print("Validation Metric : " , multiclass_log_loss(val_preds, y_val))
    
        if test_preds is None:
            test_preds = model.predict_proba(test.values)
        else:
            test_preds += model.predict_proba(test.values)

    print("-" * 50)
    print("Average Training Metric : " , train_mertics / n_splits)
    print("Average Validation Metric : " , val_mertics / n_splits)

    return val_mertics / n_splits

Specifies the hyperparameters to explore and their ranges. Note whether the parameter is a real number or an integer.

In [None]:
def objective_catboost(trial, df, target, params=dict()):
    """ Set optimize target parameters & its' sampling
    """
    
    # Tuning target
    params['max_depth'] = trial.suggest_int('max_depth', 2, 8)
    params['n_estimators'] = trial.suggest_int('n_estimators', 500, 1500)
    params['bagging_temperature'] = trial.suggest_uniform('bagging_temperature', 0.5, 10)
    params['learning_rate'] = trial.suggest_uniform('learning_rate', 0.01, 0.15)

    return train_and_val_catboost(df, target, params, n_splits=3)

Optimize execute setting.

In [None]:
def execute_optimization(study_name, df, target, trials,
                                   params=dict(), direction='minimize'):
    """ Execute optimization for objective_catboost
    """
    logging.set_verbosity(logging.ERROR)
    
    ## We use pruner to skip trials that are NOT fruitful
    pruner = MedianPruner(n_warmup_steps=5)
    
    study = create_study(direction=direction,
                         study_name=study_name,
                         storage=f'sqlite:///optuna_{study_name}.db',
                         load_if_exists=False,
                         pruner=pruner)

    study.optimize(lambda trial: objective_catboost(trial, df, target, params),
                   n_trials=trials,
                   n_jobs=-1)
    
    
    print("STUDY NAME: ", study_name)
    print('------------------------------------------------')
    print("EVALUATION METRIC: ", multiclass_log_loss.__name__)
    print('------------------------------------------------')
    print("BEST CV SCORE", study.best_value)
    print('------------------------------------------------')
    print(f"OPTIMAL PARAMS: ", study.best_params)
    print('------------------------------------------------')
    print("BEST TRIAL", study.best_trial)
    print('------------------------------------------------')
    
    
    return study.best_params, study

Let's execute optimization.

In [None]:
params_catboost, study_catboost = execute_optimization("catboost_tuning", df, target, trials_catboost)

Here is best parameter.

In [None]:
print(params_catboost)

The search results can be visualized.

In [None]:
display(optuna.visualization.plot_optimization_history(study_catboost))
display(optuna.visualization.plot_slice(study_catboost))
display(optuna.visualization.plot_parallel_coordinate(study_catboost))

### For HistGradientBoostClassifier

We'll see a similar tuning for the [HisGradientBoostClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html), which runs fast and has reasonably good score for this competition data.

In [None]:
def train_and_val_histgradientboost(df, target, params, n_splits=3):
    """Calculate and return validation score  of HistGradientBoostingClassifier averaged n_splits times tried with kfold.
    """
    test_preds = None
    train_mertics = 0
    val_mertics = 0
    
    kf = KFold(n_splits = n_splits , shuffle = True , random_state = 42)
    for fold, (tr_index , val_index) in enumerate(kf.split(df.values , target)):
        print("-" * 50)
        print(f"Fold {fold + 1}")
    
        x_train,x_val = df.values[tr_index] , df.values[val_index]
        y_train,y_val = target[tr_index] , target[val_index]
    
        model = HistGradientBoostingClassifier(**params)
        model.fit(x_train, y_train)
    
        train_preds = model.predict_proba(x_train)
        train_mertics += multiclass_log_loss(train_preds, y_train)
        print("Training Metric : " , multiclass_log_loss(train_preds, y_train))
    
        val_preds = model.predict_proba(x_val)
        val_mertics += multiclass_log_loss(val_preds, y_val)
        print("Validation Metric : " , multiclass_log_loss(val_preds, y_val))
    
        if test_preds is None:
            test_preds = model.predict_proba(test.values)
        else:
            test_preds += model.predict_proba(test.values)

    print("-" * 50)
    print("Average Training Metric : " , train_mertics / n_splits)
    print("Average Validation Metric : " , val_mertics / n_splits)

    return val_mertics / n_splits


def objective_histgradientboost(trial, df, target, params=dict()):
    """ Set optimize target parameters & its' sampling
    """  

    # Tuning target
    params['max_depth'] = trial.suggest_int('max_depth', 2, 8)
    params['l2_regularization'] = trial.suggest_uniform('l2_regularization', 0, 1)
    params['learning_rate'] = trial.suggest_uniform('learning_rate', 0.05, 0.5)

    return train_and_val_histgradientboost(df, target, params, n_splits=3)


def execute_optimization(study_name, df, target, trials,
                                   params=dict(), direction='minimize'):
    """ Execute optimization for objective_histgradientboost
    """
    
    logging.set_verbosity(logging.ERROR)
    
    ## We use pruner to skip trials that are NOT fruitful
    pruner = MedianPruner(n_warmup_steps=5)
    
    study = create_study(direction=direction,
                         study_name=study_name,
                         storage=f'sqlite:///optuna_{study_name}.db',
                         load_if_exists=False,
                         pruner=pruner)

    study.optimize(lambda trial: objective_histgradientboost(trial, df, target, params),
                   n_trials=trials,
                   n_jobs=-1)
    
    
    print("STUDY NAME: ", study_name)
    print('------------------------------------------------')
    print("EVALUATION METRIC: ", multiclass_log_loss.__name__)
    print('------------------------------------------------')
    print("BEST CV SCORE", study.best_value)
    print('------------------------------------------------')
    print(f"OPTIMAL PARAMS: ", study.best_params)
    print('------------------------------------------------')
    print("BEST TRIAL", study.best_trial)
    print('------------------------------------------------')
    
    
    return study.best_params, study

In [None]:
params_histgradientboost, study_histgradientboost = execute_optimization("histgradientboost_tuning", df, target, trials_histgradientboost)

In [None]:
print(params_histgradientboost)

In [None]:
display(optuna.visualization.plot_optimization_history(study_histgradientboost))
display(optuna.visualization.plot_slice(study_histgradientboost))
display(optuna.visualization.plot_parallel_coordinate(study_histgradientboost))

### For XGBoost

In [None]:
def train_and_val_xgboost(df, target, params, n_splits=3):
    """Calculate and return validation score  of HistGradientBoostingClassifier averaged n_splits times tried with kfold.
    """
    dtrain = xgb.DMatrix(df, label=target)
    
    pruning_callback = XGBoostPruningCallback(trial, "test-mlogloss")
    cv_scores = xgb.cv(params, dtrain, nfold=n_splits,
                       stratified=True,
                       metrics = "mlogloss",
                       early_stopping_rounds=50,
                       callbacks=[pruning_callback],
                       seed=0)

    return cv_scores['test-' + "mlogloss" + '-mean'].values[-1]


def objective_xgboost(trial, df, target, params=dict()):
    """ Set optimize target parameters & its' sampling
    """  
    
    params['num_class'] = 9

    # Tuning target
    params['max_depth'] = trial.suggest_int('max_depth', 2, 10)
    params['learning_rate'] = trial.suggest_uniform('learning_rate', 0, 0.1)
    params['num_boost_round'] = trial.suggest_int('num_boost_round', 100, 1000)

    return train_and_val_xgboost(df, target, params, n_splits=3)


def execute_xgboost(study_name, df, target, trials,
                                   params=dict(), direction='minimize'):
    """ Execute optimization for objective_histgradientboost
    """
    
    logging.set_verbosity(logging.ERROR)
    
    ## We use pruner to skip trials that are NOT fruitful
    pruner = MedianPruner(n_warmup_steps=5)
    
    study = create_study(direction=direction,
                         study_name=study_name,
                         storage=f'sqlite:///optuna_{study_name}.db',
                         load_if_exists=False,
                         pruner=pruner)

    study.optimize(lambda trial: objective_xgboost(trial, df, target, params),
                   n_trials=trials,
                   n_jobs=-1)
    
    
    print("STUDY NAME: ", study_name)
    print('------------------------------------------------')
    print("EVALUATION METRIC: ", multiclass_log_loss.__name__)
    print('------------------------------------------------')
    print("BEST CV SCORE", study.best_value)
    print('------------------------------------------------')
    print(f"OPTIMAL PARAMS: ", study.best_params)
    print('------------------------------------------------')
    print("BEST TRIAL", study.best_trial)
    print('------------------------------------------------')
    
    
    return study.best_params, study

In [None]:
params_xgboost, study_xgboost = execute_optimization("xgboost_tuning", df, target, trials_xgboost)

In [None]:
print(params_xgboost)

In [None]:
display(optuna.visualization.plot_optimization_history(study_xgboost))
display(optuna.visualization.plot_slice(study_xgboost))
display(optuna.visualization.plot_parallel_coordinate(study_xgboost))

<a id='5'></a>
# <div class="alert alert-block alert-success">Train & Inference</div>

Using the tuned parameters, try to create the data to be submitted.

### <span style="color: orange; ">↓↓↓ I replace the parameters with those explored by large trial number in another notebook version. If you want to use your own tuned values, comment out this cell.</span>

In [None]:
# See https://www.kaggle.com/nayuts/tps-06-solution-assortment-eda-optuna-ensemble?scriptVersionId=65714768

params_catboost = {'bagging_temperature': 1.5451428810065613, 'learning_rate': 0.04814888472822457, 'max_depth': 5, 'n_estimators': 1483}
params_histgradientboost = {'l2_regularization': 0.5629424804207567, 'learning_rate': 0.05065982344408913, 'max_depth': 6}
params_xgboost = {'l2_regularization': 0.591214850198673, 'learning_rate': 0.4895203779149179, 'max_depth': 3}

## CatBoostClassifier

### Train & Inference

In [None]:
test_preds_catboost = None
train_metric = 0
val_metric = 0
n_splits = 7

feature_importances = pd.DataFrame()
feature_importances['feature'] = test.columns

kf = KFold(n_splits = n_splits , shuffle = True , random_state = 42)
for fold, (tr_index , val_index) in enumerate(kf.split(df.values , target)):
    
    print("-" * 50)
    print(f"Fold {fold + 1}")
    
    x_train,x_val = df.values[tr_index] , df.values[val_index]
    y_train,y_val = target[tr_index] , target[val_index]
        
    eval_set = [(x_val, y_val)]
    
    model = CatBoostClassifier(**params_catboost)
    model.fit(x_train, y_train, eval_set = eval_set, verbose = 100)
    
    train_preds = model.predict_proba(x_train)
    train_metric += multiclass_log_loss(train_preds, y_train)
    print("Training Metric : " , multiclass_log_loss(train_preds, y_train))
    
    feature_importances[f'fold_{fold}'] = model.feature_importances_
    
    val_preds = model.predict_proba(x_val)
    val_metric += multiclass_log_loss(val_preds, y_val)
    print("Validation Metric : " , multiclass_log_loss(val_preds, y_val))
    
    if test_preds_catboost is None:
        test_preds_catboost = model.predict_proba(test.values)
    else:
        test_preds_catboost += model.predict_proba(test.values)

print("-" * 50)
print("Average Training Metric : " , train_metric / n_splits)
print("Average Validation Metric : " , val_metric / n_splits)

test_preds_catboost /= n_splits

In [None]:
# Exporting results
sub_catboost = pd.read_csv(DATA + "/sample_submission.csv")

sub_catboost['Class_1']=test_preds_catboost[:,0]
sub_catboost['Class_2']=test_preds_catboost[:,1]
sub_catboost['Class_3']=test_preds_catboost[:,2]
sub_catboost['Class_4']=test_preds_catboost[:,3]
sub_catboost['Class_5']=test_preds_catboost[:,4]
sub_catboost['Class_6']=test_preds_catboost[:,5]
sub_catboost['Class_7']=test_preds_catboost[:,6]
sub_catboost['Class_8']=test_preds_catboost[:,7]
sub_catboost['Class_9']=test_preds_catboost[:,8]

sub_catboost.to_csv("CatBoost.csv",index=False)

## Feature importance

In methods such as LightGBM and CatBoost, we can often use feature importance. Although not completely reliable, feature importance can be used to check the importance of features and in some cases to reduce the number of features that are too many.

In this case, the implementation is a bit complicated because we are using kfold to ensemble, but the point is that the feature_importances_ of the trained model contains values, so we extract and visualize.

In [None]:
# I refered https://www.kaggle.com/gogo827jz/catboost-baseline-with-feature-importance

# Calculate the average feature importance for each feature
feature_importances['average'] = feature_importances[[f'fold_{fold}' for fold in range(n_splits)]].mean(axis=1)
feature_importances.to_csv('feature_importances_catboost.csv')
feature_importances.sort_values(by='average', ascending=False).head()

In [None]:
# Plot the feature importances with min/max/average using seaborn
feature_importances_flatten = pd.DataFrame()
for i in range(1, len(feature_importances.columns)-1):
    col = ['feature', feature_importances.columns.values[i]]
    feature_importances_flatten = pd.concat([feature_importances_flatten, feature_importances[col].rename(columns={f'fold_{i-1}': 'importance'})], axis=0)

plt.figure(figsize=(16, 16))
sns.barplot(data=feature_importances_flatten.sort_values(by='importance', ascending=False), x='importance', y='feature')
plt.title(f'Feature Importances over {n_splits} folds of CatBoostClassifier')  
plt.savefig("feature_importances_catboost.png")

## HistGradientBoostingClassifier

### Train & Inference

In [None]:
test_preds_histgradientboost = None
train_metric = 0
val_metric = 0
n_splits = 7

kf = KFold(n_splits = n_splits , shuffle = True , random_state = 46)
for fold, (tr_index , val_index) in enumerate(kf.split(df.values , target)):
    
    print("-" * 50)
    print(f"Fold {fold + 1}")
    
    x_train,x_val = df.values[tr_index] , df.values[val_index]
    y_train,y_val = target[tr_index] , target[val_index]
        
    eval_set = [(x_val, y_val)]
    
    model = HistGradientBoostingClassifier(**params_histgradientboost)
    model.fit(x_train, y_train)
    
    train_preds = model.predict_proba(x_train)
    train_metric += multiclass_log_loss(train_preds, y_train)
    print("Training Metric : " , multiclass_log_loss(train_preds, y_train))
    
    val_preds = model.predict_proba(x_val)
    val_metric += multiclass_log_loss(val_preds, y_val)
    print("Validation Metric : " , multiclass_log_loss(val_preds, y_val))
    
    if test_preds_histgradientboost is None:
        test_preds_histgradientboost = model.predict_proba(test.values)
    else:
        test_preds_histgradientboost += model.predict_proba(test.values)

print("-" * 50)
print("Average Training Metric : " , train_metric / n_splits)
print("Average Validation Metric : " , val_metric / n_splits)

test_preds_histgradientboost /= n_splits

In [None]:
sub_histgradientboost = pd.read_csv(DATA + "/sample_submission.csv")

sub_histgradientboost['Class_1']=test_preds_histgradientboost[:,0]
sub_histgradientboost['Class_2']=test_preds_histgradientboost[:,1]
sub_histgradientboost['Class_3']=test_preds_histgradientboost[:,2]
sub_histgradientboost['Class_4']=test_preds_histgradientboost[:,3]
sub_histgradientboost['Class_5']=test_preds_histgradientboost[:,4]
sub_histgradientboost['Class_6']=test_preds_histgradientboost[:,5]
sub_histgradientboost['Class_7']=test_preds_histgradientboost[:,6]
sub_histgradientboost['Class_8']=test_preds_histgradientboost[:,7]
sub_histgradientboost['Class_9']=test_preds_histgradientboost[:,8]

sub_histgradientboost.to_csv("HistGradientBoost.csv",index=False)

Unfortunately, the HisTGradientBoostingClassifier does not seem to have feature importance yet ( [HistGradientBoostingRegressor does](https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regression.html) ).

In [None]:
print(sklearn.__version__)

## XGBoost

### Train & Inference

In [None]:
test_preds_xgboost = None
train_metric = 0
val_metric = 0
n_splits = 7

feature_importances = pd.DataFrame()
feature_importances['feature'] = test.columns

kf = KFold(n_splits = n_splits , shuffle = True , random_state = 46)
for fold, (tr_index , val_index) in enumerate(kf.split(df.values , target)):
    
    print("-" * 50)
    print(f"Fold {fold + 1}")
    
    x_train,x_val = df.values[tr_index] , df.values[val_index]
    y_train,y_val = target[tr_index] , target[val_index]
        
    eval_set = [(x_val, y_val)]
    
    model = xgb.XGBClassifier(**params_xgboost, random_state=46 , n_jobs=-1)
    model.fit(x_train, y_train)
    
    train_preds = model.predict_proba(x_train)
    train_metric += multiclass_log_loss(train_preds, y_train)
    print("Training Metric : " , multiclass_log_loss(train_preds, y_train))
    
    feature_importances[f'fold_{fold}'] = model.feature_importances_
    
    val_preds = model.predict_proba(x_val)
    val_metric += multiclass_log_loss(val_preds, y_val)
    print("Validation Metric : " , multiclass_log_loss(val_preds, y_val))
    
    if test_preds_xgboost is None:
        test_preds_xgboost = model.predict_proba(test.values)
    else:
        test_preds_xgboost += model.predict_proba(test.values)

print("-" * 50)
print("Average Training Metric : " , train_metric / n_splits)
print("Average Validation Metric : " , val_metric / n_splits)

test_preds_xgboost /= n_splits

In [None]:
# Exporting results

sub_xgboost = pd.read_csv(DATA + "/sample_submission.csv")

sub_xgboost['Class_1']=test_preds_xgboost[:,0]
sub_xgboost['Class_2']=test_preds_xgboost[:,1]
sub_xgboost['Class_3']=test_preds_xgboost[:,2]
sub_xgboost['Class_4']=test_preds_xgboost[:,3]
sub_xgboost['Class_5']=test_preds_xgboost[:,4]
sub_xgboost['Class_6']=test_preds_xgboost[:,5]
sub_xgboost['Class_7']=test_preds_xgboost[:,6]
sub_xgboost['Class_8']=test_preds_xgboost[:,7]
sub_xgboost['Class_9']=test_preds_xgboost[:,8]

sub_xgboost.to_csv("XGBoost.csv",index=False)

## Feature importance

In [None]:
# I refered https://www.kaggle.com/gogo827jz/catboost-baseline-with-feature-importance

# Calculate the average feature importance for each feature
feature_importances['average'] = feature_importances[[f'fold_{fold}' for fold in range(n_splits)]].mean(axis=1)
feature_importances.to_csv('feature_importances_xgboost.csv')
feature_importances.sort_values(by='average', ascending=False).head()

In [None]:
# Plot the feature importances with min/max/average using seaborn
feature_importances_flatten = pd.DataFrame()
for i in range(1, len(feature_importances.columns)-1):
    col = ['feature', feature_importances.columns.values[i]]
    feature_importances_flatten = pd.concat([feature_importances_flatten, feature_importances[col].rename(columns={f'fold_{i-1}': 'importance'})], axis=0)

plt.figure(figsize=(16, 16))
sns.barplot(data=feature_importances_flatten.sort_values(by='importance', ascending=False), x='importance', y='feature')
plt.title(f'Feature Importances over {n_splits} folds of XGBoost')  
plt.savefig("feature_importances_xgboost.png")

## Blending

In this case, we estimated result with two models, but averaging the results may improve the score. A simple average is easy to do, so let's see how to do it.

In [None]:
# Zero matrix to hold the values
test_preds_blended =  np.zeros_like(test_preds_catboost, dtype="float64")

# Weights for mixing results
weights = {"catboost": 0.4,
           "histgradientboost": 0.3,
           "xgboost": 0.3}

# Ensemble targets
preds = [test_preds_catboost, test_preds_histgradientboost, test_preds_xgboost]

In [None]:
# Ensemble

for pred, weight in zip(preds, weights.values()):
    test_preds_blended += pred * weight

In [None]:
# Exporting results

sub_blended = pd.read_csv(DATA + "/sample_submission.csv")

sub_blended['Class_1']=test_preds_blended[:,0]
sub_blended['Class_2']=test_preds_blended[:,1]
sub_blended['Class_3']=test_preds_blended[:,2]
sub_blended['Class_4']=test_preds_blended[:,3]
sub_blended['Class_5']=test_preds_blended[:,4]
sub_blended['Class_6']=test_preds_blended[:,5]
sub_blended['Class_7']=test_preds_blended[:,6]
sub_blended['Class_8']=test_preds_blended[:,7]
sub_blended['Class_9']=test_preds_blended[:,8]

sub_blended.to_csv("Blended.csv",index=False)

In [None]:
sub_blended.head()