# Table of Contents
<a id="table-of-contents"></a>
- [1 Introduction](#1)
- [2 Preparations](#2)
- [3 Datasets Overview](#3)
    - [3.1 Train dataset](#3.1)
    - [3.2 Test dataset](#3.2)
    - [3.3 Submission](#3.3)
- [4 Features](#4)
    - [4.1 Unique values](#4.1)
    - [4.2 Distribution](#4.2)
- [5 Target](#5)
    - [5.1 Distribution](#5.1)
    - [5.2 Features Distribution by Target](#5.2)
- [6 Base Model](#6)
    - [6.1 Regression](#6.1)
        - [6.1.1 Linear Regression](#6.1.1)
        - [6.1.2 XGBoost Regressor](#6.1.2)
        - [6.1.3 LGBM Regressor](#6.1.3)
        - [6.1.4 Catboost Regressor](#6.1.4)
    - [6.2 Classification](#6.2)
        - [6.2.1 XGBoost Classifier](#6.2.1)
        - [6.2.2 LGBM Classifier](#6.2.2)
        - [6.2.3 Catboost Classifier](#6.2.3)

[back to top](#table-of-contents)
<a id="1"></a>
# 1 Introduction

Kaggle competitions are incredibly fun and rewarding, but they can also be intimidating for people who are relatively new in their data science journey. In the past, Kaggle have launched many Playground competitions that are more approachable than Featured competition, and thus more beginner-friendly.

The goal of these competitions is to provide a fun, but less challenging, tabular dataset. These competitions will be great for people looking for something in between the Titanic Getting Started competition and a Featured competition.

The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with calculating the loss associated with a loan defaults. Although the features are anonymized, they have properties relating to real-world features.

Submissions are evaluated using Submissions are scored on the root mean squared error. RMSE is defined as:

$$ \text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_{i} - \hat{y_{i}})^{2}} $$

where $\hat{y_{i}}$ is the predicted value, $y$ is the ground truth value, and $n$ is the number of rows in the test data.

[back to top](#table-of-contents)
<a id="2"></a>
# 2 Preparations
Preparing packages and data that will be used in the analysis process. Packages that will be loaded are mainly for data manipulation, data visualization and modeling. There are 2 datasets that are used in the analysis, they are train and test dataset. The main use of train dataset is to train models and use it to predict test dataset. While sample submission file is used to informed participants on the expected submission for the competition. *(to see the details, please expand)*

In [None]:
# import packages
import os
import joblib
import numpy as np
import pandas as pd
import warnings

import matplotlib
import matplotlib.pyplot as plt
from matplotlib import ticker
import seaborn as sns

# setting up options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('float_format', '{:f}'.format)
warnings.filterwarnings('ignore')

# import datasets
train_df = pd.read_csv('../input/tabular-playground-series-aug-2021/train.csv')
test_df = pd.read_csv('../input/tabular-playground-series-aug-2021/test.csv')
submission = pd.read_csv('../input/tabular-playground-series-aug-2021/sample_submission.csv')

# converting column without decimal to integer
for col in train_df.columns:
    if np.sum((train_df[col] - train_df[col].astype('int'))) == 0:
        train_df[col] = train_df[col].astype('int')
        
for col in test_df.columns:
    if np.sum((test_df[col] - test_df[col].astype('int'))) == 0:
        test_df[col] = test_df[col].astype('int')

[back to top](#table-of-contents)
<a id="3"></a>
# 3 Dataset Overview
The intend of the overview is to get a feel of the data and its structure in train, test and submission file. An overview on train and test datasets will include a quick analysis on missing values and basic statistics, while sample submission will be loaded to see the expected submission.

<a id="3.1"></a>
## 3.1 Train dataset
As stated before, train dataset is mainly used to train predictive model as there is an available target variable in this set. This dataset is also used to explore more on the data itself including find a relation between each predictors and the target variable.

**Observations:**
- **target**
    - `loss` column is the target variable which is only available in the `train` dataset.
    - The interesting part with the `loss` column is in `int64` type. **Is this a classification task with a regression evaluation metrics?**
- **features**
    - There are `100` features which start from `f0` to `f99`.
    - `train` dataset contain of `250,000` observations without any missing values with total of `102` columns.
    - Only features `id`, `f1`, `f16`, `f27`, `f55`, `f60`, and `f86` are in `int64` type, other features are in `float64`.
    - `f31`, `f36`, `f46`, `f78` mean are quite close with target variable mean.

### 3.1.1 Quick view
Below is the first 5 rows of train dataset:

In [None]:
train_df.head()

The dimension and number of missing values in the train dataset is as below:

In [None]:
print(f'Number of rows: {train_df.shape[0]};  Number of columns: {train_df.shape[1]}; No of missing values: {sum(train_df.isna().sum())}')

### 3.1.2 Data types
Except for column `id`, `f1`, `f16`, `f27`, `f55`, `f60`, `f86` and `loss` column which are in `int64` type, other columns are in `float64`. *(to see the details, please expand)*

In [None]:
train_df.dtypes

### 3.1.3 Basic statistics
Below is the basic statistics for each variables which contain information on `count`, `mean`, `standard deviation`, `minimum`, `1st quartile`, `median`, `3rd quartile` and `maximum`.

In [None]:
train_df.describe()

[back to top](#table-of-contents)
<a id="3.2"></a>
## 3.2 Test dataset
Test dataset is used to make a prediction based on the model that has previously trained. Exploration in this dataset is also needed to see how the data is structured and especially on it’s similiarity with the train dataset.

**Observations:**

Features column in `test` dataset are similar with `train` with details as follow:
- There are `100` features which start from `f0` to `f99`.
- `test` dataset contain of `150,000` observations without any missing values with total of `101` columns.
- Only features `id`, `f1`, `f16`, `f27`, `f55`, `f60` and `f86` are in `int64` type, other features are in `float64`.

### 3.2.1 Quick view
Below is the first 5 rows of test dataset:

In [None]:
test_df.head()

In [None]:
print(f'Number of rows: {test_df.shape[0]};  Number of columns: {test_df.shape[1]}; No of missing values: {sum(test_df.isna().sum())}')

### 3.2.2 Data types
Except for column `id`, `f1`, `f16`, `f27`, `f55`, `f60`, `f86` and `loss` column which are in `int64` type, other columns are in `float64` which is consistent with the train dataset. *(to see the details, please expand)*

In [None]:
test_df.dtypes

[back to top](#table-of-contents)
<a id="3.3"></a>
## 3.3 Submission
The submission file is expected to have an `id` and `loss` columns.

Below is the first 5 rows of submission file:

In [None]:
submission.head()

[back to top](#table-of-contents)
<a id="4"></a>
# 4 Features
Number of features available to be used to create a prediction model are `100`. The analysis is started by looking on number of uniques value on integer features which are `f1`, `f16`, `f27`, `f55`, `f60` and `f86`.

<a id="4.1"></a>
## 4.1 Unique values
Counting number of unique value and it's relative with their respective observations between train & test dataset.

## 4.1.1 Preparation
Prepare train and test dataset for data analysis and visualization. *(to see the details, please expand)*

In [None]:
integer_features = ['f1', 'f16', 'f27', 'f55', 'f60', 'f86']
unique_values_train = pd.DataFrame(train_df[integer_features].nunique())
unique_values_train = unique_values_train.reset_index(drop=False)
unique_values_train.columns = ['Features', 'Count']

unique_values_percent_train = pd.DataFrame(train_df[integer_features].nunique()/train_df.shape[0])
unique_values_percent_train = unique_values_percent_train.reset_index(drop=False)
unique_values_percent_train.columns = ['Features', 'Count']

unique_values_test = pd.DataFrame(test_df[integer_features].nunique())
unique_values_test = unique_values_test.reset_index(drop=False)
unique_values_test.columns = ['Features', 'Count']

unique_values_percent_test = pd.DataFrame(test_df[integer_features].nunique()/test_df.shape[0])
unique_values_percent_test = unique_values_percent_test.reset_index(drop=False)
unique_values_percent_test.columns = ['Features', 'Count']

### 4.1.2 Individual features

Count how many unique values in each features and perform differences calculation on train and test datasets to see how both datasets differ from one to another.

**Observations:**
- It seems `f1` and `f86` can be treated as classification features as the unique numbers is small compared with the total observation which can be seen on the percentage to the total observations.

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(6, 4), facecolor='#f6f5f5')
gs = fig.add_gridspec(2, 2)
gs.update(wspace=0.4, hspace=0.5)

background_color = "#f6f5f5"
sns.set_palette(['#ffd514']*6)

ax0 = fig.add_subplot(gs[0, 0])
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)
ax0.set_facecolor(background_color)
ax0_sns = sns.barplot(ax=ax0, y=unique_values_train['Features'], x=unique_values_train['Count'], 
                      zorder=2, linewidth=0, orient='h', saturation=1, alpha=1)
ax0_sns.set_xlabel("Unique Values",fontsize=4, weight='bold')
ax0_sns.set_ylabel("Features",fontsize=4, weight='bold')
ax0_sns.tick_params(labelsize=4, width=0.5, length=1.5)
ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0.text(0, -1.5, 'Unique Values - Train Dataset', fontsize=6, ha='left', va='top', weight='bold')
ax0.text(0, -1, 'f1 and f86 can be considered as classification features', fontsize=4, ha='left', va='top')
ax0.get_xaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
# data label
for p in ax0.patches:
    value = f'{p.get_width():,.0f}'
    x = p.get_x() + p.get_width() + 6000
    y = p.get_y() + p.get_height() / 2 
    ax0.text(x, y, value, ha='left', va='center', fontsize=4, 
            bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.3))
    
ax1 = fig.add_subplot(gs[0, 1])
for s in ["right", "top"]:
    ax1.spines[s].set_visible(False)
ax1.set_facecolor(background_color)
ax1_sns = sns.barplot(ax=ax1, y=unique_values_percent_train['Features'], x=unique_values_percent_train['Count'], 
                      zorder=2, linewidth=0, orient='h', saturation=1, alpha=1)
ax1_sns.set_xlabel("Percentage Unique Values",fontsize=4, weight='bold')
ax1_sns.set_ylabel("Features",fontsize=4, weight='bold')
ax1_sns.tick_params(labelsize=4, width=0.5, length=1.5)
ax1_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax1_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
ax1.text(0, -1.5, 'Percentage Unique Values - Train Dataset', fontsize=6, ha='left', va='top', weight='bold')
ax1.text(0, -1, 'f1 and f86 can be considered as classification features', fontsize=4, ha='left', va='top')
# data label
for p in ax1.patches:
    value = f'{p.get_width():.2f}'
    x = p.get_x() + p.get_width() + 0.03
    y = p.get_y() + p.get_height() / 2 
    ax1.text(x, y, value, ha='left', va='center', fontsize=4, 
            bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.3))

background_color = "#f6f5f5"
sns.set_palette(['#ff355d']*6)
    
ax3 = fig.add_subplot(gs[1, 0])
for s in ["right", "top"]:
    ax3.spines[s].set_visible(False)
ax3.set_facecolor(background_color)
ax3_sns = sns.barplot(ax=ax3, y=unique_values_test['Features'], x=unique_values_test['Count'], 
                      zorder=2, linewidth=0, orient='h', saturation=1, alpha=1)
ax3_sns.set_xlabel("Unique Values",fontsize=4, weight='bold')
ax3_sns.set_ylabel("Features",fontsize=4, weight='bold')
ax3_sns.tick_params(labelsize=4, width=0.5, length=1.5)
ax3_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax3_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
ax3.text(0, -1.5, 'Unique Values - Test Dataset', fontsize=6, ha='left', va='top', weight='bold')
ax3.text(0, -1, 'Test dataset is quite similar with train dataset', fontsize=4, ha='left', va='top')
ax3.get_xaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
# data label
for p in ax3.patches:
    value = f'{p.get_width():,.0f}'
    x = p.get_x() + p.get_width() + 6000
    y = p.get_y() + p.get_height() / 2 
    ax3.text(x, y, value, ha='left', va='center', fontsize=4, 
            bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.3))
    
ax4 = fig.add_subplot(gs[1, 1])
for s in ["right", "top"]:
    ax4.spines[s].set_visible(False)
ax4.set_facecolor(background_color)
ax4_sns = sns.barplot(ax=ax4, y=unique_values_percent_test['Features'], x=unique_values_percent_test['Count'], 
                      zorder=2, linewidth=0, orient='h', saturation=1, alpha=1)
ax4_sns.set_xlabel("Percentage Unique Values",fontsize=4, weight='bold')
ax4_sns.set_ylabel("Features",fontsize=4, weight='bold')
ax4_sns.tick_params(labelsize=4, width=0.5, length=1.5)
ax4_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax4_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
ax4.text(0, -1.5, 'Percentage Unique Values - Test Dataset', fontsize=6, ha='left', va='top', weight='bold')
ax4.text(0, -1, 'Test dataset is quite similar with train dataset', fontsize=4, ha='left', va='top')
# data label
for p in ax4.patches:
    value = f'{p.get_width():.2f}'
    x = p.get_x() + p.get_width() + 0.03
    y = p.get_y() + p.get_height() / 2 
    ax4.text(x, y, value, ha='left', va='center', fontsize=4, 
            bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.3))

[back to top](#table-of-contents)
<a id="4.2"></a>
## 4.2 Distribution
Showing distribution on each feature that are available in train and test dataset. As there are 100 features, it will be broken down into 25 features for each sections. `Yellow` represents train dataset while `pink` will represent test dataset

**Observations:**
- All features distribution on train and test dataset are almost similar.

### 4.2.1 Features f0 - f24

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)

run_no = 0
for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

ax0.text(-0.2, 1.4, 'Features Distribution', fontsize=10, fontweight='bold')
ax0.text(-0.2, 1.3, 'f0-f24', fontsize=6, fontweight='light')        

features = list(train_df.columns[1:26])

background_color = "#f6f5f5"

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train_df[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=test_df[col], zorder=2, alpha=1, linewidth=1, color='#ff355d')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1

plt.show()

### 4.2.2 Features f25 - f49

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)

run_no = 0
for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

ax0.text(-4, 0.165, 'Features Distribution ', fontsize=10, fontweight='bold')
ax0.text(-4, 0.15, 'f25-f49', fontsize=6, fontweight='light')

features = list(train_df.columns[26:51])

background_color = "#f6f5f5"

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train_df[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=test_df[col], zorder=2, alpha=1, linewidth=1, color='#ff355d')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1

plt.show()

### 4.2.3 Features f50 - f74

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)

run_no = 0
for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

ax0.text(-1, 1.8, 'Features Distribution ', fontsize=10, fontweight='bold')
ax0.text(-1, 1.65, 'f50-f74', fontsize=6, fontweight='light')

features = list(train_df.columns[51:76])

background_color = "#f6f5f5"

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train_df[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=test_df[col], zorder=2, alpha=1, linewidth=1, color='#ff355d')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1

plt.show()

### 4.2.4 Features f75 - f99

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)

run_no = 0
for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

ax0.text(-15000, 0.00033, 'Features Distribution ', fontsize=10, fontweight='bold')
ax0.text(-15000, 0.00030, 'f75-f99', fontsize=6, fontweight='light')

features = list(train_df.columns[76:101])

background_color = "#f6f5f5"

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=train_df[col], zorder=2, alpha=1, linewidth=1, color='#ffd514')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=test_df[col], zorder=2, alpha=1, linewidth=1, color='#ff355d')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1

plt.show()

[back to top](#table-of-contents)
<a id="5"></a>
# 5 Target
Though the metrics that is used to evaluate the model is `RMSE`, it's tempting to see the target variables in a classification point of view. It may be possible to use multi-classification model to make the prediction, though the metrics used is `RMSE`.

<a id="5.1"></a>
## 5.1 Distribution
Target variable has a range of `0` to `42` with total of `43` values. Let's see the individual numbers occurances from the `train` dataset.

**Observations:**
- `0` has the highest occurances in the train dataset which reach `24%`.
- `0` to `10` combined has an occurances of `75.5%`, meaning most of the `loss` is `<=10`. 
- Above `12` the percentage occurances are below `2%`.

In [None]:
unique_values_target = pd.DataFrame(train_df['loss'].value_counts())
unique_values_target = unique_values_target.reset_index(drop=False)
unique_values_target.columns = ['Value', 'Count']

unique_values_percentage_target = pd.DataFrame(train_df['loss'].value_counts()/train_df.shape[0] * 100)
unique_values_percentage_target = unique_values_percentage_target.reset_index(drop=False)
unique_values_percentage_target.columns = ['Value', 'Count']

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(6, 4), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 2)
gs.update(wspace=0.4, hspace=0.5)

background_color = "#f6f5f5"
sns.set_palette(['#ffd514']*43)

ax0 = fig.add_subplot(gs[0, 0])
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)
ax0.set_facecolor(background_color)
ax0_sns = sns.barplot(ax=ax0, y=unique_values_target['Value'], x=unique_values_target['Count'], 
                      zorder=2, linewidth=0, orient='h', saturation=1, alpha=1)
ax0_sns.set_xlabel("No of Occurances",fontsize=4, weight='bold')
ax0_sns.set_ylabel("Unique Value",fontsize=4, weight='bold')
ax0_sns.tick_params(labelsize=4, width=0.5, length=1.5)
ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0.text(0, -2.8, 'Target Variable', fontsize=6, ha='left', va='top', weight='bold')
ax0.text(0, -1.5, 'Zero is dominating the target variable with 60,144 occurances', fontsize=4, ha='left', va='top')
ax0.get_xaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
# data label
for p in ax0.patches:
    value = f'{p.get_width():,.0f}'
    x = p.get_x() + p.get_width() + 800
    y = p.get_y() + p.get_height() / 2 
    ax0.text(x, y, value, ha='left', va='center', fontsize=2, 
            bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.3))
    
sns.set_palette(['#ff355d']*43)

ax0 = fig.add_subplot(gs[0, 1])
for s in ["right", "top"]:
    ax0.spines[s].set_visible(False)
ax0.set_facecolor(background_color)
ax0_sns = sns.barplot(ax=ax0, y=unique_values_percentage_target['Value'], x=unique_values_percentage_target['Count'], 
                      zorder=2, linewidth=0, orient='h', saturation=1, alpha=1)
ax0_sns.set_xlabel("Percentage",fontsize=4, weight='bold')
ax0_sns.set_ylabel("Unique Value",fontsize=4, weight='bold')
ax0_sns.tick_params(labelsize=4, width=0.5, length=1.5)
ax0_sns.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0_sns.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0.text(0, -2.8, 'Percentage Target Variable', fontsize=6, ha='left', va='top', weight='bold')
ax0.text(0, -1.5, 'Zero is dominating the target variable with 24.1% occurances', fontsize=4, ha='left', va='top')
ax0.get_xaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
# data label
for p in ax0.patches:
    value = f'{p.get_width():,.1f}'
    x = p.get_x() + p.get_width() + 0.5
    y = p.get_y() + p.get_height() / 2 
    ax0.text(x, y, value, ha='left', va='center', fontsize=2, 
            bbox=dict(facecolor='none', edgecolor='black', boxstyle='round', linewidth=0.3))


<a id="5.2"></a>
## 5.2 Features Distribution by Target
This section try to see if there is any distinct feature distribution if it is classified by target variable. The distribution will be shown using violinplot.

**Observations:**
- All features that fall into value `0` to `10` (which cover 75.5% of target variables) have a relative similar distribution.
- Most features changes their distribution at between value `39` to `42` of target variables.
- Value `0` in features `f59`, `f77`, `81`, `86` and `93` distribution tails are quite different from other values.

### 5.2.1 Features f0 - f24

In [None]:
features = list(train_df.columns[1:26])

plt.rcParams['figure.dpi'] = 400
fig = plt.figure(figsize=(10,160), facecolor='#f6f5f5')
gs = fig.add_gridspec(100, 1)
gs.update(wspace=0, hspace=0.5)

background_color = "#f6f5f5"
sns.set_palette(['#fcd12a']*43)

run_no = 0
for row in range(0, 25):
    for col in range(0, 1):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

run_no = 0
for col in features:
    sns.violinplot(ax=locals()["ax"+str(run_no)], y=train_df[col], x=train_df['loss'], 
                     saturation=1, linewidth=0.3, zorder=1, inner='quartile')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.7)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.7)
    locals()["ax"+str(run_no)].set_axisbelow(True) 
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize = 8, fontdict=dict(weight='bold'))
    locals()["ax"+str(run_no)].tick_params(labelsize=5, width=0.5, length=1.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1
    
ax0.text(0, 1.8, 'Features Distribution', fontsize=10, fontweight='bold')
ax0.text(0, 1.5, 'Showing features f0 - f24 distribution on train dataset relative to target variable values', fontsize=8)

plt.show()

### 5.2.2 Features f25 - f49

In [None]:
features = list(train_df.columns[26:51])

plt.rcParams['figure.dpi'] = 400
fig = plt.figure(figsize=(10,160), facecolor='#f6f5f5')
gs = fig.add_gridspec(100, 1)
gs.update(wspace=0, hspace=0.5)

background_color = "#f6f5f5"
sns.set_palette(['#fcd12a']*43)

run_no = 0
for row in range(0, 25):
    for col in range(0, 1):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

run_no = 0
for col in features:
    sns.violinplot(ax=locals()["ax"+str(run_no)], y=train_df[col], x=train_df['loss'], 
                     saturation=1, linewidth=0.3, zorder=1, inner='quartile')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.7)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.7)
    locals()["ax"+str(run_no)].set_axisbelow(True) 
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize = 8, fontdict=dict(weight='bold'))
    locals()["ax"+str(run_no)].tick_params(labelsize=5, width=0.5, length=1.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1
    
ax0.text(-0.5, 32, 'Features Distribution', fontsize=10, fontweight='bold')
ax0.text(-0.5, 27, 'Showing features f25 - f49 distribution on train dataset relative to target variable values', fontsize=8)

plt.show()

### 5.2.3 Features f50 - f74

In [None]:
features = list(train_df.columns[51:76])

plt.rcParams['figure.dpi'] = 400
fig = plt.figure(figsize=(10,160), facecolor='#f6f5f5')
gs = fig.add_gridspec(100, 1)
gs.update(wspace=0, hspace=0.5)

background_color = "#f6f5f5"
sns.set_palette(['#fcd12a']*43)

run_no = 0
for row in range(0, 25):
    for col in range(0, 1):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

run_no = 0
for col in features:
    sns.violinplot(ax=locals()["ax"+str(run_no)], y=train_df[col], x=train_df['loss'], 
                     saturation=1, linewidth=0.3, zorder=1, inner='quartile')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.7)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.7)
    locals()["ax"+str(run_no)].set_axisbelow(True) 
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize = 8, fontdict=dict(weight='bold'))
    locals()["ax"+str(run_no)].tick_params(labelsize=5, width=0.5, length=1.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1
    
ax0.text(-0.6, 9, 'Features Distribution', fontsize=10, fontweight='bold')
ax0.text(-0.6, 7.8, 'Showing features f50 - f74 distribution on train dataset relative to target variable values', fontsize=8)

plt.show()

### 5.2.4 Features f75 - f99

In [None]:
features = list(train_df.columns[76:101])

plt.rcParams['figure.dpi'] = 400
fig = plt.figure(figsize=(10,160), facecolor='#f6f5f5')
gs = fig.add_gridspec(100, 1)
gs.update(wspace=0, hspace=0.5)

background_color = "#f6f5f5"
sns.set_palette(['#fcd12a']*43)

run_no = 0
for row in range(0, 25):
    for col in range(0, 1):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

run_no = 0
for col in features:
    sns.violinplot(ax=locals()["ax"+str(run_no)], y=train_df[col], x=train_df['loss'], 
                     saturation=1, linewidth=0.3, zorder=1, inner='quartile')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.7)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.7)
    locals()["ax"+str(run_no)].set_axisbelow(True) 
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize = 8, fontdict=dict(weight='bold'))
    locals()["ax"+str(run_no)].tick_params(labelsize=5, width=0.5, length=1.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1
    
ax0.text(-0.6, 67000, 'Features Distribution', fontsize=10, fontweight='bold')
ax0.text(-0.6, 57000, 'Showing features f75 - f99 distribution on train dataset relative to target variable values', fontsize=8)

plt.show()

[back to top](#table-of-contents)
<a id="6"></a>
# 6 Base Model
Evaluate the performance of base model which will be divided into 2 parts: regression and classification models. Models will be evaluated using five cross validation without any hyperparameters tuning. *(to see the packages used, please expand)*

**Observations:**
- `Classification` models perform worse compare to `Regression` models.
- `Catboost` in `regression` and `classification` beat other models performance.
- The best model in `classification` can not beat the best model in the `regression`:
    - As the evaluation metric is using RMSE, `classification` model may need to have a more correct prediction than `regression` model. For example: `classification` model can not have a decimal value assuming it predicts `11` while the actual is `12` there will be a gap of `1`, while the `regression` model may have a prediction of `11.5` and have a lower gap due to decimal places.
    - The best model in regression model is `Catboost Regressor` with OOF RMSE of `7.85`.
    - Once again the best model in classification is `Catboost Classifier` with OOF RMSE of `10.24`.


In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor, XGBClassifier 
from catboost import CatBoostRegressor, CatBoostClassifier
from lightgbm import LGBMRegressor, LGBMClassifier

folds = 5
features = list(train_df.columns[1:101])

[back to top](#table-of-contents)
<a id="6.1"></a>
## 6.1 Regression
Models that will be evaluated are `Linear Regression`, `XGBoost Regressor`, `LGBM Regressor` and `Catboost Regressor`.

**Observations:**
- `Catboost Regressor` performs the best compared to another model with OOF RMSE of `7.85`.
- `LGBM Regressor` is in the second place with OOF RMSE of `7.86`.
- `Linear Regression` is beating the performance of XGBoost with OOF RMSE of `7.89`.
- `XGBoost Regressor` is in the last place with OOF RMSE of `7.93`.

[back to top](#table-of-contents)
<a id="6.1.1"></a>
## 6.1.1 Linear Regression

In [None]:
train_oof = np.zeros((250000,))
skf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=42)
for fold, (train_idx, valid_idx) in enumerate(skf.split(train_df[features], train_df['loss'])):
    X_train, X_valid = train_df.iloc[train_idx], train_df.iloc[valid_idx]
    y_train = X_train['loss']
    y_valid = X_valid['loss']
    X_train = X_train.drop('loss', axis=1)
    X_valid = X_valid.drop('loss', axis=1)
    
    model = LinearRegression()

    model =  model.fit(X_train, y_train)
    temp_oof = model.predict(X_valid)
    train_oof[valid_idx] = temp_oof
    print(f'Fold {fold} RMSE: ', mean_squared_error(y_valid, temp_oof, squared=False))
    
print(f'OOF Accuracy: ', mean_squared_error(train_df['loss'], train_oof, squared=False))

[back to top](#table-of-contents)
<a id="6.1.2"></a>
### 6.1.2 XGBoost Regressor

In [None]:
train_oof = np.zeros((250000,))
skf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=42)
for fold, (train_idx, valid_idx) in enumerate(skf.split(train_df[features], train_df['loss'])):
    X_train, X_valid = train_df.iloc[train_idx], train_df.iloc[valid_idx]
    y_train = X_train['loss']
    y_valid = X_valid['loss']
    X_train = X_train.drop('loss', axis=1)
    X_valid = X_valid.drop('loss', axis=1)
    
    model = XGBRegressor(random_state=42, tree_method='gpu_hist')

    model =  model.fit(X_train, y_train, verbose=False)
    temp_oof = model.predict(X_valid)
    train_oof[valid_idx] = temp_oof
    print(f'Fold {fold} RMSE: ', mean_squared_error(y_valid, temp_oof, squared=False))
    
print(f'OOF Accuracy: ', mean_squared_error(train_df['loss'], train_oof, squared=False))

[back to top](#table-of-contents)
<a id="6.1.3"></a>
### 6.1.3 LGBM Regressor

In [None]:
train_oof = np.zeros((250000,))
skf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=42)
for fold, (train_idx, valid_idx) in enumerate(skf.split(train_df[features], train_df['loss'])):
    X_train, X_valid = train_df.iloc[train_idx], train_df.iloc[valid_idx]
    y_train = X_train['loss']
    y_valid = X_valid['loss']
    X_train = X_train.drop('loss', axis=1)
    X_valid = X_valid.drop('loss', axis=1)
    
    model = LGBMRegressor(random_state=42)

    model =  model.fit(X_train, y_train, verbose=False)
    temp_oof = model.predict(X_valid)
    train_oof[valid_idx] = temp_oof
    print(f'Fold {fold} RMSE: ', mean_squared_error(y_valid, temp_oof, squared=False))
    
print(f'OOF Accuracy: ', mean_squared_error(train_df['loss'], train_oof, squared=False))

[back to top](#table-of-contents)
<a id="6.4"></a>
### 6.1.4 Catboost Regressor

In [None]:
train_oof = np.zeros((250000,))
skf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=42)
for fold, (train_idx, valid_idx) in enumerate(skf.split(train_df[features], train_df['loss'])):
    X_train, X_valid = train_df.iloc[train_idx], train_df.iloc[valid_idx]
    y_train = X_train['loss']
    y_valid = X_valid['loss']
    X_train = X_train.drop('loss', axis=1)
    X_valid = X_valid.drop('loss', axis=1)
    
    model = CatBoostRegressor(random_state=42)

    model =  model.fit(X_train, y_train, verbose=False)
    temp_oof = model.predict(X_valid)
    train_oof[valid_idx] = temp_oof
    print(f'Fold {fold} RMSE: ', mean_squared_error(y_valid, temp_oof, squared=False))
    
print(f'OOF Accuracy: ', mean_squared_error(train_df['loss'], train_oof, squared=False))

[back to top](#table-of-contents)
<a id="6.2"></a>
## 6.2 Classification
This will be counterintuitive as it will assume the target is a classification. Models that will be used are `XGBoost Classifier`, `LGBM Classifier` and `Catboost lassifier`.

**Observations:**
- `Catboost Classifier` performs the best compared to another model with OOF RMSE of `10.24`.
- `XGBoost Classifier` is in the second place with OOF RMSE of `10.25`.
- `LGBM Classifier` is in the last place with OOF RMSE of `12.62`.

[back to top](#table-of-contents)
<a id="6.2.1"></a>
### 6.2.1 XGBoost Classifier

In [None]:
train_oof = np.zeros((250000,))
skf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=42)
for fold, (train_idx, valid_idx) in enumerate(skf.split(train_df[features], train_df['loss'])):
    X_train, X_valid = train_df.iloc[train_idx], train_df.iloc[valid_idx]
    y_train = X_train['loss']
    y_valid = X_valid['loss']
    X_train = X_train.drop('loss', axis=1)
    X_valid = X_valid.drop('loss', axis=1)

    model = XGBClassifier(random_state=42, tree_method='gpu_hist', 
                          verbosity=0, use_label_encoder=False)

    model =  model.fit(X_train, y_train, verbose=0)
    temp_oof = pd.DataFrame(model.predict(X_valid))[0]
    train_oof[valid_idx] = temp_oof
    print(f'Fold {fold} RMSE: ', mean_squared_error(y_valid, temp_oof, squared=False))
    
print(f'OOF Accuracy: ', mean_squared_error(train_df['loss'], train_oof, squared=False))

[back to top](#table-of-contents)
<a id="6.2.2"></a>
### 6.2.2 LGBM Classifier

In [None]:
train_oof = np.zeros((250000,))
skf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=42)
for fold, (train_idx, valid_idx) in enumerate(skf.split(train_df[features], train_df['loss'])):
    X_train, X_valid = train_df.iloc[train_idx], train_df.iloc[valid_idx]
    y_train = X_train['loss']
    y_valid = X_valid['loss']
    X_train = X_train.drop('loss', axis=1)
    X_valid = X_valid.drop('loss', axis=1)

    model = LGBMClassifier(
        random_state=42
    )

    model =  model.fit(X_train, y_train, verbose=0)
    temp_oof = model.predict(X_valid)
    train_oof[valid_idx] = temp_oof
    print(f'Fold {fold} RMSE: ', mean_squared_error(y_valid, temp_oof, squared=False))
    
print(f'OOF Accuracy: ', mean_squared_error(train_df['loss'], train_oof, squared=False))

[back to top](#table-of-contents)
<a id="6.2.3"></a>
### 6.2.3 Catboost Classifier

In [None]:
train_oof = np.zeros((250000,))
skf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=42)
for fold, (train_idx, valid_idx) in enumerate(skf.split(train_df[features], train_df['loss'])):
    X_train, X_valid = train_df.iloc[train_idx], train_df.iloc[valid_idx]
    y_train = X_train['loss']
    y_valid = X_valid['loss']
    X_train = X_train.drop('loss', axis=1)
    X_valid = X_valid.drop('loss', axis=1)

    model = CatBoostClassifier(
        random_state=42,
        task_type="GPU"
    )

    model =  model.fit(X_train, y_train, verbose=0)
    temp_oof = pd.DataFrame(model.predict(X_valid))[0]
    train_oof[valid_idx] = temp_oof
    print(f'Fold {fold} RMSE: ', mean_squared_error(y_valid, temp_oof, squared=False))
    
print(f'OOF Accuracy: ', mean_squared_error(train_df['loss'], train_oof, squared=False))