# <p style="padding: 15px; background-color: #010D36; font-family: 'JetBrains Mono'; font-weight: bold; font-size: 100%; color: #F2F2F0; letter-spacing: 2px; text-align: center; border-radius: 8px;">ICR - Identifying Age-Related Conditions</p>

In [1]:
import os
import shutil
import subprocess
from collections import defaultdict
from copy import copy
from functools import partial
from itertools import product
from pathlib import Path

# Sub-modules and so on.
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
import scipy.stats as stats

from colorama import Fore
from colorama import Style
from scipy.cluster.hierarchy import linkage
from scipy.spatial.distance import squareform
from scipy.stats import gaussian_kde
from scipy.stats import probplot
from IPython.core.display import HTML
from plotly.subplots import make_subplots

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.manifold import TSNE
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
from sklearn.compose import make_column_selector
from sklearn.compose import make_column_transformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import mutual_info_classif
from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.metrics import brier_score_loss
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.preprocessing import Binarizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import StandardScaler

ON_KAGGLE = os.getenv("KAGGLE_KERNEL_RUN_TYPE") is not None

# Colorama settings.
CLR = (Style.BRIGHT + Fore.BLACK) if ON_KAGGLE else (Style.BRIGHT + Fore.WHITE)
RED = Style.BRIGHT + Fore.RED
BLUE = Style.BRIGHT + Fore.BLUE
CYAN = Style.BRIGHT + Fore.CYAN
RESET = Style.RESET_ALL

FONT_COLOR = "#010D36"
BACKGROUND_COLOR = "#F6F5F5"

CELL_HOVER = {  # for row hover use <tr> instead of <td>
    "selector": "td:hover",
    "props": "background-color: #F6F5F5",
}
TEXT_HIGHLIGHT = {
    "selector": "td",
    "props": "color: #FF2079; font-weight: bold",
}
INDEX_NAMES = {
    "selector": ".index_name",
    "props": "font-style: italic; background-color: #010D36; color: #F2F2F0;",
}
HEADERS = {
    "selector": "th:not(.index_name)",
    "props": "font-style: italic; background-color: #010D36; color: #F2F2F0;",
}
DF_STYLE = (INDEX_NAMES, HEADERS, TEXT_HIGHLIGHT)

# Utility functions.
def download_dataset_from_kaggle(user, dataset, directory):
    command = "kaggle datasets download -d "
    filepath = directory / (dataset + ".zip")

    if not filepath.is_file():
        subprocess.run((command + user + "/" + dataset).split())
        filepath.parent.mkdir(parents=True, exist_ok=True)
        shutil.unpack_archive(dataset + ".zip", "data")
        shutil.move(dataset + ".zip", "data")


def download_competition_from_kaggle(competition):
    command = "kaggle competitions download -c "
    filepath = Path("data/" + competition + ".zip")

    if not filepath.is_file():
        subprocess.run((command + competition).split())
        Path("data").mkdir(parents=True, exist_ok=True)
        shutil.unpack_archive(competition + ".zip", "data")
        shutil.move(competition + ".zip", "data")


# Html `code` block highlight.
HTML(
    """
<style>
code {
    background: rgba(58, 90, 129, 0.5) !important;
    border-radius: 4px !important;
    color: #f2f2f0 !important;
}
</style>
"""
)




<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
">
    <b>Competition Description</b> 📜
</p>

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 20px;
    margin-right: 20px;
    margin-bottom: 20px;
">
    <i>The goal of this competition is to predict if a person has any of three medical conditions. You are being asked to predict if the person has one or more of any of the three medical conditions (Class $1$), or none of the three medical conditions (Class $0$). You will create a model trained on measurements of health characteristics.</br></br>
    To determine if someone has these medical conditions requires a long and intrusive process to collect information from patients. With predictive models, we can shorten this process and keep patient details private by collecting key characteristics relative to the conditions, then encoding these characteristics.</br></br>
    Your work will help researchers discover the relationship between measurements of certain characteristics and potential patient conditions.</i>
</p>

<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color:#f2f2f0;
">
    <b>Context and Task</b> 🕵
</p>

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 20px;
    margin-right: 20px;
    margin-bottom: 20px;
">
    <i>They say age is just a number but a whole host of health issues come with aging. From heart disease and dementia to hearing loss and arthritis, aging is a risk factor for numerous diseases and complications. The growing field of bioinformatics includes research into interventions that can help slow and reverse biological aging and prevent major age-related ailments. Data science could have a role to play in developing new methods to solve problems with diverse data, even if the number of samples is small.</br></br>
    Currently, models like XGBoost and random forest are used to predict medical conditions yet the models' performance is not good enough. Dealing with critical problems where lives are on the line, models need to make correct predictions reliably and consistently between different cases.</br></br>
    Founded in 2015, competition host InVitro Cell Research, LLC (ICR) is a privately funded company focused on regenerative and preventive personalized medicine. Their offices and labs in the greater New York City area offer state-of-the-art research space. InVitro Cell Research's Scientists are what set them apart, helping guide and defining their mission of researching how to repair aging people fast.</br></br>
    <b>In this competition, you’ll work with measurements of health characteristic data to solve critical problems in bioinformatics. Based on minimal training, you’ll create a model to predict if a person has any of three medical conditions, with an aim to improve on existing methods.</b></br></br>
    You could help advance the growing field of bioinformatics and explore new methods to solve complex problems with diverse data.</i>
</p>

<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
">
    <b>This Notebook Covers</b> 📔
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-bottom: 20px;
"> 
    <li>A quick look at the dataset.</li>
    <li>Basic relations in numerical features.</li>
    <li>Pair plots and kernel density estimation.</li>
    <li>Probability plots and example transformations.</li>
    <li>Semi-constant features.</li>
    <li>Look at categorical variable.</li>
    <li>Dimensionality reduction with t-SNE.</li>
    <li>Feature importance problem and permutation tests.</li>
    <li>Look at greeks metadata.</li>
    <li>Possible preprocessing pipeline.</li>
    <li>Balanced learning with LightGBM & XGBoost ensemble.</li>
</ul>

<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
">
    <b>See More Here</b> 📈
</p>

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 20px;
    margin-right: 20px;
    margin-bottom: 20px;
">
    <a href="https://www.kaggle.com/competitions/icr-identify-age-related-conditions/overview" style="color: #01CBEE;"><b>ICR - Identifying Age-Related Conditions</b></a>
</p>
</blockquote>

# <p style="padding: 15px; background-color: #010D36; font-family: 'JetBrains Mono'; font-weight: bold; font-size: 100%; color: #f2f2f0; letter-spacing: 2px; text-align: center; border-radius: 8px;">Quick Overview</p>

<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Notes</b> 📜
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>Let's get started with a short dataset overview.</li>
</ul>
</blockquote>

In [2]:
competition = "icr-identify-age-related-conditions"

if not ON_KAGGLE:
    download_competition_from_kaggle(competition)
    train_path = "data/train.csv"
    test_path = "data/test.csv"
    greeks_path = "data/greeks.csv"
else:
    train_path = f"/kaggle/input/{competition}/train.csv"
    test_path = f"/kaggle/input/{competition}/test.csv"
    greeks_path = f"/kaggle/input/{competition}/greeks.csv"

train = pd.read_csv(train_path, index_col="Id").rename(columns=str.strip)
test = pd.read_csv(test_path, index_col="Id").rename(columns=str.strip)
greeks = pd.read_csv(greeks_path, index_col="Id").rename(columns=str.strip)


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>General Remarks</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 20px;
    margin-right: 20px;
    margin-bottom: 20px;
">
    <b>In the original description, we read that:</b></br></br>
    <i>The competition data comprises over fifty anonymized health characteristics linked to three age-related conditions. Your goal is to predict whether a subject has or has not been diagnosed with one of these conditions - a binary classification problem.</br></br>
    Note that this is a Code Competition, in which the actual test set is hidden. In this version, we give some sample data in the correct format to help you author your solutions. When your submission is scored, this example test data will be replaced with the full test set. There are about $400$ rows in the full test set.</i>
</p>

<p style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 20px;
    margin-right: 20px;
    margin-bottom: 20px;
">
    <b>Moreover, we know that:</b>
</p>

<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li><b>train.csv</b> - <i>The training set.</i></li>
    <ul style="
        font-size: 16px;
        font-family: 'JetBrains Mono';
        color: #f2f2f0;
        margin-right: 8px;
    ">
        <li><code>Id</code> - <i>Unique identifier for each observation.</i></li>
        <li><code>AB-GL</code> - <i>Fifty-six anonymized health characteristics. All are numeric except for EJ, which is categorical.</i></li>
        <li><code>Class</code> - <i>A binary target: $1$ indicates the subject has been diagnosed with one of the three conditions, $0$ indicates they have not.</i></li>
    </ul>
    <li><b>test.csv</b> - <i>The test set. Your goal is to predict the probability that a subject in this set belongs to each of the two classes.</i></li>
    <li><b>greeks.csv</b> - <i>Supplemental metadata, only available for the training set.</i></li>
    <ul style="
        font-size: 16px;
        font-family: 'JetBrains Mono';
        color: #f2f2f0;
        margin-right: 8px;
    ">
        <li><code>Alpha</code> - <i>Identifies the type of age-related condition, if present.</i></li>
        <ul style="
            font-size: 16px;
            font-family: 'JetBrains Mono';
            color: #f2f2f0;
            margin-right: 8px;
        ">
            <li><code>A</code> - <i>No age-related condition. Corresponds to class $0$.</i></li>
            <li><code>B</code>, <code>D</code>, <code>G</code> - <i>The three age-related conditions. Correspond to class $1$.</i></li>
        </ul>
        <li><code>Beta</code>, <code>Gamma</code>, <code>Delta</code> - <i>Three experimental characteristics.</i></li>
        <li><code>Epsilon</code> - <i>The date the data for this subject was collected. Note that all of the data in the test set was collected after the training set was collected.</i></li>
    </ul>
</ul>
</blockquote>

In [3]:
train.head().style.set_table_styles(DF_STYLE).format(precision=3)


Unnamed: 0_level_0,AB,AF,AH,AM,AR,AX,AY,AZ,BC,BD,BN,BP,BQ,BR,BZ,CB,CC,CD,CF,CH,CL,CR,CS,CU,CW,DA,DE,DF,DH,DI,DL,DN,DU,DV,DY,EB,EE,EG,EH,EJ,EL,EP,EU,FC,FD,FE,FI,FL,FR,FS,GB,GE,GF,GH,GI,GL,Class
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1
000ff2bfdfe9,0.209,3109.033,85.2,22.394,8.139,0.7,0.026,9.812,5.556,4126.587,22.598,175.639,152.708,823.928,257.432,47.223,0.563,23.388,4.852,0.023,1.05,0.069,13.784,1.302,36.206,69.083,295.571,0.239,0.284,89.246,84.317,29.657,5.311,1.743,23.188,7.294,1.987,1433.167,0.949,B,30.879,78.527,3.828,13.395,10.265,9028.292,3.583,7.298,1.739,0.095,11.339,72.611,2003.81,22.136,69.835,0.12,1
007255e47698,0.145,978.764,85.2,36.969,8.139,3.632,0.026,13.518,1.23,5496.928,19.421,155.868,14.755,51.217,257.432,30.284,0.485,50.628,6.085,0.031,1.114,1.118,28.311,1.357,37.477,70.798,178.553,0.239,0.363,110.582,75.745,37.532,0.006,1.743,17.222,4.926,0.859,1111.287,0.003,A,109.125,95.415,52.26,17.176,0.297,6785.003,10.359,0.173,0.497,0.569,9.293,72.611,27981.563,29.135,32.132,21.978,0
013f2bd269f5,0.47,2635.107,85.2,32.361,8.139,6.733,0.026,12.825,1.23,5135.78,26.483,128.989,219.32,482.142,257.432,32.564,0.496,85.955,5.376,0.036,1.05,0.7,39.365,1.01,21.46,70.82,321.427,0.239,0.21,120.056,65.47,28.053,1.29,1.743,36.861,7.814,8.147,1494.076,0.377,B,109.125,78.527,5.391,224.207,8.745,8338.906,11.627,7.71,0.976,1.199,37.078,88.609,13676.958,28.023,35.193,0.197,0
043ac50845d5,0.252,3819.652,120.202,77.112,8.139,3.685,0.026,11.054,1.23,4169.677,23.658,237.282,11.05,661.519,257.432,15.202,0.718,88.159,2.348,0.029,1.4,0.636,41.117,0.723,21.53,47.276,196.608,0.239,0.292,139.825,71.571,24.355,2.655,1.743,52.004,7.386,3.813,15691.552,0.614,B,31.674,78.527,31.323,59.302,7.884,10965.766,14.852,6.122,0.497,0.284,18.53,82.417,2094.262,39.949,90.493,0.156,0
044fb8a146ec,0.38,3733.048,85.2,14.104,8.139,3.942,0.055,3.397,102.152,5728.734,24.011,324.546,149.717,6074.859,257.432,82.213,0.536,72.644,30.538,0.025,1.05,0.693,31.725,0.828,34.415,74.065,200.178,0.239,0.208,97.92,52.839,26.02,1.145,1.743,9.065,7.351,3.491,1403.656,0.164,B,109.125,91.995,51.141,29.103,4.275,16198.05,13.667,8.153,48.501,0.122,16.409,146.11,8524.371,45.381,36.263,0.097,1


In [4]:
train.info(verbose=False)


<class 'pandas.core.frame.DataFrame'>
Index: 617 entries, 000ff2bfdfe9 to ffcca4ded3bb
Columns: 57 entries, AB to Class
dtypes: float64(55), int64(1), object(1)
memory usage: 279.6+ KB


In [5]:
greeks.head().style.set_table_styles(DF_STYLE)


Unnamed: 0_level_0,Alpha,Beta,Gamma,Delta,Epsilon
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
000ff2bfdfe9,B,C,G,D,3/19/2019
007255e47698,A,C,M,B,Unknown
013f2bd269f5,A,C,M,B,Unknown
043ac50845d5,A,C,M,B,Unknown
044fb8a146ec,D,B,F,B,3/25/2020


In [6]:
greeks.info(verbose=False)


<class 'pandas.core.frame.DataFrame'>
Index: 617 entries, 000ff2bfdfe9 to ffcca4ded3bb
Columns: 5 entries, Alpha to Epsilon
dtypes: object(5)
memory usage: 28.9+ KB


In [7]:
missing_values_cols = train.isna().sum()[train.isna().sum() > 0].index.to_list()

print(CLR + "Training Dataset Missing Values\n")

for feature in missing_values_cols:
    print(
        (CLR + feature) + "\t",
        (RED + str(train[feature].isna().sum())) + "\t",
        (RED + f"{train[feature].isna().sum() / len(train):.1%}" + RESET) + "\t",
        (RED + f"{train[feature].dtype}"),
    )


[1m[30mTraining Dataset Missing Values

[1m[30mBQ	 [1m[31m60	 [1m[31m9.7%[0m	 [1m[31mfloat64
[1m[30mCB	 [1m[31m2	 [1m[31m0.3%[0m	 [1m[31mfloat64
[1m[30mCC	 [1m[31m3	 [1m[31m0.5%[0m	 [1m[31mfloat64
[1m[30mDU	 [1m[31m1	 [1m[31m0.2%[0m	 [1m[31mfloat64
[1m[30mEL	 [1m[31m60	 [1m[31m9.7%[0m	 [1m[31mfloat64
[1m[30mFC	 [1m[31m1	 [1m[31m0.2%[0m	 [1m[31mfloat64
[1m[30mFL	 [1m[31m1	 [1m[31m0.2%[0m	 [1m[31mfloat64
[1m[30mFS	 [1m[31m2	 [1m[31m0.3%[0m	 [1m[31mfloat64
[1m[30mGL	 [1m[31m1	 [1m[31m0.2%[0m	 [1m[31mfloat64


In [8]:
print(
    CLR + "Training Dataset Duplicated Rows:",
    RED + f"{train.drop('Class', axis=1).duplicated().sum()}",
)


[1m[30mTraining Dataset Duplicated Rows: [1m[31m0


In [9]:
fig = px.pie(
    train.assign(ClassMap=train.Class.map({0: "Class 0", 1: "Class 1"})),
    names="ClassMap",
    height=540,
    width=840,
    hole=0.65,
    title="Target Overview - Class",
    color_discrete_sequence=["#010D36", "#FF2079"],
)
fig.update_layout(
    font_color=FONT_COLOR,
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
    showlegend=False,
)
fig.add_annotation(
    x=0.5,
    y=0.5,
    align="center",
    xref="paper",
    yref="paper",
    showarrow=False,
    font_size=22,
    text="Class<br>Imbalance",
)
fig.update_traces(
    hovertemplate=None,
    textposition="outside",
    texttemplate="%{label}<br>%{value} - %{percent}",
    textfont_size=16,
    rotation=-20,
    marker_line_width=25,
    marker_line_color=BACKGROUND_COLOR,
)
fig.show()


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>The training dataset is small, containing $617$ samples. Nevertheless, we have to handle $57$ different medical characteristics (attributes), including the binary target.</li>
    <li>These features are anonymous, and we all know that these are specific medical characteristics.</li>
    <li>We've got additional data, e.g. greeks.csv, but we will look at this later, especially the <code>Epsilon</code> attribute.</li>
    <li>In our dataset, we have nine numeric features that contain missing values. Typically, only $1$ to $3$ values are missing for each attribute. However, there are two specific features where we observe $60$ missing values each.</li>
    <li>Lastly, there is quite a lot of unbalance in the target class: $83$% (no age-related conditions) to $17$% (at least one age-related condition).</li>
</ul>
</blockquote>

# <p style="padding: 15px; background-color: #010D36; font-family: 'JetBrains Mono'; font-weight: bold; font-size: 100%; color: #f2f2f0; letter-spacing: 2px; text-align: center; border-radius: 8px;">Basic Relations in Numerical Features</p>

<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Notes</b> 📜
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>Let's focus on an elementary description of numerical features. Firstly, let's see the numerical summary. Then, we will get to the correlation matrix and finally create hierarchical clustering based on Pearson correlations.</li>
</ul>
</blockquote>

In [10]:
numeric_descr = (
    train.drop("Class", axis=1)
    .describe(percentiles=[0.01, 0.05, 0.25, 0.50, 0.75, 0.95, 0.99])
    .drop("count")
    .T.rename(columns=str.title)
)

numeric_descr.style.set_table_styles(DF_STYLE).format(precision=3)


Unnamed: 0,Mean,Std,Min,1%,5%,25%,50%,75%,95%,99%,Max
AB,0.477,0.468,0.081,0.12,0.153,0.252,0.355,0.56,1.079,2.165,6.162
AF,3502.013,2300.323,192.593,192.593,1018.985,2197.345,3120.319,4361.637,6957.807,10377.994,28688.188
AH,118.625,127.839,85.2,85.2,85.2,85.2,85.2,113.74,209.993,541.429,1910.123
AM,38.969,69.728,3.178,5.186,7.153,12.27,20.533,39.14,111.939,410.512,630.518
AR,10.128,10.519,8.139,8.139,8.139,8.139,8.139,8.139,17.12,34.467,178.944
AX,5.546,2.552,0.7,1.035,2.87,4.128,5.032,6.432,9.247,13.169,38.271
AY,0.06,0.417,0.026,0.026,0.026,0.026,0.026,0.037,0.124,0.214,10.316
AZ,10.566,4.351,3.397,3.397,3.397,8.13,10.461,12.97,16.862,22.914,38.972
BC,8.053,65.167,1.23,1.23,1.23,1.23,1.23,5.081,11.997,50.66,1463.693
BD,5350.389,3021.327,1693.624,2221.15,3041.643,4155.703,4997.961,6035.886,7955.458,10131.207,53060.599


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>Well, at first glance, it's hard to focus on specific values here. However, let's look at the Q1-Q3 range and upper percentiles, including the max value. We may conclude that many of these distributions have long tails, which will probably require some transformations like log-level one.</li>
</ul>
</blockquote>

In [11]:
color_map = [[0.0, "#01CBEE"], [0.5, "#010D36"], [1.0, "#FF2079"]]

pearson_corr = (
    train.drop("Class", axis=1).corr(numeric_only=True, method="pearson").round(2)
)
mask = np.triu(np.ones_like(pearson_corr, dtype=bool))
lower_triangular_corr = (
    pearson_corr.mask(mask)
    .dropna(axis="index", how="all")
    .dropna(axis="columns", how="all")
)

heatmap = go.Heatmap(
    z=lower_triangular_corr,
    x=lower_triangular_corr.columns,
    y=lower_triangular_corr.index,
    text=lower_triangular_corr.fillna(""),
    texttemplate="%{text}",
    xgap=1,
    ygap=1,
    showscale=True,
    colorscale=color_map,
    colorbar_len=1.02,
    hoverinfo="none",
)
fig = go.Figure(heatmap)
fig.update_layout(
    font_color=FONT_COLOR,
    title="Correlation Matrix (Pearson) - Lower Triangular",
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
    width=840,
    height=840,
    xaxis_showgrid=False,
    yaxis_showgrid=False,
    yaxis_autorange="reversed",
)
fig.show()


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>Here we have several highly correlated features like <code>BZ</code> vs <code>BC</code> ($0.91$) or <code>DV</code> vs <code>CL</code> ($0.95$). Such extreme linear correlations give hope for rejecting certain features. Remember that you can zoom in on this matrix and explore specific relations. In the other case, you won't be able to see anything.</li>
</ul>
</blockquote>

In [12]:
dissimilarity = 1 - np.abs(pearson_corr)

fig = ff.create_dendrogram(
    dissimilarity,
    labels=pearson_corr.columns,
    orientation="left",
    colorscale=px.colors.sequential.YlGnBu_r,
    # squareform() returns lower triangular in compressed form - as 1D array.
    linkagefun=lambda x: linkage(squareform(dissimilarity), method="complete"),
)
fig.update_layout(
    font_color=FONT_COLOR,
    title="Hierarchical Clustering using Correlation Matrix (Pearson)",
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
    height=1340,
    width=840,
    yaxis=dict(
        showline=False,
        title="Feature",
        ticks="",
    ),
    xaxis=dict(
        showline=False,
        title="Distance",
        ticks="",
        range=[-0.05, 1.05],
    ),
)
fig.update_traces(line_width=1.5)
fig.show()


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>Okay, here we need to make something clear. Since we had the correlation matrix, we conducted hierarchical clustering. This process consists of an alternative to the K-Means algorithm. Hierarchical clustering allows us to visualize the effect of different clusters' number determining.</li>
    <li>However, relying on a correlation matrix to perform hierarchical clustering requires additional steps. Primarily, clustering methods measure the dissimilarity of variables. Meanwhile, correlation measures similarity. We can treat dissimilarity as $dissimilarity = 1 - abs(correlation)$. And basically, that's all. We passed dissimilarity to the <code>linkage()</code> function from the <code>scipy</code> module and got clustering results.</li>
    <li>Moreover, we should remember that we rely on the <b>Pearson</b> correlation. It measures linear dependency, and it's computed on actual values. However, we could have used for example the <b>Spearman</b> correlation, which is based on ranks and measures monotonic relations.</li>
    <li>Additionally, we chose the <code>complete</code> method in the <code>linkage()</code> function, and if you take a different method, you get different results.</li>
    <li>As you can see, here we have minimal distances between <code>BZ</code> - <code>BC</code>, <code>DV</code> - <code>CL</code>, and <code>EH</code> - <code>FD</code>.</li>
</ul>
</blockquote>

# <p style="padding: 15px; background-color: #010D36; font-family: 'JetBrains Mono'; font-weight: bold; font-size: 100%; color: #f2f2f0; letter-spacing: 2px; text-align: center; border-radius: 8px;">Pair Plots &amp; Kernel Density Estimation</p>

<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Notes</b> 📜
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>In this section, we will focus on exploring distributions in a general manner. Firstly, we will depict some pair plots of strongly correlated features, and then we will see the probability density of these variables by target value.</li>
    <li>Let's define some small utility functions. The former is liable for KDE calculations, and the latter provides appropriate axes arrangement.</li>
</ul>
</blockquote>

In [13]:
def get_kde_estimation(data_series):
    kde = gaussian_kde(data_series.dropna())
    kde_range = np.linspace(
        data_series.min() - data_series.max() * 0.1,
        data_series.max() + data_series.max() * 0.1,
        len(data_series),
    )
    estimated_values = kde.evaluate(kde_range)
    estimated_values_cum = np.cumsum(estimated_values)
    estimated_values_cum /= estimated_values_cum.max()
    return kde_range, estimated_values, estimated_values_cum


def get_n_rows_axes(n_features, n_cols=5, n_rows=None):
    n_rows = int(np.ceil(n_features / n_cols))
    current_col = range(1, n_cols + 1)
    current_row = range(1, n_rows + 1)
    return n_rows, list(product(current_row, current_col))


In [14]:
corr_threshold = 0.7

highest_abs_corr = (
    lower_triangular_corr.abs()
    .unstack()
    .sort_values(ascending=False)  # type: ignore
    .rename("Absolute Pearson Correlation")
)

highest_abs_corr = (
    highest_abs_corr[highest_abs_corr > corr_threshold]
    .to_frame()
    .reset_index(names=["Feature 1", "Feature 2"])
)

highest_corr_combinations = highest_abs_corr[["Feature 1", "Feature 2"]].to_numpy()
highest_abs_corr.style.set_table_styles(DF_STYLE).format(precision=2)


Unnamed: 0,Feature 1,Feature 2,Absolute Pearson Correlation
0,EH,FD,0.97
1,CL,DV,0.95
2,BC,BZ,0.91
3,DU,EH,0.85
4,AR,DV,0.82
5,DU,FD,0.81
6,CS,EP,0.79
7,BC,BD,0.75
8,AR,CL,0.75
9,AR,EP,0.75


In [15]:
n_cols = 3
n_rows, axes = get_n_rows_axes(len(highest_corr_combinations), n_cols=n_cols)

fig = make_subplots(
    rows=n_rows,
    cols=n_cols,
    horizontal_spacing=0.1,
    vertical_spacing=0.06,
)

show_legend = True

for k, ((current_row, current_col), (feature1, feature2)) in enumerate(
    zip(axes, highest_corr_combinations)
):
    if k > 0:
        show_legend = False

    fig.add_scatter(
        x=train.query("Class == 0")[feature1],
        y=train.query("Class == 0")[feature2],
        mode="markers",
        name="Class 0",
        marker=dict(color="#010D36", size=3, symbol="diamond", opacity=0.5),
        legendgroup="Class 0",
        showlegend=show_legend,
        row=current_row,
        col=current_col,
    )
    fig.add_scatter(
        x=train.query("Class == 1")[feature1],
        y=train.query("Class == 1")[feature2],
        mode="markers",
        name="Class 1",
        marker=dict(color="#FF2079", size=2, symbol="circle", opacity=0.5),
        legendgroup="Class 1",
        showlegend=show_legend,
        row=current_row,
        col=current_col,
    )
    fig.update_xaxes(
        type="log",
        title_text=feature1,
        titlefont_size=9,
        titlefont_family="Arial Black",
        tickfont_size=7,
        row=current_row,
        col=current_col,
    )
    fig.update_yaxes(
        type="log",
        title_text=feature2,
        titlefont_size=9,
        titlefont_family="Arial Black",
        tickfont_size=7,
        row=current_row,
        col=current_col,
    )

fig.update_annotations(font_size=14)
fig.update_layout(
    font_color=FONT_COLOR,
    title="Highest Pearson Correlations - Pair Plots<br>Double Logarithmic Scale",
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
    width=840,
    height=1140,
    legend=dict(
        orientation="h",
        yanchor="bottom",
        xanchor="right",
        y=1.01,
        x=1,
        itemsizing="constant",
    ),
)

fig.show()


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>In the case of this dataset, it's impossible to show all pair-plots, so I chose only those most correlated.</li>
    <li>The highest correlation is between <code>EH</code> - <code>FD</code> ($0.97$), and this is clearly visible there. Moreover, values associated with Class $0$ are shifted towards higher values. You can explore this by turning off and turning on a given group using legend. A similar situation occurs within <code>DU</code> - <code>EH</code> and <code>DU</code> - <code>FD</code>. Unfortunately, we don't know what these abbreviations mean.</li>
    <li>Moreover, we can see that many different values of a given feature correspond to one specific value from the second one. It may account for a little problem for machine learning algorithms. Such a situation appears in each of the above relationships.</li>
</ul>
</blockquote>

In [16]:
numeric_data = train.select_dtypes("number")
numeric_cols = numeric_data.drop("Class", axis=1).columns.tolist()

n_cols = 5
n_rows, axes = get_n_rows_axes(len(numeric_cols))

fig1 = make_subplots(
    rows=n_rows,
    cols=n_cols,
    y_title="Probability Density",
    horizontal_spacing=0.06,
    vertical_spacing=0.04,
)
fig2 = copy(fig1)

show_legend = True

for k, ((current_row, current_col), feature) in enumerate(zip(axes, numeric_cols)):
    if k > 0:
        show_legend = False

    for target, color in zip((0, 1), ("#010D36", "#FF2079")):
        kde_range, estimated_values, estimated_values_cum = get_kde_estimation(
            numeric_data.query(f"Class == {target}")[feature]
        )

        for fig, kde_values in zip(  # type: ignore
            (fig1, fig2), (estimated_values, estimated_values_cum)
        ):
            fig.add_scatter(
                x=kde_range,
                y=kde_values,
                line=dict(dash="solid", color=color, width=1),
                fill="tozeroy",
                name=f"Class {target}",
                legendgroup=f"Class {target}",
                showlegend=show_legend,
                row=current_row,
                col=current_col,
            )
            fig.update_yaxes(
                tickfont_size=7,
                row=current_row,
                col=current_col,
            )
            fig.update_xaxes(
                title_text=feature,
                titlefont_size=9,
                titlefont_family="Arial Black",
                tickfont_size=7,
                row=current_row,
                col=current_col,
            )

title1 = "Numerical Features - Kernel Density Estimation"
title2 = "Numerical Features - Cumulative Kernel Density Estimation"

for fig, title in zip((fig1, fig2), (title1, title2)):
    fig.update_annotations(font_size=14)
    fig.update_layout(
        font_color=FONT_COLOR,
        title=title,
        title_font_size=18,
        plot_bgcolor=BACKGROUND_COLOR,
        paper_bgcolor=BACKGROUND_COLOR,
        width=840,
        height=1340,
        legend=dict(
            orientation="h",
            yanchor="bottom",
            xanchor="right",
            y=1.01,
            x=1,
        ),
    )

fig1.show()


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>You can activate and deactivate distributions for a certain class by clicking on the legend.</li>
    <li>Well, here we've got a diversity of variables, i.e. some of them probably relatively good fit a normal distribution (<code>BN</code>, <code>CU</code>, <code>GH</code>), some have long tails (and extremely long tails), like <code>AR</code>, <code>AY</code>, <code>BR</code>, <code>BZ</code>, etc. Moreover, there are even bimodal distributions (<code>CW</code>, <code>EL</code> and <code>GL</code>).</li>
    <li>We will better understand the diversity between classes on the cumulative plots, as below.</li>
</ul>
</blockquote>

In [17]:
fig2.show()


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>The cumulative KDE reveals a varied presence of long tails in the given distributions. Depending on the variable, the responsibility for the long tail can be attributed to values associated with Class $0$ in some cases, while in other cases it is associated with values linked to Class $1$. Additionally, there are instances where the distributions overlap.</li>
</ul>
</blockquote>

# <p style="padding: 15px; background-color: #010D36; font-family: 'JetBrains Mono'; font-weight: bold; font-size: 100%; color: #f2f2f0; letter-spacing: 2px; text-align: center; border-radius: 8px;">Probability Plots &amp; Transformations</p>

<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Notes</b> 📜
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>This section aims to explore so-called probability plots. It's a pleasant graphical technique to assess whether a variable follows a specific distribution. Here, the normal one. On such a plot, samples which follow normal distribution are deployed on a diagonal straight line.</li>
    <li>Some machine learning models assume that the variable follows a normal distribution. In turn, the mentioned technique helps to decide which transformations should be done within the given variable to improve the fit to that distribution.</li>
    <li>Let's get started with original values and see results.</li>
</ul>
</blockquote>

In [18]:
fig = make_subplots(
    rows=n_rows,
    cols=n_cols,
    y_title="Observed Values",
    x_title="Theoretical Quantiles",
    horizontal_spacing=0.06,
    vertical_spacing=0.04,
)
fig.update_annotations(font_size=14)

for (row, col), feature in zip(axes, numeric_cols):
    (osm, osr), (slope, intercept, R) = probplot(train[feature].dropna(), rvalue=True)
    x_theory = np.array([osm[0], osm[-1]])
    y_theory = intercept + slope * x_theory
    R2 = f"R\u00b2 = {R * R:.2f}"
    fig.add_scatter(x=osm, y=osr, mode="markers", row=row, col=col, name=feature)
    fig.add_scatter(x=x_theory, y=y_theory, mode="lines", row=row, col=col)
    fig.add_annotation(
        x=-1.25,
        y=osr[-1] * 0.75,
        text=R2,
        showarrow=False,
        row=row,
        col=col,
        font_size=9,
    )
    fig.update_yaxes(tickfont_size=7, row=row, col=col)
    fig.update_xaxes(
        title_text=feature,
        titlefont_size=9,
        titlefont_family="Arial Black",
        tickfont_size=7,
        row=row,
        col=col,
    )

fig.update_layout(
    font_color=FONT_COLOR,
    title="Numerical Features - Probability Plots against Normal Distribution",
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
    showlegend=False,
    width=840,
    height=1340,
)
fig.update_traces(
    marker=dict(size=1, symbol="x-thin", line=dict(width=2, color="#010D36")),
    line_color="#FF2079",
)
fig.show()


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>As you can see, some variables fit a normal distribution well, which manifests by a high coefficient of determination (R-squared) and evenly deployed samples around the straight line. These are for example <code>DN</code> or <code>BN</code>.</li>
    <li>Nevertheless, there are a lot of features which do not fit the normal one. We can improve that by specific transformations:</li>
    <ul style="
        font-size: 16px;
        font-family: 'JetBrains Mono';
        color: #f2f2f0;
        margin-right: 8px;
    ">
        <li><b>Log Transformation</b> - generally works fine with right-skewed data. Requires non-negative numbers.</li>
        <li><b>Square Root Transformation</b> - similarly to log-level transformation. Requires non-negative numbers.</li>
        <li><b>Square Transformation</b> - helps to reduce left-skewed data.</li>
        <li><b>Reciprocal Transformation</b> - used sometimes, when data is skewed, or there are obvious outliers. Not defined at zero.</li>
        <li><b>Box-Cox Transformation</b> - used when data is skewed or has outliers. Requires strictly positive numbers.</li>
        <li><b>Yeo-Johnson Transformation</b> - variation of Box-Cox transformation, but without restrictions concerning numbers.</li>
    </ul>
    <li>Let's check all of these transformations for our variables. We simply use the <code>probplot()</code> function to get R-squared coefficients for each transformation.</li>
</blockquote>

In [19]:
r2_scores = defaultdict(tuple)

for feature in numeric_cols:
    orig = train[feature].dropna()
    _, (*_, R_orig) = probplot(orig, rvalue=True)
    _, (*_, R_log) = probplot(np.log(orig), rvalue=True)
    _, (*_, R_sqrt) = probplot(np.sqrt(orig), rvalue=True)
    _, (*_, R_reci) = probplot(np.reciprocal(orig), rvalue=True)
    _, (*_, R_boxcox) = probplot(stats.boxcox(orig)[0], rvalue=True)
    _, (*_, R_yeojohn) = probplot(stats.yeojohnson(orig)[0], rvalue=True)
    r2_scores[feature] = (
        R_orig * R_orig,
        R_log * R_log,
        R_sqrt * R_sqrt,
        R_reci * R_reci,
        R_boxcox * R_boxcox,
        R_yeojohn * R_yeojohn,
    )

r2_scores = pd.DataFrame(
    r2_scores, index=("Original", "Log", "Sqrt", "Reciprocal", "BoxCox", "YeoJohnson")
).T

r2_scores["Winner"] = r2_scores.idxmax(axis=1)
r2_scores.style.set_table_styles(DF_STYLE).format(precision=3)


Unnamed: 0,Original,Log,Sqrt,Reciprocal,BoxCox,YeoJohnson,Winner
AB,0.537,0.976,0.82,0.92,0.998,0.991,BoxCox
AF,0.761,0.872,0.945,0.344,0.955,0.955,YeoJohnson
AH,0.238,0.568,0.416,0.678,0.686,0.686,YeoJohnson
AM,0.383,0.959,0.716,0.903,0.997,0.996,BoxCox
AR,0.158,0.422,0.299,0.505,0.515,0.515,YeoJohnson
AX,0.745,0.918,0.912,0.489,0.938,0.95,YeoJohnson
AY,0.039,0.573,0.232,0.642,0.634,0.627,Reciprocal
AZ,0.942,0.903,0.953,0.722,0.957,0.958,YeoJohnson
BC,0.058,0.74,0.308,0.723,0.739,0.745,YeoJohnson
BD,0.412,0.924,0.73,0.918,0.962,0.962,YeoJohnson


In [20]:
no_transform_cols = r2_scores.query("Winner == 'Original'").index
log_transform_cols = r2_scores.query("Winner == 'Log'").index
sqrt_transform_cols = r2_scores.query("Winner == 'Sqrt'").index
reciprocal_transform_cols = r2_scores.query("Winner == 'Reciprocal'").index
boxcox_transform_cols = r2_scores.query("Winner == 'BoxCox'").index
yeojohnson_transform_cols = r2_scores.query("Winner == 'YeoJohnson'").index


In [21]:
AB_orig = train.AB.dropna()
(osm, osr), (slope, intercept, R) = probplot(AB_orig, rvalue=True)
x_theory = np.array([osm[0], osm[-1]])
y_theory = intercept + slope * x_theory

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=["Probability Plot against Normal Distribution", "Histogram"],
)

fig.add_scatter(x=osm, y=osr, mode="markers", row=1, col=1, name="AB")
fig.add_scatter(x=x_theory, y=y_theory, mode="lines", row=1, col=1)
fig.add_annotation(
    x=-1.25,
    y=osr[-1] * 0.4,
    text=f"R\u00b2 = {R * R:.3f}",
    showarrow=False,
    row=1,
    col=1,
)
fig.update_yaxes(title_text="Observed Values", row=1, col=1)
fig.update_xaxes(title_text="Theoretical Quantiles", row=1, col=1)
fig.update_traces(
    marker=dict(size=1, symbol="x-thin", line=dict(width=2, color="#010D36")),
    line_color="#FF2079",
)

fig.add_histogram(
    x=AB_orig,
    marker_color="#010D36",
    opacity=0.75,
    name="AB",
    row=1,
    col=2,
)
fig.update_yaxes(title_text="Count", row=1, col=2)
fig.update_xaxes(title_text="AB", row=1, col=2)

fig.update_layout(
    font_color=FONT_COLOR,
    title="AB Feature - Original",
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
    showlegend=False,
    width=840,
    height=440,
    bargap=0.2,
)

fig.update_annotations(font_size=14)
fig.show()


In [22]:
AB_transformed = stats.boxcox(train.AB.dropna())[0]
(osm, osr), (slope, intercept, R) = probplot(AB_transformed, rvalue=True)
x_theory = np.array([osm[0], osm[-1]])
y_theory = intercept + slope * x_theory

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=["Probability Plot against Normal Distribution", "Histogram"],
)

fig.add_scatter(x=osm, y=osr, mode="markers", row=1, col=1, name="BoxCox(AB)")
fig.add_scatter(x=x_theory, y=y_theory, mode="lines", row=1, col=1)
fig.add_annotation(
    x=-1.25,
    y=osr[-1] * 0.4,
    text=f"R\u00b2 = {R * R:.3f}",
    showarrow=False,
    row=1,
    col=1,
)
fig.update_yaxes(title_text="Observed Values", row=1, col=1)
fig.update_xaxes(title_text="Theoretical Quantiles", row=1, col=1)
fig.update_traces(
    marker=dict(size=1, symbol="x-thin", line=dict(width=2, color="#010D36")),
    line_color="#FF2079",
)

fig.add_histogram(
    x=AB_transformed,
    marker_color="#010D36",
    opacity=0.75,
    name="BoxCox(AB)",
    row=1,
    col=2,
)
fig.update_yaxes(title_text="Count", row=1, col=2)
fig.update_xaxes(title_text="BoxCox(AB)", row=1, col=2)

fig.update_layout(
    font_color=FONT_COLOR,
    title="AB Feature - Box-Cox Transformation",
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
    showlegend=False,
    width=840,
    height=440,
    bargap=0.2,
)

fig.update_annotations(font_size=14)
fig.show()


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>As you can see above, the Box-Cox transformation works perfectly for the <code>AB</code> variable.</li>
    <li>Obviously, I suppose we will be working with tree-based models at the end, but sometimes models like <code>SVC</code> handle very well, and appropriate transformations for these algorithms are crucial.</li>
    <li>Let's look closely at these values we've got.</li>
</blockquote>

In [23]:
r2_scores.describe().T.drop("count", axis=1).rename(
    columns=str.title
).style.set_table_styles(DF_STYLE).format(precision=3)


Unnamed: 0,Mean,Std,Min,25%,50%,75%,Max
Original,0.527,0.321,0.023,0.209,0.531,0.82,0.982
Log,0.841,0.17,0.177,0.827,0.903,0.949,0.998
Sqrt,0.722,0.249,0.132,0.588,0.789,0.941,0.992
Reciprocal,0.653,0.217,0.11,0.504,0.678,0.832,0.972
BoxCox,0.879,0.155,0.254,0.84,0.941,0.984,0.998
YeoJohnson,0.882,0.157,0.254,0.843,0.95,0.985,0.998


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>Well, as you can see Yeo-Johnson's transformation wins in most cases. However, some simple transformations, like log one, are also doing well. Moreover, we have one feature where none of the transformations helps - <code>CW</code>.</li>
</blockquote>

# <p style="padding: 15px; background-color: #010D36; font-family: 'JetBrains Mono'; font-weight: bold; font-size: 100%; color: #f2f2f0; letter-spacing: 2px; text-align: center; border-radius: 8px;">Semi-Constant Features</p>

<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Notes</b> 📜
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>There is something suspect with some variables, i.e., these that have especially poor transformation results. Let's specify, and look at them closely: <code>AR</code>, <code>AY</code>, <code>BZ</code>, <code>DF</code>, and <code>DV</code>.</li>
</ul>
</blockquote>

In [24]:
problematic_variables = train[["AR", "AY", "BZ", "DF", "DV"]]
problematic_variables.head(10).style.set_table_styles(DF_STYLE)


Unnamed: 0_level_0,AR,AY,BZ,DF,DV
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
000ff2bfdfe9,8.138688,0.025578,257.432377,0.23868,1.74307
007255e47698,8.138688,0.025578,257.432377,0.23868,1.74307
013f2bd269f5,8.138688,0.025578,257.432377,0.23868,1.74307
043ac50845d5,8.138688,0.025578,257.432377,0.23868,1.74307
044fb8a146ec,8.138688,0.05481,257.432377,0.23868,1.74307
04517a3c90bd,8.138688,0.025578,257.432377,0.23868,1.74307
049232ca8356,15.31248,0.025578,257.432377,1.318005,1.74307
057287f2da6d,8.138688,0.025578,257.432377,0.23868,1.74307
0594b00fb30a,8.138688,0.025578,257.432377,0.23868,1.74307
05f2bc0155cd,8.138688,0.025578,257.432377,0.23868,1.74307


In [25]:
problematic_variables.info()


<class 'pandas.core.frame.DataFrame'>
Index: 617 entries, 000ff2bfdfe9 to ffcca4ded3bb
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AR      617 non-null    float64
 1   AY      617 non-null    float64
 2   BZ      617 non-null    float64
 3   DF      617 non-null    float64
 4   DV      617 non-null    float64
dtypes: float64(5)
memory usage: 28.9+ KB


In [26]:
problematic_variables_descr = numeric_descr.loc[problematic_variables.columns]
problematic_variables_descr.style.set_table_styles(DF_STYLE).format(precision=3)


Unnamed: 0,Mean,Std,Min,1%,5%,25%,50%,75%,95%,99%,Max
AR,10.128,10.519,8.139,8.139,8.139,8.139,8.139,8.139,17.12,34.467,178.944
AY,0.06,0.417,0.026,0.026,0.026,0.026,0.026,0.037,0.124,0.214,10.316
BZ,550.633,2076.371,257.432,257.432,257.432,257.432,257.432,257.432,1516.082,2983.909,50092.459
DF,0.634,1.912,0.239,0.239,0.239,0.239,0.239,0.239,2.038,5.745,37.895
DV,1.925,1.485,1.743,1.743,1.743,1.743,1.743,1.743,2.053,6.494,25.193


In [27]:
problematic_variables_vs_class = problematic_variables.join(train.Class)
duplicated_rows = problematic_variables_vs_class.duplicated(
    subset=["AR", "AY", "BZ", "DF", "DV"]
)

duplicates = problematic_variables_vs_class[duplicated_rows]
no_duplicates = problematic_variables_vs_class[~duplicated_rows]

print(
    CLR
    + "Ratio of duplicated / not duplicated rows in ['AR', 'AY', 'BZ', 'DF', 'DV'] subset:\n"
)
print(CLR + "Duplicated rows:".ljust(20), RED + f"{len(duplicates)}")
print(CLR + "Not duplicated rows:".ljust(20), RED + f"{len(no_duplicates)}\n")

print(CLR + "Class balance when ['AR', 'AY', 'BZ', 'DF', 'DV'] are duplicated:\n")
for key, value in duplicates.Class.value_counts(normalize=True).to_dict().items():
    print(CLR + f"Class {key}:", RED + f"{value:.1%}")

print(CLR + "\nClass balance when ['AR', 'AY', 'BZ', 'DF', 'DV'] are not duplicated:\n")
for key, value in no_duplicates.Class.value_counts(normalize=True).to_dict().items():
    print(CLR + f"Class {key}:", RED + f"{value:.1%}")


[1m[30mRatio of duplicated / not duplicated rows in ['AR', 'AY', 'BZ', 'DF', 'DV'] subset:

[1m[30mDuplicated rows:     [1m[31m280
[1m[30mNot duplicated rows: [1m[31m337

[1m[30mClass balance when ['AR', 'AY', 'BZ', 'DF', 'DV'] are duplicated:

[1m[30mClass 0: [1m[31m91.8%
[1m[30mClass 1: [1m[31m8.2%
[1m[30m
Class balance when ['AR', 'AY', 'BZ', 'DF', 'DV'] are not duplicated:

[1m[30mClass 0: [1m[31m74.8%
[1m[30mClass 1: [1m[31m25.2%


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>Okay, there we have just mostly the same value for the whole variable, additionally with significant outliers. That's the reason for weak transformations. Moreover, we know that some of them (<code>AR</code>, <code>BZ</code> and <code>DV</code>) strongly correlate with some other features, but it's hard to say about correlation since almost the whole distribution consists of one value. Nevertheless, we could probably drop these attributes.</li>
    <li>Remains the question about <code>AY</code> and <code>DF</code>. These ones do not have such strong correlations with any features. The <code>AY</code> correlates with <code>EP</code> ($0.52$), and <code>DF</code> with <code>AR</code> ($0.35$) and <code>EU</code> ($0.30$).</li>
    <li>It's good to check other semi-constant variables. In such a case, we probably should binarize them. Let's suppose we consider semi-constant features where the minimum value and median are the same.</li>
</blockquote>

In [28]:
semi_constant_mask = np.isclose(numeric_descr["Min"], numeric_descr["50%"])
semi_constant_descr = numeric_descr[semi_constant_mask]
semi_constant_descr.style.set_table_styles(DF_STYLE).format(precision=3)


Unnamed: 0,Mean,Std,Min,1%,5%,25%,50%,75%,95%,99%,Max
AH,118.625,127.839,85.2,85.2,85.2,85.2,85.2,113.74,209.993,541.429,1910.123
AR,10.128,10.519,8.139,8.139,8.139,8.139,8.139,8.139,17.12,34.467,178.944
AY,0.06,0.417,0.026,0.026,0.026,0.026,0.026,0.037,0.124,0.214,10.316
BC,8.053,65.167,1.23,1.23,1.23,1.23,1.23,5.081,11.997,50.66,1463.693
BZ,550.633,2076.371,257.432,257.432,257.432,257.432,257.432,257.432,1516.082,2983.909,50092.459
CL,1.404,1.922,1.05,1.05,1.05,1.05,1.05,1.228,1.889,5.686,31.688
DF,0.634,1.912,0.239,0.239,0.239,0.239,0.239,0.239,2.038,5.745,37.895
DV,1.925,1.485,1.743,1.743,1.743,1.743,1.743,1.743,2.053,6.494,25.193
EP,105.061,68.446,78.527,78.527,78.527,78.527,78.527,112.767,184.079,308.474,1063.595
GE,131.715,144.182,72.611,72.611,72.611,72.611,72.611,127.592,335.351,799.046,1497.352


In [29]:
semi_constant_features_corr = (
    train[np.r_[semi_constant_descr.index, ["Class"]]]
    .corr(method="pearson")["Class"]
    .to_dict()
)

print(CLR + "Semi-constant features - Pearson correlation with Class:\n")
for feature, corr_with_class in semi_constant_features_corr.items():
    print((CLR + feature + ":") + "\t", (RED + f"{corr_with_class:+.3f}"))


[1m[30mSemi-constant features - Pearson correlation with Class:

[1m[30mAH:	 [1m[31m+0.045
[1m[30mAR:	 [1m[31m+0.064
[1m[30mAY:	 [1m[31m+0.082
[1m[30mBC:	 [1m[31m+0.156
[1m[30mBZ:	 [1m[31m+0.112
[1m[30mCL:	 [1m[31m+0.017
[1m[30mDF:	 [1m[31m+0.064
[1m[30mDV:	 [1m[31m+0.015
[1m[30mEP:	 [1m[31m-0.068
[1m[30mGE:	 [1m[31m-0.071
[1m[30mClass:	 [1m[31m+1.000


In [30]:
# Let's save these features with their median thresholds.
semi_const_cols_thresholds = semi_constant_descr["50%"].to_dict()


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>Weak correlations with <code>Class</code> give hope that binarization should not be harmful. What is more, perhaps these features ought to be dropped. </li>
</blockquote>

# <p style="padding: 15px; background-color: #010D36; font-family: 'JetBrains Mono'; font-weight: bold; font-size: 100%; color: #f2f2f0; letter-spacing: 2px; text-align: center; border-radius: 8px;">EJ - The Only One Categorical Variable</p>

<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Notes</b> 📜
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>In the whole dataset, there is only one categorical feature - <code>EJ</code>. Let's focus on this.</li>
</ul>
</blockquote>

In [31]:
sunburst_df = train.copy()
sunburst_df.Class = sunburst_df.Class.map({0: "Class 0", 1: "Class 1"})
sunburst_df.EJ = sunburst_df.EJ.map({"A": "EJ - A", "B": "EJ - B"})

fig = px.sunburst(
    sunburst_df,
    title="Class (Binary Target) vs EJ (Categorical)",
    path=["EJ", "Class"],
    color_discrete_sequence=["#010D36", "#FF2079"],
    height=640,
    width=640,
)
fig.update_traces(
    insidetextorientation="horizontal",
    texttemplate="%{label}<br>%{value} - %{percentParent}",
    marker_line_width=5,
    marker_line_color=BACKGROUND_COLOR,
)
fig.update_layout(
    font_color=FONT_COLOR,
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
)
fig.show()


In [32]:
EJ_pivot = (
    train.pivot_table(
        values="Class",
        index="EJ",
        aggfunc=["mean", "sum", "count"],
        margins=True,
        margins_name="Total",
    )
    .rename(
        columns={
            "mean": "Class 1 Fraction",
            "sum": "Class 1 Count",
            "count": "Samples",
        }
    )
    .droplevel(level=1, axis="columns")
)

EJ_pivot.style.set_table_styles(DF_STYLE)


Unnamed: 0_level_0,Class 1 Fraction,Class 1 Count,Samples
EJ,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0.126126,28,222
B,0.202532,80,395
Total,0.175041,108,617


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>I think there is nothing suspect.</li>
</blockquote>

# <p style="padding: 15px; background-color: #010D36; font-family: 'JetBrains Mono'; font-weight: bold; font-size: 100%; color: #f2f2f0; letter-spacing: 2px; text-align: center; border-radius: 8px;">Dimensionality Reduction with t-SNE</p>

<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Notes</b> 📜
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>The t-SNE algorithm is an excellent tool to reduce data dimensionality and visualize datasets. Additionally, it tries to group together similar samples. We will use it with the given dataset to see whether there are some clusters or something interesting in 2D and 3D.</li>
    <li>Firstly, we provide simple preprocessing and transformations, which we explored in the previous section about probability plots. So far, I do not include binarization.</li>
</ul>
</blockquote>

In [33]:
casual_preprocess = make_pipeline(
    make_column_transformer(
        (
            StandardScaler(),
            no_transform_cols.to_list(),
        ),
        (
            make_pipeline(
                FunctionTransformer(func=np.log, feature_names_out="one-to-one"),
                StandardScaler(),
            ),
            log_transform_cols.to_list(),
        ),
        (
            make_pipeline(
                FunctionTransformer(func=np.reciprocal, feature_names_out="one-to-one"),
                StandardScaler(),
            ),
            reciprocal_transform_cols.to_list(),
        ),
        (
            PowerTransformer(method="box-cox", standardize=True),
            boxcox_transform_cols.to_list(),
        ),
        (
            PowerTransformer(method="yeo-johnson", standardize=True),
            yeojohnson_transform_cols.to_list(),
        ),
        (
            make_pipeline(
                SimpleImputer(strategy="most_frequent"),
                OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
            ),
            make_column_selector(dtype_include=object),  # type: ignore
        ),
        remainder="drop",
        verbose_feature_names_out=False,
    ),
    KNNImputer(n_neighbors=10, weights="distance"),
)


In [34]:
X = train.drop("Class", axis=1)
y = train.Class

X_processed = casual_preprocess.fit_transform(X)
X_processed_frame = pd.DataFrame(
    X_processed,
    columns=casual_preprocess.get_feature_names_out(),
    index=X.index,
)
X_processed_frame.head().style.set_table_styles(DF_STYLE).format(precision=3)


Unnamed: 0_level_0,CW,DU,EL,EU,FD,FS,AY,BZ,DF,EP,GE,AB,AM,BN,BQ,CB,CL,DV,GI,GL,AF,AH,AR,AX,AZ,BC,BD,BP,BR,CC,CD,CF,CH,CR,CS,CU,DA,DE,DH,DI,DL,DN,DY,EB,EE,EG,EH,FC,FE,FI,FL,FR,GB,GF,GH,EJ
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1
000ff2bfdfe9,0.618,1.406,-0.808,-1.194,1.295,-0.847,0.6,0.475,0.519,0.756,0.737,-0.999,0.087,0.29,0.859,0.189,-0.718,-0.253,0.728,-0.848,-0.011,-0.701,-0.497,-3.893,-0.095,1.114,-0.579,-0.244,0.314,-0.512,-2.769,-0.665,-0.613,-3.55,-3.012,-0.073,0.911,-0.067,-0.717,-0.944,-0.336,0.509,0.022,-0.201,-0.419,-0.021,1.919,-2.015,0.335,-2.683,0.913,0.95,-1.15,-0.913,-0.969,1.0
007255e47698,0.705,-1.073,0.915,0.756,-1.095,0.88,0.6,0.475,0.519,-0.099,0.737,-1.77,0.652,-0.635,-1.209,-0.358,-0.012,-0.253,-0.327,1.167,-1.514,-0.701,-0.497,-0.873,0.722,-0.838,0.343,-0.586,-2.413,-1.036,-1.033,-0.403,0.321,1.371,-0.623,0.033,0.98,-0.769,0.045,-0.368,-0.655,1.342,-0.392,-1.476,-1.518,-0.489,-0.976,-1.438,-0.127,0.151,-1.133,-1.352,-1.587,1.048,-0.143,0.0
013f2bd269f5,-0.39,0.895,0.915,-0.939,1.186,1.599,0.6,0.475,0.519,0.756,0.116,0.434,0.511,1.567,1.249,-0.264,-0.718,-0.253,-0.214,-0.652,-0.269,-0.701,-0.497,0.674,0.576,-0.838,0.133,-1.186,-0.242,-0.957,0.14,-0.548,0.767,-0.056,0.401,-0.667,0.981,0.049,-1.509,-0.16,-1.05,0.32,0.758,-0.006,1.936,0.055,1.251,1.813,0.211,0.572,0.951,-0.091,1.559,0.421,-0.265,1.0
043ac50845d5,-0.385,1.156,-0.773,0.374,1.117,0.212,0.6,0.475,0.519,0.756,0.328,-0.638,1.333,0.623,-1.419,-1.359,1.443,-0.253,1.132,-0.745,0.334,1.215,-0.497,-0.838,0.19,-0.838,-0.544,0.509,0.085,0.33,0.196,-1.416,0.068,-0.32,0.533,-1.317,-0.07,-0.635,-0.634,0.21,-0.813,-0.151,1.383,-0.165,0.624,4.297,1.668,0.633,0.622,1.557,0.787,-1.352,-0.046,-0.887,0.902,1.0
044fb8a146ec,0.495,0.852,0.915,0.74,0.704,-0.605,-1.263,0.475,0.519,0.049,-0.994,0.092,-0.521,0.736,0.838,0.779,-0.718,-0.253,-0.177,-0.935,0.294,-0.701,-0.497,-0.677,-1.871,1.94,0.468,1.151,2.518,-0.683,-0.231,1.721,-0.355,-0.084,-0.264,-1.071,1.108,-0.61,-1.54,-0.689,-1.56,0.067,-1.149,-0.178,0.476,-0.059,0.41,-0.407,1.14,1.208,0.991,3.091,-0.322,0.049,1.356,1.0


In [35]:
tsne_2D = TSNE(n_components=2, n_jobs=-1, random_state=42, perplexity=10)
tsne_3D = TSNE(n_components=3, n_jobs=-1, random_state=42, perplexity=10)

X_2D = pd.DataFrame(
    tsne_2D.fit_transform(X_processed), columns=["dim1", "dim2"], index=X.index
).join(y.astype(str))

X_3D = pd.DataFrame(
    tsne_3D.fit_transform(X_processed), columns=["dim1", "dim2", "dim3"], index=X.index
).join(y.astype(str))


In [36]:
fig = px.scatter(
    X_2D.reset_index(),
    x="dim1",
    y="dim2",
    symbol="Class",
    symbol_sequence=["diamond", "circle"],
    color="Class",
    color_discrete_sequence=["#010D36", "#FF2079"],
    category_orders={"Class": ("0", "1")},
    hover_data="Id",
    opacity=0.6,
    height=840,
    width=840,
    title="Training Dataset - 2D Projection with t-SNE",
)
fig.update_layout(
    font_color=FONT_COLOR,
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
    legend=dict(
        orientation="h",
        yanchor="bottom",
        xanchor="right",
        y=1.05,
        x=1,
        title="Class",
        itemsizing="constant",
    ),
)
fig.update_traces(marker_size=6)
fig.show()


In [37]:
fig = px.scatter_3d(
    X_3D.reset_index(),
    x="dim1",
    y="dim2",
    z="dim3",
    symbol="Class",
    symbol_sequence=["diamond", "circle"],
    color="Class",
    color_discrete_sequence=["#010D36", "#FF2079"],
    category_orders={"Class": ("0", "1")},
    hover_data="Id",
    opacity=0.6,
    height=840,
    width=840,
    title="Training Dataset - 3D Projection with t-SNE",
)
fig.update_layout(
    font_color=FONT_COLOR,
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
    legend=dict(
        orientation="h",
        yanchor="bottom",
        xanchor="right",
        y=1.05,
        x=1,
        title="Class",
        itemsizing="constant",
    ),
)
fig.update_traces(marker_size=3)
fig.show()


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>Well, as you can see, there is actually something interesting. I mean that cluster which is associated with Class $1$. This behaviour occurs both in the 2D projection and 3D one. Moreover, I checked several different <code>perplexity</code> values and different random seeds. However, in each case, there is a smaller or bigger cluster. So probably, thereby, hangs a tale.</li>
    <li>On the other hand, many samples overlap.</li>
</blockquote>

# <p style="padding: 15px; background-color: #010D36; font-family: 'JetBrains Mono'; font-weight: bold; font-size: 100%; color: #f2f2f0; letter-spacing: 2px; text-align: center; border-radius: 8px;">Feature Importance &amp; Permutation Tests</p>

<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Notes</b> 📜
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>In this section, we will tackle the general feature importance problem. We check several different methods to assess which variables are essential in the decision process.</li>
    <li>Firstly, let's take completely different methods: <code>LinearDiscriminantAnalysis</code>, <code>LGBMClassifier</code> and <code>mutual_info_classif()</code>, and see what we get.</li>
</ul>
</blockquote>

In [38]:
lda_pipeline = make_pipeline(
    casual_preprocess,
    LinearDiscriminantAnalysis(),
).fit(X, y)
lda_info = np.abs(lda_pipeline[-1].scalings_.ravel())
lda_info = lda_info / lda_info.sum()  # Normalise to 1 to compare with other methods.

lgbm_pipeline = make_pipeline(
    casual_preprocess,
    LGBMClassifier(random_state=42, is_unbalance=True),
).fit(X, y)
lgbm_info = lgbm_pipeline[-1].feature_importances_
lgbm_info = lgbm_info / lgbm_info.sum()

mutual_info = mutual_info_classif(
    X=casual_preprocess.fit_transform(X), y=y, random_state=42
)
mutual_info = mutual_info / np.sum(mutual_info)

importances = pd.DataFrame(
    [lda_info, lgbm_info, mutual_info],
    columns=lda_pipeline[0].get_feature_names_out(),
    index=["LDA", "LGBM", "MI"],
).T

importances[:10].style.set_table_styles(DF_STYLE).format(precision=4)


Unnamed: 0,LDA,LGBM,MI
CW,0.0021,0.0101,0.0095
DU,0.1249,0.0949,0.07
EL,0.0035,0.0108,0.0119
EU,0.0005,0.0101,0.0169
FD,0.1061,0.0087,0.0324
FS,0.0017,0.0087,0.0057
AY,0.0023,0.0028,0.0103
BZ,0.0022,0.0007,0.0039
DF,0.0013,0.0007,0.018
EP,0.0116,0.0136,0.004


In [39]:
importances_melted_frame = (
    importances.melt(
        var_name="Method",
        value_name="Importance",
        ignore_index=False,
    )
    .reset_index()
    .rename(columns={"index": "Feature"})
    .round(4)
)

fig = px.bar(
    importances_melted_frame,
    x="Importance",
    y="Feature",
    color="Importance",
    facet_col="Method",
    facet_col_spacing=0.07,
    height=940,
    width=840,
    color_continuous_scale=color_map,
    title="Normalised Feature Importances (Three Different Default Methods)",
)
fig.update_annotations(font_size=14)
fig.update_yaxes(
    matches=None,
    showticklabels=True,
    categoryorder="total ascending",
    tickfont_size=8,
)
fig.update_xaxes(matches=None)
fig.update_traces(width=0.7)
fig.update_layout(
    font_color=FONT_COLOR,
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
    coloraxis_colorbar=dict(
        orientation="h",
        title_side="bottom",
        yanchor="bottom",
        xanchor="center",
        title=None,
        y=-0.2,
        x=0.5,
    ),
)
fig.show()


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>So, as you can see, different methods give different results. However, these obtained from LGBM and Mutual Information are similar. In turn, LDA gave utterly different outcome. In that method, the feature importance measure is based on discriminator weights.</li>
    <li>Okay, so the first sight is that the <code>DU</code> variable occupies a very high place in each method (in two of them, it wins). Moreover, <code>GL</code>, which won in the LDA method, is also high in the rest. What is more interesting, the LDA says that <code>EJ</code> (the only one categorical variable) is the fifth most important feature in the dataset. Meanwhile, LGBM says that it's useless. In the Mutual Information method, <code>EJ</code> is around in the middle.</li>
    <li>Probably not all variables will be needed in the final model. We can explore this with a more sophisticated method based on out-of-bag data. We will perform the so-called permutation test to see when the balanced log loss metric is mostly sensitive while permuting samples in a certain feature.</li>
</blockquote>

In [40]:
def balanced_log_loss(y_true, y_pred, **kwargs):
    """Competition evaluation metric - balanced logarithmic loss.
    The overall effect is such that each class is roughly equally
    important for the final score."""
    N0, N1 = np.bincount(y_true)

    y0 = np.where(y_true == 0, 1, 0)
    y1 = np.where(y_true == 1, 1, 0)

    eps = kwargs.get("eps", 1e-15)
    y_pred = np.clip(y_pred, eps, 1 - eps)
    p0 = np.log(1 - y_pred)
    p1 = np.log(y_pred)

    return -(1 / N0 * np.sum(y0 * p0) + 1 / N1 * np.sum(y1 * p1)) * 0.5


In [41]:
n_bags = 10
n_folds = 5

np.random.seed(42)
seeds = np.random.randint(0, 19937, size=n_bags)


In [42]:
original_loglosses = []
permutation_loglosses = pd.DataFrame()

forest = RandomForestClassifier(
    class_weight="balanced", criterion="log_loss", random_state=42
)
svc = SVC(class_weight="balanced", probability=True, random_state=42)
lgbm = LGBMClassifier(is_unbalance=True, random_state=42)

for classifier in (forest, svc, lgbm):
    y_proba_original = np.zeros_like(y, dtype=np.float64)
    y_proba_shuffled = defaultdict(partial(np.zeros_like, y, dtype=np.float64))

    for seed in seeds:
        skfold = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=seed)
        classifier.set_params(random_state=seed)

        for train_ids, valid_ids in skfold.split(X, y):
            X_train, y_train = X.iloc[train_ids], y.iloc[train_ids]
            X_valid, y_valid = X.iloc[valid_ids], y.iloc[valid_ids]

            X_train = casual_preprocess.fit_transform(X_train)
            X_valid = casual_preprocess.transform(X_valid)

            classifier.fit(X_train, y_train)
            y_proba_original[valid_ids] += classifier.predict_proba(X_valid)[:, 1]

            for i, feature in enumerate(casual_preprocess.get_feature_names_out()):
                X_shuffled = X_valid.copy()
                X_shuffled[:, i] = np.random.permutation(X_shuffled[:, i])  # type: ignore
                y_proba_shuffled[feature][valid_ids] += classifier.predict_proba(
                    X_shuffled
                )[:, 1]

    classifier_name = classifier.__class__.__name__
    feature_names = y_proba_shuffled.keys()

    original_loglosses.append(balanced_log_loss(y, y_proba_original / n_bags))
    permutation_loglosses[classifier_name] = pd.Series(
        [
            balanced_log_loss(y, y_proba_shuffled[feature] / n_bags)
            for feature in feature_names
        ],
        index=feature_names,
    )


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>The provided code requires clarification. Firstly, we will explore how rearranging samples within a specific feature affects the balanced logarithmic loss when evaluating the validation dataset. We begin with three distinct classifiers, and each of them is trained and evaluated using stratified cross-validation. The model is trained on a subset of the data and assessed on a separate subset during each cross-validation iteration. Consequently, we gather predictions for the entire dataset. To ensure more reliable outcomes, this entire process is repeated ten times with different random seeds, and the final outcome is averaged. Ultimately, we compute the balanced logarithmic loss. Importantly, throughout this entire process, we shuffle samples in the chosen feature of the validation subset and record results obtained from evaluating this modified dataset in a separate dictionary. If the variable is significant, we should observe worsened results in terms of balanced log loss. If the feature is really relevant, rather each classifier should show that.</li>
</blockquote>

In [43]:
permutation_results_melted = (
    permutation_loglosses.melt(
        var_name="Method",
        value_name="Balanced Log Loss",
        ignore_index=False,
    )
    .reset_index()
    .rename(columns={"index": "Feature"})
    .round(4)
)

fig = px.bar(
    permutation_results_melted,
    x="Balanced Log Loss",
    y="Feature",
    color="Balanced Log Loss",
    facet_col="Method",
    facet_col_spacing=0.07,
    height=940,
    width=840,
    color_continuous_scale=color_map,
    title="Permutation Test Results - Balanced Log Loss when Permuting Samples<br>"
    "in Certain Features (Averaged over Stratified 5-Fold and 10 Different Seeds)",
)
fig.update_annotations(font_size=14)
fig.update_traces(width=0.7)
fig.update_xaxes(matches=None)
fig.update_yaxes(
    matches=None,
    showticklabels=True,
    categoryorder="total ascending",
    tickfont_size=8,
)
fig.update_layout(
    font_color=FONT_COLOR,
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
    coloraxis_colorbar=dict(
        orientation="h",
        title_side="bottom",
        yanchor="bottom",
        xanchor="center",
        title=None,
        y=-0.2,
        x=0.5,
    ),
    margin_t=120,
)
for original_logloss, max_logloss, col in zip(
    original_loglosses, permutation_loglosses.max().tolist(), (1, 2, 3)
):
    fig.add_vline(
        x=original_logloss,
        line_width=2,
        line_dash="dash",
        line_color="#FF2079",
        col=col,
    )
    fig.add_vrect(
        x0=original_logloss,
        x1=max_logloss,
        line_width=0,
        fillcolor="#FF2079",
        opacity=0.2,
        col=col,
    )
fig.show()


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>Each of the models says the same. The <code>DU</code> variable has the highest influence on predictions. Also important are, for example, <code>AB</code> or <code>BQ</code>.</li>
    <li>Given the above facts, we probably should provide some feature selection steps in the final pipeline. There are a lot of methods to select features, so we need to explore them.</li>
</blockquote>

# <p style="padding: 15px; background-color: #010D36; font-family: 'JetBrains Mono'; font-weight: bold; font-size: 100%; color: #f2f2f0; letter-spacing: 2px; text-align: center; border-radius: 8px;">Look at Greeks</p>

<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Notes</b> 📜
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>In this section, we will have a quick look at greeks metadata.</li>
    <li>Let's get started with a parallel coordinates plot. Since greeks are categorical in total, we will check which categories are connected with each other.</li>
</ul>
</blockquote>

In [44]:
greeks = greeks.join(train.Class)
greeks_cats = greeks[["Alpha", "Beta", "Gamma", "Delta"]].astype("category")
greeks_codes = greeks_cats.apply(lambda x: x.cat.codes)


In [45]:
fig = go.Figure(
    go.Parcoords(
        dimensions=[
            dict(
                label="Beta",
                values=greeks_codes.Beta,
                tickvals=np.unique(greeks_codes.Beta),
                ticktext=greeks_cats.Beta.cat.categories,
            ),
            dict(
                label="Gamma",
                values=greeks_codes.Gamma,
                tickvals=np.unique(greeks_codes.Gamma),
                ticktext=greeks_cats.Gamma.cat.categories,
            ),
            dict(
                label="Delta",
                values=greeks_codes.Delta,
                tickvals=np.unique(greeks_codes.Delta),
                ticktext=greeks_cats.Delta.cat.categories,
            ),
            dict(
                label="Alpha",
                values=greeks_codes.Alpha,
                tickvals=np.unique(greeks_codes.Alpha),
                ticktext=greeks_cats.Alpha.cat.categories,
            ),
            dict(
                label="Class",
                values=greeks.Class,
                tickvals=np.unique(greeks.Class),
            ),
        ],
        line=dict(
            color=greeks.Class,
            colorscale=color_map,
            showscale=True,
            colorbar=dict(
                title="Class",
                orientation="h",
                title_side="bottom",
                yanchor="bottom",
                xanchor="center",
                y=-0.35,
                x=0.5,
                nticks=2,
            ),
        ),
    )
)

fig.update_layout(
    font_color=FONT_COLOR,
    title="Greeks - Parallel Coordinates",
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
    height=540,
    width=840,
)
fig.update_traces(
    labelfont=dict(family="Arial Black", size=10),
    tickfont=dict(family="Arial Black", size=10),
    selector=dict(type="parcoords"),
)
fig.show()


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>Okay, so as you can see, the category "A" in the <code>Beta</code> characteristic is associated with people who have got age-related conditions. The rest, i.e. "B" and "C" are mixed. A more interesting situation occurs in the <code>Gamma</code> variable. There we have eight categories; six of them are related to age-related conditions, and two are not. Moving forward to the <code>Delta</code> feature, we see the situation is mixed in all categories.</li>
    <li>Let's see the pivot table for these features. We will see <code>Beta</code> and <code>Delta</code> vs <code>Class</code> since the situation is diverse there.</li>
</blockquote>

In [46]:
pivot = (
    greeks.pivot_table(
        values="Class",
        index=["Beta", "Delta"],
        aggfunc=["mean", "sum", "count"],
        margins=True,
        margins_name="Total",
    )
    .rename(
        columns={
            "mean": "Class 1 Fraction",
            "sum": "Class 1 Count",
            "count": "Samples",
        }
    )
    .droplevel(level=1, axis="columns")
)

pivot.style.set_table_styles(DF_STYLE)


Unnamed: 0_level_0,Unnamed: 1_level_0,Class 1 Fraction,Class 1 Count,Samples
Beta,Delta,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,A,1.0,8,8
B,A,0.227273,15,66
B,B,0.286765,39,136
C,A,0.0,0,1
C,B,0.046875,15,320
C,C,0.3125,20,64
C,D,0.5,11,22
Total,,0.175041,108,617


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>So, we have only eight people with <code>Beta</code> - "A" and <code>Delta</code> - "A" indicators, and all of them had age-related conditions. Moreover, there is quite a big group with <code>Beta</code> - "C" and <code>Delta</code> - "B" values, where the ratio of the positive class is significantly low - less than $5$% samples with positive class.</li>
    <li>We can explore this more, but let's focus on the <code>Epsilon</code> attribute. It's the only one time-distributed variable and depicts the date of data collection. In this case, it's good to see the <code>Class</code> trend in time.</li>
</blockquote>

In [47]:
rolling_mean_class = (
    greeks[["Epsilon", "Class"]]
    .assign(Epsilon=pd.to_datetime(greeks.Epsilon, errors="coerce"))
    .dropna()
    .sort_values(by="Epsilon")
    .rolling(window="365D", on="Epsilon")
    .mean()
)

fig = px.line(
    rolling_mean_class,
    x="Epsilon",
    y="Class",
    height=540,
    width=840,
    color_discrete_sequence=["#010D36"],
    symbol_sequence=["x"],
    line_shape="spline",
    markers=True,
    title="Class Trend - Rolling Mean over 365 Days",
)
fig.update_layout(
    font_color=FONT_COLOR,
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
)
fig.update_traces(marker=dict(size=6, color="#FF2079", opacity=0.7))
fig.show()


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>Well, most of the samples were collected from the end of $2018$ up to $2020$ autumn. We observe a relevant decreasing trend in the positive class target within samples collected from the end of $2018$ to begin of $2019$. This feature looks like a powerful predictive variable. However, it's available only for the training set, so including time in the learning process may be risky.</li>
</blockquote>

In [48]:
greeks["Epsilon Availability"] = (greeks.Epsilon != "Unknown").map(
    {True: "Epsilon Available", False: "Epsilon Missing"}
)

fig = px.sunburst(
    greeks.assign(Class=greeks.Class.map({0: "Class 0", 1: "Class 1"})),
    title="Class vs Epsilon Availability",
    path=["Epsilon Availability", "Class"],
    color_discrete_sequence=["#010D36", "#FF2079"],
    height=640,
    width=640,
)
fig.update_traces(
    insidetextorientation="horizontal",
    texttemplate="%{label}<br>%{value} - %{percentParent}",
    marker_line_width=5,
    marker_line_color=BACKGROUND_COLOR,
)
fig.update_layout(
    font_color=FONT_COLOR,
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
)
fig.show()


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>The unknown time of sample collection is only associated with the negative class target, thus machine learning models can relate that, which is not exactly what we want. Additionally, accurate imputation is probably not possible here.</li>
</blockquote>

In [49]:
df = greeks[["Epsilon", "Class"]]
df.Epsilon = (
    pd.to_datetime(df.Epsilon, errors="coerce")
    .apply(pd.Timestamp.toordinal)
    .replace(1, np.nan)
    .transform(lambda x: (x - x.min()) / (x.max() - x.min()))
)
df = df.dropna()

epsilon = df.Epsilon.to_numpy()[:, np.newaxis]
target = df.Class.to_numpy()

mutual_info = mutual_info_classif(epsilon, target, random_state=42)
f_stat, p_value = f_classif(epsilon, target)


In [50]:
print(CLR + "Mutual Information: ", RED + f"{mutual_info[0]:.2f}")
print(
    CLR + "ANOVA Test - F-statistic: ",
    RED + f"{f_stat[0]:.2f}",
)
print(
    CLR + "ANOVA Test - p-value associated with the F-statistic: ",
    RED + f"{p_value[0]:.2e}",
)


[1m[30mMutual Information:  [1m[31m0.18
[1m[30mANOVA Test - F-statistic:  [1m[31m25.89
[1m[30mANOVA Test - p-value associated with the F-statistic:  [1m[31m5.24e-07


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>Mutual information is a measure which estimates the relationship between two random variables, which are simultaneously sampled. Intuitively it can be understood how much one variable tells us about another one. Mutual information is equal to zero if and only if two variables are statistically independent. We got a value of around $0.18$, which is much greater than for the <code>DU</code> feature ($0.07$), which was at the top in the mutual information test for available features concerning the target class.</li>
    <li>In the ANOVA test, the null hypothesis is that there is no relationship between the feature and the target variable. The p-value we got is extremely small, which indicates strong evidence against the null hypothesis. Typically if the p-value is smaller than $0.05$, we should consider it as statistically significant.</li>
    <li>To summarize, <code>Epsilon</code> has a statistically significant relationship with the <code>Class</code>.</li>
</blockquote>

# <p style="padding: 15px; background-color: #010D36; font-family: 'JetBrains Mono'; font-weight: bold; font-size: 100%; color: #f2f2f0; letter-spacing: 2px; text-align: center; border-radius: 8px;">Preprocessing Pipeline</p>

<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Notes</b> 📜
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>In preprocessing, we use transformations we've found and binarization for semi-constant variables. Missing values in continuous features are filled with KNN imputation.</li>
</ul>
</blockquote>

In [51]:
semi_const_cols = semi_const_cols_thresholds.keys()

# We don't have square root transformations.
no_transform_cols = no_transform_cols.drop(semi_const_cols, errors="ignore")
log_transform_cols = log_transform_cols.drop(semi_const_cols, errors="ignore")
reciprocal_transform_cols = reciprocal_transform_cols.drop(semi_const_cols, errors="ignore")
boxcox_transform_cols = boxcox_transform_cols.drop(semi_const_cols, errors="ignore")
yeojohnson_transform_cols = yeojohnson_transform_cols.drop(semi_const_cols, errors="ignore")

preliminary_preprocess = make_pipeline(
    make_column_transformer(
        (
            StandardScaler(),
            no_transform_cols.to_list(),
        ),
        (
            make_pipeline(
                FunctionTransformer(func=np.log, feature_names_out="one-to-one"),
                StandardScaler(),
            ),
            log_transform_cols.to_list(),
        ),
        (
            make_pipeline(
                FunctionTransformer(func=np.reciprocal, feature_names_out="one-to-one"),
                StandardScaler(),
            ),
            reciprocal_transform_cols.to_list(),
        ),
        (
            PowerTransformer(method="box-cox", standardize=True),
            boxcox_transform_cols.to_list(),
        ),
        (
            PowerTransformer(method="yeo-johnson", standardize=True),
            yeojohnson_transform_cols.to_list(),
        ),
        (
            make_pipeline(
                SimpleImputer(strategy="most_frequent"),
                OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
            ),
            make_column_selector(dtype_include=object),  # type: ignore
        ),
        *[
            (
                make_pipeline(
                    SimpleImputer(strategy="median"),
                    Binarizer(threshold=thresh),
                ),
                [col],
            )
            for col, thresh in semi_const_cols_thresholds.items()
        ],
        remainder="drop",
        verbose_feature_names_out=False,
    ),
    KNNImputer(n_neighbors=10, weights="distance"),
).set_output(transform="pandas")


In [52]:
X_preliminary = preliminary_preprocess.fit_transform(train.drop("Class", axis=1))

assert np.all(np.isfinite(X_preliminary)) == True
assert np.any(np.isnan(X_preliminary)) == False

X_preliminary.head().style.set_table_styles(DF_STYLE).format(precision=3)


Unnamed: 0_level_0,CW,DU,EL,EU,FD,FS,AB,AM,BN,BQ,CB,GI,GL,AF,AX,AZ,BD,BP,BR,CC,CD,CF,CH,CR,CS,CU,DA,DE,DH,DI,DL,DN,DY,EB,EE,EG,EH,FC,FE,FI,FL,FR,GB,GF,GH,EJ,AH,AR,AY,BC,BZ,CL,DF,DV,EP,GE
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1
000ff2bfdfe9,0.618,1.406,-0.808,-1.194,1.295,-0.847,-0.999,0.087,0.29,0.859,0.189,0.728,-0.848,-0.011,-3.893,-0.095,-0.579,-0.244,0.314,-0.512,-2.769,-0.665,-0.613,-3.55,-3.012,-0.073,0.911,-0.067,-0.717,-0.944,-0.336,0.509,0.022,-0.201,-0.419,-0.021,1.919,-2.015,0.335,-2.683,0.913,0.95,-1.15,-0.913,-0.969,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
007255e47698,0.705,-1.073,0.915,0.756,-1.095,0.88,-1.77,0.652,-0.635,-1.209,-0.358,-0.327,1.167,-1.514,-0.873,0.722,0.343,-0.586,-2.413,-1.036,-1.033,-0.403,0.321,1.371,-0.623,0.033,0.98,-0.769,0.045,-0.368,-0.655,1.342,-0.392,-1.476,-1.518,-0.489,-0.976,-1.438,-0.127,0.151,-1.133,-1.352,-1.587,1.048,-0.143,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
013f2bd269f5,-0.39,0.895,0.915,-0.939,1.186,1.599,0.434,0.511,1.567,1.249,-0.264,-0.214,-0.652,-0.269,0.674,0.576,0.133,-1.186,-0.242,-0.957,0.14,-0.548,0.767,-0.056,0.401,-0.667,0.981,0.049,-1.509,-0.16,-1.05,0.32,0.758,-0.006,1.936,0.055,1.251,1.813,0.211,0.572,0.951,-0.091,1.559,0.421,-0.265,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
043ac50845d5,-0.385,1.156,-0.773,0.374,1.117,0.212,-0.638,1.333,0.623,-1.419,-1.359,1.132,-0.745,0.334,-0.838,0.19,-0.544,0.509,0.085,0.33,0.196,-1.416,0.068,-0.32,0.533,-1.317,-0.07,-0.635,-0.634,0.21,-0.813,-0.151,1.383,-0.165,0.624,4.297,1.668,0.633,0.622,1.557,0.787,-1.352,-0.046,-0.887,0.902,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
044fb8a146ec,0.495,0.852,0.915,0.74,0.704,-0.605,0.092,-0.521,0.736,0.838,0.779,-0.177,-0.935,0.294,-0.677,-1.871,0.468,1.151,2.518,-0.683,-0.231,1.721,-0.355,-0.084,-0.264,-1.071,1.108,-0.61,-1.54,-0.689,-1.56,0.067,-1.149,-0.178,0.476,-0.059,0.41,-0.407,1.14,1.208,0.991,3.091,-0.322,0.049,1.356,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0


In [53]:
print(
    CLR + "Training dataset shape before preprocessing:",
    RED + f"{train.drop('Class', axis=1).shape}",
)
print(
    CLR + "Training dataset shape after preprocessing: ",
    RED + f"{X_preliminary.shape}",
)


[1m[30mTraining dataset shape before preprocessing: [1m[31m(617, 56)
[1m[30mTraining dataset shape after preprocessing:  [1m[31m(617, 56)


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>Everything should work fine.</li>
</blockquote>

# <p style="padding: 15px; background-color: #010D36; font-family: 'JetBrains Mono'; font-weight: bold; font-size: 100%; color: #f2f2f0; letter-spacing: 2px; text-align: center; border-radius: 8px;">Balanced Learning with LGBM &amp; XGB Ensemble</p>

<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Notes</b> 📜
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>This section aims to build the LightGBM & XGBoost ensemble. In order to handle an imbalanced dataset, we will conduct balanced learning with a undersampling strategy - see: <a href="https://www.kaggle.com/competitions/icr-identify-age-related-conditions/discussion/412507" style="color: #01CBEE;"><b>How To Balance Training And Boost CV and LB Score!</b></a> and <a href="https://www.kaggle.com/code/cdeotte/rapids-cuml-svc-baseline-lb-0-27-cv-0-35?scriptVersionId=130753675" style="color: #01CBEE;"><b>RAPIDS cuML SVC Baseline</b></a>.</li>
    <li>Before we move on to the learning loop, we need several simple utility functions. The first, <code>get_undersampling_fraction()</code>, is responsible for setting a perfectly balanced training subset. The second, <code>assert_balanced_learning()</code>, provides that the training subset is actually balanced (does not really need to be used). The third function, <code>get_sample_weights()</code> provides sample weights if we decide to disturb balance (it's not used here, so far). The last function, <code>perform_proba_postprocessing()</code>, is experimental. It performs postprocessing on predicted probabilities.</li>
</ul>
</blockquote>

In [54]:
def get_undersampling_fraction(y_true):
    N0, N1 = np.bincount(y_true)
    return 1 - N1 / N0


def assert_balanced_learning(y_train, n_samples_tol=1):
    N0, N1 = np.bincount(y_train)
    assert np.isclose(N0, N1, atol=n_samples_tol)


def get_sample_weights(y_true, weights=None):
    """Pass `weights` tuple as `(weight_class_0, weight_class_1)`
    if you want to use custom weights."""
    N0, N1 = np.bincount(y_true)
    y0, y1 = np.unique(y_true)

    if weights:
        w0, w1 = weights
        return np.where(y_true == y1, w1, w0)

    w0 = (N0 + N1) / N0
    w1 = (N0 + N1) / N1

    return np.where(y_true == y1, w1, w0)


def perform_proba_postprocessing(
    y_proba,
    rounding=True,
    rounding_prec=4,
    boosting=True,
    boosting_coef=0.8,
    shifting=True,
    shifting_map=None,
):
    """Fancy postprocessing. Highly probable that do nothing or deteriorates."""

    def my_ceil(x, prec=rounding_prec):
        return np.true_divide(np.ceil(x * 10**prec), 10**prec)

    def my_floor(x, prec=rounding_prec):
        return np.true_divide(np.floor(x * 10**prec), 10**prec)

    proba = y_proba.copy()

    if rounding:
        proba = np.where(proba > 0.5, my_floor(proba), my_ceil(proba))

    if boosting:
        odds = boosting_coef * proba / (1 - proba)
        proba = odds / (1 + odds)

    if shifting:
        if not shifting_map:
            shifting_map = {"low": (0.01, 0.02), "high": (0.99, 0.98)}
        low_shift_from, low_shift_to = shifting_map.get("low", (0.01, 0.02))
        high_shift_from, high_shift_to = shifting_map.get("high", (0.99, 0.98))
        proba[proba < low_shift_from] = low_shift_to
        proba[proba > high_shift_from] = high_shift_to

    return proba


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Notes</b> 📜
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>During the training, we will use two tree-based models: LightGBM and XGBoost. We conduct $10$-fold cross-validation and repeat the whole process with $20$ different seeds to get more reliable outcome.</li>
</ul>
</blockquote>

In [55]:
n_bags = 20
n_folds = 10

np.random.seed(42)
seeds = np.random.randint(0, 19937, size=n_bags)

X = train.drop("Class", axis=1)
y = train.Class

lgbm_params = {
    "max_depth": 4,
    "num_leaves": 9,
    "min_child_samples": 17,
    "n_estimators": 200,
    "learning_rate": 0.15,
    "colsample_bytree": 0.4,
    "min_split_gain": 1e-4,
    "reg_alpha": 1e-2,
    "reg_lambda": 5e-3,
}

xgb_params = {
    "max_depth": 2,
    "n_estimators": 200,
    "learning_rate": 0.4,
    "subsample": 0.6,
    "min_child_weight": 0.1,
    "max_delta_step": 0.35,
    "colsample_bytree": 0.3,
    "colsample_bylevel": 0.7,
    "min_split_loss": 1e-4,
    "reg_alpha": 2e-3,
    "reg_lambda": 6e-2,
}

svc_params = {
    "probability": True,
    "C": 3,
}


In [56]:
undersampling_frac = get_undersampling_fraction(y)
y_proba = np.zeros_like(y, dtype=np.float64)
classifiers = defaultdict(object)

for bag, seed in enumerate(seeds):
    skfold = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=seed)

    for fold, (train_ids, valid_ids) in enumerate(skfold.split(X, y)):
        y_train_full = y.iloc[train_ids]
        to_undersample_ids = (
            y_train_full[y_train_full == 0]
            .sample(frac=undersampling_frac, random_state=seed)
            .index.to_numpy()
        )
        # Skfold returns numbers, but `y` is a series with IDs, so we map them.
        to_undersample_ids = [y.index.get_loc(idx) for idx in to_undersample_ids]
        train_ids = np.setdiff1d(train_ids, to_undersample_ids)

        X_train, y_train = X.iloc[train_ids], y.iloc[train_ids]
        X_valid, y_valid = X.iloc[valid_ids], y.iloc[valid_ids]

        assert_balanced_learning(y_train)

        current_ensemble = make_pipeline(
            preliminary_preprocess,
            VotingClassifier(
                [
                    ("lgbm", LGBMClassifier(random_state=seed, **lgbm_params)),
                    ("xgb", XGBClassifier(random_state=seed, **xgb_params)),
                    ("svc", SVC(random_state=seed, **svc_params)),
                ],
                voting="soft",
                weights=(0.45, 0.45, 0.10),
            ),
        ).fit(X_train, y_train)

        y_proba[valid_ids] += current_ensemble.predict_proba(X_valid)[:, 1]
        classifiers[f"Voting Bag: {bag} Fold: {fold}"] = current_ensemble

y_proba_averaged = y_proba / n_bags


In [57]:
print(CLR + "Balanced Log Loss:", RED + f"{balanced_log_loss(y, y_proba_averaged):.5f}")
print(CLR + "Brier Score Loss: ", RED + f"{brier_score_loss(y, y_proba_averaged):.5f}")


[1m[30mBalanced Log Loss: [1m[31m0.22175
[1m[30mBrier Score Loss:  [1m[31m0.06773


In [58]:
y_proba_postprocessed = perform_proba_postprocessing(y_proba_averaged)
print(
    CLR + "Postprocessed Balanced Log Loss:",
    RED + f"{balanced_log_loss(y, y_proba_postprocessed):.5f}",
)
print(
    CLR + "Postprocessed Brier Score Loss: ",
    RED + f"{brier_score_loss(y, y_proba_postprocessed):.5f}",
)


[1m[30mPostprocessed Balanced Log Loss: [1m[31m0.21832
[1m[30mPostprocessed Brier Score Loss:  [1m[31m0.06201


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>Okay, so I didn't mention that before, but I decided to include <code>SVC</code> with a low weight to the ensemble. Why? Well, in general, <code>SVC</code> itself has poor predictions, poor compared to LGBM and XGB, i.e. around $0.29$. However, <code>SVC</code> doesn't make such terrible mistakes as them. The point is that <code>SVC</code> doesn't predict such low or high probabilities as around $0$ or $1$. It means the <code>SVC</code> additive can be treated as a regularizer in the tree-based ensemble.</li>
    <li>As far as I checked, this <code>SVC</code> additive doesn't have a reflection on the LB but improves local CV.</li>
    <li>Additionally, as you can see, postprocessing does well in the case of the training dataset but may be risky within test one.</li>
</blockquote>

In [59]:
y_proba_frame = pd.DataFrame(
    {
        "Sample Integer Index": np.arange(0, len(y)),
        "Positive Class Probability": y_proba_averaged,
        "Class": y.values.astype(str),
    },
    index=y.index,
)

fig = px.scatter(
    y_proba_frame.reset_index(),
    x="Positive Class Probability",
    y="Sample Integer Index",
    symbol="Class",
    symbol_sequence=["diamond", "circle"],
    color="Class",
    color_discrete_sequence=["#010D36", "#FF2079"],
    category_orders={"Class": ("0", "1")},
    hover_data="Id",
    opacity=0.6,
    height=540,
    width=840,
    title="Training Dataset - Out of Fold Predictions",
)
fig.update_layout(
    font_color=FONT_COLOR,
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
    legend=dict(
        orientation="h",
        yanchor="bottom",
        xanchor="right",
        y=1.05,
        x=1,
        title="Class",
        itemsizing="constant",
    ),
    xaxis_range=[-0.02, 1.02],
)
fig.update_traces(marker_size=6)
fig.show()


In [60]:
fatal_mistake_ids = (
    (y_proba_frame["Positive Class Probability"] < 0.05)
    & (y_proba_frame["Class"] == "1")
) | (
    (y_proba_frame["Positive Class Probability"] > 0.95)
    & (y_proba_frame["Class"] == "0")
)

y_proba_frame[fatal_mistake_ids].style.set_table_styles(DF_STYLE).format(precision=3)


Unnamed: 0_level_0,Sample Integer Index,Positive Class Probability,Class
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2901ef1394b9,102,0.994,0
7416fea10b6b,292,0.985,0
cf5439add02c,509,0.006,1


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>As you can see, the model made several terrible mistakes. Especially painful are these with integer IDs: $102$, $292$, $509$. The model is virtually confident about its predictions here, and balanced log loss severely punishes that. I've tried to examine these samples with different methods like Isolation Forest, Local Outlier Factor, checking their percentile of scores in most important features and so on. However, it seems these are no outliers in the end, or I didn't pay enough attention to that.</li>
    <li>Now it's obvious why the competition says that XGBoost or Random Forest are not sufficient. Mistakes made for these three samples are too relevant.</li>
    <li>Let's have a look at different metrics yet.</li>
</blockquote>

In [61]:
scores = {}
predictions = np.where(y_proba_averaged > 0.5, 1, 0)

scores["Accuracy"] = accuracy_score(y, predictions)
scores["Precision"] = precision_score(y, predictions)
scores["Recall"] = recall_score(y, predictions)
scores["Specificity"] = recall_score(y, predictions, pos_label=0)
scores["F1"] = f1_score(y, predictions)
scores["ROC-AUC"] = roc_auc_score(y, y_proba_averaged)
metrics_for_bar = pd.DataFrame(scores, index=["Value"]).T

scores["ConfusionMatrix"] = confusion_matrix(y, predictions)
scores["FPR-TPR-Threshold"] = roc_curve(y, y_proba_averaged)

In [62]:
fig = px.bar(
    metrics_for_bar,
    text_auto=".2f",
    labels={"value": "Value", "index": "Metric"},
    title="LightGBM & XGBoost Ensemble - Metrics Summary",
    color_discrete_sequence=["#010D36"],
    height=540,
    width=840,
    opacity=0.8,
    orientation="h",
)
fig.update_layout(
    font_color=FONT_COLOR,
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
    bargap=0.5,
    showlegend=False,
    yaxis_categoryorder="total ascending",
)
fig.show()


In [63]:
fig = make_subplots(rows=1, cols=2)

fig.add_scatter(
    x=scores["FPR-TPR-Threshold"][0],
    y=scores["FPR-TPR-Threshold"][1],
    name="Ensemble",
    mode="markers+lines",
    line_color="#010D36",
    marker=dict(size=6, color="#FF2079", opacity=0.7, symbol="x"),
    showlegend=False,
    row=1,
    col=1,
)
fig.add_scatter(
    x=[0, 1],
    y=[0, 1],
    name="Dummy Classifier",
    mode="lines",
    line=dict(dash="longdash", color="#010D36"),
    showlegend=False,
    row=1,
    col=1,
)
fig.update_yaxes(
    scaleanchor="x",
    scaleratio=1,
    range=(-0.01, 1.01),
    title="True Positive Rate (Recall)",
    row=1,
    col=1,
)
fig.update_xaxes(
    scaleanchor="y",
    scaleratio=1,
    range=(-0.01, 1.01),
    title="False Positive Rate (Fall-Out)",
    row=1,
    col=1,
)

fig.add_heatmap(
    z=scores["ConfusionMatrix"],
    x=["Class 0", "Class 1"],
    y=["Class 0", "Class 1"],
    name="ConfusionMatrix",
    text=scores["ConfusionMatrix"],
    texttemplate="%{text}",
    xgap=20,
    ygap=20,
    showscale=True,
    colorscale=[[0.0, "#010D36"], [1.0, "#FF2079"]],
    row=1,
    col=2,
)
fig.update_yaxes(
    title="True Label",
    autorange="reversed",
    tickangle=-90,
    row=1,
    col=2,
)
fig.update_xaxes(
    title="Predicted Label",
    row=1,
    col=2,
)
fig.update_layout(
    font_color=FONT_COLOR,
    title="LightGBM & XGBoost Ensemble - ROC Curve & Confusion Matrix",
    title_font_size=18,
    plot_bgcolor=BACKGROUND_COLOR,
    paper_bgcolor=BACKGROUND_COLOR,
    height=480,
    width=840,
)
fig.show()


<p style="
    font-size: 20px;
    font-family: 'JetBrains Mono';
    color: #3E3F4C;
    border-bottom: 3px solid #01CBEE;
">
    <b>Observations</b> 📔
</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>The first thing that catches the eye is the poor precision of the model, which means low accuracy of positive class prediction. On the other hand, the recall score is high, which means the model rarely confuses the positive class in favour of the negative one. Such results are probably caused by undersampling, where we haven't made full use of the negative class.</li>
</blockquote>

In [64]:
# Dummy protection for an empty test dataset.
if np.all(np.isclose(test.select_dtypes("number").sum(), 0)):
    test_numeric_cols = test.select_dtypes("number").columns
    test[test_numeric_cols] += 1e-9

test_ids = test.index
y_test = np.zeros_like(test_ids)

for classifier in classifiers.values():
    # Each classifier contains preprocessing, so we pass raw test dataset.
    y_test += classifier.predict_proba(test)[:, 1]

y_test_averaged = y_test / len(classifiers)

submission = pd.DataFrame(
    {
        "Id": test_ids,
        "class_0": 1 - y_test_averaged,
        "class_1": y_test_averaged,
    }
).set_index("Id")

submission.to_csv("submission.csv")
submission.head().style.set_table_styles(DF_STYLE)


Unnamed: 0_level_0,class_0,class_1
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
00eed32682bb,0.539577,0.460423
010ebe33f668,0.539577,0.460423
02fa521e1838,0.539577,0.460423
040e15f562a2,0.539577,0.460423
046e85c7cc7f,0.539577,0.460423


# <p style="padding: 15px; background-color: #010D36; font-family: 'JetBrains Mono'; font-weight: bold; font-size: 100%; color: #f2f2f0; letter-spacing: 2px; text-align: center; border-radius: 8px;">Summary</p>

<blockquote style="
    margin-right: auto; 
    margin-left: auto; 
    background-color: #010D36; 
    padding: 15px; 
    border-radius: 8px;
    border-left: none;
">
<ul style="
    font-size: 16px;
    font-family: 'JetBrains Mono';
    color: #f2f2f0;
    margin-left: 8px;
    margin-right: 8px;
    margin-top: 4px; 
    margin-bottom: 4px;
">
    <li>Since the dataset is imbalanced, we've built an ensemble of LightGBM & XGBoost classifiers using an undersampling strategy. In order to get more reliable results, we repeated the training process $20$ times with different seeds. The local CV result for the training dataset is $0.22$, which is not great but not terrible. Slightly more worrying is that we got an LB score of $0.16$, which indicates a gap between training and test sets.</li>
    <li>I browsed through several posts of other people, and it looks like the CV often doesn't have a reflection on LB. On the other hand, some of them have nicely balanced scores. Well, remember that we used only one strategy - undersampling. It seems that class weighting works better. I checked that in my private notebook and actually got better CV results at the level of $0.20$. Moreover, these results do not include feature selection, which may significantly impact the final score. So this step should be our second stage of the preprocessing pipeline.</li>
    <li>The Epsilon variable looks attractive to include in the training process, but on the whole, it may be risky to do it. We don't really know anything about it, and we don't know what the trend will be in the future.</li>
    <li>The purpose of this notebook was to provide you with a pleasant overview of the available datasets. I hope you didn't get bored. If you have any questions or noticed something wrong, let me know in comments.</li>
    <li>If you like my work, then upvote. I appreciate that. Moreover, I encourage you to check my other notebooks. I try to deal with different things to not get bored.</li>
</ul>
</blockquote>
