<a id="overview"></a>
# Overview 🧐
<img src="https://i.imgur.com/wC22WgL.jpeg" width="600"><br>
We are going to predict mortality by heart failure based on the 12 features included in the data set. This can be used to help hospitals in assessing the severity of patients with cardiovascular diseases (CVDs).<br>
<font color="RoyalBlue">データセットに含まれる12の特徴量に基づいて心不全による死亡率を予測していきます。この予測は、病院で心血管疾患（CVDs）患者の重症度を評価する際に役立てられるでしょう。</font><br>

I have also run a similar analysis in R ([Heart Failure❣️ EDA & Prediction with R (91.5%acc)](https://www.kaggle.com/snowpea8/heart-failure-eda-prediction-with-r-91-5-acc)), 
if you would like to take a look at it.<br>
<font color="RoyalBlue">同様の分析を R でも実行していますので、そちらも参考にしてください。</font><br>

In this notebook, we will first discover and visualize the data to gain insights. Then we split the data into a training and a test set and use the training set to train various machine learning models. At the same time, we evaluate the performance of the models with cross-validation. Finally, we will ensemble each model to improve its accuracy.<br>
<font color="RoyalBlue">このノートブックでは、まず洞察を得るためにデータを研究、可視化します。それからデータを訓練用とテスト用に分割し、訓練セットを使って様々な機械学習モデルを訓練します。同時に、交差検証でモデルの性能を評価します。最後にそれぞれのモデルをアンサンブルし、精度の向上を目指していきます。</font>

# Table of contents 📖
* [Overview 🧐](#overview)
* [Setup 💻](#setup)
* [Load CSV data 📃](#load)
* [Explore CSV data 📊](#explore)
    * [Distribution of the binary features](#binary)
    * [Distribution of the numeric features](#numeric)
* [Data preprocessing 🧹](#preprocessing)
* [Train models and make predictions 💭](#models)
    * [LightGBM 🌳](#gbm)
    * [XGBoost 🌳](#xgb)
    * [CatBoost 🌳](#cat)
    * [Random forest 🌳](#rf)
    * [Extremely randomized trees 🌳](#ert)
    * [Linear model 📈](#lm)
    * [Deep learning 🧠](#dl)
* [Simple ensemble 🤝](#ensemble)

<a id="setup"></a>
# Setup 💻
All seed values are fixed at zero.<br>
<font color="RoyalBlue">シード値は全て0で固定しています。</font><br>

In [None]:
import os
import random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.graph_objects as go
from mlxtend.plotting import plot_confusion_matrix

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler
import tensorflow as tf

import optuna
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostClassifier

def seed_everything(seed=2020):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
seed_everything(0)

sns.set_style("whitegrid")
palette_ro = ["#ee2f35", "#fa7211", "#fbd600", "#75c731", "#1fb86e", "#0488cf", "#7b44ab"]

ROOT = "../input/heart-failure-clinical-data"

<a id="load"></a>
# Load CSV data 📃

In [None]:
df = pd.read_csv(ROOT + "/heart_failure_clinical_records_dataset.csv")

print("Data shape: ", df.shape)
df.head()

Dataset from: [Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5)

* `age` - Age
* `anaemia` - Decrease of red blood cells or hemoglobin (boolean) (0:`False`, 1:`True`)
<br>　<font color="RoyalBlue">貧血 - 赤血球またはヘモグロビンの減少が起こっているか</font>
* `creatinine_phosphokinase` - Level of the CPK enzyme in the blood (mcg/L)
<br>　<font color="RoyalBlue">クレアチンフォスフォキナーゼ - 血中 CPK 酵素（筋肉細胞のエネルギー代謝に重要な役割を果たす酵素）のレベル（μg/L）</font>
* `diabetes` - If the patient has diabetes (boolean) (0:`False`, 1:`True`)
<br>　<font color="RoyalBlue">糖尿病 - 患者が糖尿病かどうか</font>
* `ejection_fraction` - Percentage of blood leaving the heart at each contraction (percentage)
<br>　<font color="RoyalBlue">駆出率 - 心拍ごとに心臓が送り出す血液量（駆出量）／心臓が拡張したときの左心室容量（％）
<br>　（※　元論文で重要視）</font>
* `high_blood_pressure` - If the patient has hypertension (boolean) (0:`False`, 1:`True`)
<br>　<font color="RoyalBlue">高血圧 - 患者が高血圧かどうか</font>
* `platelets` - Platelets in the blood (kiloplatelets/mL)
<br>　<font color="RoyalBlue">血小板数 - 血中の血小板数（千／mL）</font>
* `serum_creatinine` - Level of serum creatinine in the blood (mg/dL)
<br>　<font color="RoyalBlue">血清クレアチニン値 - 血中の血清クレアチニン（腎臓の糸球体から排泄される）のレベル（mg/dL）
<br>　（※　元論文で重要視）</font>
* `serum_sodium` - Level of serum sodium in the blood (mEq/L)
<br>　<font color="RoyalBlue">血清ナトリウム値 - 血中の血清ナトリウム値のレベル（mEq/L）</font>
* `sex` - Woman or man (binary) (0: Woman, 1: Man)
* `smoking` - If the patient smokes or not (boolean) (0:`False`, 1:`True`)
* `time` - Follow-up period (days)
<br>　<font color="RoyalBlue">時間 - 患者の経過観察時間（日）</font>
* `DEATH_EVENT` - If the patient deceased during the follow-up period (boolean)
<br>　<font color="RoyalBlue">死亡 - 経過観察期間中に患者が死亡したかどうか</font>

<a id="explore"></a>
# Explore CSV data 📊

In [None]:
df.isnull().sum()

<a id="binary"></a>
## Distribution of the binary features

In [None]:
fig, ((ax1, ax2, ax3), (ax4, ax5, ax6)) = plt.subplots(2, 3, figsize=(16, 12))

sns.countplot(x="anaemia", ax=ax1, data=df,
              palette=palette_ro[3::-3], alpha=0.9)
sns.countplot(x="diabetes", ax=ax2, data=df,
              palette=palette_ro[3::-3], alpha=0.9)
sns.countplot(x="high_blood_pressure", ax=ax3, data=df,
              palette=palette_ro[3::-3], alpha=0.9)
sns.countplot(x="sex", ax=ax4, data=df,
              palette=palette_ro[2::3], alpha=0.9)
sns.countplot(x="smoking", ax=ax5, data=df,
              palette=palette_ro[3::-3], alpha=0.9)
sns.countplot(x="DEATH_EVENT", ax=ax6, data=df,
              palette=palette_ro[1::5], alpha=0.9)
fig.suptitle("Distribution of the binary features and DEATH_EVENT", fontsize=18);

Insights:

* The distribution of the objective variable is not 1:1, but is biased.
<br>　<font color="RoyalBlue">目的変数の分布は１：１ではなく、偏りがある。</font>

In [None]:
bin_features = ["anaemia", "diabetes", "high_blood_pressure", "sex", "smoking"]

df_d = pd.DataFrame(columns=[0, 1, "value"])
for col in bin_features:
    for u in df[col].unique():
        df_d.loc[col+"_"+str(u)] = 0
        for i in df["DEATH_EVENT"].unique():
            if u == 0:
                df_d["value"][col+"_"+str(u)] = "0 (False)"
            else:
                df_d["value"][col+"_"+str(u)] = "1 (True)"
            df_d[i][col+"_"+str(u)] = df[df[col]==u]["DEATH_EVENT"].value_counts(normalize=True)[i] * 100

df_d = df_d.reindex(index=["anaemia_0", "anaemia_1", "diabetes_0", "diabetes_1", "high_blood_pressure_0", "high_blood_pressure_1",
                           "sex_0", "sex_1", "smoking_0", "smoking_1"])
df_d.at["sex_0", "value"] = "0 (Female)"
df_d.at["sex_1", "value"] = "1 (Male)"

fig = go.Figure(data=[
    go.Bar(y=[["anaemia", "anaemia","diabetes","diabetes","high_blood_pressure","high_blood_pressure","sex","sex","smoking","smoking"], list(df_d["value"])],
           x=df_d[0], name="DEATH_EVENT = 0<br>(survived)", orientation='h', marker=dict(color=palette_ro[1])),
    go.Bar(y=[["anaemia", "anaemia","diabetes","diabetes","high_blood_pressure","high_blood_pressure","sex","sex","smoking","smoking"], list(df_d["value"])],
           x=df_d[1], name="DEATH_EVENT = 1<br>(dead)", orientation='h', marker=dict(color=palette_ro[6]))
])
fig.update_layout(barmode="stack",
                  title="Percentage of DEATH_EVENT per binary features")
fig.update_yaxes(autorange="reversed")
fig.show(config={"displayModeBar": False})

Insights:

* For diabetes, sex, and smoking, there was little difference in the distribution of the objective variable.
<br>　<font color="RoyalBlue">diabetes, sex, smoking においては、目的変数の分布にほとんど差は見られない。</font>
* For anaemia and high_blood_pressure, there are some differences in the distributions of the objective variables, but we do not know if we can say that the differences are significant.
<br>　<font color="RoyalBlue">anaemia, high_blood_pressure においては、目的変数の分布に多少の差があるが、有意な差があると言えるかどうかは分からない。</font>

<a id="numeric"></a>
## Distribution of the numeric features

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 12))

range_bin_width = np.arange(df["age"].min(), df["age"].max()+5, 5)

sns.distplot(df["age"], ax=ax1, bins=range_bin_width, color=palette_ro[5])
sns.distplot(df[df["DEATH_EVENT"]==0].age, label="DEATH_EVENT=0", ax=ax2, bins=range_bin_width, color=palette_ro[1])
sns.distplot(df[df["DEATH_EVENT"]==1].age, label="DEATH_EVENT=1", ax=ax2, bins=range_bin_width, color=palette_ro[6])
ax1.set_title("age distribution", fontsize=16);
ax2.set_title("Relationship between age and DEATH_EVENT", fontsize=16)

ax1.axvline(x=df["age"].median(), color=palette_ro[5], linestyle="--", alpha=0.5)
ax2.axvline(x=df[df["DEATH_EVENT"]==0].age.median(), color=palette_ro[1], linestyle="--", alpha=0.5)
ax2.axvline(x=df[df["DEATH_EVENT"]==1].age.median(), color=palette_ro[6], linestyle="--", alpha=0.5)

ax1.annotate("Min: {:.0f}".format(df["age"].min()), xy=(df["age"].min(), 0.010), 
             xytext=(df["age"].min()-7, 0.015),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=0.2"))
ax1.annotate("Max: {:.0f}".format(df["age"].max()), xy=(df["age"].max(), 0.005), 
             xytext=(df["age"].max(), 0.008),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=-0.2"))
ax1.annotate("Med: {:.0f}".format(df["age"].median()), xy=(df["age"].median(), 0.032), 
             xytext=(df["age"].median()-8, 0.035),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=0.25"))

ax2.annotate("Survived\nMed: {:.0f}".format(df[df["DEATH_EVENT"]==0].age.median()), xy=(df[df["DEATH_EVENT"]==0].age.median(), 0.033), 
             xytext=(df[df["DEATH_EVENT"]==0].age.median()-18, 0.035),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=-0.25"))
ax2.annotate("Dead\nMed: {:.0f}".format(df[df["DEATH_EVENT"]==1].age.median()), xy=(df[df["DEATH_EVENT"]==1].age.median(), 0.026), 
             xytext=(df[df["DEATH_EVENT"]==1].age.median()+7, 0.029),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=0.25"))
ax2.legend();

Insights:

* The age of patients was highest around 60 years old, and the number of patients decreased in a bell-shaped pattern around that age.
<br>　<font color="RoyalBlue">患者の年齢は60歳付近が最も多く、そこを中心に釣鐘状に減少している。</font>
* There is a difference in the distribution of each objective variable, with the younger the age, the more difficult it is to die; the probability density reverses after the age of just under 70.
<br>　<font color="RoyalBlue">目的変数別に見ると分布に差があり、年齢が若いほど死亡しづらい傾向にある。70歳弱を境に確率密度が逆転する。</font>

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 12))

sns.distplot(df["creatinine_phosphokinase"], ax=ax1, color=palette_ro[5])
sns.distplot(df[df["DEATH_EVENT"]==0].creatinine_phosphokinase, label="DEATH_EVENT=0", ax=ax2, hist=None, color=palette_ro[1])
sns.distplot(df[df["DEATH_EVENT"]==1].creatinine_phosphokinase, label="DEATH_EVENT=1", ax=ax2, hist=None, color=palette_ro[6])
ax1.set_title("creatinine_phosphokinase distribution", fontsize=16);
ax2.set_title("Relationship between creatinine_phosphokinase and DEATH_EVENT", fontsize=16)

ax1.axvline(x=df["creatinine_phosphokinase"].median(), color=palette_ro[5], linestyle="--", alpha=0.5)
ax2.axvline(x=df[df["DEATH_EVENT"]==0].creatinine_phosphokinase.median(), color=palette_ro[1], linestyle="--", alpha=0.5)
ax2.axvline(x=df[df["DEATH_EVENT"]==1].creatinine_phosphokinase.median(), color=palette_ro[6], linestyle="--", alpha=0.5)

ax1.annotate("Min: {:,}".format(df["creatinine_phosphokinase"].min()), xy=(df["creatinine_phosphokinase"].min(), 0.00085), 
             xytext=(df["creatinine_phosphokinase"].min()-700, 0.0010),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=0.2"))
ax1.annotate("Max: {:,}".format(df["creatinine_phosphokinase"].max()), xy=(df["creatinine_phosphokinase"].max(), 0.00005), 
             xytext=(df["creatinine_phosphokinase"].max()-500, 0.0002),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=0.2"))
ax1.annotate("Med: {:.0f}".format(df["creatinine_phosphokinase"].median()), xy=(df["creatinine_phosphokinase"].median(), 0.0014), 
             xytext=(df["creatinine_phosphokinase"].median()+500, 0.0015),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=0.25"))

ax2.annotate("Survived\nMed: {:.0f}".format(df[df["DEATH_EVENT"]==0].creatinine_phosphokinase.median()), xy=(df[df["DEATH_EVENT"]==0].creatinine_phosphokinase.median(), 0.00145), 
             xytext=(df[df["DEATH_EVENT"]==0].creatinine_phosphokinase.median()+600, 0.00145),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=0.25"))
ax2.annotate("Dead\nMed: {:.0f}".format(df[df["DEATH_EVENT"]==1].creatinine_phosphokinase.median()), xy=(df[df["DEATH_EVENT"]==1].creatinine_phosphokinase.median(), 0.00135), 
             xytext=(df[df["DEATH_EVENT"]==1].creatinine_phosphokinase.median()+700, 0.00125),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=-0.25"))
ax2.legend();

Insights:

* The distribution is heavily skewed to one side, with the highest value more than 30 times the median.
<br>　<font color="RoyalBlue">片側に裾の重い分布となっており、最高で中央値の30倍以上の値を持つケースがある。</font>
* By objective variable, there is little difference in the median, although there are some differences in the distribution.
<br>　<font color="RoyalBlue">目的変数別に見ると、分布に多少の違いはあれど、中央値にはほとんど差が無い。</font>

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 12))

range_bin_width = np.arange(df["ejection_fraction"].min(), df["ejection_fraction"].max()+1, 1)

sns.distplot(df["ejection_fraction"], ax=ax1, bins=range_bin_width, color=palette_ro[5])
sns.distplot(df[df["DEATH_EVENT"]==0].ejection_fraction, label="DEATH_EVENT=0", ax=ax2, bins=range_bin_width, color=palette_ro[1])
sns.distplot(df[df["DEATH_EVENT"]==1].ejection_fraction, label="DEATH_EVENT=1", ax=ax2, bins=range_bin_width, color=palette_ro[6])
ax1.set_title("ejection_fraction distribution", fontsize=16);
ax2.set_title("Relationship between ejection_fraction and DEATH_EVENT", fontsize=16)

ax1.axvline(x=df["ejection_fraction"].median(), color=palette_ro[5], linestyle="--", alpha=0.5)
ax2.axvline(x=df[df["DEATH_EVENT"]==0].ejection_fraction.median(), color=palette_ro[1], linestyle="--", alpha=0.5)
ax2.axvline(x=df[df["DEATH_EVENT"]==1].ejection_fraction.median(), color=palette_ro[6], linestyle="--", alpha=0.5)

ax1.annotate("Min: {:,}".format(df["ejection_fraction"].min()), xy=(df["ejection_fraction"].min(), 0.005), 
             xytext=(df["ejection_fraction"].min()-5, 0.022),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=0.2"))
ax1.annotate("Max: {:,}".format(df["ejection_fraction"].max()), xy=(df["ejection_fraction"].max(), 0.001), 
             xytext=(df["ejection_fraction"].max(), 0.022),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=-0.2"))
ax1.annotate("Med: {:.0f}".format(df["ejection_fraction"].median()), xy=(df["ejection_fraction"].median(), 0.041), 
             xytext=(df["ejection_fraction"].median()+5, 0.074),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=0.25"))

ax2.annotate("Survived\nMed: {:.0f}".format(df[df["DEATH_EVENT"]==0].ejection_fraction.median()), xy=(df[df["DEATH_EVENT"]==0].ejection_fraction.median(), 0.051), 
             xytext=(df[df["DEATH_EVENT"]==0].ejection_fraction.median()+5, 0.091),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=0.25"))
ax2.annotate("Dead\nMed: {:.0f}".format(df[df["DEATH_EVENT"]==1].ejection_fraction.median()), xy=(df[df["DEATH_EVENT"]==1].ejection_fraction.median(), 0.03), 
             xytext=(df[df["DEATH_EVENT"]==1].ejection_fraction.median()-18, 0.04),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=-0.25"))
ax2.legend();

Insights:

* The distribution is discrete, not continuous, with the first peak near 38 and the second peak near 60.
<br>　<font color="RoyalBlue">連続的ではなく離散的な分布をとっている。38付近に第一の山が、60付近に第二の山がある。</font>
* By objective variable, there are considerable differences in the shape of the distribution and in the median. Survivors are mostly located near the first and second mountains. The values of the dead are mostly around 30 and decrease slowly from there.
<br>　<font color="RoyalBlue">目的変数別に見ると、分布の形にも中央値にもかなりの差がある。生存者は第一の山と第二の山付近に多い。死亡者の値は30付近が多く、そこから緩やかに減少していく。</font>

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 12))

sns.distplot(df["platelets"], ax=ax1, color=palette_ro[5])
sns.distplot(df[df["DEATH_EVENT"]==0].platelets, label="DEATH_EVENT=0", ax=ax2, hist=None, color=palette_ro[1])
sns.distplot(df[df["DEATH_EVENT"]==1].platelets, label="DEATH_EVENT=1", ax=ax2, hist=None, color=palette_ro[6])
ax1.set_title("platelets distribution", fontsize=16);
ax2.set_title("Relationship between platelets and DEATH_EVENT", fontsize=16)

ax1.axvline(x=df["platelets"].median(), color=palette_ro[5], linestyle="--", alpha=0.5)
ax2.axvline(x=df[df["DEATH_EVENT"]==0].platelets.median(), color=palette_ro[1], linestyle="--", alpha=0.5)
ax2.axvline(x=df[df["DEATH_EVENT"]==1].platelets.median(), color=palette_ro[6], linestyle="--", alpha=0.5)

ax1.annotate("Min: {:,.0f}".format(df["platelets"].min()), xy=(df["platelets"].min(), 2e-7), 
             xytext=(df["platelets"].min()-50000, 7e-7),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=0.2"))
ax1.annotate("Max: {:,.0f}".format(df["platelets"].max()), xy=(df["platelets"].max(), 1e-7), 
             xytext=(df["platelets"].max()-30000, 7e-7),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=-0.2"))
ax1.annotate("Med: {:,.0f}".format(df["platelets"].median()), xy=(df["platelets"].median(), 5.9e-6), 
             xytext=(df["platelets"].median()+25000, 5.5e-6),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=0.25"))

ax2.annotate("Survived\nMed: {:,.0f}".format(df[df["DEATH_EVENT"]==0].platelets.median()), xy=(df[df["DEATH_EVENT"]==0].platelets.median(), 6.2e-6), 
             xytext=(df[df["DEATH_EVENT"]==0].platelets.median()+50000, 5.5e-6),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=0.25"))
ax2.annotate("Dead\nMed: {:,.0f}".format(df[df["DEATH_EVENT"]==1].platelets.median()), xy=(df[df["DEATH_EVENT"]==1].platelets.median(), 4.5e-6), 
             xytext=(df[df["DEATH_EVENT"]==1].platelets.median()-200000, 5.2e-6),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=-0.25"))
ax2.legend();

Insights:

* The distribution is roughly symmetrical and almost bell-shaped.
<br>　<font color="RoyalBlue">左右対称の釣鐘状に近い分布をとっている。</font>
* By objective variable, there is little difference in the median. Survivors have slightly higher platelet counts, and the values are clustered around the median.
<br>　<font color="RoyalBlue">目的変数別に見ると、中央値にほとんど差は無い。生存者の方が若干血小板数が多く、値が中央値付近に集まっている。</font>

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 12))

range_bin_width = np.arange(df["serum_creatinine"].min(), df["serum_creatinine"].max()+0.25, 0.25)

sns.distplot(df["serum_creatinine"], ax=ax1, bins=range_bin_width, color=palette_ro[5])
sns.distplot(df[df["DEATH_EVENT"]==0].serum_creatinine, label="DEATH_EVENT=0", ax=ax2, bins=range_bin_width, color=palette_ro[1])
sns.distplot(df[df["DEATH_EVENT"]==1].serum_creatinine, label="DEATH_EVENT=1", ax=ax2, bins=range_bin_width, color=palette_ro[6])
ax1.set_title("serum_creatinine distribution", fontsize=16);
ax2.set_title("Relationship serum_creatinine age and DEATH_EVENT", fontsize=16)

ax1.axvline(x=df["serum_creatinine"].median(), color=palette_ro[5], linestyle="--", alpha=0.5)
ax2.axvline(x=df[df["DEATH_EVENT"]==0].serum_creatinine.median(), color=palette_ro[1], linestyle="--", alpha=0.5)
ax2.axvline(x=df[df["DEATH_EVENT"]==1].serum_creatinine.median(), color=palette_ro[6], linestyle="--", alpha=0.5)

ax1.annotate("Min: {:.1f}".format(df["serum_creatinine"].min()), xy=(df["serum_creatinine"].min(), 0.31), 
             xytext=(df["serum_creatinine"].min()-0.7, 0.5),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=0.2"))
ax1.annotate("Max: {:.1f}".format(df["serum_creatinine"].max()), xy=(df["serum_creatinine"].max(), 0.05), 
             xytext=(df["serum_creatinine"].max()-0.2, 0.25),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=-0.2"))
ax1.annotate("Med: {:.1f}".format(df["serum_creatinine"].median()), xy=(df["serum_creatinine"].median(), 1.22), 
             xytext=(df["serum_creatinine"].median()+0.5, 1.3),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=0.25"))

ax2.annotate("Survived\nMed: {:.1f}".format(df[df["DEATH_EVENT"]==0].serum_creatinine.median()), xy=(df[df["DEATH_EVENT"]==0].serum_creatinine.median(), 1.47), 
             xytext=(df[df["DEATH_EVENT"]==0].serum_creatinine.median()-1.3, 1.5),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=-0.25"))
ax2.annotate("Dead\nMed: {:.1f}".format(df[df["DEATH_EVENT"]==1].serum_creatinine.median()), xy=(df[df["DEATH_EVENT"]==1].serum_creatinine.median(), 0.62), 
             xytext=(df[df["DEATH_EVENT"]==1].serum_creatinine.median()+0.4, 0.7),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=0.25"))
ax2.legend();

Insights:

* The distribution is heavily skewed to one side, with rare cases having values more than four times the median.
<br>　<font color="RoyalBlue">片側に裾の重い分布となっており、稀に中央値の４倍以上の値を持つケースがある。</font>
* By objective variable, there are considerable differences in the shape of the distribution. For survivors, the values are clustered around the median, but for the dead, there are often cases where the values exceed 1.5.
<br>　<font color="RoyalBlue">目的変数別に見ると、分布の形にかなりの差がある。生存者は値がほぼ中央値付近に集まっているが、死亡者は1.5を超えるようなケースがしばしばある。</font>

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 12))

range_bin_width = np.arange(df["serum_sodium"].min(), df["serum_sodium"].max()+1, 1)

sns.distplot(df["serum_sodium"], ax=ax1, bins=range_bin_width, color=palette_ro[5])
sns.distplot(df[df["DEATH_EVENT"]==0].serum_sodium, label="DEATH_EVENT=0", ax=ax2, bins=range_bin_width, color=palette_ro[1])
sns.distplot(df[df["DEATH_EVENT"]==1].serum_sodium, label="DEATH_EVENT=1", ax=ax2, bins=range_bin_width, color=palette_ro[6])
ax1.set_title("serum_sodium distribution", fontsize=16);
ax2.set_title("Relationship between serum_sodium and DEATH_EVENT", fontsize=16)

ax1.axvline(x=df["serum_sodium"].median(), color=palette_ro[5], linestyle="--", alpha=0.5)
ax2.axvline(x=df[df["DEATH_EVENT"]==0].serum_sodium.median(), color=palette_ro[1], linestyle="--", alpha=0.5)
ax2.axvline(x=df[df["DEATH_EVENT"]==1].serum_sodium.median(), color=palette_ro[6], linestyle="--", alpha=0.5)

ax1.annotate("Min: {:.0f}".format(df["serum_sodium"].min()), xy=(df["serum_sodium"].min(), 0.005), 
             xytext=(df["serum_sodium"].min()-3, 0.015),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=0.2"))
ax1.annotate("Max: {:.0f}".format(df["serum_sodium"].max()), xy=(df["serum_sodium"].max(), 0.005), 
             xytext=(df["serum_sodium"].max(), 0.015),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=-0.2"))
ax1.annotate("Med: {:.0f}".format(df["serum_sodium"].median()), xy=(df["serum_sodium"].median(), 0.103), 
             xytext=(df["serum_sodium"].median()-6, 0.115),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=-0.25"))

ax2.annotate("Survived\nMed: {:.0f}".format(df[df["DEATH_EVENT"]==0].serum_sodium.median()), xy=(df[df["DEATH_EVENT"]==0].serum_sodium.median(), 0.117), 
             xytext=(df[df["DEATH_EVENT"]==0].serum_sodium.median()+5, 0.135),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=0.25"))
ax2.annotate("Dead\nMed: {:.0f}".format(df[df["DEATH_EVENT"]==1].serum_sodium.median()), xy=(df[df["DEATH_EVENT"]==1].serum_sodium.median(), 0.09), 
             xytext=(df[df["DEATH_EVENT"]==1].serum_sodium.median()-5.5, 0.11),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=-0.25"))
ax2.legend();

Insights:

* The distribution is roughly symmetrical and almost bell-shaped, with no value exceeding 148, but there are rare cases below 125.
<br>　<font color="RoyalBlue">ほとんど左右対称の釣鐘型に近い分布で、148を超える値は無いが、125未満のケースは稀に存在する。</font>
* By objective variable, there is some difference in the median and in the distribution. The values of survivors are clustered around the median, while the values of deaths are lower and tend to be more dispersed.
<br>　<font color="RoyalBlue">目的変数別に見ると、中央値にも分布にも多少の差がある。生存者の値は中央値付近に集まっているが、死亡者の値はより低く、分散傾向にある。</font>

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 12))

range_bin_width = np.arange(df["time"].min(), df["time"].max()+10, 10)

sns.distplot(df["time"], ax=ax1, bins=range_bin_width, color=palette_ro[5])
sns.distplot(df[df["DEATH_EVENT"]==0].time, label="DEATH_EVENT=0", ax=ax2, bins=range_bin_width, color=palette_ro[1])
sns.distplot(df[df["DEATH_EVENT"]==1].time, label="DEATH_EVENT=1", ax=ax2, bins=range_bin_width, color=palette_ro[6])
ax1.set_title("time distribution", fontsize=16);
ax2.set_title("Relationship between time and DEATH_EVENT", fontsize=16)

ax1.axvline(x=df["time"].median(), color=palette_ro[5], linestyle="--", alpha=0.5)
ax2.axvline(x=df[df["DEATH_EVENT"]==0].time.median(), color=palette_ro[1], linestyle="--", alpha=0.5)
ax2.axvline(x=df[df["DEATH_EVENT"]==1].time.median(), color=palette_ro[6], linestyle="--", alpha=0.5)

ax1.annotate("Min: {:.0f}".format(df["time"].min()), xy=(df["time"].min(), 0.0021), 
             xytext=(df["time"].min()-30, 0.0032),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=0.2"))
ax1.annotate("Max: {:.0f}".format(df["time"].max()), xy=(df["time"].max(), 0.001), 
             xytext=(df["time"].max(), 0.0017),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=0.2"))
ax1.annotate("Med: {:.0f}".format(df["time"].median()), xy=(df["time"].median(), 0.0041), 
             xytext=(df["time"].median()+8, 0.0052),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=-0.25"))

ax2.annotate("Survived\nMed: {:.0f}".format(df[df["DEATH_EVENT"]==0].time.median()), xy=(df[df["DEATH_EVENT"]==0].time.median(), 0.0035), 
             xytext=(df[df["DEATH_EVENT"]==0].time.median()-40, 0.007),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=0.25"))
ax2.annotate("Dead\nMed: {:.0f}".format(df[df["DEATH_EVENT"]==1].time.median()), xy=(df[df["DEATH_EVENT"]==1].time.median(), 0.0082), 
             xytext=(df[df["DEATH_EVENT"]==1].time.median()+7, 0.0105),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->", facecolor='slategray', edgecolor='slategray',
                             connectionstyle="arc3, rad=-0.25"))
ax2.legend();

Insights:

* The distribution of the follow-up period is spread out with no large peaks, and there are small peaks around 90 and 200.
<br>　<font color="RoyalBlue">経過観察期間の分布には大きな山は無くばらけていて、90付近と200付近に小さな山がある。</font>
* By objective variable, there are clear differences in the medians and distributions. Survivors have a long follow-up period and two peaks in the distribution, while the dead tend to have a short follow-up period, with a gradual decrease from a large peak around 30 days.
<br>　<font color="RoyalBlue">目的変数別に見ると、中央値や分布に明確な差がある。生存者は経過観察期間が長く分布に２つの山があるが、死亡者は経過観察期間が短い傾向にあり、30日付近の大きな山から緩やかに減少していく。</font>

The figure below is based on a [scatterplot from the paper](https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5/figures/3)
.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 8))

sns.scatterplot(x=df["serum_creatinine"], y=df["ejection_fraction"], ax=ax,
                palette=[palette_ro[1], palette_ro[6]], hue=df["DEATH_EVENT"])
ax.plot([0.9, 5.3], [13, 80.0], color="gray", ls="--")

fig.suptitle("Relationship between serum_creatinine and ejection_fraction against DEATH_EVENT", fontsize=18);

> This plot shows a clear distinction between alive patients and dead patients, that we highlighted by manually inserting a black straight line.

<font color="RoyalBlue">この図は、生存した患者と死亡した患者の明確な違いを示しており、手動で黒い直線を挿入して強調しています。</font>

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 8))

sns.heatmap(df.corr(), ax=ax, vmax=1, vmin=-1, center=0,
            annot=True, fmt=".2f",
            cmap=sns.diverging_palette(220, 10, as_cmap=True),
            mask=np.triu(np.ones_like(df.corr(), dtype=np.bool)))

fig.suptitle("Diagonal correlation matrix", fontsize=18);

Insights:

* The explanatory variables that can be said to be significantly correlated with the objective variable are, in order of increasing correlation, `time`, `serum_creatinine`, `ejection_fraction`, `age`, and `serum_creatinine`.
<br>　<font color="RoyalBlue">目的変数に対して有意に相関があると言える説明変数は、相関が高い順に time, serum_creatinine, ejection_fraction, age, serum_creatinine の５つである。</font>
* The correlation between explanatory variables is not very high.
<br>　<font color="RoyalBlue">説明変数同士の相関はそれほど高くない。</font>

<a id="preprocessing"></a>
# Data preprocessing 🧹
Except for models based on decision trees, we need to do feature scaling. In this case, let's do standardization (converting the mean of each feature to 0 and the standard deviation to 1). There are no categorical variables in this case, so there is no need for one-hot encoding or anything else.<br>
<font color="RoyalBlue">決定木をベースにしたモデル以外では、特徴量のスケーリングを行う必要があります。今回は、標準化（各特徴量の平均を０、標準偏差を１に変換）を行っておきましょう。今回はカテゴリ変数が無いので、ワンホットエンコーディングなどは必要ありません。</font>

In [None]:
num_features = ["age", "creatinine_phosphokinase", "ejection_fraction", "platelets", "serum_creatinine", "serum_sodium", "time"]
num_features_s = []

for i in range(len(num_features)):
    num_features_s.append(num_features[i] + "_s")

sc = StandardScaler()
df[num_features_s] = sc.fit_transform(df[num_features])
df.head()

The `train_test_split()` function is used to split the train set and the test set. You can split the dataset while keeping the ratio of `y` by specifying `y` as the argument `stratify`.<br>
<font color="RoyalBlue">train_test_split() 関数で訓練用セットとテスト用セットを分割します。このとき、引数 stratify に y を指定することで、y の割合を保ったままデータセットを分割することができます。</font>

In [None]:
X = df.copy()
y = X["DEATH_EVENT"]
X = X.drop(["DEATH_EVENT"], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, stratify=y, random_state=0)
X_train.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
X_train.head()

<a id="models"></a>
# Train models and make predictions 💭
Now, let's create some models and check the performance measures. We will also optimize the models using optuna.<br>
The performance measure for classifiers are as follows.<br>
<font color="RoyalBlue">では、いくつかのモデルを作成し、性能指標を確認していきましょう。また、optuna を使ってモデルの最適化も行います。<br>
分類器の性能指標には以下のようなものがあります。</font>

> Referenced from Hands-On Machine Learning with Scikit-Learn and TensorFlow (Aurelien Geron, 2017).
* accuracy - the ratio of correct predictions
<br>　<font color="RoyalBlue">正解率 - 正しい予測の割合</font>
* confusion matrix - counting the number of times instances of class A are classified as class B
<br>　<font color="RoyalBlue">混同行列 - クラスＡのインスタンスがクラスＢに分類された回数を数える</font>
* precision - the accuracy of the positive predictions
<br>　<font color="RoyalBlue">適合率 - 陽性の予測の正解率（陽性であると予測したうち、当たっていた率）</font>
* recall (sensitivity, true positive rate: TPR) - the ratio of positive instances that are correctly detected by the classifier
<br>　<font color="RoyalBlue">再現率（感度、真陽性率）- 分類器が正しく分類した陽性インスタンスの割合（本当に陽性であるケースのうち、陽性だと判定できた率）</font>
* F1 score - the harmonic mean of precision and recall
<br>　<font color="RoyalBlue">F1 スコア（F 値） - 適合率と再現率の調和平均（算術平均に比べ、調和平均は低い値にそうでない値よりもずっと大きな重みを置く）</font>
* AUC - the area under the ROC curve (plotting the true positive rate (another name for recall) against the false positive rate)
<br>　<font color="RoyalBlue">AUC - ROC 曲線（偽陽性率に対する真陽性率（再現率）をプロットした曲線）の下の面積</font><br>

In this notebook, we will look at their accuracy, F1 score, and confusion matrix.<br>
<font color="RoyalBlue">このノートブックでは、正解率、F1 スコア、そして混同行列を見ていきます。</font>

<a id="gbm"></a>
## LightGBM 🌳

In [None]:
features = ["age", "anaemia", "creatinine_phosphokinase", "diabetes", "ejection_fraction", "high_blood_pressure", "platelets",
            "serum_creatinine", "serum_sodium", "sex", "smoking", "time"]
NFOLD = 10

skf = StratifiedKFold(n_splits=NFOLD)
models = []
imp = np.zeros((NFOLD, len(features)))
oof = np.zeros((len(X_train), ))
y_preds = np.zeros((len(X_test), NFOLD))

for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
    print(f"FOLD {fold_id+1}")
    X_tr, X_va = X_train[features].iloc[tr_idx], X_train[features].iloc[va_idx]
    y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]
    
    clf = lgb.LGBMClassifier(objective="binary",
                             metric="binary_logloss")
    clf.fit(X_tr, y_tr, eval_set = [(X_va, y_va)],
            early_stopping_rounds=10,
            verbose=-1)
    oof[va_idx] = clf.predict(X_va)
    models.append(clf)
    imp[fold_id] = clf.feature_importances_

for fold_id, clf in enumerate(models):
    pred_ = clf.predict(X_test[features], num_iteration=clf.best_iteration_)
    y_preds[:, fold_id] = pred_
y_pred = np.rint(np.mean(y_preds, axis=1))

print(f"\nOut-of-fold accuracy: {accuracy_score(y_train, oof)}")
print(f"Out-of-fold F1 score: {f1_score(y_train, oof)}")
print(f"Test accuracy:        {accuracy_score(y_test, y_pred)}")
print(f"Test F1 score:        {f1_score(y_test, y_pred)}")

feature_imp = pd.DataFrame(sorted(zip(np.mean(imp, axis=0), features), reverse=True), columns=["values", "features"])

fig, ax = plt.subplots(1, 1, figsize=(16, 6))
sns.barplot(x="values", y="features", data=feature_imp, palette="Blues_r")
plt.title("Feature importance of default LightGBM", fontsize=18);

In [None]:
fig, ax = plot_confusion_matrix(confusion_matrix(y_test, y_pred), figsize=(12,8), hide_ticks=True, colorbar=True, class_names=["true", "false"])

plt.title("Confusion Matrix of default LightGBM", fontsize=18)
plt.ylabel("True label", fontsize=14)
plt.xlabel("Predicted label\naccuracy={:0.4f}, F1-score={:0.4f}".format(accuracy_score(y_test, y_pred), f1_score(y_test, y_pred)), fontsize=14)
plt.xticks(np.arange(2), [False, True], fontsize=16)
plt.yticks(np.arange(2), [False, True], fontsize=16);

In [None]:
features = ["ejection_fraction", "serum_creatinine", "time"]

def objective(trial):
    skf = StratifiedKFold(n_splits=NFOLD)
    oof = np.zeros((len(X_train), ))

    for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
        print(f"FOLD {fold_id+1}")
        X_tr, X_va = X_train[features].iloc[tr_idx], X_train[features].iloc[va_idx]
        y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]

        clf = lgb.LGBMClassifier(objective="binary",
                                 metric="binary_logloss",
                                 colsample_bytree = trial.suggest_uniform('colsample_bytree', 0.4, 1.0),
                                 learning_rate = trial.suggest_uniform("learning_rate", 1e-8, 1.0),
                                 max_depth = trial.suggest_int("max_depth", 2, 32),
                                 min_child_samples = trial.suggest_int("min_child_samples", 3, 500),
                                 min_child_weight = trial.suggest_loguniform("min_child_weight", 1e-4, 1e+1),
                                 n_estimators = trial.suggest_int("n_estimators", 20, 200),
                                 num_leaves = trial.suggest_int("num_leaves", 2, 512),
                                 reg_alpha = trial.suggest_loguniform("reg_alpha", 1e-8, 10.0),
                                 reg_lambda = trial.suggest_loguniform("reg_lambda", 1e-8, 10.0),
                                 subsample = trial.suggest_uniform("subsample", 0.4, 1.0),
                                 subsample_freq = trial.suggest_int("subsample_freq", 0, 7),
                                )
        clf.fit(X_tr, y_tr, eval_set = [(X_va, y_va)],
                early_stopping_rounds=20,
                verbose=-1)
        oof[va_idx] = clf.predict(X_va)
        
    score = accuracy_score(y_train, oof)
    return score

study = optuna.create_study(direction="maximize", sampler=optuna.samplers.TPESampler(seed=0))
study.optimize(objective, n_trials=100)

In [None]:
study.best_params

In [None]:
optuna.importance.get_param_importances(study)

In [None]:
fig = optuna.visualization.plot_param_importances(study)
fig.show(config={"displayModeBar": False})

In [None]:
NFOLD = 10

skf = StratifiedKFold(n_splits=NFOLD)
models = []
imp = np.zeros((NFOLD, len(features)))
oof = np.zeros((len(X_train), ))
y_preds = np.zeros((len(X_test), NFOLD))

for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
    print(f"FOLD {fold_id+1}")
    X_tr, X_va = X_train[features].iloc[tr_idx], X_train[features].iloc[va_idx]
    y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]
    
    clf = lgb.LGBMClassifier(objective="binary",
                             metric="binary_logloss",
                             **study.best_params)
    clf.fit(X_tr, y_tr, eval_set = [(X_va, y_va)],
            early_stopping_rounds=10,
            verbose=-1)
    oof[va_idx] = clf.predict(X_va)
    models.append(clf)
    imp[fold_id] = clf.feature_importances_

for fold_id, clf in enumerate(models):
    pred_ = clf.predict(X_test[features], num_iteration=clf.best_iteration_)
    y_preds[:, fold_id] = pred_
y_pred = np.rint(np.mean(y_preds, axis=1))
y_pred_gbm = np.mean(y_preds, axis=1)

print(f"\nOut-of-fold accuracy: {accuracy_score(y_train, oof)}")
print(f"Out-of-fold F1 score: {f1_score(y_train, oof)}")
print(f"Test accuracy:        {accuracy_score(y_test, y_pred)}")
print(f"Test F1 score:        {f1_score(y_test, y_pred)}")

feature_imp = pd.DataFrame(sorted(zip(np.mean(imp, axis=0), features), reverse=True), columns=["values", "features"])

fig, ax = plt.subplots(1, 1, figsize=(16, 6))
sns.barplot(x="values", y="features", data=feature_imp, palette="Blues_r")
plt.title("Feature importance of optimized LightGBM", fontsize=18);

In [None]:
fig, ax = plot_confusion_matrix(confusion_matrix(y_test, y_pred), figsize=(12,8), hide_ticks=True, colorbar=True, class_names=["true", "false"])

plt.title("Confusion Matrix of optimized LightGBM", fontsize=18)
plt.ylabel("True label", fontsize=14)
plt.xlabel("Predicted label\naccuracy={:0.4f}, F1 score={:0.4f}".format(accuracy_score(y_test, y_pred), f1_score(y_test, y_pred)), fontsize=14)
plt.xticks(np.arange(2), [False, True], fontsize=16)
plt.yticks(np.arange(2), [False, True], fontsize=16);

<a id="xgb"></a>
## XGBoost 🌳

In [None]:
features = ["age", "anaemia", "creatinine_phosphokinase", "diabetes", "ejection_fraction", "high_blood_pressure", "platelets",
            "serum_creatinine", "serum_sodium", "sex", "smoking", "time"]
NFOLD = 10

skf = StratifiedKFold(n_splits=NFOLD)
models = []
imp = np.zeros((NFOLD, len(features)))
oof = np.zeros((len(X_train), ))
y_preds = np.zeros((len(X_test), NFOLD))

for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
    # print(f"FOLD {fold_id+1}")
    X_tr, X_va = X_train[features].iloc[tr_idx], X_train[features].iloc[va_idx]
    y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]
    
    clf = xgb.XGBClassifier()
    clf.fit(X_tr, y_tr)
    oof[va_idx] = clf.predict(X_va)
    models.append(clf)
    imp[fold_id] = clf.feature_importances_

for fold_id, clf in enumerate(models):
    pred_ = clf.predict(X_test[features])
    y_preds[:, fold_id] = pred_
y_pred = np.rint(np.mean(y_preds, axis=1))

print(f"Out-of-fold accuracy: {accuracy_score(y_train, oof)}")
print(f"Out-of-fold F1 score: {f1_score(y_train, oof)}")
print(f"Test accuracy:        {accuracy_score(y_test, y_pred)}")
print(f"Test F1 score:        {f1_score(y_test, y_pred)}")

feature_imp = pd.DataFrame(sorted(zip(np.mean(imp, axis=0), features), reverse=True), columns=["values", "features"])

fig, ax = plt.subplots(1, 1, figsize=(16, 6))
sns.barplot(x="values", y="features", data=feature_imp, palette="Blues_r")
plt.title("Feature importance of default XGBoost", fontsize=18);

In [None]:
fig, ax = plot_confusion_matrix(confusion_matrix(y_test, y_pred), figsize=(12,8), hide_ticks=True, colorbar=True, class_names=["true", "false"])

plt.title("Confusion Matrix of default XGBoost", fontsize=18)
plt.ylabel("True label", fontsize=14)
plt.xlabel("Predicted label\naccuracy={:0.4f}, F1-score={:0.4f}".format(accuracy_score(y_test, y_pred), f1_score(y_test, y_pred)), fontsize=14)
plt.xticks(np.arange(2), [False, True], fontsize=16)
plt.yticks(np.arange(2), [False, True], fontsize=16);

In [None]:
features = ["ejection_fraction", "serum_creatinine", "time"]

def objective(trial):
    skf = StratifiedKFold(n_splits=NFOLD)
    models = []
    imp = np.zeros((NFOLD, len(features)))
    oof = np.zeros((len(X_train), ))
    y_preds = np.zeros((len(X_test), NFOLD))

    for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
        # print(f"FOLD {fold_id+1}")
        X_tr, X_va = X_train[features].iloc[tr_idx], X_train[features].iloc[va_idx]
        y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]

        clf = xgb.XGBClassifier(n_estimators = trial.suggest_int("n_estimators", 20, 200),
                                max_depth = trial.suggest_int("max_depth", 2, 32),
                                learning_rate = trial.suggest_uniform("learning_rate", 1e-8, 1.0),
                                min_child_weight = trial.suggest_loguniform("min_child_weight", 1e-4, 1e+1),
                                subsample = trial.suggest_uniform("subsample", 0.4, 1.0),
                                colsample_bytree = trial.suggest_uniform("colsample_bytree", 0.4, 1.0),
                                reg_alpha = trial.suggest_loguniform("reg_alpha", 1e-8, 10.0),
                                reg_lambda = trial.suggest_loguniform("reg_lambda", 1e-8, 10.0),
                                scale_pos_weight = trial.suggest_int("scale_pos_weight", 1, 100)
                                )
        clf.fit(X_tr, y_tr)
        oof[va_idx] = clf.predict(X_va)
        
    score = accuracy_score(y_train, oof)
    return score

study = optuna.create_study(direction="maximize", sampler=optuna.samplers.TPESampler(seed=0))
study.optimize(objective, n_trials=100)

In [None]:
study.best_params

In [None]:
optuna.importance.get_param_importances(study)

In [None]:
NFOLD = 10

skf = StratifiedKFold(n_splits=NFOLD)
models = []
imp = np.zeros((NFOLD, len(features)))
oof = np.zeros((len(X_train), ))
y_preds = np.zeros((len(X_test), NFOLD))

for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
    # print(f"FOLD {fold_id+1}")
    X_tr, X_va = X_train[features].iloc[tr_idx], X_train[features].iloc[va_idx]
    y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]
    
    clf = xgb.XGBClassifier(**study.best_params)
    clf.fit(X_tr, y_tr)
    oof[va_idx] = clf.predict(X_va)
    models.append(clf)
    imp[fold_id] = clf.feature_importances_

for fold_id, clf in enumerate(models):
    pred_ = clf.predict(X_test[features])
    y_preds[:, fold_id] = pred_
y_pred = np.rint(np.mean(y_preds, axis=1))
y_pred_xgb = np.mean(y_preds, axis=1)

print(f"Out-of-fold accuracy: {accuracy_score(y_train, oof)}")
print(f"Out-of-fold F1 score: {f1_score(y_train, oof)}")
print(f"Test accuracy:        {accuracy_score(y_test, y_pred)}")
print(f"Test F1 score:        {f1_score(y_test, y_pred)}")

feature_imp = pd.DataFrame(sorted(zip(np.mean(imp, axis=0), features), reverse=True), columns=["values", "features"])

fig, ax = plt.subplots(1, 1, figsize=(16, 6))
sns.barplot(x="values", y="features", data=feature_imp, palette="Blues_r")
plt.title("Feature importance of optimized XGBoost", fontsize=18);

In [None]:
fig, ax = plot_confusion_matrix(confusion_matrix(y_test, y_pred), figsize=(12,8), hide_ticks=True, colorbar=True, class_names=["true", "false"])

plt.title("Confusion Matrix of optimized XGBoost", fontsize=18)
plt.ylabel("True label", fontsize=14)
plt.xlabel("Predicted label\naccuracy={:0.4f}, F1 score={:0.4f}".format(accuracy_score(y_test, y_pred), f1_score(y_test, y_pred)), fontsize=14)
plt.xticks(np.arange(2), [False, True], fontsize=16)
plt.yticks(np.arange(2), [False, True], fontsize=16);

<a id="cat"></a>
## CatBoost 🌳

In [None]:
features = ["age", "anaemia", "creatinine_phosphokinase", "diabetes", "ejection_fraction", "high_blood_pressure", "platelets",
            "serum_creatinine", "serum_sodium", "sex", "smoking", "time"]
NFOLD = 10

skf = StratifiedKFold(n_splits=NFOLD)
models = []
imp = np.zeros((NFOLD, len(features)))
oof = np.zeros((len(X_train), ))
y_preds = np.zeros((len(X_test), NFOLD))

for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
    # print(f"FOLD {fold_id+1}")
    X_tr, X_va = X_train[features].iloc[tr_idx], X_train[features].iloc[va_idx]
    y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]
    
    clf = CatBoostClassifier(loss_function="Logloss")
    clf.fit(X_tr, y_tr, eval_set = [(X_va, y_va)],
            early_stopping_rounds=10,
            verbose=False)
    oof[va_idx] = clf.predict(X_va)
    models.append(clf)
    imp[fold_id] = clf.get_feature_importance()

for fold_id, clf in enumerate(models):
    pred_ = clf.predict(X_test[features])
    y_preds[:, fold_id] = pred_
y_pred = np.rint(np.mean(y_preds, axis=1))

print(f"Out-of-fold accuracy: {accuracy_score(y_train, oof)}")
print(f"Out-of-fold F1 score: {f1_score(y_train, oof)}")
print(f"Test accuracy:        {accuracy_score(y_test, y_pred)}")
print(f"Test F1 score:        {f1_score(y_test, y_pred)}")

feature_imp = pd.DataFrame(sorted(zip(np.mean(imp, axis=0), features), reverse=True), columns=["values", "features"])

fig, ax = plt.subplots(1, 1, figsize=(16, 6))
sns.barplot(x="values", y="features", data=feature_imp, palette="Blues_r")
plt.title("Feature importance of default CatBoost", fontsize=18);

In [None]:
fig, ax = plot_confusion_matrix(confusion_matrix(y_test, y_pred), figsize=(12,8), hide_ticks=True, colorbar=True, class_names=["true", "false"])

plt.title("Confusion Matrix of default CatBoost", fontsize=18)
plt.ylabel("True label", fontsize=14)
plt.xlabel("Predicted label\naccuracy={:0.4f}, F1-score={:0.4f}".format(accuracy_score(y_test, y_pred), f1_score(y_test, y_pred)), fontsize=14)
plt.xticks(np.arange(2), [False, True], fontsize=16)
plt.yticks(np.arange(2), [False, True], fontsize=16);

In [None]:
features = ["ejection_fraction", "serum_creatinine", "time"]

def objective(trial):
    skf = StratifiedKFold(n_splits=NFOLD)
    models = []
    imp = np.zeros((NFOLD, len(features)))
    oof = np.zeros((len(X_train), ))
    y_preds = np.zeros((len(X_test), NFOLD))

    for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
        # print(f"FOLD {fold_id+1}")
        X_tr, X_va = X_train[features].iloc[tr_idx], X_train[features].iloc[va_idx]
        y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]

        clf = CatBoostClassifier(loss_function="Logloss",
                                 iterations = trial.suggest_int("iterations", 1000, 6000),
                                 learning_rate = trial.suggest_uniform("learning_rate", 1e-4, 1e-1),
                                 l2_leaf_reg = trial.suggest_loguniform("l2_leaf_reg", 1e-8, 10.0),
                                 # bagging_temperature = trial.suggest_loguniform("bagging_temperature", 1e-8, 100.0),
                                 subsample = trial.suggest_uniform("subsample", 0.4, 1.0),
                                 # random_strength = trial.suggest_loguniform("random_strength", 1e-8, 100.0),
                                 depth = trial.suggest_int("depth", 2, 16),
                                 min_data_in_leaf = trial.suggest_int("min_data_in_leaf", 1, 200),
                                 # od_type = trial.suggest_categorical("od_type", ["IncToDec", "Iter"]),
                                 # od_wait = trial.suggest_int("od_wait", 10, 50)
                                )
        clf.fit(X_tr, y_tr, eval_set = [(X_va, y_va)],
                early_stopping_rounds=10,
                verbose=False)
        oof[va_idx] = clf.predict(X_va)
        
    score = accuracy_score(y_train, oof)
    return score

study = optuna.create_study(direction="maximize", sampler=optuna.samplers.TPESampler(seed=0))
study.optimize(objective, n_trials=80)

In [None]:
study.best_params

In [None]:
optuna.importance.get_param_importances(study)

In [None]:
NFOLD = 10

skf = StratifiedKFold(n_splits=NFOLD)
models = []
imp = np.zeros((NFOLD, len(features)))
oof = np.zeros((len(X_train), ))
y_preds = np.zeros((len(X_test), NFOLD))

for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
    # print(f"FOLD {fold_id+1}")
    X_tr, X_va = X_train[features].iloc[tr_idx], X_train[features].iloc[va_idx]
    y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]
    
    clf = CatBoostClassifier(loss_function="Logloss",
                             **study.best_params)
    clf.fit(X_tr, y_tr, eval_set = [(X_va, y_va)],
            early_stopping_rounds=10,
            verbose=False)
    oof[va_idx] = clf.predict(X_va)
    models.append(clf)
    imp[fold_id] = clf.get_feature_importance()

for fold_id, clf in enumerate(models):
    pred_ = clf.predict(X_test[features])
    y_preds[:, fold_id] = pred_
y_pred = np.rint(np.mean(y_preds, axis=1))
y_pred_cat = np.mean(y_preds, axis=1)

print(f"Out-of-fold accuracy: {accuracy_score(y_train, oof)}")
print(f"Out-of-fold F1 score: {f1_score(y_train, oof)}")
print(f"Test accuracy:        {accuracy_score(y_test, y_pred)}")
print(f"Test F1 score:        {f1_score(y_test, y_pred)}")

feature_imp = pd.DataFrame(sorted(zip(np.mean(imp, axis=0), features), reverse=True), columns=["values", "features"])

fig, ax = plt.subplots(1, 1, figsize=(16, 6))
sns.barplot(x="values", y="features", data=feature_imp, palette="Blues_r")
plt.title("Feature importance of optimized CatBoost", fontsize=18);

In [None]:
fig, ax = plot_confusion_matrix(confusion_matrix(y_test, y_pred), figsize=(12,8), hide_ticks=True, colorbar=True, class_names=["true", "false"])

plt.title("Confusion Matrix of optimized CatBoost", fontsize=18)
plt.ylabel("True label", fontsize=14)
plt.xlabel("Predicted label\naccuracy={:0.4f}, F1 score={:0.4f}".format(accuracy_score(y_test, y_pred), f1_score(y_test, y_pred)), fontsize=14)
plt.xticks(np.arange(2), [False, True], fontsize=16)
plt.yticks(np.arange(2), [False, True], fontsize=16);

<a id="rf"></a>
## Random forest 🌳

In [None]:
features = ["age", "anaemia", "creatinine_phosphokinase", "diabetes", "ejection_fraction", "high_blood_pressure", "platelets",
            "serum_creatinine", "serum_sodium", "sex", "smoking", "time"]
NFOLD = 10

skf = StratifiedKFold(n_splits=NFOLD)
models = []
imp = np.zeros((NFOLD, len(features)))
oof = np.zeros((len(X_train), ))
y_preds = np.zeros((len(X_test), NFOLD))

for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
    # print(f"FOLD {fold_id+1}")
    X_tr, X_va = X_train[features].iloc[tr_idx], X_train[features].iloc[va_idx]
    y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]
    
    clf = RandomForestClassifier(random_state=0)
    clf.fit(X_tr, y_tr)
    oof[va_idx] = clf.predict(X_va)
    models.append(clf)
    imp[fold_id] = clf.feature_importances_

for fold_id, clf in enumerate(models):
    pred_ = clf.predict(X_test[features])
    y_preds[:, fold_id] = pred_
y_pred = np.rint(np.mean(y_preds, axis=1))

print(f"Out-of-fold accuracy: {accuracy_score(y_train, oof)}")
print(f"Out-of-fold F1 score: {f1_score(y_train, oof)}")
print(f"Test accuracy:        {accuracy_score(y_test, y_pred)}")
print(f"Test F1 score:        {f1_score(y_test, y_pred)}")

feature_imp = pd.DataFrame(sorted(zip(np.mean(imp, axis=0), features), reverse=True), columns=["values", "features"])

fig, ax = plt.subplots(1, 1, figsize=(16, 6))
sns.barplot(x="values", y="features", data=feature_imp, palette="Blues_r")
plt.title("Feature importance of default Random forest", fontsize=18);

In [None]:
fig, ax = plot_confusion_matrix(confusion_matrix(y_test, y_pred), figsize=(12,8), hide_ticks=True, colorbar=True, class_names=["true", "false"])

plt.title("Confusion Matrix of default Random forest", fontsize=18)
plt.ylabel("True label", fontsize=14)
plt.xlabel("Predicted label\naccuracy={:0.4f}, F1 score={:0.4f}".format(accuracy_score(y_test, y_pred), f1_score(y_test, y_pred)), fontsize=14)
plt.xticks(np.arange(2), [False, True], fontsize=16)
plt.yticks(np.arange(2), [False, True], fontsize=16);

In [None]:
features = ["ejection_fraction", "serum_creatinine", "time"]

def objective(trial):
    skf = StratifiedKFold(n_splits=NFOLD)
    models = []
    imp = np.zeros((NFOLD, len(features)))
    oof = np.zeros((len(X_train), ))
    y_preds = np.zeros((len(X_test), NFOLD))

    for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
        # print(f"FOLD {fold_id+1}")
        X_tr, X_va = X_train[features].iloc[tr_idx], X_train[features].iloc[va_idx]
        y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]

        clf = RandomForestClassifier(random_state=0,
                                     n_estimators = trial.suggest_int("n_estimators", 20, 200),
                                     max_depth = trial.suggest_int("max_depth", 2, 32),
                                     min_samples_split = trial.suggest_int("min_samples_split", 2, 16),
                                     min_samples_leaf = trial.suggest_int("min_samples_leaf", 1, 16))
        clf.fit(X_tr, y_tr)
        oof[va_idx] = clf.predict(X_va)
        
    score = accuracy_score(y_train, oof)
    return score

study = optuna.create_study(direction="maximize", sampler=optuna.samplers.TPESampler(seed=0))
study.optimize(objective, n_trials=50)

In [None]:
study.best_params

In [None]:
optuna.importance.get_param_importances(study)

In [None]:
NFOLD = 10

skf = StratifiedKFold(n_splits=NFOLD)
models = []
imp = np.zeros((NFOLD, len(features)))
oof = np.zeros((len(X_train), ))
y_preds = np.zeros((len(X_test), NFOLD))

for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
    # print(f"FOLD {fold_id+1}")
    X_tr, X_va = X_train[features].iloc[tr_idx], X_train[features].iloc[va_idx]
    y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]
    
    clf = RandomForestClassifier(random_state=0,
                                 **study.best_params)
    clf.fit(X_tr, y_tr)
    oof[va_idx] = clf.predict(X_va)
    models.append(clf)
    imp[fold_id] = clf.feature_importances_

for fold_id, clf in enumerate(models):
    pred_ = clf.predict(X_test[features])
    y_preds[:, fold_id] = pred_
y_pred = np.rint(np.mean(y_preds, axis=1))
y_pred_rf = np.mean(y_preds, axis=1)

print(f"Out-of-fold accuracy: {accuracy_score(y_train, oof)}")
print(f"Out-of-fold F1 score: {f1_score(y_train, oof)}")
print(f"Test accuracy:        {accuracy_score(y_test, y_pred)}")
print(f"Test F1 score:        {f1_score(y_test, y_pred)}")

feature_imp = pd.DataFrame(sorted(zip(np.mean(imp, axis=0), features), reverse=True), columns=["values", "features"])

fig, ax = plt.subplots(1, 1, figsize=(16, 6))
sns.barplot(x="values", y="features", data=feature_imp, palette="Blues_r")
plt.title("Feature importance of optimized Random forest", fontsize=18);

In [None]:
fig, ax = plot_confusion_matrix(confusion_matrix(y_test, y_pred), figsize=(12,8), hide_ticks=True, colorbar=True, class_names=["true", "false"])

plt.title("Confusion Matrix of optimized Random forest", fontsize=18)
plt.ylabel("True label", fontsize=14)
plt.xlabel("Predicted label\naccuracy={:0.4f}, F1 score={:0.4f}".format(accuracy_score(y_test, y_pred), f1_score(y_test, y_pred)), fontsize=14)
plt.xticks(np.arange(2), [False, True], fontsize=16)
plt.yticks(np.arange(2), [False, True], fontsize=16);

<a id="ert"></a>
## Extremely randomized trees 🌳

In [None]:
features = ["age", "anaemia", "creatinine_phosphokinase", "diabetes", "ejection_fraction", "high_blood_pressure", "platelets",
            "serum_creatinine", "serum_sodium", "sex", "smoking", "time"]
NFOLD = 10

skf = StratifiedKFold(n_splits=NFOLD)
models = []
imp = np.zeros((NFOLD, len(features)))
oof = np.zeros((len(X_train), ))
y_preds = np.zeros((len(X_test), NFOLD))

for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
    # print(f"FOLD {fold_id+1}")
    X_tr, X_va = X_train[features].iloc[tr_idx], X_train[features].iloc[va_idx]
    y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]
    
    clf = ExtraTreesClassifier(random_state=0)
    clf.fit(X_tr, y_tr)
    oof[va_idx] = clf.predict(X_va)
    models.append(clf)
    imp[fold_id] = clf.feature_importances_

for fold_id, clf in enumerate(models):
    pred_ = clf.predict(X_test[features])
    y_preds[:, fold_id] = pred_
y_pred = np.rint(np.mean(y_preds, axis=1))

print(f"Out-of-fold accuracy: {accuracy_score(y_train, oof)}")
print(f"Out-of-fold F1 score: {f1_score(y_train, oof)}")
print(f"Test accuracy:        {accuracy_score(y_test, y_pred)}")
print(f"Test F1 score:        {f1_score(y_test, y_pred)}")

feature_imp = pd.DataFrame(sorted(zip(np.mean(imp, axis=0), features), reverse=True), columns=["values", "features"])

fig, ax = plt.subplots(1, 1, figsize=(16, 6))
sns.barplot(x="values", y="features", data=feature_imp, palette="Blues_r")
plt.title("Feature importance of default Extremely randomized trees", fontsize=18);

In [None]:
fig, ax = plot_confusion_matrix(confusion_matrix(y_test, y_pred), figsize=(12,8), hide_ticks=True, colorbar=True, class_names=["true", "false"])

plt.title("Confusion Matrix of default Extremely randomized trees", fontsize=18)
plt.ylabel("True label", fontsize=14)
plt.xlabel("Predicted label\naccuracy={:0.4f}, F1 score={:0.4f}".format(accuracy_score(y_test, y_pred), f1_score(y_test, y_pred)), fontsize=14)
plt.xticks(np.arange(2), [False, True], fontsize=16)
plt.yticks(np.arange(2), [False, True], fontsize=16);

In [None]:
features = ["age", "ejection_fraction", "serum_creatinine", "time"]

def objective(trial):
    skf = StratifiedKFold(n_splits=NFOLD)
    models = []
    imp = np.zeros((NFOLD, len(features)))
    oof = np.zeros((len(X_train), ))
    y_preds = np.zeros((len(X_test), NFOLD))

    for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
        # print(f"FOLD {fold_id+1}")
        X_tr, X_va = X_train[features].iloc[tr_idx], X_train[features].iloc[va_idx]
        y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]

        clf = RandomForestClassifier(random_state=0,
                                     n_estimators = trial.suggest_int("n_estimators", 20, 200),
                                     max_depth = trial.suggest_int("max_depth", 2, 32),
                                     min_samples_split = trial.suggest_int("min_samples_split", 2, 16),
                                     min_samples_leaf = trial.suggest_int("min_samples_leaf", 1, 16))
        clf.fit(X_tr, y_tr)
        oof[va_idx] = clf.predict(X_va)
        
    score = accuracy_score(y_train, oof)
    return score

study = optuna.create_study(direction="maximize", sampler=optuna.samplers.TPESampler(seed=0))
study.optimize(objective, n_trials=50)

In [None]:
study.best_params

In [None]:
optuna.importance.get_param_importances(study)

In [None]:
NFOLD = 10

skf = StratifiedKFold(n_splits=NFOLD)
models = []
imp = np.zeros((NFOLD, len(features)))
oof = np.zeros((len(X_train), ))
y_preds = np.zeros((len(X_test), NFOLD))

for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
    # print(f"FOLD {fold_id+1}")
    X_tr, X_va = X_train[features].iloc[tr_idx], X_train[features].iloc[va_idx]
    y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]
    
    clf = ExtraTreesClassifier(random_state=0,
                               **study.best_params)
    clf.fit(X_tr, y_tr)
    oof[va_idx] = clf.predict(X_va)
    models.append(clf)
    imp[fold_id] = clf.feature_importances_

for fold_id, clf in enumerate(models):
    pred_ = clf.predict(X_test[features])
    y_preds[:, fold_id] = pred_
y_pred = np.rint(np.mean(y_preds, axis=1))
y_pred_ert = np.mean(y_preds, axis=1)

print(f"Out-of-fold accuracy: {accuracy_score(y_train, oof)}")
print(f"Out-of-fold F1 score: {f1_score(y_train, oof)}")
print(f"Test accuracy:        {accuracy_score(y_test, y_pred)}")
print(f"Test F1 score:        {f1_score(y_test, y_pred)}")

feature_imp = pd.DataFrame(sorted(zip(np.mean(imp, axis=0), features), reverse=True), columns=["values", "features"])

fig, ax = plt.subplots(1, 1, figsize=(16, 6))
sns.barplot(x="values", y="features", data=feature_imp, palette="Blues_r")
plt.title("Feature importance of optimized Extremely randomized trees", fontsize=18);

In [None]:
fig, ax = plot_confusion_matrix(confusion_matrix(y_test, y_pred), figsize=(12,8), hide_ticks=True, colorbar=True, class_names=["true", "false"])

plt.title("Confusion Matrix of optimized Extremely randomized trees", fontsize=18)
plt.ylabel("True label", fontsize=14)
plt.xlabel("Predicted label\naccuracy={:0.4f}, F1 score={:0.4f}".format(accuracy_score(y_test, y_pred), f1_score(y_test, y_pred)), fontsize=14)
plt.xticks(np.arange(2), [False, True], fontsize=16)
plt.yticks(np.arange(2), [False, True], fontsize=16);

<a id="lm"></a>
## Linear model 📈

In [None]:
features = ["age_s", "anaemia", "creatinine_phosphokinase_s", "diabetes", "ejection_fraction_s", "high_blood_pressure", "platelets_s",
            "serum_creatinine_s", "serum_sodium_s", "sex", "smoking", "time_s"]
NFOLD = 10

skf = StratifiedKFold(n_splits=NFOLD)
models = []
oof = np.zeros((len(X_train), ))
y_preds = np.zeros((len(X_test), NFOLD))

for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
    # print(f"FOLD {fold_id+1}")
    X_tr, X_va = X_train[features].iloc[tr_idx], X_train[features].iloc[va_idx]
    y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]
    
    clf = LogisticRegression(random_state=0)
    clf.fit(X_tr, y_tr)
    oof[va_idx] = clf.predict(X_va)
    models.append(clf)

for fold_id, clf in enumerate(models):
    pred_ = clf.predict(X_test[features])
    y_preds[:, fold_id] = pred_
y_pred = np.rint(np.mean(y_preds, axis=1))

print(f"Out-of-fold accuracy: {accuracy_score(y_train, oof)}")
print(f"Out-of-fold F1 score: {f1_score(y_train, oof)}")
print(f"Test accuracy:        {accuracy_score(y_test, y_pred)}")
print(f"Test F1 score:        {f1_score(y_test, y_pred)}")

In [None]:
fig, ax = plot_confusion_matrix(confusion_matrix(y_test, y_pred), figsize=(12,8), hide_ticks=True, colorbar=True, class_names=["true", "false"])

plt.title("Confusion Matrix of default linear model", fontsize=18)
plt.ylabel("True label", fontsize=14)
plt.xlabel("Predicted label\naccuracy={:0.4f}, F1 score={:0.4f}".format(accuracy_score(y_test, y_pred), f1_score(y_test, y_pred)), fontsize=14)
plt.xticks(np.arange(2), [False, True], fontsize=16)
plt.yticks(np.arange(2), [False, True], fontsize=16);

In [None]:
features = ["age_s", "creatinine_phosphokinase_s", "ejection_fraction_s", "serum_creatinine_s", "serum_sodium_s", "time_s"]
NFOLD = 10

def objective(trial):
    skf = StratifiedKFold(n_splits=NFOLD)
    models = []
    imp = np.zeros((NFOLD, len(features)))
    oof = np.zeros((len(X_train), ))
    y_preds = np.zeros((len(X_test), NFOLD))

    for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
        # print(f"FOLD {fold_id+1}")
        X_tr, X_va = X_train[features].iloc[tr_idx], X_train[features].iloc[va_idx]
        y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]

        clf = LogisticRegression(random_state=0,
                                 C = trial.suggest_uniform("C", 0.1, 10.0),
                                 intercept_scaling = trial.suggest_uniform("intercept_scaling", 0.1, 2.0),
                                 max_iter = trial.suggest_int("max_iter", 100, 1000)
                                 )
        clf.fit(X_tr, y_tr)
        oof[va_idx] = clf.predict(X_va)
        
    score = accuracy_score(y_train, oof)
    return score

study = optuna.create_study(direction="maximize", sampler=optuna.samplers.TPESampler(seed=0))
study.optimize(objective, n_trials=20)

In [None]:
study.best_params

In [None]:
optuna.importance.get_param_importances(study)

In [None]:
NFOLD = 10

skf = StratifiedKFold(n_splits=NFOLD)
models = []
oof = np.zeros((len(X_train), ))
y_preds = np.zeros((len(X_test), NFOLD))

for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
    # print(f"FOLD {fold_id+1}")
    X_tr, X_va = X_train[features].iloc[tr_idx], X_train[features].iloc[va_idx]
    y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]
    
    clf = LogisticRegression(random_state=0,
                             **study.best_params)
    clf.fit(X_tr, y_tr)
    oof[va_idx] = clf.predict(X_va)
    models.append(clf)

for fold_id, clf in enumerate(models):
    pred_ = clf.predict(X_test[features])
    y_preds[:, fold_id] = pred_
y_pred = np.rint(np.mean(y_preds, axis=1))
y_pred_lm = np.mean(y_preds, axis=1)

print(f"Out-of-fold accuracy: {accuracy_score(y_train, oof)}")
print(f"Out-of-fold F1 score: {f1_score(y_train, oof)}")
print(f"Test accuracy:        {accuracy_score(y_test, y_pred)}")
print(f"Test F1 score:        {f1_score(y_test, y_pred)}")

In [None]:
fig, ax = plot_confusion_matrix(confusion_matrix(y_test, y_pred), figsize=(12,8), hide_ticks=True, colorbar=True, class_names=["true", "false"])

plt.title("Confusion Matrix of optimized linear model", fontsize=18)
plt.ylabel("True label", fontsize=14)
plt.xlabel("Predicted label\naccuracy={:0.4f}, F1 score={:0.4f}".format(accuracy_score(y_test, y_pred), f1_score(y_test, y_pred)), fontsize=14)
plt.xticks(np.arange(2), [False, True], fontsize=16)
plt.yticks(np.arange(2), [False, True], fontsize=16);

<a id="dl"></a>
## Deep learning 🧠
Because of the long learning time, I manually adjusted the hyperparameters instead of optuna.<br>
<font color="RoyalBlue">学習に時間がかかったため、optuna ではなく手動でハイパーパラメータを調整しています。</font>

In [None]:
features = ["age_s", "anaemia", "creatinine_phosphokinase_s", "diabetes", "ejection_fraction_s", "high_blood_pressure", "platelets_s",
            "serum_creatinine_s", "serum_sodium_s", "sex", "smoking", "time_s"]
NFOLD = 10
seed_everything(0)

BATCH_SIZE = 32

skf = StratifiedKFold(n_splits=NFOLD)
oof = np.zeros((len(X_train), 1))
y_preds = np.zeros((len(X_test), NFOLD))

for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
    # print(f"FOLD {fold_id+1}")
    X_tr, X_va = X_train[features].iloc[tr_idx], X_train[features].iloc[va_idx]
    y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]
    
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(128, activation="relu", input_shape=(len(features), )),
        tf.keras.layers.Dropout(0.1),
        tf.keras.layers.Dense(128, activation="relu"),
        tf.keras.layers.Dense(1)
    ])
    
    model.compile(loss="binary_crossentropy",
                  optimizer=tf.keras.optimizers.Adam(lr=0.001,
                                                     decay=0.0),
                  metrics=["accuracy"])
    
    model.fit(X_tr, y_tr,
              validation_data=(X_va, y_va),
              epochs=100, batch_size=BATCH_SIZE,
              verbose=0)
    
    oof[va_idx] = model.predict(X_va, batch_size=BATCH_SIZE, verbose=0)
    y_preds += model.predict(X_test[features], batch_size=BATCH_SIZE, verbose=0) / NFOLD

oof = (np.mean(oof, axis=1) > 0.5).astype(int)
y_pred = (np.mean(y_preds, axis=1) > 0.5).astype(int)

print(f"Out-of-fold accuracy: {accuracy_score(y_train, oof)}")
print(f"Out-of-fold F1 score: {f1_score(y_train, oof)}")
print(f"Test accuracy:        {accuracy_score(y_test, y_pred)}")
print(f"Test F1 score:        {f1_score(y_test, y_pred)}")

In [None]:
fig, ax = plot_confusion_matrix(confusion_matrix(y_test, y_pred), figsize=(12,8), hide_ticks=True, colorbar=True, class_names=["true", "false"])

plt.title("Confusion Matrix of deep learning", fontsize=18)
plt.ylabel("True label", fontsize=14)
plt.xlabel("Predicted label\naccuracy={:0.4f}, F1 score={:0.4f}".format(accuracy_score(y_test, y_pred), f1_score(y_test, y_pred)), fontsize=14)
plt.xticks(np.arange(2), [False, True], fontsize=16)
plt.yticks(np.arange(2), [False, True], fontsize=16);

In [None]:
features = ["ejection_fraction_s", "serum_creatinine_s", "serum_sodium_s", "time_s"]
NFOLD = 10
seed_everything(0)

BATCH_SIZE = 32

skf = StratifiedKFold(n_splits=NFOLD)
oof = np.zeros((len(X_train), 1))
y_preds = np.zeros((len(X_test), NFOLD))

for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
    # print(f"FOLD {fold_id+1}")
    X_tr, X_va = X_train[features].iloc[tr_idx], X_train[features].iloc[va_idx]
    y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]
    
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(64, activation="elu", input_shape=(len(features), )),
        tf.keras.layers.Dropout(0.1),
        tf.keras.layers.Dense(64, activation="elu"),
        tf.keras.layers.Dropout(0.1),
        tf.keras.layers.Dense(1)
    ])
    
    model.compile(loss="binary_crossentropy",
                  optimizer=tf.keras.optimizers.Adam(lr=0.001,
                                                     decay=0.0),
                  metrics=["accuracy"])
    
    model.fit(X_tr, y_tr,
              validation_data=(X_va, y_va),
              epochs=102, batch_size=BATCH_SIZE,
              verbose=0)
    
    oof[va_idx] = model.predict(X_va, batch_size=BATCH_SIZE, verbose=0)
    y_preds += model.predict(X_test[features], batch_size=BATCH_SIZE, verbose=0) / NFOLD

oof = (np.mean(oof, axis=1) > 0.5).astype(int)
y_pred = (np.mean(y_preds, axis=1) > 0.5).astype(int)
y_pred_dl = y_pred

print(f"Out-of-fold accuracy: {accuracy_score(y_train, oof)}")
print(f"Out-of-fold F1 score: {f1_score(y_train, oof)}")
print(f"Test accuracy:        {accuracy_score(y_test, y_pred)}")
print(f"Test F1 score:        {f1_score(y_test, y_pred)}")

In [None]:
fig, ax = plot_confusion_matrix(confusion_matrix(y_test, y_pred), figsize=(12,8), hide_ticks=True, colorbar=True, class_names=["true", "false"])

plt.title("Confusion Matrix of optimized deep learning", fontsize=18)
plt.ylabel("True label", fontsize=14)
plt.xlabel("Predicted label\naccuracy={:0.4f}, F1 score={:0.4f}".format(accuracy_score(y_test, y_pred), f1_score(y_test, y_pred)), fontsize=14)
plt.xticks(np.arange(2), [False, True], fontsize=16)
plt.yticks(np.arange(2), [False, True], fontsize=16);

<a id="ensemble"></a>
# Simple ensemble 🤝

In [None]:
y_pred_em = y_pred_gbm + y_pred_xgb*2 + y_pred_cat + y_pred_rf + y_pred_ert*2 + y_pred_lm + y_pred_dl
y_pred_em = (y_pred_em > 3.0).astype(int)

print(f"Test accuracy:        {accuracy_score(y_test, y_pred_em)}")
print(f"Test F1 score:        {f1_score(y_test, y_pred_em)}")

In [None]:
fig, ax = plot_confusion_matrix(confusion_matrix(y_test, y_pred_em), figsize=(12,8), hide_ticks=True, colorbar=True, class_names=["true", "false"])

plt.title("Confusion Matrix of the ensembled model", fontsize=18)
plt.ylabel("True label", fontsize=14)
plt.xlabel("Predicted label\naccuracy={:0.4f}, F1-score={:0.4f}".format(accuracy_score(y_test, y_pred_em), f1_score(y_test, y_pred_em)), fontsize=14)
plt.xticks(np.arange(2), [False, True], fontsize=16)
plt.yticks(np.arange(2), [False, True], fontsize=16);

We were able to get better accuracy by using the ensemble model. Thanks so much for reading!<br>
<font color="RoyalBlue">複数のモデルをアンサンブルすることでより良い精度を出すことができました。ここまで読んでくださりどうもありがとうございました！</font>