<a id="overview"></a>
# Overview 🧐
<img src="https://i.imgur.com/HVZezzb.jpg" width="600"><br>
In this notebook, we are going to predict whether a breast mass is benign or malignant based on 30 features in the dataset. This prediction can be useful in diagnosing patients with suspected breast cancer.<br>
<font color="RoyalBlue">このノートブックでは、データセットに含まれる30の特徴量から乳腺腫瘤が良性か悪性かを予測します。この予測は、乳がんの疑いがある患者を診断する際に役立てられるでしょう。</font><br>

I have also run a similar analysis in R ([Breast Cancer🦀EDA & FA / PCA with R (98.2% acc)](https://www.kaggle.com/snowpea8/breast-cancer-eda-fa-pca-with-r-98-2-acc)), 
if you would like to take a look at it.<br>
<font color="RoyalBlue">同様の分析を R でも実行していますので、そちらも参考にしてください。</font><br>

We will first discover and visualize the data to gain insights. Then we split the data into a training and a test set and use the training set to train some machine learning models. At the same time, we evaluate the performance of the models with cross-validation. Finally, we will ensemble each model to improve its accuracy.<br>
<font color="RoyalBlue">まず洞察を得るためにデータを研究、可視化します。それからデータを訓練用とテスト用に分割し、訓練セットを使っていくつかの機械学習モデルを訓練します。同時に、交差検証でモデルの性能を評価します。最後にそれぞれのモデルをアンサンブルし、精度の向上を目指していきます。</font>

# Table of contents 📖
* [Overview 🧐](#overview)
* [Setup 💻](#setup)
* [Load CSV data 📃](#load)
* [Explore CSV data 📊](#explore)
* [Data preprocessing 🧹](#preprocessing)
* [Train models and make predictions 💭](#models)
    * [LightGBM 🌳](#gbm)
    * [Extremely randomized trees 🌳](#ert)
    * [Linear model 📈](#lm)
* [Simple ensemble 🤝](#ensemble)

<a id="setup"></a>
# Setup 💻
All seed values are fixed at zero.<br>
<font color="RoyalBlue">シード値は全て0で固定しています。</font><br>

In [None]:
import os
import random
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from mlxtend.plotting import plot_confusion_matrix

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, recall_score
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler

import lightgbm as lgb

def seed_everything(seed=2020):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    # tf.random.set_seed(seed)
seed_everything(0)

sns.set_style("whitegrid")
palette_ro = ["#ee2f35", "#fa7211", "#fbd600", "#75c731", "#1fb86e", "#0488cf", "#7b44ab"]

ROOT = "../input/breast-cancer-wisconsin-data"

<a id="load"></a>
# Load CSV data 📃

In [None]:
df = pd.read_csv(ROOT + "/data.csv")

print("Data shape: ", df.shape)
df.head()

In [None]:
df.info()

Dataset from: [Breast Cancer Wisconsin (Diagnostic) Data Set](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)

* `id` - ID number
* `diagnosis` - Diagnosis (`M`: malignant, `B`: benign)
<br>　<font color="RoyalBlue">【目的変数】診断（結果）（M : 悪性，B : 良性）</font><br>

The following features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.<br>
<font color="RoyalBlue">以下の特徴量は、乳腺腫瘤の穿刺吸引細胞診（FNA）のデジタル画像から計算されたデータです。画像内に存在する細胞核の特徴を説明しています。以下の10の各属性について、それぞれ平均（mean）、標準誤差（se）、最悪値（worst）の3種類、合計30の特徴量が格納されています。</font><br>

* `radius` - mean of distances from center to points on the perimeter
<br>　<font color="RoyalBlue">半径 - 中心から外周上の点までの距離の平均</font>
* `texture` - standard deviation of gray-scale values
<br>　<font color="RoyalBlue">テクスチャ - グレースケール値の標準偏差</font>
* `perimeter`
<br>　<font color="RoyalBlue">外周長</font>
* `area`
<br>　<font color="RoyalBlue">面積</font>
* `smoothness` - local variation in radius lengths
<br>　<font color="RoyalBlue">平滑性 - 半径の長さの局所変動</font>
* `compactness` - perimeter^2 / area - 1.0
<br>　<font color="RoyalBlue">コンパクト性 - 外周長^2 / 面積 - 1.0</font>
* `concavity` - severity of concave portions of the contour
<br>　<font color="RoyalBlue">凹度 - 輪郭の凹部の程度</font>
* `concave points` - number of concave portions of the contour
<br>　<font color="RoyalBlue">凹点数 - 輪郭の凹部の数</font>
* `symmetry`
<br>　<font color="RoyalBlue">対称性</font>
* `fractal dimension` - "coastline approximation" - 1
<br>　<font color="RoyalBlue">フラクタル次元 - 複雑さの程度を表す尺度。複雑であればあるほど値が大きくなる</font>

<a id="explore"></a>
# Explore CSV data 📊
Acknowledgements: [Feature Selection and Data Visualization](https://www.kaggle.com/kanncaa1/feature-selection-and-data-visualization)

In [None]:
df.isnull().sum()

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
sns.countplot(x="diagnosis", ax=ax, data=df, palette=palette_ro[6::-5], alpha=0.9)

ax.annotate(len(df[df["diagnosis"]=="M"]), xy=(-0.05, len(df[df["diagnosis"]=="M"])+5),
            size=16, color=palette_ro[6])
ax.annotate(len(df[df["diagnosis"]=="B"]), xy=(0.95, len(df[df["diagnosis"]=="B"])+5),
            size=16, color=palette_ro[1])

fig.suptitle("Distribution of diagnosis", fontsize=18);

In [None]:
scaler = StandardScaler()
columns = df.columns.drop(["id", "Unnamed: 32", "diagnosis"])

data_s = pd.DataFrame(scaler.fit_transform(df[columns]), columns=columns)
data_s = pd.concat([df["diagnosis"], data_s.iloc[:, 0:10]], axis=1)
data_s = pd.melt(data_s, id_vars="diagnosis", var_name="features", value_name="value")

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(20, 16))
sns.violinplot(x="features", y="value", hue="diagnosis", ax=ax1,
               data=data_s, palette=palette_ro[6::-5], split=True,
               scale="count", inner="quartile")

sns.swarmplot(x="features", y="value", hue="diagnosis", ax=ax2,
              data=data_s, palette=palette_ro[6::-5])

fig.suptitle("Mean values distribution", fontsize=18);

In [None]:
data_s = pd.DataFrame(scaler.fit_transform(df[columns]), columns=columns)
data_s = pd.concat([df["diagnosis"], data_s.iloc[:, 10:20]], axis=1)
data_s = pd.melt(data_s, id_vars="diagnosis", var_name="features", value_name="value")

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(20, 16))
sns.violinplot(x="features", y="value", hue="diagnosis", ax=ax1,
               data=data_s, palette=palette_ro[6::-5], split=True,
               scale="count", inner="quartile")

sns.swarmplot(x="features", y="value", hue="diagnosis", ax=ax2,
              data=data_s, palette=palette_ro[6::-5])

fig.suptitle("Standard error values distribution", fontsize=18);

In [None]:
data_s = pd.DataFrame(scaler.fit_transform(df[columns]), columns=columns)
data_s = pd.concat([df["diagnosis"], data_s.iloc[:, 20:30]], axis=1)
data_s = pd.melt(data_s, id_vars="diagnosis", var_name="features", value_name="value")

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(20, 16))
sns.violinplot(x="features", y="value", hue="diagnosis", ax=ax1,
               data=data_s, palette=palette_ro[6::-5], split=True,
               scale="count", inner="quartile")

sns.swarmplot(x="features", y="value", hue="diagnosis", ax=ax2,
              data=data_s, palette=palette_ro[6::-5])

fig.suptitle("Worst values distribution", fontsize=18);

In [None]:
df_c = df.reindex(columns=["radius_mean", "radius_se", "radius_worst", "texture_mean", "texture_se", "texture_worst",
                           "perimeter_mean", "perimeter_se", "perimeter_worst", "area_mean", "area_se", "area_worst",
                           "smoothness_mean", "smoothness_se", "smoothness_worst", "compactness_mean", "compactness_se", "compactness_worst",
                           "concavity_mean", "concavity_se", "concavity_worst", "concave points_mean", "concave points_se", "concave points_worst",
                           "symmetry_mean", "symmetry_se", "symmetry_worst", "fractal_dimension_mean", "fractal_dimension_se", "fractal_dimension_worst",
                           "diagnosis"])
df_c = df_c.replace({"M":1, "B":0})

print("Correlation coefficient against diagnosis")
df_c.corr().sort_values("diagnosis", ascending=False)["diagnosis"]

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(18, 12))

sns.heatmap(df_c.corr(), ax=ax, vmax=1, vmin=-1, center=0,
            annot=True, fmt=".2f",
            cmap=sns.diverging_palette(220, 10, as_cmap=True),
            mask=np.triu(np.ones_like(df_c.corr(), dtype=np.bool)))

_, labels = plt.yticks()
labels[30].set_color(palette_ro[0])

fig.suptitle("Diagonal correlation matrix", fontsize=18);

Since many of the features in this dataset have high correlation coefficients with each other, feature selection is very important.<br>
<font color="RoyalBlue">このデータセットの特徴量には互いに相関係数の高いものが多いため、特徴量選択が非常に重要になってきます。</font><br>

<a id="preprocessing"></a>
# Data preprocessing 🧹

In [None]:
X = df.copy()
y = X["diagnosis"].replace({"M":1, "B":0})
X = X.drop(["id", "Unnamed: 32", "diagnosis"], axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True, stratify=y, random_state=0)
X_train.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
X_train.head()

<a id="models"></a>
# Train models and make predictions 💭
Now, let's create some models and check the performance measures. The performance measure for classifiers are as follows.<br>
<font color="RoyalBlue">では、いくつかのモデルを作成し、性能指標を確認していきましょう。分類器の性能指標には以下のようなものがあります。</font>

> Referenced from Hands-On Machine Learning with Scikit-Learn and TensorFlow (Aurelien Geron, 2017).
* accuracy - the ratio of correct predictions
<br>　<font color="RoyalBlue">正解率 - 正しい予測の割合</font>
* confusion matrix - counting the number of times instances of class A are classified as class B
<br>　<font color="RoyalBlue">混同行列 - クラスＡのインスタンスがクラスＢに分類された回数を数える</font>
* precision - the accuracy of the positive predictions
<br>　<font color="RoyalBlue">適合率 - 陽性の予測の正解率（陽性であると予測したうち、当たっていた率）</font>
* recall (sensitivity, true positive rate: TPR) - the ratio of positive instances that are correctly detected by the classifier
<br>　<font color="RoyalBlue">再現率（感度、真陽性率）- 分類器が正しく分類した陽性インスタンスの割合（本当に陽性であるケースのうち、陽性だと判定できた率）</font>
* F1 score - the harmonic mean of precision and recall
<br>　<font color="RoyalBlue">F1 スコア（F 値） - 適合率と再現率の調和平均（算術平均に比べ、調和平均は低い値にそうでない値よりもずっと大きな重みを置く）</font>
* AUC - the area under the ROC curve (plotting the true positive rate (another name for recall) against the false positive rate)
<br>　<font color="RoyalBlue">AUC - ROC 曲線（偽陽性率に対する真陽性率（再現率）をプロットした曲線）の下の面積</font><br>

In this notebook, we will look at their accuracy, F1 score, and confusion matrix.<br>
<font color="RoyalBlue">このノートブックでは、正解率、F1 スコア、そして混同行列を見ていきます。</font>

<a id="gbm"></a>
## LightGBM 🌳
First, let's try a prediction with all the features using LightGBM.<br>
<font color="RoyalBlue">まずは、LightGBM で全ての特徴量を使った予測を試してみましょう。</font>

In [None]:
NFOLD = 10

skf = StratifiedKFold(n_splits=NFOLD)
models = []
imp = np.zeros((NFOLD, len(X_train.columns)))
oof = np.zeros((len(X_train), ))
y_preds = np.zeros((len(X_test), NFOLD))

for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
    print(f"FOLD {fold_id+1}")
    X_tr, X_va = X_train.iloc[tr_idx], X_train.iloc[va_idx]
    y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]
    
    clf = lgb.LGBMClassifier(objective="binary",
                             metric="binary_logloss")
    clf.fit(X_tr, y_tr, eval_set=[(X_va, y_va)],
            early_stopping_rounds=10,
            verbose=-1)
    oof[va_idx] = clf.predict(X_va)
    models.append(clf)
    imp[fold_id] = clf.feature_importances_

for fold_id, clf in enumerate(models):
    pred_ = clf.predict(X_test, num_iteration=clf.best_iteration_)
    y_preds[:, fold_id] = pred_
y_pred = np.rint(np.mean(y_preds, axis=1))

print(f"\nOut-of-fold accuracy: {accuracy_score(y_train, oof)}")
print(f"Out-of-fold F1 score: {f1_score(y_train, oof)}")
print(f"Test accuracy:        {accuracy_score(y_test, y_pred)}")
print(f"Test F1 score:        {f1_score(y_test, y_pred)}")
print(f"Test recall:          {recall_score(y_test, y_pred)}")

feature_imp = pd.DataFrame(sorted(zip(np.mean(imp, axis=0), X_train.columns), reverse=True), columns=["values", "features"])

fig, ax = plt.subplots(1, 1, figsize=(16, 6))
sns.barplot(x="values", y="features", data=feature_imp, palette="Blues_r")
plt.title("Feature importance of default LightGBM", fontsize=18);

In [None]:
fig, ax = plot_confusion_matrix(confusion_matrix(y_test, y_pred), figsize=(12,8), hide_ticks=True, colorbar=True, class_names=["true", "false"])

plt.title("Confusion Matrix of default LightGBM", fontsize=18)
plt.ylabel("True label", fontsize=14)
plt.xlabel("Predicted label\naccuracy={:0.4f}, F1-score={:0.4f}".format(accuracy_score(y_test, y_pred), f1_score(y_test, y_pred)), fontsize=14)
plt.xticks(np.arange(2), ["Benign", "Malignant"], fontsize=16)
plt.yticks(np.arange(2), ["Benign", "Malignant"], fontsize=16);

Next, narrow down the number of features based on EDA and feature importance. Let's choose the following features.<br>
<font color="RoyalBlue">次に、EDA や feature importance をもとに特徴量の数を絞ります。下記のような特徴量を選んでいきましょう。</font>
* High correlation coefficient with the objective variable
<br>　<font color="RoyalBlue">目的変数との相関係数が高い</font>
* Less mixing in data distribution for the objective variable
<br>　<font color="RoyalBlue">目的変数に対するデータ分布において混在が少ない</font>
* High feature importance
<br>　<font color="RoyalBlue">feature importance が高い</font>
* Features are independent of each other (to eliminate multicollinearity)
<br>　<font color="RoyalBlue">特徴量同士がなるべく独立している（多重共線性を解消するため）</font>

In [None]:
drop_features1 = ["radius_mean", "radius_se", "radius_worst", "texture_mean", "texture_se",
                  "perimeter_mean", "perimeter_se", "area_mean", "area_worst",
                  "smoothness_mean", "smoothness_se", "compactness_mean", "compactness_se", "compactness_worst",
                  "concavity_mean", "concavity_se", "concavity_worst", "concave points_worst",
                  "symmetry_mean", "symmetry_se", "fractal_dimension_mean", "fractal_dimension_se"]
X_1 = X.drop(drop_features1, axis=1)

fig, ax = plt.subplots(1, 1, figsize=(12, 8))
sns.heatmap(pd.concat([X_1, y], axis=1).corr(), ax=ax, vmax=1, vmin=-1, center=0,
            annot=True, fmt=".2f",
            cmap=sns.diverging_palette(220, 10, as_cmap=True),
            mask=np.triu(np.ones_like(pd.concat([y, X_1], axis=1).corr(), dtype=np.bool)))

_, labels = plt.yticks()
labels[8].set_color(palette_ro[0])

fig.suptitle("Diagonal correlation matrix", fontsize=18);

X_train, X_test, y_train, y_test = train_test_split(X_1, y, test_size=0.3, shuffle=True, stratify=y, random_state=0)
X_train.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)

In [None]:
NFOLD = 10

skf = StratifiedKFold(n_splits=NFOLD)
models = []
imp = np.zeros((NFOLD, len(X_train.columns)))
oof = np.zeros((len(X_train), ))
y_preds = np.zeros((len(X_test), NFOLD))

for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
    print(f"FOLD {fold_id+1}")
    X_tr, X_va = X_train.iloc[tr_idx], X_train.iloc[va_idx]
    y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]
    
    clf = lgb.LGBMClassifier(objective="binary",
                             metric="binary_logloss",
                             min_child_samples=10,
                             reg_alpha=0.1)
    clf.fit(X_tr, y_tr, eval_set=[(X_va, y_va)],
            early_stopping_rounds=10,
            verbose=-1)
    oof[va_idx] = clf.predict(X_va)
    models.append(clf)
    imp[fold_id] = clf.feature_importances_

for fold_id, clf in enumerate(models):
    pred_ = clf.predict(X_test, num_iteration=clf.best_iteration_)
    y_preds[:, fold_id] = pred_
y_pred = np.rint(np.mean(y_preds, axis=1))
y_pred_gbm = np.mean(y_preds, axis=1)

print(f"\nOut-of-fold accuracy: {accuracy_score(y_train, oof)}")
print(f"Out-of-fold F1 score: {f1_score(y_train, oof)}")
print(f"Test accuracy:        {accuracy_score(y_test, y_pred)}")
print(f"Test F1 score:        {f1_score(y_test, y_pred)}")
print(f"Test recall:          {recall_score(y_test, y_pred)}")

feature_imp = pd.DataFrame(sorted(zip(np.mean(imp, axis=0), X_train.columns), reverse=True), columns=["values", "features"])

fig, ax = plt.subplots(1, 1, figsize=(16, 6))
sns.barplot(x="values", y="features", data=feature_imp, palette="Blues_r")
plt.title("Feature importance of optimized LightGBM", fontsize=18);

In [None]:
fig, ax = plot_confusion_matrix(confusion_matrix(y_test, y_pred), figsize=(12,8), hide_ticks=True, colorbar=True, class_names=["true", "false"])

plt.title("Confusion Matrix of optimized LightGBM", fontsize=18)
plt.ylabel("True label", fontsize=14)
plt.xlabel("Predicted label\naccuracy={:0.4f}, F1 score={:0.4f}".format(accuracy_score(y_test, y_pred), f1_score(y_test, y_pred)), fontsize=14)
plt.xticks(np.arange(2), ["Benign", "Malignant"], fontsize=16)
plt.yticks(np.arange(2), ["Benign", "Malignant"], fontsize=16);

<a id="ert"></a>
## Extremely randomized trees 🌳
We will also use the Extremely randomized trees.<br>
<font color="RoyalBlue">Extremely randomized trees も使ってみましょう。</font>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True, stratify=y, random_state=0)
X_train.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)

In [None]:
NFOLD = 10

skf = StratifiedKFold(n_splits=NFOLD)
models = []
imp = np.zeros((NFOLD, len(X_train.columns)))
oof = np.zeros((len(X_train), ))
y_preds = np.zeros((len(X_test), NFOLD))

for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
    # print(f"FOLD {fold_id+1}")
    X_tr, X_va = X_train.iloc[tr_idx], X_train.iloc[va_idx]
    y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]
    
    clf = ExtraTreesClassifier(random_state=0)
    clf.fit(X_tr, y_tr)
    oof[va_idx] = clf.predict(X_va)
    models.append(clf)
    imp[fold_id] = clf.feature_importances_

for fold_id, clf in enumerate(models):
    pred_ = clf.predict(X_test)
    y_preds[:, fold_id] = pred_
y_pred = np.rint(np.mean(y_preds, axis=1))

print(f"Out-of-fold accuracy: {accuracy_score(y_train, oof)}")
print(f"Out-of-fold F1 score: {f1_score(y_train, oof)}")
print(f"Test accuracy:        {accuracy_score(y_test, y_pred)}")
print(f"Test F1 score:        {f1_score(y_test, y_pred)}")
print(f"Test recall:          {recall_score(y_test, y_pred)}")

feature_imp = pd.DataFrame(sorted(zip(np.mean(imp, axis=0), X_train.columns), reverse=True), columns=["values", "features"])

fig, ax = plt.subplots(1, 1, figsize=(16, 6))
sns.barplot(x="values", y="features", data=feature_imp, palette="Blues_r")
plt.title("Feature importance of default Extremely randomized trees", fontsize=18);

In [None]:
fig, ax = plot_confusion_matrix(confusion_matrix(y_test, y_pred), figsize=(12,8), hide_ticks=True, colorbar=True, class_names=["true", "false"])

plt.title("Confusion Matrix of default Extremely randomized trees", fontsize=18)
plt.ylabel("True label", fontsize=14)
plt.xlabel("Predicted label\naccuracy={:0.4f}, F1 score={:0.4f}".format(accuracy_score(y_test, y_pred), f1_score(y_test, y_pred)), fontsize=14)
plt.xticks(np.arange(2), ["Benign", "Malignant"], fontsize=16)
plt.yticks(np.arange(2), ["Benign", "Malignant"], fontsize=16);

Do the same feature selection as before.<br>
<font color="RoyalBlue">先程と同じように、特徴量選択を行います。</font>

In [None]:
drop_features2 = ["radius_mean", "radius_se", "radius_worst", "texture_mean", "texture_se",
                  "perimeter_mean", "perimeter_se", "area_mean", "area_worst",
                  "smoothness_mean", "smoothness_se", "compactness_mean", "compactness_se", "compactness_worst",
                  "concavity_mean",  "concavity_worst", "concave points_mean", "concave points_se",
                  "symmetry_mean", "symmetry_se", "fractal_dimension_mean", "fractal_dimension_se"]
X_2 = X.drop(drop_features2, axis=1)

fig, ax = plt.subplots(1, 1, figsize=(12, 8))
sns.heatmap(pd.concat([X_2, y], axis=1).corr(), ax=ax, vmax=1, vmin=-1, center=0,
            annot=True, fmt=".2f",
            cmap=sns.diverging_palette(220, 10, as_cmap=True),
            mask=np.triu(np.ones_like(pd.concat([y, X_2], axis=1).corr(), dtype=np.bool)))

_, labels = plt.yticks()
labels[8].set_color(palette_ro[0])

fig.suptitle("Diagonal correlation matrix", fontsize=18);

X_train, X_test, y_train, y_test = train_test_split(X_2, y, test_size=0.3, shuffle=True, stratify=y, random_state=0)
X_train.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)

In [None]:
NFOLD = 10

skf = StratifiedKFold(n_splits=NFOLD)
models = []
imp = np.zeros((NFOLD, len(X_train.columns)))
oof = np.zeros((len(X_train), ))
y_preds = np.zeros((len(X_test), NFOLD))

for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
    # print(f"FOLD {fold_id+1}")
    X_tr, X_va = X_train.iloc[tr_idx], X_train.iloc[va_idx]
    y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]
    
    clf = ExtraTreesClassifier(random_state=0,
                               n_estimators=200,
                               min_samples_split=5)
    clf.fit(X_tr, y_tr)
    oof[va_idx] = clf.predict(X_va)
    models.append(clf)
    imp[fold_id] = clf.feature_importances_

for fold_id, clf in enumerate(models):
    pred_ = clf.predict(X_test)
    y_preds[:, fold_id] = pred_
y_pred = np.rint(np.mean(y_preds, axis=1))
y_pred_ert = np.mean(y_preds, axis=1)

print(f"Out-of-fold accuracy: {accuracy_score(y_train, oof)}")
print(f"Out-of-fold F1 score: {f1_score(y_train, oof)}")
print(f"Test accuracy:        {accuracy_score(y_test, y_pred)}")
print(f"Test F1 score:        {f1_score(y_test, y_pred)}")
print(f"Test recall:          {recall_score(y_test, y_pred)}")

feature_imp = pd.DataFrame(sorted(zip(np.mean(imp, axis=0), X_train.columns), reverse=True), columns=["values", "features"])

fig, ax = plt.subplots(1, 1, figsize=(16, 6))
sns.barplot(x="values", y="features", data=feature_imp, palette="Blues_r")
plt.title("Feature importance of optimized Extremely randomized trees", fontsize=18);

In [None]:
fig, ax = plot_confusion_matrix(confusion_matrix(y_test, y_pred), figsize=(12,8), hide_ticks=True, colorbar=True, class_names=["true", "false"])

plt.title("Confusion Matrix of optimized Extremely randomized trees", fontsize=18)
plt.ylabel("True label", fontsize=14)
plt.xlabel("Predicted label\naccuracy={:0.4f}, F1 score={:0.4f}".format(accuracy_score(y_test, y_pred), f1_score(y_test, y_pred)), fontsize=14)
plt.xticks(np.arange(2), ["Benign", "Malignant"], fontsize=16)
plt.yticks(np.arange(2), ["Benign", "Malignant"], fontsize=16);

<a id="lm"></a>
## Linear model 📈
In order to get diverse models, we will also try linear model as a model without decision trees.<br>
<font color="RoyalBlue">多様性のあるモデルを得るために、決定木を使わないモデルとして、線形モデルも試してみましょう。</font>

In [None]:
scaler = StandardScaler()
X_s = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

X_train, X_test, y_train, y_test = train_test_split(X_s, y, test_size=0.3, shuffle=True, stratify=y, random_state=0)
X_train.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)

In [None]:
NFOLD = 10

skf = StratifiedKFold(n_splits=NFOLD)
models = []
oof = np.zeros((len(X_train), ))
y_preds = np.zeros((len(X_test), NFOLD))

for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
    # print(f"FOLD {fold_id+1}")
    X_tr, X_va = X_train.iloc[tr_idx], X_train.iloc[va_idx]
    y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]
    
    clf = LogisticRegression(random_state=0)
    clf.fit(X_tr, y_tr)
    oof[va_idx] = clf.predict(X_va)
    models.append(clf)

for fold_id, clf in enumerate(models):
    pred_ = clf.predict(X_test)
    y_preds[:, fold_id] = pred_
y_pred = np.rint(np.mean(y_preds, axis=1))

print(f"Out-of-fold accuracy: {accuracy_score(y_train, oof)}")
print(f"Out-of-fold F1 score: {f1_score(y_train, oof)}")
print(f"Test accuracy:        {accuracy_score(y_test, y_pred)}")
print(f"Test F1 score:        {f1_score(y_test, y_pred)}")
print(f"Test recall:          {recall_score(y_test, y_pred)}")

In [None]:
fig, ax = plot_confusion_matrix(confusion_matrix(y_test, y_pred), figsize=(12,8), hide_ticks=True, colorbar=True, class_names=["true", "false"])

plt.title("Confusion Matrix of default linear model", fontsize=18)
plt.ylabel("True label", fontsize=14)
plt.xlabel("Predicted label\naccuracy={:0.4f}, F1 score={:0.4f}".format(accuracy_score(y_test, y_pred), f1_score(y_test, y_pred)), fontsize=14)
plt.xticks(np.arange(2), ["Benign", "Malignant"], fontsize=16)
plt.yticks(np.arange(2), ["Benign", "Malignant"], fontsize=16);

Do feature selection.<br>
<font color="RoyalBlue">特徴量選択を行います。</font>

In [None]:
drop_features3 = ["radius_mean", "radius_se", "radius_worst", "texture_mean", "texture_se",
                  "perimeter_mean", "perimeter_se", "area_mean", "area_worst",
                  "smoothness_mean", "smoothness_se", "compactness_mean", "compactness_se", 
                  "concavity_se", "concavity_worst", "concave points_mean", "concave points_se",
                  "symmetry_mean", "symmetry_se", "fractal_dimension_mean", "fractal_dimension_se", "fractal_dimension_worst"]
X_3 = X_s.drop(drop_features3, axis=1)

fig, ax = plt.subplots(1, 1, figsize=(12, 8))
sns.heatmap(pd.concat([X_3, y], axis=1).corr(), ax=ax, vmax=1, vmin=-1, center=0,
            annot=True, fmt=".2f",
            cmap=sns.diverging_palette(220, 10, as_cmap=True),
            mask=np.triu(np.ones_like(pd.concat([y, X_3], axis=1).corr(), dtype=np.bool)))

_, labels = plt.yticks()
labels[8].set_color(palette_ro[0])

fig.suptitle("Diagonal correlation matrix", fontsize=18);

X_train, X_test, y_train, y_test = train_test_split(X_3, y, test_size=0.3, shuffle=True, stratify=y, random_state=0)
X_train.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)

In [None]:
NFOLD = 10

skf = StratifiedKFold(n_splits=NFOLD)
models = []
oof = np.zeros((len(X_train), ))
y_preds = np.zeros((len(X_test), NFOLD))

for fold_id, (tr_idx, va_idx) in enumerate(skf.split(X_train, y_train)):
    # print(f"FOLD {fold_id+1}")
    X_tr, X_va = X_train.iloc[tr_idx], X_train.iloc[va_idx]
    y_tr, y_va = y_train.iloc[tr_idx], y_train.iloc[va_idx]
    
    clf = LogisticRegression(random_state=0)
    clf.fit(X_tr, y_tr)
    oof[va_idx] = clf.predict(X_va)
    models.append(clf)

for fold_id, clf in enumerate(models):
    pred_ = clf.predict(X_test)
    y_preds[:, fold_id] = pred_
y_pred = np.rint(np.mean(y_preds, axis=1))
y_pred_lm = np.mean(y_preds, axis=1)

print(f"Out-of-fold accuracy: {accuracy_score(y_train, oof)}")
print(f"Out-of-fold F1 score: {f1_score(y_train, oof)}")
print(f"Test accuracy:        {accuracy_score(y_test, y_pred)}")
print(f"Test F1 score:        {f1_score(y_test, y_pred)}")
print(f"Test recall:          {recall_score(y_test, y_pred)}")

In [None]:
fig, ax = plot_confusion_matrix(confusion_matrix(y_test, y_pred), figsize=(12,8), hide_ticks=True, colorbar=True, class_names=["true", "false"])

plt.title("Confusion Matrix of optimized linear model", fontsize=18)
plt.ylabel("True label", fontsize=14)
plt.xlabel("Predicted label\naccuracy={:0.4f}, F1 score={:0.4f}".format(accuracy_score(y_test, y_pred), f1_score(y_test, y_pred)), fontsize=14)
plt.xticks(np.arange(2), ["Benign", "Malignant"], fontsize=16)
plt.yticks(np.arange(2), ["Benign", "Malignant"], fontsize=16);

<a id="ensemble"></a>
# Simple ensemble 🤝
For better accuracy, ensemble predictions of the three models.<br>
<font color="RoyalBlue">精度を高めるために、3つのモデルの予測を組み合わせてアンサンブルを行いましょう。</font>

In [None]:
y_pred_em = y_pred_gbm*2 +  y_pred_ert*2 + y_pred_lm
y_pred_em = (y_pred_em > 3).astype(int)

print(f"Test accuracy:        {accuracy_score(y_test, y_pred_em)}")
print(f"Test F1 score:        {f1_score(y_test, y_pred_em)}")
print(f"Test recall:          {recall_score(y_test, y_pred_em)}")

In [None]:
fig, ax = plot_confusion_matrix(confusion_matrix(y_test, y_pred_em), figsize=(12,8), hide_ticks=True, colorbar=True, class_names=["true", "false"])

plt.title("Confusion Matrix of the ensembled model", fontsize=18)
plt.ylabel("True label", fontsize=14)
plt.xlabel("Predicted label\naccuracy={:0.4f}, F1-score={:0.4f}".format(accuracy_score(y_test, y_pred_em), f1_score(y_test, y_pred_em)), fontsize=14)
plt.xticks(np.arange(2), ["Benign", "Malignant"], fontsize=16)
plt.yticks(np.arange(2), ["Benign", "Malignant"], fontsize=16);

We were able to get better accuracy by using the ensemble model. Thanks so much for reading!<br>
<font color="RoyalBlue">複数のモデルをアンサンブルすることでより良い精度を出すことができました。ここまで読んでくださりどうもありがとうございました！</font>