# sprint1 機械学習フロー

## 2.機械学習フロー

Kaggleの Home Credit Default Risk コンペティションを題材に、機械学習の実践的な流れを学びます。特に適切な 検証 を行い、高い 汎化性能 のあるモデルを完成させることを目指します。  
Home Credit Default Risk | Kaggle

## 【問題1】クロスバリデーション
事前学習期間では検証データをはじめに分割しておき、それに対して指標値を計算することで検証を行っていました。（ホールドアウト法）しかし、分割の仕方により精度は変化します。実践的には クロスバリデーション（交差検証） を行います。分割を複数回行い、それぞれに対して学習と検証を行う方法です。複数回の分割のためにscikit-learnにはKFoldクラスが用意されています。

事前学習期間の課題で作成したベースラインモデルに対してKFoldクラスによるクロスバリデーションを行うコードを作成し実行してください。  
sklearn.model_selection.KFold — scikit-learn 0.21.3 documentation  
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold

### 1.1.1（準備)前処理

In [1]:
# 基本ライブラリのインポート
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 30)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import scipy.stats
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from tqdm import tqdm
from sklearn.metrics import f1_score
from sklearn import datasets
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
import warnings # 実行に関係ない警告を無視
warnings.filterwarnings('ignore')

# データフレームを綺麗に出力する関数
import IPython
def display(*dfs, head=True):
    for df in dfs:
        IPython.display.display(df.head() if head else df)

In [2]:
# 訓練、検証データの読み込み
app_test= pd.read_csv("sample_dataset/home-credit-default-risk/application_test.csv")
app_train = pd.read_csv("sample_dataset/home-credit-default-risk/application_train.csv")

In [3]:
# ラベルエンコード（列数を増やさず、値を数字に変換する手法）
le = LabelEncoder()
le_count = 0
for col in app_train:
    # 値がオブジェクト型の時を条件指定。
    if app_train[col].dtype == "object":
        # カテゴリ数が２以下の条件指定。
        if len(list(app_train[col].unique())) <= 2:
            # 訓練データで学習を行う。
            le.fit(app_train[col])
            # 訓練、検証データ共に変換する。
            app_train[col] = le.transform(app_train[col])
            app_test[col] = le.transform(app_test[col])
            le_count += 1

# one_hotエンコード (値は0と１のみ、列数をカテゴリ数分用意する手法)
app_train = pd.get_dummies(app_train)
app_test = pd.get_dummies(app_test)

In [4]:
# 元データを残す
train0 = app_train.copy()
test0 = app_test.copy()

# ターゲット値
train_labels = app_train['TARGET']

# id値
train_id = app_train["SK_ID_CURR"]
test_id = app_train["SK_ID_CURR"]

In [5]:
# A(app_train)とB(app_test)をjoin(両方含む)値に変換。
train1, test1 = train0.align(test0, join = 'inner', axis = 1)

print('Training Features shape: ', train1.shape)
print('Testing Features shape: ', test1.shape)

Training Features shape:  (307511, 239)
Testing Features shape:  (48744, 239)


In [6]:
# 特微量リスト
features = list(train1.columns)

In [7]:
# 欠損値を中央値で埋める
imputer = SimpleImputer(strategy = 'median')
imputer.fit(train1)
train2 = imputer.transform(train1)
test2 = imputer.transform(test1)

In [8]:
# 最小値0、最大値1に正規化
scaler = MinMaxScaler(feature_range = (0, 1))
scaler.fit(train2)
train3 = scaler.transform(train2)
test3 = scaler.transform(test2)

In [9]:
# 訓練データのマスター
train_df = pd.DataFrame(train3)
train_df.columns = features

#検証データのマスター
test_df = pd.DataFrame(test3)
test_df.columns = features
train_df.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,...,FONDKAPREMONT_MODE_org spec account,FONDKAPREMONT_MODE_reg oper account,FONDKAPREMONT_MODE_reg oper spec account,HOUSETYPE_MODE_block of flats,HOUSETYPE_MODE_specific housing,HOUSETYPE_MODE_terraced house,WALLSMATERIAL_MODE_Block,WALLSMATERIAL_MODE_Mixed,WALLSMATERIAL_MODE_Monolithic,WALLSMATERIAL_MODE_Others,WALLSMATERIAL_MODE_Panel,"WALLSMATERIAL_MODE_Stone, brick",WALLSMATERIAL_MODE_Wooden,EMERGENCYSTATE_MODE_No,EMERGENCYSTATE_MODE_Yes
0,0.0,0.0,0.0,1.0,0.0,0.001512,0.090287,0.090032,0.077441,0.256321,0.888839,0.045086,0.85214,0.705433,0.098901,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,3e-06,0.0,0.0,0.0,0.0,0.002089,0.311736,0.132924,0.271605,0.045016,0.477114,0.043648,0.951929,0.959566,0.098901,...,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,6e-06,1.0,1.0,1.0,0.0,0.000358,0.022472,0.020025,0.023569,0.134897,0.348534,0.046161,0.827335,0.648326,0.285714,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.1e-05,0.0,0.0,1.0,0.0,0.000935,0.066837,0.109477,0.063973,0.107023,0.350846,0.038817,0.601451,0.661387,0.098901,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.4e-05,0.0,0.0,1.0,0.0,0.000819,0.116854,0.078975,0.117845,0.39288,0.298591,0.03882,0.825268,0.519522,0.098901,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 1.2.1(準備）ベースラインモデルの作成

In [23]:
# 使用データの生成
X_train = train_df[["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3", "DAYS_EMPLOYED"]]
X_train = X_train[:6000]
Y_train = train_labels[:6000]

x_train, x_test, y_train, y_test = train_test_split(X_train.values, Y_train.values, test_size=0.2, random_state=0)
print("x_train",x_train.shape, "x_test", x_test.shape, "y_train", y_train.shape, "y_test", y_test.shape)

x_train (4800, 4) x_test (1200, 4) y_train (4800,) y_test (1200,)


### 1.3.1（解答）クロスバリデーションの実行

In [24]:
# クロスバリデーションの設定
kf = KFold(n_splits = 3, shuffle = False, random_state = 0)

# 各oof(out of fold)の受け皿を作成
oof_train= np.zeros((x_train.shape[0],))
oof_test = np.zeros((x_test.shape[0],))
oof_test_skf = np.zeros((3, x_test.shape[0]))

# スコア値用の空リスト
auc_scores = []

# クロスバリデーションの実行
for train_index, test_index in kf.split(x_train, y_train):
    print("train_index",train_index[:5], "test_index", test_index[:5])

    # K_foldで分割
    x_tr, y_tr = x_train[train_index], y_train[train_index]
    x_te, y_te = x_train[test_index], y_train[test_index]
    
    # ランダムフォレストでの学習
    rfc = RandomForestClassifier(max_depth=2).fit(x_tr, y_tr)
    rfc_proba = rfc.predict_proba(x_te)[:, 1]
    
    # AUCスコアの算出
    score = roc_auc_score(y_te, rfc_proba)
    auc_scores.append(score)

    # AUC評価
    print("AUC:",score)

train_index [1600 1601 1602 1603 1604] test_index [0 1 2 3 4]
AUC: 0.7364028066249682
train_index [0 1 2 3 4] test_index [1600 1601 1602 1603 1604]
AUC: 0.7322990460778
train_index [0 1 2 3 4] test_index [3200 3201 3202 3203 3204]
AUC: 0.7157058320620924


In [25]:
print("各AUCスコア",auc_scores)
print("平均AUCスコア",np.mean(auc_scores))

各AUCスコア [0.7364028066249682, 0.7322990460778, 0.7157058320620924]
平均AUCスコア 0.7281358949216202


### 1.3.2（解答）クロスバリデーションの実行 簡素化

In [26]:
from sklearn.model_selection import cross_val_score, KFold

kf2 = KFold(n_splits=3, shuffle=True, random_state=0)
scores = cross_val_score(RandomForestClassifier(max_depth=2), x_train, y_train, cv=kf2)
print(scores)
print ("mean score", np.mean(scores))

[0.9275   0.925    0.924375]
mean score 0.925625


### 1.3.3 まとめ  
評価の方法が決まっていないケースでは、cross_val_scoreモジュールで簡素化が可能。  
スタッキング手法を身につけるためにも簡素化せずに慣れる方が良さそうである。

## 【問題2】グリッドサーチ
これまで分類器のパラメータには触れず、デフォルトの設定を使用していました。パラメータの詳細は今後のSprintで学んでいくことになります。機械学習の前提として、パラメータは状況に応じて最適なものを選ぶ必要があります。最適なパラメータを探していくことを パラメータチューニング と呼びます。パラメータチューニングをある程度自動化する単純な方法としては グリッドサーチ があります。

scikit-learnのGridSearchCVを使い、グリッドサーチを行うコードを作成してください。そして、ベースラインモデルに対して何らかしらのパラメータチューニングを行なってください。どのパラメータをチューニングするかは、使用した手法の公式ドキュメントを参考にしてください。  
sklearn.model_selection.GridSearchCV — scikit-learn 0.21.3 documentation  
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

GridSearchCVクラスには引数としてモデル、探索範囲、さらにクロスバリデーションを何分割で行うかを与えます。クロスバリデーションの機能も含まれているため、これを使用する場合はKFoldクラスを利用する必要はありません。

### 2.1.1（解答）グリッドサーチ

In [27]:
# ランダムフォレストのパラメーター
rfc_param = {RandomForestClassifier(): {"n_estimators":[i for i in range(2,5)],
                                       "criterion":["gini", "entropy"],
                                       "max_depth":[i for i in range(1,30)]
                                      }}
max_score = 0

#ランダムフォレストの実行
for model, param in tqdm(rfc_param.items()):
    clf = GridSearchCV(model, param, cv=5).fit(x_train, y_train)
    proba = clf.predict_proba(x_test)[:, 1]
    score = roc_auc_score(y_test, proba)
    
    if max_score < score:
        max_score = score
        best_param = clf.best_params_

print("best AUC:",max_score)
print("best params:",best_param)

100%|██████████| 1/1 [00:21<00:00, 21.16s/it]

best AUC: 0.6596053047857651
best params: {'criterion': 'entropy', 'max_depth': 4, 'n_estimators': 4}





### 2.1.2（解答）グリッドサーチ 簡素化

In [29]:
# ランダムフォレストの学習
rfc = RandomForestClassifier().fit(x_train, y_train)

# ランダムフォレストのパラメーター
rfc_param = {"n_estimators":[i for i in range(2,5)],
             "criterion":["gini", "entropy"],
             "max_depth":[i for i in range(1,30)]
             }

#グリッドサーチの実行
grid_search = GridSearchCV(rfc, rfc_param, cv=5).fit(x_train, y_train)

#最も良かったパラメーターを取得
print(grid_search.best_params_)

print(grid_search.best_estimator_)

{'criterion': 'gini', 'max_depth': 1, 'n_estimators': 2}
RandomForestClassifier(max_depth=1, n_estimators=2)


In [16]:
rfc2 =RandomForestClassifier(criterion='entropy', max_depth=5, n_estimators=4).fit(x_train, y_train)
proba1 = rfc2.predict_proba(x_test)[:, 1]
score1 = roc_auc_score(y_test, proba1)

print("AUC:",score1)

AUC: 0.7177403467297083


### 2.2.1 まとめ  
処理速度の点から簡素化した方法が有用であるが、学習分類器も含めたサーチを行える拡張性がある点のみ１つ目の手法も評価出来る。

## 【問題3】Kaggle Notebooksからの調査  
KaggleのNotebooksから様々なアイデアを見つけ出して、列挙してください。

### 3.1.1（解答）
- LigthGBM ベースラインの作成（week4で使用) https://www.codexa.net/lightgbm-beginner/
- Feature Importance ベースラインの作成（week4で使用)  
https://mathmatical22.xyz/2020/04/12/%E3%80%90%E5%88%9D%E5%BF%83%E8%80%85%E5%90%91%E3%81%91%E3%80%91%E7%89%B9%E5%BE%B4%E9%87%8F%E9%87%8D%E8%A6%81%E5%BA%A6%E3%81%AE%E7%AE%97%E5%87%BA-lightgbm-%E3%80%90python%E3%80%91%E3%80%90%E6%A9%9F/
- Permutation Importance  https://qiita.com/kenmatsu4/items/c49059f78c2b6fed0929
- GXBoost、ランダムサーチ、ベイズ最適化   https://www.codexa.net/hyperparameter-tuning-python/

### 3.2.1（予備知識）
- Scaler    https://helve-python.hatenablog.jp/entry/scikitlearn-scale-conversion
- stratify  https://www.haya-programming.com/entry/2019/06/23/205121

## 【問題4】高い汎化性能のモデル作成
問題3で見つけたアイデアと、独自のアイデアを組み合わせ高い汎化性能のモデル作りを進めてください。  
その過程として、何を行うことで、クロスバリデーションの結果がどの程度変化したかを表にまとめてください。

### 4.1.1（予備知識） Embedded Method（ランダムフォレストを用意て特徴量の重要度測定する）

In [30]:
xtrain, xtest, ytrain, ytest = train_test_split(train_df, train_labels, test_size=0.2, random_state=3)

# ランダムフォレスト学習
rfc3 = RandomForestClassifier(n_estimators=20, random_state=4).fit(xtrain, ytrain)
rfc3_pred = rfc3.predict(xtrain)

# 特徴量重要を抽出
feature_importance = rfc3.feature_importances_

# 可視化
feature_importance = pd.Series(feature_importance, index=features)
values = feature_importance.sort_values()
values

FLAG_DOCUMENT_12                0.000000e+00
NAME_INCOME_TYPE_Businessman    0.000000e+00
FLAG_DOCUMENT_4                 0.000000e+00
FLAG_MOBIL                      0.000000e+00
FLAG_DOCUMENT_10                2.810749e-08
                                    ...     
DAYS_REGISTRATION               3.150815e-02
DAYS_ID_PUBLISH                 3.156294e-02
DAYS_BIRTH                      3.255876e-02
EXT_SOURCE_3                    4.770294e-02
EXT_SOURCE_2                    4.819231e-02
Length: 239, dtype: float64

### 4.2.1（解答） GXBoost、ベースラインモデル

In [31]:
import xgboost as xgb

params = {'metric':'error',
          'objective':'binary:logistic',
          'n_estimators':50000,
          'booster': 'gbtree',
          'learning_rate':0.01,
          'min_child_weight':1,
          'max_depth':5,
          'random_state':0,
          'colsample_bytree':1,
          'subsample':1,
         }

cls = xgb.XGBClassifier()
cls.set_params(**params)
cls.fit(x_train,
        y_train,
        early_stopping_rounds=50,
        eval_set=[(x_test, y_test)],
        eval_metric='error',
        verbose=1)

Parameters: { metric } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[0]	validation_0-error:0.09083
Will train until validation_0-error hasn't improved in 50 rounds.
[1]	validation_0-error:0.09083
[2]	validation_0-error:0.09333
[3]	validation_0-error:0.09333
[4]	validation_0-error:0.09083
[5]	validation_0-error:0.09083
[6]	validation_0-error:0.09167
[7]	validation_0-error:0.09167
[8]	validation_0-error:0.09167
[9]	validation_0-error:0.09083
[10]	validation_0-error:0.09083
[11]	validation_0-error:0.08917
[12]	validation_0-error:0.09083
[13]	validation_0-error:0.08917
[14]	validation_0-error:0.08917
[15]	validation_0-error:0.08917
[16]	validation_0-error:0.08917
[17]	validation_0-error:0.08917
[18]	validation_0-error:0.08917
[19]	validation_0-error:0.08917
[20]	validation_0-erro

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.01, max_delta_step=0, max_depth=5, metric='error',
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=50000, n_jobs=0, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [32]:
xgb_proba = cls.predict_proba(x_test)[:,1]
xgb_score = roc_auc_score(y_test, xgb_proba)
print("AUC:" ,xgb_score)

AUC: 0.7355131636326324


### 4.2.2（解答）XGBoost、グリッドサーチ

In [39]:
cv_params = {'metric':['error'],
             'objective':['binary:logistic'],
             'n_estimators':[50000],
             'random_state':[0],
             'booster': ['gbtree'],
             'learning_rate':[0.01],
             'min_child_weight':[1,5],
             'max_depth':[1,3],
             'colsample_bytree':[0.5,1.0],
             'subsample':[0.5,1.0]
            }

#グリッドサーチの実行
xgb1 = xgb.XGBClassifier()
xgb1_grid = GridSearchCV(xgb1, cv_params, cv=KFold(5, random_state=0), scoring='accuracy', iid=False)
xgb1_grid.fit(x_train, 
              y_train, 
              early_stopping_rounds=50,
              eval_set=[(x_test, y_test)],
              eval_metric='error',
              verbose=0)

Parameters: { metric } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { metric } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { metric } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { metric } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoo

GridSearchCV(cv=KFold(n_splits=5, random_state=0, shuffle=False),
             estimator=XGBClassifier(base_score=None, booster=None,
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None, gamma=None,
                                     gpu_id=None, importance_type='gain',
                                     interaction_constraints=None,
                                     learning_rate=None, max_delta_step=None,
                                     max_depth=None, min_child_weight=None,
                                     missing=nan, mono...
                                     scale_pos_weight=None, subsample=None,
                                     tree_method=None, validate_parameters=None,
                                     verbosity=None),
             iid=False,
             param_grid={'booster': ['gbtree'], 'colsample_bytree': [0.5, 1.0],
     

In [40]:
#最も良かったパラメーターを取得
print(xgb1_grid.best_params_)
print(xgb1_grid.best_score_)

{'booster': 'gbtree', 'colsample_bytree': 0.5, 'learning_rate': 0.01, 'max_depth': 1, 'metric': 'error', 'min_child_weight': 1, 'n_estimators': 50000, 'objective': 'binary:logistic', 'random_state': 0, 'subsample': 0.5}
0.9256249999999999


In [41]:
xgb1_proba = xgb1_grid.best_estimator_.predict_proba(x_test)[:,1]
xgb1_score = roc_auc_score(y_test, xgb1_proba)
print("AUC:" ,xgb1_score)

AUC: 0.5704226556421065


### 4.2.3（解答）XGBoost、ランダムサーチ

In [42]:
cv_params = {'metric':['error'],
             'objective':['binary:logistic'],
             'n_estimators':[50000],
             'random_state':[0],
             'boosting_type': ['gbdt'],
             'learning_rate':[0.01],
             'min_child_weight':[1,2,3,4,5,6,7,8,9,10],
             'max_depth':[1,2,3,4,5,6,7,8,9,10],
             'colsample_bytree':[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0],
             'subsample':[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0],
            }
 
xgb2 = xgb.XGBClassifier()
xgb2_rdn = RandomizedSearchCV(xgb2,
                              cv_params,
                              cv=KFold(5, random_state=0),
                              random_state=0,
                              n_iter=30,
                              iid=False,
                              scoring='accuracy')
xgb2_rdn.fit(x_train,
            y_train,
            early_stopping_rounds=50,
            eval_set=[(x_test, y_test)],
            eval_metric='error',
            verbose=0)

Parameters: { boosting_type, metric } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { boosting_type, metric } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { boosting_type, metric } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { boosting_type, metric } might not be used.

  This may not be accurate due to some parameters a

RandomizedSearchCV(cv=KFold(n_splits=5, random_state=0, shuffle=False),
                   estimator=XGBClassifier(base_score=None, booster=None,
                                           colsample_bylevel=None,
                                           colsample_bynode=None,
                                           colsample_bytree=None, gamma=None,
                                           gpu_id=None, importance_type='gain',
                                           interaction_constraints=None,
                                           learning_rate=None,
                                           max_delta_step=None, max_depth=None,
                                           min_child_weight=None, missing=na...
                   param_distributions={'boosting_type': ['gbdt'],
                                        'colsample_bytree': [0.1, 0.2, 0.3, 0.4,
                                                             0.5, 0.6, 0.7, 0.8,
                                      

In [46]:
print(xgb2_rdn.best_params_)
print(xgb2_rdn.best_score_)

{'subsample': 0.3, 'random_state': 0, 'objective': 'binary:logistic', 'n_estimators': 50000, 'min_child_weight': 4, 'metric': 'error', 'max_depth': 8, 'learning_rate': 0.01, 'colsample_bytree': 0.3, 'boosting_type': 'gbdt'}
0.9256249999999999


In [45]:
xgb2_proba = xgb2_rdn.best_estimator_.predict_proba(x_test)[:,1]
xgb2_score = roc_auc_score(y_test, xgb2_proba)
print("AUC:" ,xgb2_score)

AUC: 0.5


### 4.2.4（解答）ベイズ最適化

In [47]:
from bayes_opt import BayesianOptimization

def xgb_evaluate(min_child_weight, subsample, colsample_bytree, max_depth):
    params = {'metric': 'error',
              'objective':'binary:logistic',
              'n_estimators':50000,
              'random_state':0,
              'boosting_type':'gbdt',
              'learning_rate':0.01,              
              'min_child_weight': int(min_child_weight),
              'max_depth': int(max_depth),
              'colsample_bytree': colsample_bytree,
              'subsample': subsample,
             }
    
    xgb3 = xgb.XGBClassifier()
    xgb3.set_params(**params)
    xgb3.fit(x_train,
             y_train,
             early_stopping_rounds=50,
             eval_set=[(x_test, y_test)],
             eval_metric='error',
             verbose=0)
    
    xgb3_proba = xgb3.predict_proba(x_test)[:,1]
    xgb3_score = roc_auc_score(y_test, xgb3_proba)
    return xgb3_score

In [48]:
xgb_bo = BayesianOptimization(xgb_evaluate, 
                              {'min_child_weight': (1,20),
                               'subsample': (.1,1),
                               'colsample_bytree': (.1,1),
                               'max_depth': (1,50)},
                              random_state=0)

In [49]:
xgb_bo.maximize(init_points=15, n_iter=10, acq='ei')

|   iter    |  target   | colsam... | max_depth | min_ch... | subsample |
-------------------------------------------------------------------------
Parameters: { boosting_type, metric } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


| [0m 1       [0m | [0m 0.5773  [0m | [0m 0.5939  [0m | [0m 36.04   [0m | [0m 12.45   [0m | [0m 0.5904  [0m |
Parameters: { boosting_type, metric } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


| [0m 2       [0m | [0m 0.5417  [0m | [0m 0.4813  [0m | [0m 32.65   [0m | [0m 9.314   [0m | [0m 0.9026  [0m |
Parameters: { 

In [50]:
optimized_params = xgb_bo.max['params']
optimized_params['max_depth'] = int(optimized_params['max_depth'])
optimized_params

{'colsample_bytree': 0.8007420193266123,
 'max_depth': 27,
 'min_child_weight': 2.378138608765206,
 'subsample': 0.7340315830201506}

In [51]:
fixed_params = {'metric':'error',
                'objective':'binary:logistic',
                'n_estimators':50000,
                'random_state':0,
                'booster': 'gbtree',
                'learning_rate':0.01}

xgb4 = xgb.XGBClassifier()
xgb4.set_params(**fixed_params, **optimized_params)
xgb4.fit(x_train,
         y_train,
         early_stopping_rounds=50,
         eval_set=[(x_test, y_test)],
         eval_metric='error',
         verbose=0)

Parameters: { metric } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8007420193266123, gamma=0,
              gpu_id=-1, importance_type='gain', interaction_constraints='',
              learning_rate=0.01, max_delta_step=0, max_depth=27,
              metric='error', min_child_weight=2.378138608765206, missing=nan,
              monotone_constraints='()', n_estimators=50000, n_jobs=0,
              num_parallel_tree=1, random_state=0, reg_alpha=0, reg_lambda=1,
              scale_pos_weight=1, subsample=0.7340315830201506,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [52]:
xgb4_proba = xgb4.predict_proba(x_test)[:,1]
xgb4_score = roc_auc_score(y_test, xgb4_proba)
print("AUC:" ,xgb4_score)

AUC: 0.7126104094877341


## 【問題5】最終的なモデルの選定
最終的にこれは良いというモデルを選び、推定した結果をKaggleに提出してスコアを確認してください。どういったアイデアを取り入れ、どの程度のスコアになったかを記載してください。

### 5.1.1（検証）

In [56]:
Xtrain = train_df[["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3", "DAYS_EMPLOYED"]]
test = test_df[["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3", "DAYS_EMPLOYED"]]

xtrain, xtest, ytrain, ytest = train_test_split(Xtrain.values, train_labels.values, test_size=0.2, random_state=0)
print("xtrain",xtrain.shape, "xtest", xtest.shape, "ytrain", ytrain.shape, "ytest", ytest.shape)

xtrain (246008, 4) xtest (61503, 4) ytrain (246008,) ytest (61503,)


In [65]:
def xgb_evaluate1(min_child_weight, subsample, colsample_bytree, max_depth):
    params = {'metric': 'error',
              'objective':'binary:logistic',
              'n_estimators':50000,
              'random_state':0,
              'boosting_type':'gbdt',
              'learning_rate':0.01,              
              'min_child_weight': int(min_child_weight),
              'max_depth': int(max_depth),
              'colsample_bytree': colsample_bytree,
              'subsample': subsample,
             }
    
    xgb5 = xgb.XGBClassifier()
    xgb5.set_params(**params)
    xgb5.fit(xtrain,
             ytrain,
             early_stopping_rounds=50,
             eval_set=[(xtest, ytest)],
             eval_metric='error',
             verbose=0)
    
    xgb5_proba = xgb5.predict_proba(xtest)[:,1]
    xgb5_score = roc_auc_score(ytest, xgb5_proba)
    return xgb5_score

In [66]:
xgb_bo2 = BayesianOptimization(xgb_evaluate1, 
                              {'min_child_weight': (1,20),
                               'subsample': (.1,1),
                               'colsample_bytree': (.1,1),
                               'max_depth': (1,50)},
                              random_state=0)

In [67]:
xgb_bo2.maximize(init_points=15, n_iter=10, acq='ei')

|   iter    |  target   | colsam... | max_depth | min_ch... | subsample |
-------------------------------------------------------------------------
Parameters: { boosting_type, metric } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


| [0m 1       [0m | [0m 0.715   [0m | [0m 0.5939  [0m | [0m 36.04   [0m | [0m 12.45   [0m | [0m 0.5904  [0m |
Parameters: { boosting_type, metric } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


| [0m 2       [0m | [0m 0.5958  [0m | [0m 0.4813  [0m | [0m 32.65   [0m | [0m 9.314   [0m | [0m 0.9026  [0m |
Parameters: { 

In [68]:
optimized_params1 = xgb_bo2.max['params']
optimized_params1['max_depth'] = int(optimized_params1['max_depth'])
optimized_params1

{'colsample_bytree': 0.8272205372615543,
 'max_depth': 38,
 'min_child_weight': 15.469921343824002,
 'subsample': 0.6549304535809041}

In [80]:
fixed_params = {'metric':'error',
                'objective':'binary:logistic',
                'random_state':0,
                'booster': 'gbtree',
                'learning_rate':0.01}

xgb6 = xgb.XGBClassifier()
xgb6.set_params(**fixed_params, **optimized_params1)
xgb6.fit(xtrain,ytrain)
xgb6_proba = xgb6.predict_proba(test)[:,1]

Parameters: { metric } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




In [83]:
submit = app_test[["SK_ID_CURR"]]
submit['TARGET'] = xgb6_proba.reshape(-1,1)
submit.head()

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.215406
1,100005,0.224137
2,100013,0.205077
3,100028,0.212541
4,100038,0.241963


In [85]:
submit.to_csv('xgb6_baseline.csv', index = False)

### 5.1.2（解答）まとめ  
kaggle_score:0.71849でした。  
ハイパーパラメータを未調整のランダムフォレスト:0.69026から約0.03伸びた。
kaggleでの使用頻度の高い、LightGBMで特徴量を選出し（week4で作成したベースベータ）、  
XGBoost、ベイズ最適化と有名な手法を一通り学ぶ事が出来た。