## Oputunaを用いて，LightGBMのベイズ最適化

- Optunaのstudyを使用
- [Oputunaのgithub](https://github.com/optuna/optuna-examples/blob/main/lightgbm/lightgbm_integration.py)を参考にコーディング

In [10]:
"""
Optuna example that demonstrates a pruner for LightGBM.
In this example, we optimize the validation accuracy of cancer detection using LightGBM.
We optimize both the choice of booster model and their hyperparameters. Throughout
training of models, a pruner observes intermediate results and stop unpromising trials.
You can run this example as follows:
    $ python lightgbm_integration.py
"""

'\nOptuna example that demonstrates a pruner for LightGBM.\nIn this example, we optimize the validation accuracy of cancer detection using LightGBM.\nWe optimize both the choice of booster model and their hyperparameters. Throughout\ntraining of models, a pruner observes intermediate results and stop unpromising trials.\nYou can run this example as follows:\n    $ python lightgbm_integration.py\n'

In [11]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pickle
import math
import urllib.request
import time

import scipy
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import mean_squared_error, r2_score

import lightgbm as lgb

In [12]:
import optuna
import optuna.integration.lightgbm as lgb_tune
from lightgbm import early_stopping
from lightgbm import log_evaluation
from sklearn.model_selection import KFold

In [32]:
df = pd.read_csv('..//Data-science//data//boston.csv')

X = df[['INDUS', 'RM', 'TAX', 'PTRATIO', 'LSTAT']]
y = df['house prices']

# 標準化
X = pd.DataFrame(scipy.stats.zscore(X),index=X.index, columns=X.columns)

In [44]:
# 学習データ，検証データに分割
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2, shuffle=True, random_state=123)
X_train,X_valid,y_train,y_valid = train_test_split(X_train,y_train,test_size=0.2, shuffle=True, random_state=123)

# LightGBM には ndarray にすると良い
X_train = X_train.values
X_valid = X_valid.values
X_test = X_test.values

y_train = y_train.values
y_valid = y_valid.values
y_test = y_test.values


In [43]:
# LightGBM用にデータ変換
trains = lgb.Dataset(X_train, label=y_train, free_raw_data=False)
valids = lgb.Dataset(X_valid, label=y_valid, free_raw_data=False)

### ハイパーパラメータについて
#### suggest_float または suggest_int でパラメータの探索範囲を設定する(最大値，最小値)
#### suggest_int は整数のみ探索 

In [35]:
def objective(trial):

    # ハイパーパラメータの設定
    param = {
        "objective": "regression",
        "metric": "rmse",
        "verbosity": -1,
        "boosting_type": "gbdt",
        "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True),
        "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),
        "num_leaves": trial.suggest_int("num_leaves", 2, 256),
        "feature_fraction": trial.suggest_float("feature_fraction", 0.4, 1.0),
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 1.0),
        "bagging_freq": trial.suggest_int("bagging_freq", 1, 7),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
    }

    # 学習
    gbm = lgb.train(param, trains)

    # 予測
    preds = gbm.predict(X_valid)
    # 精度の算出
    rmse = np.sqrt(mean_squared_error(y_valid, preds))

    return rmse


In [39]:
if __name__ == "__main__":
    # oputuna による最適化
    study = optuna.create_study(direction="minimize")
    # 探索数(試行数)を指定する場合
    study.optimize(objective, n_trials=1000)
    # 探索時間を指定する場合
    #study.optimize(objective, timeout=60)

    print("Number of finished trials: {}".format(len(study.trials)))

    print("Best trial:")

    print("  Value: {}".format(study.best_trial.value))

    print("  Params: ")

    tuned_params = {}
    for key, value in study.best_trial.params.items():
        #print("    {}: {}".format(key, value))
        tuned_params[key] = value

    print(tuned_params)
    
    #print(study.best_trial.value)

[32m[I 2021-12-29 17:33:27,528][0m A new study created in memory with name: no-name-bfba6c04-4434-4107-afb0-77e15b3afe5a[0m
[32m[I 2021-12-29 17:33:27,673][0m Trial 0 finished with value: 6.005932946104612 and parameters: {'lambda_l1': 2.5896660014843876e-07, 'lambda_l2': 0.6839056915417362, 'num_leaves': 239, 'feature_fraction': 0.6850250448643385, 'bagging_fraction': 0.8296000316748228, 'bagging_freq': 6, 'min_child_samples': 61}. Best is trial 0 with value: 6.005932946104612.[0m
[32m[I 2021-12-29 17:33:27,802][0m Trial 1 finished with value: 6.6848775435130054 and parameters: {'lambda_l1': 0.0007053949846991518, 'lambda_l2': 9.688976037773749e-05, 'num_leaves': 55, 'feature_fraction': 0.8509881360287996, 'bagging_fraction': 0.7096794105632367, 'bagging_freq': 6, 'min_child_samples': 74}. Best is trial 0 with value: 6.005932946104612.[0m
[32m[I 2021-12-29 17:33:27,853][0m Trial 2 finished with value: 5.161678992196041 and parameters: {'lambda_l1': 1.470657623350737e-07, 'l

Number of finished trials: 1000
Best trial:
  Value: 3.917011523033931
  Params: 
{'lambda_l1': 0.6174205905201046, 'lambda_l2': 0.0001615654956991066, 'num_leaves': 124, 'feature_fraction': 0.8123750056883934, 'bagging_fraction': 0.562320282799111, 'bagging_freq': 7, 'min_child_samples': 7}


In [41]:
tuned_params = tuned_params

In [48]:
tuned_params

{'lambda_l1': 0.6174205905201046,
 'lambda_l2': 0.0001615654956991066,
 'num_leaves': 124,
 'feature_fraction': 0.8123750056883934,
 'bagging_fraction': 0.562320282799111,
 'bagging_freq': 7,
 'min_child_samples': 7}

In [49]:
valid_socres = []
rmse_list = []
r2_list = []
tuned_models = []
kf = KFold(n_splits=5, shuffle=False)

for fold, (train_index, valid_index) in enumerate(kf.split(X_train)):

    X_tr, X_val = X_train[train_index], X_train[valid_index] 
    y_tr, y_val = y_train[train_index], y_train[valid_index]
    
    trains = lgb.Dataset(data=X_tr, label=y_tr, feature_name='auto') # dataにはテストデータ，labelには正解データ
    evals = lgb.Dataset(data=X_val, label=y_val, feature_name='auto') # feature_name=’auto’とすることで DataFrameの列名が認識される
    
    model = lgb.train(
        tuned_params,
        trains,
        valid_sets=evals,
        num_boost_round=50,
        early_stopping_rounds=10,
        verbose_eval=-1
    )

    pred = model.predict(X_val)
    rmse = np.sqrt(mean_squared_error(y_val, pred))
    r2 = r2_score(y_val, pred)
    rmse_list.append(rmse)
    r2_list.append(r2)
    valid_socres.append(pred)

    tuned_models.append(model)



You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 210
[LightGBM] [Info] Number of data points in the train set: 258, number of used features: 5
[LightGBM] [Info] Start training from score 22.277907
Training until validation scores don't improve for 10 rounds
Did not meet early stopping. Best iteration is:
[43]	valid_0's l2: 7.48894
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 206
[LightGBM] [Info] Number of data points in the train set: 258, number of used features: 5
[LightGBM] [Info] Start training from score 22.241473
Training until validation scores don't improve for 10 rounds
Did not meet early stopping. Best iteration is:
[44]	valid_0's l2: 10.5049
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 204
[LightGBM] [Info] Number of data points in the train set: 258, number of used features: 5
[LightGBM] [Info] Start training from score 21.969768
Training until validati




Early stopping, best iteration is:
[29]	valid_0's l2: 6.16443
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 207
[LightGBM] [Info] Number of data points in the train set: 259, number of used features: 5
[LightGBM] [Info] Start training from score 22.127413
Training until validation scores don't improve for 10 rounds
Did not meet early stopping. Best iteration is:
[50]	valid_0's l2: 12.234




In [50]:
print(f'rmse: {np.mean(rmse_list)}')
print(f'r2: {np.mean(r2_list)}')

rmse: 3.2918542178979577
r2: 0.853939630748312


In [51]:
pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, pred))
r2 = r2_score(y_test, pred)

print(f'r2: {r2}')
print(f'rmse: {rmse}')

r2: 0.6971544512263478
rmse: 5.0056123772359875


In [52]:
trains = lgb.Dataset(data=X_train, label=y_train, feature_name='auto') # dataにはテストデータ，labelには正解データ
evals = lgb.Dataset(data=X_valid, label=y_valid, feature_name='auto') # feature_name=’auto’とすることで DataFrameの列名が認識される
    
model_best = lgb.train(
    tuned_params,
    trains,
    valid_sets=evals,
    num_boost_round=50,
    early_stopping_rounds=10,
    verbose_eval=-1
    )

pred = model_best.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, pred))
r2 = r2_score(y_test, pred)



You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 236
[LightGBM] [Info] Number of data points in the train set: 323, number of used features: 5
[LightGBM] [Info] Start training from score 22.090712
Training until validation scores don't improve for 10 rounds
Did not meet early stopping. Best iteration is:
[50]	valid_0's l2: 15.8436


In [47]:
print(f'rmse: {rmse}')
print(f'r2: {r2}')

rmse: 4.857573945117617
r2: 0.7148025724093248
