## XG-Boost

---
### parameter tuning
* 일반 파라미터
부스팅을 수행할 때 트리를 사용할지, 선형 모델을 사용할지 등을 고른다.
* 부스터 파라미터
선택한 부스터에 따라서 적용할 수 있는 파라미터 종류가 다르다.
* 학습 과정 파라미터
학습 시나리오를 결정한다.

* 일반 파라미터
    - booster [기본값 = gbtree]
        어떤 부스터 구조를 쓸지 결정한다.
        의사결정기반모형(gbtree), 선형모형(gblinear), dart가 있다.
    - n_jobs
        XGBoost를 실행하는 데 사용되는 병렬 스레드 수
    - verbosity [기본값 = 1]
        유효한 값은 0 (무음), 1 (경고), 2 (정보), 3 (디버그)
        
        
* 부스터 파라미터
    - gbtree Booster의 파라미터
        - learning_rate [ 기본값 : 0.3 ]
            learning rate가 높을수록 과적합 하기 쉽다.
        - n_estimators [ 기본값 : 100 ]
            생성할 weak learner의 수
            learning_rate가 낮을 땐, n_estimators를 높여야 과적합이 방지된다.
        - max_depth [ 기본값 : 6 ]
            트리의 maximum depth이다.
            적절한 값이 제시되어야 하고 보통 3-10 사이 값이 적용된다.
            max_depth가 높을수록 모델의 복잡도가 커져 과적합 하기 쉽다.
        - min_child_weight [ 기본값 : 1 ]
            관측치에 대한 가중치 합의 최소를 말한다.
            값이 높을수록 과적합이 방지된다.
        - gamma [ 기본값 : 0 ]
            리프노드의 추가분할을 결정할 최소손실 감소값이다.
            해당값보다 손실이 크게 감소할 때 분리한다.
            값이 높을수록 과적합이 방지된다.
        - subsample [ 기본값 : 1 ]
            weak learner가 학습에 사용하는 데이터 샘플링 비율이다.
            보통 0.5 ~ 1 사용된다.
            값이 낮을수록 과적합이 방지된다.
        - colsample_bytree [ 기본값 : 1 ]
            각 tree 별 사용된 feature의 퍼센테이지이다.
            보통 0.5 ~ 1 사용된다.
            값이 낮을수록 과적합이 방지된다.
        - lambda [기본값 = 1, 별칭 : reg_lambda]
            가중치에 대한 L2 Regularization 적용 값
            피처 개수가 많을 때 적용을 검토
            이 값이 클수록 과적합 감소 효과
        - alpha [기본값 = 0, 별칭 : reg_alpha]
            가중치에 대한 L1 Regularization 적용 값
            피처 개수가 많을 때 적용을 검토
            이 값이 클수록 과적합 감소 효과

* 학습 과정 파라미터
    - objective [ 기본값 : reg = squarederror ]
        - reg : squarederror [제곱 손실이 있는 회귀 ]
        - binary : logistic (binary-logistic classification)
        [이항 분류 문제 로지스틱 회귀 모형으로 반환값이 클래스가 아니라 예측 확률]
        - multi : softmax 
        다항 분류 문제의 경우 소프트맥스(Softmax)를 사용해서 분류하는데 반횐되는 값이 예측확률이 아니라 클래스임. 또한 num_class도 지정해야함.
        - multi : softprob
        클래스 범주에 속하는 예측확률을 반환함.
        - count : poisson (count data poison regression) 등 다양하다.
    - eval_metric : 모델의 평가 함수를 조정하는 함수다.
    설정한 objective 별로 기본설정값이 지정되어 있다.
        - rmse: root mean square error
        - mae: mean absolute error
        - logloss: negative log-likelihood
        - error: Binary classification error rate (0.5 threshold)
        - merror: Multiclass classification error rate
        - mlogloss: Multiclass logloss
        - auc: Area under the curve
        map (mean average precision)등, 해당 데이터의 특성에 맞게 평가 함수를 조정한다.            
    - seed [ 기본값 : 0 ]
        재현가능하도록 난수를 고정시킴.

* 민감하게 조정해야하는 것

    - booster 모양
    - eval_metric(평가함수) / objective(목적함수)
    - eta 
    - L1 form (L1 레귤러라이제이션 폼이 L2보다 아웃라이어에 민감하다.)
    - L2 form

* 과적합 방지를 위해 조정해야하는 것

    - learning rate 낮추기 → n_estimators은 높여야함
    - max_depth 낮추기
    - min_child_weight 높이기
    - gamma 높이기
    - subsample, colsample_bytree 낮추기


---
### 참고
* https://www.kaggle.com/lifesailor/xgboost
* https://wooono.tistory.com/97


In [1]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold

import xgboost as xgb
from xgboost import XGBRegressor
from xgboost import plot_importance

import optuna 
from optuna import Trial, visualization
from optuna.samplers import TPESampler

#autogloun

%matplotlib inline

In [2]:
import warnings

warnings.filterwarnings( 'ignore' )

In [3]:
train_data = pd.read_csv("train.csv") 
test_data = pd.read_csv("test.csv")

In [4]:
x_data = train_data.loc[:, 'f0':'f99']
y_data = train_data.loc[:, 'loss']

In [5]:
from sklearn.model_selection import train_test_split

In [9]:
x_train, x_test, y_train, y_test=train_test_split(x_data,
                                                  y_data,
                                                  test_size=0.2,   #전체 중 20%를 테스트용으로 분할
                                                                   #나머지 80%는 훈련용
                                                  shuffle=True,    #무작위로 섞어서 추출
                                                  random_state=20) #무작위 추출 시 일정한 기준으로

In [7]:
import xgboost as xgb

# 모델 선언
my_model = xgb.XGBRegressor(
    learning_rate=0.1,
    max_depth=5,
    n_estimators=100)

# 모델 훈련
my_model.fit(x_train, y_train, verbose=False)

# 모델 예측
y_test = my_model.predict(x_test) 

Parameters: { verbose } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




KeyboardInterrupt: 

In [24]:
import xgboost as xgb

# 모델 선언
my_model = xgb.XGBRegressor(
    n_estimators = 3520,
    max_depth = 11,
    min_child_weight = 231,
    gamma = 2,
    colsample_bytree = 0.7,
    reg_lambda = 0.014950936465569798,
    alpha = 0.28520156840812494,
    subsample = 0.6,
    learning_rate=0.01,
)

# 모델 훈련
my_model.fit(x_train, y_train, verbose=False)

# 모델 예측
y_pred = my_model.predict(x_test) 

In [25]:
rms = np.sqrt(mean_squared_error(y_pred, y_test))
print(rms)

7.797709619810067


In [20]:
def objectiveXGB(trial: Trial, x_data, y_data, test):
    param = {
        "n_estimators" : trial.suggest_int('n_estimators', 100, 10000),
        'max_depth':trial.suggest_int('max_depth', 5, 10),
        'min_child_weight':trial.suggest_int('min_child_weight', 1, 300),
        'gamma':trial.suggest_int('gamma', 1, 3),
        'learning_rate': trial.suggest_categorical('learning_rate', [0.3, 0.1, 0.05, 0.01]),
        'colsample_bytree':trial.suggest_discrete_uniform('colsample_bytree',0.5, 1, 0.1),
        'lambda': trial.suggest_loguniform('lambda', 1e-3, 10.0),
        'alpha': trial.suggest_loguniform('alpha', 1e-3, 10.0),
        'subsample': trial.suggest_categorical('subsample', [0.6,0.7,0.8,1.0] ),
        'random_state': 42
    }
    x_train, x_eval, y_train, y_eval=train_test_split(x_data,
                                                  y_data,
                                                  test_size=0.2,   #전체 중 20%를 테스트용으로 분할
                                                                   #나머지 80%는 훈련용
                                                  shuffle=True,    #무작위로 섞어서 추출
                                                  random_state=20) #무작위 추출 시 일정한 기준으로
    
    x_eval, x_test, y_eval, y_test=train_test_split(x_eval,
                                              y_eval,
                                              test_size=0.2,   #전체 중 20%를 테스트용으로 분할
                                                               #나머지 80%는 훈련용
                                              shuffle=True,    #무작위로 섞어서 추출
                                              random_state=77) #무작위 추출 시 일정한 기준으로

    model = xgb.XGBRegressor(**param)
    xgb_model = model.fit(x_train, y_train, early_stopping_rounds=50, eval_set=[(x_eval, y_eval)])
    score = mean_squared_error(xgb_model.predict(x_test), y_test)

    return score

In [None]:
study = optuna.create_study(direction='minimize',
                            sampler=TPESampler())
study.optimize(lambda trial : objectiveXGB(trial, x_data,  y_data, x_test), n_trials=50)
print('Best trial: score {},\nparams {}'.format(study.best_trial.value,study.best_trial.params))

best_param = study.best_trial.params

[32m[I 2021-08-10 00:28:00,203][0m A new study created in memory with name: no-name-40923849-1b23-44ae-82ae-acd2a71b6626[0m


[0]	validation_0-rmse:9.88326
[1]	validation_0-rmse:9.70315
[2]	validation_0-rmse:9.53804
[3]	validation_0-rmse:9.38651
[4]	validation_0-rmse:9.24764
[5]	validation_0-rmse:9.11990
[6]	validation_0-rmse:9.00325
[7]	validation_0-rmse:8.89750
[8]	validation_0-rmse:8.79987
[9]	validation_0-rmse:8.71051
[10]	validation_0-rmse:8.62991
[11]	validation_0-rmse:8.55651
[12]	validation_0-rmse:8.48891
[13]	validation_0-rmse:8.42825
[14]	validation_0-rmse:8.37276
[15]	validation_0-rmse:8.32208
[16]	validation_0-rmse:8.27597
[17]	validation_0-rmse:8.23464
[18]	validation_0-rmse:8.19679
[19]	validation_0-rmse:8.16307
[20]	validation_0-rmse:8.13231
[21]	validation_0-rmse:8.10427
[22]	validation_0-rmse:8.07871
[23]	validation_0-rmse:8.05609
[24]	validation_0-rmse:8.03516
[25]	validation_0-rmse:8.01615
[26]	validation_0-rmse:7.99874
[27]	validation_0-rmse:7.98344
[28]	validation_0-rmse:7.96977
[29]	validation_0-rmse:7.95678
[30]	validation_0-rmse:7.94562
[31]	validation_0-rmse:7.93543
[32]	validation_0-

[260]	validation_0-rmse:7.80170
[261]	validation_0-rmse:7.80164
[262]	validation_0-rmse:7.80180
[263]	validation_0-rmse:7.80174
[264]	validation_0-rmse:7.80157
[265]	validation_0-rmse:7.80154
[266]	validation_0-rmse:7.80147
[267]	validation_0-rmse:7.80140
[268]	validation_0-rmse:7.80130
[269]	validation_0-rmse:7.80135
[270]	validation_0-rmse:7.80133
[271]	validation_0-rmse:7.80120
[272]	validation_0-rmse:7.80112
[273]	validation_0-rmse:7.80102
[274]	validation_0-rmse:7.80098
[275]	validation_0-rmse:7.80093
[276]	validation_0-rmse:7.80095
[277]	validation_0-rmse:7.80124
[278]	validation_0-rmse:7.80096
[279]	validation_0-rmse:7.80103
[280]	validation_0-rmse:7.80089
[281]	validation_0-rmse:7.80104
[282]	validation_0-rmse:7.80088
[283]	validation_0-rmse:7.80096
[284]	validation_0-rmse:7.80073
[285]	validation_0-rmse:7.80061
[286]	validation_0-rmse:7.80046
[287]	validation_0-rmse:7.80040
[288]	validation_0-rmse:7.80057
[289]	validation_0-rmse:7.80037
[290]	validation_0-rmse:7.80028
[291]	va

[32m[I 2021-08-10 00:31:34,089][0m Trial 0 finished with value: 61.52010932963598 and parameters: {'n_estimators': 1464, 'max_depth': 7, 'min_child_weight': 95, 'gamma': 1, 'learning_rate': 0.05, 'colsample_bytree': 0.9, 'lambda': 0.00416487544647258, 'alpha': 0.0867180624129271, 'subsample': 0.6}. Best is trial 0 with value: 61.52010932963598.[0m


[0]	validation_0-rmse:9.88332
[1]	validation_0-rmse:9.70375
[2]	validation_0-rmse:9.53913
[3]	validation_0-rmse:9.38851
[4]	validation_0-rmse:9.24996
[5]	validation_0-rmse:9.12283
[6]	validation_0-rmse:9.00636
[7]	validation_0-rmse:8.90019
[8]	validation_0-rmse:8.80340
[9]	validation_0-rmse:8.71460
[10]	validation_0-rmse:8.63413
[11]	validation_0-rmse:8.56132
[12]	validation_0-rmse:8.49412
[13]	validation_0-rmse:8.43359
[14]	validation_0-rmse:8.37833
[15]	validation_0-rmse:8.32795
[16]	validation_0-rmse:8.28233
[17]	validation_0-rmse:8.24107
[18]	validation_0-rmse:8.20371
[19]	validation_0-rmse:8.16978
[20]	validation_0-rmse:8.13903
[21]	validation_0-rmse:8.11111
[22]	validation_0-rmse:8.08570
[23]	validation_0-rmse:8.06292
[24]	validation_0-rmse:8.04238
[25]	validation_0-rmse:8.02346
[26]	validation_0-rmse:8.00637
[27]	validation_0-rmse:7.99098
[28]	validation_0-rmse:7.97720
[29]	validation_0-rmse:7.96460
[30]	validation_0-rmse:7.95323
[31]	validation_0-rmse:7.94297
[32]	validation_0-

[260]	validation_0-rmse:7.80166
[261]	validation_0-rmse:7.80166
[262]	validation_0-rmse:7.80162
[263]	validation_0-rmse:7.80148
[264]	validation_0-rmse:7.80124
[265]	validation_0-rmse:7.80114
[266]	validation_0-rmse:7.80114
[267]	validation_0-rmse:7.80096
[268]	validation_0-rmse:7.80085
[269]	validation_0-rmse:7.80082
[270]	validation_0-rmse:7.80063
[271]	validation_0-rmse:7.80048
[272]	validation_0-rmse:7.80062
[273]	validation_0-rmse:7.80043
[274]	validation_0-rmse:7.80014
[275]	validation_0-rmse:7.80018
[276]	validation_0-rmse:7.80016
[277]	validation_0-rmse:7.80011
[278]	validation_0-rmse:7.80002
[279]	validation_0-rmse:7.79996
[280]	validation_0-rmse:7.79987
[281]	validation_0-rmse:7.79960
[282]	validation_0-rmse:7.79951
[283]	validation_0-rmse:7.79940
[284]	validation_0-rmse:7.79922
[285]	validation_0-rmse:7.79905
[286]	validation_0-rmse:7.79904
[287]	validation_0-rmse:7.79894
[288]	validation_0-rmse:7.79884
[289]	validation_0-rmse:7.79884
[290]	validation_0-rmse:7.79860
[291]	va

[517]	validation_0-rmse:7.79207
[518]	validation_0-rmse:7.79209
[519]	validation_0-rmse:7.79216
[520]	validation_0-rmse:7.79210
[521]	validation_0-rmse:7.79206
[522]	validation_0-rmse:7.79211
[523]	validation_0-rmse:7.79204
[524]	validation_0-rmse:7.79210
[525]	validation_0-rmse:7.79197
[526]	validation_0-rmse:7.79178
[527]	validation_0-rmse:7.79168
[528]	validation_0-rmse:7.79172
[529]	validation_0-rmse:7.79172
[530]	validation_0-rmse:7.79164
[531]	validation_0-rmse:7.79160
[532]	validation_0-rmse:7.79158
[533]	validation_0-rmse:7.79146
[534]	validation_0-rmse:7.79150
[535]	validation_0-rmse:7.79150
[536]	validation_0-rmse:7.79151
[537]	validation_0-rmse:7.79153
[538]	validation_0-rmse:7.79151
[539]	validation_0-rmse:7.79158
[540]	validation_0-rmse:7.79162
[541]	validation_0-rmse:7.79156
[542]	validation_0-rmse:7.79166
[543]	validation_0-rmse:7.79159
[544]	validation_0-rmse:7.79161
[545]	validation_0-rmse:7.79150
[546]	validation_0-rmse:7.79158
[547]	validation_0-rmse:7.79161
[548]	va

[32m[I 2021-08-10 00:35:09,054][0m Trial 1 finished with value: 61.576598746172884 and parameters: {'n_estimators': 8296, 'max_depth': 5, 'min_child_weight': 204, 'gamma': 3, 'learning_rate': 0.05, 'colsample_bytree': 0.6, 'lambda': 1.6407589060735495, 'alpha': 0.0013150929449043552, 'subsample': 0.7}. Best is trial 0 with value: 61.52010932963598.[0m


[0]	validation_0-rmse:10.03908
[1]	validation_0-rmse:10.00020
[2]	validation_0-rmse:9.96192
[3]	validation_0-rmse:9.92424
[4]	validation_0-rmse:9.88712
[5]	validation_0-rmse:9.85064
[6]	validation_0-rmse:9.81470
[7]	validation_0-rmse:9.77942
[8]	validation_0-rmse:9.74470
[9]	validation_0-rmse:9.71054
[10]	validation_0-rmse:9.67699
[11]	validation_0-rmse:9.64392
[12]	validation_0-rmse:9.61143
[13]	validation_0-rmse:9.57948
[14]	validation_0-rmse:9.54808
[15]	validation_0-rmse:9.51722
[16]	validation_0-rmse:9.48680
[17]	validation_0-rmse:9.45694
[18]	validation_0-rmse:9.42757
[19]	validation_0-rmse:9.39869
[20]	validation_0-rmse:9.37026
[21]	validation_0-rmse:9.34234
[22]	validation_0-rmse:9.31496
[23]	validation_0-rmse:9.28800
[24]	validation_0-rmse:9.26151
[25]	validation_0-rmse:9.23545
[26]	validation_0-rmse:9.20984
[27]	validation_0-rmse:9.18469
[28]	validation_0-rmse:9.15996
[29]	validation_0-rmse:9.13562
[30]	validation_0-rmse:9.11175
[31]	validation_0-rmse:9.08831
[32]	validation_

[260]	validation_0-rmse:7.84854
[261]	validation_0-rmse:7.84819
[262]	validation_0-rmse:7.84783
[263]	validation_0-rmse:7.84751
[264]	validation_0-rmse:7.84719
[265]	validation_0-rmse:7.84684
[266]	validation_0-rmse:7.84652
[267]	validation_0-rmse:7.84626
[268]	validation_0-rmse:7.84596
[269]	validation_0-rmse:7.84563
[270]	validation_0-rmse:7.84534
[271]	validation_0-rmse:7.84504
[272]	validation_0-rmse:7.84475
[273]	validation_0-rmse:7.84451
[274]	validation_0-rmse:7.84425
[275]	validation_0-rmse:7.84396
[276]	validation_0-rmse:7.84369
[277]	validation_0-rmse:7.84341
[278]	validation_0-rmse:7.84317
[279]	validation_0-rmse:7.84293
[280]	validation_0-rmse:7.84271
[281]	validation_0-rmse:7.84246
[282]	validation_0-rmse:7.84219
[283]	validation_0-rmse:7.84194
[284]	validation_0-rmse:7.84168
[285]	validation_0-rmse:7.84149
[286]	validation_0-rmse:7.84127
[287]	validation_0-rmse:7.84108
[288]	validation_0-rmse:7.84083
[289]	validation_0-rmse:7.84061
[290]	validation_0-rmse:7.84040
[291]	va

[517]	validation_0-rmse:7.81733
[518]	validation_0-rmse:7.81726
[519]	validation_0-rmse:7.81719
[520]	validation_0-rmse:7.81710
[521]	validation_0-rmse:7.81706
[522]	validation_0-rmse:7.81701
[523]	validation_0-rmse:7.81695
[524]	validation_0-rmse:7.81687
[525]	validation_0-rmse:7.81679
[526]	validation_0-rmse:7.81669
[527]	validation_0-rmse:7.81663
[528]	validation_0-rmse:7.81652
[529]	validation_0-rmse:7.81650
[530]	validation_0-rmse:7.81646
[531]	validation_0-rmse:7.81640
[532]	validation_0-rmse:7.81632
[533]	validation_0-rmse:7.81634
[534]	validation_0-rmse:7.81630
[535]	validation_0-rmse:7.81626
[536]	validation_0-rmse:7.81621
[537]	validation_0-rmse:7.81615
[538]	validation_0-rmse:7.81610
[539]	validation_0-rmse:7.81602
[540]	validation_0-rmse:7.81598
[541]	validation_0-rmse:7.81596
[542]	validation_0-rmse:7.81589
[543]	validation_0-rmse:7.81583
[544]	validation_0-rmse:7.81584
[545]	validation_0-rmse:7.81580
[546]	validation_0-rmse:7.81573
[547]	validation_0-rmse:7.81564
[548]	va

[774]	validation_0-rmse:7.80557
[775]	validation_0-rmse:7.80555
[776]	validation_0-rmse:7.80556
[777]	validation_0-rmse:7.80552
[778]	validation_0-rmse:7.80551
[779]	validation_0-rmse:7.80546
[780]	validation_0-rmse:7.80540
[781]	validation_0-rmse:7.80537
[782]	validation_0-rmse:7.80534
[783]	validation_0-rmse:7.80534
[784]	validation_0-rmse:7.80532
[785]	validation_0-rmse:7.80527
[786]	validation_0-rmse:7.80530
[787]	validation_0-rmse:7.80527
[788]	validation_0-rmse:7.80521
[789]	validation_0-rmse:7.80518
[790]	validation_0-rmse:7.80518
[791]	validation_0-rmse:7.80511
[792]	validation_0-rmse:7.80506
[793]	validation_0-rmse:7.80504
[794]	validation_0-rmse:7.80501
[795]	validation_0-rmse:7.80499
[796]	validation_0-rmse:7.80498
[797]	validation_0-rmse:7.80492
[798]	validation_0-rmse:7.80486
[799]	validation_0-rmse:7.80480
[800]	validation_0-rmse:7.80475
[801]	validation_0-rmse:7.80474
[802]	validation_0-rmse:7.80469
[803]	validation_0-rmse:7.80463
[804]	validation_0-rmse:7.80461
[805]	va

[1030]	validation_0-rmse:7.79863
[1031]	validation_0-rmse:7.79862
[1032]	validation_0-rmse:7.79863
[1033]	validation_0-rmse:7.79864
[1034]	validation_0-rmse:7.79859
[1035]	validation_0-rmse:7.79857
[1036]	validation_0-rmse:7.79854
[1037]	validation_0-rmse:7.79850
[1038]	validation_0-rmse:7.79848
[1039]	validation_0-rmse:7.79846
[1040]	validation_0-rmse:7.79844
[1041]	validation_0-rmse:7.79843
[1042]	validation_0-rmse:7.79844
[1043]	validation_0-rmse:7.79840
[1044]	validation_0-rmse:7.79838
[1045]	validation_0-rmse:7.79835
[1046]	validation_0-rmse:7.79833
[1047]	validation_0-rmse:7.79833
[1048]	validation_0-rmse:7.79832
[1049]	validation_0-rmse:7.79834
[1050]	validation_0-rmse:7.79832
[1051]	validation_0-rmse:7.79832
[1052]	validation_0-rmse:7.79831
[1053]	validation_0-rmse:7.79828
[1054]	validation_0-rmse:7.79823
[1055]	validation_0-rmse:7.79823
[1056]	validation_0-rmse:7.79823
[1057]	validation_0-rmse:7.79823
[1058]	validation_0-rmse:7.79820
[1059]	validation_0-rmse:7.79816
[1060]	val

In [None]:
print(best_param)

In [None]:
optuna.visualization.plot_optimization_history(study)

In [None]:
optuna.visualization.plot_param_importances(study)