### XGBOOST 라이브러리를 활용한 머신러닝 이진 분류와 모니터링

* XGBoost는 대표적인 부스팅 알고리즘 중 하나로, 대용량 데이터셋 및 복잡한 특징을 다루는 데 효과적

### 학습 내용
 * 데이터 및 라이브러리 사전 준비
 * XGBoost 모델의 구현
 * 검증 데이터의 점수 확인
 * 학습 데이터와 검증 데이터의 점수 모니터링

### 데이터 및 라이브러리 사전 준비

In [1]:
import numpy as np
import pandas as pd

# train_x는 학습 데이터, train_y는 목적 변수, test_x는 테스트 데이터
# pandas의 DataFrame, Series로 유지합니다.(numpy의 array로 유지하기도 합니다)

train = pd.read_csv('../input/sample-data/train_preprocessed.csv')
train_x = train.drop(['target'], axis=1)
train_y = train['target']
test_x = pd.read_csv('../input/sample-data/test_preprocessed.csv')

In [2]:
train_x.shape, train_y.shape, test_x.shape

((10000, 28), (10000,), (10000, 28))

In [3]:
# 학습 데이터를 학습 데이터와 검증(평가용) 데이터로 나눕니다.
from sklearn.model_selection import KFold

kf = KFold(n_splits=4, shuffle=True, random_state=71)
tr_idx, va_idx = list(kf.split(train_x))[0]
tr_x, va_x = train_x.iloc[tr_idx], train_x.iloc[va_idx]
tr_y, va_y = train_y.iloc[tr_idx], train_y.iloc[va_idx]

### xgboost 머신러닝 구현

In [7]:
import xgboost as xgb
from sklearn.metrics import log_loss

# 특징(입력)과 Target(목적변수)를 xgboost의 데이터 구조로 변환
dtrain = xgb.DMatrix(tr_x, label=tr_y)
dvalid = xgb.DMatrix(va_x, label=va_y)
dtest = xgb.DMatrix(test_x)

# 하이퍼파라미터 설정
params = {'objective': 'binary:logistic', 'silent': 1, 'random_state': 71}
num_round = 50

# 학습의 실행
# 검증 데이터도 모델에 제공하여 학습 진행과 함께 점수가 어떻게 달라지는지 모니터링
# watchlist로 학습 데이터 및 검증 데이터를 준비
watchlist = [(dtrain, 'train'), (dvalid, 'eval')]
model = xgb.train(params, dtrain, num_round, evals=watchlist)

Parameters: { "silent" } are not used.

[0]	train-logloss:0.54088	eval-logloss:0.55003
[1]	train-logloss:0.45269	eval-logloss:0.47182
[2]	train-logloss:0.39482	eval-logloss:0.42026
[3]	train-logloss:0.35198	eval-logloss:0.38520
[4]	train-logloss:0.32021	eval-logloss:0.36150
[5]	train-logloss:0.29673	eval-logloss:0.34463
[6]	train-logloss:0.27610	eval-logloss:0.32900
[7]	train-logloss:0.25886	eval-logloss:0.31670
[8]	train-logloss:0.24363	eval-logloss:0.30775
[9]	train-logloss:0.23153	eval-logloss:0.30093
[10]	train-logloss:0.22016	eval-logloss:0.29413
[11]	train-logloss:0.20963	eval-logloss:0.28528
[12]	train-logloss:0.19951	eval-logloss:0.27912
[13]	train-logloss:0.19324	eval-logloss:0.27642
[14]	train-logloss:0.18547	eval-logloss:0.27154
[15]	train-logloss:0.17474	eval-logloss:0.26516
[16]	train-logloss:0.16900	eval-logloss:0.26089
[17]	train-logloss:0.16323	eval-logloss:0.25849
[18]	train-logloss:0.15950	eval-logloss:0.25691
[19]	train-logloss:0.15637	eval-logloss:0.25511
[20]	train

In [8]:
# 검증 데이터의 점수를 확인
va_pred = model.predict(dvalid)
score = log_loss(va_y, va_pred)
print(f'logloss: {score:.4f}')

# 예측 - 두 값(0 or 1)의 예측이 아닌 양성(1)일 확률을 출력
pred = model.predict(dtest)
pred[0:10]

logloss: 0.2223


array([2.0640090e-01, 2.4007071e-02, 3.8863444e-03, 9.1364473e-04,
       3.0064746e-03, 4.6773311e-01, 9.6380478e-01, 7.5712699e-01,
       1.5055874e-01, 6.6642754e-02], dtype=float32)

In [9]:
# -----------------------------------
# 학습 데이터와 검증 데이터의 점수를 모니터링
# -----------------------------------
# 모니터링을 logloss로 수행. early_stopping_rounds를 20라운드로 설정

params = {'objective': 'binary:logistic', 
          'silent': 1, 'random_state': 71,
          'eval_metric': 'logloss'}

num_round = 500
watchlist = [(dtrain, 'train'), (dvalid, 'eval')]
model = xgb.train(params, dtrain, num_round,
                  evals=watchlist,
                  early_stopping_rounds=20)

# 최적의 결정 트리의 개수로 예측
pred = model.predict(dtest, ntree_limit=model.best_ntree_limit)
pred[0:10]

Parameters: { "silent" } are not used.

[0]	train-logloss:0.54088	eval-logloss:0.55003
[1]	train-logloss:0.45269	eval-logloss:0.47182
[2]	train-logloss:0.39482	eval-logloss:0.42026
[3]	train-logloss:0.35198	eval-logloss:0.38520
[4]	train-logloss:0.32021	eval-logloss:0.36150
[5]	train-logloss:0.29673	eval-logloss:0.34463
[6]	train-logloss:0.27610	eval-logloss:0.32900
[7]	train-logloss:0.25886	eval-logloss:0.31670
[8]	train-logloss:0.24363	eval-logloss:0.30775
[9]	train-logloss:0.23153	eval-logloss:0.30093
[10]	train-logloss:0.22016	eval-logloss:0.29413
[11]	train-logloss:0.20963	eval-logloss:0.28528
[12]	train-logloss:0.19951	eval-logloss:0.27912
[13]	train-logloss:0.19324	eval-logloss:0.27642
[14]	train-logloss:0.18547	eval-logloss:0.27154
[15]	train-logloss:0.17474	eval-logloss:0.26516
[16]	train-logloss:0.16900	eval-logloss:0.26089
[17]	train-logloss:0.16323	eval-logloss:0.25849
[18]	train-logloss:0.15950	eval-logloss:0.25691
[19]	train-logloss:0.15637	eval-logloss:0.25511
[20]	train



array([2.0979232e-01, 1.1668997e-02, 2.1548264e-03, 2.1623065e-04,
       1.0843581e-03, 5.2126783e-01, 9.5567137e-01, 8.0202556e-01,
       1.8773586e-01, 3.7838366e-02], dtype=float32)