## XGboost & LightGBM (2016-2017)
- 전통적인 머신러닝 알고리즘의 융합
 + 선형회귀 릿지 라쏘, 과적합 방지 위한 규제
 + 결정트리의 핵심적인 알고리즘
 + 경사하강법
 + 부스팅 기법
- 문제점 : 파라미터의 개수가 매우 많음
- 왜 많이 쓸까요?
 + 모델학습 속도 
 + 성능 
 + 가장 좋은 모델이란, 학습속도는 빠르면서 성능은 좋은 것(지금까지 나온 알고리즘보다)
- Python
 + JAVA, C, C++
 + C, C++ / r data.table 패키지 
- 큰 회사들 개발
 + 첫 번째 옵션, 우리가 자체적으로 배포하자-> Pyhthon Wrapper API
 + R, 머신러닝 프레임워크 종류 다양
 + 파이썬 머신러닝 = Scikit-Learn에서 쉽게 쓸 수 있도록 개발, Scikit-Learn 
 
 참고 https://xgboost.readthedocs.io/en/stable/python/python_intro.html 

 


#  XGboost (Scikit-Learn API방식 & python Wrapper 방식)

### python Wrapper 방식
- X_train, y_train


In [None]:
import xgboost as xgb 
from sklearn.model_selection import train_test_split
import seaborn as sns 

# 데이터 분리
titanic = sns.load_dataset('titanic')
# titanic.info()

# X, 독립변수, y 종속변수
X = titanic[['pclass', 'parch', 'fare']]
y = titanic['survived']

# 훈련데이터, 테스트 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    stratify = y, 
                                                    test_size = 0.3, 
                                                    random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((623, 3), (268, 3), (623,), (268,))

- 여기가 핵심


In [None]:
dtrain = xgb.DMatrix(data = X_train, label = y_train)
dtest = xgb.DMatrix(data=X_test, label=y_test)

print(dtrain)


<xgboost.core.DMatrix object at 0x7fc937b31ad0>


In [None]:
#max_depth = 트리의 깊이
#n_estimator = 트리의 수


params = {
    'max_depth' : 3, 
    'n_estimators' : 100, 
    'eta' : 0.1, 
    'objective' : 'binary:logistic'
}
num_rounds = 400

w_list = [(dtrain, 'train'), (dtest, 'test')]
xgb_ml = xgb.train(params = params, 
                   dtrain = dtrain, 
                   num_boost_round = 400, 
                   early_stopping_rounds = 100, 
                   evals = w_list)

[0]	train-error:0.260032	test-error:0.302239
Multiple eval metrics have been passed: 'test-error' will be used for early stopping.

Will train until test-error hasn't improved in 100 rounds.
[1]	train-error:0.260032	test-error:0.302239
[2]	train-error:0.260032	test-error:0.302239
[3]	train-error:0.260032	test-error:0.302239
[4]	train-error:0.260032	test-error:0.302239
[5]	train-error:0.260032	test-error:0.302239
[6]	train-error:0.260032	test-error:0.302239
[7]	train-error:0.260032	test-error:0.302239
[8]	train-error:0.260032	test-error:0.302239
[9]	train-error:0.260032	test-error:0.302239
[10]	train-error:0.260032	test-error:0.302239
[11]	train-error:0.260032	test-error:0.302239
[12]	train-error:0.260032	test-error:0.302239
[13]	train-error:0.247191	test-error:0.298507
[14]	train-error:0.247191	test-error:0.298507
[15]	train-error:0.248796	test-error:0.302239
[16]	train-error:0.248796	test-error:0.302239
[17]	train-error:0.248796	test-error:0.302239
[18]	train-error:0.248796	test-error

In [None]:
# 평가
from sklearn.metrics import accuracy_score
pred_probs = xgb_ml.predict(dtest)
y_pred = [1 if x> 0.5 else 0 for x in pred_probs]

# 예측 라벨과 실제 라벨 사이의 정확도 측정
accuracy_score(y_pred, y_test)

0.6865671641791045

### Scikit-Learn API방식


In [None]:
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier # API 

# dt = DecisionTreeClassifier()
xgb_model = XGBClassifier(objective = 'binary:logistic', 
                          n_estimators=100, 
                          max_depth=3, 
                          learning_rate = 0.1, 
                          num_rounds = 400,
                          random_state=42)

w_list = [(X_train, y_train), (X_test, y_test)]

xgb_model.fit(X_train, y_train, eval_set = w_list, eval_metric='error', verbose=True)

y_probas = xgb_model.predict_proba(X_test)
y_pred = [1 if x > 0.5 else 0 for x in pred_probs]

# 예측 라벨과 실제 라벨 사이의 정확도 측정
accuracy_score(y_pred, y_test)


[0]	validation_0-error:0.260032	validation_1-error:0.302239
[1]	validation_0-error:0.260032	validation_1-error:0.302239
[2]	validation_0-error:0.260032	validation_1-error:0.302239
[3]	validation_0-error:0.260032	validation_1-error:0.302239
[4]	validation_0-error:0.260032	validation_1-error:0.302239
[5]	validation_0-error:0.260032	validation_1-error:0.302239
[6]	validation_0-error:0.260032	validation_1-error:0.302239
[7]	validation_0-error:0.260032	validation_1-error:0.302239
[8]	validation_0-error:0.260032	validation_1-error:0.302239
[9]	validation_0-error:0.260032	validation_1-error:0.302239
[10]	validation_0-error:0.260032	validation_1-error:0.302239
[11]	validation_0-error:0.260032	validation_1-error:0.302239
[12]	validation_0-error:0.260032	validation_1-error:0.302239
[13]	validation_0-error:0.247191	validation_1-error:0.298507
[14]	validation_0-error:0.247191	validation_1-error:0.298507
[15]	validation_0-error:0.248796	validation_1-error:0.302239
[16]	validation_0-error:0.248796	v

0.6865671641791045

- p275 = Scikit-Learn API방식
- Scikit-Learn API방식, python Wrapper 방식 비교하기 

# lightGBM Python Wrapper 방식 & LightGBM Scikit-Learn API방식

## lightGBM Python Wrapper 방식

참고 https://lightgbm.readthedocs.io/en/latest/Parameters.html

In [None]:
import lightgbm as lgb 
from sklearn.model_selection import train_test_split 
from sklearn.metrics import accuracy_score
import seaborn as sns 

# tips 데이터셋 
titanic = sns.load_dataset('titanic')

X = titanic[['pclass', 'parch', 'fare']]
y = titanic['survived']

# 훈련데이터, 테스트 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.3, random_state=42)

# XGBoost 코드와 유사하다. 
dtrain = lgb.Dataset(data = X_train, label = y_train)
dtest = lgb.Dataset(data = X_test, label = y_test)

params = {'max_depth':3,
          'n_estimators':100,
          'learning_rate': 0.1,
          'objective':'binary',
          'metric' : 'binary_error', 
          'num_boost_round' : 400, 
          'verbose' : 1} 

w_list = [dtrain, dtest]
lgb_ml = lgb.train(params=params, train_set = dtrain,\
                  early_stopping_rounds=100, valid_sets= w_list)

pred_probs = lgb_ml.predict(X_test)
y_pred=[1 if x > 0.5 else 0 for x in pred_probs]

# 예측 라벨과 실제 라벨 사이의 정확도 측정
accuracy_score(y_pred, y_test)



[1]	training's binary_error: 0.383628	valid_1's binary_error: 0.384328
Training until validation scores don't improve for 100 rounds.
[2]	training's binary_error: 0.383628	valid_1's binary_error: 0.384328
[3]	training's binary_error: 0.354735	valid_1's binary_error: 0.369403
[4]	training's binary_error: 0.29695	valid_1's binary_error: 0.354478
[5]	training's binary_error: 0.272873	valid_1's binary_error: 0.33209
[6]	training's binary_error: 0.272873	valid_1's binary_error: 0.33209
[7]	training's binary_error: 0.269663	valid_1's binary_error: 0.317164
[8]	training's binary_error: 0.269663	valid_1's binary_error: 0.317164
[9]	training's binary_error: 0.264848	valid_1's binary_error: 0.309701
[10]	training's binary_error: 0.269663	valid_1's binary_error: 0.309701
[11]	training's binary_error: 0.264848	valid_1's binary_error: 0.309701
[12]	training's binary_error: 0.264848	valid_1's binary_error: 0.309701
[13]	training's binary_error: 0.264848	valid_1's binary_error: 0.309701
[14]	training

0.6940298507462687

## LightGBM Scikit-Learn API방식

In [None]:
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score

# model 
w_list = [dtrain, dtest]
model = LGBMClassifier(objective = 'binary', 
                       metric = 'binary_error',
                       n_estimators=100, 
                       learning_rate=0.1, 
                       max_depth=3, 
                       num_boost_round = 400,
                       random_state = 32)
model.fit(X_train, 
          y_train, 
          eval_set = [(X_train, y_train), (X_test, y_test)], 
          verbose=1,
          early_stopping_rounds = 100)
y_probas = model.predict_proba(X_test) 
y_pred=[1 if x > 0.5 else 0 for x in y_probas[:, 1]] # 예측 라벨(0과 1로 예측)

# 예측 라벨과 실제 라벨 사이의 정확도 측정
accuracy_score(y_pred, y_test)











[1]	training's binary_error: 0.383628	valid_1's binary_error: 0.384328
Training until validation scores don't improve for 100 rounds.
[2]	training's binary_error: 0.383628	valid_1's binary_error: 0.384328
[3]	training's binary_error: 0.354735	valid_1's binary_error: 0.369403
[4]	training's binary_error: 0.29695	valid_1's binary_error: 0.354478
[5]	training's binary_error: 0.272873	valid_1's binary_error: 0.33209
[6]	training's binary_error: 0.272873	valid_1's binary_error: 0.33209
[7]	training's binary_error: 0.269663	valid_1's binary_error: 0.317164
[8]	training's binary_error: 0.269663	valid_1's binary_error: 0.317164
[9]	training's binary_error: 0.264848	valid_1's binary_error: 0.309701
[10]	training's binary_error: 0.269663	valid_1's binary_error: 0.309701
[11]	training's binary_error: 0.264848	valid_1's binary_error: 0.309701
[12]	training's binary_error: 0.264848	valid_1's binary_error: 0.309701
[13]	training's binary_error: 0.264848	valid_1's binary_error: 0.309701
[14]	training

0.6940298507462687