### LightGBM 라이브러리를 활용한 머신러닝 학습, 예측

* LightGBM은 Microsoft에서 개발한 Gradient Boosting 프레임워크로, 대용량 데이터셋에 대한 고속 및 고성능 학습이 가능

### 데이터 불러오기 및 라이브러리 준비

In [1]:
import numpy as np
import pandas as pd

# train_x는 학습 데이터, train_y는 목적 변수, test_x는 테스트 데이터
# pandas의 DataFrame, Series로 유지합니다.(numpy의 array로 유지하기도 합니다)

train = pd.read_csv('../input/sample-data/train_preprocessed.csv')
train_x = train.drop(['target'], axis=1)
train_y = train['target']
test_x = pd.read_csv('../input/sample-data/test_preprocessed.csv')

# 학습 데이터를 학습 데이터와 밸리데이션 데이터로 나눕니다.
from sklearn.model_selection import KFold

kf = KFold(n_splits=4, shuffle=True, random_state=71)
tr_idx, va_idx = list(kf.split(train_x))[0]
tr_x, va_x = train_x.iloc[tr_idx], train_x.iloc[va_idx]
tr_y, va_y = train_y.iloc[tr_idx], train_y.iloc[va_idx]

In [2]:
tr_x.shape, tr_y.shape, va_x.shape, va_y.shape

((7500, 28), (7500,), (2500, 28), (2500,))

### LightGBM을 활용한 이진 분류 모델의 구현

In [4]:
# -----------------------------------
# lightgbm의 구현
# -----------------------------------
import lightgbm as lgb
from sklearn.metrics import log_loss

# 특징과 목적변수를 lightgbm의 데이터 구조로 변환
lgb_train = lgb.Dataset(tr_x, tr_y)
lgb_eval = lgb.Dataset(va_x, va_y)

# 하이퍼파라미터 설정
params = {'objective': 'binary', 'seed': 71, 
          'verbose': 0, 'metrics': 'binary_logloss'}
num_round = 100

# 학습 실행
# 범주형 변수를 지정
# 검증 데이터도 추가 지정하여 학습 진행과 함께 점수가 어떻게 달라지는지 모니터링
categorical_features = ['product', 'medical_info_b2', 'medical_info_b3']
model = lgb.train(params, lgb_train, num_boost_round=num_round,
                  categorical_feature=categorical_features,
                  valid_names=['train', 'valid'], valid_sets=[lgb_train, lgb_eval])

New categorical_feature is ['medical_info_b2', 'medical_info_b3', 'product']


You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[1]	train's binary_logloss: 0.454286	valid's binary_logloss: 0.4654
[2]	train's binary_logloss: 0.429348	valid's binary_logloss: 0.443537
[3]	train's binary_logloss: 0.409269	valid's binary_logloss: 0.425588
[4]	train's binary_logloss: 0.393109	valid's binary_logloss: 0.411213
[5]	train's binary_logloss: 0.379351	valid's binary_logloss: 0.399341
[6]	train's binary_logloss: 0.366138	valid's binary_logloss: 0.389055
[7]	train's binary_logloss: 0.35417	valid's binary_logloss: 0.378254
[8]	train's binary_logloss: 0.343782	valid's binary_logloss: 0.370131
[9]	train's binary_logloss: 0.334283	valid's binary_logloss: 0.362036
[10]	train's binary_logloss: 0.324802	valid's binary_logloss: 0.353452
[11]	train's binary_logloss: 0.316592	valid's binary_logloss: 0.346904
[12]	train's binary_logloss: 0.308484	valid's binary_logloss: 0.340248
[13]	train's binary_logloss: 0.301468	

In [6]:
# 검증 데이터에서의 점수 확인
va_pred = model.predict(va_x)
score = log_loss(va_y, va_pred)
print(f'logloss: {score:.4f}')

# 예측
pred = model.predict(test_x)
pred[0:10]

logloss: 0.2161


array([0.24319712, 0.04309432, 0.0089008 , 0.00154677, 0.00283472,
       0.22837672, 0.52074278, 0.70167458, 0.67261649, 0.1007279 ])