## Baseline Model

* 데이터가 익명화되어 숫자들로 치환되어 있음 (따로 전처리 과정 불필요)

In [1]:
import pandas as pd

train = pd.read_csv("train.csv")
train_label = train['target']
train_id = train['id']
del train['target'], train['id']

test = pd.read_csv("test.csv")
test_id = test['id']
del test['id']

### 피처 엔지니어링
어떤 변수인지 드러난 정보가 없기 때문에 방향성 찾기 어려움... 탐색적 데이터 분석 과정 참고
1. 운전자별 결측값의 개수를 나타내는 missing 변수 (운전을 시작한 시기에 대한 정보나 데이터 출처에 대한 간접적 정보 표현 가능)
2. 이진 변수들의 총합 (변수간의 상호작용으로 얻을 수 있는 고차원 정보 추출)
3. Target Encoding 파생 변수 (단일 변수의 고유값별 타겟 변수의 평균값을 새로운 변수로 만들기)

In [2]:
#1. 결측값을 의미하는 “-1”의 개수 세기
train['missing'] = (train==-1).sum(axis=1).astype(float)
test['missing'] = (test==-1).sum(axis=1).astype(float)

#2. 이진 변수의 합
bin_features = [c for c in train.columns if 'bin' in c]
train['bin_sum'] = train[bin_features].sum(axis=1)
test['bin_sum'] = test[bin_features].sum(axis=1)

#3. Target Encoding은 data leaking 방지하기 위해 교차검증과정에서 함!

features = ['ps_ind_06_bin', 'ps_ind_07_bin', 'ps_ind_08_bin', 'ps_ind_09_bin', 'ps_ind_12_bin', 'ps_ind_16_bin', 'ps_ind_17_bin', 'ps_ind_18_bin', 'ps_ind_04_cat', 'ps_ind_05_cat', 'ps_car_01_cat', 'ps_car_02_cat', 'ps_car_03_cat', 'ps_car_04_cat', 'ps_car_06_cat', 'ps_car_07_cat', 'ps_car_08_cat', 'ps_car_09_cat', 'ps_car_11_cat', 'ps_ind_01', 'ps_ind_03', 'ps_ind_15', 'ps_car_11']

### 모델 학습
* LightGBM
* 5-Fold StratifiedKFold

In [3]:
#모델 설정
num_boost_round = 10000
params = {"objective": "binary",
          "boosting_type": "gbdt",
          "learning_rate": 0.1,
          "num_leaves": 15,
          "max_bin": 256,
          "feature_fraction": 0.6,
          "verbosity": 0,
          "drop_rate": 0.1,
          "is_unbalance": False,
          "max_drop": 50,
          "min_child_samples": 10,
          "min_child_weight": 150,
          "min_split_gain": 0,
          "subsample": 0.9,
          "seed": 2018
}

In [7]:
import numpy as np
#지니 계수
def gini(actual, pred):
    assert (len(actual) == len(pred))
    all = np.asarray(np.c_[actual, pred, np.arange(len(actual))], dtype=np.float)
    all = all[np.lexsort((all[:,2], -1 * all[:, 1]))]
    totalLosses = all[:, 0].sum()
    giniSum = all[:, 0].cumsum().sum() / totalLosses
    giniSum -= (len(actual) +1) / 2.
    return giniSum / len(actual)
def Gini(actual, pred):
    return gini(actual,pred) / gini(actual, actual)

In [8]:
from sklearn.model_selection import StratifiedKFold
import lightgbm as lgbm
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

NFOLDS = 5
kfold = StratifiedKFold(n_splits=NFOLDS, shuffle=True, random_state=218)
kf = kfold.split(train, train_label)

cv_train = np.zeros(len(train_label))
cv_pred = np.zeros(len(test_id))    
best_trees = []
fold_scores = []
def evalerror(preds,d):
    labels = d.get_label()
    return 'gini', Gini(labels,preds), True

for i, (train_fold, validate) in enumerate(kf):
    # 훈련/검증 데이터를 분리
    X_train, X_validate, label_train, label_validate = train.iloc[train_fold, :], train.iloc[validate, :], train_label[train_fold], train_label[validate]
    
    # target encoding 
    for feature in features:
        # 훈련 데이터에서 feature 고유값별 타겟 변수의 평균
        map_dic = pd.DataFrame([X_train[feature], label_train]).T.groupby(feature).agg('mean')
        map_dic = map_dic.to_dict()['target']
        # 훈련/검증/테스트 데이터에 평균값을 매핑
        X_train[feature + '_target_enc'] = X_train[feature].apply(lambda x: map_dic.get(x, 0))
        X_validate[feature + '_target_enc'] = X_validate[feature].apply(lambda x: map_dic.get(x, 0))
        test[feature + '_target_enc'] = test[feature].apply(lambda x: map_dic.get(x, 0))
    
    dtrain = lgbm.Dataset(X_train, label_train)
    dvalid = lgbm.Dataset(X_validate, label_validate, reference=dtrain)
    # 훈련 데이터를 학습하고, evalerror() 함수를 통해 검증 데이터에 대한 정규화 Gini 계수 점수를 기준으로 최적의 트리 개수를 찾기
    bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dvalid, feval=evalerror, verbose_eval=100, early_stopping_rounds=100)
    best_trees.append(bst.best_iteration)
    # 테스트 데이터에 대한 예측값을 cv_pred에 더하기
    cv_pred += bst.predict(test, num_iteration=bst.best_iteration)
    cv_train[validate] += bst.predict(X_validate)

    # 검증 데이터에 대한 평가 점수를 출력
    score = Gini(label_validate, cv_train[validate])
    print(score)
    fold_scores.append(score)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Training until validation scores don't improve for 100 rounds.
[100]	valid_0's binary_logloss: 0.151606	valid_0's gini: 0.287939
[200]	valid_0's binary_logloss: 0.151608	valid_0's gini: 0.287492
Early stopping, best iteration is:
[132]	valid_0's binary_logloss: 0.151591	valid_0's gini: 0.287982
0.2879823369169921
Training until validation scores don't improve for 100 rounds.
[100]	valid_0's binary_logloss: 0.152424	valid_0's gini: 0.264158
Early stopping, best iteration is:
[82]	valid_0's binary_logloss: 0.1524	valid_0's gini: 0.265387
0.2653866946078528
Training until validation scores don't improve for 100 rounds.
[100]	valid_0's binary_logloss: 0.152227	valid_0's gini: 0.276015
[200]	valid_0's binary_logloss: 0.152186	valid_0's gini: 0.277444
[300]	valid_0's binary_logloss: 0.152259	valid_0's gini: 0.274823
Early stopping, best iteration is:
[202]	valid_0's binary_logloss: 0.152175	valid_0's gini: 0.277855
0.2778554545445486
Training until validation scores don't improve for 100 rou

In [9]:
cv_pred /= NFOLDS

# 시드값별로 교차 검증 점수를 출력
print("cv score:")
print(Gini(train_label, cv_train))
print(fold_scores)
print(best_trees, np.mean(best_trees))

# 테스트 데이터에 대한 결과물을 저장
pd.DataFrame({'id': test_id, 'target': cv_pred}).to_csv('lgbm_baseline.csv', index=False)

cv score:
0.2794294491958753
[0.2879823369169921, 0.2653866946078528, 0.2778554545445486, 0.28034318566195054, 0.28623330121679297]
[132, 82, 202, 145, 62] 124.6
