현재 진행중인 대출등급 예측의 목표를 다시 생각해보자.

객관적 자료를 배제하고, '내가' 생각했을 때 등급 산정 시 가장 중요한 특성 5가지만 생각해보면

연간소득, 총상환원금과 이자 그리고 연체와 관련된 부분이다.

따라서 연간소득, 총상환원금, 총상환이자, 최근2년간연체횟수, 총연체금액, 연체계좌수 6개 컬럼만 이용하여 모델링을 시도해보자

In [2]:
import pandas as pd
import numpy as np
import warnings
from sklearn.preprocessing import LabelEncoder,StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
warnings.filterwarnings('ignore')

In [2]:
train = pd.read_csv('./train_new.csv')
train.drop(['ID', '대출기간', '대출금액', '근로기간', '주택소유상태', '부채_대비_소득_비율', '총계좌수', '대출목적'], axis=1, inplace=True)
train.head()

Unnamed: 0,연간소득,최근_2년간_연체_횟수,총상환원금,총상환이자,총연체금액,연체계좌수,대출등급
0,72000000,0,0,0.0,0.0,0.0,C
1,130800000,0,373572,234060.0,0.0,0.0,B
2,96000000,0,928644,151944.0,0.0,0.0,A
3,132000000,0,325824,153108.0,0.0,0.0,C
4,84000000,0,240216,55428.0,0.0,0.0,A


In [5]:
le = LabelEncoder()

train['대출등급'] = le.fit_transform(train['대출등급'])

X = train.iloc[:, :-1]
y = train.iloc[:, -1]

ss = StandardScaler()

X_ss = ss.fit_transform(X)

X_train, X_val, y_train, y_val = train_test_split(X_ss, y, test_size=0.3, stratify=y, random_state=42)

In [3]:
def grid_search(model, params, random=False):
    clf = model
    if not random:
        grid = GridSearchCV(clf, params,
                                scoring='f1_macro', cv=5,
                                n_jobs=-1)
    else:
        grid = RandomizedSearchCV(clf, params, n_iter=10,
                                scoring='f1_macro', cv=5,
                                n_jobs=-1, random_state=42)
        
    grid.fit(X_train, y_train)
    
    
    best_model = grid.best_estimator_
    
    best_params = grid.best_params_
    print("최상의 매개변수:", best_params)
    
    best_score = grid.best_score_
    print("훈련 점수: {:.3f}".format(best_score))
    
    y_pred = best_model.predict(X_val)
    macro_f1_val = f1_score(y_val, y_pred, average='macro')
    print('테스트 세트 점수: {:.3f}'.format(macro_f1_val))

In [5]:
params = {'min_samples_leaf':[18,19,20,21,22],
          'min_impurity_decrease':[0.0],
          'max_features':['auto',0.6,0.61,0.62,0.63,0.64,0.65,0.66,0.67,0.68,0.69,0.70],
          'max_depth':[None,11,12,13,14,15,16,17,18],
          'class_weight' : [None, 'balanced']}

grid_search(DecisionTreeClassifier(random_state=42), params, random=True)

최상의 매개변수: {'min_samples_leaf': 19, 'min_impurity_decrease': 0.0, 'max_features': 0.68, 'max_depth': 17, 'class_weight': None}
훈련 점수: 0.668
테스트 세트 점수: 0.706


최상의 매개변수: {'min_samples_leaf': 19, 'min_impurity_decrease': 0.0, 'max_features': 0.68, 'max_depth': 17, 'class_weight': None}
훈련 점수: 0.668
테스트 세트 점수: 0.706

In [6]:
params = {'min_samples_leaf':[1,3,5],
          'min_impurity_decrease':[0.0],
          'max_features':[0.71,0.73,0.75],
          'max_depth':[91,93,95,97,99],
          'n_estimators' : [750,770,790,810,830,850]}

grid_search(RandomForestClassifier(n_jobs=-1, random_state=42), params, random=True)

최상의 매개변수: {'n_estimators': 770, 'min_samples_leaf': 1, 'min_impurity_decrease': 0.0, 'max_features': 0.73, 'max_depth': 95}
훈련 점수: 0.774
테스트 세트 점수: 0.776


최상의 매개변수: {'n_estimators': 700, 'min_samples_leaf': 20, 'min_impurity_decrease': 0.0, 'max_features': 0.5, 'max_depth': 80}
훈련 점수: 0.638
테스트 세트 점수: 0.650

48분 소요로 후행 작업 중단

최상의 매개변수: {'n_estimators': 770, 'min_samples_leaf': 1, 'min_impurity_decrease': 0.0, 'max_features': 0.73, 'max_depth': 95}
훈련 점수: 0.774
테스트 세트 점수: 0.776

In [7]:
params = {'n_estimators' : [150,200,250],
          'learning_rate' : [0.01,0.05,0.1],
          'max_depth' : [90,100,110],
          'objective' : ['multi:softmax']}

grid_search(XGBClassifier(random_state=42, n_jobs=-1), params, random=True)

최상의 매개변수: {'objective': 'multi:softmax', 'n_estimators': 250, 'max_depth': 110, 'learning_rate': 0.01}
훈련 점수: 0.748
테스트 세트 점수: 0.750


최상의 매개변수: {'objective': 'multi:softmax', 'n_estimators': 200, 'max_depth': 100, 'learning_rate': 0.01}
훈련 점수: 0.750
테스트 세트 점수: 0.749

20분 - 성능변화 크게 없음

최상의 매개변수: {'objective': 'multi:softmax', 'n_estimators': 250, 'max_depth': 110, 'learning_rate': 0.01}
훈련 점수: 0.748
테스트 세트 점수: 0.750

In [8]:
params = {'n_estimators' : [450,500,550,600],
          'learning_rate' : [0.01,0.05,0.1,0.2,0.3,0.4],
          'max_depth' : [75,100,125,150],
          'num_leaves' : [20,25,30]}

grid_search(LGBMClassifier(objective='multiclass', random_state=42, n_jobs=-1), params, random=True)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001863 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 818
[LightGBM] [Info] Number of data points in the train set: 63435, number of used features: 6
[LightGBM] [Info] Start training from score -1.744243
[LightGBM] [Info] Start training from score -1.208106
[LightGBM] [Info] Start training from score -1.248814
[LightGBM] [Info] Start training from score -1.982449
[LightGBM] [Info] Start training from score -2.564256
[LightGBM] [Info] Start training from score -3.885346
[LightGBM] [Info] Start training from score -5.433754
최상의 매개변수: {'num_leaves': 30, 'n_estimators': 500, 'max_depth': 125, 'learning_rate': 0.05}
훈련 점수: 0.743
테스트 세트 점수: 0.737


최상의 매개변수: {'num_leaves': 25, 'n_estimators': 500, 'max_depth': 100, 'learning_rate': 0.1}
훈련 점수: 0.738
테스트 세트 점수: 0.736

8분 - 성능 변화 크게 없음

최상의 매개변수: {'num_leaves': 30, 'n_estimators': 500, 'max_depth': 125, 'learning_rate': 0.05}
훈련 점수: 0.743
테스트 세트 점수: 0.737

In [6]:
X_ss = pd.DataFrame(X_ss, columns=X.columns)

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

f1_macro_scores = []

def skf_score(model):
    for train_idx, valid_idx in skf.split(X_ss, y):
        X_train = X_ss.iloc[train_idx]
        X_val = y.iloc[train_idx]

        y_train = X_ss.iloc[valid_idx]
        y_val = y.iloc[valid_idx]

        model.fit(X_train, X_val)

        pred = model.predict(y_train)

        f1_macro = f1_score(y_val, pred, average='macro')
        f1_macro_scores.append(f1_macro)
    
    average_f1_macro = np.mean(f1_macro_scores)

    print("Average F1-macro score:", average_f1_macro)
    try:
        if model.feature_importances_.any():
            feature_importances = model.feature_importances_
            print("\n",'-'*10,'특성중요도','-'*10)
            for feature, importance in zip(X_ss.columns, feature_importances):
                print(f"{feature}: {importance}")
    except:
        None

In [10]:
skf_score(DecisionTreeClassifier(min_samples_leaf=19, min_impurity_decrease=0, 
                                 max_features=0.68, max_depth=17, class_weight=None, random_state=42))

Average F1-macro score: 0.7056080563919831

 ---------- 특성중요도 ----------
연간소득: 0.031204710863173352
최근_2년간_연체_횟수: 0.0018175221538471197
총상환원금: 0.5644763422936642
총상환이자: 0.4025014246893154
총연체금액: 0.0
연체계좌수: 0.0


In [11]:
skf_score(RandomForestClassifier(n_estimators=770, min_samples_leaf=1, min_impurity_decrease=0, 
                                 max_features=0.73, max_depth=95, random_state=42, n_jobs=-1))

Average F1-macro score: 0.747356297190938

 ---------- 특성중요도 ----------
연간소득: 0.09318574941085862
최근_2년간_연체_횟수: 0.014293254197159409
총상환원금: 0.4798765292562635
총상환이자: 0.4110978759495366
총연체금액: 0.0007683069149552781
연체계좌수: 0.0007782842712264112


In [12]:
skf_score(XGBClassifier(objective='multi:softmax', n_estimators=250, max_depth=110, 
          learning_rate=0.01, n_jobs=-1, random_state=42))

Average F1-macro score: 0.7524633595010309

 ---------- 특성중요도 ----------
연간소득: 0.05038720369338989
최근_2년간_연체_횟수: 0.04464619606733322
총상환원금: 0.39258119463920593
총상환이자: 0.402171790599823
총연체금액: 0.06731744110584259
연체계좌수: 0.04289612919092178


In [13]:
skf_score(LGBMClassifier(objective='multiclass', num_leaves=30, n_estimators=500, max_depth=125, 
               learning_rate=0.05, n_jobs=-1, random_state=42))

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004202 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 822
[LightGBM] [Info] Number of data points in the train set: 81559, number of used features: 6
[LightGBM] [Info] Start training from score -1.744289
[LightGBM] [Info] Start training from score -1.208056
[LightGBM] [Info] Start training from score -1.248847
[LightGBM] [Info] Start training from score -1.982382
[LightGBM] [Info] Start training from score -2.564275
[LightGBM] [Info] Start training from score -3.885514
[LightGBM] [Info] Start training from score -5.434151
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001353 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 825
[LightGBM] [Info] Number of data points in the train set: 81559, number of u

In [14]:
skf_score(LGBMClassifier(objective='multiclass', num_leaves=41, n_estimators=1500, max_depth=130, 
               learning_rate=0.024, n_jobs=-1, random_state=42))

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003890 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 822
[LightGBM] [Info] Number of data points in the train set: 81559, number of used features: 6
[LightGBM] [Info] Start training from score -1.744289
[LightGBM] [Info] Start training from score -1.208056
[LightGBM] [Info] Start training from score -1.248847
[LightGBM] [Info] Start training from score -1.982382
[LightGBM] [Info] Start training from score -2.564275
[LightGBM] [Info] Start training from score -3.885514
[LightGBM] [Info] Start training from score -5.434151
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003328 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 825
[LightGBM] [Info] Number of data points in the train set: 81559, number of used features: 6
[LightGBM] [Info] Start training from score -1.7

XGB를 제외하고 총상환원금과 총상환이자의 특성 중요도가 90% 이상을 차지했다.

총상환원금과 총상환이자 컬럼이 중요한건 알겠는데.. 이걸 어떻게 활용하면 좋을까?

model|k-Fold|Sk-Fold
-|-|-
DecisionTree Classifier|0.706|0.7056080563919831
RandomForest Classifier|0.776|0.747356297190938
XGBoost Classifier|0.750|0.7524633595010309
Light GBM Classifier|0.737|0.752895857025268 --> 최고 성능

-----

### 파생변수 생성

### 앞서 특성 중요도를 살펴보았을 때 총상환원금과 총상환이자 2개의 컬럼의 중요도가 매우 높게 나타났다. 

### 이 두가지 컬럼을 이용하여 파생변수를 생성해보자

* 상호 작용 변수 생성

상호 작용 변수를 만드는 것은 모델에 미치는 다양한 변수 간의 상호 작용을 고려하여 모델의 품질을 향상시킬 수 있는 중요한 전략이다.

상호 작용 변수는 두 변수 간의 곱셈이나 나눗셈 등을 통해 생성될 수 있는데, 예를 들면 변수 A와 변수 B의 상호 작용은 A * B와 같이 표현된다.

그러나 상호 작용 변수를 생성하는 데 있어서 주의할 점은 모든 가능한 조합을 만들 필요는 없다는 것. 

특히, 도메인 지식이나 특정 가정에 기반하여 어떤 변수 간의 상호 작용이 중요한지를 결정하는 것이 좋다. 

무작정 모든 변수의 조합을 만들면 모델이 복잡해지고 과적합의 위험이 있기 때문

예를 들어, LinearRegression 모델에서는 주요 예측 변수들 간의 상호 작용을 추가하여 모델의 설명력을 높일 수 있다. 하지만 모델의 유연성을 고려하여 필요한 상호 작용만을 추가하는 것이 매우 중요하다.

1. 총상환원금 + 총상환이자
2. 총상환원금 - 총상환이자
3. 총상환원금 * 총상환이자
4. 총상환원금^2
5. 총상환이자^2
6. sqrt(총상환원금)
7. sqrt(총상환이자)

총 7가지 파생변수를 생성해서 모델링 해본 뒤 과적합의 위험이 보이면 차원을 축소하는 걸로 진행해보자

(log 값도 추가하려 했으나 무한대 값을 처리하는 방법을 찾지 못해 일단 7개만 진행)

++ 연체에 대한 컬럼도 영향력이 작으니 과감하게 삭제

In [15]:
train = pd.read_csv('./train_new.csv')
train.drop(['ID', '대출기간', '대출금액', '근로기간', '주택소유상태', '최근_2년간_연체_횟수', '부채_대비_소득_비율', '총계좌수', '대출목적', '총연체금액', '연체계좌수'], axis=1, inplace=True)
train.head()

Unnamed: 0,연간소득,총상환원금,총상환이자,대출등급
0,72000000,0,0.0,C
1,130800000,373572,234060.0,B
2,96000000,928644,151944.0,A
3,132000000,325824,153108.0,C
4,84000000,240216,55428.0,A


In [16]:
train['총상환원금+총상환이자'] = train['총상환원금'] + train['총상환이자']
train['총상환원금-총상환이자'] = train['총상환원금'] - train['총상환이자']
train['총상환원금*총상환이자'] = train['총상환원금'] * train['총상환이자']
train['총상환원금^2'] = train['총상환원금']**2
train['총상환이자^2'] = train['총상환이자']**2
train['sqrt(총상환원금)'] = np.sqrt(train['총상환원금'])
train['sqrt(총상환이자)'] = np.sqrt(train['총상환이자'])

In [17]:
train.head()

Unnamed: 0,연간소득,총상환원금,총상환이자,대출등급,총상환원금+총상환이자,총상환원금-총상환이자,총상환원금*총상환이자,총상환원금^2,총상환이자^2,sqrt(총상환원금),sqrt(총상환이자)
0,72000000,0,0.0,C,0.0,0.0,0.0,0,0.0,0.0,0.0
1,130800000,373572,234060.0,B,607632.0,139512.0,87438260000.0,139556039184,54784080000.0,611.205366,483.797478
2,96000000,928644,151944.0,A,1080588.0,776700.0,141101900000.0,862379678736,23086980000.0,963.661766,389.799949
3,132000000,325824,153108.0,C,478932.0,172716.0,49886260000.0,106161278976,23442060000.0,570.809951,391.290174
4,84000000,240216,55428.0,A,295644.0,184788.0,13314690000.0,57703726656,3072263000.0,490.118353,235.431519


In [18]:
train.fillna(0, inplace=True)

In [19]:
train.describe()

Unnamed: 0,연간소득,총상환원금,총상환이자,총상환원금+총상환이자,총상환원금-총상환이자,총상환원금*총상환이자,총상환원금^2,총상환이자^2,sqrt(총상환원금),sqrt(총상환이자)
count,90622.0,90622.0,90622.0,90622.0,90622.0,90622.0,90622.0,90622.0,90622.0,90622.0
mean,96026550.0,832663.2,435076.2,1267739.0,397587.1,550804500000.0,1779663000000.0,386587300000.0,813.354733,580.451919
std,101809100.0,1042279.0,444182.6,1288687.0,952142.4,1187096000000.0,19552290000000.0,901466800000.0,413.665583,313.293517
min,6432000.0,0.0,0.0,0.0,-4032960.0,0.0,0.0,0.0,0.0,0.0
25%,58800000.0,312642.0,137847.0,495246.0,37644.0,47772830000.0,97745020000.0,19001800000.0,559.143989,371.277524
50%,81600000.0,604728.0,293418.0,959328.0,232452.0,187382800000.0,365696000000.0,86094120000.0,777.642591,541.680718
75%,114000000.0,1067496.0,579408.0,1705812.0,564465.0,591202900000.0,1139548000000.0,335713600000.0,1033.19698,761.188544
max,10800000000.0,41955940.0,5653416.0,42337840.0,41574040.0,77054590000000.0,1760301000000000.0,31961110000000.0,6477.340195,2377.691317


In [20]:
X = train.loc[:, train.columns != '대출등급']
y = train.loc[:, ['대출등급']]

In [21]:
from sklearn.preprocessing import StandardScaler, LabelEncoder

le = LabelEncoder()
ss = StandardScaler()

y = le.fit_transform(y)
X_ss = ss.fit_transform(X)

In [22]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X_ss, y, test_size=0.3, stratify=y, random_state=42)

In [23]:
params = {'min_samples_leaf':[18,19,20,21,22],
          'min_impurity_decrease':[0.0],
          'max_features':['auto',0.6,0.61,0.62,0.63,0.64,0.65,0.66,0.67,0.68,0.69,0.70],
          'max_depth':[None,11,12,13,14,15,16,17,18],
          'class_weight' : [None, 'balanced']}

grid_search(DecisionTreeClassifier(random_state=42), params, random=True)

최상의 매개변수: {'min_samples_leaf': 19, 'min_impurity_decrease': 0.0, 'max_features': 0.68, 'max_depth': 17, 'class_weight': None}
훈련 점수: 0.739
테스트 세트 점수: 0.743


In [24]:
params = {'min_samples_leaf':[1,3,5],
          'min_impurity_decrease':[0.0],
          'max_features':[0.71,0.73,0.75],
          'max_depth':[91,93,95,97,99],
          'n_estimators' : [750,770,790,810,830,850]}

grid_search(RandomForestClassifier(n_jobs=-1, random_state=42), params, random=True)

최상의 매개변수: {'n_estimators': 770, 'min_samples_leaf': 1, 'min_impurity_decrease': 0.0, 'max_features': 0.73, 'max_depth': 95}
훈련 점수: 0.794
테스트 세트 점수: 0.787


0.689

150분.. 소요되어 후행 작업 중단

최상의 매개변수: {'n_estimators': 770, 'min_samples_leaf': 1, 'min_impurity_decrease': 0.0, 'max_features': 0.73, 'max_depth': 95}
훈련 점수: 0.794
테스트 세트 점수: 0.787

In [25]:
params = {'n_estimators' : [150,175,200,225,250],
          'learning_rate' : [0.01,0.025,0.05,0.075,0.1],
          'max_depth' : [90,95,100,105,110],
          'objective' : ['multi:softmax']}

grid_search(XGBClassifier(random_state=42, n_jobs=-1), params, random=True)

최상의 매개변수: {'objective': 'multi:softmax', 'n_estimators': 250, 'max_depth': 90, 'learning_rate': 0.01}
훈련 점수: 0.768
테스트 세트 점수: 0.770


최상의 매개변수: {'objective': 'multi:softmax', 'n_estimators': 200, 'max_depth': 100, 'learning_rate': 0.01}
훈련 점수: 0.768
테스트 세트 점수: 0.767

33분 - 성능변화 크게 없음

최상의 매개변수: {'objective': 'multi:softmax', 'n_estimators': 250, 'max_depth': 90, 'learning_rate': 0.01}
훈련 점수: 0.768
테스트 세트 점수: 0.770

In [26]:
params = {'n_estimators' : [200,250,300,350,400],
          'learning_rate' : [0.01,0.025,0.05,0.075,0.1,0.125],
          'max_depth' : [5,7,10,13,15],
          'num_leaves' : [35,37,40,42,45]}

grid_search(LGBMClassifier(objective='multiclass', random_state=42, n_jobs=-1), params, random=True)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002202 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2550
[LightGBM] [Info] Number of data points in the train set: 63435, number of used features: 10
[LightGBM] [Info] Start training from score -1.744243
[LightGBM] [Info] Start training from score -1.208106
[LightGBM] [Info] Start training from score -1.248814
[LightGBM] [Info] Start training from score -1.982449
[LightGBM] [Info] Start training from score -2.564256
[LightGBM] [Info] Start training from score -3.885346
[LightGBM] [Info] Start training from score -5.433754
최상의 매개변수: {'num_leaves': 37, 'n_estimators': 300, 'max_depth': 13, 'learning_rate': 0.05}
훈련 점수: 0.771
테스트 세트 점수: 0.762


최상의 매개변수: {'num_leaves': 40, 'n_estimators': 300, 'max_depth': 10, 'learning_rate': 0.05}
훈련 점수: 0.771
테스트 세트 점수: 0.760

10분 - 성능변화 크게 없음

최상의 매개변수: {'num_leaves': 37, 'n_estimators': 300, 'max_depth': 13, 'learning_rate': 0.05}
훈련 점수: 0.771
테스트 세트 점수: 0.762

In [27]:
X_ss = pd.DataFrame(X_ss, columns=X.columns)
y = pd.DataFrame(y)

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

f1_macro_scores = []

def skf_score(model):
    for train_idx, valid_idx in skf.split(X_ss, y):
        X_train = X_ss.iloc[train_idx]
        X_val = y.iloc[train_idx]

        y_train = X_ss.iloc[valid_idx]
        y_val = y.iloc[valid_idx]

        model.fit(X_train, X_val)

        pred = model.predict(y_train)

        f1_macro = f1_score(y_val, pred, average='macro')
        f1_macro_scores.append(f1_macro)
    
    average_f1_macro = np.mean(f1_macro_scores)

    print("Average F1-macro score:", average_f1_macro)
    try:
        if model.feature_importances_.any():
            feature_importances = model.feature_importances_
            print("\n",'-'*10,'특성중요도','-'*10)
            for feature, importance in zip(X_ss.columns, feature_importances):
                print(f"{feature}: {importance}")
    except:
        None

In [28]:
skf_score(DecisionTreeClassifier(min_samples_leaf=19, min_impurity_decrease=0, 
                                 max_features=0.68, max_depth=17, class_weight=None, random_state=42))

Average F1-macro score: 0.7528012477225275

 ---------- 특성중요도 ----------
연간소득: 0.021138299394517394
총상환원금: 0.08188699028073927
총상환이자: 0.08633119073387403
총상환원금+총상환이자: 0.017892359226226506
총상환원금-총상환이자: 0.4859162479557219
총상환원금*총상환이자: 0.017575779932279025
총상환원금^2: 0.04432519658485146
총상환이자^2: 0.1161858224054798
sqrt(총상환원금): 0.06547769610864976
sqrt(총상환이자): 0.06327041737766058


In [29]:
skf_score(RandomForestClassifier(n_estimators=770, min_samples_leaf=1, min_impurity_decrease=0, 
                                 max_features=0.73, max_depth=95, random_state=42, n_jobs=-1))

Average F1-macro score: 0.7764259953077468

 ---------- 특성중요도 ----------
연간소득: 0.06241124126952324
총상환원금: 0.06494779137516525
총상환이자: 0.09512233978565837
총상환원금+총상환이자: 0.035440518664591615
총상환원금-총상환이자: 0.38431213855745666
총상환원금*총상환이자: 0.03356839455607797
총상환원금^2: 0.06596673787971329
총상환이자^2: 0.09566470653395252
sqrt(총상환원금): 0.06631108900440655
sqrt(총상환이자): 0.09625504237345472


In [30]:
skf_score(XGBClassifier(objective='multi:softmax', n_estimators=250, max_depth=90, 
          learning_rate=0.01, n_jobs=-1, random_state=42)) 

Average F1-macro score: 0.7772975157775472

 ---------- 특성중요도 ----------
연간소득: 0.03212502598762512
총상환원금: 0.10362794995307922
총상환이자: 0.25557741522789
총상환원금+총상환이자: 0.08019416779279709
총상환원금-총상환이자: 0.463946133852005
총상환원금*총상환이자: 0.06452930718660355
총상환원금^2: 0.0
총상환이자^2: 0.0
sqrt(총상환원금): 0.0
sqrt(총상환이자): 0.0


In [31]:
skf_score(LGBMClassifier(objective='multiclass', num_leaves=37, n_estimators=300, max_depth=13, 
               learning_rate=0.05, n_jobs=-1, random_state=42))

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004788 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2550
[LightGBM] [Info] Number of data points in the train set: 81559, number of used features: 10
[LightGBM] [Info] Start training from score -1.744289
[LightGBM] [Info] Start training from score -1.208056
[LightGBM] [Info] Start training from score -1.248847
[LightGBM] [Info] Start training from score -1.982382
[LightGBM] [Info] Start training from score -2.564275
[LightGBM] [Info] Start training from score -3.885514
[LightGBM] [Info] Start training from score -5.434151
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002912 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2550
[LightGBM] [Info] Number of data points in the train set: 81559, number of used features: 10
[LightGBM] [Info] Start training from score 

전체적으로 "총상환원금-총상환이자" 의 특성 중요도가 가장 높게 나타났다.

이는 상대적으로 낮은 대출등급에서 총상환원금보다 총상환이자가 더 많은 경우가 있기 때문에 음수가 생성되는데, 양/음수로 나뉘는 구간이 명확하기 때문이지 않을까 싶음

model|k-Fold|Sk-Fold
-|-|-
DecisionTree Classifier|0.743|0.7528012477225275
RandomForest Classifier|0.776|0.7764259953077468
XGBoost Classifier|0.777|0.7772975157775472
Light GBM Classifier|0.778|0.7777161437855237 --> 최고 성능

-----

### 앞에서 여러가지 모델링을 돌렸을 때 XGBoost에서 특이하게 대출기간에 대한 중요도가 엄청 크게 나타났다.

### 이번에는 대출기간과 다른 컬럼을 이용하여 파생변수를 생성해서 성능을 확인해보자

1. 대출금액*대출기간
2. 대출금액/대출기간
3. 총상환원금*대출기간
4. 총상환이자*대출기간
5. 총상환원금/대출기간
6. 총상환이자/대출기간

총 6가지 파생변수를 생성

대출금액, 대출기간, 총상환원금, 총상환이자 컬럼을 포함하여 10개의 Feature로 XGBoost 성능 확인

In [7]:
train = pd.read_csv('./train_new.csv')
train.drop(['ID', '근로기간', '연간소득', '주택소유상태', '최근_2년간_연체_횟수', '부채_대비_소득_비율', '총계좌수', '대출목적', '총연체금액', '연체계좌수'], axis=1, inplace=True)
train['대출금액*대출기간'] = train['대출금액'] * train['대출기간']
train['대출금액/대출기간'] = train['대출금액'] / train['대출기간']
train['총상환원금*대출기간'] = train['총상환원금'] * train['대출기간']
train['총상환이자*대출기간'] = train['총상환이자'] * train['대출기간']
train['총상환원금/대출기간'] = train['총상환원금'] / train['대출기간']
train['총상환이자/대출기간'] = train['총상환이자'] / train['대출기간']
train.loc[:, train.columns != '대출등급'].tail()

Unnamed: 0,대출금액,대출기간,총상환원금,총상환이자,대출금액*대출기간,대출금액/대출기간,총상환원금*대출기간,총상환이자*대출기간,총상환원금/대출기간,총상환이자/대출기간
90617,14400000,3,974580,492168.0,43200000,4800000.0,2923740,1476504.0,324860.0,164056.0
90618,28800000,5,583728,855084.0,144000000,5760000.0,2918640,4275420.0,116745.6,171016.8
90619,14400000,3,1489128,241236.0,43200000,4800000.0,4467384,723708.0,496376.0,80412.0
90620,15600000,3,1378368,818076.0,46800000,5200000.0,4135104,2454228.0,459456.0,272692.0
90621,8640000,3,596148,274956.0,25920000,2880000.0,1788444,824868.0,198716.0,91652.0


In [8]:
train.describe()

Unnamed: 0,대출금액,대출기간,총상환원금,총상환이자,대출금액*대출기간,대출금액/대출기간,총상환원금*대출기간,총상환이자*대출기간,총상환원금/대출기간,총상환이자/대출기간
count,90622.0,90622.0,90622.0,90622.0,90622.0,90622.0,90622.0,90622.0,90622.0,90622.0
mean,18567160.0,3.676282,832663.2,435076.2,72148060.0,5092617.0,2967748.0,1771490.0,246237.2,113941.3
std,10351190.0,0.946159,1042279.0,444182.6,49775470.0,2864262.0,3904647.0,2121681.0,318418.1,107150.0
min,1200000.0,3.0,0.0,0.0,3600000.0,400000.0,0.0,0.0,0.0,0.0
25%,10800000.0,3.0,312642.0,137847.0,32400000.0,3000000.0,1131960.0,428616.0,84310.8,41121.6
50%,17280000.0,3.0,604728.0,293418.0,60480000.0,4400000.0,2198700.0,999936.0,166384.8,83828.0
75%,24480000.0,5.0,1067496.0,579408.0,100800000.0,6400000.0,3977040.0,2303235.0,301232.0,152704.0
max,42000000.0,5.0,41955940.0,5653416.0,210000000.0,14000000.0,198969500.0,28267080.0,13985310.0,1349580.0


In [9]:
le = LabelEncoder()

train['대출등급'] = le.fit_transform(train['대출등급'])

X = train.loc[:, train.columns != '대출등급']
y = train.loc[:, '대출등급']

ss = StandardScaler()

X_ss = ss.fit_transform(X)

X_train, X_val, y_train, y_val = train_test_split(X_ss, y, test_size=0.3, stratify=y, random_state=42)

In [35]:
params = {'min_samples_leaf':[18,19,20,21,22],
          'min_impurity_decrease':[0.0],
          'max_features':['auto',0.6,0.61,0.62,0.63,0.64,0.65,0.66,0.67,0.68,0.69,0.70],
          'max_depth':[None,11,12,13,14,15,16,17,18],
          'class_weight' : [None, 'balanced']}

grid_search(DecisionTreeClassifier(random_state=42), params, random=True)

최상의 매개변수: {'min_samples_leaf': 19, 'min_impurity_decrease': 0.0, 'max_features': 0.68, 'max_depth': 17, 'class_weight': None}
훈련 점수: 0.757
테스트 세트 점수: 0.773


In [36]:
params = {'min_samples_leaf':[1,3,5],
          'min_impurity_decrease':[0.0],
          'max_features':[0.71,0.73,0.75],
          'max_depth':[91,93,95,97,99],
          'n_estimators' : [750,770,790,810,830,850]}

grid_search(RandomForestClassifier(n_jobs=-1, random_state=42), params, random=True)

최상의 매개변수: {'n_estimators': 770, 'min_samples_leaf': 1, 'min_impurity_decrease': 0.0, 'max_features': 0.73, 'max_depth': 95}
훈련 점수: 0.831
테스트 세트 점수: 0.830


92분소요... 후행 작업 중단

최상의 매개변수: {'n_estimators': 770, 'min_samples_leaf': 1, 'min_impurity_decrease': 0.0, 'max_features': 0.73, 'max_depth': 95}
훈련 점수: 0.831
테스트 세트 점수: 0.830

In [37]:
params = {'n_estimators' : [500,700,850,1000],
          'learning_rate' : [0.01,0.05,0.1,0.2,0.3,0.4],
          'max_depth' : [5,10,15,20,25,30,35,40],
          'objective' : ['multi:softmax']}

grid_search(XGBClassifier(random_state=42, n_jobs=-1), params, random=True)

최상의 매개변수: {'objective': 'multi:softmax', 'n_estimators': 500, 'max_depth': 15, 'learning_rate': 0.3}
훈련 점수: 0.830
테스트 세트 점수: 0.830


최상의 매개변수: {'objective': 'multi:softmax', 'n_estimators': 500, 'max_depth': 10, 'learning_rate': 0.2}
훈련 점수: 0.832
테스트 세트 점수: 0.831

55분 소요.. 성능변화 크게 없음

최상의 매개변수: {'objective': 'multi:softmax', 'n_estimators': 500, 'max_depth': 15, 'learning_rate': 0.3}
훈련 점수: 0.830
테스트 세트 점수: 0.830

In [62]:
params = {'n_estimators' : [400,500,600,800,1000],
          'learning_rate' : [0.001,0.005,0.01,0.05,0.1,0.2],
          'max_depth' : [5,10,15,20,25,30],
          'num_leaves' : [20,25,30,35,40]}

grid_search(LGBMClassifier(objective='multiclass', random_state=42, n_jobs=-1), params, random=True)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.006639 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2298
[LightGBM] [Info] Number of data points in the train set: 63435, number of used features: 10
[LightGBM] [Info] Start training from score -1.744243
[LightGBM] [Info] Start training from score -1.208106
[LightGBM] [Info] Start training from score -1.248814
[LightGBM] [Info] Start training from score -1.982449
[LightGBM] [Info] Start training from score -2.564256
[LightGBM] [Info] Start training from score -3.885346
[LightGBM] [Info] Start training from score -5.433754
최상의 매개변수: {'num_leaves': 25, 'n_estimators': 500, 'max_depth': 10, 'learning_rate': 0.1}
훈련 점수: 0.821
테스트 세트 점수: 0.819


최상의 매개변수: {'num_leaves': 25, 'n_estimators': 500, 'max_depth': 10, 'learning_rate': 0.1}
훈련 점수: 0.821
테스트 세트 점수: 0.819

80분 소요... 후행 작업 중단

In [10]:
X_ss = pd.DataFrame(X_ss, columns=X.columns)
y = pd.DataFrame(y)

In [45]:
skf_score(DecisionTreeClassifier(min_samples_leaf=19, min_impurity_decrease=0, 
                                 max_features=0.68, max_depth=17, class_weight=None, random_state=42))

Average F1-macro score: 0.7803377111430769

 ---------- 특성중요도 ----------
대출기간: 0.020414716670254046
총상환원금: 0.058643291827882964
총상환이자: 0.11157762561254556
총상환원금*대출기간: 0.06780954164453884
총상환이자*대출기간: 0.11871771357899652
총상환원금/대출기간: 0.05714950767932168
총상환이자/대출기간: 0.11910432089473721
총상환원금-총상환이자: 0.4465832820917232


In [11]:
skf_score(RandomForestClassifier(n_estimators=770, min_samples_leaf=1, min_impurity_decrease=0, 
                                 max_features=0.73, max_depth=95, random_state=42, n_jobs=-1))

Average F1-macro score: 0.8454329888401269

 ---------- 특성중요도 ----------
대출금액: 0.02905751461041064
대출기간: 0.025428606055366614
총상환원금: 0.1479504335018489
총상환이자: 0.13901697289703976
대출금액*대출기간: 0.029655404271213263
대출금액/대출기간: 0.02966126849103156
총상환원금*대출기간: 0.1718265561943464
총상환이자*대출기간: 0.1398371559565516
총상환원금/대출기간: 0.14731120258071137
총상환이자/대출기간: 0.1402548854414799


In [43]:
skf_score(XGBClassifier(objective='multi:softmax', n_estimators=500, max_depth=10, 
          learning_rate=0.2, n_jobs=-1, random_state=42))

Average F1-macro score: 0.8344399032633336

 ---------- 특성중요도 ----------
대출금액: 0.006457563955336809
대출기간: 0.7597364783287048
총상환원금: 0.025837700814008713
총상환이자: 0.02248673141002655
대출금액*대출기간: 0.004607027862221003
대출금액/대출기간: 0.004678612109273672
총상환원금*대출기간: 0.03731198608875275
총상환이자*대출기간: 0.055219151079654694
총상환원금/대출기간: 0.03926330432295799
총상환이자/대출기간: 0.04440142214298248


In [64]:
skf_score(LGBMClassifier(objective='multiclass', num_leaves=25, n_estimators=500, max_depth=10, 
               learning_rate=0.1, n_jobs=-1, random_state=42))

[LightGBM] [Info] Total Bins 2298
[LightGBM] [Info] Number of data points in the train set: 81559, number of used features: 10
[LightGBM] [Info] Start training from score -1.744289
[LightGBM] [Info] Start training from score -1.208056
[LightGBM] [Info] Start training from score -1.248847
[LightGBM] [Info] Start training from score -1.982382
[LightGBM] [Info] Start training from score -2.564275
[LightGBM] [Info] Start training from score -3.885514
[LightGBM] [Info] Start training from score -5.434151
[LightGBM] [Info] Total Bins 2298
[LightGBM] [Info] Number of data points in the train set: 81559, number of used features: 10
[LightGBM] [Info] Start training from score -1.744289
[LightGBM] [Info] Start training from score -1.208056
[LightGBM] [Info] Start training from score -1.248847
[LightGBM] [Info] Start training from score -1.982382
[LightGBM] [Info] Start training from score -2.564275
[LightGBM] [Info] Start training from score -3.884917
[LightGBM] [Info] Start training from score 

model|k-Fold|Sk-Fold
-|-|-
DecisionTree Classifier|0.773|0.7803377111430769
RandomForest Classifier|0.830|0.8454329888401269 --> 최고 성능
XGBoost Classifier|0.830|0.8344399032633336 
Light GBM Classifier|0.819|0.8050474063923861

확실히 XGB에서 대출기간에서 영향을 많이 받는 모습을 볼 수 있음 (특성 중요도 76%)

-----

### 이번에는 위 모델링 결과에서 영향력이 작은 대출금액, 대출금액*대출기간, 대출금액/대출기간 3개 컬럼을 제거하고

### 앞서 높은 특성중요도를 보였던 총상환원금 - 총상환이자 컬럼을 추가하여 재모델링 (총 8개 컬럼 사용)

In [12]:
train = pd.read_csv('./train_new.csv')
train.drop(['ID', '대출금액', '근로기간', '연간소득', '주택소유상태', '최근_2년간_연체_횟수', '부채_대비_소득_비율', '총계좌수', '대출목적', '총연체금액', '연체계좌수'], axis=1, inplace=True)
train['총상환원금*대출기간'] = train['총상환원금'] * train['대출기간']
train['총상환이자*대출기간'] = train['총상환이자'] * train['대출기간']
train['총상환원금/대출기간'] = train['총상환원금'] / train['대출기간']
train['총상환이자/대출기간'] = train['총상환이자'] / train['대출기간']
train['총상환원금-총상환이자'] = train['총상환원금'] - train['총상환이자']
train.head()

Unnamed: 0,대출기간,총상환원금,총상환이자,대출등급,총상환원금*대출기간,총상환이자*대출기간,총상환원금/대출기간,총상환이자/대출기간,총상환원금-총상환이자
0,3,0,0.0,C,0,0.0,0.0,0.0,0.0
1,5,373572,234060.0,B,1867860,1170300.0,74714.4,46812.0,139512.0
2,3,928644,151944.0,A,2785932,455832.0,309548.0,50648.0,776700.0
3,3,325824,153108.0,C,977472,459324.0,108608.0,51036.0,172716.0
4,3,240216,55428.0,A,720648,166284.0,80072.0,18476.0,184788.0


In [48]:
train.describe()

Unnamed: 0,대출기간,총상환원금,총상환이자,총상환원금*대출기간,총상환이자*대출기간,총상환원금/대출기간,총상환이자/대출기간,총상환원금-총상환이자
count,90622.0,90622.0,90622.0,90622.0,90622.0,90622.0,90622.0,90622.0
mean,3.676282,832663.2,435076.2,2967748.0,1771490.0,246237.2,113941.3,397587.1
std,0.946159,1042279.0,444182.6,3904647.0,2121681.0,318418.1,107150.0,952142.4
min,3.0,0.0,0.0,0.0,0.0,0.0,0.0,-4032960.0
25%,3.0,312642.0,137847.0,1131960.0,428616.0,84310.8,41121.6,37644.0
50%,3.0,604728.0,293418.0,2198700.0,999936.0,166384.8,83828.0,232452.0
75%,5.0,1067496.0,579408.0,3977040.0,2303235.0,301232.0,152704.0,564465.0
max,5.0,41955940.0,5653416.0,198969500.0,28267080.0,13985310.0,1349580.0,41574040.0


In [13]:
le = LabelEncoder()

train['대출등급'] = le.fit_transform(train['대출등급'])

X = train.loc[:, train.columns != '대출등급']
y = train.loc[:, '대출등급']

ss = StandardScaler()

X_ss = ss.fit_transform(X)

X_train, X_val, y_train, y_val = train_test_split(X_ss, y, test_size=0.3, stratify=y, random_state=42)

In [41]:
params = {'min_samples_leaf':[18,19,20,21,22],
          'min_impurity_decrease':[0.0],
          'max_features':['auto',0.6,0.61,0.62,0.63,0.64,0.65,0.66,0.67,0.68,0.69,0.70],
          'max_depth':[None,11,12,13,14,15,16,17,18],
          'class_weight' : [None, 'balanced']}

grid_search(DecisionTreeClassifier(random_state=42), params, random=True)

최상의 매개변수: {'min_samples_leaf': 18, 'min_impurity_decrease': 0.0, 'max_features': 0.65, 'max_depth': 15, 'class_weight': None}
훈련 점수: 0.781
테스트 세트 점수: 0.768


In [42]:
params = {'min_samples_leaf':[1,3,5],
          'min_impurity_decrease':[0.0],
          'max_features':[0.71,0.73,0.75],
          'max_depth':[91,93,95,97,99],
          'n_estimators' : [750,770,790,810,830,850]}

grid_search(RandomForestClassifier(n_jobs=-1, random_state=42), params, random=True)

최상의 매개변수: {'n_estimators': 850, 'min_samples_leaf': 3, 'min_impurity_decrease': 0.0, 'max_features': 0.73, 'max_depth': 95}
훈련 점수: 0.831
테스트 세트 점수: 0.829


95분.. 후행 작업 중단

최상의 매개변수: {'n_estimators': 850, 'min_samples_leaf': 3, 'min_impurity_decrease': 0.0, 'max_features': 0.73, 'max_depth': 95}
훈련 점수: 0.831
테스트 세트 점수: 0.829

In [43]:
params = {'n_estimators' : [150,300,500,1000],
          'learning_rate' : [0.001,0.01,0.1],
          'max_depth' : [50,100,150],
          'objective' : ['multi:softmax']}

grid_search(XGBClassifier(random_state=42, n_jobs=-1), params, random=True)

최상의 매개변수: {'objective': 'multi:softmax', 'n_estimators': 150, 'max_depth': 100, 'learning_rate': 0.01}
훈련 점수: 0.817
테스트 세트 점수: 0.813


XGB 튜닝 이력 (1번당 평균 10~15분 소요)

첫 번쨰 랜덤서치 사용 이후 그리드서치..를 사용하려 했으나 역시나 시간이 1시간반 이상 소요되어 랜덤서치로 쭉 진행

최상의 매개변수: {'objective': 'multi:softmax', 'n_estimators': 200, 'max_depth': 100, 'learning_rate': 0.01}
훈련 점수: 0.815
테스트 세트 점수: 0.814

최상의 매개변수: {'objective': 'multi:softmax', 'n_estimators': 150, 'max_depth': 90, 'learning_rate': 0.01}
훈련 점수: 0.817
테스트 세트 점수: 0.813

최상의 매개변수: {'objective': 'multi:softmax', 'n_estimators': 130, 'max_depth': 100, 'learning_rate': 0.01}
훈련 점수: 0.816
테스트 세트 점수: 0.813

세번정도 미세 튜닝결과 성능이 크게 나아지는 부분이 없어서 2번째 결과를 사용

혹시 몰라 큰 값도 잡아봤는데, 두번째 결과가 여전히 최상의 매개변수로 출력됨

In [14]:
params = {'n_estimators' : [450,500,550,600],
          'learning_rate' : [0.01,0.05,0.1,0.2,0.3,0.4],
          'max_depth' : [75,100,125,150],
          'num_leaves' : [20,25,30]}

grid_search(LGBMClassifier(objective='multiclass', random_state=42, n_jobs=-1), params, random=True)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000677 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1788
[LightGBM] [Info] Number of data points in the train set: 63435, number of used features: 8
[LightGBM] [Info] Start training from score -1.744243
[LightGBM] [Info] Start training from score -1.208106
[LightGBM] [Info] Start training from score -1.248814
[LightGBM] [Info] Start training from score -1.982449
[LightGBM] [Info] Start training from score -2.564256
[LightGBM] [Info] Start training from score -3.885346
[LightGBM] [Info] Start training from score -5.433754
최상의 매개변수: {'num_leaves': 30, 'n_estimators': 500, 'max_depth': 125, 'learning_rate': 0.05}
훈련 점수: 0.809
테스트 세트 점수: 0.799


In [15]:
X_ss = pd.DataFrame(X_ss, columns=X.columns)
y = pd.DataFrame(y)

In [52]:
skf_score(DecisionTreeClassifier(min_samples_leaf=18, min_impurity_decrease=0, 
                                 max_features=0.65, max_depth=15, class_weight=None, random_state=42))

Average F1-macro score: 0.7825190595974667

 ---------- 특성중요도 ----------
대출기간: 0.03252951874349846
총상환원금: 0.07039780582122465
총상환이자: 0.1384651286046061
총상환원금*대출기간: 0.04919493517081665
총상환이자*대출기간: 0.17311708469403161
총상환원금/대출기간: 0.04320628644937883
총상환이자/대출기간: 0.05408595922536836
총상환원금-총상환이자: 0.4390032812910753


In [16]:
skf_score(RandomForestClassifier(n_estimators=850, min_samples_leaf=3, min_impurity_decrease=0, 
                                 max_features=0.73, max_depth=95, random_state=42, n_jobs=-1))

Average F1-macro score: 0.8410873545863643

 ---------- 특성중요도 ----------
대출기간: 0.024334394715308655
총상환원금: 0.07112992841695093
총상환이자: 0.10338090169824544
총상환원금*대출기간: 0.08933986188326175
총상환이자*대출기간: 0.11699873373477396
총상환원금/대출기간: 0.06998909211801188
총상환이자/대출기간: 0.12577413064420231
총상환원금-총상환이자: 0.3990529567892452


In [18]:
skf_score(XGBClassifier(objective='multi:softmax', n_estimators=150, max_depth=100, 
          learning_rate=0.01, n_jobs=-1, random_state=42))

Average F1-macro score: 0.8204524829028639

 ---------- 특성중요도 ----------
대출기간: 0.3718532621860504
총상환원금: 0.025181423872709274
총상환이자: 0.03701609745621681
총상환원금*대출기간: 0.0464862585067749
총상환이자*대출기간: 0.10007234662771225
총상환원금/대출기간: 0.05068105086684227
총상환이자/대출기간: 0.15467771887779236
총상환원금-총상환이자: 0.21403177082538605


In [17]:
skf_score(LGBMClassifier(objective='multiclass', num_leaves=30, n_estimators=500, max_depth=125, 
               learning_rate=0.05, n_jobs=-1 , random_state=42))

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003089 seconds.
You can set `force_col_wise=true` to remove the overhead.


[LightGBM] [Info] Total Bins 1788
[LightGBM] [Info] Number of data points in the train set: 81559, number of used features: 8
[LightGBM] [Info] Start training from score -1.744289
[LightGBM] [Info] Start training from score -1.208056
[LightGBM] [Info] Start training from score -1.248847
[LightGBM] [Info] Start training from score -1.982382
[LightGBM] [Info] Start training from score -2.564275
[LightGBM] [Info] Start training from score -3.885514
[LightGBM] [Info] Start training from score -5.434151
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000930 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1788
[LightGBM] [Info] Number of data points in the train set: 81559, number of used features: 8
[LightGBM] [Info] Start training from score -1.744289
[LightGBM] [Info] Start training from score -1.208056
[LightGBM] [Info] Start training fr

model|k-Fold|Sk-Fold
-|-|-
DecisionTree Classifier|0.768|0.7825190595974667
RandomForest Classifier|0.829|0.8410873545863643 --> 최고 성능
XGBoost Classifier|0.813|0.8204524829028639
Light GBM Classifier|0.799|0.8332265359229514