### 세 번째 모델링 : 근로기간의 모든 값을 0~11로 대치 후 모델링

근로기간|값
-|-
< 1 year|0
1 year|1
2 years|2
3 years|3
4 years|4
5 years|5
6 years|6
7 years|7
8 years|8
9 years|9
10+ years|10
Unknown|11

In [1]:
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings(action='ignore')

In [3]:
train_raw = pd.read_csv('train.csv')
train = train_raw.copy()

In [4]:
train.drop('ID', axis=1, inplace=True)

In [5]:
train = train[train['주택소유상태'] != 'ANY']

In [6]:
val = dict({
    ' 36 months' : 3,
    ' 60 months' : 5
})

train['대출기간'] = train['대출기간'].replace(val)

train['대출기간'].value_counts()

대출기간
3    64478
5    31815
Name: count, dtype: int64

In [7]:
train['근로기간'] = train['근로기간'].apply(lambda x: '< 1 year' if '<1 year' in x else x)
train['근로기간'] = train['근로기간'].apply(lambda x: '1 year' if '1 years' in x else x)
train['근로기간'] = train['근로기간'].apply(lambda x: '3 years' if '3' in x else x)
train['근로기간'] = train['근로기간'].apply(lambda x: '10+ years' if '10+years' in x else x)

duration=dict({'10+ years':10,
 '9 years':9,
 '8 years':8,
 '7 years':7,
 '6 years':6,
 '5 years':5,
 '4 years':4,
 '3 years':3,
 '2 years':2,
 '1 year':1,
 '< 1 year':0,
 'Unknown' : 11})

train['근로기간'] = train['근로기간'].replace(duration)

In [8]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

train['주택소유상태'] = le.fit_transform(train['주택소유상태'])
train['대출목적'] = le.fit_transform(train['대출목적'])
train['대출등급'] = le.fit_transform(train['대출등급'])

In [9]:
X = train.drop(['대출등급'], axis=1)
y = train['대출등급']

X.head()

Unnamed: 0,대출금액,대출기간,근로기간,주택소유상태,연간소득,부채_대비_소득_비율,총계좌수,대출목적,최근_2년간_연체_횟수,총상환원금,총상환이자,총연체금액,연체계좌수
0,12480000,3,6,2,72000000,18.9,15,1,0,0,0.0,0.0,0.0
1,14400000,5,10,0,130800000,22.33,21,10,0,373572,234060.0,0.0,0.0
2,12000000,3,5,0,96000000,8.6,14,1,0,928644,151944.0,0.0,0.0
3,14400000,3,8,0,132000000,15.09,15,1,0,325824,153108.0,0.0,0.0
4,18000000,5,11,2,71736000,25.39,19,8,0,228540,148956.0,0.0,0.0


In [17]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

X_ss = ss.fit_transform(X)

In [18]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X_ss, y, test_size=0.3, stratify=y, random_state=42)

In [19]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import f1_score

In [20]:
def grid_search(model, params, random=False):
    clf = model
    if not random:
        grid = GridSearchCV(clf, params,
                                scoring='f1_macro', cv=5,
                                n_jobs=-1)
    else:
        grid = RandomizedSearchCV(clf, params, n_iter=10,
                                scoring='f1_macro', cv=5,
                                n_jobs=-1, random_state=42)
        
    grid.fit(X_train, y_train)
    
    
    best_model = grid.best_estimator_
    
    best_params = grid.best_params_
    print("최상의 매개변수:", best_params)
    
    best_score = grid.best_score_
    print("훈련 점수: {:.3f}".format(best_score))
    
    y_pred = best_model.predict(X_val)
    macro_f1_val = f1_score(y_val, y_pred, average='macro')
    print('테스트 세트 점수: {:.3f}'.format(macro_f1_val))

In [14]:
params = {'min_samples_leaf':[18,19,20,21,22],
          'min_impurity_decrease':[0.0],
          'max_features':['auto',0.6,0.61,0.62,0.63,0.64,0.65,0.66,0.67,0.68,0.69,0.70],
          'max_depth':[None,11,12,13,14,15,16,17,18],
          'class_weight' : [None, 'balanced']}

grid_search(DecisionTreeClassifier(random_state=42), params, random=True)

최상의 매개변수: {'min_samples_leaf': 19, 'min_impurity_decrease': 0.0, 'max_features': 0.68, 'max_depth': 17, 'class_weight': None}
훈련 점수: 0.677
테스트 세트 점수: 0.666


In [21]:
params = {'min_samples_leaf':[20],
          'min_impurity_decrease':[0],
          'max_features':[0.65,0.6,0.55],
          'max_depth':[85,90,95],
          'n_estimators' : [730,750,770]}

grid_search(RandomForestClassifier(n_jobs=-1, random_state=42), params, random=True)

최상의 매개변수: {'n_estimators': 730, 'min_samples_leaf': 20, 'min_impurity_decrease': 0, 'max_features': 0.65, 'max_depth': 90}
훈련 점수: 0.685
테스트 세트 점수: 0.702


In [22]:
params = {'n_estimators' : [130,150,170],
          'learning_rate' : [0.18,0.2,0.22],
          'max_depth' : [33,34,35,36,37],
          'objective' : ['multi:softmax']}

grid_search(XGBClassifier(random_state=42, n_jobs=-1), params, random=True)

최상의 매개변수: {'objective': 'multi:softmax', 'n_estimators': 170, 'max_depth': 34, 'learning_rate': 0.22}
훈련 점수: 0.793
테스트 세트 점수: 0.794


In [23]:
params = {'n_estimators' : [340,350,360],
          'learning_rate' : [0.05,0.06,0.07,0.08,0.09],
          'max_depth' : [8,9,10,11,12],
          'num_leaves' : [43,45,47]}

grid_search(LGBMClassifier(objective='multiclass', random_state=42, n_jobs=-1), params, random=True)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001430 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1457
[LightGBM] [Info] Number of data points in the train set: 67405, number of used features: 13
[LightGBM] [Info] Start training from score -1.747717
[LightGBM] [Info] Start training from score -1.206424
[LightGBM] [Info] Start training from score -1.248802
[LightGBM] [Info] Start training from score -1.975557
[LightGBM] [Info] Start training from score -2.572111
[LightGBM] [Info] Start training from score -3.897369
[LightGBM] [Info] Start training from score -5.434895
최상의 매개변수: {'num_leaves': 43, 'n_estimators': 340, 'max_depth': 11, 'learning_rate': 0.09}
훈련 점수: 0.787
테스트 세트 점수: 0.794


In [24]:
X_ss = pd.DataFrame(X_ss, columns=X.columns)

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

f1_macro_scores = []

def skf_score(model):
    for train_idx, valid_idx in skf.split(X_ss, y):
        X_train = X_ss.iloc[train_idx]
        X_val = y.iloc[train_idx]

        y_train = X_ss.iloc[valid_idx]
        y_val = y.iloc[valid_idx]

        model.fit(X_train, X_val)

        pred = model.predict(y_train)

        f1_macro = f1_score(y_val, pred, average='macro')
        f1_macro_scores.append(f1_macro)
    
    average_f1_macro = np.mean(f1_macro_scores)

    print("Average F1-macro score:", average_f1_macro)
    try:
        if model.feature_importances_.any():
            feature_importances = model.feature_importances_
            print("\n",'-'*10,'특성중요도','-'*10)
            for feature, importance in zip(X_ss.columns, feature_importances):
                print(f"{feature}: {importance}")
    except:
        None

In [16]:
skf_score(DecisionTreeClassifier(min_samples_leaf=18, min_impurity_decrease=0, max_features=0.65,
                                 max_depth=None, class_weight=None, random_state=42))

Average F1-macro score: 0.7010763737633329

 ---------- 특성중요도 ----------
대출금액: 0.07002911851969708
대출기간: 0.03529504123307088
근로기간: 0.002055386402501382
주택소유상태: 0.0012106893667268236
연간소득: 0.010836897382718041
부채_대비_소득_비율: 0.015174850697332296
총계좌수: 0.0042522523421826205
대출목적: 0.0020056258950848444
최근_2년간_연체_횟수: 0.0005408129480661561
총상환원금: 0.5019585673877407
총상환이자: 0.35664075782487914
총연체금액: 0.0
연체계좌수: 0.0


In [25]:
skf_score(RandomForestClassifier(n_estimators=730, min_samples_leaf=20, min_impurity_decrease=0, 
                                 max_features=0.65, max_depth=90, random_state=42, n_jobs=-1))

Average F1-macro score: 0.7337106491739855

 ---------- 특성중요도 ----------
대출금액: 0.05095479784400546
대출기간: 0.045537395660334774
근로기간: 0.0028482726556627327
주택소유상태: 0.0015965295833664122
연간소득: 0.01777375572126978
부채_대비_소득_비율: 0.011142237162281757
총계좌수: 0.0053632482242000614
대출목적: 0.004074013377095862
최근_2년간_연체_횟수: 0.0012883103737419248
총상환원금: 0.45381591399719595
총상환이자: 0.40560552540084527
총연체금액: 0.0
연체계좌수: 0.0


In [26]:
skf_score(XGBClassifier(objective='multi:softmax', n_estimators=170, max_depth=34, 
          learning_rate=0.22, n_jobs=-1, random_state=42))

Average F1-macro score: 0.7697288586617749

 ---------- 특성중요도 ----------
대출금액: 0.030072171241044998
대출기간: 0.5025891065597534
근로기간: 0.013120010495185852
주택소유상태: 0.012862917967140675
연간소득: 0.019176431000232697
부채_대비_소득_비율: 0.01340008620172739
총계좌수: 0.01359183806926012
대출목적: 0.018117578700184822
최근_2년간_연체_횟수: 0.01852087490260601
총상환원금: 0.15769290924072266
총상환이자: 0.17012466490268707
총연체금액: 0.016124052926898003
연체계좌수: 0.014607292599976063


In [27]:
skf_score(LGBMClassifier(objective='multiclass', num_leaves=43, n_estimators=340, max_depth=11, 
               learning_rate=0.09, n_jobs=-1, random_state=42))

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003381 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1463
[LightGBM] [Info] Number of data points in the train set: 86663, number of used features: 13
[LightGBM] [Info] Start training from score -1.747730
[LightGBM] [Info] Start training from score -1.206395
[LightGBM] [Info] Start training from score -1.248807
[LightGBM] [Info] Start training from score -1.975538
[LightGBM] [Info] Start training from score -2.572234
[LightGBM] [Info] Start training from score -3.897282
[LightGBM] [Info] Start training from score -5.434888


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002801 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1470
[LightGBM] [Info] Number of data points in the train set: 86663, number of used features: 13
[LightGBM] [Info] Start training from score -1.747663
[LightGBM] [Info] Start training from score -1.206395
[LightGBM] [Info] Start training from score -1.248807
[LightGBM] [Info] Start training from score -1.975622
[LightGBM] [Info] Start training from score -2.572234
[LightGBM] [Info] Start training from score -3.897282
[LightGBM] [Info] Start training from score -5.434888
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001863 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1467
[LightGBM] [Info] Number of data points in the train set: 86663, number o

model|k-Fold|Sk-Fold
-|-|-
DecisionTree Classifier|0.666|0.7010763737633329
RandomForest Classifier|0.702|0.7337106491739855
XGBoost Classifier|0.794|0.7697288586617749
Light GBM Classifier|0.794|0.7796600062893523 --> 최고 성능