# 고객의 채무 불이행 여부 분류
---
#### 데이터 양
- train : 100,000 개
- test : 35,815 개
---
#### input과 output
- input : 고객 재무 상태에 대한 75개 feature
- output : 채무 불이행 여부
    - 0 = 이행
    - 1 = 불이행 / 부도
---
#### features
- **int_rate** : 대출자에 부여된 이자율 (Interest rate of the loan the applicant received)
- **annual_inc** : 연 소득 (annual income)
- **dti** : 소득 대비 부채 비율 (Debt-to-income ratio)
- **delinq_2yrs** : 지난 2년 간 체납 발생 횟수 (Delinquencies on lines of credit in the last 2 years)
- **inq_last_6mths** : 지난 6개월 간 신용 조회 수 (Inquiries into the applicant's credit during the last 6 months)
- **pub_rec** : 파산 횟수 (Number of bankruptcies listed in the public record)
- **revol_bal** : 리볼빙 잔액 (Total credit revolving balance)
- **total_acc** : 지금까지 소유했던 신용카드 개수 (num_total_cc_accounts : Total number of credit card accounts in the applicant's history)
- **collections_12_mths_ex_med** : 의료부문을 제외한 지난 12개월 간 추심 발생 횟수 (num_collections_last_12m : Number of collections in the last 12 months. This excludes medical collections)
- **acc_now_delinq** : 대출자가 체납 상태에 있지 않은 계좌의 수 (The number of accounts on which the borrower is now delinquent)
- **tot_coll_amt** : 대출자에 대한 현재까지의 총 추심액 (total_collection_amount_ever : The total amount that the applicant has had against them in collections)
- **tot_cur_bal** : 전 계좌의 현재 통합 잔고 (Total current balance of all accounts)
- **chargeoff_within_12_mths** : 대출 부 신청인의 대출 신청 직전 12개월 간 세금 공제 횟수 (Number of charge-offs within last 12 months at time of application for the secondary applicant)
- **delinq_amnt** : 체납 금액 (delinquency amount)
- **tax_liens** : 세금 저당권의 수 (Number of tax liens)
- **emp_length1** ~ 12 : 고용 연수 (Number of years in the job)
- **home_ownership1** ~ 6 : 대출 신청자의 주거 소유 형태 (The ownership status of the applicant's residence)
- **verification_status1** ~ 3 : 공동 소득 발생 여부 및 형태 (verification_income_joint : Type of verification of the joint income)
- **purpose1** ~ 14 : 대출 목적 (The purpose of the loan)
- **initial_list_status1** ~ 2 : 최초 대출 상태 (Initial listing status of the loan)
- **mths_since_last_delinq1** ~ 11 : 마지막 체납이 지금으로부터 몇개월 전에 있었는지를 나타내는 변수 (Months since the last delinquency)
- **funded_amnt** : 대출액 (Funded amount)
- **funded_amnt_inv** : 사채 대출액 (Funded amount by investors)
- **total_rec_late_fee** : 총 연체료 중 납부액 (Late fees received to date)
- **term1** : 상환 기간 (The number of payments on the loan. Values are in months and can be either 36 or 60)
- **open_acc** : 개설 개좌 수 (The number of open credit lines in the borrower's credit file)
- **installment** : 대출 발생 시 월 상환액 (The monthly payment owed by the borrower if the loan originates)
- **revol_util** : 리볼빙 한도 대비 리볼빙 사용 비율 (Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit)
- **out_prncp** : 대출액 중 원리금 잔액 (Remaining outstanding principal for total amount funded)
- **out_prncp_inv** : 사채 대출액 중 원리금 잔액 (Remaining outstanding principal for total amount funded by investors)
- **total_rec_int** : 이자 상환액 (Interest received to date)
- **fico_range_low** : FICO(일종의 신용점수) 최저값 (The lower boundary range the borrowerʼs FICO at loan origination belongs to)
- **fico_range_high** : FICO(일종의 신용점수) 최고값 (The upper boundary range the borrowerʼs FICO at loan origination belongs to)
- **depvar** : 고객의 부도 여부 (dependent variable)

---
# 필요 데이터 로드
---

In [2]:
# Libraries for data handling
import numpy as np
import pandas as pd

# Libraries for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Libraries for machin learning
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import sklearn.metrics as metrics
from sklearn.linear_model import LogisticRegression

In [3]:
train = pd.read_csv("../data/train.csv")
test = pd.read_csv("../data/test.csv")

In [4]:
train.iloc[:, :-1].columns

Index(['int_rate', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths',
       'pub_rec', 'revol_bal', 'total_acc', 'collections_12_mths_ex_med',
       'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal',
       'chargeoff_within_12_mths', 'delinq_amnt', 'tax_liens', 'emp_length1',
       'emp_length2', 'emp_length3', 'emp_length4', 'emp_length5',
       'emp_length6', 'emp_length7', 'emp_length8', 'emp_length9',
       'emp_length10', 'emp_length11', 'emp_length12', 'home_ownership1',
       'home_ownership2', 'home_ownership3', 'home_ownership4',
       'home_ownership5', 'home_ownership6', 'verification_status1',
       'verification_status2', 'verification_status3', 'purpose1', 'purpose2',
       'purpose3', 'purpose4', 'purpose5', 'purpose6', 'purpose7', 'purpose8',
       'purpose9', 'purpose10', 'purpose11', 'purpose12', 'purpose13',
       'purpose14', 'initial_list_status1', 'initial_list_status2',
       'mths_since_last_delinq1', 'mths_since_last_delinq2',
       'mths_since_l

In [5]:
# StandardSclaer
from sklearn.preprocessing import StandardScaler

scale_data = train.iloc[:, :-1]
scaler = StandardScaler()
scaler.fit(scale_data)
scaler_data_scaled = scaler.transform(scale_data)
scaler_data_scaled = pd.DataFrame(scaler_data_scaled)
scaler_data_scaled.columns = scale_data.columns
print(scaler_data_scaled)

       int_rate  annual_inc       dti  delinq_2yrs  inq_last_6mths   pub_rec  \
0     -1.081759   -0.714584  1.268927    -0.379778        0.347801 -0.356360   
1     -0.020849    0.075520 -1.627778    -0.379778        0.347801  1.155435   
2     -0.020849   -0.486927  0.611611    -0.379778        2.448553 -0.356360   
3      0.131028    0.343352 -0.266790     4.040097        1.398177 -0.356360   
4     -0.087854   -0.594059  0.804170    -0.379778        0.347801  2.667231   
...         ...         ...       ...          ...             ...       ...   
99995  1.002092   -0.125353 -0.100381    -0.379778        2.448553  1.155435   
99996 -0.934349   -0.125353 -1.858373    -0.379778       -0.702576 -0.356360   
99997  0.090826   -0.379794  1.617197    -0.379778        0.347801 -0.356360   
99998  1.801683   -0.580668 -1.662248    -0.379778        0.347801 -0.356360   
99999  0.649199    0.678142  1.761022    -0.379778       -0.702576 -0.356360   

       revol_bal  total_acc  collection

In [10]:
scaler_data_scaled['depvar'] = train['depvar']
scale_data = scaler_data_scaled
scale_data.shape

(100000, 76)

In [11]:
X = scale_data.drop(['depvar'], axis=1)
y = scale_data['depvar']
print(X.shape, y.shape)

(100000, 75) (100000,)


In [12]:
# model 채점
def get_clf_eval(y_answer, y_pred):
    acc = metrics.accuracy_score(y_answer, y_pred)
    prec = metrics.precision_score(y_answer, y_pred)
    recall = metrics.recall_score(y_answer, y_pred)
    AUC = metrics.roc_auc_score(y_answer, y_pred)
    F1 = metrics.f1_score(y_answer, y_pred, average="macro")
    confus_met = metrics.confusion_matrix(y_answer, y_pred)

    print("========================")
    print("정확도 : {:.6f}".format(acc))
    print("정밀도 : {:.6f}".format(prec))
    print("재현율 : {:.6f}".format(recall))
    print("AUC : {:.6f}".format(AUC))
    
    print(" ** F1 : {:.6f} **".format(F1))
    
    print("====confusion_matrix====\n{}".format(confus_met))
    print("========================")

    return F1

In [13]:
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold

model = XGBClassifier()
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

fold_scores = []
models = []

for i, (train_index, val_index) in enumerate(skf.split(X, y)):
    print(i)
    X_train, X_val = X.iloc[train_index], X.iloc[val_index]
    y_train, y_val = y.iloc[train_index], y.iloc[val_index]
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    score = get_clf_eval(y_val, y_pred)
    fold_scores.append(score)
    models.append(model)

best_fold_index = fold_scores.index(max(fold_scores))
best_model = models[best_fold_index]

print(f"Best fold: {best_fold_index}")
print(f"Best fold score: {fold_scores[best_fold_index]}")
print(f"Mean accuracy: {sum(fold_scores) / len(fold_scores)}")

0
정확도 : 0.751150
정밀도 : 0.659006
재현율 : 0.488715
AUC : 0.683299
 ** F1 : 0.693776 **
====confusion_matrix====
[[11840  1647]
 [ 3330  3183]]
1
정확도 : 0.754450
정밀도 : 0.666944
재현율 : 0.491557
AUC : 0.686495
 ** F1 : 0.697385 **
====confusion_matrix====
[[11887  1599]
 [ 3312  3202]]
2
정확도 : 0.751550
정밀도 : 0.658918
재현율 : 0.491710
AUC : 0.684384
 ** F1 : 0.694788 **
====confusion_matrix====
[[11828  1658]
 [ 3311  3203]]
3
정확도 : 0.748950
정밀도 : 0.652503
재현율 : 0.490329
AUC : 0.682099
 ** F1 : 0.692147 **
====confusion_matrix====
[[11785  1701]
 [ 3320  3194]]
4
정확도 : 0.753350
정밀도 : 0.660834
재현율 : 0.498618
AUC : 0.687504
 ** F1 : 0.697861 **
====confusion_matrix====
[[11819  1667]
 [ 3266  3248]]
Best fold: 4
Best fold score: 0.6978605479677173
Mean accuracy: 0.6951912804191134


In [14]:
X_test = test.iloc[:, 1:]
transformed_X_test = scaler.transform(X_test)
transformed_X_test = pd.DataFrame(transformed_X_test)
transformed_X_test.columns = X_test.columns
# transformed_X_test = round(transformed_X_test)
print(transformed_X_test.shape)

(35816, 75)


In [15]:
test_pred = best_model.predict(transformed_X_test)

submission = pd.read_csv('../data/sample_submission.csv')
submission['answer'] = test_pred
submission.to_csv('submission.csv', index=False)