프로젝트 개요

이 프로젝트에서는 Credit Default Risk로 과거 대출 신청 데이터를 사용하여 신청자가 대출금을 상환할 수 있는지 여부를 예측해 보았습니다.

대출금 상환 여부를 예측하기 위해 본 프로젝트에서는 KNN, Logistic Regression, Decision Tree 및 인기가 많은 XGBoost 총 4개의 알고리즘을 활용하여 예측력은 가장 좋은 알고리즘 즉 XGBoost 알고리즘을 확인하였습니다.

In [78]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, roc_auc_score, roc_curve

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))      # list all files under the input directory

# 변수 탐색

In [31]:
credit_df = pd.read_csv('/kaggle/input/credit-risk-dataset/credit_risk_dataset.csv')

In [32]:
credit_df.shape

In [33]:
credit_df.info()    

**변수 설명**
* person_age:                  age
* person_income:               annual income
* person_home_ownership:       home ownership 
* person_emp_length:          employment length(in years)
* loan_intent:                 loan intent 
* loan_grade:                  loan grade 
* loan_amnt:                   loan amount 
* loan_int_rate:               interest rate
* loan_status:                 loan status(0: non default, 1: default) 
* loan_percent_income:         percent income
* cb_person_default_on_file:   historical default
* cb_person_cred_hist_length:  credit history length


In [34]:
credit_df['person_home_ownership'].value_counts()

In [35]:
credit_df['loan_intent'].value_counts()

In [36]:
credit_df['loan_grade'].value_counts()   # A ~ G

In [37]:
credit_df['cb_person_default_on_file'].value_counts()  # no/yes

In [38]:
credit_df['loan_status'].value_counts()  # 0 (non default)or 1 (default)

In [39]:
credit_df.describe()  # 이상치 확인

1. person_age 칼럼 최대값 이상치로 보임
2. person_emp_length 칼럼 최대값 이상치로 보임 
3. person_income 칼럼 최대값 이상치인지 추가 확인 필요


In [41]:
df_num = credit_df.select_dtypes(['float', 'int'])
df_hist = df_num.drop(['loan_status'], axis=1)

In [42]:
#y변수(loan_status) 제외한 x변수를 대상으로 시각화

plt.figure(figsize=(12,16))
for i, column in enumerate(df_hist.columns):
    plt.subplot(int('42'+str(i+1)))
    sns.histplot(df_hist[column], color='forestgreen', stat='density')
    sns.kdeplot(df_hist[column], color='indianred')
    plt.title(column +' distribution',fontsize=14)
    plt.ylabel('Probability', fontsize=12)
    plt.xlabel(column, fontsize=12)
    plt.xticks(fontsize=12)
    plt.yticks(fontsize=12)

plt.subplots_adjust(top=0.92, bottom=0.08, left=0.10, right=0.95, hspace=0.35, wspace=0.35)
plt.show()

In [43]:
credit_df['person_income'].sort_values(ascending=False)  

In [44]:
credit_df['person_emp_length'].sort_values(ascending=False) 

1. 이상치(person_age > 100) 삭제할 예정
2. 이상치(person_emp_length 최대값) 삭제할 예정
3. 이상치(person_income 최대값) 삭제할 예정

In [45]:
nan_per = credit_df.isnull().sum() / credit_df.shape[0]*100  # 결측치 확인 
nan_per.round(2)

person_emp_length 및 loan_int_rate 칼럼에 결측치 있음, 각각 2.75%, 9.56% 차지하고 있음

결측치를 대체하기 위해 결측치 해당 칼럼의 최빈수와 평균값을 확인

In [19]:
print('person_emp_length 최빈수 = {}'.format(credit_df['person_emp_length'].mode()[0]))
print('person_emp_length 평균값 = {}'.format(credit_df['person_emp_length'].mean()))
print('loan_int_rate 최빈수 = {}'.format(credit_df['loan_int_rate'].mode()[0]))
print('loan_int_rate 평균값 = {}'.format(credit_df['loan_int_rate'].mean()))

person_emp_length(근로기간)의 경우 결측치를 최빈수로, loan_int_rate(금리)의 경우 결측치를 평균값으로 대체할 예정입니다. 

# 데이터 전처리

### 결측치 처리

In [46]:
credit_df['person_emp_length'].fillna(credit_df['person_emp_length'].mode()[0], inplace=True)
credit_df['loan_int_rate'].fillna(credit_df['loan_int_rate'].mean(), inplace=True)

In [47]:
credit_df.isnull().sum()

### 이상치 처리

In [48]:
credit_df = credit_df[credit_df['person_age']<=100]
credit_df = credit_df[credit_df['person_income'] < credit_df['person_income'].max()]
credit_df = credit_df[credit_df['person_emp_length'] < credit_df['person_emp_length'].max()]

In [49]:
credit_df.shape

In [50]:
corr = credit_df.corr()['loan_status'].sort_values(ascending=False)
corr

loan_percent_income, loan_int_rate: 종속변수 loan_status에 긍정적인 영향을 미치며,변수값이 클수록 credit risk 높아집니다.

person_income: 기본이 되는 loan_status에 부정적인 영향을 미치며, 변수값이 클수록 credit risk 적어집니다.

### 데이터 인코딩

In [51]:
credit_df_encoded = pd.get_dummies(credit_df, drop_first=True)  # one_hot encoding
credit_df_encoded.shape

In [52]:
credit_df_encoded.info()

### train set / test set 준비

In [53]:
X = credit_df_encoded.drop(['loan_status'], axis=1)
y = credit_df_encoded['loan_status']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.3)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

### 모델 생성

In [54]:
# 모델 생성하기 전에 모델 평가를 위한 함수 정의
def model_assess(model, name='Default'):
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    print(name, '\n', classification_report(y_test, pred))

In [55]:
# KNN
knn = KNeighborsClassifier()
model_assess(knn, name='KNN')

# Logistic Regression
lg = LogisticRegression(random_state=42)
model_assess(lg, name = 'Logistic Regression')

# DecisionTree Classifier
dr = DecisionTreeClassifier(random_state=42)
model_assess(dr, name = 'DecisionTree Classifier')

# XGBoost
xgb = XGBClassifier(objective="binary:logistic", random_state=42)
model_assess(xgb, name = 'XGBClassifier')

In [66]:
# AUC
fig = plt.figure(figsize=(8,5))
plt.plot([0,1], [0,1], 'r--')

# KNN
knn_preds_proba = knn.predict_proba(X_test)
knn_default_proba = knn_preds_proba[:, 1]
fpr, tpr, threshold = roc_curve(y_test, knn_default_proba)
knn_score = roc_auc_score(y_test, knn_default_proba)
plt.plot(fpr, tpr, label=f'KNN, AUC = {str(round(knn_score, 3))}')

# Logistic Regression
lg_preds_proba = lg.predict_proba(X_test)
lg_default_proba = lg_preds_proba[:, 1]
fpr, tpr, threshold = roc_curve(y_test, lg_default_proba)
lg_score = roc_auc_score(y_test, lg_default_proba)
plt.plot(fpr, tpr, label=f'Logistic Regression, AUC = {str(round(lg_score,3))}')

# DecisionTree Classifier
dr_preds_proba = dr.predict_proba(X_test)
dr_default_proba = dr_preds_proba[:, 1]
fpr, tpr, threshold = roc_curve(y_test, dr_default_proba)
dr_score = roc_auc_score(y_test, dr_default_proba)
plt.plot(fpr, tpr, label=f'DecisionTree Classifier, AUC = {str(round(dr_score,3))}')

# XGBoost
xgb_preds_proba = xgb.predict_proba(X_test)
xgb_default_proba = xgb_preds_proba[:, 1]
fpr, tpr, threshold = roc_curve(y_test, xgb_default_proba)
xgb_score = roc_auc_score(y_test, xgb_default_proba)
plt.plot(fpr, tpr, label = f'XGBoost, AUC = {str(round(xgb_score, 3))}')
plt.ylabel("True Positive Rate", fontsize=12)
plt.xlabel("False Positive Rate", fontsize=12)
plt.title("ROC CURVE")
plt.legend()
plt.show()

In [76]:
feature_importance = pd.DataFrame({'feature': X_test.columns, 'importance': xgb.feature_importances_})
feature_importance_sorted = feature_importance.sort_values(['importance'], ascending=False)





In [77]:
sns.set(context='paper', style='ticks', font='sans-serif',
       font_scale=1.2, color_codes=True, rc=None)
plt.plot(figsize=(8,5))
sns.barplot(data=feature_importance_sorted[:10], y='feature', 
           x = 'importance', palette='Blues_d')
plt.title('feature importance',fontsize = 14)
plt.xlabel('importance', fontsize=12)
plt.ylabel('feature', fontsize=12)
plt.show()


XGBoost 모델은 KNN, LogisticRegression 및 DecisionTree 분류기에 비해 AUROC점수 0.954로 최고의 성능을 보입니다.