<a href="https://colab.research.google.com/github/Jake-LJH/default-prediction/blob/master/cc_model_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Data Set Information:

This research aimed at the case of customers default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel â€œSorting Smoothing Method to estimate the real probability of default. 

With the real probability of default as the response variable (Y), and the predictive probability of default as the independent variable (X), the simple linear regression result (Y = A + BX) shows that the forecasting model produced by artificial neural network has the highest coefficient of determination; its regression intercept (A) is close to zero, and regression coefficient (B) to one. Therefore, among the six data mining techniques, artificial neural network is the only one that can accurately estimate the real probability of default.

Link: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#


##Attribute Information:

This research employed a binary variable, default payment (Yes = <code>1</code>, No = <code>0</code>), as the response variable. 

There are 25 variables:

* ID: ID of each client
* LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
* SEX: Gender (1=male, 2=female)
* EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
* MARRIAGE: Marital status (1=married, 2=single, 3=others)
* AGE: Age in years
* PAY_0: Repayment status in September, 2005 

Scale: 

-2 = Balance paid in full and no transactions this period (we may refer to this credit card account as having been 'inactive' this period)

-1 = Balance paid in full, but account has a positive balance at end of period due to recent transactions for which payment has not yet come due

0 = Customer paid the minimum due amount, but not the entire balance. I.e., the customer paid enough for their account to remain in good standing, but did revolve a balance, 

1 = payment delay for one month, 2 = payment delay for two months, … 8=payment delay for eight months, 9 = payment delay for nine months and above

* PAY_2: Repayment status in August, 2005 (scale same as above)
* PAY_3: Repayment status in July, 2005 (scale same as above)
* PAY_4: Repayment status in June, 2005 (scale same as above)
* PAY_5: Repayment status in May, 2005 (scale same as above)
* PAY_6: Repayment status in April, 2005 (scale same as above)
* BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
* BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
* BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
* BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
* BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
* BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
* PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
* PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
* PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
* PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
* PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
* PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
* default.payment.next.month: Default payment (1=yes, 0=no)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC

#Import all the metrics for validation and evaluation
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
import seaborn as sns
#ROC Curve
from sklearn.metrics import plot_roc_curve

# Data Exploratory

In [2]:
dataset = pd.read_excel('/content/drive/MyDrive/credit card default data/default of credit card clients.xls',skiprows=1,index_col=0)
dataset.head()

FileNotFoundError: ignored

In [None]:
print("Shape of dataset "+str(dataset.shape))
print('*'*40)
print(dataset.info())

In [None]:
dataset.describe()

In [None]:
print(dataset.info())

In [None]:
dataset.BILL_AMT1_OVER_LIMIT_BAL.plot(kind='box')
plt.title('Ratio of Bill Amount in September vs Credit Limit')
plt.show()


# Data Engineering

In [None]:
X = df2.drop('default payment next month', axis=1)
y = df2['default payment next month']

In [None]:
X_train, x_test, Y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=42)
X_train.head()

In [None]:
numeric_cols = X[['LIMIT_BAL','AGE',	'BILL_AMT1','BILL_AMT2','BILL_AMT3','BILL_AMT4','BILL_AMT5','BILL_AMT6','PAY_AMT1','PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6','SEX','EDUCATION','MARRIAGE','PAY_0','PAY_2','PAY_3','PAY_4','PAY_5','PAY_6']].columns

print(numeric_cols)

In [None]:
numeric_transformers = Pipeline(steps=[
                               ('scaler',StandardScaler())
])

preprocessor = ColumnTransformer(
    transformers = [
                    ('num',numeric_transformers, numeric_cols)
                 
    ]
)

# Handling Class Imbalance using undersampling

In [None]:
# visualize the target variable
g = sns.countplot(df2['default payment next month'])
g.set_xticklabels(['Not Default','Default'])
plt.show()

In [None]:
# class count
class_count_0, class_count_1 = df2['default payment next month'].value_counts()

# Separate class
class_0 = df2[df2['default payment next month'] == 0]
class_1 = df2[df2['default payment next month'] == 1]

# print the shape of the class
print('class 0:', class_0.shape)
print('class 1:', class_1.shape)

In [None]:
#Under sample the non-default class
class_0_under = class_0.sample(class_count_1)

#concatenate the equalized default and non-default data
test_under = pd.concat([class_0_under, class_1], axis=0)

print("total class of 1 and 0:",test_under['default payment next month'].value_counts())

# plot the count after under-sampeling
test_under['default payment next month'].value_counts().plot(kind='bar', title='count (target)')

In [None]:
X_undersample = test_under.drop('default payment next month', axis=1)
y_undersample = test_under['default payment next month']


In [None]:
from sklearn.model_selection import train_test_split
X_train_undersample, x_test_undersample, Y_train_undersample, y_test_undersample = train_test_split(X_undersample, y_undersample, test_size = 0.20, random_state=42)
X_train_undersample.head()

In [None]:
classifiers = [
               GaussianNB(),
               KNeighborsClassifier(),
               LinearSVC(),
               LogisticRegression(random_state=1),
               RandomForestClassifier(),
               DecisionTreeClassifier(),
               XGBClassifier(),
               BernoulliNB(),                              
               ]

In [None]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score
import pandas as pd
params = []
scores = []
for clf in classifiers:
  pipeline = Pipeline(
      steps =[
              ('preprocessor',preprocessor),
              ('classifier',clf)
      ]
  )
  #Fit the model
  pipeline.fit(X_train_undersample, Y_train_undersample)


  #getting the score of the classifiers
  score = pipeline.score(x_test_undersample,y_test_undersample)
  print("%s score : %.3f" %(clf.__class__.__name__, score))


  y_pred_undersample = pipeline.predict(x_test_undersample)
  roc = roc_auc_score(y_test_undersample, y_pred_undersample)
  acc = accuracy_score(y_test_undersample, y_pred_undersample)
  prec = precision_score(y_test_undersample, y_pred_undersample)
  rec = recall_score(y_test_undersample,y_pred_undersample)
  f1 = f1_score(y_test_undersample, y_pred_undersample)
  
  
  cols = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC']
  score = [clf.__class__.__name__, acc, prec, rec, f1, roc]

  scores.append(score)    
            
  scores_df = pd.DataFrame(scores, columns=cols)
  list_params = [pipeline, score, x_test_undersample, y_test_undersample, clf.__class__.__name__]
  params.append(list_params)



In [None]:
#Scores
print(scores_df)

In [None]:
#Choose the best model and create a pipeline
rf_clf = RandomForestClassifier()
final_pipeline = Pipeline(
    steps = [
             ('preprocessor', preprocessor),
             ('classifier', rf_clf)
    ]
)
final_pipeline

In [None]:
rf_model = final_pipeline.fit(X_train_undersample, Y_train_undersample)
y_pred_undersample = rf_model.predict(x_test_undersample)

cm = confusion_matrix(y_test_undersample, y_pred_undersample)
sns.heatmap(cm, annot=True, cmap="Blues" ,fmt =".0f");

roc = plot_roc_curve(rf_model, x_test, y_test)

In [None]:
report = classification_report(y_test_undersample, y_pred_undersample)
print("Report : \n{}".format(report))

In [None]:

y_pred = rf_model.predict(x_test)

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, cmap="Blues" ,fmt =".0f");

roc = plot_roc_curve(rf_model, x_test, y_test)

In [None]:
report = classification_report(y_test, y_pred)
print("Report : \n{}".format(report))

# Feature Ranking

In [None]:
def plot_most_important_features(feat_imp, method='MDI', 
                                 n_features=10, bottom=False):
    '''
    Function for plotting the top/bottom x features in terms of their importance.
    
    Parameters
    ----------
    feat_imp : pd.Series
        A pd.Series with calculated feature importances
    method : str
        A string representing the method of calculating the importances.
        Used for the title of the plot.
    n_features : int
        Number of top/bottom features to plot
    bottom : boolean
        Indicates if the plot should contain the bottom feature importances.
    
    Returns
    -------
    ax : matplotlib.axes._subplots.AxesSubplot
        Ax cointaining the plot
    '''
    
    if bottom:
        indicator = 'Bottom'
        feat_imp = feat_imp.sort_values(ascending=True)
    else:
        indicator = 'Top'
        feat_imp = feat_imp.sort_values(ascending=False)
        
    ax = feat_imp.head(n_features).plot.barh()
    ax.invert_yaxis()
    ax.set(title=('Feature importance - '
                  f'{method} ({indicator} {n_features})'), 
           xlabel='Importance', 
           ylabel='Feature')
    
    return ax

In [None]:
feat_names = np.r_[numeric_cols]
rf_classifier = best_model.named_steps['classifier']
rf_feat_imp = pd.DataFrame(rf_classifier.feature_importances_,
                           index=feat_names,
                           columns=['mdi'])
rf_feat_imp = rf_feat_imp.sort_values('mdi', ascending=False)
rf_feat_imp['cumul_importance_mdi'] = np.cumsum(rf_feat_imp.mdi)

plot_most_important_features(rf_feat_imp.mdi, 
                             method='MDI')

plt.tight_layout()
plt.show()