# Credit Card Default Prediction Project

Based on the dataset UCI Machine Learning Repository

The original paper that works with this dataset is : Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.
<br>__[Link to original paper](https://bradzzz.gitbooks.io/ga-seattle-dsi/content/dsi/dsi_05_classification_databases/2.1-lesson/assets/datasets/DefaultCreditCardClients_yeh_2009.pdf)__
__[Link to UCI dataset page](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)__

### Dataset Description
* Data consists of 30 000 points and 24 attributes

### Project Outline
Data preparation and exploration -> ML models hyperparameters tuning -> Combination into a final model

## Import : Data and Libraries
### Library Imports

In [1]:
# Imports
%matplotlib inline
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost
import scipy.cluster.hierarchy as sch
sns.set_style("dark")
sns.set_context("paper")

from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from imblearn.pipeline import Pipeline
from sklearn import svm, metrics, preprocessing
from imblearn.combine import SMOTETomek
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import EditedNearestNeighbours

from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D


### Import and pre-processing of dataset 
(preprocessing : transforming data into ML model readable format)

In [2]:
# data imports

### EDIT FILEPATH IF NECESSARY
root = '.'
data_dir = '/DataFiles/'

# form filepaths
data_path = root + data_dir
train_file = data_path + 'CreditCard_train.csv'
test_file = data_path + 'CreditCard_test.csv'

# load
_df_train = pd.read_csv(train_file, index_col=0, header=1).rename(columns={'PAY_0':'PAY_1', 'default payment next month':'DEFAULT'})
_df_test = pd.read_csv(test_file, index_col=0, header=1).rename(columns={'PAY_0':'PAY_1', 'default payment next month':'DEFAULT'})

# create copy df for handling
df_train = _df_train.copy()
df_test = _df_test.copy()

### Data Checking

In [3]:
df_train.describe()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_1,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,DEFAULT
count,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0,...,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0
mean,165495.986667,1.62825,1.847417,1.55725,35.380458,-0.003125,-0.1235,-0.15475,-0.211667,-0.252917,...,42368.188417,40000.682542,38563.710625,5542.912917,5815.336,4969.266,4743.480042,4783.486042,5189.399042,0.22375
std,129128.744855,0.483282,0.780007,0.52208,9.27105,1.123425,1.20058,1.204033,1.166549,1.136993,...,63070.680934,60345.012766,59155.759799,15068.576072,20797.03,16095.61434,14883.26999,15270.405279,17630.37199,0.416765
min,10000.0,1.0,0.0,0.0,21.0,-2.0,-2.0,-2.0,-2.0,-2.0,...,-170000.0,-81334.0,-339603.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,50000.0,1.0,1.0,1.0,28.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,2340.0,1740.0,1234.75,1000.0,800.0,379.0,279.75,244.0,60.75,0.0
50%,140000.0,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,0.0,...,18940.5,18107.5,17036.0,2100.0,2000.0,1702.5,1500.0,1500.0,1500.0,0.0
75%,240000.0,2.0,2.0,2.0,41.0,0.0,0.0,0.0,0.0,0.0,...,52188.5,49746.5,48796.25,5000.0,5000.0,4347.25,4000.0,4005.0,4000.0,0.0
max,1000000.0,2.0,6.0,3.0,79.0,8.0,8.0,8.0,8.0,8.0,...,891586.0,927171.0,961664.0,505000.0,1684259.0,896040.0,497000.0,417990.0,528666.0,1.0


In [4]:
features = list(df_train.columns)[:-1]

# renaming columns for consistency and simplicity
df_train = df_train.rename(columns={'PAY_0':'PAY_1', 'default payment next month':'DEFAULT'})
df_test = df_test.rename(columns={'PAY_0':'PAY_1', 'default payment next month':'DEFAULT'})
label = df_train.columns[-1]

y_train = df_train[label]
X_train = df_train[features]

y_test = df_test[label]
X_test = df_test[features]


__Comment__ : All the data types are integers and thus workable for ML models. There are no null values (arbitrarily checked and all features have the same count). Values in the `SEX`, `EDUCATION` <br>
Optional to check and explore the data further into .DataExploration.

## Data Pipeline
* includes scaling, sampling and (future work : feature transformation)

In [5]:
from imblearn.over_sampling import SMOTE

X_train_, y_train_ = SMOTE(random_state=3).fit_resample(X=X_train, y=y_train)
scaler = StandardScaler()
X_train_ = scaler.fit_transform(X_train_)
X_test = scaler.transform(X_test)

### Benchmarking some standard ML models

Checking the ML models
* xgboost, adaboost, gradientboostingregressor, logistic regression and support vector machines


## Hyperparameter tuning of ML models

Hyperparameter tuning framework consists of a tuner (hyperopt), optimization space (model dependent), and objective function (model  dependent)
These are imported.
### ML models to be optimized

In [6]:
from sklearn.model_selection import train_test_split
X_train_, X_validation, y_train_, y_validation = train_test_split(X_train_, y_train_, test_size = 0.25, random_state = 0)


In [7]:
import pickle

with open('objs.pkl', 'wb') as f:  # Python 3: open(..., 'wb')
    pickle.dump([X_train_, y_train_, X_validation, y_validation], f)
f.close()

In [12]:
# Optimizer
from hyperopt import Trials, fmin, tpe

# Model hyperparameter space
from Models_spaces import space_xgb, space_ada, space_gbrt, space_log, space_svm

# Model objective function
from Models_objectives import objective_xgb, objective_ada, objective_gbrt, objective_log, objective_svm

In [10]:
print(space_svm)




{'C': <hyperopt.pyll.base.Apply object at 0x7f9493292e20>, 'kernel': <hyperopt.pyll.base.Apply object at 0x7f9493292f10>, 'degree': <hyperopt.pyll.base.Apply object at 0x7f94932990a0>, 'seed': 0}


### Tuning

For tuning we will be first split up the training data into a validation



In [13]:
trials = Trials()

best_hyperparams = fmin(fn = objective_xgb,
                        space = space_xgb,
                        algo = tpe.suggest,
                        max_evals = 50,
                        trials = trials)


print("The best hyperparameters are : ","\n")
print(best_hyperparams)

  0%|          | 0/50 [00:00<?, ?trial/s, best loss=?]


job exception: 'float' object cannot be interpreted as an integer



TypeError: 'float' object cannot be interpreted as an integer

In [14]:
trials = Trials()

best_hyperparams = fmin(fn = objective_ada,
                        space = space_ada,
                        algo = tpe.suggest,
                        max_evals = 50,
                        trials = trials)

print("The best hyperparameters are : ","\n")
print(best_hyperparams)


  0%|          | 0/50 [00:00<?, ?trial/s, best loss=?]


job exception: __init__() got an unexpected keyword argument 'algorithm'



TypeError: __init__() got an unexpected keyword argument 'algorithm'

In [None]:
trials = Trials()

best_hyperparams = fmin(fn = objective_gbrt,
                        space = space_gbrt,
                        algo = tpe.suggest,
                        max_evals = 50,
                        trials = trials)

print("The best hyperparameters are : ","\n")
print(best_hyperparams)

In [None]:
trials = Trials()

best_hyperparams = fmin(fn = objective_log,
                        space = space_log,
                        algo = tpe.suggest,
                        max_evals = 50,
                        trials = trials)

print("The best hyperparameters are : ","\n")
print(best_hyperparams)

In [None]:
trials = Trials()

best_hyperparams = fmin(fn = objective_svm,
                        space = space_svm,
                        algo = tpe.suggest,
                        max_evals = 50,
                        trials = trials)

print("The best hyperparameters are : ","\n")
print(best_hyperparams)

In [None]:
X_test1 = method0.transform(X_test)
X_test1 = pd.DataFrame(X_test1, columns=features)

In [None]:
y_predicted = xgb_clf.predict(X_test1)

In [None]:
np.mean(y_predicted==y_test)

In [None]:
model = xgb_clf

In [None]:
xgboost.plot_importance(model)
plt.title("xgboost.plot_importance(model)")
plt.show()

In [None]:
xgboost.plot_importance(model, importance_type="cover")
plt.title('xgboost.plot_importance(model, importance_type="cover")')
plt.show()

In [None]:
xgboost.plot_importance(model, importance_type="gain")
plt.title('xgboost.plot_importance(model, importance_type="gain")')
plt.show()

## Performance at given percentages
### robustness

As opposed to simply classifiying clients as expected to default vs not-expected to default, quantifying is more meaningful. I.e. defining a probability of default has more potential.

To estimate the real probability, the Smooth Sorting Method can be used, which estimates the real probability by looking at neighboring points and taking the mean of these values.

__Smooth Sorting Method__ from the original paper (Yeh, I. C., & Lien, C. H. (2009)): 

$$\text{P}_i = \frac{\sum_{j=-n}^{n}\text{Y}_{i-j}}{2n+1}$$

where $\text{P}_i$ is the estimated real probability of default, $\text{Y}_{i}$ is the binary variable of default (1) or non-default (0), $n$ is the number of data for smoothing.<br>
The Smooth Sorting Method is used on sorted data, from the lowest probability of default occuring to the highest probability of default occuring. 

This is interesting to look at because loaners adopt different risk strategies.    
(for this we 

we have the lists : `y_predicted` and `y_test`

In [None]:
xgb_reg = xgboost.XGBRegressor(eta=0.3, gamma=0.5, use_label_encoder=False)
xgb_reg.fit(X_train,y_train)
y_predicted = xgb_reg.predict(X_test)

In [None]:
y_predicted

In [None]:
sorted_index = np.argsort(y_predicted)

y_test_sorted = y_test_numpy[sorted_index]

In [None]:
y_test_sorted

In [None]:
y_avg = []
n = 200
for counter in range(n,len(y_test_sorted)-n):
    intermediate_val = np.mean(y_test_sorted[counter-n:counter+n])
    y_avg.append(intermediate_val)
    
y_predicted_sorted = sorted(y_predicted[n:len(y_predicted)-n])

In [None]:
plt.plot(y_avg)
plt.show()

In [None]:
plt.plot(sorted(y_predicted[n:len(y_predicted)-n]),y_avg)
plt.grid(True)
plt.ylim([0,1])
plt.show()

In [None]:
np.shape(y_predicted[n:len(y_predicted)-n])

In [None]:
sorted(y_predicted)

In [None]:
n

In [None]:
y_predicted_selected = y_predicted[n:len(y_predicted)-n]

In [None]:
from sklearn.metrics import r2_score
print(r2_score(y_avg,y_predicted_selected))

In [None]:
len(y_predicted)-n

