# Credit Card Default Prediction Project

Based on the dataset UCI Machine Learning Repository

The original paper that works with this dataset is : Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36(2), 2473-2480.

* __[Link to original paper](https://bradzzz.gitbooks.io/ga-seattle-dsi/content/dsi/dsi_05_classification_databases/2.1-lesson/assets/datasets/DefaultCreditCardClients_yeh_2009.pdf)__

* __[Link to UCI dataset page](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)__

### Dataset Description
* Data consists of 30 000 points and 23 features and 1 label


### Project Outline
Data preparation and exploration -> ML models hyperparameters tuning -> Combination into a final model

## Import : Data and Libraries
### Library Imports

In [None]:
# Imports
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
import copy
import pickle

from matplotlib.widgets import Slider, Button, RadioButtons

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import accuracy_score, roc_auc_score, r2_score
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Optimizer
from hyperopt import Trials, fmin, tpe

from evaluation_helper import performance_metrics


# Model hyperparameter space to be optimized
from hyperparameters_spaces import ada_loss_functions, ada_space, log_space, svm_space, svm_kernels, \
    svm_kernel_degrees, xgb_space, INT_KEYS

# Model objective function builder
from hyperopt_objective import build_objective_func

sns.set_style("dark")
sns.set_context("paper")


### Import and pre-processing of dataset
(preprocessing : transforming data into ML model readable format)

#### Data Importing

In [None]:
# load data internally
_df_train = pd.read_csv("DataFiles/CreditCard_train.csv", index_col=0, header=1)
_df_test = pd.read_csv("DataFiles/CreditCard_test.csv", index_col=0, header=1)

# create external df for handling
df_train = _df_train.copy()
df_test = _df_test.copy()


#### Train data head and description

In [None]:
df_train.head()

In [None]:
df_train.describe()


### Pandas DataFrame processing

In [None]:
# renaming columns for consistency and simplicity
df_train = df_train.rename(columns={'PAY_0':'PAY_1', 'default payment next month':'DEFAULT'})
df_test = df_test.rename(columns={'PAY_0':'PAY_1', 'default payment next month':'DEFAULT'})

label = df_train.columns[-1] # = `DEFAULT`
features = list(df_train.columns)[:-1]

Formatting X, y as `np.ndarray`

In [None]:

y_train = df_train[label].to_numpy()
X_train = df_train[features]

y_test = df_test[label].to_numpy()
X_test = df_test[features]


__Comment__ : All the data types are integers and thus workable for ML models. There are no null values (arbitrarily checked and all features have the same count). Values in the `SEX`, `EDUCATION` are have a specified range, however, some values are not contained.


## Data Pipeline
* includes scaling, sampling and (future work : feature transformation)


Scaling the dataset for computational efficiency of ML models operations (fitting, prediction).

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Exporting training data

### Benchmarking some standard ML models

Checking the ML models
* adaboostingregressor, logistic regression and support vector machines


## Hyperparameter tuning of ML models

Saving data in a pickle file and opened again in model objectives. (not sure if I can include them as input to the objectives of each model, for version 2..)


Hyperparameter tuning framework consists of a tuner (hyperopt), optimization space (model dependent), and objective function (model  dependent)
These are imported.

### ML models to be optimized



### Tuning

For tuning we will be first split up the training data into a validation


In [None]:
def build_hyperopt_fitted_model(algorithmn, X_train : np.ndarray, y_train : np.ndarray, hyperparameters_space : dict,
                                hyperparameters_choices : dict,  tuning_method : 'str' = 'train_validation_split',
                                tuning_value : float = 0.25, tuning_measure = 'accuracy tuning', max_evals : int = 1,
                                random_state : int = 0, **fitting_setting):
    """
    Fits a ML model on the training data with hyperparameters tuned by Hyperopt using a selected cross validation
    method.

    :param algorithmn: ML algorithm for building model
    :param hyperparameters_space: possible hyperparameters values
    :param hyperparameters_choices: dictionary with 'key' as the choice and 'values' as options
    :param X_train: X training data
    :param y_train: X training data labels
    :param tuning_method: method for tuning hyerparameters, either 'train_validation_split' or 'KFold'
    :param tuning_value: tuning method value
    :param tuning_measure: measure of score/performance on the validation set, either 'roc' or 'accuracy'
    :param max_evals: maximum number of evaluations for hyperparameters searching
    :param random_state: random_state
    :param fitting_setting: additional settings for fitting the model (e.g. sample weight)

    :return: fitted ML model
    """

    # create trail
    trials = Trials()

    tuning_objective = build_objective_func(algorithmn=algorithmn, X_train=X_train, y_train=y_train,
                                            tuning_method=tuning_method, tuning_value=tuning_value,
                                            tuning_measure=tuning_measure, random_state=random_state, **fitting_setting)

    print(f'Starting hyperparameter search with with {algorithmn}, {tuning_method}, {tuning_value}\n')
    model_best_hyperparams = fmin(fn = tuning_objective,
                            space = hyperparameters_space,
                            algo = tpe.suggest,
                            max_evals = max_evals,
                            trials = trials)

    ## in case hyperparameters must be integers (e.g. max_depth for decision trees)
    # fmin returns space in floats (for some reason changed the type int to float)
    if type(INT_KEYS) == list:
        for key in INT_KEYS:
            if key in model_best_hyperparams.keys():
                model_best_hyperparams[key] = int(model_best_hyperparams[key])

    # when model space is a choice, `model_best_hyperparams` gives the index of the selected choice
    # model options a dict with keys: as the option name and value as list of options
    if type(hyperparameters_choices) == dict:
        for choice, choice_option in hyperparameters_choices.items():
            if choice in model_best_hyperparams.keys(): # additional safety net
                model_best_hyperparams[choice] = choice_option[model_best_hyperparams[choice]]

    model = algorithmn(**model_best_hyperparams) # refering to the algorithmn with hyperparameters as model
    model.fit(X_train, y_train, **fitting_setting)

    return model


## Performance at given percentages
### robustness

As opposed to simply classifiying clients as expected to default vs not-expected to default, quantifying is more meaningful. I.e. defining a probability of default has more potential.

To estimate the real probability, the Smooth Sorting Method can be used, which estimates the real probability by looking at neighboring points and taking the mean of these values.

__Smooth Sorting Method__ from the original paper (Yeh, I. C., & Lien, C. H. (2009)):

$$\text{P}_i = \frac{\sum_{j=-n}^{n}\text{Y}_{i-j}}{2n+1}$$

where $\text{P}_i$ is the estimated real probability of default, $\text{Y}_{i}$ is the binary variable of default (1) or non-default (0), $n$ is the number of data for smoothing.<br>
The Smooth Sorting Method is used on sorted data, from the lowest probability of default occuring to the highest probability of default occuring.

This is interesting to look at because loaners adopt different risk strategies.
( (?) for this we consider at 20% and 80% (?) )

In [None]:
def SSM_real_probability(y_real : np.ndarray, y_predicted : np.ndarray, n : int, plot : bool = True):

    sorted_index = np.argsort(y_predicted)
    y_real_sorted = y_real[sorted_index]
    y_predicted_sorted = y_predicted[sorted_index]

    intermediate_real_probability = np.array([])
    for counter in range(n, len(y_real)-n):
        intermediate_real_probability = np.append(intermediate_real_probability,
                                                  np.mean(y_real_sorted[counter-n:counter+n]))

    r2 = r2_score(y_real, y_predicted)

    if plot:
        y_predicted_selected = y_predicted_sorted[n:len(y_real)-n]
        plt.plot(y_predicted_selected,intermediate_real_probability)
        plt.grid(True)
        plt.xlim([0,1])
        plt.xlabel('Predicted probability')
        plt.ylabel('Real probability using SSM')
        plt.annotate(f'$R^2 = {r2}$', (0.05, 0.95))
        plt.grid(True, which='both')
        plt.show()

    print(f'r2 score : {r2}\n')
    return r2


In [None]:
algorithmns_list = [AdaBoostRegressor,
                    LogisticRegression,
                    xgb.XGBRegressor,
                    SVC
                    ]

# imported
hyperparameter_space_list = [ada_space,
                             log_space,
                             xgb_space,
                             svm_space
                             ]

hyperparameter_choices_list = [{'loss' : ada_loss_functions},
                               None,
                               None,
                               {'kernel' : svm_kernels, 'degree' : svm_kernel_degrees}
                               ]

Dictionary ordering
model_name (e.g. `ada`) -> tuning_method (`train_validation_split`, `train_validation_split_randomized`, `KFold`) -> tuning_value -> tuning_measure -> max_evals -> score (accuracy, roc)

In [None]:
# tuning_measures_list = ['accuracy tuning', 'roc auc tuning']
tuning_measures_list = ['accuracy tuning']
tuning_methods_dict = {'train_validation_split' : {}, 'KFold' : {}}

# model performance dictionary building
model_ref_names = ['ada', 'log', 'xgb', 'svm']
selected_models = [model_ref_names[0]]
MODELS_PERFORMANCES = {}
for model_name in selected_models:
    MODELS_PERFORMANCES[model_name] = copy.deepcopy(tuning_methods_dict)

print(MODELS_PERFORMANCES)

Set variable `long_run=True` for a long run.

In [None]:
# intialization
ratio_range = range(20,30+1,5)
KFold_range = [4]
max_evals_range = range(25,35+1,10)
random_state = 0
n = 50

max_evals_range = range(30, 31, 10)

debug = False
if debug:
    max_evals_range = range(1,1+1,1)
    ratio_range = range(20,20+1,5)

tuning_value_range_list = [ratio_range, KFold_range]


In [None]:
import time
start_time = time.time()

tuning_measure_performance = {}
performance_at_num_iterations = {}


for algorithmn, hyperparameters_space, hyperparameters_choices, model_name in zip(algorithmns_list,
                                                                                  hyperparameter_space_list,
                                                                                         hyperparameter_choices_list,
                                                                                      MODELS_PERFORMANCES):
    for tuning_method_iterator, tuning_value_range in zip(tuning_methods_dict, tuning_value_range_list):
        for tuning_value_iterator in tuning_value_range:
            if tuning_method_iterator == 'train_validation_split' or tuning_value_iterator == 'train_validation_split_randomized':
                tuning_value_iterator = tuning_value_iterator/100 # setting as a decimal
            for max_evals_iterator in max_evals_range:
                for tuning_measure_iterator in tuning_measures_list:
                    model = build_hyperopt_fitted_model(algorithmn=algorithmn, X_train= X_train, y_train= y_train,
                                                        hyperparameters_space = hyperparameters_space,
                                                        hyperparameters_choices = hyperparameters_choices,
                                                        tuning_method = tuning_method_iterator,
                                                        tuning_measure = tuning_measure_iterator, tuning_value = tuning_value_iterator,
                                                        max_evals = max_evals_iterator,
                                                        random_state=random_state)

                    y_pred = model.predict(X_test)

                    performance = {'test accuracy': performance_metrics(y_test, y_pred>0.5, confusion_matrix=False),
                                   'test roc auc': roc_auc_score(y_test,y_pred, average = 'macro'),
                                   'test r2': SSM_real_probability(y_test,y_pred, n=n, plot=False)}

                    performance_at_num_iterations[max_evals_iterator] = [copy.deepcopy(performance)]
                    tuning_measure_performance[tuning_measure_iterator] = copy.deepcopy(performance_at_num_iterations)

                    # tuning_measure_performance[tuning_measure_iterator] = copy.deepcopy(performance)
                    assert type(tuning_measure_iterator) == str
                MODELS_PERFORMANCES[model_name][tuning_method_iterator][tuning_value_iterator] = copy.deepcopy(tuning_measure_performance)

# can stop the evaluation at 20 times and save parameters, this almost halves the simulation time. log model hyperparameters also don't change

end_time = time.time()
print(end_time-start_time)

In [None]:
MODELS_PERFORMANCES

In [None]:
with open('results2.pkl', 'wb') as f:
    pickle.dump(MODELS_PERFORMANCES, f)

print(MODELS_PERFORMANCES)

Model performance of ada : `MODELS_PERFORMANCES['ada']['train_split_ratio']`

Plotting the performance on accuracy under different tuning strategies

In [None]:
def select_plot_data(tuning_method_list : list, tuning_method_value : float, max_evals : int = 25, performance_score : str = 'test accuracy', tuning_measures_list : list = tuning_measures_list) -> tuple:
    score_list = []
    for tuning_method_iterator in tuning_method_list:
        for tuning_measure_name in tuning_measures_list:
            score_list.append(MODELS_PERFORMANCES[model_name][tuning_method_iterator][tuning_method_value][tuning_measure_name][max_evals][performance_score])

    accuracy_tuned_scores = score_list[::2]
    roc_auc_tuned_scores = score_list[1::2]

    return accuracy_tuned_scores, roc_auc_tuned_scores



ratio_0 = 0.25
max_evals_0 = 25

tuning_method_name = 'train_validation_split'
tuning_measure_name = tuning_measures_list[0] # = 'accuracy tuning'

performance_scores = ['test accuracy' , 'test roc auc', 'test r2']
selected_performance_score = performance_scores[0]

print(MODELS_PERFORMANCES['ada'][tuning_method_name][0.25][tuning_measure_name])

selected_data = select_plot_data(tuning_method_list=list(tuning_methods_dict), tuning_method_value=ratio_0, max_evals=max_evals_0, performance_score=selected_performance_score)
accuracy_tuned_score_list, roc_auc_tuned_score_list = selected_data

fig, ax = plt.subplots()
plt.subplots_adjust(left=0.25, bottom=0.20)

xticks_variables = list(tuning_methods_dict)

# The x position of bars
bar_width = 0.25
r1 = np.arange(len(xticks_variables))
r2 = [x + bar_width for x in r1]
positions = [r1, r2]

bars1 = ax.bar(r1, accuracy_tuned_score_list, color ='b', width = bar_width, label='accuracy tuned')
bars2 = ax.bar(r2, roc_auc_tuned_score_list, color ='g', width = bar_width, label='roc auc tuned')

# bars2.remove()
bars_list = [bars1, bars2]

ax.legend()

plt.ylim([0.7, 0.84])
plt.xticks(r1+bar_width/2, xticks_variables)
plt.xlabel('Tuning method')
plt.ylabel(selected_performance_score)


axcolor = 'lightgoldenrodyellow'
ax_ratio = plt.axes([0.15, 0.2, 0.03, 0.675], facecolor=axcolor)
s_ratio = Slider(ax=ax_ratio, label='ratio', valmin=ratio_range[0]/100, valmax=ratio_range[-1]/100, valinit=ratio_0, valstep=ratio_range.step/100, orientation='vertical')

ax_max_evals = plt.axes([0.05, 0.2, 0.03, 0.675], facecolor=axcolor)
s_max_evals = Slider(ax=ax_max_evals, label='max_evals', valmin=max_evals_range[0], valmax=max_evals_range[-1], valinit=max_evals_0, valstep=max_evals_range.step, orientation='vertical')

def bar_plot(y_label, data1, data2):

    for bars in bars_list:
        bars.remove()

    bars_list[0] = ax.bar(r1, data1, color='b', width=bar_width, label='accuracy tuned')
    bars_list[1] = ax.bar(r2, data2, color='g', width=bar_width, label='roc auc tuned')

    plt.ylabel(y_label)


def update(val):
    ratio = round(s_ratio.val*100)/100
    max_evals = s_max_evals.val
    updated_data = select_plot_data(tuning_method_list=list(tuning_methods_dict), tuning_method_value=ratio,  max_evals=max_evals, performance_score=selected_performance_score)
    selected_accuracy_tuned_score_list, selected_roc_auc_tuned_score_list = updated_data
    bar_plot(selected_performance_score, selected_accuracy_tuned_score_list, selected_roc_auc_tuned_score_list)
    fig.canvas.draw_idle()

s_ratio.on_changed(update)
s_max_evals.on_changed(update)

plt.ioff()

pickle.dump(fig, open('train_val.fig.pickle', 'wb'))
plt.show()






In [None]:
from sklearn.metrics import plot_roc_curve

roc_curve = True

if roc_curve:
    plot_roc_curve(model, X_test, y_test)
    plt.show()


The search selects the best considered (not the best in the space) generalizable hyperparameters (i.e. the ones that perform best, after fitting on the training set and prediction on the validation set).
In this limited analysis, these hyperparameters are considered the best generalizable.
Alternatively, we can say that we select these parameters to be tested.
We may find that some ML models allow for a large fluctuation in performance on the validation set, which may indicate that we have 'overfitted' the validation set. This will show on the test set.
We can decide to further train on the validation set, however, it is interesting to see how the performance of the model changes whether we use the validation set or not.

The models are:
`xgb_reg`, `ada_reg`, `gbrt_reg`, `log_reg`, and `svm_reg`

Now we can test these models on the test set. Followed by checking each prediction's 'strenght' by using the Smooth Sorting Method as proposed in the original paper by Yeh and Lien.
