# Predicting Credit Card Applications 

## Introduction

This notebook accompanies the talk I gave in Februrary 2020 to Warwick Finance Society titled 'Python and a Piece of Paper'. The presentation (which provides background on this notebook) can be found on the [GitHub repository](https://github.com/THargreaves/python-and-a-piece-of-paper) for this. If you enjoyed the talk and/or appreciate this live notebook to try the code yourself, please do give the notebook a star (GitHub's equivalent of a like).

You can run the code by selecting a cell of the notebook and either using `Ctrl-Enter` or clicking the play icon next to the cell. Each cell depends on the last so make sure they are ran in order. If you want to undo the effect of a cell, select it and use `Ctrl-F8` or select `Runtime > Run Before` in the top menu to run all cells up to (but not including) itself.


If there are any issues, feel free to reach out to me on [LinkedIn](https://www.linkedin.com/in/tim-hargreaves/). For more interesting Data Science projects, check out [my blog](https://www.ttested.com/).

**Enjoy!**

_Note: The code in this notebook is slightly different from that in the presentation. This is because I wrote the code and slides using my work laptop which has an outdated version of certain packages (blaim IT!) whereas Google Colab uses the latest versions. It was easier to change the code here than in my slides so that's how it will have to be._

## Setup

Here we import various Python packages and modules to offer additionally functionality to our code without any added effort.

In [0]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.model_selection import KFold, cross_val_score

from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.models import Sequential
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

## Import

We download the dataset from the UCI Machine Learning repository and import it into Python. We do a bit of minor housekeeping too, such as removing the ZIP code column (too much fuss to process) and splitting the predictors and labels.

The original dataset and some information about it can be found [here](https://archive.ics.uci.edu/ml/datasets/credit+approval). The probable column names are taken from [this](http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html) blog post .

In [0]:
crx = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/' +
                  'credit-screening/crx.data', header=None)

# add probable column names
crx.columns = [
    'sex', 'age', 'debt', 'married', 'bank_customer', 'education', 'ethnicity', 
    'year_employed', 'prior_default', 'employed', 'credit_score',
    'drivers_license', 'citizen', 'zip_code', 'income', 'approved',
]

# remove unhelpful features
crx = crx.drop(['zip_code'], axis=1)

# extract label
lab = (crx['approved'] == '+').to_numpy()
crx = crx.drop(['approved'], axis=1)

You can have a look at the imported dataset by adding a code block, typing `crx` and clicking the play button.

## Cleaning

### Impute Missing Values

Most machine learning models can't handle missing values by themselves so we have to pre-process them. One approach is just to throw out any columns or rows with a lot of missing values but this is wasteful if there are a lot of missing values. A better approach is to replace missing numeric values with column means and missing string values with the most common category.

We also add new columns representing which predictors each person was missing data for—perhaps some missing data is caused by the applicant trying to hide information that makes them look back?

In [0]:
# missing values saved as '?'
crx = crx.replace(['?'],np.NaN)
# fix column datatypes
crx['age'] = crx['age'].astype('float')
# keep track of which columns are missing data
for name, values in crx.iteritems():
    is_na = values.isna()
    if any(is_na):
        crx[f'{name}_is_na'] = is_na
# replace missing numeric values with column means
crx.fillna(crx.mean(), inplace=True)
# replace missing string values with column modes
for name, values in crx.iteritems():
    if values.dtype == 'object':
        crx[name] = values.fillna(values.value_counts().index[0])

You can use `crx.iloc[[206, 248], :]` before and after running this code block to see the effect of imputation.

### Encode Non-Numeric Features

Machine learning algorthms prefer their input as numbers rather than text. We therefore change the representation of text columns using a technique called [One-hot Encoding](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f).

In [0]:
# split numeric and non-numeric features
numeric = crx.select_dtypes(exclude='object')
string = crx.select_dtypes(include='object')
enc = OneHotEncoder()
string_enc = enc.fit_transform(string)
crx_enc = np.hstack((numeric.to_numpy(), string_enc.todense()))

The output of this code block is an object called `crx_enc`. You can print this in the same way you did for `crx` although the result will now just be a block of numbers without column names. Not as pretty, I know, but it makes Python happy.

### Normalise Inputs

Scale values so that they lie between zero and one. This prevents machine learning models from unfairly favouring predictors that have a large range.

In [0]:
scal = MinMaxScaler()
crx_enc = scal.fit_transform(crx_enc)

To see the impact of this change, use `plt.hist(crx_enc[:, 2])` before and after scaling.

## Build Model

We build a model by sequentially stacking layers of _neurons_. We then compile the model by telling it how it should 'learn'.

In [0]:
def build_model():
    model = Sequential([
        Dense(16, activation='relu', 
              input_shape=(crx_enc.shape[1],)),
        Dense(8, activation='relu'),
        Dense(1, activation='sigmoid'),
    ])
    model.compile(optimizer='adam', loss='binary_crossentropy', 
                  metrics=['accuracy'])
    return model

## Train Model

### Fit Model

We _fit_ the model to our data. This is essentially the time in which we teach the model. Don't worry if this block takes a few minutes to run—learning takes time!

In [0]:
nn = KerasClassifier(build_fn=build_model, epochs=10, batch_size=32, verbose=0)
kfold = KFold(n_splits=30)
results = cross_val_score(nn, crx_enc, lab, cv=kfold)

If the code is taking a _really_ long time to run (10 minutes or more) then it may be the case that the Python kernel has died. To fix this use `Ctrl+M` to restart the runtime. You will then have to run the code blocks from the start.

### Evaluate Performance

We evaluate our model by trying to predict for data that the model has **never seen before**. This way we can be sure that the model is actually learning and not just memorising the examples it's seen.

In [19]:
acc = results.mean()
print(f"Model accuracy is {round(acc * 100, 2)}%")

Model accuracy is 84.06%


You may not obtain the exact same accuracy as in the presentation due to the random nature of the algorithms used and hardware differences. It should be in the same ball park though. It appears from other models this dataset that ~85% is about as good as you can get without more predictors/observations/inside knowledge. Not bad for a quick bodge-of-a-solution!