## 1. Credit card applications
<p>Many people want to get a credit card but consequently not everyone is going to get one and that is because every bank in the world receives <em>tons</em> of applications for credit cards, so inevitably some of them probaly won't be approved. There is an  approval evaluation performed on these applications on the basis of many criteria like income, debts etc because banks need to assess the chances that an owner of a credit card may not manage to repay the debit. Nowdays the whole proccess of a evaluation is automated, due to the huge amount of data, and this is where tools like machine learning models come in handy.</p>
<p><img src="https://assets.datacamp.com/production/project_558/img/credit_card.jpg" alt="Credit card being held in hand"></p>
<p>So here we are going to build a model for credit card approval, using the <a href="http://archive.ics.uci.edu/ml/datasets/credit+approval">Credit Card Approval dataset</a> from the UCI Machine Learning Repository. With that in mind:</p>
<ul>
<li>First things first, we'll load our database and have a first inspection of its characteristics.</li>
<li>Since it is a real world dataset we will notice that it has its own "flaws". We are faced with many different features (as for their range and type), lots of missing values etc etc.</li>
<li>These "flaws" create the need for the data to be preproccessed. We need to tidy things up cause "good data = valid predictions V.S  bad data =  non-valid predictions". Of course our machine learning model (actually our  employer, but I'll leave this "one" out for now) will appreciate us only if we aim for the "(good = valid)" pair.</li>
<li>After cleaning up, we'll make some exploration and try to derive some first intuitions.</li>
<li>Finally, we will build a machine learning model to make prediction on whether or not, an application for credit card would be approved.</li>
</ul>
<p>So let's grab our data and have a first look. Take notice that we are dealing with sensitive/confidential data so the provider of this data has already taken care of providing anonymity to the features, which is somehting that complies with GDPR regulations. That means that we are going to find that  "this" feature(s) (e.g feature named "g.1") might influence the prediction but we won't know if "this" feature is income or debt or something else, this is something that the bank that collected the data knows and it is confidential.</p>

In [95]:
# Import pandas
import pandas as pd

# Load dataset
cc_apps = pd.read_csv("datasets/cc_approvals.data", header=None)

# Inspect data
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


## 2. Inspecting the applications
<p>As we can see there are a lot of weird symbols/columns-titles, due to the aforementioned anonymity, but <a href="http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html">this blog</a> gives us a pretty good overview of the probable features. The probable features in a typical credit card application are <code>Gender</code>, <code>Age</code>, <code>Debt</code>, <code>Married</code>, <code>BankCustomer</code>, <code>EducationLevel</code>, <code>Ethnicity</code>, <code>YearsEmployed</code>, <code>PriorDefault</code>, <code>Employed</code>, <code>CreditScore</code>, <code>DriversLicense</code>, <code>Citizen</code>, <code>ZipCode</code>, <code>Income</code> and finally the <code>ApprovalStatus</code></p>
<p>This is an interesting dataset because it combines both numerical and categorical features, though this fact means at the same time the preprocess of data is rather compelling or we won't be able to come to valid conclusions. Before doing that let's at first have a very first look the summary statictics of the data.</p>

In [97]:
# Print summary statistics
cc_apps_description =  cc_apps.describe()
print(cc_apps_description)

print("\n")

# Print DataFrame information
cc_apps_info = cc_apps.info()
print(cc_apps_info)

print("\n")

# Inspect missing values in the dataset
cc_apps.tail(17)

               2           7          10             14
count  690.000000  690.000000  690.00000     690.000000
mean     4.758725    2.223406    2.40000    1017.385507
std      4.978163    3.346513    4.86294    5210.102598
min      0.000000    0.000000    0.00000       0.000000
25%      1.000000    0.165000    0.00000       0.000000
50%      2.750000    1.000000    0.00000       5.000000
75%      7.207500    2.625000    3.00000     395.500000
max     28.000000   28.500000   67.00000  100000.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
0     690 non-null object
1     690 non-null object
2     690 non-null float64
3     690 non-null object
4     690 non-null object
5     690 non-null object
6     690 non-null object
7     690 non-null float64
8     690 non-null object
9     690 non-null object
10    690 non-null int64
11    690 non-null object
12    690 non-null object
13    690 non-null object
14    690 non-null int64

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-
680,b,19.5,0.29,u,g,k,v,0.29,f,f,0,f,g,280,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,0,f,g,176,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,0,t,g,140,2,-


## 3. Handling the missing values (the '?' case)
<p>We can now optically validate that our data contains lot of different features:</p>
<ul>
<li>Both nuerical and non-numerical features (specifically data that are of <code>float64</code>, <code>int64</code> and <code>object</code> types). Specifically, the features 2, 7, 10 and 14 contain numeric values (of types float64, float64, int64 and int64 respectively) and all the other features contain non-numeric values.</li>
<li>We can also see diversity in the ranges of the features. Some features have a value range of 0 - 28, some have a range of 2 - 67, and some have a range of 1017 - 100000. This does not prevent us form getting a general notion from generic statistics (like <code>mean</code>, <code>max</code>, and <code>min</code>) about the features that have numerical values. </li>
<li>Something that does create problems is the missing values (here for example is the '?' in the first row, first column) of the output of the <code>cc_apps.tail(17)</code> command.</li>
</ul>
<p>We are going to temporarily replace the '?' with the more computer-friendly "NaN" values, so we can proceed in how to deal with them.</p>

In [99]:
# Import numpy
import numpy as np

# Inspect missing values in the dataset
print(cc_apps.tail(17))

# Replace the '?'s with NaN
cc_apps = cc_apps.replace('?', np.nan)
print('\n')

# Inspect the missing values again
print(cc_apps.tail(17))

    0      1       2  3  4   5   6      7  8  9   10 11 12     13   14 15
673  ?  29.50   2.000  y  p   e   h  2.000  f  f   0  f  g  00256   17  -
674  a  37.33   2.500  u  g   i   h  0.210  f  f   0  f  g  00260  246  -
675  a  41.58   1.040  u  g  aa   v  0.665  f  f   0  f  g  00240  237  -
676  a  30.58  10.665  u  g   q   h  0.085  f  t  12  t  g  00129    3  -
677  b  19.42   7.250  u  g   m   v  0.040  f  t   1  f  g  00100    1  -
678  a  17.92  10.210  u  g  ff  ff  0.000  f  f   0  f  g  00000   50  -
679  a  20.08   1.250  u  g   c   v  0.000  f  f   0  f  g  00000    0  -
680  b  19.50   0.290  u  g   k   v  0.290  f  f   0  f  g  00280  364  -
681  b  27.83   1.000  y  p   d   h  3.000  f  f   0  f  g  00176  537  -
682  b  17.08   3.290  u  g   i   v  0.335  f  f   0  t  g  00140    2  -
683  b  36.42   0.750  y  p   d   v  0.585  f  f   0  f  g  00240    3  -
684  b  40.58   3.290  u  g   m   v  3.500  f  f   0  t  s  00400    0  -
685  b  21.08  10.085  y  p   e   h  1

## 4. Handling the missing values (impute the NaNs)
<p>So as we can see, in the output right above, we replaced the question mark with NaN and consequently this means this replacement took place in the dataset as a whole.</p>
<p>The whole trouble around the NaN values is that we can't simply ignore them because they can lead to misspredictions but can't either just "use" them because some predicting models don't know what to do with them.</p>
<p>Our work around with these kind of situations is to impute the missing values. Though there are many ways to impute NaNs, we are going to go with a strategy called mean imputation, which essentially it replaces NaN values with the mean of the feature where the NaN was spotted and is being applied on our numerical features (here columns 2, 7, 10 and 14)</p>

In [101]:
# Impute the missing values with mean imputation
cc_apps.fillna(cc_apps.mean(), inplace=True)

# Count the number of NaNs in the dataset to verify
print(cc_apps.isnull().values.sum())

67


## 5. Handling the missing values (Non numerical imputation)
<p>There are still some missing values to be imputed for columns 0, 1, 3, 4, 5, 6 and 13. All of these columns contain non-numeric data and we won't be able to impute them using the mean as a filling value. We are forced to proceed in a different way. </p>
<p>We are going to impute these missing values with the most frequent values as present in the respective columns. The following <a href="https://pbpython.com/categorical-encoding.html">link</a> is a good tutorial for when it comes to imputing missing values for categorical data in general.</p>

In [103]:
# Fill null values with the most frequent (index[0])value for each column
for col in cc_apps.columns:
    if cc_apps[col].dtype == 'object':
        cc_apps = cc_apps.fillna(cc_apps[col].value_counts().index[0])
        
# Check again for null values
cc_apps.isnull().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64

## 6. Preprocessing the data (part i)
<p>The above output shows what we no more have any missing values.</p>
<p>We will now make some data preprocessing that will make our ML model more reliable. We are going to take the next steps as follows:</p>
<ol>
<li>Convert non-numerical data into numerical</li>
<li>Split the data into train and test sets</li>
<li>Convert data into having a uniform range</li>
</ol>
<p>First, we will be converting all the non-numeric values into numeric ones. We do this because not only it results in a faster computation but also many machine learning models (like XGBoost) (and especially the ones developed using scikit-learn) require the data to be in a strictly numeric format. We will do this by using a technique called <a href="http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html">label encoding</a>.</p>

In [105]:
from sklearn import preprocessing

# Instantiate LabelEncoder
le = preprocessing.LabelEncoder()

# Iterate cc_apps and if type is object, tranform it with LabelEncoder
for col in cc_apps.columns:
    if cc_apps[col].dtype == 'object':
        le.fit(cc_apps[col].values)
        cc_apps[col] = le.transform(cc_apps[col])

        
# Check if successfully converted objects into ints
import pandas.api.types as ptypes

assert all(ptypes.is_numeric_dtype(cc_apps[col]) for col in cc_apps.columns)
print(cc_apps.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
0     690 non-null int64
1     690 non-null int64
2     690 non-null float64
3     690 non-null int64
4     690 non-null int64
5     690 non-null int64
6     690 non-null int64
7     690 non-null float64
8     690 non-null int64
9     690 non-null int64
10    690 non-null int64
11    690 non-null int64
12    690 non-null int64
13    690 non-null int64
14    690 non-null int64
15    690 non-null int64
dtypes: float64(2), int64(14)
memory usage: 86.3 KB
None


## 7. Splitting the dataset into train and test sets
<p>We did not get any Assertion Error messages in the above block so the transformation was succefull.</p>
<p>Now, into our step number 2 we will split our data into train set and test set (setting <code>test_size</code> and <code>random_state</code> for achieving the same data partitioning each time) to prepare our data for two different phases of machine learning modeling: training and testing. This is crucial because we  don't want any data from the soon to be tested set/sample to intervene with the data that we use for creating our model, as this would produce case specific/biased results that are not reliable scalable.</p>
<p>We will also perform some basic <em>feature selection</em> meaning we will drop features such as <code>DriversLicense</code> and <code>ZipCode</code> as they do not significally contribute as the other features in the dataset for predicting credit card approvals. This is a "healthy" Data Science practice that can lead in generating better predictive models.</p>

In [107]:
from sklearn.model_selection import train_test_split

# Drop the features 11 and 13 and convert the DataFrame to a NumPy array
# Note: Use 'DataFrame.to_numpy()' in pandas version 0.24 and above
cc_apps.drop(columns=[11, 13], inplace=True)
cc_arr = cc_apps.values

# Seperate features and labels and split into train and test sets
X, y = cc_arr[:,1:13], cc_arr[:,13]
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33, 
                                                    random_state=42)


## 8. Preprocessing the data (part ii)
<p>Now that we have our train and test sets we march into our step 3 of data preprocessing, that is rescaling the data.</p>
<p> For example we have the feature <code>CreditScore</code> which is a person's creditworthiness based on their credit history. The higher this number, the more financially trustworthy a person is considered to be. The thing is that this feature has a range of "crazy" values that mask the insight it can provide us, so we are going to rescale it in a 0-1 range.</p>
<p>That way we can easily and intuitively understand that a (maximum) value of 1 and a (minimum) value of 0 tells us that a person has the highest and lowest creditworthiness respectively. Of course in real life such extreme values are not the main case so it is also easier for us to get a grasp of where a person resides in this "spectrum" of credit evaluation.</p>

In [109]:
from sklearn.preprocessing import MinMaxScaler

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler()
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.fit_transform(X_test)


## 9. Fitting a logistic regression model to the train set
<p>Essentially, predicting if a credit card application will be approved or not is a <a href="https://en.wikipedia.org/wiki/Statistical_classification">classification</a> task. <a href="http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.names">According to UCI</a>, our dataset contains more instances that correspond to "Denied" status than instances corresponding to "Approved" status. Specifically, out of 690 instances, there are 383 (55.5%) applications that got denied and 307 (44.5%) applications that got approved.</p>
<p>This gives us a benchmark. A good machine learning model should be able to accurately predict the status of the applications with respect to these statistics.</p>
<p>Now we are faced with the problem of picking an efficient model. One of the most significant factors is the notion <em>if any of the features corellate to each other, affecting the prediction</em>. At this point we will trust our intuition/common sense that there is indeed a correlation. We could validate this assumption by statistically computing the correlation though in the context of this specific showcasing it is not mandatory. Generalized linear models perform well in cases of existing corellation. Let's start our machine learning modeling with a Logistic Regression model (a generalized linear model). If you want to learn more about Logistic Regression follow<a href="https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc"> this link</a> which contains a detailed explaination</p>

In [111]:
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier and fit model
logreg = LogisticRegression()
logreg.fit(rescaledX_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

## 10. Making predictions and evaluating performance
<p>We will now evaluate our model on the test set with respect to <a href="https://developers.google.com/machine-learning/crash-course/classification/accuracy">classification accuracy</a>. But we will also take a look the model's <a href="http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/">confusion matrix</a>. In the case of predicting credit card applications, it is equally important to see if our machine learning model is able to predict the approval status of the applications as denied that originally got denied. If our model is not performing well in this aspect, then it might end up approving the application that should have been approved. The confusion matrix helps us to view our model's performance from these aspects.  </p>

In [113]:
from sklearn.metrics import confusion_matrix

# Predict instances from the test set and print its score
y_pred = logreg.predict(rescaledX_test)
print("The accuracy score for Logistic Regression is:",
        logreg.score(rescaledX_test, y_test))

# Print the confusion matrix of the model
print(confusion_matrix(y_test, y_pred))

The accuracy score for Logistic Regression is: 0.8377192982456141
[[92 11]
 [26 99]]


## 11. Grid searching and making the model perform better
<p>Our model was pretty good! It was able to yield an accuracy score of almost 84%.</p>
<p>For the confusion matrix, the first element of the of the first row of the confusion matrix denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly. And the last element of the second row of the confusion matrix denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly.</p>
<p>Let's see if we can do better. We can perform a <a href="https://machinelearningmastery.com/how-to-tune-algorithm-parameters-with-scikit-learn/">grid search</a> of the model parameters to improve the model's ability to predict credit card approvals.</p>
<p><a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">scikit-learn's implementation of logistic regression</a> consists of different hyperparameters but we will grid search over the following two:</p>
<ul>
<li>tol</li>
<li>max_iter</li>
</ul>

In [115]:
from sklearn.model_selection import GridSearchCV

# Define the grid of values for tol and max_iter
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]

# Create a dictionary where tol and max_iter are keys and the lists of their values are corresponding values
param_grid = dict([('tol', tol),('max_iter', max_iter)])
print(param_grid)

{'tol': [0.01, 0.001, 0.0001], 'max_iter': [100, 150, 200]}


## 12. Finding the best performing model
<p>We have defined the grid of hyperparameter values and converted them into a single dictionary format which <code>GridSearchCV()</code> expects as one of its parameters. Now, we will begin the grid search to see which values perform best.</p>
<p>We will instantiate <code>GridSearchCV()</code> with our earlier <code>logreg</code> model with all the data we have. Instead of passing train and test sets separately, we will supply <code>X</code> (scaled version) and <code>y</code>. We will also instruct <code>GridSearchCV()</code> to perform a <a href="https://www.dataschool.io/machine-learning-with-scikit-learn/">cross-validation</a> of five folds.</p>
<p>We'll end our credit card approval model by storing the best-achieved score and the respective best parameters.</p>

In [117]:
# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, cv=5, param_grid=param_grid)
# Use scaler to rescale X and assign it to rescaledX
rescaledX = scaler.fit_transform(X)

# Fit data to grid_model and summarize results
grid_model_result = grid_model.fit(rescaledX, y)
best_params = grid_model.best_params_
best_score = grid_model_result.best_score_


Best score achieved is: 0.853623 using {'tol': 0.01, 'max_iter': 100}


## 13. Conclusion
<p>To sum it up: Firstly we imported our data and created a dataframe out of it. Then we performed some basic data cleaning by replacing some irregular characters with NaN values.</p>
<p>After that, we imputed these missing values. In the case of numerical values we replaced them with the mean value of the specific feature and in the case of categorical value with the most frequent value. The next step was to do some data preproccesing, before training our model, by tranforming the categorical features into integers and rescale them to a 0-1 basis.</p>
<p>Finally we trained our model using a <em>Logistic Regression</em> model that initialy gave a performance of almost 84%, which we managed to improve it up to a 85.36% through <em> Hyperparameter Optimization</em> using a 5-fold cross validation that tweaked the number of maximum iterations and the tolerance degree.</p>
<p><img src="https://upgradedpoints.com/wp-content/uploads/2018/03/Instant-Approval-Credit-Cards.jpg"></p>