<h1 align="center"> Predicting Credit Card Approvals using ML Techniques </h1>
<br>
<p>In this notebook, a customized version of a DataCamp project will be presented, whereas an automatic credit card approval predictor using machine learning techniques were built.</p><br>

Topics: `Data Manipulation`   `Machine Learning`   `Importing & Cleaning Data`   `Applied Finance`

<br>


## Credit card applications
<p> Owing the amount of credit card applications received by commercial banks and the lenght of time a manual credit analysis would consume, machine learning techniques can be performed in order to automate the predictions of credit card approval. For this purpose, a machine learning model will be built using the <a href="http://archive.ics.uci.edu/ml/datasets/credit+approval">Credit Card Approval dataset</a> from the UCI Machine Learning Repository.</p>
<p><img src="https://assets.datacamp.com/production/project_558/img/credit_card.jpg" align="center" width="500" height="600" alt="Credit card being held in hand"></p>
<p>The structure of this notebook is organized as follows:</p>
<ul>
<li>Loading and inspection of the dataset.</li>
<li>Data preprocessing prior machine learning model predictions.</li>
<li>Exploratory data analysis.</li>
<li>Building a machine learning model to predict credit card approval of individual application.</li>
</ul>

## 1. Loading the dataset

<p>First, importing <code>pandas</code> package for loading and viewing the dataset.</p>

In [51]:
# Ignoring FutureWarning before importing pandas for better aesthetics
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Importing the required package
import pandas as pd

# Loading the dataset
cc_apps = pd.read_csv("cc_approvals.data", header=None)

# Inspecting the first 5 rows of the dataset
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


## 2. Inspecting the applications

<p>The features of this dataset have been anonymized for confidentiality and so the columns are not stated. According to <a href="http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html">this blog</a>, the tipical features in a credit card application are <b>Gender</b>, <b>Age</b>, <b>Debt</b>, <b>Married</b>, <b>BankCustomer</b>, <b>EducationLevel</b>, <b>Ethnicity</b>, <b>YearsEmployed</b>, <b>PriorDefault</b>, <b>Employed</b>, <b>CreditScore</b>, <b>DriversLicense</b>, <b>Citizen</b>, <b>ZipCode</b>, <b>Income</b>, besides <b>ApprovalStatus</b>.</p>

<p>To get more information about the DataFrame, the <code>describe()</code>, <code>info()</code> and <code>tail()</code> methods were applied:</p>

In [52]:
# Printing summary statistics
cc_apps_description = cc_apps.describe()
print(cc_apps_description)

print('\n')

# Printing DataFrame information
cc_apps_info = cc_apps.info()
print(cc_apps_info)

print('\n')

# Inspecting missing values in the dataset
print(cc_apps.tail(17))

               2           7          10             14
count  690.000000  690.000000  690.00000     690.000000
mean     4.758725    2.223406    2.40000    1017.385507
std      4.978163    3.346513    4.86294    5210.102598
min      0.000000    0.000000    0.00000       0.000000
25%      1.000000    0.165000    0.00000       0.000000
50%      2.750000    1.000000    0.00000       5.000000
75%      7.207500    2.625000    3.00000     395.500000
max     28.000000   28.500000   67.00000  100000.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 no

<p>Note that the dataset is composed of missing values, besides both numerical and non-numerical features, making the use of preprocessing necessary.</p>

## 3. Splitting the dataset into train and test sets
<p>Bevor preprocessing the data, as preferred, the dataset has been split into train and test set for the training and testing phases of machine learning modeling.</p>

<p>Also, a feature selection has been made and unnecessary features such as <b>DriversLicense</b> and <b>ZipCode</b> were dropped.</p>

In [53]:
# Importing train_test_split
from sklearn.model_selection import train_test_split

# Dropping the features DriversLicense and ZipCode
cc_apps = cc_apps.drop(columns=[11,13])

# Split into train and test sets
cc_apps_train, cc_apps_test = train_test_split(cc_apps, test_size=0.33, random_state=42)

## 4. Handling the missing values (part I)
<p>As identified from inspecting the DataFrame, the dataset contains both numeric and non-numeric data of types <code>float64</code>, <code>int64</code> and <code>object</code> types), whereas the features 2, 7, 10 and 14 contain numeric values, while all the other features contain non-numeric values.</p>

<p>Missing values in the dataset are labeled as question marks. As a temporarily solution to help performing further missing value treatment, the question marks were replaced by NaN. For this, the <code>numpy</code> library was used.</p>

In [54]:
# Importing numpy
import numpy as np

# Replacing the '?'s with NaN in the both sets
cc_apps_train = cc_apps_train.replace('?', np.nan)
cc_apps_test = cc_apps_test.replace('?', np.nan)

## 5. Handling the missing values (part II)
<p>Since many models such as Linear Discriminant Analysis (LDA) cannot handle missing values implicitly, a strategy called <i>mean imputation</i> was implemented. This is a better way to deal with missing values instead of ignoring them, and consequently affecting the performance of a machine learning model.</p>

In [55]:
# Imputing the missing values of the numeric columns with mean imputation
cc_apps_train.loc[:,[2,7,10,14]].fillna(cc_apps_train.mean(), inplace=True)
cc_apps_test.loc[:,[2,7,10,14]].fillna(cc_apps_train.mean(), inplace=True)

# Counting and printing the number of NaNs in the datasets to verify
print(cc_apps_train.isnull().sum())
print(cc_apps_test.isnull().sum())

0     8
1     5
2     0
3     6
4     6
5     7
6     7
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64
0     4
1     7
2     0
3     0
4     0
5     2
6     2
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64


## 6. Handling the missing values (part III)
<p>For the columns containing non-numeric data-types, the impute of the missing values were implemented with the most frequent values. This is <a href="https://www.datacamp.com/community/tutorials/categorical-data">good practice</a> when it comes to imputing missing values for categorical data in general.</p>

In [56]:
# Iterate over each column of cc_apps_train
for col in list(cc_apps_train):
    # Check if the column is of object type
    if cc_apps_train[col].dtypes == 'object':
        # Impute with the most frequent value
        cc_apps_train = cc_apps_train.fillna(cc_apps_train[col].value_counts().index[0])
        cc_apps_test = cc_apps_test.fillna(cc_apps_train[col].value_counts().index[0])

# Counting and printing the number of NaNs in the dataset to verify
print(cc_apps.isna().sum()) 

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64


## 7. Preprocessing the data (part I)
<p>Before proceeding towards building the machine learning model, there are still minor but essential data preprocessing necessary to be done, as described as follow:</p>
<ol>
<li>Convert the non-numeric data into numeric.</li>
<li>Scale the feature values to a uniform range.</li>
</ol>
<p>Many machine learning models (especially the ones developed using scikit-learn), require the data to be in a strictly numeric format. This can be perfomed using the <code>get_dummies()</code> method from pandas.</p>
<p>Another alternative would be the technique called <i>label encoding.</i></p>

In [57]:
# Converting the categorical features in the train and test sets independently
cc_apps_train = pd.get_dummies(cc_apps_train)
cc_apps_test = pd.get_dummies(cc_apps_test)

# Reindexing the columns of the test set aligning with the train set
cc_apps_test = cc_apps_test.reindex(columns=cc_apps_train.columns, fill_value=0)

## 8. Preprocessing the data (part II)
<p>When a dataset has varying ranges as observed in this dataset, a small change in a particular feature may not have a significant impact on the other feature. The final preprocessing step is then to scale before fitting the model to the data. Using the <b>CreditScore</b> as an example, the higher this number, the more financially trustworthy a person is considered to be. Therefore, rescaling all the values to a range of 0-1, a <b>CreditScore</b> of 1 is the highest.</p>

In [58]:
# Importing MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# Segregating features and labels into separate variables
X_train, y_train = cc_apps_train.iloc[:, :-1].values, cc_apps_train.iloc[:, [-1]].values
X_test, y_test = cc_apps_test.iloc[:, :-1].values, cc_apps_test.iloc[:, [-1]].values

# Instantiating MinMaxScaler and using it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.fit_transform(X_test)

## 9. Fitting a logistic regression model to the train set
<p>Fundamentally, the predicting of credit card application approval is a <a href="https://en.wikipedia.org/wiki/Statistical_classification">classification</a> task. According to UCI, the dataset contains out of 690 instances 383 (55.5%) applications that were denied and 307 (44.5%) applications that were approved. A good machine learning model should be able to accurately predict the status of the applications with respect to these statistics.</p>
<p>Generally, linear models perform well in the case where features are correlated with each other. Therefore, initially a machine learning modeling with a Logistic Regression model (a generalized linear model) were applied.</p>

In [80]:
# Importing LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiating a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()

# Fitting logreg to the train set
logreg.fit(rescaledX_train, y_train.ravel())

LogisticRegression()

## 10. Making predictions and evaluating performance
<p> In the case of predicting credit card applications, it is important to see if this machine learning model is equally capable of predicting approved and denied status, in line with the frequency of these labels in the original dataset. If the model is not performing well in this aspect, then it might end up approving the application that should have been approved. The confusion matrix helps to view the model's performance from these aspects. </p>
<p>For the <a href="https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/">confusion matrix</a>, the first element of the first row denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly. And the last element of the second row of the confusion matrix denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly.</p>

In [81]:
# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", logreg.score(rescaledX_test, y_test))

# Print the confusion matrix of the logreg model
print(confusion_matrix(y_test, y_pred))

Accuracy of logistic regression classifier:  1.0
[[103   0]
 [  0 125]]


<b>The model seems to be pretty good! In fact it was able to yield an accuracy score of 100%</b>
<br>
<br>
<br>

## 11. Grid searching and making the model perform better
<p>If the model hadn't yielded a perfect score, performing a <a href="https://machinelearningmastery.com/how-to-tune-algorithm-parameters-with-scikit-learn/">grid search</a> of the model parameters to improve the model's ability to predict credit card approvals could be a way to perform better. <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">Scikit-learn's implementation of logistic regression</a> consists of different hyperparameters, such as <code>tol</code> and <code>max_iter</code></p>

In [82]:
# Importing GridSearchCV
from sklearn.model_selection import GridSearchCV

# Defining the grid of values for tol and max_iter
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]

# Creating a dictionary where tol and max_iter are keys and the lists of their values are corresponding values
param_grid = dict(tol = tol, max_iter = max_iter)

## 12. Finding the best performing model
<p>The grid of hyperparameter values has been definied and converted into a single dictionary format which <code>GridSearchCV()</code> expects as one of its parameters. To find out which values perfom best, the grid search can be started, instantiating <code>GridSearchCV()</code> with the earlier <code>logreg</code> and then performing <a href="https://www.dataschool.io/machine-learning-with-scikit-learn/">cross-validation</a> of five folds.</p>

<p>Finally, this notebook will end by storing the best-achieved score and the respective best parameters.</p>

In [83]:
# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

# Fit grid_model to the data
grid_model_result = grid_model.fit(rescaledX_train, y_train.ravel())

# Summarize results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

# Extract the best model and evaluate it on the test set
best_model = grid_model_result.best_estimator_
print("Accuracy of logistic regression classifier: ", best_model.score(rescaledX_test, y_test))

Best: 1.000000 using {'max_iter': 100, 'tol': 0.01}
Accuracy of logistic regression classifier:  1.0
