# Credit Card Application
Commercial banks receive a lot of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this notebook, we will build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.


In [1]:
#importing some dependencies
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

In [2]:
#Reading the file into the pandas dataframe
card = pd.read_csv('crx.data', header=None)
card.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [3]:
#Describing the card data
card.describe()

Unnamed: 0,2,7,10,14
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0


In [4]:
card.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    object 
 13  13      690 non-null    object 
 14  14      690 non-null    int64  
 15  15      690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB


In [5]:
card.tail(17)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-
680,b,19.5,0.29,u,g,k,v,0.29,f,f,0,f,g,280,364,-
681,b,27.83,1.0,y,p,d,h,3.0,f,f,0,f,g,176,537,-
682,b,17.08,3.29,u,g,i,v,0.335,f,f,0,t,g,140,2,-


# Handling Missing Values
Missing values can affect the performance of a machine learning model heavily. While ignoring the missing values our machine learning model may miss out on information about the dataset that may be useful for its training. 
The dataset has missing values, which we will also take care of in this task. The missing values in the dataset are labeled with '?', which can be seen in the last cell's output. We first temporarily replace these missing value question marks with NaN using the ".replace method", setting the inplace value to 'True' and using np.nan.

Another method employed is the **Mean Manipulation** method, using **.fillna()**. This method is preferrably used on Numeric columns. For the Non-numeric columns we Iterate over each column of cc_apps using a for loop. Checking if the data-type of the column is of object type by using the "**dtypes keyword**". Using the "**fillna()**" method, imputing the column's missing values with the most frequent value of that column with the **value_counts() method** and **index attribute** and assign it to cc_apps.
Finally, we verify if there are any more missing values in the dataset that are left to be imputed by printing the total number of NaNs in each column.

In [6]:
# Replace the ? with NaN
card.replace('?', np.nan, inplace=True)

# Checking for the missing values again
print(card.tail(17))

      0      1       2  3  4   5   6      7  8  9   10 11 12     13   14 15
673  NaN  29.50   2.000  y  p   e   h  2.000  f  f   0  f  g  00256   17  -
674    a  37.33   2.500  u  g   i   h  0.210  f  f   0  f  g  00260  246  -
675    a  41.58   1.040  u  g  aa   v  0.665  f  f   0  f  g  00240  237  -
676    a  30.58  10.665  u  g   q   h  0.085  f  t  12  t  g  00129    3  -
677    b  19.42   7.250  u  g   m   v  0.040  f  t   1  f  g  00100    1  -
678    a  17.92  10.210  u  g  ff  ff  0.000  f  f   0  f  g  00000   50  -
679    a  20.08   1.250  u  g   c   v  0.000  f  f   0  f  g  00000    0  -
680    b  19.50   0.290  u  g   k   v  0.290  f  f   0  f  g  00280  364  -
681    b  27.83   1.000  y  p   d   h  3.000  f  f   0  f  g  00176  537  -
682    b  17.08   3.290  u  g   i   v  0.335  f  f   0  t  g  00140    2  -
683    b  36.42   0.750  y  p   d   v  0.585  f  f   0  f  g  00240    3  -
684    b  40.58   3.290  u  g   m   v  3.500  f  f   0  t  s  00400    0  -
685    b  21

## Handling the missing values (Numeric)
We replaced all the question marks with NaNs. This is going to help us in the next missing value treatment that we are going to perform. Ignoring missing values can affect the performance of a machine learning model heavily. While ignoring the missing values our machine learning model may miss out on information about the dataset that may be useful for its training. Then, there are many models which cannot handle missing values implicitly such as LDA. So, to avoid this problem, we are going to impute the missing values with a strategy called mean imputation.  As your dataset contains both numeric and non-numeric data, for this task you will only impute the missing values (NaNs) present in the columns having numeric data-types (columns 2, 7, 10 and 14).
Pandas provides the fillna() function for replacing missing values with a specific value. For example, we can use fillna() to replace missing values with the mean value for each column. Thus, mean imputation is only useful for numeric columns.

In [7]:
# Impute the missing values with mean imputation
card.fillna(card.mean(), inplace=True)

# Count the number of NaNs in the dataset to verify
card.isna().sum()

0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
13    13
14     0
15     0
dtype: int64

## Handling the missing values (Non Numeric)
We have successfully taken care of the missing values present in the numeric columns. There are still some missing values to be imputed for columns 0, 1, 3, 4, 5, 6 and 13. All of these columns contain non-numeric data and this why the mean imputation strategy would not work here. This needs a different treatment. We are going to impute these missing values with the most frequent values as present in the respective columns. This is good practice when it comes to imputing missing values for categorical data in general.

The column names of a pandas DataFrame can be accessed using columns '**df[cols]**' attribute. The dtypes attribute provides the data type. In this part, object is the data type that you should be concerned about. The value_counts() method returns the frequency distribution of each value in the column, and the index attribute can then be used to get the most frequent value.

In [8]:
# Iterate over each column of cc_apps
for cols in card:
    # Check if the column is of object type
    if card[cols].dtypes == 'object':
        # Impute with the most frequent value
        card.fillna(card[cols].value_counts().index[0], inplace=True)

# Count the number of NaNs in the dataset and print the counts to verify
card.isnull().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64

## Preprocessing the data (Encoding)
We are going to divide these remaining preprocessing steps into three main tasks:
- Convert the non-numeric data into numeric.
- Split the data into train and test sets.
- Scale the feature values to a uniform range.
First, we will be converting all the non-numeric values into numeric ones. We do this because not only it results in a faster computation but also many machine learning models (like XGBoost) (and especially the ones developed using scikit-learn) require the data to be in a strictly numeric format. We will do this by using a technique called label encoding.
The values of each column a pandas DataFrame can be accessed using columns and values attributes consecutively. The dtypes attribute provides the data type. In this part, object is the data type that we should be concerned about.

In [9]:
#Instantiate LabelEncoder() into a variable le.
le = LabelEncoder()
# Iterate over all the values of each column and extract their dtypes
for cols in card.columns.values:
    # Compare if the dtype is object
    if card[cols].dtypes=='object':
    # Use LabelEncoder to do the numeric transformation
        card[cols]=le.fit_transform(card[cols])

In [10]:
card.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    int32  
 1   1       690 non-null    int32  
 2   2       690 non-null    float64
 3   3       690 non-null    int32  
 4   4       690 non-null    int32  
 5   5       690 non-null    int32  
 6   6       690 non-null    int32  
 7   7       690 non-null    float64
 8   8       690 non-null    int32  
 9   9       690 non-null    int32  
 10  10      690 non-null    int64  
 11  11      690 non-null    int32  
 12  12      690 non-null    int32  
 13  13      690 non-null    int32  
 14  14      690 non-null    int64  
 15  15      690 non-null    int32  
dtypes: float64(2), int32(12), int64(2)
memory usage: 54.0 KB


In [11]:
card.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,1,156,0.0,2,1,13,8,1.25,1,1,1,0,0,68,0,0
1,0,328,4.46,2,1,11,4,3.04,1,1,6,0,0,11,560,0
2,0,89,0.5,2,1,11,4,1.5,1,0,0,0,0,96,824,0
3,1,125,1.54,2,1,13,8,3.75,1,1,5,1,0,31,3,0
4,1,43,5.625,2,1,13,8,1.71,1,0,0,0,2,37,0,0


## Splitting the dataset into train and test sets
Now, we will split our data into train set and test set to prepare our data for two different phases of machine learning **Modeling**: training and testing. Ideally, no information from the test data should be used to scale the training data or should be used to direct the training process of a machine learning model. Hence, we first split the data and then apply the scaling.
Also, features like DriversLicense and ZipCode are not as important as the other features in the dataset for predicting credit card approvals. We should drop them to design our machine learning model with the best set of features. In Data Science literature, this is often referred to as **Feature Selection**.
Note, setting random_state ensures the dataset is split with same sets of instances every time the code is run.

In [12]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Drop the features 11 and 13
card = card.drop([11, 13], axis=1)

# Segregate features and labels into separate variables
X = card.drop(15, axis=1)
y = card[15]

print(X.head())
y.head()

# convert the DataFrame to a NumPy array
X = X.values
y = y.values

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                y,
                                test_size=0.33,
                                random_state=42)

   0    1      2   3   4   5   6     7   8   9   10  12   14
0   1  156  0.000   2   1  13   8  1.25   1   1   1   0    0
1   0  328  4.460   2   1  11   4  3.04   1   1   6   0  560
2   0   89  0.500   2   1  11   4  1.50   1   0   0   0  824
3   1  125  1.540   2   1  13   8  3.75   1   1   5   0    3
4   1   43  5.625   2   1  13   8  1.71   1   0   0   2    0


In [13]:
X_test = pd.DataFrame(X_test)
print(X_test)

       0      1      2    3    4     5    6       7    8    9    10   11  \
0    0.0  349.0   1.50  2.0  1.0   6.0  3.0   0.000  0.0  1.0   2.0  0.0   
1    0.0  271.0   4.00  2.0  1.0   8.0  5.0   0.000  1.0  0.0   0.0  0.0   
2    1.0   41.0   0.00  2.0  1.0   4.0  8.0   0.500  0.0  0.0   0.0  0.0   
3    1.0  277.0   6.50  2.0  1.0   2.0  8.0   1.000  0.0  0.0   0.0  0.0   
4    1.0   34.0   0.00  3.0  3.0  10.0  1.0   0.000  0.0  0.0   0.0  2.0   
..   ...    ...    ...  ...  ...   ...  ...     ...  ...  ...   ...  ...   
223  0.0   50.0   0.50  3.0  3.0   5.0  2.0   1.000  0.0  0.0   0.0  0.0   
224  0.0  326.0  21.00  2.0  1.0   7.0  1.0  10.000  1.0  1.0  13.0  0.0   
225  1.0  205.0   0.42  3.0  3.0  13.0  8.0   0.290  0.0  0.0   0.0  0.0   
226  1.0  171.0   2.50  2.0  1.0   2.0  8.0   1.250  0.0  0.0   0.0  0.0   
227  1.0   33.0   1.75  3.0  3.0   2.0  8.0   2.335  0.0  0.0   0.0  0.0   

         12  
0     105.0  
1     960.0  
2       0.0  
3     228.0  
4       1.0  
.. 

## Preprocessing the data (Scaling)
The data is now split into two separate sets - train and test sets respectively. We are only left with one final preprocessing step of scaling before we can fit a machine learning model to the data.
Now, let's try to understand what these scaled values mean in the real world. Let's use CreditScore as an example. The credit score of a person is their creditworthiness based on their credit history. The higher this number, the more financially trustworthy a person is considered to be. So, a CreditScore of 1 is the highest since we are rescaling all the values to the range of 0-1.
Thus, When a dataset has varying ranges as in this credit card approvals dataset, one a small change in a particular feature may not have a significant effect on the other feature, which can cause a lot of problems when predictive modeling. Scaling helps to reduce the ranges and thus  effect change.

In [14]:
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler 

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)


In [15]:
# rescaledX_test = pd.DataFrame(rescaledX_test)
# print(rescaledX_test)

## Fitting a logistic regression model to the train set
Essentially, predicting if a credit card application will be approved or not is a classification task. According to UCI, our dataset contains more instances that correspond to "Denied" status than instances corresponding to "Approved" status. Specifically, out of 690 instances, there are 383 (55.5%) applications that got denied and 307 (44.5%) applications that got approved.

This gives us a benchmark. A good machine learning model should be able to accurately predict the status of the applications with respect to these statistics.

Which model should we pick? A question to ask is: are the features that affect the credit card approval decision process correlated with each other? Although we can measure correlation, that is outside the scope of this notebook, so we'll rely on our intuition that they indeed are correlated for now. Because of this correlation, we'll take advantage of the fact that generalized linear models perform well in these cases. Let's start our machine learning modeling with a Logistic Regression model (a generalized linear model).

In [16]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier with default parameter values
reg = LogisticRegression()

# Fit logreg to the train set
reg.fit(rescaledX_train, y_train)

LogisticRegression()

## Making predictions and evaluating performance
We will now evaluate our model on the test set with respect to classification accuracy. But we will also take a look the model's confusion matrix. In the case of predicting credit card applications, it is equally important to see if our machine learning model is able to predict the approval status of the applications as denied that originally got denied. If our model is not performing well in this aspect, then it might end up approving the application that should have been approved. The confusion matrix helps us to view our model's performance from these aspects.

Confusion matrix is a Classification Metrics, used in classification problems in Machine Learning. The Confusion matrix is one of the evaluation metrics available to evaluate how well the algorithm is performing. We need them to evaluate algorithms as they check the performance of the classifier that is used in the algorithm. The classifiers help to build models e.g. Logistic Regression algorithm will use a Logistic Regression Classifier in Sklearn Python Library.
So we have;

TN FP,

FN TP, 

in the matrix result and it translates the classifier's performance as;
The Classifier’s Performance;
- It denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly(TN).
- It denotes the number of negative instances predicted by our model wrongly(FN).
- It denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly(TP).
- It denotes the number of postive instances predicted by our model wrongly(this is not good, in the financial field)(FP).


In [17]:
# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Use logreg to predict instances from the test set and store it
y_pred = reg.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", reg.score(rescaledX_test, y_test))

# Print the confusion matrix of the logreg model
print(confusion_matrix(y_test, y_pred))

Accuracy of logistic regression classifier:  0.8421052631578947
[[94  9]
 [27 98]]


## Hyper parameter Tuning

### Grid searching and making the model perform better
Our model was pretty good! It was able to yield an accuracy score of just over 84%.

For the confusion matrix, the first element of the of the first row of the confusion matrix denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly. And the last element of the second row of the confusion matrix denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly.

Let's see if we can do better. We can perform a grid search of the model parameters to improve the model's ability to predict credit card approvals.
scikit-learn's implementation of logistic regression consists of different hyperparameters but we will grid search over the following two:
- tol
- max_iter

GridSearchCV takes a dictionary that describes the parameters that could be tried on a model to train it. The grid of parameters is defined as a dictionary, where the keys are the parameters and the values are the settings to be tested.
1. estimator: Pass the model instance for which you want to check the hyperparameters.
2. params_grid: the dictionary object that holds the hyperparameters you want to try
3. scoring: evaluation metric that you want to use, you can simply pass a valid string/ object of evaluation metric
4. cv: number of cross-validation you have to try for each selected set of hyperparameters
5. verbose: you can set it to 1 to get the detailed print out while you fit the data to GridSearchCV
6. n_jobs: number of processes you wish to run in parallel for this task if it -1 it will use all available processors.

In [20]:
from sklearn.model_selection import GridSearchCV

# Define the grid of values for tol and max_iter
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]

# Create a dictionary where tol and max_iter are keys and the lists of their values are corresponding values
param_grid = dict(tol = tol, max_iter = max_iter)

## Finding the best performing model
We have defined the grid of hyperparameter values and converted them into a single dictionary format which GridSearchCV() expects as one of its parameters. Now, we will begin the grid search to see which values perform best.

We will instantiate GridSearchCV() with our earlier logreg model with all the data we have. Instead of passing train and test sets separately, we will supply X (scaled version) and y. We will also instruct GridSearchCV() to perform a cross-validation of five folds.
We'll end the notebook by storing the best-achieved score and the respective best parameters.
While building this credit card predictor, we tackled some of the most widely-known preprocessing steps such as scaling, label encoding, and missing value imputation. We finished with some machine learning to predict if a person's application for a credit card would get approved or not given some information about that person.

Grid searching is a process of finding an optimal set of values for the parameters of a certain machine learning model. This is often known as hyperparameter optimization which is an active area of research. Note that, here we have used the word parameters and hyperparameters interchangeably, but they are not exactly the same.

In [19]:
# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=reg, param_grid=param_grid, cv=5)

# Use scaler to rescale X and assign it to rescaledX
rescaledX = scaler.fit_transform(X)

# Fit data to grid_model
grid_model_result = grid_model.fit(rescaledX, y)

# Summarize results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

Best: 0.850725 using {'max_iter': 100, 'tol': 0.01}
