<p>In this notebook I'm goning to use data from the <a href="http://archive.ics.uci.edu/ml/datasets/credit+approval">Credit Card Approval dataset</a> from the UCI Machine Learning Repository to practice a real use case of machine learning by building an automatic credit card approval predictor</p>
<p>By building this credit card predictor, I want to practice and showcase some of the most common preprocessing steps such as <strong>scaling</strong>, <strong>label encoding</strong>, and <strong>missing value imputation</strong>. At the end building a <strong>machine learning model</strong> to predict if a person's application for a credit card would get approved or not given some information about that person.</p>

# 1. Loading and viewing the dataset

In [21]:
# Import pandas
import pandas as pd

# Load dataset
cc_apps = pd.read_csv("datasets/cc_approvals.data", header=None)

# Inspect data
print(cc_apps.head())

  0      1      2  3  4  5  6     7  8  9   10 11 12     13   14 15
0  b  30.83  0.000  u  g  w  v  1.25  t  t   1  f  g  00202    0  +
1  a  58.67  4.460  u  g  q  h  3.04  t  t   6  f  g  00043  560  +
2  a  24.50  0.500  u  g  q  h  1.50  t  f   0  f  g  00280  824  +
3  b  27.83  1.540  u  g  w  v  3.75  t  t   5  t  g  00100    3  +
4  b  20.17  5.625  u  g  w  v  1.71  t  f   0  f  s  00120    0  +


<p>By loading and viewing the dataset. We find that the dataset has anonymized the feature names.</p>
<p>On the website in the <b>Data Set Information</b> part. We can read that <i>This file concerns credit card applications. All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.</i> </p>

# 2. Checking if there are dataset issues that need to be fixed

In [22]:
# Summary statistics
cc_apps_description = cc_apps.describe()
print(cc_apps_description)

               2           7          10             14
count  690.000000  690.000000  690.00000     690.000000
mean     4.758725    2.223406    2.40000    1017.385507
std      4.978163    3.346513    4.86294    5210.102598
min      0.000000    0.000000    0.00000       0.000000
25%      1.000000    0.165000    0.00000       0.000000
50%      2.750000    1.000000    0.00000       5.000000
75%      7.207500    2.625000    3.00000     395.500000
max     28.000000   28.500000   67.00000  100000.000000


<p>Notice that the dataset contains values from several ranges</p>

In [23]:
# DataFrame information
cc_apps_info = cc_apps.info()
print(cc_apps_info)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
0     690 non-null object
1     690 non-null object
2     690 non-null float64
3     690 non-null object
4     690 non-null object
5     690 non-null object
6     690 non-null object
7     690 non-null float64
8     690 non-null object
9     690 non-null object
10    690 non-null int64
11    690 non-null object
12    690 non-null object
13    690 non-null object
14    690 non-null int64
15    690 non-null object
dtypes: float64(2), int64(2), object(12)
memory usage: 86.3+ KB
None


<p>The dataset has a mixture of both numerical and non-numerical features. The features 2, 7, 10 and 14 contain numeric values (of types float64, float64, int64 and int64 respectively) and all the other features contain non-numeric values.</p>

In [24]:
# Inspecting missing values in the dataset
print(cc_apps.tail(17))

    0      1       2  3  4   5   6      7  8  9   10 11 12     13   14 15
673  ?  29.50   2.000  y  p   e   h  2.000  f  f   0  f  g  00256   17  -
674  a  37.33   2.500  u  g   i   h  0.210  f  f   0  f  g  00260  246  -
675  a  41.58   1.040  u  g  aa   v  0.665  f  f   0  f  g  00240  237  -
676  a  30.58  10.665  u  g   q   h  0.085  f  t  12  t  g  00129    3  -
677  b  19.42   7.250  u  g   m   v  0.040  f  t   1  f  g  00100    1  -
678  a  17.92  10.210  u  g  ff  ff  0.000  f  f   0  f  g  00000   50  -
679  a  20.08   1.250  u  g   c   v  0.000  f  f   0  f  g  00000    0  -
680  b  19.50   0.290  u  g   k   v  0.290  f  f   0  f  g  00280  364  -
681  b  27.83   1.000  y  p   d   h  3.000  f  f   0  f  g  00176  537  -
682  b  17.08   3.290  u  g   i   v  0.335  f  f   0  t  g  00140    2  -
683  b  36.42   0.750  y  p   d   v  0.585  f  f   0  f  g  00240    3  -
684  b  40.58   3.290  u  g   m   v  3.500  f  f   0  t  s  00400    0  -
685  b  21.08  10.085  y  p   e   h  1

<p>The dataset has missing values. The missing entries in the dataset are labeled with '?'</p>

## 3. Handling the missing values

In [25]:
# Import numpy
import numpy as np

# Replacing the '?'s with NaN
cc_apps = cc_apps.replace('?', np.nan)

# Inspecting the missing values again
print(cc_apps.tail(17))

      0      1       2  3  4   5   6      7  8  9   10 11 12     13   14 15
673  NaN  29.50   2.000  y  p   e   h  2.000  f  f   0  f  g  00256   17  -
674    a  37.33   2.500  u  g   i   h  0.210  f  f   0  f  g  00260  246  -
675    a  41.58   1.040  u  g  aa   v  0.665  f  f   0  f  g  00240  237  -
676    a  30.58  10.665  u  g   q   h  0.085  f  t  12  t  g  00129    3  -
677    b  19.42   7.250  u  g   m   v  0.040  f  t   1  f  g  00100    1  -
678    a  17.92  10.210  u  g  ff  ff  0.000  f  f   0  f  g  00000   50  -
679    a  20.08   1.250  u  g   c   v  0.000  f  f   0  f  g  00000    0  -
680    b  19.50   0.290  u  g   k   v  0.290  f  f   0  f  g  00280  364  -
681    b  27.83   1.000  y  p   d   h  3.000  f  f   0  f  g  00176  537  -
682    b  17.08   3.290  u  g   i   v  0.335  f  f   0  t  g  00140    2  -
683    b  36.42   0.750  y  p   d   v  0.585  f  f   0  f  g  00240    3  -
684    b  40.58   3.290  u  g   m   v  3.500  f  f   0  t  s  00400    0  -
685    b  21

<p>Imputing the missing values with mean imputation</p>

In [26]:
# Replacing missing values present in the numeric columns
cc_apps.fillna(cc_apps.mean(), inplace=True)

# Count of the number of NaNs in the dataset
cc_apps.isna().sum().sum()

67

<p>For the columns that contain non-numeric data I'm going to impute the missing values with the most frequent values as present in the respective columns<p/>

In [27]:
for col in cc_apps.columns:
    # Check if the column is of object type
    if cc_apps[col].dtype == 'object':
        # Impute with the most frequent value
        cc_apps[col] = cc_apps[col].fillna(cc_apps[col].value_counts().max())

# Count of the number of NaNs in the dataset
cc_apps.isna().sum().sum()

0

# 4. Preprocessing the data

## 4.1 Converting non-numeric values into numeric
<p>I'll do this by using the <a href="http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html">label encoding</a> technique which encode labels with a value between 0 and n_classes-1</p>

In [28]:
# Import LabelEncoder
from sklearn import preprocessing

# Instantiate LabelEncoder
le = preprocessing.LabelEncoder()

# Iterate over all the values of each column and extract their dtypes
for col in cc_apps.columns.values:
    # Compare if the dtype is object
    if cc_apps[col].dtypes=='object':
    # Use LabelEncoder to do the numeric transformation
        cc_apps[col]=le.fit_transform(list(cc_apps[col]))

In [29]:
print(cc_apps.dtypes)

0       int64
1       int64
2     float64
3       int64
4       int64
5       int64
6       int64
7     float64
8       int64
9       int64
10      int64
11      int64
12      int64
13      int64
14      int64
15      int64
dtype: object


In [30]:
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,2,156,0.0,2,1,13,8,1.25,1,1,1,0,0,68,0,0
1,1,328,4.46,2,1,11,4,3.04,1,1,6,0,0,11,560,0
2,1,89,0.5,2,1,11,4,1.5,1,0,0,0,0,96,824,0
3,2,125,1.54,2,1,13,8,3.75,1,1,5,1,0,31,3,0
4,2,43,5.625,2,1,13,8,1.71,1,0,0,0,2,37,0,0


## 4.2 Splitting the dataset into train and test sets

<p>Dividing the data into training sets and test sets to prepare the data for modeling. Information from the test data should not be used to scale the training data or be used to drive the training process of a machine learning model. Therefore, I first divide the data and then apply the scaling<p/>

In [31]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# convert the DataFrame to a NumPy array
cc_apps = cc_apps.values

# Segregate features and labels into separate variables
X,y = cc_apps[:,0:14] , cc_apps[:,15]

# # Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## 4.3 Scaling the feature values to a uniform range

In [32]:
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)

# 5. Fitting a logistic regression model to the train set

<p>Predicting if a credit card application will be approved or not is a classification task. <a href="http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/">According to the source of the data UCI</a>, the dataset contains more instances corresponding to the "Denied" status than the instances corresponding to the "Approved" status. Specifically, of 690 cases, there are 383 (55.5%) requests that were denied and 307 (44.5%) requests that were approved.</p>
<p>The model should be able to accurately predict the status of the applications with respect to these statistics.</p>
<p>Starting to model with Logistic Regression</p>

In [33]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression(solver="lbfgs")

# Fit logreg to the train set
logreg.fit(rescaledX_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

# 6. Making predictions and evaluating performance

<p>In the case of predicting credit card applications, it is equally important to see if our model is able to predict the approval status of the applications as well as the denied that originally got denied.</p>

<p>Evaluating the model on the test set with respect to classification accuracy<p/>

In [34]:
# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", logreg.score(rescaledX_test, y_test))

Accuracy of logistic regression classifier:  0.8333333333333334


<p>Evaluating the model on the test set with respect to confusion matrix</p>

In [35]:
# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Print the confusion matrix of the logreg model
confusion_matrix(y_test, y_pred)

array([[92, 11],
       [27, 98]], dtype=int64)

<p>The first element of the of the first row of the confusion matrix denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly. And the last element of the second row of the confusion matrix denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly.</p>

# 7. Improving the model

<p>Performing a grid search of the model parameters "tol" and "max_iter" to improve the model's ability to predict credit card approvals.</p>

In [16]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Defining the grid of values
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]

# Create a dictionary where tol and max_iter are keys and the lists of their values are corresponding values
param_grid = dict(tol=tol, max_iter=max_iter)
param_grid

{'tol': [0.01, 0.001, 0.0001], 'max_iter': [100, 150, 200]}

In [17]:
# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

# Use scaler to rescale X and assign it to rescaledX
rescaledX = scaler.fit_transform(X)

# Fit data to grid_model
grid_model_result = grid_model.fit(rescaledX, y)

# Summarize results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

Best: 0.863636 using {'max_iter': 100, 'tol': 0.01}


<p>The model is good and it was able to yield an accuracy score of almost 84%.</p>