# Credit Card Approvals Prediction
 An automatic credit card approval predictor using machine learning techniques in Python.

We'll use the Credit Card Approval dataset from the UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets/credit+approval

Since this data is confidential, the contributor of the dataset has anonymized the feature names:

In [84]:
import pandas as pd

cc_apps = pd.read_csv('datasets/cc_approvals.data', header=None)

#The first five rows:
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


The last column is our target variable: Approval Status. 
Additionally, statistical analysis of correlation (outside of this report) showed that two of the features are not important for predicting the Approval Status. We'll drop them:

In [85]:
# Drop the features 11 and 13
cc_apps = cc_apps.drop([11, 13], axis=1)

# Print summary statistics
cc_apps_description = cc_apps.describe()
print(cc_apps_description)

print('\n')

# Print DataFrame information
cc_apps_info = cc_apps.info()
print(cc_apps_info)

               2           7          10             14
count  690.000000  690.000000  690.00000     690.000000
mean     4.758725    2.223406    2.40000    1017.385507
std      4.978163    3.346513    4.86294    5210.102598
min      0.000000    0.000000    0.00000       0.000000
25%      1.000000    0.165000    0.00000       0.000000
50%      2.750000    1.000000    0.00000       5.000000
75%      7.207500    2.625000    3.00000     395.500000
max     28.000000   28.500000   67.00000  100000.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 no

# Preprocessing

We'll reserve 33% of our data for the final testing of accuracy:

In [86]:
from sklearn.model_selection import train_test_split

# Split into train and test sets
cc_apps_train, cc_apps_test = train_test_split(cc_apps, test_size=0.33, random_state=42)

Some of the values in the dataset are missing (only in categorical columns), we have '?' instead. We'll fix this by imputing these missing values with the most frequent values as present in the respective columns of the train dataset.

In [87]:
import numpy as np

# Replace the '?'s with NaN in the train and test sets
cc_apps_train = cc_apps_train.replace('?', np.nan)
cc_apps_test = cc_apps_test.replace('?', np.nan)

# Iterate over each column of cc_apps_train
for col in cc_apps_train.columns:
    # Check if the column is of object type
    if cc_apps_train[col].dtypes == 'object':
        # Impute with the most frequent value
        cc_apps_train = cc_apps_train.fillna(cc_apps_train[col].value_counts().index[0])
        cc_apps_test = cc_apps_test.fillna(cc_apps_train[col].value_counts().index[0])


# Count the number of missing values
print(cc_apps_train.isnull().sum())
print(cc_apps_test.isnull().sum())

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64
0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64


Now, we'll convert all the non-numeric values into numeric ones:

In [88]:
# Convert the categorical features in the train and test sets independently
cc_apps_train = pd.get_dummies(cc_apps_train)
cc_apps_test = pd.get_dummies(cc_apps_test)

# Reindex the columns of the test set aligning with the train set
cc_apps_test = cc_apps_test.reindex(columns=cc_apps_train.columns, fill_value=0)

cc_apps_train.head()

Unnamed: 0,2,7,10,14,0_a,0_b,1_13.75,1_15.83,1_15.92,1_16.00,...,6_z,8_f,8_t,9_f,9_t,12_g,12_p,12_s,15_+,15_-
382,2.5,4.5,0,456,1,0,0,0,0,0,...,0,1,0,1,0,1,0,0,0,1
137,2.75,4.25,6,0,0,1,0,0,0,0,...,0,0,1,0,1,1,0,0,1,0
346,1.5,0.25,0,122,0,1,0,0,0,0,...,0,1,0,1,0,1,0,0,0,1
326,1.085,0.04,0,179,0,1,0,0,0,0,...,0,1,0,1,0,1,0,0,0,1
33,5.125,5.0,0,4000,1,0,0,0,0,0,...,0,0,1,1,0,1,0,0,1,0


Next, we'll scale the feature values to a uniform range:

In [98]:
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# Segregate features and labels into separate variables
X_train, y_train = cc_apps_train.iloc[:, :-1].values, cc_apps_train.iloc[:, [-1]].values
X_test, y_test = cc_apps_test.iloc[:, :-1].values, cc_apps_test.iloc[:, [-1]].values

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)

# Machine Learning and evaluation

We'll use a Logistic Regression model, grid searching ofer a few parameters with 5-fold cross validation:

In [118]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

from sklearn.model_selection import GridSearchCV

# Define the grid of values for tol and max_iter
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]
param_grid = {'tol': tol, 'max_iter': max_iter}

# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

# Fit grid_model to the data
grid_model_result = grid_model.fit(rescaledX_train, y_train.reshape(len(y_train, )))

# Summarize results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best result from a grid search is %f (percent of accuracy on the train set), using the following parameters: %s." % (best_score * 100, best_params))

# Extract the best model and evaluate it on the test set
best_model = grid_model_result.best_estimator_
print("Accuracy of the model on the test set (%): ", best_model.score(rescaledX_test, y_test) * 100) 

# Predict instances from the test set and store it
y_pred = best_model.predict(rescaledX_test)

# Print the confusion matrix of the logreg model
from sklearn.metrics import confusion_matrix
print('Confusion matrix (test set):')
confusion_matrix(y_test, y_pred)

Best result from a grid search is 100.000000 (percent of accuracy on the train set), using the following parameters: {'max_iter': 100, 'tol': 0.01}.
Accuracy of the model on the test set (%):  100.0
Confusion matrix (test set):


array([[103,   0],
       [  0, 125]], dtype=int64)

As we can see above, we have 0 false negatives and 0 false positives on the test set! 100% accuracy! Hence, we don't need a human evaluation for a credit card approval decision, a machine learning algorithm can fully mimic human decision making process based on the historical approval data.