# Credit Card Approval Predictor

In this notebook, I built an automatic credit card approval predictor using machine learning techniques using a small subset of the credit card applications a bank receives.

In [25]:
#data imports
import pandas as pd
import numpy as np

#visualizatoin imports
import matplotlib.pyplot as plt
import plotly.express as px

#ML imports
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

In [26]:
#load the data
cc_apps = pd.read_csv('cc_approvals.data', header=None)
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


## 1. Explore and Clean Up Data

In [27]:
#explore data
cc_apps.describe()

Unnamed: 0,2,7,10,14
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0


In [28]:
cc_apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    object 
 13  13      690 non-null    object 
 14  14      690 non-null    int64  
 15  15      690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB


In [29]:
cc_apps.columns

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], dtype='int64')

In [30]:
#drop columns 11 and 13 as they are not necessary for this task
cc_apps.drop([11,13], axis=1, inplace=True)

## 2. Split Dataset into Test and Training Data

In [31]:
# Split into train and test sets
cc_apps_train, cc_apps_test = train_test_split(cc_apps, test_size=0.33, random_state=42)

## 3. Handle Missing Values

In [32]:
# Replace the '?'s with NaN in the train and test sets
cc_apps_train_nans_replaced = cc_apps_train.replace("?", np.NaN)
cc_apps_test_nans_replaced = cc_apps_test.replace("?", np.NaN)

In [33]:
#fill missing values with mean
cc_apps_train_imputed = cc_apps_train_nans_replaced.fillna(cc_apps_train_nans_replaced.mean())
cc_apps_test_imputed = cc_apps_test_nans_replaced.fillna(cc_apps_test_nans_replaced.mean())

In [34]:
#handle missing values in object-type columns by filling them with the most common value in each respective column

for column_name, dtype in cc_apps_train_imputed.dtypes.items():
    if dtype == np.object:
        cc_apps_train_imputed = cc_apps_train_imputed.fillna(
            cc_apps_train_imputed[column_name].value_counts().index[0]
        )
        cc_apps_test_imputed = cc_apps_test_imputed.fillna(
            cc_apps_test_imputed[column_name].value_counts().index[0]
        )

## 4. Preprocess The Data

In [35]:
#one-hot encode training and test dataset
cc_apps_test_cat_encoding = pd.get_dummies(cc_apps_test_imputed)
cc_apps_train_cat_encoding = pd.get_dummies(cc_apps_train_imputed)

In [36]:
#reindex the test and train dataset
cc_apps_test_cat_encoding = cc_apps_test_cat_encoding.reindex(columns=cc_apps_train_cat_encoding.columns, fill_value=0)

## 5. Segregating Features and Labels & Feature Rescaling

In [37]:
#segregate test and train data
X_train, y_train = (
    cc_apps_train_cat_encoding.iloc[:, -1].values,
    cc_apps_train_cat_encoding.iloc[:, [-1]].values,
)
X_test, y_test = (
    cc_apps_test_cat_encoding.iloc[:,-1].values,
    cc_apps_test_cat_encoding.iloc[:,[-1]].values,
)

In [38]:
#ensure features and variables have a 2d shape
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((462,), (228,), (462, 1), (228, 1))

In [39]:
#reshape features
X_train = X_train.reshape(-1,1)
X_test = X_test.reshape(-1,1)

In [48]:
#rescale training and testing feautures
scaler = MinMaxScaler(feature_range=(0,1))
rescaled_Xtrain = scaler.fit_transform(X_train)
rescaled_Xtest = scaler.transform(X_test)

## 6. Logistic Regression Model

In [41]:
#instantiate logistic regression classifier
logreg = LogisticRegression()

In [42]:
#fit logreg on train set
logreg.fit(rescaled_Xtrain, y_train)


In [43]:
#make predictions on scaled variable
y_pred = logreg.predict(rescaled_Xtest)

In [44]:
#evaluate logreg classifier
confusion_matrix(y_pred,y_test)

array([[103,   0],
       [  0, 125]])

There are no false positives or false negatives. This is an ideal scenario, indicating perfect classification. However, it's always a good practice to check other metrics like precision, recall, and F1-score, especially in real-world scenarios where data might not be as clean

## 7. Hyperparameter Check & Model Performance

In [45]:
#intialise the tol and max_iter for ParameterGrid
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]
param_grid = dict(tol=tol, max_iter=max_iter)

In [46]:
#instantiate a hyperparameter tuning function with 5-fold cross validation
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)
grid_model

In [49]:
#fit train dataset to grid model
grid_model_result = grid_model.fit(rescaled_Xtrain, y_train)

In [50]:
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

Best: 1.000000 using {'max_iter': 100, 'tol': 0.01}


Among the combinations of hyperparameters tested, the model performed the best with max_iter=100 and tol=0.01, achieving a perfect score of 1.000000

In [52]:
# Extract the best model and evaluate it on the test set
best_model = grid_model_result.best_estimator_
print(
    "Accuracy of logistic regression classifier: ",
    best_model.score(rescaled_Xtest, y_test),
)

Accuracy of logistic regression classifier:  1.0
