![Credit card being held in hand](credit_card.jpg)

Commercial banks receive _a lot_ of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this workbook, you will build an automatic credit card approval predictor using machine learning techniques, just like real banks do.

### The Data

The data is a small subset of the Credit Card Approval dataset from the UCI Machine Learning Repository showing the credit card applications a bank receives. This dataset has been loaded as a `pandas` DataFrame called `cc_apps`. The last column in the dataset is the target value.

In [14]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

# Load the dataset
cc_apps = pd.read_csv("cc_approvals.data", header=None) 

# Seeing first few rows
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,g,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,g,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,g,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+


# **Step 1: Inspecting our Data**
The first step before analyzing any dataset, is to inspect our data. We need to learn more about our data. We can see the type of data we have, our summary statistics, and check the missing values

In [15]:
# Getting data types
print(cc_apps.info())

# Looking at all 17 rows to see if we are missing data
print(cc_apps.tail(17))

# Getting summary stats
cc_apps.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    int64  
 13  13      690 non-null    object 
dtypes: float64(2), int64(2), object(10)
memory usage: 75.6+ KB
None
    0      1       2  3  4   5   6      7  8  9   10 11   12 13
673  ?  29.50   2.000  y  p   e   h  2.000  f  f   0  g   17  -
674  a  37.33   2.500  u  g   i   h  0.210  f  f   0  g  246  -
675  a  41.58   1.040  u  g  aa   v  0.665

Unnamed: 0,2,7,10,12
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0


# **Step 2: Splitting Data**

After inspecting our dataset, we can finally split our data to create a training and testing dataset. We will need these different datasets to create our machine learning model.

In [16]:
# Importing splitting method
from sklearn.model_selection import train_test_split

# Splitting into train and test
cc_apps_train, cc_apps_test = train_test_split(cc_apps, test_size = .33, random_state = 42)

# **Step 3: Handling Missing Values**

Now, after we have split our data we can finally handle our issues: the missing data. As we saw when we looked at our data, these missing values are labled with "?". The first step we can do to combat this is changing the ? to NaN values, then we can replace the Na values with the mean.  

In [17]:
# Filling in missing values for NaN
cc_apps_train= cc_apps_train.replace('?', np.NaN)
cc_apps_test= cc_apps_test.replace('?', np.NaN)

# Checking to see if "?" was removed
print(cc_apps_train.tail(17))
print(cc_apps_test.tail(17))

# Replacing NaNs with means
cc_apps_train.fillna(cc_apps_train.mean(), inplace=True)
cc_apps_test.fillna(cc_apps_train.mean(), inplace=True)

# Count the number of NaNs in the datasets and print the counts to verify
print(cc_apps_train.isnull().sum())
print(cc_apps_test.isnull().sum()) 

# Checking to see what other missing variables we have
print(cc_apps_train.info())
print(cc_apps_test.info())

    0      1       2    3    4    5    6       7  8  9   10 11    12 13
130  b  67.75   5.500    u    g    e    z  13.000  t  t   1  g     0  +
663  b  32.08   4.000    y    p   cc    v   1.500  f  f   0  g     0  -
99   a  28.50   1.000    u    g    q    v   1.000  t  t   2  g   500  -
372  a  45.00   4.585    u    g    k    h   1.000  f  f   0  s     0  -
87   b  25.67   2.210    y    p   aa    v   4.000  t  f   0  g     0  -
458  b  36.17   5.500    u    g    i   bb   5.000  f  f   0  g   687  -
330  b  20.42   0.000  NaN  NaN  NaN  NaN   0.000  f  f   0  p     0  -
214  b  26.67   2.710    y    p   cc    v   5.250  t  t   1  g     0  +
466  b  31.08   3.085    u    g    c    v   2.500  f  t   2  g    41  -
121  b  25.67  12.500    u    g   cc    v   1.210  t  t  67  g   258  +
614  a  38.33   4.415    u    g    c    v   0.125  f  f   0  g     0  -
20   b  25.00  11.250    u    g    c    v   2.500  t  t  17  g  1208  +
71   b  34.83   4.000    u    g    d   bb  12.500  t  f   0  g  

As seen in our data,we still have a number of missing values. This happens to be because some of these variables are not numerical. We can combat this by first changing all the NaNs to a numerical value, by just replacing it with the most frequest value.

In [18]:
# Iterate over each column of cc_apps_train
for col in cc_apps_train.columns:
    # Check if the column is of object type
    if cc_apps_train[col].dtypes == 'object':
        # Impute with the most frequent value
        cc_apps_train = cc_apps_train.fillna(cc_apps_train[col].value_counts().index[0])
        cc_apps_test = cc_apps_test.fillna(cc_apps_train[col].value_counts().index[0])

# Count the number of NaNs in the dataset and print the counts to verify
print(cc_apps_train.isnull().sum())
print(cc_apps_test.isnull().sum())

# checking data types
print(cc_apps_train.info())
print(cc_apps_test.info())

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
dtype: int64
0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
Int64Index: 462 entries, 382 to 102
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       462 non-null    object 
 1   1       462 non-null    object 
 2   2       462 non-null    float64
 3   3       462 non-null    object 
 4   4       462 non-null    object 
 5   5       462 non-null    object 
 6   6       462 non-null    object 
 7   7       462 non-null    float64
 8   8       462 non-null    object 
 9   9       462 non-null    object 
 10  10      462 non-null    int64  
 11  11      462 non-null    object 
 12  12      462 non-null    int64  
 13  13      462 non-null    object 
dtypes: float64(2), int64(2), object(10)
memory usa

# **Step 4: Preprocessing Data**

Now, we solved our missing values proble. However, as seen above, our data is still not all numeric. Before we can build our machine learning model, we need to preprocess our data. We need to change it to numeric

In [19]:
# Convert non-numeric data in  train and test
cc_apps_train = pd.get_dummies(cc_apps_train)
cc_apps_test = pd.get_dummies(cc_apps_test)

# Reindex the columns of the test set aligning with the train set
cc_apps_test = cc_apps_test.reindex(columns=cc_apps_train.columns, fill_value=0)

# checking data types
print(cc_apps_train.info())
print(cc_apps_test.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 462 entries, 382 to 102
Columns: 334 entries, 2 to 13_-
dtypes: float64(2), int64(2), uint8(330)
memory usage: 166.9 KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 228 entries, 286 to 311
Columns: 334 entries, 2 to 13_-
dtypes: float64(2), int64(179), uint8(153)
memory usage: 358.3 KB
None


Now, we have successfully changed all our data to numeric! We can finally start creating our model! To do so, we will need to rescale our data to improve our prediction accuracy.

In [20]:
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# Creating X train and y train 
X_train, y_train = cc_apps_train.iloc[:, :-1].values, cc_apps_train.iloc[:, [-1]].values
X_test, y_test = cc_apps_test.iloc[:, :-1].values, cc_apps_test.iloc[:, [-1]].values

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)

# **Step 5: Instantiate and fitting Our Model**

Because we finally preprocessed and cleaned our data, we can build our model.

In [21]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier
logreg = LogisticRegression()

# Fit to rain 
logreg.fit(rescaledX_train,y_train)

In [22]:
# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", logreg.score(rescaledX_test,y_test))

# Print the confusion matrix of the logreg model
confusion_matrix(y_test,y_pred)

Accuracy of logistic regression classifier:  1.0


array([[103,   0],
       [  0, 125]])

# **Step 6: Grid Search**

Now, our next step is we can perform a grid search. A grid search is often used to explore all possible combinations of different hyperparameters within a specified range. In the end, this leads to an improved accuracy and better model performance because it tunes the paramaters itself. 

In [23]:
#Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define the grid of values for tol and max_iter
tol = [0.01, 0.001 ,0.0001]
max_iter = [100, 150, 200]

# Create a dictionary where tol and max_iter are keys and the lists of their values are the corresponding values
param_grid = dict(tol=tol, max_iter=max_iter)

In [24]:
# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

# Fit grid_model to the data
grid_model_result = grid_model.fit(rescaledX_train, y_train)

# Summarize results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

# Extract the best model and evaluate it on the test set
best_model = grid_model_result.best_estimator_
print("Accuracy of logistic regression classifier: ", best_model.score(rescaledX_test,y_test))


Best: 1.000000 using {'max_iter': 100, 'tol': 0.01}
Accuracy of logistic regression classifier:  1.0


Here, we see that our accuracy for the model is 1. The output is both showing the best hyperparameters and indicating that, with these settings, the model achieved perfect accuracy. Max_iter 100 references the maximum number of iterations the logistic regression model will run to find the optimal solution.
tol 0.01 references the tolerance level for convergence; the model stops when the change between iterations is smaller than this threshold.