![Credit card being held in hand](credit_card.jpg)

Commercial banks receive _a lot_ of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this workbook, you will build an automatic credit card approval predictor using machine learning techniques, just like real banks do.

### The Data

The data is a small subset of the Credit Card Approval dataset from the UCI Machine Learning Repository showing the credit card applications a bank receives. This dataset has been loaded as a `pandas` DataFrame called `cc_apps`. The last column in the dataset is the target value.

## Exploratory Data Analysis

In [14]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split , KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

# Load the dataset
cc_apps = pd.read_csv("cc_approvals.data", header=None) 
#to familiarize with dataset 
print(cc_apps.head())
#to check for any missing values that can be dropped or imputed
for col in cc_apps.columns:
    print(cc_apps[col].unique())


  0      1      2  3  4  5  6     7  8  9   10 11   12 13
0  b  30.83  0.000  u  g  w  v  1.25  t  t   1  g    0  +
1  a  58.67  4.460  u  g  q  h  3.04  t  t   6  g  560  +
2  a  24.50  0.500  u  g  q  h  1.50  t  f   0  g  824  +
3  b  27.83  1.540  u  g  w  v  3.75  t  t   5  g    3  +
4  b  20.17  5.625  u  g  w  v  1.71  t  f   0  s    0  +
['b' 'a' '?']
['30.83' '58.67' '24.50' '27.83' '20.17' '32.08' '33.17' '22.92' '54.42'
 '42.50' '22.08' '29.92' '38.25' '48.08' '45.83' '36.67' '28.25' '23.25'
 '21.83' '19.17' '25.00' '47.75' '27.42' '41.17' '15.83' '47.00' '56.58'
 '57.42' '42.08' '29.25' '42.00' '49.50' '36.75' '22.58' '27.25' '23.00'
 '27.75' '54.58' '34.17' '28.92' '29.67' '39.58' '56.42' '54.33' '41.00'
 '31.92' '41.50' '23.92' '25.75' '26.00' '37.42' '34.92' '34.25' '23.33'
 '23.17' '44.33' '35.17' '43.25' '56.75' '31.67' '23.42' '20.42' '26.67'
 '36.00' '25.50' '19.42' '32.33' '34.83' '38.58' '44.25' '44.83' '20.67'
 '34.08' '21.67' '21.50' '49.58' '27.67' '39.83' '?' '

## Preprocessing Data

In [15]:
# Replace the '?'s with NaN in dataset
cc_apps_nans_replaced = cc_apps.replace("?", np.NaN)
# Create a copy of the NaN replacement DataFrame
cc_apps_imputed = cc_apps_nans_replaced.copy()
# Iterate over each column of cc_apps_nans_replaced and impute the most frequent value for object data types and the mean for numeric data types
for col in cc_apps_imputed.columns:
    # Check if the column is of object type (categorical)
    if cc_apps_imputed[col].dtypes == "object":
        # Impute with the most frequent value i.e the mode
        cc_apps_imputed[col] = cc_apps_imputed[col].fillna(
            cc_apps_imputed[col].mode()
        )
    else:
        cc_apps_imputed[col] = cc_apps_imputed[col].fillna(cc_apps_imputed[col].mean())
# Dummify the categorical features (because scikit-learn takes only numeric values)
#drop_first is set to true to avoid data_duplication
cc_apps_encoded = pd.get_dummies(cc_apps_imputed, drop_first=True)

## Subsetting ,Splitting and Scaling

In [16]:
# Extracting features
X = cc_apps_encoded.iloc[:, :-1].values
# Extracting the last column as target variable 
y = cc_apps_encoded.iloc[:, [-1]].values
#train_test split
X_train , X_test ,y_train , y_test = train_test_split(X,y,test_size=0.25,random_state=12,stratify=y)
#scaling data 
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Train Model

In [17]:
#Instantiate the model
log_reg = LogisticRegression()
log_reg.fit(X_train_scaled,y_train)
#predictions from training set
y_train_pred = log_reg.predict(X_train_scaled)
# Print the confusion matrix of the logreg model
print(confusion_matrix(y_train, y_train_pred))


[[226   4]
 [  5 282]]


## Hyperparameter tuning (GridSearchCV ---> CrossValidation)

In [18]:
#parameter grid (dictionary)
#2 hyperparameters i.e tolerance and max_iterations
#iterative optimization parameters (gradient descent)
#tolerance is difference between two consecutive values of cost function if it is less than tol parameter the iterative procedure stops
#total fits = p hyperparametes * q values * k foldsCV
param_grid = {"tol":[0.01,0.001,0.0001]  , "max_iter":[100,150,200]}
kf = KFold(n_splits = 5,shuffle = True,random_state=12)
# Instantiate GridSearchCV with the required parameters
grid_cv = GridSearchCV(estimator = log_reg, param_grid=param_grid , cv=kf)
grid_cv_res = grid_cv.fit(X_train_scaled,y_train)

## Best scoring model

In [19]:
# Summarize results for training set 
best_train_score, best_train_params = grid_cv_res.best_score_, grid_cv_res.best_params_
print("Best: %f using %s" % (best_train_score, best_train_params))
#Extracting best model and evaluate it on test set
best_model = grid_cv_res.best_estimator_
best_score = best_model.score(X_test_scaled,y_test)
print(best_score)

Best: 0.779388 using {'max_iter': 100, 'tol': 0.01}
0.791907514450867
