![Credit card being held in hand](credit_card.jpg)

Commercial banks are inundated with a high volume of credit card applications, a significant portion of which are declined due to various factors such as elevated loan balances, insufficient income levels, or excessive inquiries on credit reports. The manual scrutiny of these applications is not only tedious and prone to errors but also consumes valuable time. Fortunately, leveraging machine learning technology automates this process, a practice now ubiquitous among commercial banks. In this workbook, you'll construct an automated credit card approval predictor using machine learning methodologies, mirroring the approach adopted by real-world financial institutions.

### The Data

The dataset provided is a condensed portion of the Credit Card Approval dataset sourced from the UCI Machine Learning Repository, depicting the array of credit card applications received by a bank. This dataset has been imported into a `pandas` DataFrame named `cc_apps`, with the final column representing the target value.

### Reading the data

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

# Load the dataset
cc_apps = pd.read_csv("cc_approvals.data", header=None)
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,g,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,g,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,g,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+


 We use df.info() to obtain a concise summary of the DataFrame's structure and contents. This method provides valuable information such as the number of entries, the number of non-null values in each column, and the data type of each column.

In [None]:
#Check if the data contains any nulls and dtypes of the columns
cc_apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    int64  
 13  13      690 non-null    object 
dtypes: float64(2), int64(2), object(10)
memory usage: 75.6+ KB


### Data Preprocessing
The following code essentially handles missing values in the dataset by replacing them with appropriate values based on the data type of each column. For categorical columns, it uses the most frequent value, and for numerical columns, it uses the mean.

In [None]:
# Replace the '?'s with NaN in dataset
cc_apps_nans_replaced = cc_apps.replace("?", np.NaN)

# Create a copy of the NaN replacement DataFrame
cc_apps_imputed = cc_apps_nans_replaced.copy()

# Iterate over each column of cc_apps_nans_replaced and impute the most frequent value for object data types and the mean for numeric data types
for col in cc_apps_imputed.columns:
    # Check if the column is of object type
    if cc_apps_imputed[col].dtypes == "object":
        # Impute with the most frequent value
        cc_apps_imputed[col] = cc_apps_imputed[col].fillna(
            cc_apps_imputed[col].value_counts().index[0]
        )
    else:
        cc_apps_imputed[col] = cc_apps_imputed[col].fillna(cc_apps_imputed[col].mean())

The resulting DataFrame

```
cc_apps_encoded
```
 will have the original numerical columns along with new columns representing the one-hot encoded categorical variables. Each category in a categorical variable will be represented by a binary column (0 or 1) indicating its presence or absence. This encoding is commonly used for preprocessing data before applying machine learning algorithms, especially with models that require numerical input.

In [None]:
# Dummify the categorical features
cc_apps_encoded = pd.get_dummies(cc_apps_imputed, drop_first=True)

The preprocessed data is then divided into 2 parts: X - the part of the dataframe without the labels and y - The column containing the lables of the data.

In [None]:
# Get X - data without label and y-labels for Modeling
X = cc_apps_encoded.iloc[:, :-1].values
y = cc_apps_encoded.iloc[:, [-1]].values

### Splitting the data
The `train_test_split` module from the `sklearn.model_selection` package is used to divide the X and y into `X_train, X_test, y_train and y_test ` for training our model. We chose the test size to be 20%. This can be changed as per need.

In [None]:
#Use the Train and test split to divide data into 80% - train and 20% - test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

### Scaling the data
The `StandardScaler()` function from the `sklearn.preprocessing` module is used for standardizing features by removing the mean and scaling them to unit variance

In [None]:
#Use Standard Scaler to scale the values
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Hyperparameter Tuning
Creating a paramter grid for testing various combinations through the GridSearchCV model for hyperparameter Tuning. This will allow us to use the best possible parameters for our machine learning model.

In [None]:
#Initialize parameters for the Logistic Regression model to find the best combination
tol = [0.01,0.001,0.0001]
max_iter = [100,200,150]
param_grid = {'tol':tol,'max_iter':max_iter}

Training the data using GridSearchCV to get the best parameters.

In [None]:
#Create an object of GridSearchCV class to find the best parameters
search_grid = GridSearchCV(
    estimator = logreg , param_grid = param_grid, cv= 5
)
search_grid.fit(X_train_scaled,y_train)

#Print the best paramters
print(search_grid.best_params_)

{'max_iter': 100, 'tol': 0.01}


### Training Logistic Regression Model
Training the logistic regression model with the values of the hyperparameters obtained in the previous step. We also print a confusion matrix to check our model performance.

In [None]:
#Using the best parameters create an instance of the LogisticRegression model
log_reg = LogisticRegression(max_iter = 100, tol = 0.01)

#Fit the model
log_reg.fit(X_train_scaled,y_train)

#Use the model to predict values
y_pred_new = log_reg.predict(X_test_scaled)

#Print the confusion matrix
print(confusion_matrix(y_test,y_pred_new))

[[52 18]
 [12 56]]


Logistic Regression Model Score

In [2]:
# Score from the best fit model
best_score = log_reg.score(X_test_scaled,y_test)
print(f'The score of the model is {round(best_score*100,2)}')

The score of the model is 78.26
