<a href="https://colab.research.google.com/github/Aparna-6309663/Predicting-CreditCard-Approval/blob/main/creditcard_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Credit card being held in hand](credit_card.jpg)

Commercial banks receive _a lot_ of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this workbook, you will build an automatic credit card approval predictor using machine learning techniques, just like real banks do.

### The Data

The data is a small subset of the Credit Card Approval dataset from the UCI Machine Learning Repository showing the credit card applications a bank receives. This dataset has been loaded as a `pandas` DataFrame called `cc_apps`. The last column in the dataset is the target value.

In [79]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV



In [80]:
from google.colab import files
uploaded = files.upload()

!mkdir -p datasets

# Move the uploaded file to the datasets directory
!mv cc_approvals.data datasets/


Saving cc_approvals.data to cc_approvals.data


In [81]:
!ls datasets/

cc_approvals.data


### Load the data and see the first 5 rows

In [82]:
cc_approval = pd.read_csv('datasets/cc_approvals.data', header=None)
cc_approval.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [83]:
print(cc_approval.shape)

(690, 16)


### Information and data types of the credit approval data

In [84]:
cc_approval.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    object 
 13  13      690 non-null    object 
 14  14      690 non-null    int64  
 15  15      690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB


In [85]:
cc_approval.describe()

Unnamed: 0,2,7,10,14
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0


<p>The output may appear a bit confusing at its first sight, but let's try to figure out the most important features of a credit card application. The features of this dataset have been anonymized to protect the privacy, but <a href="http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html">this blog</a> gives us a pretty good overview of the probable features. The probable features in a typical credit card application are <code>Gender</code>, <code>Age</code>, <code>Debt</code>, <code>Married</code>, <code>BankCustomer</code>, <code>EducationLevel</code>, <code>Ethnicity</code>, <code>YearsEmployed</code>, <code>PriorDefault</code>, <code>Employed</code>, <code>CreditScore</code>, <code>DriversLicense</code>, <code>Citizen</code>, <code>ZipCode</code>, <code>Income</code> and finally the <code>ApprovalStatus</code>. This gives us a pretty good starting point, and we can map these features with respect to the columns in the output.   </p>
# <p>As we can see from our first glance at the data, the dataset has a mixture of numerical and non-numerical features. This can be fixed with some preprocessing, but before we do that, let's learn about the dataset a bit more to see if there are other dataset issues that need to be fixed.</p>


### Seeing the last 20 records of data

In [86]:
cc_approval.tail(20)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
670,b,47.17,5.835,u,g,w,v,5.5,f,f,0,f,g,465,150,-
671,b,25.83,12.835,u,g,cc,v,0.5,f,f,0,f,g,0,2,-
672,a,50.25,0.835,u,g,aa,v,0.5,f,f,0,t,g,240,117,-
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-


## Hadling Missing data

### From the above last records, observed there are "?" values in the data. So, eplace "?" to NAN. Which will come under missing values.

In [87]:
cc_approval = cc_approval.replace('?', np.nan)

### Now the data is recording the missing values as NaN

In [88]:
cc_approval.tail(20)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
670,b,47.17,5.835,u,g,w,v,5.5,f,f,0,f,g,465,150,-
671,b,25.83,12.835,u,g,cc,v,0.5,f,f,0,f,g,0,2,-
672,a,50.25,0.835,u,g,aa,v,0.5,f,f,0,t,g,240,117,-
673,,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-


### Let's see does the data having any missing values

In [89]:
cc_approval.isnull().any()

0      True
1      True
2     False
3      True
4      True
5      True
6      True
7     False
8     False
9     False
10    False
11    False
12    False
13     True
14    False
15    False
dtype: bool

In [93]:
# Columns 2, 7, 10, and 14 have numerical values which is important to check the missing values.
cc_approval.loc[[2,7,10,14]].fillna(np.mean, inplace=True)

#Count the NaNs in the dataset
print(cc_approval.isnull().sum())


0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64


### There are still some missing values to be imputed for columns 0, 1, 3, 4, 5, 6 and 13. All of these columns contain non-numeric data. We are going to impute these missing values with the most frequent values as present in the respective columns.

In [94]:
for col in list(cc_approval):

# Check the columns of the data type is object
  if cc_approval[col].dtype == 'object':

#Impute the missing values with most frequency value as present in the respective columns
    cc_approval[col].fillna(cc_approval[col].value_counts().index[0], inplace=True)

#Count the number of Nans
print(cc_approval.isnull().sum())



0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64


## Pre-Processing the Data

### 1. First, we will change all the non-numeric values to numeric ones. This is important because it makes the computation faster and many machine learning models, like XGBoost and those from scikit-learn, need the data to be in numeric form. We will do this using a method called label encoding.

In [95]:
#import LabelEncoder from sklearn.preprocessing
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

# Iterate over all the values of each column and extract their data types
for col in list(cc_approval):

#Check the column that having object data type
  if cc_approval[col].dtype == 'object':

#By using labelencoder, transforming from categorical to numerical
    cc_approval[col] = le.fit_transform(cc_approval[col])

### 2. Splitting the dataset into train and test

In [77]:
print(cc_approval.shape)

(690, 14)


In [110]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Drop the features 11 and 13 and convert the DataFrame to a NumPy array
cc_approval = cc_approval.drop([11,13], axis=1)
cc_approval = cc_approval.values

# Segregate features and labels into separate variables
X,y = cc_approval[:,0:12] , cc_approval[:,13]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                y,
                                test_size=0.33,
                                random_state=42)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(462, 12) (462,)
(228, 12) (228,)


### 3. Scaling before performing the Machine Learning model

###Now, let's try to understand what these scaled values mean in the real world. Let's use CreditScore as an example. The credit score of a person is their creditworthiness based on their credit history. The higher this number, the more financially trustworthy a person is considered to be. So, a CreditScore of 1 is the highest since we're rescaling all the values to the range of 0-1.

In [111]:
#import MInMaxScaler from sklearn. preprocessing
from sklearn.preprocessing import MinMaxScaler

#Initilize Scaler by using MinMaxScaler with the feature_range
scaler = MinMaxScaler(feature_range=(0,1))

#Using rescale on X_train, X_test
X_train_rescaled = scaler.fit_transform(X_train)
X_test_rescaled = scaler.fit_transform(X_test)


print(X_train_rescaled.shape)
print(X_test_rescaled.shape)

(462, 12)
(228, 12)


## Fitting a Model to the train set

### A good machine learning model should accurately predict the approval status in line with these statistics.

### So, which model should we choose? One important question is whether the features affecting the credit card approval decision are correlated with each other. Although measuring this correlation is beyond the scope of this notebook, we'll assume that they are correlated based on intuition. Given this assumption, we know that generalized linear models perform well in such cases. Therefore, we'll start our machine learning modeling with a Logistic Regression model, which is a type of generalized linear model.

In [118]:
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier with default parameter values
logreg_classifier = LogisticRegression(solver='lbfgs') # Use solver instead of solve

# Fit logreg to the train set
logreg_classifier.fit(X_train_rescaled, y_train)

## Making Predictions & Evaluating Performance

In [119]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# Use logreg to predict instances from the test set
y_pred = logreg_classifier.predict(X_test_rescaled)

print("Accuracy of Logistic Regression Classifier : ", accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))


Accuracy of Logistic Regression Classifier :  0.8421052631578947
[[94  9]
 [27 98]]


## Applying Grid Search and make the performance better

### Models performance was good and the accuracy is 84%
### Let's see if we can do better. We can perform a grid search of the model parameters to improve the model's ability to predict credit card approvals.

### scikit-learn's implementation of logistic regression consists of different hyperparameters but we will grid search over the following two:

* tol
* max_iter

In [120]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define the grid of values for tol and max_iter
tol = [0.01,0.001,0.0001]
max_iter = [100,150,200]

# Create a dictionary where tol and max_iter are keys and the lists of their values are corresponding values
param_grid = dict(tol = tol, max_iter = max_iter)

### We have defined the grid of hyperparameter values and put them into a single dictionary format, which is what GridSearchCV() expects.

## Finding the best Performing Model

### Now, we will start the grid search to find which values perform best.

### We'll set up GridSearchCV() with our Logistic Regression model and use all our data. Instead of passing train and test sets separately, we will provide the scaled version of X and y. We will also tell GridSearchCV() to perform five-fold cross-validation.

### At the end, we will save the best score achieved and the corresponding best parameters.

In [121]:
# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg_classifier, param_grid=param_grid, cv=5)

# Use scaler to rescale X and assign it to rescaledX
rescaledX = scaler.fit_transform(X_train)

# Fit data to grid_model
grid_model_result = grid_model.fit(rescaledX, y_train)

# Summarize results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

Best: 0.857153 using {'max_iter': 100, 'tol': 0.01}


### By using GridSearchGV() to perform cross-validations, we improved the accuracy of our model by about 3%.