### CREDIT CARD APPROVAL PREDICTION

In this project we will be predicting whether a credit card will get approved or not, based on the previous data and this projects thus aims to help financial sectors who issue credit card to people to help judge them better whether to grant the person credit card or not. This model can also be used for the approval purpose of other types of financial cards as well with few modifications

For this project we will be using this [Credit Card Approval DataSet](http://archive.ics.uci.edu/ml/datasets/credit+approval) from UCI Machine Learning Repository.

In [16]:
# Import pandas
import pandas as pd

# Load dataset
data = pd.read_csv('/content/cc_approvals.data', header = None)

# Inspect data
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


Since we have now loaded the data, we will now like to get some insights from the data as well. The features that are provided in this dataset are "**Gender, Age, Debt, Married, BankCustomer, EducationLevel, Ethnicity, YearsEmployed, PriorDefault, Employed, CreditScore, DriversLicense, Citizen, ZipCode, Income, ApprovalStatus**" in order



In [17]:
# Print DataFrame information
data_info = data.info()
print(data_info)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    object 
 13  13      690 non-null    object 
 14  14      690 non-null    int64  
 15  15      690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB
None


In [5]:
# summary statistics
data_description = data.describe()
print(data_description)

# this is giving us the infos of the numerical columns only, which have the data type as int or float

               2           7          10             14
count  690.000000  690.000000  690.00000     690.000000
mean     4.758725    2.223406    2.40000    1017.385507
std      4.978163    3.346513    4.86294    5210.102598
min      0.000000    0.000000    0.00000       0.000000
25%      1.000000    0.165000    0.00000       0.000000
50%      2.750000    1.000000    0.00000       5.000000
75%      7.207500    2.625000    3.00000     395.500000
max     28.000000   28.500000   67.00000  100000.000000


In [25]:
#now we will be checking if our dataset has any missing values for features or not

import numpy as np

#since our missing values in data is marked as '?', we will replace it with NaN
data = data.replace('?', np.nan)

#now printing the total missing values in each feature
print(data.isnull().sum())

0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
13    13
14     0
15     0
dtype: int64


Now since we have find out that we have missing values in our dataset. Now we need to replace the missing values. For numerical values we will replace them with the mean of the column

In [27]:
# This will take care of the numerical missing datas by replacing them with mean
data.fillna(data.mean(), inplace=True)

#but here we have no missing numerical data, hence this will have no effect

# Count the number of NaNs in the dataset to verify
print('Total NaN: ' + str(data.isnull().values.sum()))
data.isnull().sum()

Total NaN: 67


0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
13    13
14     0
15     0
dtype: int64

Now we need to take care of the missing values of categorical data.

In [29]:
#iterating over each column
for col in data:
    # Check if the column is of object type
    if data[col].dtypes == 'object':
        # Impute with the most frequent value, value_count() is an array whose index[0] has most 
        data = data.fillna(data[col].value_counts().index[0])

# Verifying whether missing value is there or not
print('Total missing values:' + str(data.isnull().values.sum()))
print('Missing values in each column:')
data.isnull().sum()

Total missing values:0
Missing values in each column:


0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64

Now since we have handled the missing data, we need to convert the data from object type to numerical data type so that we can perform operation on them.

In [30]:
#Import LabelEncoder
from sklearn.preprocessing import LabelEncoder
# Instantiate LabelEncoder
le = LabelEncoder()
# Iterate over all the values of each column and extract their dtypes
for col in data:
    # Compare if the dtype is object
    if data[col].dtypes =='object':
    # Use LabelEncoder to do the numeric transformation
        le.fit(data[col])
        data[col]=le.transform(data[col])
#  information of the new dataframe
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    int64  
 1   1       690 non-null    int64  
 2   2       690 non-null    float64
 3   3       690 non-null    int64  
 4   4       690 non-null    int64  
 5   5       690 non-null    int64  
 6   6       690 non-null    int64  
 7   7       690 non-null    float64
 8   8       690 non-null    int64  
 9   9       690 non-null    int64  
 10  10      690 non-null    int64  
 11  11      690 non-null    int64  
 12  12      690 non-null    int64  
 13  13      690 non-null    int64  
 14  14      690 non-null    int64  
 15  15      690 non-null    int64  
dtypes: float64(2), int64(14)
memory usage: 86.4 KB


So see that we have successfully converted the object type to int type. 

Lets drop those features which are not useful. See that we have features like DriversLicense and ZipCode are not as important as the other features in the dataset for predicting credit card approvals. We should drop them to design our machine learning model with the best set of features. This is known as ***Feature Selection***


In [31]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Drop the features 11 and 13 and convert the DataFrame to a NumPy array
data = data.drop([11, 13], axis=1)
print(cc_apps.head())
data = data.values


  0      1      2  3  4  5  6     7  8  9   10 11 12     13   14 15
0  b  30.83  0.000  u  g  w  v  1.25  t  t   1  f  g  00202    0  +
1  a  58.67  4.460  u  g  q  h  3.04  t  t   6  f  g  00043  560  +
2  a  24.50  0.500  u  g  q  h  1.50  t  f   0  f  g  00280  824  +
3  b  27.83  1.540  u  g  w  v  3.75  t  t   5  t  g  00100    3  +
4  b  20.17  5.625  u  g  w  v  1.71  t  f   0  f  s  00120    0  +


Now we will split the train and the test features

In [34]:
# Segregate features and labels into separate variables
X,y = data[:,0:12] , data[:,13]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33,random_state=0)

Now we will need to scale the features. 

In [37]:
from sklearn.preprocessing import StandardScaler as SC  #we are using standardization

sc_X=SC()

rescaledX_train=sc_X.fit_transform(X_train)
rescaledX_test=sc_X.fit_transform(X_test)


array([[ 0.64252941, -1.03990675, -0.92262137, ..., -0.88531564,
        -0.4715825 , -0.30375709],
       [ 0.64252941,  0.36783248, -0.18667959, ...,  1.12954065,
         0.8454277 , -0.30375709],
       [-1.556349  ,  0.23521937, -0.65562402, ..., -0.88531564,
        -0.4715825 , -0.30375709],
       ...,
       [-1.556349  , -1.19292188, -0.84203672, ..., -0.88531564,
        -0.4715825 , -0.30375709],
       [-1.556349  , -0.83588657, -0.52649439, ...,  1.12954065,
         0.8454277 , -0.30375709],
       [ 0.64252941,  0.86768192, -0.3323145 , ..., -0.88531564,
        -0.4715825 ,  3.38928965]])

Now that we have dealt with all the things that are need to make our data ideal for fitting a model into it, we must find a perfect model to do so.

Note that the problem in our hand is a classification problem. That is we have to deal with the yes/no question. 

In [39]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression
# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()

# Fit logreg to the train set
logreg.fit(rescaledX_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Now we will see how our data is performing on the test data set. We will see the performance using a confusion matrix.

[ TP   FP ]

[ FN   TN ]     


In [40]:
from sklearn.metrics import confusion_matrix
# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", logreg.score(rescaledX_test, y_test))

# Print the confusion matrix of the logreg model
print('Confusion matrix: \n ', confusion_matrix(y_test, y_pred))


Accuracy of logistic regression classifier:  0.8464912280701754
Confusion matrix: 
  [[ 86  13]
 [ 22 107]]


Now we have built and tested our model for credit card approval. Now we will like to improve the accuracy of this model. For this we will take the help of ***GridSearchCV***. Logisitic Regression implemented in sklearn has several [hyperparameters](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) and apart from them we will like to tune



1.   tol
2.   max_iter



In [41]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

#declaring the grid values

tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]

# Create a dictionary where tol and max_iter are keys and the lists of their values are corresponding values
param_grid = dict(tol= tol, max_iter= max_iter)

We have defined the grid of hyperparameter values and converted them into a single dictionary format which GridSearchCV() expects as one of its parameters. Now, we will begin the grid search to see which values perform best.

We will instantiate GridSearchCV() with our earlier logreg model with all the data we have. Instead of passing train and test sets separately, we will supply X (scaled version) and y. We will also instruct GridSearchCV() to perform a cross-validation of 10 folds.



In [43]:
# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=10)

# Use scaler to rescale X and assign it to rescaledX
rescaledX = sc_X.fit_transform(X)

# Fit data to grid_model
grid_model_result = grid_model.fit(rescaledX, y)

# Summarize results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

Best: 0.856522 using {'max_iter': 100, 'tol': 0.01}
