# 1. Credit card applications
Comercial banks receive alot of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example, manually analyzing these applications is mundane, error-prone, and time-consuming.

In this notebook, an automated credit card approval predictor will be built by using machine learning techniques, just like real banks do. The Credit Card Approval dataset from the UCI Machine Learning Repository. (archive.ics.uci.edu/ml/datasets/credit+approval).

The notebook is as follows:
1-First, we load the dataset and have a check it out.
2-It is possible to see that the dataset has a mixture of both numerical and non-numerical features, that it contains values from different ranges, plus it contains a number of missing entries.
3-Preprocess will be performed to ensure that the machine learning model makes good predictions
4-After the data is in a good shape, exploratory analysis will be done as well.
5-Finally, we will build a machine learning model that can predict if an individual's application for a credit card will be accepted.

In [5]:
#Pandas library is needed
import pandas as pd

#the dataset is loaded
cc_apps = pd.read_csv(r'C:\Users\MacBook\Desktop\Python Jupyter notebook\crx.data',header=None)

print(cc_apps.head())

  0      1      2  3  4  5  6     7  8  9   10 11 12     13   14 15
0  b  30.83  0.000  u  g  w  v  1.25  t  t   1  f  g  00202    0  +
1  a  58.67  4.460  u  g  q  h  3.04  t  t   6  f  g  00043  560  +
2  a  24.50  0.500  u  g  q  h  1.50  t  f   0  f  g  00280  824  +
3  b  27.83  1.540  u  g  w  v  3.75  t  t   5  t  g  00100    3  +
4  b  20.17  5.625  u  g  w  v  1.71  t  f   0  f  s  00120    0  +


# 2. Inspecting the applications

The data seems a bit confusing at the first sight, but according to a blog, the features are as follows:
Gender,Age,Debt,Married,BankCustomer,EducationLevel, Ethnicity, YearsEmployed, PriorDefault,Employed, CreditScore,DriversLicense, Citizen,ZipCode,Income and finally ApprovalStatus.


In [6]:
#Print summary statistics
print(cc_apps.describe())

print('\n') #Blank space

#Print DataFrame information
print(cc_apps.info())

print('\n')

#Inspect missing values in the dataset
print(cc_apps.tail(17))

#It is possible to see that the data set has a mixture of numerical and non-numerical features. 
#This can be fixed with preprocessing methods, but first we need to see if there are other issues.

               2           7          10             14
count  690.000000  690.000000  690.00000     690.000000
mean     4.758725    2.223406    2.40000    1017.385507
std      4.978163    3.346513    4.86294    5210.102598
min      0.000000    0.000000    0.00000       0.000000
25%      1.000000    0.165000    0.00000       0.000000
50%      2.750000    1.000000    0.00000       5.000000
75%      7.207500    2.625000    3.00000     395.500000
max     28.000000   28.500000   67.00000  100000.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 no

# 3. Splitting the dataset into training and testing sets

Ideally, no information from the test data should be used to preprocess the training data or should be used to direct the training process of a machine learning model. Hence, we first split the date and then preprocess it.

Additionally, features like DriversLicense and Zipcode are not as important as other features. Therefore, they should be dropped to design the machine learning model with the best set of features. (Feature Selection)

In [7]:
#Dropping features DriversLicense and ZipCode -->[11,13]
cc_apps = cc_apps.drop([11,13],axis=1)

#Import train_test_split
from sklearn.model_selection import train_test_split

#Split into train and test sets
cc_apps_train,cc_apps_test = train_test_split(cc_apps,test_size=33,random_state = 42)

# 4. Handling the missing values (part 1)
The dataset has missing values, which are labeled with '?'. These will be replaced with NaN

In [8]:
#import numpy
import numpy as np

#Replace the '?'s with NaN in the train ant test sets
cc_apps_train = cc_apps_train.replace(to_replace='?',value = np.NaN)
cc_apps_test = cc_apps_test.replace(to_replace='?',value = np.NaN)

# 5.Handling the missing values (part 2)
Why are missing values so important?Ignoring missing values can affect the performance of a machine learning model heavily. While ignoring the missing values our machine learning model may miss out on information about the dataset that may be useful for its training.

To avoid this, we are going to impute the missing values with a strategy called mean imputation.

In [10]:
cc_apps_train.fillna(cc_apps_train.mean(), inplace = True)
cc_apps_test.fillna(cc_apps_test.mean(), inplace = True)

print(cc_apps_train.isna().sum())
print(cc_apps_test.isna().sum())

#As it can be seen, there are still some missing values in columns 0,1,3,4,5 and 6. All these columns contain non-numeric data
#and this is why mean imputation strategy will not work here.


0     11
1     10
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
12     0
14     0
15     0
dtype: int64
0     1
1     2
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64


  cc_apps_train.fillna(cc_apps_train.mean(), inplace = True)
  cc_apps_test.fillna(cc_apps_test.mean(), inplace = True)


# 6. Handling the missing values (part 3)
We are going to impute this non-numerical values with the most frequent values as present in the respective columns. This is a good practice when it comes to imputing missing values for categorical data in general.

In [14]:
#Iterate over each column of cc_apps_train
for col in cc_apps_train.columns:
    if cc_apps_train[col].dtypes == 'object':
        cc_apps_train = cc_apps_train.fillna(cc_apps_train[col].value_counts().index[0])
        cc_apps_test = cc_apps_test.fillna(cc_apps_train[col].value_counts().index[0])
        
print(cc_apps_train.isna().sum())
print(cc_apps_test.isna().sum())      



0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64
0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64


# 7. Preprocessing the data (part 1)
Now that the missing values were handled, we will proceed to the preprocessing stage.

First, we will be converting all the non-numeric values into numeric ones. We do this because not only it results in a faster computation but also many machine learning models (like XGBoost) (and especially the ones developed using scikit-learn) require the data to be in a strictly numeric format. We will do this by using the get_dummies() method from pandas.

In [15]:
# Convert the categorical features in the train and test sets independently
cc_apps_train = pd.get_dummies(cc_apps_train)
cc_apps_test = pd.get_dummies(cc_apps_test)

# Reindex the columns of the test set aligning with the train set
cc_apps_test = cc_apps_test.reindex(columns=cc_apps_train.columns, fill_value=0)

# 8.Preprocessing the data (part 2)
Now, we are only left with one final preprocessing step of scaling before we can fit a machine learning model to the data.

Now, let's try to understand what these scaled values mean in the real world. Let's use CreditScore as an example. The credit score of a person is their creditworthiness based on their credit history. The higher this number, the more financially trustworthy a person is considered to be. So, a CreditScore of 1 is the highest since we're rescaling all the values to the range of 0-1

In [16]:
# Segregate features and labels into separate variables
X_train, y_train = cc_apps_train.iloc[:,:-1].values, cc_apps_train.iloc[:,[-1]].values
X_test, y_test = cc_apps_test.iloc[:,:-1].values, cc_apps_test.iloc[:,[-1]].values

# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range = (0,1))
rescaledX_train = scaler.fit_transform(X_train, y_train)
rescaledX_test = scaler.transform(X_test)

# 9. Fitting a logistic regression model to the train set
Predicting if a credit card application will be approved or not is a classification task.

Although we can measure correlation, that is outside the scope of this notebook, so we'll rely on our intuition that they indeed are correlated for now. Because of this correlation, we'll take advantage of the fact that generalized linear models perform well in these cases. Let's start our machine learning modeling with a Logistic Regression model (a generalized linear model).

In [17]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()

# Fit logreg to the train set
# ... YOUR CODE FOR TASK 9 ...
logreg.fit(rescaledX_train,y_train)

  y = column_or_1d(y, warn=True)


LogisticRegression()

# 10. Making predictions and evaluating performance
But how well does our model perform?

We will now evaluate our model on the test set with respect to classification accuracy. But we will also take a look the model's confusion matrix. In the case of predicting credit card applications, it is important to see if our machine learning model is equally capable of predicting approved and denied status, in line with the frequency of these labels in our original dataset. If our model is not performing well in this aspect, then it might end up approving the application that should have been approved. The confusion matrix helps us to view our model's performance from these aspects.

In [18]:
#Use our model to predict instances from the test set and store it
y_pred = logreg.predict(rescaledX_test)

#Import confusion matrix
from sklearn.metrics import confusion_matrix

#Getting the accuracy score of the model
print('Accuracy of logistic regression classifier: ',logreg.score(rescaledX_test,y_test))

print('\n')
#Printing the confusion matrix
print(confusion_matrix(y_test,y_pred))

Accuracy of logistic regression classifier:  1.0


[[13  0]
 [ 0 20]]


# 11. Grid searching and making the model perform better
Our model was pretty good! In fact it was able to yield an accuracy score of 100%.

For the confusion matrix, the first element of the of the first row of the confusion matrix denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly. And the last element of the second row of the confusion matrix denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly.

But if we hadn't got a perfect score what's to be done?. We can perform a grid search of the model parameters to improve the model's ability to predict credit card approvals.

scikit-learn's implementation of logistic regression consists of different hyperparameters but we will grid search over the following two:

tol
max_iter

In [19]:
from sklearn.model_selection import GridSearchCV

#Define grid values for these two hyperparameters
tol = [0.01,0.001,0.0001]
max_iter = [100,150,200]
param_grid = dict({'tol':tol,'max_iter':max_iter})

#Initialize GridSearch with the required parameters
grid_model = GridSearchCV(estimator=logreg,param_grid = param_grid,cv=5)

#Fit grid model
grid_model_result = grid_model.fit(rescaledX_train,y_train)

# Summarize results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

# Extract the best model and evaluate it on the test set
best_model = grid_model_result.best_estimator_
print("Accuracy of logistic regression classifier: ", best_model)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = colu

Best: 1.000000 using {'max_iter': 100, 'tol': 0.01}
Accuracy of logistic regression classifier:  LogisticRegression(tol=0.01)
