## 1. Credit card applications

<p>Commercial banks receive a lot of applications for credit cards. Many of them get rejected for various reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this notebook, I will build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.</p>

<p><img src="https://assets.datacamp.com/production/project_558/img/credit_card.jpg" alt="Credit card being held in hand"></p>
<p>We'll use the <a href="http://archive.ics.uci.edu/ml/datasets/credit+approval">Credit Card Approval dataset</a> from the UCI Machine Learning Repository.

<p>First, loading and viewing the dataset. Since this data is confidential, the contributor of the dataset has anonymized the feature names.</p>

In [1]:
# Import pandas
import pandas as pd

# Load dataset
col_names = ['Gender' , 'Age' , 'Debt' , 'Married' , 'BankCustomer' , 'EducationLevel' , 'Ethinicity' , 'YearsEmployed','PriorDefault' , 'Employed' , 'CreditScore' , 'DriversLicense' , 'Citizen' , 'Zipcode' , 'Income' , 'ApprovalStatus']
cc_df = pd.read_csv('C:\\Users\\nisha\\Desktop\\crx.data' , header = None , names = col_names , sep = ',')

# Inspect data
cc_df.head()

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethinicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,Zipcode,Income,ApprovalStatus
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


## 2. Inspecting the applications  

From the first glance at our data, we can see the dataset has a mixture of numerical and non-numerical features.

In [2]:
# Inspecting the no of rows and columns in the dataset
cc_df.shape

(690, 16)

In [3]:
# Print summary statistics
cc_df.describe()

Unnamed: 0,Debt,YearsEmployed,CreditScore,Income
count,690.0,690.0,690.0,690.0
mean,4.758725,2.223406,2.4,1017.385507
std,4.978163,3.346513,4.86294,5210.102598
min,0.0,0.0,0.0,0.0
25%,1.0,0.165,0.0,0.0
50%,2.75,1.0,0.0,5.0
75%,7.2075,2.625,3.0,395.5
max,28.0,28.5,67.0,100000.0


In [4]:
# Print DataFrame information
cc_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
Gender            690 non-null object
Age               690 non-null object
Debt              690 non-null float64
Married           690 non-null object
BankCustomer      690 non-null object
EducationLevel    690 non-null object
Ethinicity        690 non-null object
YearsEmployed     690 non-null float64
PriorDefault      690 non-null object
Employed          690 non-null object
CreditScore       690 non-null int64
DriversLicense    690 non-null object
Citizen           690 non-null object
Zipcode           690 non-null object
Income            690 non-null int64
ApprovalStatus    690 non-null object
dtypes: float64(2), int64(2), object(12)
memory usage: 86.3+ KB


In [5]:
# Inspecting missing values in the Gender column of the dataset
cc_df[cc_df.Gender == '?']

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethinicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,Zipcode,Income,ApprovalStatus
248,?,24.5,12.75,u,g,c,bb,4.75,t,t,2,f,g,73,444,+
327,?,40.83,3.5,u,g,i,bb,0.5,f,f,0,f,s,1160,0,-
346,?,32.25,1.5,u,g,c,v,0.25,f,f,0,t,g,372,122,-
374,?,28.17,0.585,u,g,aa,v,0.04,f,f,0,f,g,260,1004,-
453,?,29.75,0.665,u,g,w,v,0.25,f,f,0,t,g,300,0,-
479,?,26.5,2.71,y,p,?,?,0.085,f,f,0,f,s,80,0,-
489,?,45.33,1.0,u,g,q,v,0.125,f,f,0,t,g,263,0,-
520,?,20.42,7.5,u,g,k,v,1.5,t,t,1,f,g,160,234,+
598,?,20.08,0.125,u,g,q,v,1.0,f,t,1,f,g,240,768,+
601,?,42.25,1.75,y,p,?,?,0.0,f,f,0,t,g,150,1,-


## 3. Handling the missing values (part I)
We've uncovered some issues that will affect the performance of our machine learning model(s) if they go unchanged:

**Numeric/Non-Numeric column** : Our dataset contains both numeric and non-numeric data. Specifically, the features Debt,YearsEmployed,CreditScore,Income contain numeric values (of types float64, float64, int64 and int64 respectively) and all the other features contain non-numeric values. 

**Range Issue** : The dataset also contains values from several ranges. YearsEmployed feature has a value range of 0 - 28, CreditScore feature has range of 2 - 67, and Income feature has a range of 1017 - 100000. Apart from these, we can get useful statistical information (like mean, max, and min) about the features that have numerical values.

**Missing Values** : Finally, the dataset has missing values, which we'll take care of next. The missing values in the dataset are labeled with '?', which can be seen in the last cell's output.Now, let's temporarily replace these missing value question marks with NaN.

In [6]:
# Importing numpy
import numpy as np

# Storing the row no's where gender is missing in a list
missing_values_index = list(cc_df[cc_df.Gender == '?'].index)

# Replace the '?'s with NaN
cc_df = cc_df.replace('?' , np.NaN)

# Inspect the missing values after replacement
cc_df.iloc[missing_values_index,:]

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethinicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,Zipcode,Income,ApprovalStatus
248,,24.5,12.75,u,g,c,bb,4.75,t,t,2,f,g,73,444,+
327,,40.83,3.5,u,g,i,bb,0.5,f,f,0,f,s,1160,0,-
346,,32.25,1.5,u,g,c,v,0.25,f,f,0,t,g,372,122,-
374,,28.17,0.585,u,g,aa,v,0.04,f,f,0,f,g,260,1004,-
453,,29.75,0.665,u,g,w,v,0.25,f,f,0,t,g,300,0,-
479,,26.5,2.71,y,p,,,0.085,f,f,0,f,s,80,0,-
489,,45.33,1.0,u,g,q,v,0.125,f,f,0,t,g,263,0,-
520,,20.42,7.5,u,g,k,v,1.5,t,t,1,f,g,160,234,+
598,,20.08,0.125,u,g,q,v,1.0,f,t,1,f,g,240,768,+
601,,42.25,1.75,y,p,,,0.0,f,f,0,t,g,150,1,-


## 4. Handling the missing values (part II)
We replaced all the question marks with NaNs.

An important question that gets raised here is why are we giving so much importance to missing values? Why cannot missing values be just ignored? 

Ignoring missing values can affect the performance of a machine learning model heavily.While ignoring the missing values our machine learning model may miss out on information about the dataset that may be useful for its training. 
Then, there are many models which cannot handle missing values implicitly such as LDA.

So, to avoid this problem, we are going to impute the missing values with a strategy called mean imputation.

In [7]:
# Checking which columns contain more missing values using heatmap
import seaborn as sns
sns.heatmap(cc_df.isnull() , yticklabels=False , cbar=False)

<matplotlib.axes._subplots.AxesSubplot at 0x1f80e97ad68>

In [8]:
# Impute the missing values with mean imputation
cc_df.fillna(cc_df.mean(), inplace=True)

# Count the number of NaNs in the dataset to verify
pd.DataFrame(cc_df.isnull().sum(),columns = ['Null count'])

Unnamed: 0,Null count
Gender,12
Age,12
Debt,0
Married,6
BankCustomer,6
EducationLevel,9
Ethinicity,9
YearsEmployed,0
PriorDefault,0
Employed,0


## 5. Handling the missing values (part III)
We have successfully taken care of the missing values present in the numeric columns. There are still some missing values to be imputed for non-numeric columns. Mean imputation strategy would not work for non-numeric data.This needs a different treatment.

We are going to impute missing values in non-numeric columns with the most frequent values as present in the respective columns. This is good practice when it comes to imputing missing values for categorical data in general.

In [9]:
# Iterate over each column of cc_df
for col in cc_df.columns:
    # Check if the column is of object type
    if cc_df[col].dtypes == 'object':
        # Impute with the most frequent value
        cc_df.fillna(cc_df[col].mode().values[0] , inplace = True)

# Count the number of NaNs in the dataset and print the counts to verify
pd.DataFrame(cc_df.isnull().sum(),columns = ['Null count'])

Unnamed: 0,Null count
Gender,0
Age,0
Debt,0
Married,0
BankCustomer,0
EducationLevel,0
Ethinicity,0
YearsEmployed,0
PriorDefault,0
Employed,0


## 6. Preprocessing the data
The missing values are now successfully handled.

There is still some minor but essential data preprocessing needed before we proceed towards building our machine learning model. 

We are going to divide these remaining preprocessing steps into three main tasks:

1.Convert the non-numeric data into numeric.  
2.Split the data into train and test sets.  
3.Scale the feature values to a uniform range.

First, we will be converting all the non-numeric values into numeric ones. We do this because not only it results in a faster computation but also many machine learning models (like XGBoost) (and especially the ones developed using scikit-learn) require the data to be in a strictly numeric format. We will do this by using a technique called label encoding.

In [10]:
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Instantiate LabelEncoder
le = LabelEncoder()

# Iterate over all the values of each column and extract their dtypes
for col in cc_df.columns:
    # Compare if the dtype is object
    if cc_df[col].dtypes == 'object':
    # Use LabelEncoder to do the numeric transformation
        cc_df[col]=le.fit_transform(cc_df[col])

In [11]:
# Checking if all the non-numeric columns have been label encoded or not
cc_df.head()

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethinicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,Zipcode,Income,ApprovalStatus
0,1,156,0.0,2,1,13,8,1.25,1,1,1,0,0,68,0,0
1,0,328,4.46,2,1,11,4,3.04,1,1,6,0,0,11,560,0
2,0,89,0.5,2,1,11,4,1.5,1,0,0,0,0,96,824,0
3,1,125,1.54,2,1,13,8,3.75,1,1,5,1,0,31,3,0
4,1,43,5.625,2,1,13,8,1.71,1,0,0,0,2,37,0,0


In [12]:
# Checking if the data is balanced or imbalanced

# Percentage of applications that did not got approved.
print('% of applications that did not got approved : ' , len(cc_df[cc_df['ApprovalStatus'] == 0])/cc_df.shape[0]*100)

# Percentage of applications that did not got approved.
print('% of applications that got approved : ' , len(cc_df[cc_df['ApprovalStatus'] == 1])/cc_df.shape[0]*100)

% of applications that did not got approved :  44.492753623188406
% of applications that got approved :  55.507246376811594


Since there is not much difference between the no of approved and not approved applications so the data is **balanced**.

## 7. Splitting the dataset into train and test sets
We have successfully converted all the non-numeric values to numeric ones.

Now, we will split our data into train set and test set to prepare our data for two different phases of machine learning modeling: training and testing. Ideally, no information from the test data should be used to scale the training data or should be used to direct the training process of a machine learning model. Hence, we first split the data and then apply the scaling.

Also, features like DriversLicense and ZipCode are not as important as the other features in the dataset for predicting credit card approvals. We should drop them to design our machine learning model with the best set of features. In Data Science literature, this is often referred to as feature selection.

In [13]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Drop the features DriversLicense and Zipcode and convert the DataFrame to a NumPy array
cc_df = cc_df.drop(['DriversLicense', 'Zipcode'], axis=1)

# Segregate features and labels into separate variables
X,y = cc_df.iloc[:,0:len(cc_df.columns)-1] , cc_df.iloc[:,len(cc_df.columns)-1]

# Split into train and test sets
X_train, X_test, y_train,y_test = train_test_split(X,y,test_size=0.33,random_state=42)

## 8. Scaling the data

The data is now split into two separate sets - train and test sets respectively. 

We are only left with one final preprocessing step of scaling before we can fit a machine learning model to the data.

Rescaling all the values to the range of 0-1.

In [14]:
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0,1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.fit_transform(X_test)

In [15]:
pd.DataFrame(rescaledX_train).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,0.249284,0.094931,1.0,1.0,0.5,0.111111,0.225,0.0,0.0,0.0,0.0,0.00456
1,1.0,0.52149,0.104424,0.666667,0.333333,0.714286,0.888889,0.2125,1.0,1.0,0.089552,0.0,0.0
2,1.0,0.487106,0.056958,0.666667,0.333333,0.142857,0.888889,0.0125,0.0,0.0,0.0,0.0,0.00122
3,1.0,0.424069,0.0412,1.0,1.0,0.142857,0.888889,0.002,0.0,0.0,0.0,0.0,0.00179
4,0.0,0.60745,0.194608,0.666667,0.333333,0.357143,0.888889,0.25,1.0,0.0,0.0,0.0,0.04


## 9. Fitting a logistic regression model to the training set¶

As calculated above , our dataset contains more instances that correspond to "Denied" status than instances corresponding to "Approved" status. Specifically, out of 690 instances, there are 383 (55.5%) applications that got denied and 307 (44.5%) applications that got approved.This gives us a benchmark.

A good machine learning model should be able to accurately predict the status of the applications and should beat the benchmark model with respect to these statistics.

Which model should we pick? A question to ask is: are the features that affect the credit card approval decision process correlated with each other? Although we can measure correlation, here we'll rely on our intuition that they indeed are correlated for now. Because of this correlation, we'll take advantage of the fact that generalized linear models perform well in these cases. Let's start our machine learning modeling with a Logistic Regression model (a generalized linear model).

In [16]:
# Import Logistic Regression
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression(solver = 'liblinear')

# Fit logreg to the train set
logreg.fit(rescaledX_train , y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

## 10. Making predictions and evaluating performance
But how well does our model perform?

We will now evaluate our model on the test set with respect to classification accuracy. But we will also take a look the model's confusion matrix. In the case of predicting credit card applications, it is equally important to see if our machine learning model is able to predict the approval status of the applications as denied that originally got denied. If our model is not performing well in this aspect, then it might end up approving the application that should have been approved. The confusion matrix helps us to view our model's performance from these aspects.

In [17]:
# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
from sklearn.metrics import accuracy_score
print("Accuracy of logistic regression classifier: ", accuracy_score(y_test,y_pred))

# Print the confusion matrix of the logreg model
confusion_matrix(y_test,y_pred)

Accuracy of logistic regression classifier:  0.8377192982456141


array([[92, 11],
       [26, 99]], dtype=int64)

## 11. Grid searching and making the model perform better
Our model is pretty good! It is able to yield an accuracy score of almost 84%.

For the confusion matrix, the first element of the of the first row of the confusion matrix denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly. And the last element of the second row of the confusion matrix denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly.

Let's see if we can do better. We can perform a grid search of the model parameters to improve the model's ability to predict credit card approvals.

scikit-learn's implementation of logistic regression consists of different hyperparameters but we will grid search over the following two:

**tol** for tolerance and **max_iter** for maximum no of iterations**

In [18]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define the grid of values for tol and max_iter
tol = [0.01,0.001,0.0001]
max_iter = [100,150,200]

# Create a dictionary where tol and max_iter are keys and the lists of their values are corresponding values
param_grid = dict({'tol' : tol , 'max_iter' : max_iter})

## 12. Finding the best performing model
We have defined the grid of hyperparameter values and converted them into a single dictionary format which GridSearchCV() expects as one of its parameters. Now, we will begin the grid search to see which values perform best.

We will instantiate GridSearchCV() with our earlier logreg model with all the data we have. Instead of passing train and test sets separately, we will supply X (scaled version) and y. We will also instruct GridSearchCV() to perform a cross-validation of five folds.

GridSearchCV method will yield the best-achieved score and the respective best parameters.

While building this credit card predictor, we tackled some of the most widely-known preprocessing steps such as scaling, label encoding, and missing value imputation. We finished with some machine learning to predict if a person's application for a credit card would get approved or not given some information about that person.

In [19]:
# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator = logreg , param_grid = param_grid , cv = 5)

# Use scaler to rescale X and assign it to rescaledX
rescaledX = scaler.fit_transform(X)

# Fit data to grid_model
grid_model_result = grid_model.fit(rescaledX, y)

# Summarize results
best_score, best_params = grid_model_result.best_score_ , grid_model_result.best_params_
print("Best: %f using %s" % (best_score , best_params))

Best: 0.853623 using {'max_iter': 100, 'tol': 0.01}
