# Predicting credit card approvals

In this notebook, I will build an automatic credit card approval predictor using a dataset from the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/credit+approval).

In [97]:
import pandas as pd

cc_apps = pd.read_csv("cc_approvals.data", header = None)
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


This output is quite confusing since for privacy reasons, the features have been concealed. I will use [this blog](http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html) as a reference  and map the features as following: Gender, Age, Debt, Married, BankCustomer, EducationLevel, Ethnicity, YearsEmployed, PriorDefault, Employed, CreditScore, DriversLicense, Citizen, ZipCode, Income and finally the ApprovalStatus.


In [98]:
cc_apps = cc_apps.rename({0:'Gender',1:'Age',2:'Debt',3:'Married',4:'BankCustomer',5:'EducationLevel',6:'Ethnicity',7:'YearsEmployed',8:'PriorDefault',9:'Employed',10:'CreditScore',11:'DriversLicense',12:'Citizen',13:'ZipCode',14:'Income',15:'ApprovalStatus'}, axis=1)


The dataset has a mixture of numerical and non-numerical features. This can be fixed with some preprocessing, but before that, it would be good to learn about the dataset a bit more to see if there are other dataset issues that need to be fixed.

In [99]:
print(cc_apps.describe())

print('\n')

print(cc_apps.info())

print('\n')

cc_apps.tail(20)

             Debt  YearsEmployed  CreditScore         Income
count  690.000000     690.000000    690.00000     690.000000
mean     4.758725       2.223406      2.40000    1017.385507
std      4.978163       3.346513      4.86294    5210.102598
min      0.000000       0.000000      0.00000       0.000000
25%      1.000000       0.165000      0.00000       0.000000
50%      2.750000       1.000000      0.00000       5.000000
75%      7.207500       2.625000      3.00000     395.500000
max     28.000000      28.500000     67.00000  100000.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Gender          690 non-null    object 
 1   Age             690 non-null    object 
 2   Debt            690 non-null    float64
 3   Married         690 non-null    object 
 4   BankCustomer    690 non-null    object 
 5   EducationLevel  690 non-

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,ApprovalStatus
670,b,47.17,5.835,u,g,w,v,5.5,f,f,0,f,g,465,150,-
671,b,25.83,12.835,u,g,cc,v,0.5,f,f,0,f,g,0,2,-
672,a,50.25,0.835,u,g,aa,v,0.5,f,f,0,t,g,240,117,-
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-


From this brief examination of the data, some things can be noticed:
- There are both numerical and non numerical values
- There appear to be some missing values
- The features have different range values

I will deal with these issues in the following steps. However, before that, I will split the dataset into training and testing sets. Before that though, I will drop some columns that are not so relevant to our predictions.

In [100]:
from sklearn.model_selection import train_test_split

cc_apps = cc_apps.drop(columns = ['DriversLicense','ZipCode'])


cc_apps_train, cc_apps_test = train_test_split(cc_apps,test_size= 0.3, random_state= 42)

Now I will start dealing with the missing values. After checking out the last 20 rows of the dataframe, I noticed a question mark in one cell: this is most probably a missing value. I will replace all the question marks in the dataframe with NaN.

In [101]:
import numpy as np

cc_apps_train = cc_apps_train.replace('?',np.nan)
cc_apps_test = cc_apps_test.replace('?', np.nan)

cc_apps_train.tail(10)

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,Citizen,Income,ApprovalStatus
214,b,26.67,2.71,y,p,cc,v,5.25,t,t,1,g,0,+
466,b,31.08,3.085,u,g,c,v,2.5,f,t,2,g,41,-
121,b,25.67,12.5,u,g,cc,v,1.21,t,t,67,g,258,+
614,a,38.33,4.415,u,g,c,v,0.125,f,f,0,g,0,-
20,b,25.0,11.25,u,g,c,v,2.5,t,t,17,g,1208,+
71,b,34.83,4.0,u,g,d,bb,12.5,t,f,0,g,0,-
106,b,28.75,1.165,u,g,k,v,0.5,t,f,0,s,0,-
270,b,37.58,0.0,,,,,0.0,f,f,0,p,0,+
435,b,19.0,0.0,y,p,ff,ff,0.0,f,t,4,g,1,-
102,b,18.67,5.0,u,g,q,v,0.375,t,t,2,g,38,-


After converting the question marks to NaNs, now I will use a method called mean imputation to substitute the missing values with the mean value from each column.

In [102]:
cc_apps_train.fillna(cc_apps_train.mean(), inplace=True)
cc_apps_test.fillna(cc_apps_train.mean(), inplace=True)

print(cc_apps_train.isnull().sum())
print(cc_apps_test.isnull().sum())

Gender            9
Age               6
Debt              0
Married           6
BankCustomer      6
EducationLevel    7
Ethnicity         7
YearsEmployed     0
PriorDefault      0
Employed          0
CreditScore       0
Citizen           0
Income            0
ApprovalStatus    0
dtype: int64
Gender            3
Age               6
Debt              0
Married           0
BankCustomer      0
EducationLevel    2
Ethnicity         2
YearsEmployed     0
PriorDefault      0
Employed          0
CreditScore       0
Citizen           0
Income            0
ApprovalStatus    0
dtype: int64


  cc_apps_train.fillna(cc_apps_train.mean(), inplace=True)
  cc_apps_test.fillna(cc_apps_train.mean(), inplace=True)


The dataset doesn't only contain numeric types but also object types. For these values the mean imputation will not work. I will solve this through the following for loop.

In [103]:
for col in cc_apps_train:
    if cc_apps_train[col].dtypes == 'object':
        cc_apps_train = cc_apps_train.fillna(cc_apps_train[col].value_counts().index[0])
        cc_apps_test = cc_apps_test.fillna(cc_apps_train[col].value_counts().index[0])
        
print(cc_apps_train.isnull().sum())
print(cc_apps_test.isnull().sum())

Gender            0
Age               0
Debt              0
Married           0
BankCustomer      0
EducationLevel    0
Ethnicity         0
YearsEmployed     0
PriorDefault      0
Employed          0
CreditScore       0
Citizen           0
Income            0
ApprovalStatus    0
dtype: int64
Gender            0
Age               0
Debt              0
Married           0
BankCustomer      0
EducationLevel    0
Ethnicity         0
YearsEmployed     0
PriorDefault      0
Employed          0
CreditScore       0
Citizen           0
Income            0
ApprovalStatus    0
dtype: int64


Now that the missing values are dealt with, I will deal with another crucial step in machine learning preprocessing: one hot encoding. This is a process that allows us to convert categorical variables to numerical ones. This is usually best practice because the models work better with numerical values.

In [104]:
cc_apps_train = pd.get_dummies(cc_apps_train)
cc_apps_test = pd.get_dummies(cc_apps_test)

#Reindex the columns of the test set aligning with the train set
cc_apps_test = cc_apps_test.reindex(columns=cc_apps_train.columns, fill_value=0)



Now the last preprocessing step that is left is scaling the data. To do this, i will import MinMaxScaler from the sklearn library. Before I rescale the data, I will have to separate training and test sets in features and labels using iloc.

In [105]:
from sklearn.preprocessing import MinMaxScaler

# Segregate features and labels into separate variables
X_train, y_train = cc_apps_train.iloc[:, :-1].values, cc_apps_train.iloc[:,[-1]].values
X_test, y_test = cc_apps_test.iloc[:, :-1].values, cc_apps_test.iloc[:,[-1]].values

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0,1))
rescaledX_train = scaler.fit_transform(X_train, y_train)
rescaledX_test = scaler.transform(X_test)

Now I can fit the data to a model. I will try the Logistic Regression model.

In [107]:
from sklearn.linear_model import LogisticRegression


logreg = LogisticRegression()


logreg.fit(rescaledX_train, y_train.ravel())


LogisticRegression()

Now is the time to see how well the model performs. There are many ways to do this; in the following code, I will first use the predict method to predict instances from the test set, then the score method to compare the predictions to the actual test labels and finally the confusion matrix to see if there is any flase negatives or false positives.

In [108]:
from sklearn.metrics import confusion_matrix


# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", logreg.score(rescaledX_test, y_test))

confusion_matrix(y_test, y_pred)


Accuracy of logistic regression classifier:  1.0


array([[ 97,   0],
       [  0, 110]])

The classifier had a perfect result: 1.0! This is astounding! However, this doesn't really match real life scenarios..when you're dealing with bigger datasets there is usually some margin for error; that's where cross validation,hyperparameter tuning and model selection become key parts of the machine learning process.Even if my classifier is already "perfect", I will still go through these steps in the following lines of code. 

In [126]:
from sklearn.model_selection import GridSearchCV


# Define the grid of values for tol and max_iter. These are 2 logistic regression parameters
tol = [0.01,0.001,0.0001]
max_iter = [100,150,200]

# Create a dictionary where tol and max_iter are keys and the lists of their values are corresponding values
param_grid = dict(tol = tol, max_iter = max_iter)

# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

# Fit grid_model to the data
grid_model_result = grid_model.fit(rescaledX_train, y_train.ravel())

# Show results
grid_model_result.cv_results_

{'mean_fit_time': array([0.01143537, 0.00696478, 0.00658374, 0.0051652 , 0.00561504,
        0.00623384, 0.0043323 , 0.00496011, 0.0057611 ]),
 'std_fit_time': array([0.00375577, 0.00145817, 0.00079925, 0.00128923, 0.0015134 ,
        0.00097861, 0.00062261, 0.00066348, 0.00081432]),
 'mean_score_time': array([0.00035625, 0.00024977, 0.00016999, 0.00017085, 0.00022717,
        0.00015311, 0.00015559, 0.00015163, 0.00015635]),
 'std_score_time': array([7.35360202e-05, 9.02124488e-05, 3.22350383e-06, 2.35793926e-05,
        1.35525677e-04, 1.62684651e-06, 5.27891255e-06, 1.47742590e-06,
        3.51503706e-06]),
 'param_max_iter': masked_array(data=[100, 100, 100, 150, 150, 150, 200, 200, 200],
              mask=[False, False, False, False, False, False, False, False,
                    False],
        fill_value='?',
             dtype=object),
 'param_tol': masked_array(data=[0.01, 0.001, 0.0001, 0.01, 0.001, 0.0001, 0.01, 0.001,
                    0.0001],
              mask=[False

The GridSearch cross validation outputted all these results. However,in this format, they're not very easy to analyze. I will now convert all these results to a dataframe.

In [127]:
df = pd.DataFrame(grid_model_result.cv_results_)
df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_iter,param_tol,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.011435,0.003756,0.000356,7.4e-05,100,0.01,"{'max_iter': 100, 'tol': 0.01}",1.0,1.0,1.0,1.0,1.0,1.0,0.0,1
1,0.006965,0.001458,0.00025,9e-05,100,0.001,"{'max_iter': 100, 'tol': 0.001}",1.0,1.0,1.0,1.0,1.0,1.0,0.0,1
2,0.006584,0.000799,0.00017,3e-06,100,0.0001,"{'max_iter': 100, 'tol': 0.0001}",1.0,1.0,1.0,1.0,1.0,1.0,0.0,1
3,0.005165,0.001289,0.000171,2.4e-05,150,0.01,"{'max_iter': 150, 'tol': 0.01}",1.0,1.0,1.0,1.0,1.0,1.0,0.0,1
4,0.005615,0.001513,0.000227,0.000136,150,0.001,"{'max_iter': 150, 'tol': 0.001}",1.0,1.0,1.0,1.0,1.0,1.0,0.0,1
5,0.006234,0.000979,0.000153,2e-06,150,0.0001,"{'max_iter': 150, 'tol': 0.0001}",1.0,1.0,1.0,1.0,1.0,1.0,0.0,1
6,0.004332,0.000623,0.000156,5e-06,200,0.01,"{'max_iter': 200, 'tol': 0.01}",1.0,1.0,1.0,1.0,1.0,1.0,0.0,1
7,0.00496,0.000663,0.000152,1e-06,200,0.001,"{'max_iter': 200, 'tol': 0.001}",1.0,1.0,1.0,1.0,1.0,1.0,0.0,1
8,0.005761,0.000814,0.000156,4e-06,200,0.0001,"{'max_iter': 200, 'tol': 0.0001}",1.0,1.0,1.0,1.0,1.0,1.0,0.0,1


Now I will just select the columns I am interested in: the two parameters and the mean test score.

In [130]:
df_cut = df[['param_max_iter','param_tol','mean_test_score']]
df_cut

Unnamed: 0,param_max_iter,param_tol,mean_test_score
0,100,0.01,1.0
1,100,0.001,1.0
2,100,0.0001,1.0
3,150,0.01,1.0
4,150,0.001,1.0
5,150,0.0001,1.0
6,200,0.01,1.0
7,200,0.001,1.0
8,200,0.0001,1.0


At this point, I have a clear dataframe where I can compare the different parameter combinations and the mean score for each combination! This is some cross validation magic! In this case, it doesn't really make a difference since all the scores are 1; however, in a scenario where I wanted to find which parameter combination could heighten my score, this process could turn out to be very useful.
If hyperparameter tuning doesn't solve my problems, it might be because I am not using the most efficient model. But how do I choose and compare different models? This is what I will do in the following cells.

In [132]:
# I will define 4 different models (each model with different parameters) in a dictionary

from sklearn.naive_bayes import GaussianNB
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

model_params = {
    'svm': {
        'model': svm.SVC(gamma='auto'),
        'params' : {
            'C': [1,10,20],
            'kernel': ['rbf','linear']
        }  
    },
    'naive_bayes': {
        'model': GaussianNB(),
        'params': {'var_smoothing': np.logspace(0,-9, num=100)}
    },
    'random_forest': {
        'model': RandomForestClassifier(),
        'params' : {
            'n_estimators': [1,5,10]
        }
    },
    'logistic_regression' : {
        'model': LogisticRegression(solver='liblinear',multi_class='auto'),
        'params': {
            'C': [1,5,10]
        }
    }
}


In [136]:
# After this, I will loop through the dictionary, create a GridSearch classifier for each model and parameters, then append the best scores and best parameters as a dictionary to the empty list 'scores'


scores = []

for model_name, mp in model_params.items():
    clf = GridSearchCV(mp['model'],mp['params'],cv=5,return_train_score=False)
    clf.fit(rescaledX_train, y_train.ravel())
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_parameters': clf.best_params_
    })

df_models = pd.DataFrame(scores, columns=['model','best_score','best_parameters'])
df_models

Unnamed: 0,model,best_score,best_parameters
0,svm,1.0,"{'C': 1, 'kernel': 'linear'}"
1,naive_bayes,0.989583,{'var_smoothing': 8.111308307896872e-05}
2,random_forest,0.96686,{'n_estimators': 10}
3,logistic_regression,1.0,{'C': 1}


From the df_models dataframe it seems clear that the models with the best performance are SVM and logistic regression. The best parameters for SVM would be C=1 and a linear kernel. The best paramter for the logistic regression would be C=1.
If I had previously chosen random forest as my model and wasn't happy with its performance, this comparison between models would have been very helpful.
