# Credit Card Applications
Building an automatic credit card approval predictor using machine learning techniques using [credit card approval dataset](http://archive.ics.uci.edu/ml/datasets/credit+approval) from UCI Machine Learning Repository. The structure of this notebook is as follows:

1. Load and inspect dataset.
2. Handle missing entries.
3. Preprocess the data.
4. Build and apply machine learning model to predict credit card approval outcomes.

## 1. Load and inspect dataset

In [43]:
import pandas as pd
import numpy as np

# Load and inspect data
cc_apps_df = pd.read_csv('../datasets/credit_card.data')
cc_apps_df.head()

Unnamed: 0,b,30.83,0,u,g,w,v,1.25,t,t.1,01,f,g.1,00202,0.1,+
0,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
1,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
2,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
3,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+
4,b,32.08,4.0,u,g,m,v,2.5,t,f,0,t,g,360,0,+


### 1.1. Change column names and inspect dataset
Change column names into human readable format based on [this blog](http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html). The column names will be replaced as follows: *gender, age, debt, married, bank_customer, education_level, ethnicity, years_employed, prior_default, employed, credit_score, drivers_license, citizen, zip_code, income, and approval_status*.

In [44]:
# Change column names 
cc_apps_df.columns = ['gender', 'age', 'debt', 'married', 'bank_customer', 'education_level', 'ethnicity', 
                      'years_employed', 'prior_default', 'employed', 'credit_score', 'drivers_license', 'citizen', 
                      'zip_code', 'income', 'approval_status']


# Summary statistics
cc_apps_desc = cc_apps_df.describe()
print(cc_apps_desc)

print('\n')

# Data information
cc_apps_info = cc_apps_df.info()
print(cc_apps_info)

print('\n')

# Inspect misisng valus in the dataset
cc_apps_df.tail(17)

             debt  years_employed  credit_score         income
count  689.000000      689.000000    689.000000     689.000000
mean     4.765631        2.224819      2.402032    1018.862119
std      4.978470        3.348739      4.866180    5213.743149
min      0.000000        0.000000      0.000000       0.000000
25%      1.000000        0.165000      0.000000       0.000000
50%      2.750000        1.000000      0.000000       5.000000
75%      7.250000        2.625000      3.000000     396.000000
max     28.000000       28.500000     67.000000  100000.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 689 entries, 0 to 688
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   gender           689 non-null    object 
 1   age              689 non-null    object 
 2   debt             689 non-null    float64
 3   married          689 non-null    object 
 4   bank_customer    689 non-null    object 
 5  

Unnamed: 0,gender,age,debt,married,bank_customer,education_level,ethnicity,years_employed,prior_default,employed,credit_score,drivers_license,citizen,zip_code,income,approval_status
672,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
673,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
674,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
675,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
676,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
677,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
678,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-
679,b,19.5,0.29,u,g,k,v,0.29,f,f,0,f,g,280,364,-
680,b,27.83,1.0,y,p,d,h,3.0,f,f,0,f,g,176,537,-
681,b,17.08,3.29,u,g,i,v,0.335,f,f,0,t,g,140,2,-


In [45]:
# Remove ZipCode and DriversLicense columns as that would not affect credit card approval
cc_apps_df = cc_apps_df.drop(['zip_code', 'drivers_license'], axis=1)

#### Findings:
* Have missing value labeled as '?'.
* Dataset contains both numeric and non-numeric data.
* Numeric columns have different ranges (ex: 0-28, 0-67, 0-100000).

## 2. Handle missing entries and preprocess the dataset

### 2.1 Handling missing values in numeric columns

In [46]:
# Replace '?' on gender column with np.nan.
cc_apps_df = cc_apps_df.replace('?', np.nan)

# Impute the missing values with mean imputation
cc_apps_df.fillna(cc_apps_df.mean(), inplace=True)

# Count the number of NaNs in the dataset to verify
cc_apps_df.isna().sum()

gender             12
age                12
debt                0
married             6
bank_customer       6
education_level     9
ethnicity           9
years_employed      0
prior_default       0
employed            0
credit_score        0
citizen             0
income              0
approval_status     0
dtype: int64

### 2.2 Handling missing values in categorical columns

In [47]:
# Impute missing values with the most frequent values as present in the respective columns

# Iterate over each column of cc_apps_Df
for col in cc_apps_df.columns:
    # Check if the column is of object type
    if cc_apps_df[col].dtypes == 'object':
        # Impute with the most frequent value
        cc_apps_df = cc_apps_df.fillna(cc_apps_df[col].value_counts().index[0])

# Count the number of NaNs in the dataset and print the counts to verify
cc_apps_df.isna().sum()

gender             0
age                0
debt               0
married            0
bank_customer      0
education_level    0
ethnicity          0
years_employed     0
prior_default      0
employed           0
credit_score       0
citizen            0
income             0
approval_status    0
dtype: int64

## 3. Preprocessing the data
* Convert the non-numeric data into numeric utilizing LabelEncoder.
* Scale the feature values to a uniform range.

### 3.1 Convert the non-numeric data into numeric using LabelEncoder

In [48]:
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Instantiate LabelEncoder
le = LabelEncoder()

# Iterate over all the values of each column and extract their dtypes
for col in cc_apps_df.columns:
    # Compare if the dtype is object
    if cc_apps_df[col].dtypes == 'object':
    # Use LabelEncoder to do the numeric transformation
        cc_apps_df[col] = le.fit_transform(cc_apps_df[col])

cc_apps_df.head()

Unnamed: 0,gender,age,debt,married,bank_customer,education_level,ethnicity,years_employed,prior_default,employed,credit_score,citizen,income,approval_status
0,0,327,4.46,2,1,11,4,3.04,1,1,6,0,560,0
1,0,89,0.5,2,1,11,4,1.5,1,0,0,0,824,0
2,1,125,1.54,2,1,13,8,3.75,1,1,5,0,3,0
3,1,43,5.625,2,1,13,8,1.71,1,0,0,2,0,0
4,1,167,4.0,2,1,10,8,2.5,1,0,0,0,0,0


### 3.2 Scale the feature values to a uniform range (0-1)

In [49]:
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# Convert to DataFrame to a NumPy array
cc_apps_df = cc_apps_df.values

# Segregate features and labels into separate variables
X,y = cc_apps_df[:,0:13] , cc_apps_df[:,13]

# Instantiate MinMaxScaler and use it to rescale
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)

## 4. Build and apply machine learning
* Train test split
* Fitting onto Logistic Regression 
* Making predictions and evaluating performance
* Grid searching and making the model perform better

### 4.1 Train Test Split

In [52]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(rescaledX,
                                y,
                                test_size=0.33,
                                random_state=42)

### 4.2 Fitting a Logistic Regression

In [54]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()

# Fit logreg to the train set
logreg.fit(X_train, y_train);

### 4.3 Evaluating performance

In [55]:
# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(X_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", logreg.score(X_test, y_test))

# Print the confusion matrix of the logreg model
print(confusion_matrix(y_test, y_pred))

Accuracy of logistic regression classifier:  0.868421052631579
[[ 95   5]
 [ 25 103]]


### 4.4 Grid searching and making the model perform better

In [56]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define the grid of values for tol and max_iter
tol = [0.01, 0.001 ,0.0001] # tolerance for the stopping criteria
max_iter = [100, 150, 200] # Maximum number of iterations taken for the solvers to converge

# Create a dictionary where tol and max_iter are keys and the lists of their values are corresponding values
param_grid = dict(tol=tol, max_iter=max_iter)

In [57]:
# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

# Fit data to grid_model
grid_model_result = grid_model.fit(rescaledX, y)

# Summarize results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

Best: 0.850640 using {'max_iter': 100, 'tol': 0.01}
