# 1. Introduction

Many banks have adopted machine learning algorithms as part of their operations. It takes in multiple features such as age, gender, income, and credit score. It uses this data to make a prediction on whether to approve someone for a credit card. Given enough quality data, machine learning algorithms can spot trends otherwise not seen by the human eye and make a call on an application. The following notebook will focus on creating a logistic regression algorithm to make these predictions.

## Resources
> 1. The data is hosted on UCI's machine learning repository: [Link](http://archive.ics.uci.edu/ml/datasets/credit+approval)
> 2. Ryan Kuhn has written a well documented [analysis of credit approval data](http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html)
> 3. Code inspired by Sayak Paul's [Project on DataCamp](https://www.datacamp.com/projects/558)

## Import libraries

In [1]:
# Math & data libraries
import numpy as np
import pandas as pd

# Set the output of pandas dataframes to display up to 100 columns
pd.set_option('display.max_columns', 100)
#pd.set_option('display.max_rows', 100)

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline

# Machine learning
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

# Misc
import tqdm as tqdm
import copy

## Loading the data using pandas

In [2]:
# Set unique path
path = ''

# Load data
cc_df = pd.read_csv(path + 'crx.data', header=None)

print(f'The dataset has {cc_df.shape[0]} rows and {cc_df.shape[1]} columns.')
cc_df.head()

The dataset has 690 rows and 16 columns.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


# 2. Investigating the data

## Summary statistics & other info

Pandas describe() function provides summary statistics for numerical features.

In [3]:
cc_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
2,690.0,4.758725,4.978163,0.0,1.0,2.75,7.2075,28.0
7,690.0,2.223406,3.346513,0.0,0.165,1.0,2.625,28.5
10,690.0,2.4,4.86294,0.0,0.0,0.0,3.0,67.0
14,690.0,1017.385507,5210.102598,0.0,0.0,5.0,395.5,100000.0


The info() function provides additional feature info. The objects seem to be categorical data and floats/ints are numerical.

In [4]:
cc_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
0     690 non-null object
1     690 non-null object
2     690 non-null float64
3     690 non-null object
4     690 non-null object
5     690 non-null object
6     690 non-null object
7     690 non-null float64
8     690 non-null object
9     690 non-null object
10    690 non-null int64
11    690 non-null object
12    690 non-null object
13    690 non-null object
14    690 non-null int64
15    690 non-null object
dtypes: float64(2), int64(2), object(12)
memory usage: 86.3+ KB


# 3. Preprocessing data

## Dealing with missing data
Machine learning algorithms thrive on high quality data. Missing data negatively impacts the performance of an algorithm and many can't process them on their own.

3 possible ways to address missing data:
> 1. __Remove__ records with null values
> 2. __Impute__ null values using mean, median, etc.
> 3. __Convert__ null values into useful information

The missing values are inputted as '?'. The first task is to convert these values to NaN.

In [5]:
cc_df = cc_df.replace('?', np.NaN)   # replaces each value in df with '?' to NaN

Second task is to visualize the amount of missing data.

In [6]:
def percent_NA(data):
    """
    Returns a pandas dataframe denoting the total number of NA values and the percentage of NA values in each column.
    The column names are noted on the index.
    
    Parameters
    ----------
    data: dataframe
    """
    # pandas series denoting features and the sum of their null values
    null_sum = data.isnull().sum()

    # instantiate columns for missing data
    total = null_sum.sort_values(ascending=False)
    percent = ( ((null_sum / len(data.index))*100).round(2) ).sort_values(ascending=False)
    
    # concatenate along the columns to create the complete dataframe
    df_NA = pd.concat([total, percent], axis=1, keys=['Number of NA', 'Percent NA'])
    
    # drop rows that don't have any missing data; omit if you want to keep all rows
    df_NA = df_NA[ (df_NA.T != 0).any() ]
    
    return df_NA

In [7]:
# Call function above
cc_NA = percent_NA(cc_df)

# Count number of missing data
NA_sum = cc_NA['Number of NA'].sum()

print(f'There are {NA_sum} missing data spread among {cc_NA.shape[0]} features.')
cc_NA

There are 67 missing data spread among 7 features.


Unnamed: 0,Number of NA,Percent NA
13,13,1.88
1,12,1.74
0,12,1.74
6,9,1.3
5,9,1.3
4,6,0.87
3,6,0.87


### Fix data types

There are some values that are listed as categorical (in the form of strings) that will need to be converted to numerical before we handle the missing data.

We can go ahead and convert 1 & 13 to numerical variables (see documentation on [UCI's page for feature info](http://archive.ics.uci.edu/ml/dataset))

In [8]:
cc_df[1] = pd.to_numeric(cc_df[1])
cc_df[13] = pd.to_numeric(cc_df[13])

In [9]:
# Columns with missing data
cols_NA = cc_NA.index.tolist()

# Display dataframe column info with missing data
cc_df[cols_NA].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 7 columns):
13    677 non-null float64
1     678 non-null float64
0     678 non-null object
6     681 non-null object
5     681 non-null object
4     684 non-null object
3     684 non-null object
dtypes: float64(2), object(5)
memory usage: 37.8+ KB


Columns 1 & 13 have been fixed successfully!

### Imputing missing values
*Numerical values*: null values will be imputed with the mean of the column it's in.

*Categorical values*: null values will be imputed with the highest counted value in the column.

In [10]:
def impute_data(data):
    # impute the mean for numerical data
    data[1] = data[1].fillna( data[1].mean() )
    data[13] = data[13].fillna( data[13].mean() )
    
    # impute categorical data with the highest counted value in a column
    for col in data:
        # check if the column is an object (non-numerical)
        if data[col].dtypes == 'object':
            # store the categorical with the highest count
            max_cat_var = data[col].value_counts(ascending=False).index[0]
            # impute
            data[col] = data[col].fillna( max_cat_var ) 
    
    return data

In [11]:
cc_df = impute_data(cc_df)

### Let's see if there are any missing data left:

In [12]:
cc_NA = percent_NA(cc_df)
cc_NA

Unnamed: 0,Number of NA,Percent NA


It looks like there are no more missing data!

## Encoding categorical variables
The algorithm of choice is logistic regression. The data needs to be in a special numerical format provided by scikit-learn's LabelEncoder() function

In [13]:
def encode_vars(data):
    # Instantiate LabelEncoder
    le = preprocessing.LabelEncoder()
    
    # Encode categorical variables
    for col in data:
        if data[col].dtypes == 'object':
            data[col] = le.fit_transform(data[col])
            
    return data

In [14]:
cc_df = encode_vars(cc_df)
cc_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,1,30.83,0.0,1,0,12,7,1.25,1,1,1,0,0,202.0,0,0
1,0,58.67,4.46,1,0,10,3,3.04,1,1,6,0,0,43.0,560,0
2,0,24.5,0.5,1,0,10,3,1.5,1,0,0,0,0,280.0,824,0
3,1,27.83,1.54,1,0,12,7,3.75,1,1,5,1,0,100.0,3,0
4,1,20.17,5.625,1,0,12,7,1.71,1,0,0,0,2,120.0,0,0


## Dropping unwanted features

### Attribute information

> 0. Male
> 1. Age
> 2. Debt
> 3. Married
> 4. Bank Customer
> 5. Education Level
> 6. Ethnicity
> 7. Years Employed
> 8. Prior Default
> 9. Employed
> 10. Credit Score
> 11. Driver's License
> 12. Citizen
> 13. Zip Code
> 14. Income
> 15. Approved

Please see [Ryan Kuhn's page for attribute documentation](http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html)

Columns 11 & 13 (Driver's License & Zip Code) won't be too useful for the algorithm so we will be dropping them.

In [15]:
cc_df = cc_df.drop(labels=[11, 13], axis=1)
cc_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
0,1,30.83,0.0,1,0,12,7,1.25,1,1,1,0,0,0
1,0,58.67,4.46,1,0,10,3,3.04,1,1,6,0,560,0
2,0,24.5,0.5,1,0,10,3,1.5,1,0,0,0,824,0
3,1,27.83,1.54,1,0,12,7,3.75,1,1,5,0,3,0
4,1,20.17,5.625,1,0,12,7,1.71,1,0,0,2,0,0


## Split the data into training and testing sets
The training set will contain 85% of the data and the remaining 15% will go to the testing set.

In [16]:
# Split features and target column to predict
X = cc_df.iloc[:, 0:13]
y = cc_df[15]

# Run train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=7)

## Rescale the values from 0-1
Scikit-learn's MinMaxScaler can do this. X_train and X_test will be rescaled

In [17]:
# Instantiate MinMaxScaler
scaler = preprocessing.MinMaxScaler()

# Rescale X_train & X_test
rescaledX_train = pd.DataFrame(scaler.fit_transform(X_train))
rescaledX_test = pd.DataFrame(scaler.fit_transform(X_test))

rescaledX_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,1.0,0.310827,0.047679,0.5,0.0,0.461538,0.0,0.004386,0.0,0.0,0.0,0.0,0.045
1,0.0,0.057594,0.321429,0.5,0.0,0.0,0.875,0.048246,1.0,0.0,0.0,0.0,0.0
2,1.0,0.146617,0.098214,0.5,0.0,0.384615,0.25,0.157895,0.0,0.0,0.0,0.0,0.00025
3,0.0,0.131579,0.392857,1.0,1.0,0.769231,0.875,0.105263,1.0,0.0,0.0,0.0,0.0
4,1.0,0.388421,0.178571,0.5,0.0,0.384615,0.25,0.0,0.0,1.0,0.029851,0.0,1e-05


# 4. Logistic Regression Model
The logistic regression algorithm will aid in predicting whether a credit card application will be approved given our data.

## Fitting model

In [18]:
# Instantiate model
log_reg = LogisticRegression(random_state=7, solver='lbfgs')

# Fit the logistic regression to the data
log_reg.fit(rescaledX_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=7, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Make predictions with the fitted model
Once we make the predictions, the accuracy of the model will be evaluated on the testing set. A confusion matrix will be visualized to evaluate the accuracy

[Confusion matrix guide](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/):

> Rows (top to bottom): Actual: __No__, Actual: __Yes__

> Columns (left to right): Predicted: __No__, Predicted: __Yes__

In [19]:
# The score function prints the accuracy of the fitted model applied to the testing set
print(f'Accuracy of logistic regression classifier on test set: {log_reg.score(rescaledX_test, y_test)}')

# Make predictions on the testing set. For the following confusion matrix.
y_pred = log_reg.predict(X_test)

# Call confusion_matrix
confusion_matrix(y_test, y_pred)

Accuracy of logistic regression classifier on test set: 0.8653846153846154


array([[45,  2],
       [49,  8]], dtype=int64)

## Use grid search to optimize model parameters

In [20]:
# Define grid search values
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]

# Define the parameter grid to search over
param_grid = {'tol': tol,
              'max_iter': max_iter}

In [25]:
# Instantiate GridSearchCV. The cv parameter is cross-validation with k=5 folds
grid_model = GridSearchCV(estimator=log_reg, param_grid=param_grid, cv=5)

# Rescale the features stored in X
rescaledX = pd.DataFrame(scaler.fit_transform(X))

# Fit the data to the grid model
grid_model_result = grid_model.fit(rescaledX, y)

# Summarize results
print(f'The best parameters to use are: {grid_model_result.best_params_}')
print(f'Model accuracy: {grid_model_result.best_score_}')

The best parameters to use are: {'max_iter': 100, 'tol': 0.01}
Model accuracy: 0.8507246376811595


## Re-run the model with the optimized parameters

In [28]:
# Instantiate model with new parameters
log_reg = LogisticRegression(random_state=7, solver='lbfgs',
                             max_iter=100, tol=0.01)

# Fit the logistic regression to the data
log_reg.fit(rescaledX_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=7, solver='lbfgs', tol=0.01, verbose=0,
                   warm_start=False)

In [29]:
# The score function prints the accuracy of the fitted model applied to the testing set
print(f'Accuracy of logistic regression classifier on test set: {log_reg.score(rescaledX_test, y_test)}')

# Make predictions on the testing set. For the following confusion matrix.
y_pred = log_reg.predict(X_test)

# Call confusion_matrix
confusion_matrix(y_test, y_pred)

Accuracy of logistic regression classifier on test set: 0.8653846153846154


array([[45,  2],
       [49,  8]], dtype=int64)

It's not a substantial increase in accuracy, but an increase nonetheless!