# Logistic Regression – Credit Risk Task

Build a logistic regression model in python to determine if a customer should be provided a loan.
--> My interpretation of this task is that I need to build a logistic regression model which predicts if a customer will default on a loan (i.e., 1 or "Y"), or not (i.e., 0 or "N"). If customer will default, then they should not provide the loan. 

## Approach

1. Data exploration and cleaning
2. Model building and development
3. Evaluation of the model
4. Future notes

## Data exploration and cleaning
In here, I first looked through what kind of data is available, and cleaned them by removing (e.g., unnecessary columns, NA data) and replacing (e.g., recoding the data into catgories) some of the data. 

The column, "Default," shows whether the applicant has defaulted on a loan, so this column will be the outcome/target variable. Since there were quite a few columns which can be predictors for "Default," I checked assumption on milticollinearity (i.e., correlation between predictors) by calculating Variance Inflation Factors (VIF). Generally a VIF greater than 10 is considered high, which only applies to Age, which means other predictors should pass the assumption. 

In [1]:
# import relevant libraries
import numpy as np
import math
import pandas as pd

from itertools import combinations

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

In [2]:
#download table
df = pd.read_csv('/Users/hasegawa.k./Desktop/CreditRiskData.csv')
df.head()
df = df.drop(columns=['Id', 'Amount'])

In [3]:
#remove any null data
df.info()
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32581 entries, 0 to 32580
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             32581 non-null  int64  
 1   Income          32581 non-null  int64  
 2   Home            32581 non-null  object 
 3   Emp_length      31686 non-null  float64
 4   Intent          32581 non-null  object 
 5   Rate            29465 non-null  float64
 6   Percent_income  32581 non-null  float64
 7   Default         32581 non-null  object 
 8   Cred_length     32581 non-null  int64  
dtypes: float64(3), int64(3), object(3)
memory usage: 2.2+ MB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 28638 entries, 0 to 32580
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             28638 non-null  int64  
 1   Income          28638 non-null  int64  
 2   Home            28638 non-null  object 
 3

In [4]:
#recode data into categorical
df=df.replace({'Home':{'OTHER':0,'MORTGAGE':1,'OWN':2,'RENT':3,},
 'Intent':{'PERSONAL':0, 'EDUCATION':1, 'MEDICAL':2, 'VENTURE':3, 'HOMEIMPROVEMENT':4,
 'DEBTCONSOLIDATION':5}, 'Default':{'Y':1,'N':0}})
print(df.Home.unique())
print(df.Intent.unique())
print(df.Default.unique())

df = df.astype({'Home': 'category', 'Intent': 'category', 'Default': 'category'})
df.info()

[3 2 1 0]
[0 1 2 3 4 5]
[1 0]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 28638 entries, 0 to 32580
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   Age             28638 non-null  int64   
 1   Income          28638 non-null  int64   
 2   Home            28638 non-null  category
 3   Emp_length      28638 non-null  float64 
 4   Intent          28638 non-null  category
 5   Rate            28638 non-null  float64 
 6   Percent_income  28638 non-null  float64 
 7   Default         28638 non-null  category
 8   Cred_length     28638 non-null  int64   
dtypes: category(3), float64(3), int64(3)
memory usage: 1.6 MB


In [5]:
#checking multicollinearity
dfparameters = df.loc[:, df.columns != 'Default']
dfpara_const = add_constant(dfparameters)
vif = pd.Series([variance_inflation_factor(dfpara_const.values, i) for i in range(dfparameters.shape[1])], 
                index = dfparameters.columns)
print("Variance Inflation Factors:")
print(vif)

Variance Inflation Factors:
Age               61.221437
Income             3.940589
Home               1.147777
Emp_length         1.115218
Intent             1.092236
Rate               1.001430
Percent_income     1.034721
Cred_length        1.091116
dtype: float64


## Model building and development

Logistic regression model from Sklearn package will be used. Since there are multiple predictors, all different combinations of predictors will be tested to give most optimised model. The purpose of this model is to correctly identify if someone will default a loan by looking at relevant predictors, therefore, model with highest accuracy will be defined as most optimised model here. 

In [6]:
# list of parameter combinations
columns = list(dfparameters.columns)
combination = []
for items in range(len(columns) + 1):
    for subset in combinations(columns, items):
      combination.append(list(subset))
combination.pop(0)

[]

In [7]:
# building logistic regression model and choosing the model with highest accuracy
highest_accuracy = 0
best_parameters = None
best_model = None
best_cm = None
best_precision = None
best_recall = None
y = df.Default

for parameters in combination:
    x = df[parameters]
    x_train, x_test, y_train, y_test = train_test_split(x,y, random_state=123, test_size = 0.2)
    log_reg = LogisticRegression().fit(x_train,y_train)
    pred = log_reg.predict(x_test)
    prediction = list(map(round, pred))
    accuracy = accuracy_score(y_test, prediction)
    cm = confusion_matrix(y_test, prediction)
    if accuracy > highest_accuracy:
        highest_accuracy = accuracy
        best_parameters = parameters
        best_model = log_reg
        best_cm = cm
        best_precision = precision_score(y_test, prediction)
        best_recall = recall_score(y_test, prediction)

  _warn_prf(average, modifier, msg_start, len(result))


## Evaluation of the model

From testing all combinations of parameters, it is now shown that "Home," "Emp_length," "Rate," "Percent_income," and "Cred_length" were the predictors combination with highest accuracy (0.82). Each parameters are defined as below:
* Home: Home ownership status (Own, Mortgage, Rent).
* Emp_Length: Employment length in years.
* Rate: Interest rate on the loan.
* Percent_Income: Loan amount as a percentage (0 to 1) of income.
* Cred_Length: Length of the applicant's credit history.

Although 3 out of 5 parameters only influences single-digit% to default outcome, but other 2 parameters seem quite significant. First parameter is Rate with odd ratio of 1.75, which means that the odds of default will increase by 75% when the rate increased by 1. Second parameter is Percent_income with odd ratio of 0.49, which means that the odds of default will decrease by 51% when the Percent_Income is increased by 1. It is understandable why these two parameters had especially large odd ratio, as they both directly relate to financial affordability of the loan. I was initially surprised about not having income as a predictor, but Percent_income does involve income. 

Model has a good accuracy (0.82), but low precision (0.53) and very low recall (0.30), which means positives were not predicted accurately (many false positives) or correctly (many false negatives) respectively. Therefore, if the company is more interested in avoiding the risk of providing loan to risky customers than not providing loan to safe customers, then their most optimised model should be defined based on precision and recall instead of accuracy. However, when I defined the most optimised model as either high precision or high recall, almost same model with similar evaluation (i.e., accuracy, precision, recall) was yielded, which may mean the definition was fairly holistic and will not change the outcome of the model. 

In [8]:
#result of the model
coef = best_model.coef_.flatten()
coef = coef.tolist()
print("PARAMETERS")
print("Model Intercept: ",best_model.intercept_)
for i in range(len(coef)):
    print(best_parameters[i], " - Coefficient: ", coef[i], ", Odds ratio: ", math.exp(coef[i]))
print("----")
print("EVALUATION")
print("Accuracy: ", highest_accuracy)
print("Precision: ", best_precision)
print("Recall: ", best_recall)
print("Confusion matrix: ", best_cm)

PARAMETERS
Model Intercept:  [-8.48235981]
Home  - Coefficient:  0.03677368742666623 , Odds ratio:  1.0374582044322354
Emp_length  - Coefficient:  -0.005209745829347639 , Odds ratio:  0.9948038013604403
Rate  - Coefficient:  0.560333556137986 , Odds ratio:  1.7512565452546534
Percent_income  - Coefficient:  -0.7185167381161531 , Odds ratio:  0.4874747727364457
Cred_length  - Coefficient:  -0.0005904512344334527 , Odds ratio:  0.9994097230475933
----
EVALUATION
Accuracy:  0.8189594972067039
Precision:  0.5305785123966942
Recall:  0.2988826815642458
Confusion matrix:  [[4370  284]
 [ 753  321]]


In [13]:
#PRECISION: building logistic regression model and choosing the model with highest accuracy
highest_accuracy_p = 0
best_parameters_p = None
best_model_p = None
best_cm_p = None
best_precision_p = 0
best_recall_p = 0

for parameters in combination:
    x = df[parameters]
    x_train, x_test, y_train, y_test = train_test_split(x,y, random_state=123, test_size = 0.2)
    log_reg = LogisticRegression().fit(x_train,y_train)
    pred = log_reg.predict(x_test)
    prediction = list(map(round, pred))
    precision = precision_score(y_test, prediction)
    if precision > best_precision_p:
        highest_accuracy_p = accuracy_score(y_test, prediction)
        best_parameters_p = parameters
        best_model_p = log_reg
        best_cm_p = confusion_matrix(y_test, prediction)
        best_precision_p = precision
        best_recall_p = recall_score(y_test, prediction)
        
#result of the model
coef_p = best_model_p.coef_.flatten()
print("PARAMETERS")
print("Model Intercept: ",best_model_p.intercept_)
for i in range(len(coef_p)):
    print(best_parameters_p[i], " - Coefficient: ", coef_p[i], ", Odds ratio: ", math.exp(coef_p[i]))
print("----")
print("EVALUATION")
print("Accuracy: ", highest_accuracy_p)
print("Precision: ", best_precision_p)
print("Recall: ", best_recall_p)
print("Confusion matrix: ", best_cm_p)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

PARAMETERS
Model Intercept:  [-8.48235981]
Home  - Coefficient:  0.03677368742666623 , Odds ratio:  1.0374582044322354
Emp_length  - Coefficient:  -0.005209745829347639 , Odds ratio:  0.9948038013604403
Rate  - Coefficient:  0.560333556137986 , Odds ratio:  1.7512565452546534
Percent_income  - Coefficient:  -0.7185167381161531 , Odds ratio:  0.4874747727364457
Cred_length  - Coefficient:  -0.0005904512344334527 , Odds ratio:  0.9994097230475933
----
EVALUATION
Accuracy:  0.8189594972067039
Precision:  0.5305785123966942
Recall:  0.2988826815642458
Confusion matrix:  [[4370  284]
 [ 753  321]]


  _warn_prf(average, modifier, msg_start, len(result))


In [14]:
#RECALL: building logistic regression model and choosing the model with highest accuracy
highest_accuracy_r = 0
best_parameters_r = None
best_model_r = None
best_cm_r = None
best_precision_r = 0
best_recall_r = 0

for parameters in combination:
    x = df[parameters]
    x_train, x_test, y_train, y_test = train_test_split(x,y, random_state=123, test_size = 0.2)
    log_reg = LogisticRegression().fit(x_train,y_train)
    pred = log_reg.predict(x_test)
    prediction = list(map(round, pred))
    recall = recall_score(y_test, prediction)
    if recall > best_recall_r:
        highest_accuracy_r = accuracy_score(y_test, prediction)
        best_parameters_r = parameters
        best_model_r = log_reg
        best_cm_r = confusion_matrix(y_test, prediction)
        best_precision_r = precision_score(y_test, prediction)
        best_recall_r = recall
        
#result of the model
coef_r = best_model_r.coef_.flatten()
print("PARAMETERS")
print("Model Intercept: ",best_model_r.intercept_)
for i in range(len(coef_r)):
    print(best_parameters_r[i], " - Coefficient: ", coef_r[i], ", Odds ratio: ", math.exp(coef_r[i]))
print("----")
print("EVALUATION")
print("Accuracy: ", highest_accuracy_r)
print("Precision: ", best_precision_r)
print("Recall: ", best_recall_r)
print("Confusion matrix: ", best_cm_r)

PARAMETERS
Model Intercept:  [-8.47208653]
Home  - Coefficient:  0.036702247807311315 , Odds ratio:  1.037384091460347
Emp_length  - Coefficient:  -0.005188803617034683 , Odds ratio:  0.9948246349710084
Intent  - Coefficient:  -0.004756931144814877 , Odds ratio:  0.995254365133173
Rate  - Coefficient:  0.5603927090495823 , Odds ratio:  1.7513601402421997
Percent_income  - Coefficient:  -0.7183385580052793 , Odds ratio:  0.4875616387841712
Cred_length  - Coefficient:  -0.0005643484761208888 , Odds ratio:  0.999435810738528
----
EVALUATION
Accuracy:  0.8186103351955307
Precision:  0.5287356321839081
Recall:  0.29981378026070765
Confusion matrix:  [[4367  287]
 [ 752  322]]


## Future notes

Here are some issues which can be addressed in the future:
* There is another python package for logistic regression call statsmodels, which I tested but took too long to run. However, positives of this package is that it can run other assumption tests (e.g., influential outliers, checked using cook's distance) and can yield summary of the model which includes McFadden's pseudo R value and p-value. These extra elements can be a great source of evaluation to further improve the model.
* Other type of models can be used, such as multi-linear model, by not considering default oucome as binary but in probability to see "likelihood" of them being defaulted
* For simplicity, I defined most optimised model as the model with best accuracy, but this is because I was not able to find packages which can test all type of models on multiple factors. Perhaps defining most optimised model in different manner (e.g., how much variance is explained in the model) can change the model outcome. 
