# Building Logistic Regression Model for predicting Probability of Default (PD)

For further analytics, preprocessed datasests are saved as separate files, and from this step on, it is not required to run the whole project from scartch, but can go on with uploading the datasets that have been saved in the previous step. 



In [9]:
#import relevant libraries
import sklearn
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os 

part_1 = pd.read_csv('train_inputs_part_1.csv')
part_2 = pd.read_csv('train_inputs_part_2.csv')
part_3 = pd.read_csv('train_inputs_part_3.csv')
part_4 = pd.read_csv('train_inputs_part_4.csv')

train_inputs = pd.concat([part_1, part_2, part_3, part_4])

pd.options.display.max_rows = 50
pd.options.display.max_columns = None

train_targets = pd.read_csv('train_targets.csv')





In [12]:
shape = [train_inputs.shape, 
         train_targets.shape]
shape

[(347181, 266), (347181, 2)]

# Logistic Regression intuition 

In [19]:
# Based on the initial point of generating dummy variables, we know that the first dummy variable is grade_A
# The following code returns the column number of the first dummy variable in the train_inputs dataframe 

baseline_dummy = train_inputs.columns.get_loc("grade_A")

# This code returns the last column number
last_dummy = train_inputs.shape[1]-1



181

['home_ownership_OWN',
 'home_ownership_RENT',
 'verif_status_Not Verified',
 'verif_status_Source Verified',
 'verif_status_Verified',
 'loan_status_Charged Off',
 'loan_status_Current',
 'loan_status_Default',
 'loan_status_Does not meet the credit policy. Status:Charged Off',
 'loan_status_Does not meet the credit policy. Status:Fully Paid',
 'loan_status_Fully Paid',
 'loan_status_In Grace Period',
 'loan_status_Late (16-30 days)',
 'loan_status_Late (31-120 days)',
 'purpose_car',
 'purpose_credit_card',
 'purpose_debt_consolidation',
 'purpose_educational',
 'purpose_home_improvement',
 'purpose_house',
 'purpose_major_purchase',
 'purpose_medical',
 'purpose_moving',
 'purpose_other',
 'purpose_renewable_energy',
 'purpose_small_business',
 'purpose_vacation',
 'purpose_wedding',
 'addr_state_AK',
 'addr_state_AL',
 'addr_state_AR',
 'addr_state_AZ',
 'addr_state_CA',
 'addr_state_CO',
 'addr_state_CT',
 'addr_state_DC',
 'addr_state_DE',
 'addr_state_FL',
 'addr_state_GA',
 'ad

In [31]:
'''This code assigns a list of column names to the variable column_names, using the tolist() method to 
convert the Index object returned by the columns attribute to a list'''
dummy_vars = ["'" + name + "'," for name in train_inputs.iloc[:,baseline_dummy:last_dummy].columns]

# The output of the dummy_vars needs to have list of the column names starting from 'grade_A' all the way to 'income_120k-130k'
dummy_vars
len(dummy_vars)

#
with open('column-names.txt', 'w') as f:
    for name in dummy_vars:
        f.write(name + '\n')






## Dummy Variable Trap

The dummy variable trap is a common issue that can occur when creating dummy variables in regression analysis, where one or more of the dummy variables can be expressed as a linear combination of the others. This results in a perfect multicollinearity between the variables, which can lead to issues in interpreting the results of the regression analysis.
Consider the example of education level with three categories: "No Higher Education," "College Graduated," and "Earned Graduate Master or More."

To create dummy variables for these categories, we would typically create two binary variables: one for "College Graduated" and one for "Earned Graduate Master or More." However, this creates the potential for the dummy variable trap, as "No Higher Education" can be derived as 1 minus the sum of the other two variables.

To avoid the dummy variable trap, we would need to include all three dummy variables in the regression model, but leave one of them out as the reference category. For example, we could leave out the "No Higher Education" category and include two dummy variables: one for "College Graduated" and one for "Earned Graduate Master or More."

The interpretation of the coefficients for these variables would then be as follows:

The coefficient for "College Graduated" would represent the effect of having a college degree compared to having no higher education, holding constant the effect of having a graduate degree.
The coefficient for "Earned Graduate Master or More" would represent the effect of having a graduate degree compared to having no higher education, holding constant the effect of having a college degree.
The omitted category, "No Higher Education," would be the reference category, and the coefficients for the other two categories would be interpreted relative to this category.
By including all three dummy variables, we avoid the dummy variable trap and can obtain separate estimates of the effects of each category.

In [None]:
"""This step returns dummies_ref_categ dataframe with only dummy variables, and removes one variable from each class as a 
reference category to avoid dummy variable trap as discussed above"""

dummies_ref_categ = train_inputs.loc[:, ['grade_A',
'grade_B',
'grade_C',
'grade_D',
'grade_E',
'grade_F',
'home_ownership_ANY',
'home_ownership_MORTGAGE',
'home_ownership_NONE',
'home_ownership_OWN',
'home_ownership_RENT',
'verif_status_Not Verified',
'verif_status_Source Verified',
'loan_status_Current',
'loan_status_Default',
'loan_status_Does not meet the credit policy. Status:Charged Off',
'loan_status_Does not meet the credit policy. Status:Fully Paid',
'loan_status_Fully Paid',
'loan_status_In Grace Period',
'loan_status_Late (16-30 days)',
'loan_status_Late (31-120 days)',
'purpose_car',
'purpose_credit_card',
'purpose_debt_consolidation',
'purpose_educational',
'purpose_home_improvement',
'purpose_house',
'purpose_major_purchase',
'purpose_medical',
'purpose_moving',
'purpose_renewable_energy',
'purpose_small_business',
'purpose_vacation',
'purpose_wedding',
'initial_list_status_w',
'st_group_TX',
'st_group_FL',
'st_group_NY',
'st_group_CA',
'st_group_NM_MD_NC_LA_MD',
'st_group_MI_NJ_VA',
'st_group_KY_MN_NA_IN_OH',
'st_group_RI_OR_GA_WA',
'st_group_SD_ID',
'st_group_MS_MT',
'st_group_IL_CT_CO',
'st_group_VT_SC',
'st_group_KS',
'term:36',
'months_since_issued:115',
'months_since_issued:124',
'months_since_issued:133',
'months_since_issued:142',
'months_since_issued:151',
'months_since_issued:160',
'months_since_issued:169',
'months_since_issued:178',
'months_since_issued:187',
'int_rate_classes_(5.399, 7.484]',
'int_rate_classes_(7.484, 9.548]',
'int_rate_classes_(9.548, 11.612]',
'int_rate_classes_(11.612, 13.676]',
'int_rate_classes_(13.676, 15.74]',
'int_rate_classes_(15.74, 17.804]',
'int_rate_classes_(17.804, 19.868]',
'int_rate_classes_(19.868, 21.932]',
'int_rate_classes_(21.932, 23.996]',
'income_<0k',
'income_0k-10k',
'income_10k-20k',
'income_20k-30k',
'income_30k-40k',
'income_40k-50k',
'income_50k-60k',
'income_60k-70k',
'income_70k-80k',
'income_80k-90k',
'income_90k-100k',
'income_100k-110k',
'income_110k-120k']]

In [None]:
ref_categories = ['grade_G',
                  'verif_status_Verified',
                  'loan_status_Charged Off',
                  'purpose_other',
                  'home_own_none_other_any_combined',
                  'initial_list_status_f',
                  'st_group_OK_TN_AZ_DE_AR_UT',
                  'term:60',
                  'months_since_issued:106',
                  'int_rate_classes_(23.996, 26.06]',
                  'income_120k-130k',
                  ]


# Coefficient interpretations for Logistic Regression model
See 1st video of the chapter 6

