# Linear Probability Model

The purpose of this program is to regress a mortgage approval variable against race, ethnicity, gender, and other control variables found in HMDA data. Using the model below.

$P(Approval = 1 | Race/Ethnicity/Sex, \chi_j) = \beta_0 + \lambda_j * Race/Ethnic/Sex  + \beta_j * \chi_j + \mu $

Where $\lambda_j$ are the variables of interest, $/beta_j$ are the coefficients on the control variables, and $\chi_j$ are the control variables.

Variables of Interest
- White
- Black
- Asian
- Other
- Multi-Race Interactions
- Hispanic
- Non-Hispanic
- Hispanic and Race Interactions
- Male 
- Female

Control Variables
- Income (log)
- Loan to Value ratio
- Debt to Income ratio
- Loan Amount (log)
- Pre-Approval indicators

Fixed Effects - maybe include
- Lender
- Region Indicators by Community Tract  or county

Variables ommited in model to prevent perfect collinearity.
- White
- Non-Hispanic
- Male

Filters
- Loan Purpose
- Occupancy Type

Clustered Standard errors
- by Lender
- by Region
- by County

Other regressions to run that will use similar controls.
- Simplified Model(Just variables of interest)
- Restricted Model
- Interest Rates
- Denial Rates
- Fixed Effects Model
- Years other than 2019

In [1]:
import pandas as pd
import numpy as np
from linearmodels import PanelOLS
import statsmodels.api as sm
from statsmodels.formula.api import ols

# np.set_printoptions(precision=3, suppress=True)

#This will allow all columns to be displayed when reviewing the data.
pd.options.display.max_columns = None

In [2]:
'''
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing

print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
tf.test.is_built_with_cuda()
print(tf.version.VERSION)
import sys
print(sys.version)
gpu = len(tf.config.list_physical_devices('GPU'))>0
print("GPU is", "available" if gpu else "NOT AVAILABLE")
'''

'\nimport tensorflow as tf\n\nfrom tensorflow import keras\nfrom tensorflow.keras import layers\nfrom tensorflow.keras.layers.experimental import preprocessing\n\nprint("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices(\'GPU\')))\ntf.test.is_built_with_cuda()\nprint(tf.version.VERSION)\nimport sys\nprint(sys.version)\ngpu = len(tf.config.list_physical_devices(\'GPU\'))>0\nprint("GPU is", "available" if gpu else "NOT AVAILABLE")\n'

## Load in and manipulate dataset.

Below is for manipulating the dataset before running it through the funciton.

In [3]:
# Load in HMDA Data
HMDA_clean_file_location = r'2019 HMDA Clean IL SAMPLE.csv'
HMDA_clean_0 = pd.read_csv(HMDA_clean_file_location)
HMDA_clean_0

Unnamed: 0,Year,Lender_LEI,State,County_Code,Census_Tract,Approved,Denied,Race,Ethnicity,Sex,Income,Log_Income,Loan_Amount,Log_Loan_Amount,LTV,Loan_Type,DTI_Ratio,preapproval,Occupancy_Type
0,2019,5Z1UQ1CWY0DQ3KJWDQ07,IL,17197.0,1.719788e+10,1,0,0_White,0_Not Hispanic,Female,185.0,5.220356,255000.0,12.449019,80.000,Conventional,20%-<30%,2,1
1,2019,OTQ7L99FG3H1GQVQBT56,IL,17201.0,1.720100e+10,1,0,0_White,0_Not Hispanic,0_Male,47.0,3.850148,125000.0,11.736069,101.180,Conventional,40,2,1
2,2019,549300U3721PJGQZYY68,IL,17197.0,1.719788e+10,1,0,0_White,0_Not Hispanic,Female,100.0,4.605170,155000.0,11.951180,80.000,Conventional,0%-20%,2,1
3,2019,5493003GQDUH26DNNH17,IL,17163.0,1.716350e+10,1,0,0_White,0_Not Hispanic,0_Male,60.0,4.094345,215000.0,12.278393,100.000,VA,50%-60%,2,1
4,2019,549300UHEEV73TKCZY62,IL,17097.0,1.709786e+10,1,0,0_White,Hispanic,0_Male,57.0,4.043051,145000.0,11.884489,80.000,Conventional,42,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2272,2019,549300O7SGM8FH65GQ47,IL,17019.0,1.701901e+10,1,0,0_White,0_Not Hispanic,Female,63.0,4.143135,255000.0,12.449019,90.000,Conventional,30%-<36%,2,1
2273,2019,C5654JQHZUHN0772B561,IL,17097.0,1.709786e+10,1,0,0_White,0_Not Hispanic,0_Male,206.0,5.327876,305000.0,12.628067,100.000,VA,46,2,1
2274,2019,549300O7SGM8FH65GQ47,IL,17115.0,1.711500e+10,1,0,0_White,0_Not Hispanic,Female,72.0,4.276666,95000.0,11.461632,89.921,Conventional,30%-<36%,2,1
2275,2019,549300U6DW7DX671T306,IL,17007.0,1.700701e+10,1,0,0_White,Hispanic,Female,202.0,5.308268,115000.0,11.652687,85.000,Conventional,43,2,3


In [4]:
#HMDA_clean.columns

### Check for further cleaning

In [5]:
#HMDA_clean.info()

In [6]:
#Clean df
HMDA_clean_1 = HMDA_clean_0.copy()
HMDA_clean_1 = HMDA_clean_1.dropna()
HMDA_clean_1['Census_Tract'] = HMDA_clean_1['Census_Tract'].apply(str)
#HMDA_clean.info()

Below filters the occupancy type to Principal residence. It omits secondary residence purposes and investment purposes.

In [7]:
# "Occupancy_Type" = 1, Second Residence" = 2, "Investment Property" = 3.
HMDA_clean = HMDA_clean_1[HMDA_clean_1["Occupancy_Type"] == 1]

# Approval on Race/Ethnicity/Sex interactions only.

In [8]:
No_Controls_Model = ols("Approved ~ Race*Ethnicity*Sex", data = HMDA_clean).fit(cov_type = 'Cluster', cov_kwds = {'groups': HMDA_clean['County_Code']})
No_Controls_Model.summary()

  return np.sqrt(eigvals[0]/eigvals[-1])


0,1,2,3
Dep. Variable:,Approved,R-squared:,0.017
Model:,OLS,Adj. R-squared:,0.011
Method:,Least Squares,F-statistic:,89.07
Date:,"Fri, 03 Jun 2022",Prob (F-statistic):,8.400000000000001e-154
Time:,14:33:46,Log-Likelihood:,-339.39
No. Observations:,2153,AIC:,706.8
Df Residuals:,2139,BIC:,786.2
Df Model:,13,,
Covariance Type:,Cluster,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.9310,0.011,86.544,0.000,0.910,0.952
Race[T.Asian],-0.0012,0.027,-0.043,0.965,-0.054,0.052
Race[T.Black],-0.1128,0.019,-5.972,0.000,-0.150,-0.076
Race[T.Other],-0.1310,0.186,-0.704,0.482,-0.496,0.234
Ethnicity[T.Hispanic],-0.0515,0.015,-3.430,0.001,-0.081,-0.022
Sex[T.Female],-0.0079,0.013,-0.609,0.543,-0.033,0.018
Race[T.Asian]:Ethnicity[T.Hispanic],0.1217,0.030,4.068,0.000,0.063,0.180
Race[T.Black]:Ethnicity[T.Hispanic],0.2334,0.026,8.907,0.000,0.182,0.285
Race[T.Other]:Ethnicity[T.Hispanic],0.2515,0.188,1.341,0.180,-0.116,0.619

0,1,2,3
Omnibus:,1201.577,Durbin-Watson:,1.921
Prob(Omnibus):,0.0,Jarque-Bera (JB):,6163.802
Skew:,-2.807,Prob(JB):,0.0
Kurtosis:,9.099,Cond. No.,inf


# Model 1 - Indicators Only

In [9]:
#omit ['White', 'Not Hispanic', 'Male','DTI_less_than_20']
#don't forget to add census tract, lei, and relationships
Model_1 = ols("Approved ~ Race + Ethnicity + Sex\
          + LTV + DTI_Ratio + Lender_LEI + Census_Tract\
          + Log_Income + Log_Loan_Amount", data = HMDA_clean).fit(cov_type = 'Cluster', cov_kwds = {'groups': HMDA_clean['County_Code']})
Model_1.summary()



0,1,2,3
Dep. Variable:,Approved,R-squared:,0.821
Model:,OLS,Adj. R-squared:,0.333
Method:,Least Squares,F-statistic:,3398000000.0
Date:,"Fri, 03 Jun 2022",Prob (F-statistic):,0.0
Time:,14:33:55,Log-Likelihood:,1497.4
No. Observations:,2153,AIC:,159.1
Df Residuals:,576,BIC:,9108.0
Df Model:,1576,,
Covariance Type:,Cluster,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.9813,0.336,2.916,0.004,0.322,1.641
Race[T.Asian],0.0224,0.055,0.404,0.686,-0.086,0.131
Race[T.Black],-0.0820,0.081,-1.012,0.312,-0.241,0.077
Race[T.Other],0.0065,0.056,0.116,0.908,-0.103,0.116
Ethnicity[T.Hispanic],0.0038,0.048,0.079,0.937,-0.091,0.099
Sex[T.Female],-0.0151,0.047,-0.322,0.747,-0.107,0.077
DTI_Ratio[T.20%-<30%],-0.0402,0.073,-0.552,0.581,-0.183,0.103
DTI_Ratio[T.30%-<36%],0.0227,0.085,0.266,0.790,-0.145,0.190
DTI_Ratio[T.36],0.0321,0.097,0.330,0.741,-0.158,0.223

0,1,2,3
Omnibus:,505.057,Durbin-Watson:,2.003
Prob(Omnibus):,0.0,Jarque-Bera (JB):,7457.498
Skew:,-0.688,Prob(JB):,0.0
Kurtosis:,12.013,Cond. No.,3.78e+18


# Model 2 - Race/Ethnicity/Sex Interactions

In [10]:
Model_2 = ols("Approved ~ Race*Ethnicity*Sex\
          + LTV + DTI_Ratio + Lender_LEI + Census_Tract\
          + Log_Income + Log_Loan_Amount", data = HMDA_clean).fit(cov_type = 'Cluster', cov_kwds = {'groups': HMDA_clean['County_Code']})
Model_2.summary()



0,1,2,3
Dep. Variable:,Approved,R-squared:,0.822
Model:,OLS,Adj. R-squared:,0.327
Method:,Least Squares,F-statistic:,27920000000.0
Date:,"Fri, 03 Jun 2022",Prob (F-statistic):,0.0
Time:,14:33:58,Log-Likelihood:,1501.6
No. Observations:,2153,AIC:,164.7
Df Residuals:,569,BIC:,9153.0
Df Model:,1583,,
Covariance Type:,Cluster,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,1.0088,0.327,3.089,0.002,0.369,1.649
Race[T.Asian],-0.0035,0.060,-0.059,0.953,-0.121,0.114
Race[T.Black],-0.0665,0.115,-0.578,0.563,-0.292,0.159
Race[T.Other],0.0549,0.121,0.454,0.650,-0.182,0.292
Ethnicity[T.Hispanic],-0.0008,0.075,-0.010,0.992,-0.148,0.146
Sex[T.Female],-0.0201,0.052,-0.389,0.697,-0.121,0.081
DTI_Ratio[T.20%-<30%],-0.0376,0.070,-0.534,0.594,-0.175,0.100
DTI_Ratio[T.30%-<36%],0.0254,0.081,0.315,0.752,-0.133,0.184
DTI_Ratio[T.36],0.0345,0.095,0.366,0.715,-0.151,0.220

0,1,2,3
Omnibus:,503.231,Durbin-Watson:,2.005
Prob(Omnibus):,0.0,Jarque-Bera (JB):,7357.474
Skew:,-0.687,Prob(JB):,0.0
Kurtosis:,11.951,Cond. No.,3.86e+18


# Model 3 - DTI/LTV Interactions

In [11]:
Model_3 = ols("Approved ~ Race + Ethnicity + Sex\
          + LTV*DTI_Ratio + Lender_LEI + Census_Tract\
          + Log_Income + Log_Loan_Amount", data = HMDA_clean).fit(cov_type = 'Cluster', cov_kwds = {'groups': HMDA_clean['County_Code']})
Model_3.summary()



0,1,2,3
Dep. Variable:,Approved,R-squared:,0.835
Model:,OLS,Adj. R-squared:,0.362
Method:,Least Squares,F-statistic:,-195700000000.0
Date:,"Fri, 03 Jun 2022",Prob (F-statistic):,1.0
Time:,14:34:01,Log-Likelihood:,1579.3
No. Observations:,2153,AIC:,31.37
Df Residuals:,558,BIC:,9082.0
Df Model:,1594,,
Covariance Type:,Cluster,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.8885,0.436,2.037,0.042,0.033,1.743
Race[T.Asian],0.0043,0.059,0.072,0.943,-0.111,0.120
Race[T.Black],-0.0782,0.076,-1.034,0.301,-0.227,0.070
Race[T.Other],-0.0073,0.041,-0.180,0.857,-0.087,0.072
Ethnicity[T.Hispanic],0.0003,0.051,0.006,0.996,-0.099,0.100
Sex[T.Female],-0.0137,0.041,-0.331,0.740,-0.095,0.067
DTI_Ratio[T.20%-<30%],0.1007,0.338,0.298,0.766,-0.562,0.763
DTI_Ratio[T.30%-<36%],0.1393,0.279,0.498,0.618,-0.409,0.687
DTI_Ratio[T.36],-0.0429,0.470,-0.091,0.927,-0.965,0.879

0,1,2,3
Omnibus:,456.374,Durbin-Watson:,1.988
Prob(Omnibus):,0.0,Jarque-Bera (JB):,7386.985
Skew:,-0.54,Prob(JB):,0.0
Kurtosis:,12.01,Cond. No.,3.7e+18


# Model 4 - Lender and Census_Tract Interaction

In [None]:
Model_4 = ols("Approved ~ Race + Ethnicity + Sex\
          + LTV + DTI_Ratio + Lender_LEI*Census_Tract\
          + Log_Income + Log_Loan_Amount", data = HMDA_clean).fit(cov_type = 'Cluster', cov_kwds = {'groups': HMDA_clean['County_Code']})
Model_4.summary()

# Model 5 - All Interactions(minus LEI/census_tract) and Control Variables.

In [12]:
Model_5 = ols("Approved ~ Race*Ethnicity*Sex\
          + LTV*DTI_Ratio + Lender_LEI + Census_Tract\
          + Log_Income + Log_Loan_Amount\
          + C(preapproval) + C(Loan_Type)", data = HMDA_clean).fit(cov_type = 'Cluster', cov_kwds = {'groups': HMDA_clean['County_Code']})
Model_5.summary()



0,1,2,3
Dep. Variable:,Approved,R-squared:,0.839
Model:,OLS,Adj. R-squared:,0.367
Method:,Least Squares,F-statistic:,-438700000000.0
Date:,"Fri, 03 Jun 2022",Prob (F-statistic):,1.0
Time:,14:34:03,Log-Likelihood:,1608.5
No. Observations:,2153,AIC:,-4.956
Df Residuals:,547,BIC:,9108.0
Df Model:,1605,,
Covariance Type:,Cluster,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,1.1296,0.380,2.976,0.003,0.386,1.874
Race[T.Asian],-0.0373,0.066,-0.561,0.575,-0.168,0.093
Race[T.Black],-0.0546,0.092,-0.594,0.553,-0.235,0.126
Race[T.Other],-0.0132,0.107,-0.123,0.902,-0.223,0.197
Ethnicity[T.Hispanic],-0.0003,0.074,-0.004,0.997,-0.145,0.145
Sex[T.Female],-0.0132,0.043,-0.304,0.761,-0.098,0.072
DTI_Ratio[T.20%-<30%],0.1172,0.330,0.355,0.722,-0.529,0.764
DTI_Ratio[T.30%-<36%],0.1412,0.292,0.483,0.629,-0.432,0.714
DTI_Ratio[T.36],-0.0082,0.490,-0.017,0.987,-0.968,0.951

0,1,2,3
Omnibus:,443.784,Durbin-Watson:,2.003
Prob(Omnibus):,0.0,Jarque-Bera (JB):,7001.728
Skew:,-0.517,Prob(JB):,0.0
Kurtosis:,11.774,Cond. No.,3.77e+18


# Denied Model 1

In [13]:
Denied_1 = ols("Denied ~ Race + Ethnicity + Sex\
          + LTV + DTI_Ratio + Lender_LEI + Census_Tract\
          + Log_Income + Log_Loan_Amount", data = HMDA_clean).fit(cov_type = 'Cluster', cov_kwds = {'groups': HMDA_clean['County_Code']})
Denied_1.summary()



0,1,2,3
Dep. Variable:,Denied,R-squared:,0.821
Model:,OLS,Adj. R-squared:,0.333
Method:,Least Squares,F-statistic:,-18850000000.0
Date:,"Fri, 03 Jun 2022",Prob (F-statistic):,1.0
Time:,14:34:06,Log-Likelihood:,1497.4
No. Observations:,2153,AIC:,159.1
Df Residuals:,576,BIC:,9108.0
Df Model:,1576,,
Covariance Type:,Cluster,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.0187,0.336,0.056,0.956,-0.641,0.678
Race[T.Asian],-0.0224,0.055,-0.404,0.686,-0.131,0.086
Race[T.Black],0.0820,0.081,1.012,0.312,-0.077,0.241
Race[T.Other],-0.0065,0.056,-0.116,0.908,-0.116,0.103
Ethnicity[T.Hispanic],-0.0038,0.048,-0.079,0.937,-0.099,0.091
Sex[T.Female],0.0151,0.047,0.322,0.747,-0.077,0.107
DTI_Ratio[T.20%-<30%],0.0402,0.073,0.552,0.581,-0.103,0.183
DTI_Ratio[T.30%-<36%],-0.0227,0.085,-0.266,0.790,-0.190,0.145
DTI_Ratio[T.36],-0.0321,0.097,-0.330,0.741,-0.223,0.158

0,1,2,3
Omnibus:,505.057,Durbin-Watson:,2.003
Prob(Omnibus):,0.0,Jarque-Bera (JB):,7457.498
Skew:,0.688,Prob(JB):,0.0
Kurtosis:,12.013,Cond. No.,3.78e+18


# Denied Model 2 - Add other interactions and control variables.

In [14]:
Denied_2 = ols("Denied ~ Race*Ethnicity*Sex\
          + LTV*DTI_Ratio + Lender_LEI + Census_Tract\
          + Log_Income + Log_Loan_Amount\
          + C(preapproval) + C(Loan_Type)", data = HMDA_clean).fit(cov_type = 'Cluster', cov_kwds = {'groups': HMDA_clean['County_Code']})
Denied_2.summary()



0,1,2,3
Dep. Variable:,Denied,R-squared:,0.839
Model:,OLS,Adj. R-squared:,0.367
Method:,Least Squares,F-statistic:,76240000000.0
Date:,"Fri, 03 Jun 2022",Prob (F-statistic):,0.0
Time:,14:34:09,Log-Likelihood:,1608.5
No. Observations:,2153,AIC:,-4.956
Df Residuals:,547,BIC:,9108.0
Df Model:,1605,,
Covariance Type:,Cluster,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-0.1296,0.380,-0.341,0.733,-0.874,0.614
Race[T.Asian],0.0373,0.066,0.561,0.575,-0.093,0.168
Race[T.Black],0.0546,0.092,0.594,0.553,-0.126,0.235
Race[T.Other],0.0132,0.107,0.123,0.902,-0.197,0.223
Ethnicity[T.Hispanic],0.0003,0.074,0.004,0.997,-0.145,0.145
Sex[T.Female],0.0132,0.043,0.304,0.761,-0.072,0.098
DTI_Ratio[T.20%-<30%],-0.1172,0.330,-0.355,0.722,-0.764,0.529
DTI_Ratio[T.30%-<36%],-0.1412,0.292,-0.483,0.629,-0.714,0.432
DTI_Ratio[T.36],0.0082,0.490,0.017,0.987,-0.951,0.968

0,1,2,3
Omnibus:,443.784,Durbin-Watson:,2.003
Prob(Omnibus):,0.0,Jarque-Bera (JB):,7001.728
Skew:,0.517,Prob(JB):,0.0
Kurtosis:,11.774,Cond. No.,3.77e+18
