# loanapp.dta

Peng Chenxi

2021101425

### Section 1.Introduction

The topic is Racial Discrimination in Lending Markets. The main focus of my economic anaysis is to investigate whether racial discrimination exists in lending markets. The dataset used is "loanapp.dta," which consists of information on mortgage loan applications in Boston, 1994. 

Our key variables are independent variable approve, dependent variable white, and control variables male, married, sch, hrat and obrat. 

We will describe the data first. Then I will use $Logit$ $model$ to estimate the relationship between approve and white since it can analyze binary response data, handle nonlinear relationships between the independent variables and the binary response, ensure that the estimated probabilities are always within the appropriate bounds, and be robust to heteroscedasticity.





## Use Python and Stata Together to Go on the Section2-4

In [1]:
pip install pystata

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install stata_setup

Note: you may need to restart the kernel to use updated packages.


In [3]:
# Set the interaction of Python-Stata
import stata_setup
import sys
sys.path.append('E:\stata17//utilities')
from pystata import config
config.init('mp')


  ___  ____  ____  ____  ____ ©
 /__    /   ____/   /   ____/      17.0
___/   /   /___/   /   /___/       MP—Parallel Edition

 Statistics and Data Science       Copyright 1985-2021 StataCorp LLC
                                   StataCorp
                                   4905 Lakeway Drive
                                   College Station, Texas 77845 USA
                                   800-STATA-PC        https://www.stata.com
                                   979-696-4600        stata@stata.com

Stata license: Single-user 8-core , expiring  1 Jan 2025
Serial number: 501709301094
  Licensed to: 公众号【马克数据网】
               

Notes:
      1. Unicode is supported; see help unicode_advice.
      2. More than 2 billion observations are allowed; see help obs_advice.
      3. Maximum number of variables is set to 5,000; see help set_maxvar.


### Section 2. Data cleanig

In [4]:
%%stata

* Import dataset
sysuse loanapp.dta, clear


. 
. * Import dataset
. sysuse loanapp.dta, clear

. 


In [5]:
%%stata 

* Clean the dataset by droping missing value in some variables mentioned in introduction 
drop if missing(married) | missing(male) | missing(approve)| missing(white)|missing(sch)|missing(hrat)|missing(obrat)

* Save a cleaned dataset
save loanappV1.dta, replace


. 
. * Clean the dataset by droping missing value in some variables mentioned in i
> ntroduction 
. drop if missing(married) | missing(male) | missing(approve)| missing(white)|m
> issing(sch)|missing(hrat)|missing(obrat)
(18 observations deleted)

. 
. * Save a cleaned dataset
. save loanappV1.dta, replace
file loanappV1.dta saved

. 


### Section 3. Descriptive Analysis

In [6]:
%%stata

* Describe key variables
describe approve white married male sch hrat obrat

* Summarize key variables

summarize white married male sch hrat obrat if approve == 0
summarize white married male sch hrat obrat if approve == 1



. 
. * Describe key variables
. describe approve white married male sch hrat obrat

Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
approve         byte    %9.0g                 =1 if action == 1 or 2
white           byte    %9.0g                 =1 if applicant white
married         byte    %9.0g                 =1 if applicant married
male            byte    %9.0g                 =1 if applicant male
sch             byte    %9.0g                 =1 if > 12 years schooling
hrat            float   %9.0g                 housing exp, % total inc
obrat           float   %9.0g                 other oblgs, % total inc

. 
. * Summarize key variables
. 
. summarize white married male sch hrat obrat if approve == 0

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
  

Here are my interpretation of my empirical findings
1. Most of applicants are white people and most of loan applications are approved
2. By comparison of two summaries, the loan application is more likely to be approved for people who are white,married,male,highly educated,lower housing expenditure or lower outlay for debt
3. Gender and housing expenditure may have little influence on the possibility of the success in application.


### Section 4. Econometric Analysis

First consider the logit model. 

$$\Pr(approve_i=1|\boldsymbol{X}_i)=\dfrac{e^{\beta_0 + \beta_1 white_i +\beta_2 married_i +\beta_3 male +\beta_4 sch +\beta_5 hrat +\beta_6 obrat}}{1+ e^{\beta_0 + \beta_1 white_i +\beta_2 married_i +\beta_3 male +\beta_4 sch +\beta_5 hrat +\beta_6 obrat}} $$ 

#### code manually

I use scipy and numpy to code up the log-likelihood function and use a suitable optimization routine to find the MLE. Code below is based on the note of Teacher Zi Zhong.

In [7]:
import scipy
import numpy as np
import pandas as pd

import statsmodels.api as sm
from scipy.optimize import minimize
from scipy import stats

# Import the cleaned dataset

dataset = pd.read_stata('loanappV1.dta')

# Extract predictor variables and response variable
X_pd = dataset[['white','married','male','sch','hrat','obrat']]  # Specify the column names for the predictor variables
y_pd = dataset['approve']

# Add constant to independent variables
X_pd = sm.add_constant(X_pd)

# Convert data to numpy arrays
X = X_pd.to_numpy()
y = y_pd.to_numpy()

# Create a function to apply MLE 
def ordinal_II(theta, y, x, model):
    # the value of m
    m = len(np.unique(y))
    
    # the number of colomns of x
    if x.ndim==1: k=1
    if x.ndim!=1: k=np.shape(x)[1]
    
    # linear combination
    beta = theta[0:k]
    if k==1: 
        BX = x * beta
    else:
        BX = x @ beta
        
    # mu parameter (vectorization)
    mu = theta[k:]
    mu = np.append(np.append(-np.inf,mu),np.inf)
    
    # Objective function computation
    output = 0
    for j in range(m):
        if model=="oprobit":
            output = output + np.sum((y==j)*np.log(stats.norm.cdf(mu[j+1]-BX)-stats.norm.cdf(mu[j]-BX)+1e-20))
        if model=="ologit":
            output = output + np.sum((y==j)*np.log(scipy.special.expit(mu[j+1]-BX)-scipy.special.expit(mu[j]-BX)+1e-20))
    return -output


# Defining the initial values for optimizing a solution equation
# a NumPy array that represents the initial guess or 
# starting point for the optimization algorithm
initialvalue = np.array([7.195,2.4,1.3,0.4,-0.1,0.2,0.02,-0.07]) 
logit_result = minimize(ordinal_II, initialvalue, args=(y,X,'ologit'))

# The results of solving the logit model using SciPy.
logit_report = np.zeros((8,6), dtype=float)

logit_report[:,0] = logit_result.x

# The standard deviation of parameter estimation using SciPy.
logit_se = np.sqrt(np.diag(logit_result.hess_inv))
logit_report[:,1] = logit_se

# z-statistic
logit_report[:,2] = ((1-stats.norm.cdf(logit_result.x/logit_se))*2).round(4)

# p-value
logit_report[:,3] = ((1-stats.norm.cdf(logit_result.x/logit_se))*2).round(4)

# CI
logit_report[:,4] = (logit_result.x - 1.96*logit_se/np.sqrt(np.shape(logit_result.x)[0])).round(4)
logit_report[:,5] = (logit_result.x + 1.96*logit_se/np.sqrt(np.shape(logit_result.x)[0])).round(4)

# make dataframe
logit_report = pd.DataFrame(logit_report, index=['mu0','white','married','male','sch','hrat','obrat','_cons'], columns=['Parameter','Std. Err.','z-stat','p-value','Lower CI','Upper CI'])

print('Estimands and hypothesis testing statistics are shown below:')
display(logit_report)




Estimands and hypothesis testing statistics are shown below:


Unnamed: 0,Parameter,Std. Err.,z-stat,p-value,Lower CI,Upper CI
mu0,4.74969,0.457231,0.0,0.0,4.4328,5.0665
white,1.32032,0.172883,0.0,0.0,1.2005,1.4401
married,0.407608,0.165092,0.0135,0.0135,0.2932,0.522
male,-0.128241,0.193973,1.4915,1.4915,-0.2627,0.0062
sch,0.154409,0.160134,0.3349,0.3349,0.0434,0.2654
hrat,0.021976,0.011728,0.061,0.061,0.0138,0.0301
obrat,-0.06664,0.01036,2.0,2.0,-0.0738,-0.0595
_cons,2.375305,0.160042,0.0,0.0,2.2644,2.4862


In [8]:
import sympy

# Define symbolic variable
b0 = sympy.Symbol('beta_0')
b1 = sympy.Symbol('beta_1')
b2 = sympy.Symbol('beta_2')
b3 = sympy.Symbol('beta_3')
b4 = sympy.Symbol('beta_4')
b5 = sympy.Symbol('beta_5')
b6 = sympy.Symbol('beta_6')

p1 = sympy.Symbol('P(approve_{i}=1)')
white = sympy.Symbol('white_i')
married = sympy.Symbol('married_i')
male = sympy.Symbol('male_i')
sch = sympy.Symbol('sch_i')
hrat = sympy.Symbol('hrat_i')
obrat = sympy.Symbol('obrat_i')

# linear function
z = b0 + b1 * white+ b2 * married + b3*male + b4*sch  + b5*hrat + b6*obrat

# logit model
p1 = sympy.E**z / (1+sympy.E**z)

# caculate the probabilities
white1_me = p1.subs(b0,2.3744).subs(b1,1.3203).subs(b2,0.4076).subs(b3,-0.1282).subs(b4,0.1544).subs(b5,0.0219).subs(b6,-0.0666).subs(married,X_pd.mean()[2]).subs(male,X_pd.mean()[3]).subs(sch,X_pd.mean()[4]).subs(hrat,X_pd.mean()[5]).subs(obrat,X_pd.mean()[6]).subs(white,1)
white0_me = p1.subs(b0,2.3744).subs(b1,1.3203).subs(b2,0.4076).subs(b3,-0.1282).subs(b4,0.1544).subs(b5,0.0219).subs(b6,-0.0666).subs(married,X_pd.mean()[2]).subs(male,X_pd.mean()[3]).subs(sch,X_pd.mean()[4]).subs(hrat,X_pd.mean()[5]).subs(obrat,X_pd.mean()[6]).subs(white,0)

# AME of white
print(white1_me-white0_me)

0.174460305852619


#### code with stata

In [9]:
%%stata 

* run logit model
logit approve white married male sch hrat obrat

* get AME of white
replace white=1
predict pre1_approve

replace white=0
predict pre0_approve

gen dif_white = pre1_approve-pre0_approve

sum dif_white


. 
. * run logit model
. logit approve white married male sch hrat obrat

Iteration 0:   log likelihood = -737.97933  
Iteration 1:   log likelihood =  -680.6166  
Iteration 2:   log likelihood = -669.30063  
Iteration 3:   log likelihood = -669.25917  
Iteration 4:   log likelihood = -669.25917  

Logistic regression                                     Number of obs =  1,971
                                                        LR chi2(6)    = 137.44
                                                        Prob > chi2   = 0.0000
Log likelihood = -669.25917                             Pseudo R2     = 0.0931

------------------------------------------------------------------------------
     approve | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       white |    1.32032    .157367     8.39   0.000     1.011886    1.628754
     married |   .4076083   .1559157     2.61   0.009     .1020191

The empirical findings:
1. Controling for other variables, on average, the probability of approved applicantions written by white people is 17.6% higher than people of other colors. Meanwhile, this result is statistically significant.
2. Gender, years of schooling, and housing expenditure are statistically insignificant.

### Section 5. Conclusions

1. After controling for other factors, we can believe racial discrimination does exist in lending markets.
2. Gender, years of schooling, and housing expenditure have little influence on the possibility of the success in application.