# Exercise 12: Solutions

When the outcome variable is binary, one can estimate a linear probability model or a discrete choice model such as probit or logit. The linear probability model is simple to estimate and use, but it has some drawbacks.

The two most important disadvantages are that the fitted probabilities can be less than zero or greater than one and the partial effect of any explanatory variable (appearing in level form) is constant. These limitations of the LPM can be overcome by using more sophisticated binary response models. Probit and logit regression are nonlinear regression models specifically designed for binary dependent variables.

Because a regression with a binary dependent variable $Y$ models the probability that $Y = 1$, it makes sense to adopt a nonlinear formulation that forces the predicted values to be between 0 and 1.
Because cumulative probability distribution functions (c.d.f.’s) produce probabilities between 0 and 1, they are used in logit and probit regressions.
Probit regression uses the standard normal c.d.f.
Logit regression, also called logistic regression, uses the logistic c.d.f.

Another problem of linear probability models is that it produces heteroskedasticity. However, one can easily use heteroskedasticity-robust standard errors to correct for this.

As long as we are interested in average parial effects (APEs) only and not in individual predictions or partial effects and as long as not too many probabilities are close to 0 or 1, the linear probability model often works well enough.

The parameter estimates from the linear probability model make intuitively sense. A firm with higher liabilities and lower profits is more likely to receive state aid.
However, the parameter estimates are very small. This also makes sense as most firms don'r recieve state aid regardless of the movements of their profits and liabilities.


In [7]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf


df = pd.read_stata('data/state_aid.dta')

# estimate models:
reg_lin = smf.ols(formula='aid ~ ln_fixed_liabilities + ln_current_liabilities + ln_profit', data=df)
results_linear = reg_lin.fit()
print(f'results_linear.summary(): \n{results_linear.summary()}\n')




results_linear.summary(): 
                            OLS Regression Results                            
Dep. Variable:                    aid   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     112.8
Date:                Wed, 15 Sep 2021   Prob (F-statistic):           4.82e-73
Time:                        20:38:06   Log-Likelihood:             4.8828e+06
No. Observations:             1355480   AIC:                        -9.766e+06
Df Residuals:                 1355476   BIC:                        -9.766e+06
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
I

We add more variables to the model now and also estimate it using probit and logit models in addition to the linear probability models. The results are again as expected. Large firms and public firms are more likely to receive state aid and firms with a lower Altman z-score value also have a higher probability to receive state aid etc.

The logit and the probit model predict indidividual outcomes much better than the linear probability model. While the highest predicted probability for a firm to receive state aid is only 0.001, it is 0.89 and 0.67 in the probit and logit models.

In [6]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf


df = pd.read_stata('data/state_aid.dta')

# estimate models:
reg_lin = smf.ols(formula='aid ~ ln_fixed_liabilities + ln_current_liabilities + ln_profit + liquidity_ratio + solvency_ratio + rev_emp + ln_age + sub_gov_exp + real_gdp_per_cap + unempl_3y + C(altman_kat) + C(public) + C(size) + C(nace_1dig)', data=df)
results_linear = reg_lin.fit()
print(f'results_linear.summary(): \n{results_linear.summary()}\n')

reg_logit = smf.logit(formula='aid ~ ln_fixed_liabilities + ln_current_liabilities + ln_profit + liquidity_ratio + solvency_ratio + rev_emp + ln_age + sub_gov_exp + real_gdp_per_cap + unempl_3y + C(altman_kat) + C(public) + C(size) + C(nace_1dig)', data=df)
results_logit = reg_logit.fit()
print(f'results_logit.summary(): \n{results_logit.summary()}\n')

reg_probit = smf.probit(formula='aid ~ ln_fixed_liabilities + ln_current_liabilities + ln_profit + liquidity_ratio + solvency_ratio + rev_emp + ln_age + sub_gov_exp + real_gdp_per_cap + unempl_3y + C(altman_kat) + C(public) + C(size) + C(nace_1dig)', data=df)
results_probit = reg_probit.fit(disp=0)
print(f'results_probit.summary(): \n{results_probit.summary()}\n')


# compute predicted values
xb_linear = results_linear.predict()
xb_logit = results_logit.predict()
xb_probit = results_probit.predict()

predictions = pd.DataFrame({'linear_pred':xb_linear,
                            'probit_pred':xb_probit,
                            'logit_pref':xb_logit})
predictions.describe()          








results_linear.summary(): 
                            OLS Regression Results                            
Dep. Variable:                    aid   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     20.41
Date:                Wed, 15 Sep 2021   Prob (F-statistic):           2.33e-88
Time:                        20:36:23   Log-Likelihood:             4.0740e+06
No. Observations:             1151472   AIC:                        -8.148e+06
Df Residuals:                 1151447   BIC:                        -8.148e+06
Df Model:                          24                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
I

  return 1/(1+np.exp(-X))


Optimization terminated successfully.
         Current function value: 0.000408
         Iterations 18
results_logit.summary(): 
                           Logit Regression Results                           
Dep. Variable:                    aid   No. Observations:              1151472
Model:                          Logit   Df Residuals:                  1151447
Method:                           MLE   Df Model:                           24
Date:                Wed, 15 Sep 2021   Pseudo R-squ.:                  0.2455
Time:                        20:37:11   Log-Likelihood:                -469.32
converged:                       True   LL-Null:                       -622.07
                                        LLR p-value:                 1.310e-50
                             coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                -11.2804      1.692     -6.667  

Unnamed: 0,linear_pred,probit_pred,logit_pref
count,1151472.0,1151472.0,1151472.0
mean,4.950186e-05,4.916672e-05,4.950186e-05
std,0.0001450707,0.0009200915,0.000889687
min,-0.001849135,0.0,0.0
25%,-4.19556e-05,4.564047e-07,1.239785e-06
50%,2.096984e-05,2.908795e-06,4.995123e-06
75%,0.0001063841,1.486239e-05,1.875073e-05
max,0.001292647,0.8869479,0.6691788
