In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

# 14.32 Problem Set 6

## Problem 4

### Part A

First we load the Wooldridge smoking dataset from Boston College.

In [2]:
df = pd.read_stata('smoke.dta')

We add a constant column to run regression with a constant later on.

In [3]:
df['_cons'] = 1

Next, we create a binary dependent variable for smoking at least one cigarette per day.

In [4]:
df['smoke'] = (df['cigs'] > 0).astype(np.float32)

Using this dependent variable, we can run probit regression analysis.

In [5]:
probit_model = sm.Probit(df['smoke'], df[['_cons', 'cigpric', 'age', 
                                          'agesq', 'lincome', 'educ']])
probit_results = probit_model.fit()
print(probit_results.summary())

Optimization terminated successfully.
         Current function value: 0.636103
         Iterations 5
                          Probit Regression Results                           
Dep. Variable:                  smoke   No. Observations:                  807
Model:                         Probit   Df Residuals:                      801
Method:                           MLE   Df Model:                            5
Date:                Thu, 09 May 2019   Pseudo R-squ.:                 0.04497
Time:                        15:56:06   Log-Likelihood:                -513.33
converged:                       True   LL-Null:                       -537.51
                                        LLR p-value:                 3.026e-09
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
_cons         -0.0029      0.840     -0.003      0.997      -1.649       1.644
cigpric       -0.0066      0.

These coefficients are difficult to interpret. Therefore, we consider the marginal effects of the variables.

In [6]:
probit_marg = probit_results.get_margeff()
print(probit_marg.summary())

       Probit Marginal Effects       
Dep. Variable:                  smoke
Method:                          dydx
At:                           overall
                dy/dx    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
cigpric       -0.0024      0.004     -0.681      0.496      -0.009       0.004
age            0.0228      0.006      3.879      0.000       0.011       0.034
agesq         -0.0003   6.52e-05     -4.563      0.000      -0.000      -0.000
lincome        0.0067      0.025      0.263      0.792      -0.043       0.057
educ          -0.0306      0.006     -5.281      0.000      -0.042      -0.019


We can repeat the same process for logit regression analysis.

In [7]:
logit_model = sm.Logit(df['smoke'], df[['_cons', 'cigpric', 'age', 
                                        'agesq', 'lincome', 'educ']])
logit_results = logit_model.fit()
print(logit_results.summary())

Optimization terminated successfully.
         Current function value: 0.636258
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                  smoke   No. Observations:                  807
Model:                          Logit   Df Residuals:                      801
Method:                           MLE   Df Model:                            5
Date:                Thu, 09 May 2019   Pseudo R-squ.:                 0.04474
Time:                        15:56:06   Log-Likelihood:                -513.46
converged:                       True   LL-Null:                       -537.51
                                        LLR p-value:                 3.403e-09
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
_cons         -0.0294      1.370     -0.021      0.983      -2.715       2.656
cigpric       -0.0109      0.

Once again, we consider the marginal effects of the variables.

In [8]:
logit_marg = logit_results.get_margeff()
print(logit_marg.summary())

        Logit Marginal Effects       
Dep. Variable:                  smoke
Method:                          dydx
At:                           overall
                dy/dx    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
cigpric       -0.0024      0.004     -0.689      0.491      -0.009       0.004
age            0.0234      0.006      3.866      0.000       0.012       0.035
agesq         -0.0003   6.82e-05     -4.491      0.000      -0.000      -0.000
lincome        0.0063      0.026      0.244      0.807      -0.044       0.056
educ          -0.0304      0.006     -5.211      0.000      -0.042      -0.019


We can also run OLS and compare the results.

In [9]:
ols_model = sm.OLS(df['smoke'], df[['_cons', 'cigpric', 'age', 'agesq', 
                                    'lincome', 'educ']])
ols_results = ols_model.fit(cov_type='HC1', use_t=True)
print(ols_results.summary())

                            OLS Regression Results                            
Dep. Variable:                  smoke   R-squared:                       0.054
Model:                            OLS   Adj. R-squared:                  0.048
Method:                 Least Squares   F-statistic:                     12.50
Date:                Thu, 09 May 2019   Prob (F-statistic):           1.07e-11
Time:                        15:56:06   Log-Likelihood:                -540.96
No. Observations:                 807   AIC:                             1094.
Df Residuals:                     801   BIC:                             1122.
Df Model:                           5                                         
Covariance Type:                  HC1                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
_cons          0.4995      0.302      1.652      0.0

### Part B

We can compare the coefficients between the probit and logit regressions. Printed below again for convenience:

In [10]:
print(probit_results.params)

_cons     -0.002851
cigpric   -0.006570
age        0.062623
agesq     -0.000818
lincome    0.018448
educ      -0.084260
dtype: float64


In [11]:
print(logit_results.params)

_cons     -0.029363
cigpric   -0.010861
age        0.104903
agesq     -0.001371
lincome    0.027996
educ      -0.136240
dtype: float64


Clearly, these values are quite different. This is expected as the the link function between the two regressions is considerably different numerically (although they achieve similar effects).

### Part C

Can can also consider the marginal effects of the three models. Printed below again for convenience:

In [12]:
for p, m in zip(['cigpric', 'age', 'agesq', 'lincome', 'educ'], 
                probit_marg.margeff):
    print(f"{p}\t{m}")

cigpric	-0.0023888482409308355
age	0.022769849064765608
agesq	-0.00029756022601105996
lincome	0.006707633681726004
educ	-0.030637071481779486


In [13]:
for p, m in zip(['cigpric', 'age', 'agesq', 'lincome', 'educ'], 
                logit_marg.margeff):
    print(f"{p}\t{m}")

cigpric	-0.002425635732371646
age	0.023428323130750477
agesq	-0.0003062247647835375
lincome	0.006252407079763507
educ	-0.03042690694491135


In [14]:
print(ols_results.params)

_cons      0.499525
cigpric   -0.002256
age        0.019985
agesq     -0.000261
lincome    0.008319
educ      -0.029317
dtype: float64


Between the two non-linear models probit and logit, the marginal effects are very similar. This is due to similarity of the models. Both use the same basic idea to apply regression, they only have different link functions (although they basically achieve the same thing).

On the other hand, the OLS model has slightly different marginal effects. This is because the model will not restrict the predictions to the range $(0, 1)$ like the other models. Nonetheless, the same approximate effects will be accounted for by the OLS model as they model the same data.