In [1]:
import numpy as np
from statsmodels.regression.linear_model import OLS
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.tools import add_constant
import scipy as sp
import pandas as pd

# 14.32 Problem Set 3

## Computational Exercise

Loading the data from the Excel document.

In [2]:
df = pd.read_excel("housingprices.xls")

We add a constant column to run OLS with a constant.

In [3]:
df['_cons'] = 1

### Problem 1

Here we run an ordinary least squares regression of $price$ on $crime$, $nox$, $dist$, $radial$, and $proptax$.

In [4]:
p1_res = OLS(df['price'], df[['_cons', 'crime', 'nox', 'dist', 'radial', 'proptax']]).fit()
print(p1_res.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.305
Model:                            OLS   Adj. R-squared:                  0.298
Method:                 Least Squares   F-statistic:                     43.80
Date:                Fri, 15 Mar 2019   Prob (F-statistic):           1.84e-37
Time:                        03:03:02   Log-Likelihood:                -5244.3
No. Observations:                 506   AIC:                         1.050e+04
Df Residuals:                     500   BIC:                         1.053e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
_cons         5.2e+04   3517.287     14.784      0.0

From this regression, the $R^2$ value is $0.305$.

In [5]:
p1_res.rsquared

0.3046098301139186

For my own peace of mind, I compute the estimator by hand.

In [6]:
X = df[['_cons', 'crime', 'nox', 'dist', 'radial', 'proptax']].values
Y = df['price'].values.reshape(-1, 1)

In [7]:
beta_hat = np.linalg.solve(X.T @ X, X.T @ Y)
beta_hat

array([[51998.11457491],
       [ -253.44560808],
       [-2914.91619368],
       [-1008.87358186],
       [  406.73164601],
       [ -304.93889934]])

I also compute the standard errors by hand.

In [8]:
SSR = np.sum((Y - X @ beta_hat) ** 2)
sigma_hat = np.sqrt(SSR / (len(df) - 6))
se = np.sqrt(np.diag(sigma_hat ** 2 * np.linalg.inv(X.T @ X)))[:, np.newaxis]
se

array([[3517.2866418 ],
       [  51.52851133],
       [ 527.91177856],
       [ 256.85684351],
       [  99.27946592],
       [  52.3308545 ]])

### Problem 2

Now we can run the Breusch-Pagan test for heteroskedasticity.

In [9]:
p2a_res = het_breuschpagan(p1_res.resid, df[['_cons', 'crime', 'nox', 'dist', 'radial', 'proptax']])

From this test, we can observe the $F$-statistic's $p$-value.

In [10]:
lm, lm_pval, fval, f_pval = p2a_res
f_pval

0.011432859124892414

Since this is less than $\alpha = 0.05$, we reject the null hypothesis that there is no heteroskedasticity. Now we can re-run ordinary least squares with White's heteroskedastic robust standard errors.

In [11]:
p2b_res = OLS(df['price'], df[['_cons', 'crime', 'nox', 'dist', 'radial', 'proptax']]).fit(cov_type='HC1', use_t=True)
print(p2b_res.summary())

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.305
Model:                            OLS   Adj. R-squared:                  0.298
Method:                 Least Squares   F-statistic:                     57.86
Date:                Fri, 15 Mar 2019   Prob (F-statistic):           1.82e-47
Time:                        03:03:02   Log-Likelihood:                -5244.3
No. Observations:                 506   AIC:                         1.050e+04
Df Residuals:                     500   BIC:                         1.053e+04
Df Model:                           5                                         
Covariance Type:                  HC1                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
_cons         5.2e+04   3048.103     17.059      0.0

It is important to note that the estimates have not changed, only the standard errors. In particular, all but one of the standard errors went *down*.

Once again, I compute the White heteroskedasticity-robust standard errors by hand to confirm the results.

In [12]:
Sigma_hat = np.diag(((Y - X @ beta_hat) ** 2).reshape(-1))
se = np.sqrt(np.diag(np.linalg.inv(X.T @ X) @ (X.T @ Sigma_hat @ X) @ np.linalg.inv(X.T @ X)))[:, np.newaxis]
se

array([[3029.9769754 ],
       [  54.33429643],
       [ 423.51735529],
       [ 237.65981767],
       [  92.30803164],
       [  49.73499428]])

Note the slight differences between the results by hand and the results from `statsmodels`. This comes from the use of a different version of White's standard error formula. `'HC1'` refers to a newer version developed in 1985 and the formula we use in class is from 1980. If I use `'HC0'`, the results match the results done by hand.

## Problem 3


We can run an $F$-test on the joint null hypothesis that neither $proptax$ nor $radial$ affects the price.

In [13]:
f_test = p2b_res.f_test('(proptax = 0), (radial = 0)')

From this test, we can examine the $p$-value.

In [14]:
np.asscalar(f_test.pvalue)

1.5712984661012053e-08

Since the $p$-value is significantly less than $\alpha = 0.05$, we can reject the null hypothesis the neither $proptax$ nor $radial$ affects the price jointly.

## Problem 4

We first add a new column to the dataframe $lprice$.

In [15]:
df['lprice'] = np.log(df['price'])

We can re-run the ordinary least squares on $lprice$. Since we discovered heteroskedasticity, we keep the robust standard errors.

In [16]:
p4_res = OLS(df['lprice'], df[['_cons', 'crime', 'nox', 'dist', 'radial', 'proptax']]).fit(cov_type='HC1', use_t=True)
print(p4_res.summary())

                            OLS Regression Results                            
Dep. Variable:                 lprice   R-squared:                       0.436
Model:                            OLS   Adj. R-squared:                  0.431
Method:                 Least Squares   F-statistic:                     68.20
Date:                Fri, 15 Mar 2019   Prob (F-statistic):           2.72e-54
Time:                        03:03:02   Log-Likelihood:                -120.39
No. Observations:                 506   AIC:                             252.8
Df Residuals:                     500   BIC:                             278.1
Df Model:                           5                                         
Covariance Type:                  HC1                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
_cons         11.2021      0.119     94.308      0.0

The $R^2$ value for this regression is $0.436$.

In [17]:
p4_res.rsquared

0.43628441740062684

Since this has a higher $R^2$ value, this model has a better fit.