# Multiple Regression Analysis: OLS Asymptotics

In [19]:
import pandas as pd
import numpy as np
import wooldridge
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from scipy.stats import chi2

In [20]:
def lm_test(data, dependent, control_vars, test_vars):
    # Create formulas
    controls = ' + '.join(control_vars)
    unrestricted = controls + ' + ' + ' + '.join(test_vars)
    
    # Fit unrestricted model
    formula = f"{dependent} ~ {unrestricted}"
    model = smf.ols(formula, data=data).fit()
    
    # Compute LM statistic
    r_squared = model.rsquared
    n = model.nobs
    q = len(test_vars)  # number of restrictions
    lm_stat = r_squared * n
    p_value = 1 - chi2.cdf(lm_stat, df=q)

    return lm_stat, p_value

## Examples

In [2]:
bwght = wooldridge.data('bwght')
crime1 = wooldridge.data('crime1')

### 5.1 Housing Prices and Distances From an Incinerator

$$ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + u$$
$y$: house price, $x_1$: distance from incinerator, $x_2$: quality of house 

we expect $\beta_1 >0$ and $\beta_2 > 0$ so if $x_2$ is omitted and positively correlated with $x_1$, OLS estimators are inconsistent

### 5.2 Standard Errors in a Birth Weight Equation

In [3]:
wooldridge.data('bwght', description= True)

name of dataset: bwght
no of variables: 14
no of observations: 1388

+----------+--------------------------------+
| variable | label                          |
+----------+--------------------------------+
| faminc   | 1988 family income, $1000s     |
| cigtax   | cig. tax in home state, 1988   |
| cigprice | cig. price in home state, 1988 |
| bwght    | birth weight, ounces           |
| fatheduc | father's yrs of educ           |
| motheduc | mother's yrs of educ           |
| parity   | birth order of child           |
| male     | =1 if male child               |
| white    | =1 if white                    |
| cigs     | cigs smked per day while preg  |
| lbwght   | log of bwght                   |
| bwghtlbs | birth weight, pounds           |
| packs    | packs smked per day while preg |
| lfaminc  | log(faminc)                    |
+----------+--------------------------------+

J. Mullahy (1997), “Instrumental-Variable Estimation of Count Data
Models: Applications to Models of C

In [4]:
model02 = smf.ols('lbwght ~ cigs + lfaminc', data = bwght).fit()
model02.summary2().tables[1].iloc[:, :2]

Unnamed: 0,Coef.,Std.Err.
Intercept,4.718594,0.018244
cigs,-0.004082,0.000858
lfaminc,0.016266,0.005583


In [5]:
df = bwght.iloc[:len(bwght)//2]
model021 = smf.ols('lbwght ~ cigs + lfaminc', data = df).fit()
model021.summary2().tables[1].iloc[:, :2]

Unnamed: 0,Coef.,Std.Err.
Intercept,4.705583,0.027053
cigs,-0.004637,0.001332
lfaminc,0.019404,0.008188


In [6]:
len(bwght), len(df)

(1388, 694)

Using half the dataset, the standard error for cigs is $0.0013$ but becomes $0.00086$ if we use the whole dataset. Ratio of standard errors is $0.00086/ 0.0013 = 0.662$ which is equal to the approcimation $\sqrt{694/1388} = 0.707$

$$ \dfrac{SE_{new}}{SE_{old}} = \sqrt{\dfrac{n_{old}}{n_{new}}} $$

### 5.3 Economic Model of Crime 

In [7]:
wooldridge.data('crime1', description= True)

name of dataset: crime1
no of variables: 16
no of observations: 2725

+----------+---------------------------------+
| variable | label                           |
+----------+---------------------------------+
| narr86   | # times arrested, 1986          |
| nfarr86  | # felony arrests, 1986          |
| nparr86  | # property crme arr., 1986      |
| pcnv     | proportion of prior convictions |
| avgsen   | avg sentence length, mos.       |
| tottime  | time in prison since 18 (mos.)  |
| ptime86  | mos. in prison during 1986      |
| qemp86   | # quarters employed, 1986       |
| inc86    | legal income, 1986, $100s       |
| durat    | recent unemp duration           |
| black    | =1 if black                     |
| hispan   | =1 if Hispanic                  |
| born60   | =1 if born in 1960              |
| pcnvsq   | pcnv^2                          |
| pt86sq   | ptime86^2                       |
| inc86sq  | inc86^2                         |
+----------+-------------------------

In [8]:
model03 = smf.ols('narr86 ~ pcnv + avgsen + tottime + ptime86 + qemp86', data = crime1).fit()
model03.summary2().tables[1].iloc[:, :2]

Unnamed: 0,Coef.,Std.Err.
Intercept,0.706061,0.033152
pcnv,-0.151225,0.040855
avgsen,-0.007049,0.012412
tottime,0.012095,0.009577
ptime86,-0.039259,0.008917
qemp86,-0.103091,0.010397


To test that $avgsen, tottime$ have no effect using LM statistic

In [10]:
model032 = smf.ols('narr86 ~ pcnv  + ptime86 + qemp86', data = crime1).fit()
residuals = model032.resid

In [13]:
df = crime1.copy()
df['residuals'] = residuals
model033 = smf.ols('residuals ~ pcnv  + ptime86 + qemp86 + avgsen + tottime', data = df).fit()
rsquared = model033.rsquared

In [14]:
LM = rsquared * len(df)
LM

np.float64(4.070729461071595)

The 10% critical value in a chi square distribution with 2 degrees of freedom is 4.61, so we fail to reject the null hypothesis

In [18]:
pval = 1 - chi2.cdf(LM, df=2)
pval

np.float64(0.13063282803267184)

Using a ready made function

In [21]:
lm_test(crime1, 'narr86', ['pcnv','ptime86',' qemp86'], [])

PatsyError: expected a noun, but instead the expression ended
    narr86 ~ pcnv + ptime86 +  qemp86 +
                                      ^