## Regression Inference and OLS Asymptotics

In this notebook, we are going to study and demonstrate the use of *Python* to perform **statistical inference** to test our regression models. We are also going to explore the **Asymptotic Theory** and understand how it allows us to relax some assumptions needed to derive the sampling distribution of estimators if the sample is size is large enough.

**Topics:**

1. The *t* Test
2. Confidence Intervals
3. Linear Restrictions: *F* Tests
4. Simulation Exercises
5. LM Test

Section 4.1 of Wooldridge (2019) adds assumption MLR.6 (normal distribution of the error term) to the previous assumptions MLR.1 through MLR.5. Together, these assumptions consitute the classical linear model (CLM).

The main additional result we get from this assumption is stated in Theorem 4.1: The OLS parameter estimators are normally distributed (conditional on the regressors $x_1, x_2, ..., x_k$). The benefit of this result is that it allows us to do statistical inference similar to the approaches discussed the simple estimator of the mean of a normally distributed random variable.

### 1. The *t* Test

After the sign and magnitude of the estimated parameters, empirical resarch typically pays most attention to the results of *t* tests discussed in this section.

#### General Setup

An important type of hypotheses we are often interested in is of the form

$$H_0: \beta_j = a_j$$

where $a_j$ is some given number, very often, $a_j$ = 0. For the most common cast of two-tailed tests, the alternative hypothesis is 

$$H_1: \beta_j \neq a_j$$

and for one-tailed tests it is either one of

$$H_1: \beta_j > a_j$$

or 

$$H_1: \beta_j < a_j$$

These hypotheses can be conveniently tested using a *t* test which is based on the test statistic

$$t = \frac{\hat{\beta}_j - a_j}{se(\hat{\beta}_j)}$$

If $H_0$ is in fact true and the CLM assumptions holds, then this statistic has a t distribution with *n - k - 1* degree of freedom.

#### Standard Case

Very often, we want to test whether there is any relation at all between the dependent variable *y* and a regressor $x_j$ and do not want to impose a sign on the partial effect *a priori*. This is a mission for the standard two-sided *t* test with the hypothetical value $a_j$ = 0, so

$$H_0: \beta_j = 0$$

$$H_1: \beta_j \neq 0$$

$$t_{\hat{\beta}_j} = \frac{\hat{\beta}_j}{se(\hat{\beta}_j)}$$

The subscript on the *t* statistic indicates that this is **the** *t* value for $\hat{\beta}_j$ for this frequent version of the test. Under $H_0$, it has the *t* distribution with *n - k - 1* degree of freedom implying that the probability that $|t_{\hat{\beta}_j}| > c$ is equal to $\alpha$ if *c* is the $1 - \frac{\alpha}{2}$ quantile of this distribution. If $\alpha$ is our significance level (e.g. $\alpha = 5\%$), then we reject $H_0$ if $|t_{\hat{\beta}_j}| > c$ in our sample. For the typical significance level $\alpha = 5\%$, the critical value *c* will be around 2 for reasonable large degrees of freedom and approach the counterpart of 1.96 from the standard normal distribution in very large samples.

The *p* value indicates the smallest value of the significance level $\alpha$ for which we would still reject $H_0$ using our sample. So it is the probability for a random variable *T* with the respective *t* distribution that |*T*| > $|t_{\hat{\beta}_j}|$ where $t_{\hat{\beta}_j}$ is the value of the *t* statistic in our particular sample. In our two-tailed test, it can be calculated as

$$p_{\hat{\beta}_j} = 2 \cdot F_{t_{n-k-1}} \cdot (-|t_{\hat{\beta}_j}|)$$

where $F_{t_{n-k-1}} (\cdot)$ is the CDF of the *t* distribution with *n - k - 1* degree of freedom. If our software provides us with the relevant *p* values, they are easy to use: We reject $H_0$ if $p_{\hat{\beta}_j} \leq \alpha$.

Since this standard case of a *t* test is so common, **statsmodels** provides us with the relevant *t* and *p* values directly in the **summary** of the estimation results we already saw in the previous notebooks. The regression table includes for all regressors and the intercept:

- parameter estimates and standard errors
- test statistics $t_{\hat{\beta}_j}$ in column **t**
- respective *p* values $p_{\hat{\beta}_j}$ in the colmun **P>|t|**
- respective 95% confidence interval in columns **[0.025 and 0.975]**

#### Wooldridge, Example 4.3: Determinants of College GPA

We have repeatedly used the data set *GPA1* in the previous notebooks. This example uses three regressors and estimates a regression model of the form.

$$colGPA = \beta_0 + \beta_1 \cdot hsGPA + \beta_2 \cdot ACT + \beta_3 \cdot skipped + u$$

For the critical values of the *t* test, using the normal approximation instead of the exact *t* distribution with *n - k - 1* = 137 d.f. doesn't make much of a difference:

In [1]:
# Import Modules
import scipy.stats as stats
import numpy as np

In [2]:
# CV for alpha = 5% and 1% using the t distribution with 137 d.f.
alpha = np.array([0.05, 0.01])
cv_t = stats.t.ppf(1 - alpha / 2, 137)
print(f'Critical Values by t Distribution: {cv_t}\n')

Critical Values by t Distribution: [1.97743121 2.61219198]



In [3]:
# CV for alpha = 5% and 1% using the normal approximation
cv_n = stats.norm.ppf(1 - alpha / 2)
print(f'Critical Values by Normal Distribution: {cv_n}\n')

Critical Values by Normal Distribution: [1.95996398 2.5758293 ]



We presents the standard **summary** which directly contains all the information to test the hypotheses for all parameters. The *t* statistics for all coefficients excepts $\beta_2$ are larger in absolute value than the *critical value* c = 2.61 (or c = 2.58 using the normal approximation) for $\alpha$ = 1%. So we would reject $H_0$ for all usual significance levels. By construction, we draw the same conclusions from the *p* values.

In order to confirm that **statsmodels** is exactly using the formulas of Wooldridge (2019). We next reconstruct the *t* and *p* values manually. We extract the coefficients (**params**) and standard errors (**bse**) from the regression results, and simply apply the $t_{\hat{\beta}_j}$ and $p_{\hat{\beta}_j}$ equations.

In [4]:
# Import Modules
import wooldridge as woo
import scipy.stats as stats
import numpy as np

ModuleNotFoundError: No module named 'wooldridged'

In [None]:
# Import data set gpa1
gpa1 = woo.dataWoo('gpa1')

In [None]:
# Store and display results:
reg = smf.ols(formula = 'colGPA ~ hsGPA + ACT + skipped', data = gpa1)
results = reg.fit()
print(f'Regression Summary: \n{results.summary()}\n')

In [None]:
# Manually confirm the formulas:

# Extract coefficients and SE
b = results.params
se = results.bse

# Reproduce t statistic
tstat = b / se
print(f't Statistics: \n{tstatat}\n')

# Reproduce p value
pval = 2 * stats.t.cdf(-abs(tstat), 137)
print(f'P Value: \n{pval}\n')

#### Other Hypotheses

For a one-tailed test, the critical value *c* of the *t* test and the *p* values have to be adjsuted appropriately Wooldridge (2019) provides a general discussion in Section 4.2. For testing the null hypothesis $H_0: \beta_j = a_j$, the tests for the three common alternative hypotheses are summarized in the following table.

**One- and Two-tailed *t* Tests for $H_0: \beta_j = a_j$**

|  $H_1$:  |  $\beta_j \neq a_j$  |  $\beta_j > a_j$  |  $\beta_j < a_j$  |
|  :---:  |  :---:  |  :---:  |  :---:  |
|  *c* = quantile  |  $1 - \frac{\alpha}{2}$  |  $1 - \alpha$  |  $1 - \alpha$  |
|  Reject $H_0$ if  |  $|t_{\hat{\beta}_j}| > c$  |  $\hat{\beta}_j > c$  |  $\hat{\beta}_j > c$  |
|  *p* value  |  $2 \cdot F_{t_{n - k - 1}} \cdot (-|t_{\hat{\beta}_j}|)$  |  $F_{t_{n - k - 1}} \cdot (-t_{\hat{\beta}_j})$  |  $F_{t_{n - k - 1}} \cdot (-t_{\hat{\beta}_j})$  |

Given the stardard regerssion output including the *p* value for two-sided tests $p_{\hat{\beta}_j}$, we can easily do one-sided *t* tests for the null hypothesis $H_0: \beta_j = 0$ in two steps:

* Is $\hat{\beta}_j$ positive (if $H_1: \beta_j > 0$) or negative (if $H_1: \beta_j < 0$)?
- No -> Do not reject $H_0$ since this cannot be evidence against $H_0$.
- Yes -> The relevent *p* value is half of the reported $p_{\hat{\beta}_j}$.
- Reject $H_0$ if $p = \frac{1}{2} p_{\hat{\beta}_j} < \alpha$.

#### Wooldridge, Example 4.1: Hourly Wage Equation

We have already estimated the wage equation

$$log(wage) = \beta_0 + \beta_1 \cdot educ + \beta_2 \cdot exper + \beta_3 \cdot tenure + u$$

Now we are ready to test $H_0: \beta_2 > 0$. For the critical values of the *t* test, using the normal approximation instead of the exact *t* distribution with *n - k - 1* = 522 d.f. doesn't make any relevant difference:

In [None]:
# Import Modules
import scipy.stats as stats
import numpy as np

In [None]:
# CV for alpha = 5% and 1% using the t distribution with 522 d.f.
alpha = np.array([0.05, 0.01])
cv_t = stats.t.ppf(1 - alpha, 522)
print(f'Critical Values by t Distribution: {cv_t}\n')

In [None]:
# CV for alpha = 5% and 1% using the normal approximation
cv_n = stats.norm.ppf(1 - alpha)
print(f'Critical Values by Normal Distribution: {cv_n}\n')

In this example, we show the standard regression output. The reported *t* statistic for the parameter of *exper* is $t_{\hat{\beta}_2}$ = 2.391 which is larger than the critical value *c* = 2.33 for the significance level $\alpha$ = 1%, so we reject $H_0$. By construction, we get the same answer from looking at the *p* value. Like always, the reported $p_{\hat{\beta}_j}$ value is for a two-sided test, so we have to divide it by 2. The resulting value $p = \frac{0.017}{2} = 0.0085 < 0.01$, so we reject $H-0$ using an $\alpha$ = 1% significance level.

In [None]:
# Import Modules
import wooldridge as woo
import numpy as np
import statsmodels.formula.api as smf

In [None]:
# Import data set 'wage1'
wage1 = woo.dataWoo('wage1')

In [None]:
# Construct the regression model and print the model summary
reg = smf.ols(formula = 'np.log(wage) ~ educ + exper + tenure', data = wage1)
results = reg.fit()
print(f'Regression Summary Output: \n{results}\n')

### 2. Confidence Intervals

We have already looked at confidence intervals (CI) for the mean of a normally distributed random variabel in the previous notebook. CI for the regression parameters are equally easy to construct and closely related to *t* test. Wooldridge (2019, Section 4.3) provides a succinct discussion. The 95% confidence interval for parameter $\beta_j$ is simply.

$$\hat{\beta}_j \pm c \cdot se(\hat{\beta}_j)$$

where *c* is the same critical value for the two-sided *t* test using a significance level $\alpha$ = 5%. Wooldridge (2019) shows examples of how to manually construct these CI.

**statsmodels** provides the 95% confdience intervals for all parameters in the regression table. If you use the method **conf_int()** on the object with the regression results, you can compute other significance levels. Below example demonstrates the procedure.

#### Wooldridge, Example 4.8: Model of R&D Expenditures

We study the relationship between the R&D expenditures of a firm, its size, and the profit margin for a sample of 32 firms in the chemical industry. The regression equation is

$$log(rd) = \beta_0 +\beta_1 \cdot log(sales) + \beta_2 \cdot profmarg + u$$

Here, we present the regression results as well as the 95% and 99% CI. See Wooldridge (2019) for the manual calculation of the CI and comments on the results.

In [None]:
# Import Modules
import wooldridge as woo
import numpy as np
import statsmodels.formula.api as smf

In [5]:
# Import data set 'rdchem'
rdchem = woo.dataWoo('rdchem')

NameError: name 'woo' is not defined

In [None]:
# OLS regression
reg = smf.ols(formula = 'np.log(rd) ~ np.log(sales) + profmarg', data = rdchem)
results = reg.fit()
print(f'Regression Summary Output: \n{results}\n')

In [None]:
# 95% and 99% Confidence Interval:
ci95 = results.conf_int(0.05)
ci99 = results.conf_int(0.01)

print(f'95% Confidence Interval: {ci95}\n')
print(f'99% Confidence Interval: {ci99}\n')

### 3. Linear Restrictions: *F* Tests

Wooldridge (2019, Sections 4.4 and 4.5) discusses more general tests than those for the null hypotheses for individual estimated parameter. They can involve one or more hypotheses involving one or more population parameters in a linear fashion.

We follow the illustrative example of Wooldridge (2019, Section 4.5) and analyze major league baseball players' salaries using the data set *MLB1* and the regression model

$$log(salary) = \beta_0 + \beta_1 \cdot years + \beta_2 \cdot gamesyr + \beta_3 \cdot bavg + \beta_4 \cdot hrunsyr + \beta_5 \cdot rbisyr + u$$

We want to test whether the performance measures batting average (*bavg*), home runs per year (*hrunsyr*), and runs batted in per year (*rbisyr*) have an impact on the salary once we control for the number of years as an active player (*years*) and the number of games played per year (*gamesyr*). So we state our null hypothesis as $H_0: \beta_3 = \beta_4 = \beta_5 = 0$ versus $H_1: H_0$ is false, i.e. at least one of the performance measures matters.

The test statistic of the *F* test is based on the relative difference between the sum of squared residuals in the general (unrestricted) model and a restricted model in which the hypotheses are imposed $SSR_{ur}$ and $SSR_{r}$, respectively. In our example, the restricted model is one in which *bavg*, *hrunsyr*, and *rbisyr* are excluded as regressors. If both models involve the same dependent variable, it can also be written in terms of the coefficient of determination in the unrestricted and the restricted model $R_{ur}^2$ and $R_{r}^2$, respectively:

$$F = \frac{SSR_{r} - SSR_{ur}}{SSR_{ur}} \cdot \frac{n - k - 1}{q} = \frac{R_{ur}^2 - R_{r}^2}{R_{ur}^2} \cdot \frac{n - k - 1}{q}$$

where *q* is the number of restrictions (in our example, *q* = 3). Intuitively, if the null hypothesis is correct, then imposing it as a restriction will not lead to a significant drop in the model fit and the *F* test statistic should be relatively small. It can be shown that under the CLM assumptions and the null hypothesis, the statistic has an *F* distribution with the numerator degrees of freedom equal to *q* if *F* > *c*, where critical value *c* is the 1 - $\alpha$ quantile of the relevant $F_{q, n-k-1}$ distribution. In our example, *n* = 353, *k* = 5, *q* = 3. So with $\alpha$ = 1%, the critical value is 3.84 and can be calculated using the **f.ppf()** function in **scipy.stats** as

``` Python
f.ppf(1 - 0.01, 3, 347)
```

Here, we show the calculation for this example. The result is *F* = 9.55 > 3.84, so we clearly reject $H_0$. We also calculate the *p* value for this test. It is $p = 4.47 \cdot 10^{-06} = 0.00000447$, so we reject $H_0$ for any reasonable significance level.

In [6]:
# Import Modules
import wooldridge as woo
import numpy as np
import statsmodels.formula.api as smf
import scipy.stats as stats

In [7]:
# Import data set "mlb1"
mlb1 = woo.dataWoo('mlb1')
n = mlb1.shape[0]

In [8]:
# Unrestricted OLS Regression
reg_ur = smf.ols(
    formula = 'np.log(salary) ~ years + gamesyr + bavg + hrunsyr + rbisyr',
    data = mlb1)
fit_ur = reg_ur.fit()
r2_ur = fit_ur.rsquared
print(f'R Square for Unrestricted Model: {r2_ur}\n')

R Square for Unrestricted Model: 0.6278028485187443



In [9]:
# Restricted OLS Regression
reg_r = smf.ols(
    formula = 'np.log(salary) ~ years + gamesyr',
    data = mlb1)
fit_r = reg_r.fit()
r2_r = fit_r.rsquared
print(f'R Square for Restricted Model: {r2_r}\n')

R Square for Restricted Model: 0.5970716339066895



In [10]:
# F statistic
fstat = (r2_ur - r2_r) / (1 - r2_ur) * (n - 6) / 3
print(f'F Statistic: {fstat}\n')

F Statistic: 9.55025352195195



In [11]:
# Critical Value for alpha = 1%
cv = stats.f.ppf(1 - 0.01, 3, 347)
print(f'Critical Value at 1% Significant Level: {cv}\n')

Critical Value at 1% Significant Level: 3.838520048496057



In [12]:
# p value = 1 - cdf of the appropriate F distribution
fpval = 1 - stats.f.cdf(fstat, 3, 347)
print(f'p-value for the F Test: {fpval}\n')

p-value for the F Test: 4.473708139829391e-06



It should not be surprising that there is more convenient way to do this. The module **statsmodels** provides a command **f_test()** which is well suited for these kinds of tests. Given the object with regression results, for example **results**, an *F* test is conducted with

``` Python
hypotheses = ['var_name1 = 0', 'var_name2 = 0']
ftest = results.f_test(hypotheses)
```

where **hypotheses** collects null hypothesis to be tested. It is a list of length *q* where each restriction is described as a text in which the variable name takes the place of its parameter. In our example, $H_0$ is that the three parameters of *bavg*, *hrunsyr*, and *rbisyr* are all equal to zero, which translates as **hypotheses = ['bavg = 0', 'hrunsyr = 0', 'rbisyr = 0']**. We implement this for the same test as the manual calculations done in the previous example and results in exactly the same *F* statistic and *p* value.

In [13]:
# Import Modules
import wooldridge as woo
import numpy as np
import statsmodels.formula.api as smf

In [14]:
# Import data set "mlb1"
mlb1 = woo.dataWoo('mlb1')

In [15]:
# OLS Regression
reg = smf.ols(
    formula = 'np.log(salary) ~ years + gamesyr + bavg + hrunsyr + rbisyr',
    data = mlb1)
results = reg.fit()

In [16]:
# Automate F test
hypotheses = ['bavg = 0', 'hrunsyr = 0', 'rbisyr = 0']
ftest = results.f_test(hypotheses)
fstat = ftest.statistic[0][0]
fpval = ftest.pvalue

print(f'F Statistic: {fstat}\n')
print(f'p value for the F Test: {fpval}\n')

F Statistic: 9.55025352195186

p value for the F Test: 4.4737081398390565e-06



This function can also be used to test more complicated null hypotheses. For example, suppose a sport reporter claims that the batting average plays no role and that the number of home runs has twice the impact as the number of runs batted in. This translates (using variable names instead of numbers as subscripts) as $H_0: \beta_{bavg} = 0, \beta_{hrunsyr} = 2 \cdot \beta_{rbisyr}$. For *Python* we translate it as **hypotheses = ['bavg = 0', 'hrunsyr = 2 * rbisyr']**. The output shows the results of this test. The *p* value is 0.6, so we cannot reject $H_0$.

In [17]:
# Automate F test
hypotheses = ['bavg = 0', 'hrunsyr = 2 * rbisyr']
ftest = results.f_test(hypotheses)
fstat = ftest.statistic[0][0]
fpval = ftest.pvalue

print(f'F Statistic: {fstat}\n')
print(f'p value for the F Test: {fpval}\n')

F Statistic: 0.5117822576247739

p value for the F Test: 0.5998780329146338



Both the most important and the most straightforward *F* test is the one for **overall significance**. The null hypothesis is that all parameters except for the constant are equal to zero. If this null hypothesis holds, the regressors do not have any joint explanatory power for *y*. The result of such a test are automatically included in the upper part of the **summary** output as **F-statistic** (F statistic) and **Prob(F-statistic)** (*p* value).

******

Asymptotic theory allows us to relax some assumptions needed to derive the sampling distribution of estimators if the sample size is large enough. For running a regression in a software package, it does not matter whether we rely on stronger assumptions or on asymptotic arguments. So we don't have to learn anything new regarding the implementation. 

Instead, we aim to imporve our intuition regarding the working of asymptotics by looking at some simulation exercises briefly discusses the implementation of the regression-based Lagrange multiplier (LM) test presented by Wooldridge (2019, Section 5.2).

### 4. Simulation Exercises

In the previous notebook, we already used Monte Carlo Simulation methods to study the mean and variance of OLS estimators under the assumptions SLR.1 -SLR.5. Here, we will conduct similar experiments but will look at the whole sampling distribution of OLS estimators. Remember that the sampling distribution is important since confidence intervals, *t* and *F* tests and other tools of inference rely on it. 

Theorem 4.1 of Wooldridge (2019) gives the normal distribution of the OLS estimators (conditional on the regressors) based on assumptions MLR.1 through MLR.6. In contrast, Theorem 5.2 states that *asymptotically*, the distribution is normal by assumptions MLR.1 through MLR.5 only. Assumption MRL.6 - the normal distribution of the error terms - is not required if the sample is large enough to justify asymptotic argument.

In other words: In small samples, the parameter estimates have a normal sampling distriubtion only if

- the error terms are normally distributed and
- we condition on the regressors

To see how this works out in practice, we set up a series of simulation experiments. The first case simulates a model consistent with MLR.1 through MLR.6 and keeps the regressors fixed. Theory suggests that the sampling distribution of $\hat{\beta}$ is normal, independent of the sample size. The second case simulates a violation of assumption MLR.6. Normality of $\hat{\beta}$ only holds asymptotically, so for small sample size we suspect a violation. Finally, we will look closer into what "conditional on the regressors" means and simulate a (very plausible) violation of this in the last case.

#### Normally Distributed Error Terms (Case 1)

Here, we draws 10,000 samples of a given size (which has to be stored in variable *n* before) from a population that is consistent with assumption MLR.1 through MLR.6. The error terms are specified to be standard normal. The slope estimate $\hat{\beta}_1$ is stored for each of the generated samples in the array **b1**. 

In [18]:
# Import Modules
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import scipy.stats as stats

In [19]:
# Set the random seed
np.random.seed(835742)

In [20]:
# Set sample size and number of simulations
n = 100
r = 10000

In [21]:
# Set true parameters
beta0 = 1
beta1 = 0.5
sx = 1
ex = 4

In [22]:
# Initialize b1 to store results later
b1 = np.empty(r)

In [23]:
# draw a sample of x, fixed over replications
x = stats.norm.rvs(ex, sx, size = n)

In [24]:
# Repeat the similation r times
for i in range(r):
    # draw a saimple of u (std. normal)
    u = stats.norm.rvs(0, 1, size = n)
    y = beta0 + beta1 * x + u
    df = pd.DataFrame({"y": y, "x": x})
    
    # estimate conditional OLS
    reg = smf.ols(formula = 'y ~ x', data = df)
    results = reg.fit()
    b1[i] = results.params['x']

The code was run for different sample sizes. The density estimate together with the corresponding normal density are shown in Figure 5.1. Not surprisingly, all distributions look very similar to the normal distribution - that is what Theorem 4.1 predicted. Note that the fact that the sampling variance decreases as *n* rises is only obvious if we pay attention to the different scales of the axes.

**Figure 5.1:** Density of $\hat{\beta}_1$ with Different Sample Sizes: Normal Error Terms

|  |  |
|  :---:  |  :---:  |
|  n = 5  |  n = 10  |
|  ![alt](images/MCSim-olsasy-norm-n5.png)  |  ![alt](images/MCSim-olsasy-norm-n10.png)  |
|  n = 100  |  n = 1000  |
|  ![alt](images/MCSim-olsasy-norm-n100.png)  |  ![alt](images/MCSim-olsasy-norm-n1000.png)  |

#### Non-Normal Error Terms (Case 2)

The next step is to simulate a violation of assumtpion MLR.6. In order to implement a rather drastic violation of the normality assumtpion, we implement a "standardized" $\chi^2$ distribution with one degree of freedom. More specifically, let *v* be distributed as $\chi_{[1]}^2$. Because this distribution has a mean of 1 and a variance of 2, the error term $u = \frac{v - 1}{\sqrt{2}}$ has a mean of 0 and a variance of 1. This simplifies the comparison to the exercise with the standard normal errors above. Figure 5.2 plots the density functions fo the standard normal distribution used above and the "standardized" $\chi^2$ distribution. Both have a mean of 0 and a variance of 1 but very different shapes. The only line of code we changed compared to the previous is the sampling of *u* where we replace drawing from a standard normal distribution using **u = stats.norm.rvs(0, 1, size = n)** with sampling from the standardized $\chi_{[1]}^2$ distribution with 

``` Python
u = (stats.chi2.rvs(1, size = n) - 1) / np.sqrt(2)
```

**Figure 5.2:** Density Functions of the Simulated Error Terms
![alt](images/MCSim-olsasy-stdchisq.png)

For each of the same sample sizes used above, we again estimate the slope parameter for 10,000 samples. The densities of $\hat{\beta}_1$ are plotted in Figure 5.3 together with the respective normal distributions with the corresponding variances. For the small sample sizes, the deviation from the normal distribution is strong. Note that the dashed normal distributions have the same mean and variance. The main difference is the kurtosis which is larger than 8 in the simulations for n = 5 compared to the normal distribution for which the kurtosis is equal to 3.

For larger sample size, the sampling distribution of $\hat{\beta}_1$ coverges to the normal distribution. For *n* = 100, the difference is much smaller but still discernible. For *n* = 1,000, it cannot be detected anymore in our simulation exerice. How large the sample needs to be depends among other things on the severity of the violations of MLR.6. If the distribution of the error terms is not as extremely non-normal as in our simulation, smaller sample sizes like the rule of thumb *n* = 30 might suffice for valid asymptotics.

**Figure 5.3:** Density of $\hat{\beta}_1$ with Different Sample Sizes: Non-Normal Error Terms

|  |  |
|  :---:  |  :---:  |
|  n = 5  |  n = 10  |
|  ![alt](images/MCSim-olsasy-chisq-n5.png)  |  ![alt](images/MCSim-olsasy-chisq-n10.png)  |
|  n = 100  |  n = 1000  |
|  ![alt](images/MCSim-olsasy-chisq-n100.png)  |  ![alt](images/MCSim-olsasy-chisq-n1000.png)  |

In [25]:
# Import Modules
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import scipy.stats as stats

In [26]:
# Set the random seed
np.random.seed(835742)

In [27]:
# Set sample size and number of simulations
n = 100
r = 10000

In [28]:
# Set true parameters
beta0 = 1
beta1 = 0.5
sx = 1
ex = 4

In [29]:
# Initialize b1 to store results later
b1 = np.empty(r)

In [30]:
# draw a sample of x, fixed over replications
x = stats.norm.rvs(ex, sx, size = n)

In [31]:
# Repeat the similation r times
for i in range(r):
    # draw a saimple of u (std. normal)
    u = (stats.chi2.rvs(1, size = n) - 1) / np.sqrt(2)
    y = beta0 + beta1 * x + u
    df = pd.DataFrame({"y": y, "x": x})
    
    # estimate conditional OLS
    reg = smf.ols(formula = 'y ~ x', data = df)
    results = reg.fit()
    b1[i] = results.params['x']

#### Unconditioning on the Regressors (Case 3)

There is more subtle difference between the finite-sample results regarding the variance (Theorem 3.2) and distribution (Theorem 4.1) on one hand and the corresponding asymptotic results (Theorem 5.2). The former results describe the sampling distribution "conditional on the sample values of the independent variables". This implies that as we draw different samples, the values of the regressors $x_1, x_2, x_3, ... x_k$ remain the same and only the error terms and dependent varialbes change.

In our previous simulation exerices, this is implemented by making random draws of *x* outside of the simulation loop. This is a realistic description of how data is generated only in some simple experiments: The experimenter chooses the regressors for sample, conducts the experiement and measures the dependent variable.

In most applications we are concerned with, this is an unrealistic description of how we obtain our data. If we draw a sample of individuals, both their dependent and independent variables differ across samples. In these cases, the distribution "conditional on the sample values of the independent variables" can only serve as an approximation of the actual distribution with varying regressors. For large samples, this distinction is irrelevant and the asymptotic distribution is the same. 

Let's see how this plays out in an example. The code in this example differs from case 1 only by moving the generation of the regressors into the loop in which the 10,000 samples are generated. This is inconsistent with Theorem 4.1, so for small samples, we don't know the distribution of $\hat{\beta}_1$. Theorem 5.2 is applicable, so for (very) large sample, we know that the estimator is normally distributed.

Figure 5.4 shows the distribution of the 10,000 estimates generated for *n* = 5, 10, 100, and 1,000. As we expected from theory, the distribution is (close to) normal for large samples. For small samples, it deviates quite a bit. The kurtosis is 8.7 for a sample size of *n* = 5 which is far away from the kurtosis of 3 of a normal distribution.

**Figure 5.4:** Density of $\hat{\beta}_1$ with Different Sample Size: Varying Regressors

|  |  |
|  :---:  |  :---:  |
|  n = 5  |  n = 10  |
|  ![alt](images/MCSim-olsasy-uncond-n5.png)  |  ![alt](images/MCSim-olsasy-uncond-n10.png)  |
|  n = 100  |  n = 1000  |
|  ![alt](images/MCSim-olsasy-uncond-n100.png)  |  ![alt](images/MCSim-olsasy-uncond-n1000.png)  |

In [32]:
# Import Modules
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import scipy.stats as stats

In [33]:
# Set the random seed
np.random.seed(835742)

In [34]:
# Set sample size and number of simulations
n = 100
r = 10000

In [35]:
# Set true parameters
beta0 = 1
beta1 = 0.5
sx = 1
ex = 4

In [36]:
# Initialize b1 to store results later
b1 = np.empty(r)

In [37]:
# Repeat the similation r times
for i in range(r):
    # draw a sample of x, varying over replications
    x = stats.norm.rvs(ex, sx, size = n)
    
    # draw a saimple of u (std. normal)
    u = stats.norm.rvs(0, 1, size = n)
    y = beta0 + beta1 * x + u
    df = pd.DataFrame({"y": y, "x": x})
    
    # estimate conditional OLS
    reg = smf.ols(formula = 'y ~ x', data = df)
    results = reg.fit()
    b1[i] = results.params['x']

### 5. LM Test

As an alternative to the *F* tests discussed previously, *LM* tests for the same sort of hypotheses can be very useful with large samples. In the linear regression setup, the test statistic is 

$$LM = n \cdot R_\bar{u}^2$$

where *n* is the sample size and $R_\bar{u}^2$ is the usual $R^2$ statistic in a regression of the residual $\bar{u}$ from the restricted model on the unrestricted set of regressors. Under the null hypothesis, it is asymptotically distributed as $\chi_{q}^2$ with *q* denoting number of restrictions. Details are given in Wooldridge (2019, Section 5.2).

The implementation in **statsmodels** is straightforward if we remember that the residuals can be obtained with the **resid** attribute.

#### Wooldridge, Example 5.3: Economic Model of Crime

We analyze the same data on the number of arrests as in the crime example. The unrestricted regression model is

$$narr86 = \beta_0 + \beta_1pcnv + \beta_2avgsen + \beta_3tottime + \beta_4ptime86 + \beta_5qemp86 + u$$

The dependent variable narr86 reflects the number of times a man was arrested and is explained by the proportion of prior arrests (*pcnv*), previous average sentences (*avgsen*), the time spend in prison before 1986 (*tottime*), the number of months in prison in 1986 (*ptime86*), and the number of quarters unemployed in 1986 (*qemp86*).

The joint null hypothesis is 

$$H_0: \beta_2 = \beta_3 = 0$$

so the restricted set of regressors excludes *avgsen* and *tottime*. In this example, we show an implmentation of this *LM* test. The restricted model is estimated and its residuals **utilde =** $\tilde{u}$ are calculated. They are regressed on the unrestricted set of regressors. The $R^2$ from this regression is 0.001494, so the *LM* test statistic is calculated to be around *LM* = 0.001494 $\cdot$ 2725 = 4.071. This is smaller than the critical value for a significance level of $\alpha$ = 10%, so we do not reject the null hypothesis. We can also easily calculate the *p* value using the $\chi^2$ CDF **chi2.cdf()**. It turns out to be 0.1306. 

The same hypothesis can be tested using the F test using the command **f_test()**. In this example, it delivers the same *p* value up to three digits.

In [38]:
# Import Modules
import wooldridge as woo
import statsmodels.formula.api as smf
import scipy.stats as stats

In [39]:
# Import the crime1 data set
crime1 = woo.dataWoo('crime1')

In [40]:
# 1. Estimate restricted model
reg_r = smf.ols(formula = 'narr86 ~ pcnv + ptime86 + qemp86', data = crime1)
fit_r = reg_r.fit()
r2_r = fit_r.rsquared
print(f'R^2 for Restricted Model: {r2_r}\n')

R^2 for Restricted Model: 0.04132330770123094



In [42]:
# 2. Regression of residuals from restricted model
crime1['utilde'] = fit_r.resid
reg_LM = smf.ols(formula = 'utilde ~ pcnv + ptime86 + qemp86 + avgsen + tottime', data = crime1)
fit_LM = reg_LM.fit()
r2_LM = fit_LM.rsquared
print(f'R^2 for Residual from Restricted Model: {r2_LM}\n')

R^2 for Residual from Restricted Model: 0.0014938456737877415



In [43]:
# 3. calculation of LM test statistic
LM = r2_LM * fit_LM.nobs
print(f'LM Test Statistic: {LM}\n')

LM Test Statistic: 4.070729461071595



In [44]:
#4 Critical Value from chi-squared distribution, alpha = 10%
cv = stats.chi2.ppf(1 - 0.10, 2)
print(f'Critical Value at 10% Significance Level: {cv}\n')

Critical Value at 10% Significance Level: 4.605170185988092



In [45]:
# 5. p value (alternative to critical value)
pval = 1 - stats.chi2.cdf(LM, 2)
print(f'p value for the Test: {pval}\n')

p value for the Test: 0.13063282803267184



In [47]:
# 6. Compare to F-test
reg = smf.ols(formula = ' narr86 ~ pcnv + ptime86 + qemp86 + avgsen + tottime', data = crime1)
results = reg.fit()
hypotheses = ['avgsen = 0', 'tottime = 0']
ftest = results.f_test(hypotheses)
fstat = ftest.statistic[0][0]
fpval = ftest.pvalue
print(f'F Statistic: {fstat}\n')
print(f'p value: {fpval}\n')

F Statistic: 2.033921558435096

p value: 0.13102048172760739

