## Regression Inference and OLS Asymptotics

In this notebook, we are going to study and demonstrate the use of *Python* to perform **statistical inference** to test our regression models. We are also going to explore the **Asymptotic Theory** and understand how it allows us to relax some assumptions needed to derive the sampling distribution of estimators if the sample is size is large enough.

**Topics:**

1. The *t* Test
2. Confidence Intervals
3. Linear Restrictions: *F* Tests
4. Simulation Exercises
5. LM Test

Section 4.1 of Wooldridge (2019) adds assumption MLR.6 (normal distribution of the error term) to the previous assumptions MLR.1 through MLR.5. Together, these assumptions consitute the classical linear model (CLM).

The main additional result we get from this assumption is stated in Theorem 4.1: The OLS parameter estimators are normally distributed (conditional on the regressors $x_1, x_2, ..., x_k$). The benefit of this result is that it allows us to do statistical inference similar to the approaches discussed the simple estimator of the mean of a normally distributed random variable.

### 1. The *t* Test

After the sign and magnitude of the estimated parameters, empirical resarch typically pays most attention to the results of *t* tests discussed in this section.

#### General Setup

An important type of hypotheses we are often interested in is of the form

$$H_0: \beta_j = a_j$$

where $a_j$ is some given number, very often, $a_j$ = 0. For the most common cast of two-tailed tests, the alternative hypothesis is 

$$H_1: \beta_j \neq a_j$$

and for one-tailed tests it is either one of

$$H_1: \beta_j > a_j$$

or 

$$H_1: \beta_j < a_j$$

These hypotheses can be conveniently tested using a *t* test which is based on the test statistic

$$t = \frac{\hat{\beta}_j - a_j}{se(\hat{\beta}_j)}$$

If $H_0$ is in fact true and the CLM assumptions holds, then this statistic has a t distribution with *n - k - 1* degree of freedom.

#### Standard Case

Very often, we want to test whether there is any relation at all between the dependent variable *y* and a regressor $x_j$ and do not want to impose a sign on the partial effect *a priori*. This is a mission for the standard two-sided *t* test with the hypothetical value $a_j$ = 0, so

$$H_0: \beta_j = 0$$

$$H_1: \beta_j \neq 0$$

$$t_{\hat{\beta}_j} = \frac{\hat{\beta}_j}{se(\hat{\beta}_j)}$$

The subscript on the *t* statistic indicates that this is **the** *t* value for $\hat{\beta}_j$ for this frequent version of the test. Under $H_0$, it has the *t* distribution with *n - k - 1* degree of freedom implying that the probability that $|t_{\hat{\beta}_j}| > c$ is equal to $\alpha$ if *c* is the $1 - \frac{\alpha}{2}$ quantile of this distribution. If $\alpha$ is our significance level (e.g. $\alpha = 5\%$), then we reject $H_0$ if $|t_{\hat{\beta}_j}| > c$ in our sample. For the typical significance level $\alpha = 5\%$, the critical value *c* will be around 2 for reasonable large degrees of freedom and approach the counterpart of 1.96 from the standard normal distribution in very large samples.

The *p* value indicates the smallest value of the significance level $\alpha$ for which we would still reject $H_0$ using our sample. So it is the probability for a random variable *T* with the respective *t* distribution that |*T*| > $|t_{\hat{\beta}_j}|$ where $t_{\hat{\beta}_j}$ is the value of the *t* statistic in our particular sample. In our two-tailed test, it can be calculated as

$$p_{\hat{\beta}_j} = 2 \cdot F_{t_{n-k-1}} \cdot (-|t_{\hat{\beta}_j}|)$$

where $F_{t_{n-k-1}} (\cdot)$ is the CDF of the *t* distribution with *n - k - 1* degree of freedom. If our software provides us with the relevant *p* values, they are easy to use: We reject $H_0$ if $p_{\hat{\beta}_j} \leq \alpha$.

Since this standard case of a *t* test is so common, **statsmodels** provides us with the relevant *t* and *p* values directly in the **summary** of the estimation results we already saw in the previous notebooks. The regression table includes for all regressors and the intercept:

- parameter estimates and standard errors
- test statistics $t_{\hat{\beta}_j}$ in column **t**
- respective *p* values $p_{\hat{\beta}_j}$ in the colmun **P>|t|**
- respective 95% confidence interval in columns **[0.025 and 0.975]**

#### Wooldridge, Example 4.3: Determinants of College GPA

We have repeatedly used the data set *GPA1* in the previous notebooks. This example uses three regressors and estimates a regression model of the form.

$$colGPA = \beta_0 + \beta_1 \cdot hsGPA + \beta_2 \cdot ACT + \beta_3 \cdot skipped + u$$

For the critical values of the *t* test, using the normal approximation instead of the exact *t* distribution with *n - k - 1* = 137 d.f. doesn't make much of a difference:

In [1]:
# Import Modules
import scipy.stats as stats
import numpy as np

In [2]:
# CV for alpha = 5% and 1% using the t distribution with 137 d.f.
alpha = np.array([0.05, 0.01])
cv_t = stats.t.ppf(1 - alpha / 2, 137)
print(f'Critical Values by t Distribution: {cv_t}\n')

Critical Values by t Distribution: [1.97743121 2.61219198]



In [3]:
# CV for alpha = 5% and 1% using the normal approximation
cv_n = stats.norm.ppf(1 - alpha / 2)
print(f'Critical Values by Normal Distribution: {cv_n}\n')

Critical Values by Normal Distribution: [1.95996398 2.5758293 ]



We presents the standard **summary** which directly contains all the information to test the hypotheses for all parameters. The *t* statistics for all coefficients excepts $\beta_2$ are larger in absolute value than the *critical value* c = 2.61 (or c = 2.58 using the normal approximation) for $\alpha$ = 1%. So we would reject $H_0$ for all usual significance levels. By construction, we draw the same conclusions from the *p* values.

In order to confirm that **statsmodels** is exactly using the formulas of Wooldridge (2019). We next reconstruct the *t* and *p* values manually. We extract the coefficients (**params**) and standard errors (**bse**) from the regression results, and simply apply the $t_{\hat{\beta}_j}$ and $p_{\hat{\beta}_j}$ equations.

In [None]:
# Import Modules
import scipy.stats as stats
import numpy as np

In [None]:
# Import data set gpa1
gpa1 = woo.dataWoo('gpa1')

In [None]:
# Store and display results:
reg = smf.ols(formula = 'colGPA ~ hsGPA + ACT + skipped', data = gpa1)
results = reg.fit()
print(f'Regression Summary: \n{results.summary()}\n')

In [None]:
# Manually confirm the formulas:

# Extract coefficients and SE
b = results.params
se = results.bse

# Reproduce t statistic
tstat = b / se
print(f't Statistics: \n{tstatat}\n')

# Reproduce p value
pval = 2 * stats.t.cdf(-abs(tstat), 137)
print(f'P Value: \n{pval}\n')

#### Other Hypotheses

For a one-tailed test, the critical value *c* of the *t* test and the *p* values have to be adjsuted appropriately Wooldridge (2019) provides a general discussion in Section 4.2. For testing the null hypothesis $H_0: \beta_j = a_j$, the tests for the three common alternative hypotheses are summarized in the following table.

**One- and Two-tailed *t* Tests for $H_0: \beta_j = a_j$**

|  $H_1$:  |  $\beta_j \neq a_j$  |  $\beta_j > a_j$  |  $\beta_j < a_j$  |
|  :---:  |  :---:  |  :---:  |  :---:  |
|  *c* = quantile  |  $1 - \frac{\alpha}{2}$  |  $1 - \alpha$  |  $1 - \alpha$  |
|  Reject $H_0$ if  |  $|t_{\hat{\beta}_j}| > c$  |  $\hat{\beta}_j > c$  |  $\hat{\beta}_j > c$  |
|  *p* value  |  $2 \cdot F_{t_{n - k - 1}} \cdot (-|t_{\hat{\beta}_j}|)$  |  $F_{t_{n - k - 1}} \cdot (-t_{\hat{\beta}_j})$  |  $F_{t_{n - k - 1}} \cdot (-t_{\hat{\beta}_j})$  |

Given the stardard regerssion output including the *p* value for two-sided tests $p_{\hat{\beta}_j}$, we can easily do one-sided *t* tests for the null hypothesis $H_0: \beta_j = 0$ in two steps:

* Is $\hat{\beta}_j$ positive (if $H_1: \beta_j > 0$) or negative (if $H_1: \beta_j < 0$)?
- No -> Do not reject $H_0$ since this cannot be evidence against $H_0$.
- Yes -> The relevent *p* value is half of the reported $p_{\hat{\beta}_j}$.
- Reject $H_0$ if $p = \frac{1}{2} p_{\hat{\beta}_j} < \alpha$.

#### Wooldridge, Example 4.1: Hourly Wage Equation

We have already estimated the wage equation

$$log(wage) = \beta_0 + \beta_1 \cdot educ + \beta_2 \cdot exper + \beta_3 \cdot tenure + u$$

Now we are ready to test $H_0: \beta_2 > 0$. For the critical values of the *t* test, using the normal approximation instead of the exact *t* distribution with *n - k - 1* = 522 d.f. doesn't make any relevant difference:

In [None]:
# Import Modules
import scipy.stats as stats
import numpy as np

In [None]:
# CV for alpha = 5% and 1% using the t distribution with 522 d.f.
alpha = np.array([0.05, 0.01])
cv_t = stats.t.ppf(1 - alpha, 522)
print(f'Critical Values by t Distribution: {cv_t}\n')

In [None]:
# CV for alpha = 5% and 1% using the normal approximation
cv_n = stats.norm.ppf(1 - alpha)
print(f'Critical Values by Normal Distribution: {cv_n}\n')

In this example, we show the standard regression output. The reported *t* statistic for the parameter of *exper* is $t_{\hat{\beta}_2}$ = 2.391 which is larger than the critical value *c* = 2.33 for the significance level $\alpha$ = 1%, so we reject $H_0$. By construction, we get the same answer from looking at the *p* value. Like always, the reported $p_{\hat{\beta}_j}$ value is for a two-sided test, so we have to divide it by 2. The resulting value $p = \frac{0.017}{2} = 0.0085 < 0.01$, so we reject $H-0$ using an $\alpha$ = 1% significance level.

In [None]:
# Import Modules
import wooldridge as woo
import numpy as np
import statsmodels.formula.api as smf

In [None]:
# Import data set 'wage1'
wage1 = woo.dataWoo('wage1')

In [None]:
# Construct the regression model and print the model summary
reg = smf.ols(formula = 'np.log(wage) ~ educ + exper + tenure', data = wage1)
results = reg.fit()
print(f'Regression Summary Output: \n{results}\n')

### 2. Confidence Intervals

We have already looked at confidence intervals (CI) for the mean of a normally distributed random variabel in the previous notebook. CI for the regression parameters are equally easy to construct and closely related to *t* test. Wooldridge (2019, Section 4.3) provides a succinct discussion. The 95% confidence interval for parameter $\beta_j$ is simply.

$$\hat{\beta}_j \pm c \cdot se(\hat{\beta}_j)$$

where *c* is the same critical value for the two-sided *t* test using a significance level $\alpha$ = 5%. Wooldridge (2019) shows examples of how to manually construct these CI.

**statsmodels** provides the 95% confdience intervals for all parameters in the regression table. If you use the method **conf_int()** on the object with the regression results, you can compute other significance levels. Below example demonstrates the procedure.

#### Wooldridge, Example 4.8: Model of R&D Expenditures

We study the relationship between the R&D expenditures of a firm, its size, and the profit margin for a sample of 32 firms in the chemical industry. The regression equation is

$$log(rd) = \beta_0 +\beta_1 \cdot log(sales) + \beta_2 \cdot profmarg + u$$

Here, we present the regression results as well as the 95% and 99% CI. See Wooldridge (2019) for the manual calculation of the CI and comments on the results.

In [None]:
# Import Modules
import wooldridge as woo
import numpy as np
import statsmodels.formula.api as smf

In [None]:
# Import data set 'rdchem'
rdchem = woo.dataWoo('rdchem')

In [None]:
# OLS regression
reg = smf.ols(formula = 'np.log(rd) ~ np.log(sales) + profmarg', data = rdchem)
results = reg.fit()
print(f'Regression Summary Output: \n{results}\n')

In [None]:
# 95% and 99% Confidence Interval:
ci95 = results.conf_int(0.05)
ci99 = results.conf_int(0.01)

print(f'95% Confidence Interval: {ci95}\n')
print(f'99% Confidence Interval: {ci99}\n')

### 3. Linear Restrictions: *F* Tests

