## Multiple Regression Analysis

Running a multiple regression in *Python* is as straightforward as running a simple regression using the **ols()** command in **statsmodels**. In the following example, we show how it is done in coding. In the later section, we will open the black box and replicaes the main calculations using matrix algebra. This is not required for the remaining notebook, so it can be skipped by readers who prefer to keep black boxes closed.

We will also discuss the interpretation of regression results and the prevalent omitted variable problems. Finally, we will cover standard errors and multicollinearity for multiple regression by the end of this notebook.

### Multiple Regression in Practice

Consider the population regression model

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_k x_k + u$$

and suppose the data set **sample** contains variables **y, x1, x2, x3**, with respective data of our sample. We estimate the model parameters by OLS using the commands

``` Python
reg = smf.ols(formula = 'y ~ x1 + x2 + x3', data = sample)
results = reg.fit()
```

The tilde "$\tilde$" again separates the dependent variable from the regressors which are now separated using "**+**" sign. We can add options as before. The constant is again automatically added unless it is explicitly suppressed using **'y ~ x1 + x2 + x3 + ...'**.

WE already familar with the working of **smf.ols()** and **fit()**: The first command creates an object which contains all relevant information and the estimation is performed in a second step. The estimation results are stored in a variable **results** using the code **results = reg.fit()**. We can use this variable for further analyses. For a typical regression output including a coefficent table, call **results.summary()** in one step. Further analyses involving residuals, fitted values and the like can be used exactly as presented in the previous notebook.

The output of **summary()** includes parameter estimates, standard errors according to Theorem 3.2 of Wooldgridge (2019), the coefficient of determination $R^2$, and many more useful results we cannot interpret yet before we have worked through the next notebook file.

#### Wooldridge, Example 3.1: Determinants of College GPA

This example from Wooldridge (2019) relates the college GPA (*colGPA*) to the high school GPA (*hsGPA*) and the achievement test score (*ACT*) for a sample of 141 students. The OLS regression function is

$$\hat{colGPA} = 1.286 + 0.453 \cdot hsGPA + 0.0094 \cdot ACT$$

In [1]:
# Import modules
import wooldridge as woo
import statsmodels.formula.api as smf

In [2]:
# Import gpa1 data set
gpa1 = woo.dataWoo('gpa1')

In [3]:
# Create the model and print the summary output
reg = smf.ols(formula = 'colGPA ~ hsGPA + ACT', data = gpa1)
results = reg.fit()
print(f'Regression Summary Output: \n{results.summary()}\n')

Regression Summary Output: 
                            OLS Regression Results                            
Dep. Variable:                 colGPA   R-squared:                       0.176
Model:                            OLS   Adj. R-squared:                  0.164
Method:                 Least Squares   F-statistic:                     14.78
Date:                Sun, 09 May 2021   Prob (F-statistic):           1.53e-06
Time:                        21:38:29   Log-Likelihood:                -46.573
No. Observations:                 141   AIC:                             99.15
Df Residuals:                     138   BIC:                             108.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.2863   

#### Wooldridge, Example 3.4: Determinants of College GPA

For the regression run in Example 3.1, the output reports $R^2$ = 0.176, so about 17.6% of the variance in college GPA is explained by the two regressors.

#### Examples 3.2, 3.3, 3.5, 3.6: Further Multiple Regression Examples

In order ot get a feeling of the methods and results, we present the analyses including the full regression tables of the mentioned Examples from Wooldridge (2019). See Wooldridge (2019) for descriptions of the data sets and variables and for comments on the results.

In [None]:
# Import modules
import wooldridge as woo
import numpy as np
import statsmodels.formula.api as smf

In [None]:
# Example 3.2

# Import wage1 data set
wage1 = woo.dataWoo('wage1')

# Build the OLS regression model and print the summary output
reg = smf.ols(formula = 'np.log(wage) ~ educ + exper + tenure', data = wage1)
results = reg.fit()
print(f'Regression Summary: \n{results.summary()}\n')

In [None]:
# Example 3.3

# Import 401k data set
k401k = woo.dataWoo('401k')

# Build the OLS regression model and print the summary output
reg = smf.ols(formula = 'prate ~ mrate + age', data = k401k)
results = reg.fit()
print(f'Regression Summary: \n{result.summary()}\n')

In [None]:
# Example 3.5a

# Import crime1 data set
crime1 = woo.dataWoo('crime1')

# Build the OLS regression model and print the summary output
reg = smf.ols(formula = 'narr86 ~ pcnv + ptime86 + qemp86', data = crime1)
results = reg.fit()
print(f'Regression Summary: \n{result.summary()}\n')

In [None]:
# Example 3.5b

# Import crime1 data set
crime1 = woo.dataWoo('crime1')

# Build the OLS regression model and print the summary output
reg = smf.ols(formula = 'narr86 ~ pcnv + avgsen + ptime86 + qemp86', data = crime1)
results = reg.fit()
print(f'Regression Summary: \n{result.summary()}\n')

In [None]:
# Example 3.6

# Import wage1 data set
wage1 = woo.dataWoo('wage1')

# Build the OLS regression model and print the summary output
reg = smf.ols(formula = 'np.log(wage) ~ edu', data = crime1)
results = reg.fit()
print(f'Regression Summary: \n{result.summary()}\n')

### OLS in Matrix Form

For applying regression methods to empirical problems, we do not actually need to know the formulas our software uses. In multiple regression, we need to resort to matrix algebra in order to find an explicity expression for the OLS parameter estimates. Wooldridge (2019) defers this discussion to Appendix E and we folow the notation used there. Going through this material is not required for applying multiple regression to real-world problems but is useful for a deeper understanding of the methods and their black-box implementations in software packages. In the following chapters, we will rely on the comfort of the canned routine **fit()**, so this section may be skipped.

In matrix form, we store the regressors in a *n $\cdot$ (k + 1)* matrix **X** which has a column for each regressor plus a column of ones for the constant. The sample values of the dependent variable are stored in a *n $\cdot$ 1* column vector **y**.  Wooldridge (2019) derives the OLS estimator $\hat{\beta} = (\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2, \hat{\beta}_3, ..., \hat{\beta}_k)$ to be

$$\hat{\beta} = (X'X)^{-1}X'y$$

This equation involves three matrix operations which we know how to implement in *Python*:

- Transpose: The expression $X'$ is **X.T** in **numpy**
- Matrix multiplication: The expression $X'X$ is translated as **X.T @ X**
- Inverse: $(X'X)^{-1}$ is written as **np.linalg.inv(X.T @ X)**

So we can collect everything and translate the matrix equation into the somewhat unsightly expression

``` Python
b = np.linalg.inv(X.T @ X) @ X.T @y
```

The vector of residuals can be manually calculated as 

$$\hat{u} = y - X\hat{\beta}$$

or translated into the **numpy** matrix language

``` Python
u_hat = y - X @ b
```

The formula for the estimated variance of the error term is

$$\hat{\sigma}^2 = \frac{1}{n - k - 1} \hat{u}' \hat{u}$$

which is equivalent to

``` Python
sigsq_hat = (u_hat.T @ u_hat) / (n - k - 1)
```

The standard error of the regression (SER) is its square root $\hat{\sigma} = \sqrt{\hat{\sigma}^2}$. The estimated OLS variance-covariance matrix according to Wooldridge (2019, Theorem E.2) is then

$$\hat{Var(\hat{\beta})} = \hat{\sigma}^2 (X'X)^{-1}$$

``` Python
Vb_hat = sigsq_hat * np.linalg.inv(X.T @ X)
```

Finally, the standard error of the parameter estimates are the square roots of the main diagonal of Var($\hat{\beta}$) which can be expressed in **numpy** as

``` Python
se = np.sqrt(np.diagonal(Vb_hat))
```

Below example implements this for the GPA regression from Example 3.1. Comparing the results to the built-in function, it is reassuring that we get exactly the same numbers for the parameter estimates and standard errors of the coefficients. We also demonstrate another way of generating **y** and **X** by using the module **patsy**. It includes the command **dmatrices()**, which allows to conveniently create the matrices by formula syntax.

In [None]:
# Import modules
import wooldridge as woo
import numpy as np
import pandas as pd
import patsy as pt

In [None]:
# Import gpa1 data set
gpa1 = woo.dataWoo('gpa1')

In [None]:
# Determine sample size and number of regressors
n = len(gpa1)
k = 2

In [None]:
# Extract dependent variable y, 'colGPA'
y = gpa1['colGPA']

# Extract independent variables X and add a column of ones
X = pd.DataFrame({'const': 1, 'hsGPA': gpa1['hsGPA'], 'ACT': gpa1['ACT']})

# Alternative with patsy:
# y2, X2 = pt.dmatrices('colGPA ~ hsGPA + ACT', data = gpa1, return_type = 'dataframe')

# Print the first rows of X
print(f'First Rows of X: \n{X.head()}\n')

In [None]:
# Parameter estimates
X = np.array(X)
y = np.array(y).reshape(n, 1) # Create a row vector
b = np.linalg.inv(X.T @ X) @ X.T @ y
print(f'Estimated Parameters (Betas): \n{b}\n')

In [None]:
# Residuals, estimated variance of u and SER
u_hat = y - X @ b
sigsq_hat = (u_hat.T @ u_hat) / (n - k - 1)
SER = np.sqrt(sigsq_hat)
print(f'SER: {SER}\n')

In [None]:
# Estimated variance of the parameter estimators and SE
Vbeta_hat = sigsq_hat * np.linalg.inv(X.T @ X)
se = np.sqrt(np.diagonal(Vbeta_hat))
print(f'Standard Errors of the Estimated Parameters: \n{se}\n')

### Ceteris Paribus Interpretation and Omitted Variable Bias

The parameters in a multiple regression can be interpreted as partial effects. In a general model with *k* regressors, the estimated slope parameter $\beta_j$ associated with the variable $x_j$ is the change of $\hat{y}$ as $x_j$ increases by one unit and *the other variable are held fixed*.

Wooldridge (2019) discusses this interpretation in Section 3.2 and offers a useful formula for interpreting the difference between simple regression results and the *ceteris paribus* interpretation of multiple regression: Consider a regression with two explanatory variables:

(3.1)
$$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2$$

The parameter $\hat{\beta}_1$ is the estimated effect of increasing $x_1$ by one unit while keeping $x_2$ fixed. In contrast, consider the simple regression including only $x_1$ as a regressor:

(3.2)
$$\tilde{y} = \tilde{\beta}_0 + \tilde{\beta}_1 x_1$$

The parameter $\tilde{\beta}_1$ is the estimated effect of increasing $x_1$ by one unit (and NOT keeping $x_2$ fixed). It can be related to $\hat{\beta}_1$ using the formula

(3.3)
$$\tilde{\beta}_1 = \hat{\beta}_1 + \hat{\beta}_2 \tilde{\delta}_1$$

where $\tilde{\delta}_1$ is the slope paramter of the linear regresion of $x_2$ on $x_1$

(3.4)
$$x_2 = \tilde{\delta}_0 + \tilde{\delta}_1 x_1$$

This equation is actually quite intuitive: As $x_1$ increases by one unit,

- Predicted *y* directly increases by $\hat{\beta}_1$ units (*ceteris paribus* effect)
- Predicted $x_2$ increases by $\tilde{\delta}_1$ units
- Each of these $\tilde{delta}_1$ units leads to an increase of predicted *y* by $\hat{\beta}_2$ units, giving a total indirect effect of $\tilde{\delta}_1 \hat{\beta}_2$
- The overall effect $\tilde{\beta}_1$ is the sum of the direct and indirect effects

We revisit Example 3.1 to see whether we can demonstrate this relationship in *Python*. First, we repeat the regression of the college GPA (*colGPA*) on the achievement test score (*ACT*) and the high school GPA (*hsGPA*). We study the *ceteris paribus* effect of *ACT* on *colGPA* which has an estimated value of $\hat{\beta}_1$ = 0.0094. The estimated effect of *hsGPA* is $\hat{\beta}_2$ = 0.453. The slope parameter of the regression corresponding to Equation 3.4 is $\hat{\delta}_1$ = 0.0389. Plugging these values into Equation 3.3 gives a total effect of $\tilde{\beta}_1$ = 0.0271 which is exactly what the simple regression at the end of the output delivers.



In [None]:
# Import modules
import wooldridge as woo
import numpy as np
import statsmodels.formula.api as smf

In [None]:
# Import gpa1 data set
gpa1 = woo.dataWoo('gpa1')

In [None]:
# Create the model and print the summary output
reg = smf.ols(formula = 'colGPA ~ hsGPA + ACT', data = gpa1)
results = reg.fit()
b = results.params
print(f'Estimated Parameters (betas): \n{b}\n')

In [None]:
# Relation between regressors
reg_delta = smf.ols(formula = 'hsGPA ~ ACT', data = gpa1)
results_delta = reg_delta.fit()
delta_tilde = results_delta.params
print(f'Estimated Parameters (delta): \n{delta_tilde}\n')

In [None]:
# Omitted variables formula for b1_tilde
b1_tilde = b['ACT'] + b['hsGPA'] * delta_tilde['ACT']
print(f'Omitted Variable Effect: {b1_tilde}\n')

In [None]:
# Actual regression with hsGPA omitted
reg_om = smf.ols(formula = 'colGPA ~ ACT', data = gpa1)
results_om = reg_om.fit()
b_om = results_om.params
print(f'Estimated Parameters of ACT: \n{b_om}\n')

In this example, the indirect effect is actually stronger than the direct effect. *ACT* predicts *colGPA* mainly because it is related to *hsGPA* which in turn is strongly related to *colGPA*.

These relations hold for the estimates from a given sample. In Section 3.3, Wooldridge (2019) discusses how to apply the same sort of arguments to the OLS estimators which are random variables varying over different samples. Omitting relevant regressors causes bias if we are interested in estimating partial effects. In practice, it is difficult to include *all* relevant regressors making of omitted variables a prevalent problem. It is important enough to have motivated a vast amount of methodological and applied research. More advance techniques like instrumental variables or panel data methods try to solve the problem in cases where we cannot add al relevant regressors, for example becasue they are unobservable. We will come back to this in the later sections.

### Standard Errors, Multicollinearity, and VIF

