# Regression Models with Multiple Regressors

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

import scipy.stats as stats

import statsmodels.api as sm
import statsmodels.formula.api as smf

## Omitted Variable Bias

The previous analysis of the relationship between test score and class size discussed in Chapters 4 and 5 has a major flaw: we ignored other determinants of the dependent variable (test score) that correlate with the regressor (class size). Remember that influences on the dependent variable which are not captured by the model are collected in the error term, which we so far assumed to be uncorrelated with the regressor. However, this assumption is violated if we exclude determinants of the dependent variable which vary with the regressor. This might induce an estimation bias, i.e., the mean of the OLS estimator’s sampling distribution is no longer equals the true mean. In our example we therefore wrongly estimate the causal effect on test scores of a unit change in the student-teacher ratio, on average. This issue is called omitted variable bias (OVB) and is summarized by Key Concept 6.1.

![title](images/chapter6/img1.jpg)

![title](images/chapter6/img2.png)

## Multiple regression model

![title](images/chapter6/img3.png)

We want to minimize $$ \sum_{i=1}^n (Y_i - b_0 - b_1 X_{1i} - b_2 X_{2i} - \dots -  b_k X_{ki})^2 \tag{6.5} $$

SER <- sqrt(1/(n-k-1) * SSR)                    # standard error of the regression   
Rsq <- 1 - (SSR / TSS)                          # R^2   
adj_Rsq <- 1 - (n-1)/(n-k-1) * SSR/TSS          # adj. R^2   

As already mentioned, $\bar{R}^2$ may be used to quantify how good a model fits the data. However, it is rarely a good idea to maximize these measures by stuffing the model with regressors. You will not find any serious study that does so. Instead, it is more useful to include regressors that improve the estimation of the causal effect of interest which is not assessed by means the $\bar{R}^2$ of the model. The issue of variable selection is covered in Chapter 8.

## Imperfect Multicollinearity

If X1 and X2 are highly correlated, OLS struggles to precisely estimate β1.   
That means that although ^β1 is a consistent and unbiased estimator for β1, it has a large variance due to X2 being included in the model.   

https://www.econometrics-with-r.org/6-4-ols-assumptions-in-multiple-regression.html

In [8]:
CAS = pd.read_csv('https://raw.githubusercontent.com/ejvanholm/DataProjects/master/CASchools.csv', index_col = 0)

CAS['STR'] = CAS['students']/CAS['teachers']
CAS['score'] = (CAS['read'] + CAS['math'])/2  

CAS['direction'] = np.random.choice(["West", "North", "South", "East"], size=420)

In [13]:
formula = "score ~ STR + english + C(direction)"
lm_model = smf.ols("score ~ STR + english + C(direction)", data = CAS).fit()



In [14]:
lm_model.summary()

0,1,2,3
Dep. Variable:,score,No. Observations:,420.0
Model:,GLM,Df Residuals:,414.0
Model Family:,Gamma,Df Model:,5.0
Link Function:,inverse_power,Scale:,0.00048397
Method:,IRLS,Log-Likelihood:,-1712.6
Date:,"Wed, 08 Sep 2021",Deviance:,0.20024
Time:,00:08:10,Pearson chi2:,0.2
No. Iterations:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,0.0015,1.77e-05,82.695,0.000,0.001,0.001
C(direction)[T.North],-3.497e-06,4.78e-06,-0.731,0.465,-1.29e-05,5.88e-06
C(direction)[T.South],-6.147e-06,4.74e-06,-1.298,0.194,-1.54e-05,3.14e-06
C(direction)[T.West],-1.857e-06,4.82e-06,-0.385,0.700,-1.13e-05,7.59e-06
STR,2.409e-06,8.88e-07,2.714,0.007,6.69e-07,4.15e-06
english,1.569e-06,9.42e-08,16.657,0.000,1.38e-06,1.75e-06
