## 6.1 Omitted Variable Bias

- Our previous analyses of the relationship between test score and class size were flawed in that we ignored other determinants of the dependent variable (test score) that correlate with the regressor (class size).
- Remember that influences on the dependent variable which are not captured by the model are collected in the error term, which we so far assumed to be uncorrelated with the regressor. 
- However, this assumption is violated if we exclude determinants of the dependent variable which vary with the regressor. 
- This might induce an estimation bias, i.e., the mean of the OLS estimator’s sampling distribution is no longer equal to the true mean. 
- In our example we wrongly estimate the causal effect on test scores of a unit change in the student-teacher ratio on average, resulting in omitted variable bias (OVB).
- Omitted variable bias is the bias in the OLS estimator that arises when the regressor, $X$, is correlated with an omitted variable. 
- For omitted variable bias to occur, two conditions must be fulfilled:
 1. $X$ is correlated with the omitted variable.
 2. The omitted variable is a determinant of the dependent variable $Y$.
- Together, 1. and 2. result in a violation of the first OLS assumption $E(u_i\vert X_i) = 0$.

- In the example of test score and class size, it is easy to come up with variables that may cause such a bias, if omitted from the model. 
- A highly relevant variable could be the percentage of English learners in the school district: it is plausible that the ability to speak, read and write English is an important factor for successful learning.
- Therefore, students that are still learning English are likely to perform worse in tests than native speakers. 
- Also, it is conceivable that the share of English learning students is bigger in school districts where class sizes are relatively large: think of poor urban districts where a lot of immigrants live.
- As such, the OLS estimate of $\hat\beta_1$ suggests that small classes improve test scores, but it is likely that the effect of small classes is overestimated as it captures the effect of having fewer English learners as well.

In [3]:
using FixedEffects #FixedEffectModels requires FixedEffects as a dependency
using FixedEffectModels #we use FixedEffectModels to create regression models
using CSV #we use the CSV package to load the data
using DataFrames #we use the DataFrames package as the data is stored as an object of type "DataFrame"
using Plots #we use the Plots package for generating plots

data = CSV.read("/mnt/juliabox/Econometrics With Julia/Datasets/CASchools.csv") #load the data into the workspace and store it in the variable "data"

data.student_teacher_ratio = data.students ./ data.teachers #add a new column "student_teacher_ratio" to the data, ./ is used to broadcast the division operator between arrays (or in this case, columns)
data.score = (data.read .+ data.math) ./ 2 #add a new column "score" to the data

reg_mod = reg( #initialise a FixedEffectModel and define it as reg_mod
                        data, #pass the DataFrame 'data' as the dataset to be used in reg_mod
                        @model(score ~ student_teacher_ratio) #pass the regression formula consisting of the dependent variable 'score' and the exogenous variable 'student_teacher_ratio'
)

                                 Linear Model                                 
Number of obs:                     420   Degrees of freedom:                  2
R2:                              0.051   R2 Adjusted:                     0.049
F Statistic:                   22.5751   p-value:                         0.000
                       Estimate Std.Error  t value Pr(>|t|) Lower 95% Upper 95%
student_teacher_ratio  -2.27981  0.479826 -4.75133    0.000  -3.22298  -1.33664
(Intercept)             698.933   9.46749  73.8245    0.000   680.323   717.543


-------------------------------------------------------------------------------


- When we add the percentage of English learning students $(PctEL)$ to our regression model, the coefficient $\hat\beta_2$ explains the effect of $PctEL$ on the dependent variable.
- We see that the regressor $STR$ previously had a negative bias as the coefficient $\hat\beta_1$ increases from before.

\begin{equation}
TestScore = \beta_0 + \beta_1 \times STR + \beta_2 \times PctEL
\end{equation}

In [2]:
using FixedEffects #FixedEffectModels requires FixedEffects as a dependency
using FixedEffectModels #we use FixedEffectModels to create regression models
using CSV #we use the CSV package to load the data
using DataFrames #we use the DataFrames package as the data is stored as an object of type "DataFrame"
using Plots #we use the Plots package for generating plots

data = CSV.read("/mnt/juliabox/Econometrics With Julia/Datasets/CASchools.csv") #load the data into the workspace and store it in the variable "data"

data.student_teacher_ratio = data.students ./ data.teachers #add a new column "student_teacher_ratio" to the data, ./ is used to broadcast the division operator between arrays (or in this case, columns)
data.score = (data.read .+ data.math) ./ 2 #add a new column "score" to the data

reg_mod = reg( #initialise a FixedEffectModel and define it as reg_mod
                        data, #pass the DataFrame 'data' as the dataset to be used in reg_mod
                        @model(score ~ student_teacher_ratio + english) #pass the regression formula consisting of the dependent variable 'score' and the exogenous variable 'student_teacher_ratio'
)

                                  Linear Model                                  
Number of obs:                      420  Degrees of freedom:                   3
R2:                               0.426  R2 Adjusted:                      0.424
F Statistic:                    8.69304  p-value:                          0.000
                        Estimate Std.Error  t value Pr(>|t|) Lower 95% Upper 95%
student_teacher_ratio    -1.1013  0.380278 -2.89603    0.004   -1.8488 -0.353794
english                -0.649777 0.0393425 -16.5159    0.000 -0.727111 -0.572442
(Intercept)              686.032   7.41131  92.5656    0.000   671.464     700.6


--------------------------------------------------------------------------------
