# Omitted Variable Bias
**By Jayden Nyamiaka**

In this notebook, we will work with an excel file that containing artificially generated data from a known conditional expectation function (artificial-corr-data.xlsx).

We will analyze three artificially created models in a controlled setting and purposefully exclude a variable from our regression in order to understand and estimate omitted variable bias. 

In [1]:
## Import necessary packages
## Import necessary packages
import pandas as pd
import statsmodels.formula.api as smf

In [2]:
## Import data
filename = "artificial-corr-data.xlsx"
df = pd.read_excel(filename)

## View sample of the data
print(df.head())

        y1      y2      y3     x1     x2    x3
0  337.040  50.329  -4.271  94.45  49.39  2.10
1  297.643  42.025  -1.395  82.66  43.53  1.67
2  172.656  22.318  -5.242  26.52  30.24  1.06
3  203.846  36.770 -12.630  49.02  33.56  1.90
4  429.881  33.724   6.944  68.37  78.40  1.03


In [3]:
# Part 1
# The (artificially created) model of the world is
#                    y1 = 1.3x1 + 4.5x2 + error

# View correlation between x1 and x2
print("Correlation table between all variables:")
print(df.corr())

# Estimate the model y1 = b0 + b1*x1 + error using OLS with robust standard errors
reg1 = smf.ols(formula="y1 ~ x1", data=df).fit(cov_type='HC3')
print(reg1.summary())

Correlation table between all variables:
          y1        y2        y3        x1        x2        x3
y1  1.000000  0.723885  0.462326  0.291180  0.957684  0.353009
y2  0.723885  1.000000  0.530262  0.724498  0.537554  0.590823
y3  0.462326  0.530262  1.000000  0.833572  0.230993 -0.370743
x1  0.291180  0.724498  0.833572  1.000000  0.013460  0.000401
x2  0.957684  0.537554  0.230993  0.013460  1.000000  0.369034
x3  0.353009  0.590823 -0.370743  0.000401  0.369034  1.000000
                            OLS Regression Results                            
Dep. Variable:                     y1   R-squared:                       0.085
Model:                            OLS   Adj. R-squared:                  0.085
Method:                 Least Squares   F-statistic:                     426.8
Date:                Sun, 28 Apr 2024   Prob (F-statistic):           9.13e-91
Time:                        16:34:56   Log-Likelihood:                -29065.
No. Observations:                4626   AIC:

### Question 1: What is the correlation between x1 and x2?
From the correlation table, we see the correlation between x1 and x2 is 0.013460, revealing a slight positive correlation.

### Question 2: You're thinking about estimating the model: y1 = b0+ b1*x1 + error; that is, you are omitting x2. Before estimating, do you think b1 will be biased? Why or why not? If yes, will b1 be an overestimate or an underestimate?
Yes, omitting x2 will make b1 biased. Since x2 is positively correlated with x1 and also strongly positively correlated with y1, if we omit x2, the positive correlation between x2 and y1 will be attributed to x1. This will make it seems like x1 has a more positive correlation with y1 than it actually does such that b1 will be an overestimate.

### Question 3: Estimate the model: y1 = b0 + b1*x1 + error using OLS with robust standard errors. Was your hypothesis in Question 2 correct?
Regression with model y1 = b0 + b1*x1 + error results in b0 = 222.2364 and b1 = 1.3594. The actual model of the world is y1 = 1.3x1 + 4.5x2 + error where 1.3 is the matching coefficient for b1. Thus, since b1 = 1.3594 > 1.3, b1 is an overestimate, and our hypothesis is correct.

In [4]:
# Part 2
# The (artificially created) model of the world is
#                    y2 = 0.2x1 + 0.1x2 + 13x3 + error

# View correlation between x2 and x3
print("Correlation table between all variables:")
print(df.corr())

# Estimate the model y2 = b0+ b1*x1 + b2*x2 + error using OLS with robust standard errors
reg2 = smf.ols(formula="y2 ~ x1 + x2", data=df).fit(cov_type='HC3')
print(reg2.summary())

Correlation table between all variables:
          y1        y2        y3        x1        x2        x3
y1  1.000000  0.723885  0.462326  0.291180  0.957684  0.353009
y2  0.723885  1.000000  0.530262  0.724498  0.537554  0.590823
y3  0.462326  0.530262  1.000000  0.833572  0.230993 -0.370743
x1  0.291180  0.724498  0.833572  1.000000  0.013460  0.000401
x2  0.957684  0.537554  0.230993  0.013460  1.000000  0.369034
x3  0.353009  0.590823 -0.370743  0.000401  0.369034  1.000000
                            OLS Regression Results                            
Dep. Variable:                     y2   R-squared:                       0.804
Model:                            OLS   Adj. R-squared:                  0.803
Method:                 Least Squares   F-statistic:                     8773.
Date:                Sun, 28 Apr 2024   Prob (F-statistic):               0.00
Time:                        16:34:56   Log-Likelihood:                -12455.
No. Observations:                4626   AIC:

### Question 4: What is the correlation between x2 and x3?
From the correlation table, we see the correlation between x2 and x3 is 0.369034, revealing a slight positive correlation.

### Question 5: You're thinking about estimating the model: y2 = b0+ b1*x1 + b2*x2 + error; that is, you are omitting x3. Before estimating, do you think b2 will be biased? Why or why not? If yes, will b1 be an overestimate or an underestimate?
Yes, omitting x3 will make b2 biased. First, we can recognize the correlation between x3 and x1 is insignificant (0.000401). Then, since x3 is positively correlated with x2 and also positively correlated with y2, if we omit x3, the positive correlation between x3 and y2 will be attributed to x2. This will make it seems like x2 has a more positive correlation with y2 than it actually does such that b2 will be an overestimate.

### Question 6: Estimate the model: y2 = b0+ b1*x1 + b2*x2 + error using OLS with robust standard errors. Was your hypothesis in Question 5 correct?
Regression with model y2 = b0+ b1*x1 + b2*x2 + error results b0 = 21.0889 and b1 = 0.1994 and b2 = 0.1486. The actual model of the world is y2 = 0.2x1 + 0.1x2 + 13x3 + error where 0.1 is the matching coefficient for b2. Thus, since b2 = 0.1486 > 0.1, b2 is an overestimate, and our hypothesis is correct.

In [5]:
# Part 3
# The (artificially created) model of the world is
#                    y3 = 0.2x1 + 0.1x2 - 13x3 + error

# View correlation between x2 and x3
print("Correlation table between all variables:")
print(df.corr())

# Estimate the model y3 = b0+ b1*x1 + b2*x2 + error using OLS with robust standard errors
reg3 = smf.ols(formula="y3 ~ x1 + x2", data=df).fit(cov_type='HC3')
print(reg3.summary())

Correlation table between all variables:
          y1        y2        y3        x1        x2        x3
y1  1.000000  0.723885  0.462326  0.291180  0.957684  0.353009
y2  0.723885  1.000000  0.530262  0.724498  0.537554  0.590823
y3  0.462326  0.530262  1.000000  0.833572  0.230993 -0.370743
x1  0.291180  0.724498  0.833572  1.000000  0.013460  0.000401
x2  0.957684  0.537554  0.230993  0.013460  1.000000  0.369034
x3  0.353009  0.590823 -0.370743  0.000401  0.369034  1.000000
                            OLS Regression Results                            
Dep. Variable:                     y3   R-squared:                       0.743
Model:                            OLS   Adj. R-squared:                  0.743
Method:                 Least Squares   F-statistic:                     6383.
Date:                Sun, 28 Apr 2024   Prob (F-statistic):               0.00
Time:                        16:34:56   Log-Likelihood:                -12423.
No. Observations:                4626   AIC:

### Question 7: What is the correlation between x2 and x3?
From the correlation table, we see the correlation between x2 and x3 is 0.369034, revealing a slight positive correlation.

### Question 8: You're thinking about estimating the model: y3 = b0+ b1*x1 + b2*x2 + error; that is, you are omitting x3. Before estimating, do you think b2 will be biased? Why or why not? If yes, will b1 be an overestimate or an underestimate?
Yes, omitting x3 will make b2 biased. First, we can recognize the correlation between x3 and x1 is insignificant (0.000401). Then, since x3 is positively correlated with x2 and also negatively correlated with y3, if we omit x3, the negative correlation between x3 and y3 will be attributed to x2. This will make it seems like x2 has a more negative correlation with y3 than it actually does such that b2 will be an underestimate.

### Question 9: Estimate the model: y3 = b0+ b1*x1 + b2*x2 + error using OLS with robust standard errors. Was your hypothesis in Question 8 correct?
Regression with model y3 = b0+ b1*x1 + b2*x2 + error results b0 = -21.1525 and b1 = 0.2005 and b2 = 0.0538. The actual model of the world is y3 = 0.2x1 + 0.1x2 - 13x3 + error where 0.1 is the matching coefficient for b2. Thus, since b2 = 0.0538 < 0.1, b2 is an underestimate, and our hypothesis is correct.