# Population Regression vs OLS Estimation

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm  
import yfinance as yf

np.random.seed(10)

In [2]:
# use yfinance api to extract data. 
df = yf.download("SPY EEM OIL", start="2010-12-30", end="2021-07-31")['Adj Close']
df.dropna(inplace=True)
rets = df.pct_change().dropna()

[*********************100%***********************]  3 of 3 completed


# Theoretical Model

#### Assume we have the following model for a portfolio
$$r^p_t = \alpha + \boldsymbol{x}_t'\boldsymbol{\beta} + \epsilon_t $$
#### Simulate this "true" model for a sample of size $T$

In [3]:
T = rets.shape[0]
N = rets.shape[1]

eps_vol = .01
eps = pd.DataFrame(np.random.normal(loc=0, scale =eps_vol, size=T), index=rets.index, columns=['epsilon'])

betas = {'SPY':.5, 'OIL':.25, 'EEM':.25}
X = rets[list(betas.keys())]
alpha = .001

port = pd.DataFrame(alpha + X @ list(betas.values()) + eps['epsilon'], index=rets.index, columns=['port'])

## Estimate the Theoretical Model with OLS

$$r^p_t = a + \boldsymbol{x}_t'\boldsymbol{b} + e_t $$

In [4]:
mod = sm.OLS(port, sm.add_constant(X)).fit()
e = pd.DataFrame(mod.resid,columns=['e'])
mod.summary()

0,1,2,3
Dep. Variable:,port,R-squared:,0.587
Model:,OLS,Adj. R-squared:,0.587
Method:,Least Squares,F-statistic:,1223.0
Date:,"Fri, 06 Aug 2021",Prob (F-statistic):,0.0
Time:,19:02:05,Log-Likelihood:,8306.0
No. Observations:,2585,AIC:,-16600.0
Df Residuals:,2581,BIC:,-16580.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0009,0.000,4.684,0.000,0.001,0.001
SPY,0.4436,0.031,14.522,0.000,0.384,0.504
OIL,0.2661,0.009,29.326,0.000,0.248,0.284
EEM,0.2582,0.024,10.969,0.000,0.212,0.304

0,1,2,3
Omnibus:,0.191,Durbin-Watson:,2.044
Prob(Omnibus):,0.909,Jarque-Bera (JB):,0.22
Skew:,-0.019,Prob(JB):,0.896
Kurtosis:,2.976,Cond. No.,191.0


### Examine the correlations between the regressors, the target, and the fit, and the residuals

In [5]:
data = pd.concat([X,port,eps,e],axis=1)
data['port_fit'] = mod.predict(sm.add_constant(X))
data.corr().style.format('{:.2%}')

Unnamed: 0,SPY,OIL,EEM,port,epsilon,e,port_fit
SPY,100.00%,34.65%,80.98%,64.61%,-3.99%,0.00%,84.32%
OIL,34.65%,100.00%,33.92%,58.77%,1.97%,0.00%,76.70%
EEM,80.98%,33.92%,100.00%,62.84%,-2.61%,0.00%,82.01%
port,64.61%,58.77%,62.84%,100.00%,63.06%,64.25%,76.62%
epsilon,-3.99%,1.97%,-2.61%,63.06%,100.00%,99.85%,-1.44%
e,0.00%,0.00%,0.00%,64.25%,99.85%,100.00%,0.00%
port_fit,84.32%,76.70%,82.01%,76.62%,-1.44%,0.00%,100.00%


## Main lessons

- population residual is uncorrelated to regressors in population, but in any given sample it will have at least small, non-zero correlations.
- estimated model forces sample residuals to have zero correlations in-sample. 
- Thus, $\epsilon$ and $e$ differ at least a little. (Above >99.94% correlated.)

# Regression with Omitted Variables

### If we omit a regressor
Suppose we label one regressor as $z$, writing
$$\boldsymbol{x} = \begin{bmatrix}\boldsymbol{\check{x}}\\ z\end{bmatrix},\quad \boldsymbol{\beta} = \begin{bmatrix}\boldsymbol{\check{\beta}}\\ \beta^z\end{bmatrix}, \quad \boldsymbol{b} = \begin{bmatrix}\boldsymbol{\check{b}}\\ b^z\end{bmatrix}$$
If we omit $z$, then $\check{x}$ are the remaining regressors.

Then
$$r^p_t = \alpha + \boldsymbol{\check{x}}_t'\boldsymbol{\check{\beta}} + z_t\beta^z + \epsilon_t $$
$$= \alpha + \boldsymbol{\check{x}}_t'\boldsymbol{\check{\beta}} + \underbrace{z_t\beta^z + \epsilon_t}_{\upsilon_t}$$
$$= \alpha + \boldsymbol{\check{x}}_t'\boldsymbol{\check{\beta}} + \upsilon_t$$
where $\upsilon_t$ is the population error with $z$ omitted. Note that even though $\epsilon_t$ is uncorrelated with $x$, then $\upsilon_t$ is certainly correlated with $x$, even in the population.

If we go ahead and estimate $r^p$ omitting the regressor, OLS will return: $\boldsymbol{\ddot{b}}$, which are the biased OLS estimates of $\check{\beta}$. Note that this will not be the same as the subset of OLS estimates from the full model, $\check{b}$.

Thus,
$$r^p_t = \ddot{a} + \boldsymbol{\check{x}}_t'\boldsymbol{\ddot{b}} + \ddot{e}_t$$
By construction OLS ensures $\ddot{e}$ is uncorrelated with $\check{x}$ even though $\upsilon_t$ is correlated to $\check{x}$.

## Investigate empirically
- Omit one of the true population variables.
- Re-estimate the regression, omitting the regressor.
- How do the estimates change?
- How do the sample residuals change?

In [6]:
list_omit = ['EEM']
Xomit = X.drop(columns=list_omit)
mod_omit = sm.OLS(port, sm.add_constant(Xomit)).fit()
e_omit = pd.DataFrame(mod_omit.resid,columns=['e omit'])
mod_omit.summary()

0,1,2,3
Dep. Variable:,port,R-squared:,0.568
Model:,OLS,Adj. R-squared:,0.568
Method:,Least Squares,F-statistic:,1697.0
Date:,"Fri, 06 Aug 2021",Prob (F-statistic):,0.0
Time:,19:02:06,Log-Likelihood:,8247.1
No. Observations:,2585,AIC:,-16490.0
Df Residuals:,2582,BIC:,-16470.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0008,0.000,4.043,0.000,0.000,0.001
SPY,0.7065,0.019,36.460,0.000,0.669,0.744
OIL,0.2767,0.009,29.985,0.000,0.259,0.295

0,1,2,3
Omnibus:,0.314,Durbin-Watson:,2.05
Prob(Omnibus):,0.855,Jarque-Bera (JB):,0.3
Skew:,0.026,Prob(JB):,0.861
Kurtosis:,3.005,Cond. No.,101.0


## Differences between $\boldsymbol{\check{b}}$ and $\boldsymbol{\ddot{b}}$

Note that omitting the regressor substantially changes the OLS estimates of the remaining regressors to $\ddot{b}$. These are biased estimates of the full model betas, $\check{\beta}$, and thus they will not be close to $\check{b}$.

### Examine the correlation between the omitted model's residual with the included regressors. 
$\text{corr}(\dot{x},\upsilon)\ne 0$
and this is the source of the bias.

In [7]:
data['epsilon_omit'] = eps + X[list_omit].values * betas[list_omit[0]]
data['e_omit'] = e_omit
data['port_fit_omit'] = mod_omit.predict(sm.add_constant(Xomit))

data.corr().style.format('{:.2%}')

Unnamed: 0,SPY,OIL,EEM,port,epsilon,e,port_fit,epsilon_omit,e_omit,port_fit_omit
SPY,100.00%,34.65%,80.98%,64.61%,-3.99%,0.00%,84.32%,23.72%,-0.00%,85.73%
OIL,34.65%,100.00%,33.92%,58.77%,1.97%,0.00%,76.70%,13.39%,-0.00%,77.99%
EEM,80.98%,33.92%,100.00%,62.84%,-2.61%,0.00%,82.01%,31.50%,12.31%,72.65%
port,64.61%,58.77%,62.84%,100.00%,63.06%,64.25%,76.62%,81.22%,65.74%,75.36%
epsilon,-3.99%,1.97%,-2.61%,63.06%,100.00%,99.85%,-1.44%,94.06%,97.75%,-1.59%
e,0.00%,0.00%,0.00%,64.25%,99.85%,100.00%,0.00%,94.80%,97.75%,0.00%
port_fit,84.32%,76.70%,82.01%,76.62%,-1.44%,0.00%,100.00%,26.50%,3.82%,98.35%
epsilon_omit,23.72%,13.39%,31.50%,81.22%,94.06%,94.80%,26.50%,100.00%,96.99%,23.18%
e_omit,-0.00%,-0.00%,12.31%,65.74%,97.75%,97.75%,3.82%,96.99%,100.00%,-0.00%
port_fit_omit,85.73%,77.99%,72.65%,75.36%,-1.59%,0.00%,98.35%,23.18%,-0.00%,100.00%


## Empirical observations

- Given that the remaining regressors, $\dot{x}$, are highly correlated with the omitted variable, $z$, the regression's fit (R-squared), is nearly as high. 
- The beta of the omitted variable, $\beta^z$, is mostly absorbed by the OLS estimate of $\dot{\beta}$, denoted $\dot{b}$.

### The residuals
- new sample residual will force 0 correlation to remaining regressors, but NOT to omitted.
- population model omitting will then have substantial corr between X and upsilon
- sample fit omitting will have no corr between X and u but will have corr between X and e

# Check for Bias

The formula tells us that the OLS estimator will have a bias relative to the actual beta equal to:
$$bias = (X'X)^{-1}X'\epsilon$$
This bias is simply the regression of the true epsilon on the regressors.

### In real applications, we will *never* know the true epsilon, so we can't check this. But here, we have started with a known (simulated) model, so we can investigate the epsilon.

### Bias for the omitted variable case:

In [8]:
mod_epsomit = sm.OLS(data['epsilon_omit'],Xomit).fit()
mod_epsomit.summary()

0,1,2,3
Dep. Variable:,epsilon_omit,R-squared (uncentered):,0.059
Model:,OLS,Adj. R-squared (uncentered):,0.058
Method:,Least Squares,F-statistic:,80.94
Date:,"Fri, 06 Aug 2021",Prob (F-statistic):,8.01e-35
Time:,19:02:06,Log-Likelihood:,8246.5
No. Observations:,2585,AIC:,-16490.0
Df Residuals:,2583,BIC:,-16480.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
SPY,0.2053,0.019,10.613,0.000,0.167,0.243
OIL,0.0269,0.009,2.916,0.004,0.009,0.045

0,1,2,3
Omnibus:,0.314,Durbin-Watson:,2.05
Prob(Omnibus):,0.855,Jarque-Bera (JB):,0.3
Skew:,0.026,Prob(JB):,0.86
Kurtosis:,3.004,Cond. No.,2.31


## The bias is seen in the estimated coefficients.

Sure enough, we see large t-stats (this bias is real, not just noise,) and it almost perfectly explains the difference between $\ddot{b}$ and $\check{\beta}$.


### Let's double check that there was not bias in the full model.

In [9]:
mod_eps = sm.OLS(data['epsilon'],Xomit).fit()
mod_eps.summary()

0,1,2,3
Dep. Variable:,epsilon,R-squared (uncentered):,0.003
Model:,OLS,Adj. R-squared (uncentered):,0.002
Method:,Least Squares,F-statistic:,3.815
Date:,"Fri, 06 Aug 2021",Prob (F-statistic):,0.0222
Time:,19:02:06,Log-Likelihood:,8305.8
No. Observations:,2585,AIC:,-16610.0
Df Residuals:,2583,BIC:,-16600.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
SPY,-0.0487,0.019,-2.573,0.010,-0.086,-0.012
OIL,0.0165,0.009,1.831,0.067,-0.001,0.034

0,1,2,3
Omnibus:,0.168,Durbin-Watson:,2.044
Prob(Omnibus):,0.919,Jarque-Bera (JB):,0.196
Skew:,-0.018,Prob(JB):,0.907
Kurtosis:,2.976,Cond. No.,2.31


### As expected, we see no signs of bias.

The OLS estimates here are not exactly zero due to sample noise, but neither are close to being statistically significant, and both show 0 as well within the confidence interval.

# Do we **care** about omitted variable bias?

- Above, we have a *true* model where the portfolio depends on SPY, OIL, and EEM.

- But the omitted model on just SPY and OIL gets almost as high of an R-squared.

In what situations would it matter to know that EEM is part of the linear model?

- This matters for predicting there will be exogenous changes in one regressor. Then the causation matters.
- This doesn't matter when the regressors will move randomly, and having a highly correlated proxy delivers most of the same statistical information.

In finance, we are usually in the second case--it doesn't matter.

In applied economics, public policy, etc. they are often in the first situation--they want to be able to attribute a causal effect so that if some intervention forces a change in $x$ they can be confident it will impact as expected.

Imagine that we somehow **force** a change in SPY. Would we expect the portfolio to increase by 0.75 as indicated by the omitted variable regression? No. We would expect it to increase by 0.5, the true effect. The usual channel of SPY impacting EEM is broken due to the exogenous change in SPY.

But this is unrealistic--we will never have an exogenous change in SPY. Rather, we will observe changes in SPY, in which case we will still have the correlated impact via EEM. Thus, if we know SPY changes and no nothing about EEM, we would expect a change of 0.75 in the portfolio. If we know SPY goes up 1 and we **also** know EEM does not change, then we would expect a change of 0.50 in the portfolio.

### As quants, we will rarelly care about the causal impact, so omitted variable regressions will be fine, since for random variation they are optimal predictions.

# Omitted Variable Bias may be better than Multicolinearity

## Suppose we have a theoretical model that now depends on a few extra regressors:

In [17]:
# use yfinance api to extract data. 
df = yf.download("SPY EEM OIL HYG IXUS EFA QQQ VTV IVV VOO", start="2010-12-30", end="2021-07-31")['Adj Close']
df.dropna(inplace=True)

rets = df.pct_change().dropna()

[*********************100%***********************]  10 of 10 completed


In [18]:
T = rets.shape[0]
N = rets.shape[1]

eps_vol = .02
eps = pd.DataFrame(np.random.normal(loc=0, scale =eps_vol, size=T), index=rets.index, columns=['epsilon'])

betas = {'SPY':.2, 'OIL':.2, 'EEM':.1, 'HYG':.1, 'EFA':.1, 'IXUS':.1, 'QQQ':.1, 'VTV':.1}
X = rets[list(betas.keys())]
alpha = .001

port_multi = pd.DataFrame(alpha + X @ list(betas.values()) + eps['epsilon'], index=rets.index, columns=['port'])

In [19]:
mod = sm.OLS(port_multi, sm.add_constant(X)).fit()
e = pd.DataFrame(mod.resid,columns=['e'])
mod.summary()

0,1,2,3
Dep. Variable:,port,R-squared:,0.207
Model:,OLS,Adj. R-squared:,0.204
Method:,Least Squares,F-statistic:,71.45
Date:,"Fri, 06 Aug 2021",Prob (F-statistic):,9.66e-105
Time:,19:02:57,Log-Likelihood:,5528.2
No. Observations:,2204,AIC:,-11040.0
Df Residuals:,2195,BIC:,-10990.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0022,0.000,5.260,0.000,0.001,0.003
SPY,0.1165,0.395,0.295,0.768,-0.658,0.891
OIL,0.1836,0.020,9.186,0.000,0.144,0.223
EEM,0.1186,0.085,1.390,0.165,-0.049,0.286
HYG,0.2657,0.136,1.960,0.050,-8.21e-05,0.531
EFA,0.3266,0.200,1.634,0.102,-0.065,0.719
IXUS,-0.3576,0.255,-1.404,0.161,-0.857,0.142
QQQ,0.2063,0.153,1.351,0.177,-0.093,0.506
VTV,0.2074,0.245,0.845,0.398,-0.274,0.689

0,1,2,3
Omnibus:,0.991,Durbin-Watson:,2.002
Prob(Omnibus):,0.609,Jarque-Bera (JB):,0.914
Skew:,-0.015,Prob(JB):,0.633
Kurtosis:,3.095,Cond. No.,1140.0


In [21]:
list_omit = ['HYG','EEM','EFA','IXUS','QQQ','VTV']
Xomit = X.drop(columns=list_omit)
mod_omit = sm.OLS(port_multi, sm.add_constant(Xomit)).fit()
e_omit = pd.DataFrame(mod_omit.resid,columns=['e omit'])
mod_omit.summary()

0,1,2,3
Dep. Variable:,port,R-squared:,0.203
Model:,OLS,Adj. R-squared:,0.202
Method:,Least Squares,F-statistic:,280.0
Date:,"Fri, 06 Aug 2021",Prob (F-statistic):,4.710000000000001e-109
Time:,19:03:08,Log-Likelihood:,5523.0
No. Observations:,2204,AIC:,-11040.0
Df Residuals:,2201,BIC:,-11020.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0022,0.000,5.224,0.000,0.001,0.003
SPY,0.7287,0.043,16.920,0.000,0.644,0.813
OIL,0.1874,0.019,9.666,0.000,0.149,0.225

0,1,2,3
Omnibus:,0.504,Durbin-Watson:,2.001
Prob(Omnibus):,0.777,Jarque-Bera (JB):,0.423
Skew:,-0.003,Prob(JB):,0.809
Kurtosis:,3.068,Cond. No.,104.0


In [22]:
print(mod.condition_number)
print(mod_omit.condition_number)

1141.554269533006
103.8101893094178


## Another Example

Suppose we have a portfolio of SPY , IVV, and VOO.

Is it helpful to correctly specify the model, or to omit IVV and VOO?

In [24]:
T = rets.shape[0]
N = rets.shape[1]

eps_vol = .02
eps = pd.DataFrame(np.random.normal(loc=0, scale =eps_vol, size=T), index=rets.index, columns=['epsilon'])

betas = {'SPY':.4, 'IVV':.3, 'VOO':.3}
X = rets[list(betas.keys())]
alpha = .001

port_multi = pd.DataFrame(alpha + X @ list(betas.values()) + eps['epsilon'], index=rets.index, columns=['port'])

In [25]:
mod = sm.OLS(port_multi, sm.add_constant(X)).fit()
e = pd.DataFrame(mod.resid,columns=['e'])
mod.summary()

0,1,2,3
Dep. Variable:,port,R-squared:,0.228
Model:,OLS,Adj. R-squared:,0.227
Method:,Least Squares,F-statistic:,216.9
Date:,"Fri, 06 Aug 2021",Prob (F-statistic):,2.8000000000000002e-123
Time:,19:03:37,Log-Likelihood:,5522.3
No. Observations:,2204,AIC:,-11040.0
Df Residuals:,2200,BIC:,-11010.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0014,0.000,3.312,0.001,0.001,0.002
SPY,0.8020,0.897,0.894,0.371,-0.956,2.560
IVV,0.8572,1.047,0.819,0.413,-1.195,2.910
VOO,-0.6325,0.972,-0.651,0.515,-2.538,1.273

0,1,2,3
Omnibus:,5.108,Durbin-Watson:,2.054
Prob(Omnibus):,0.078,Jarque-Bera (JB):,5.16
Skew:,-0.116,Prob(JB):,0.0758
Kurtosis:,2.954,Cond. No.,3080.0


In [28]:
list_omit = ['IVV','VOO']
Xomit = X.drop(columns=list_omit)
mod_omit = sm.OLS(port_multi, sm.add_constant(Xomit)).fit()
e_omit = pd.DataFrame(mod_omit.resid,columns=['e omit'])
mod_omit.summary()

0,1,2,3
Dep. Variable:,port,R-squared:,0.228
Model:,OLS,Adj. R-squared:,0.228
Method:,Least Squares,F-statistic:,650.5
Date:,"Fri, 06 Aug 2021",Prob (F-statistic):,6.109999999999999e-126
Time:,19:03:44,Log-Likelihood:,5521.9
No. Observations:,2204,AIC:,-11040.0
Df Residuals:,2202,BIC:,-11030.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0014,0.000,3.307,0.001,0.001,0.002
SPY,1.0308,0.040,25.505,0.000,0.952,1.110

0,1,2,3
Omnibus:,5.048,Durbin-Watson:,2.053
Prob(Omnibus):,0.08,Jarque-Bera (JB):,5.097
Skew:,-0.116,Prob(JB):,0.0782
Kurtosis:,2.957,Cond. No.,96.0


In [29]:
print(mod.condition_number)
print(mod_omit.condition_number)

3083.0437778741248
95.99938727690872


In [58]:
eps_omit = eps['epsilon'].values
for i in list_omit:
    eps_omit += X[i].values * betas[i]
eps_omit = pd.DataFrame(eps_omit,index=eps.index,columns=['epsilon_omit'])

In [61]:
pd.concat([X['SPY'],port_multi,eps,e,eps_omit,e_omit],axis=1).corr().style.format('{:.2%}')

Unnamed: 0,SPY,port,epsilon,e,epsilon_omit,e omit
SPY,100.00%,47.75%,79.11%,0.00%,79.11%,-0.00%
port,47.75%,100.00%,91.47%,87.85%,91.47%,87.86%
epsilon,79.11%,91.47%,100.00%,61.09%,100.00%,61.11%
e,0.00%,87.85%,61.09%,100.00%,61.09%,99.98%
epsilon_omit,79.11%,91.47%,100.00%,61.09%,100.00%,61.11%
e omit,-0.00%,87.86%,61.11%,99.98%,61.11%,100.00%
