In [3]:
import pandas as pd

import statsmodels.formula.api as smf
import statsmodels.api as sm
import statsmodels.stats.api as sms
import linearmodels.iv.model as lm

from scipy import stats

from IPython.display import display, HTML
display(HTML("<style>.container {width:85%;}</style>"))

# Endogeneity
<p style="font-size:16px">
    Endogeneity refers to a situation where there is a correlation between the independent variable (predictor) and the error term in a regression model, leading to biased and inconsistent estimates.
</p>

| OLS Model Assumption                                                                                                                                                              | Implication of Violation                                                                                                       | Graphical Test                 | Test                   |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|-------------------------------|------------------------|
| <b>All independent variables are uncorrelated with the error term:</b><br>If an independent variable is correlated with the error term, we can use the independent variable to predict the error term, which violates the notion that the error term represents unpredictable random error<br><b>This assumption is also referred to as exogeneity.</b> When this type of correlation exists, there is <b>endogeneity.</b><br> | Violating this assumption biases the coefficient estimate<br>when an independent variable correlates with the error term, OLS incorrectly attributes some of the variance that the error term actually explains to the independent variable instead | Residual plot<br>Residuals over time                   | Durbin-Watson test, Ljung-Box test                |


## Causes of endogeneity
| Cause of Endogeneity              | Description                                                                                                                                                                               |
|-----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Simultaneity                      | Both the dependent and independent variables are jointly determined at the same time, making it difficult to determine the direction of causality.                                      |
| Omitted Variables                 | Relevant variables that are not included in the model can confound the relationship between the variables of interest. For example, not accounting for inflation when analyzing the impact of government spending on GDP.                                     |
| Measurement Errors                | Errors in the observed values of the variables can introduce bias in the estimates. For instance, self-reported income data in a survey may be subject to reporting errors and affect the relationship between income and health outcomes.           |
| Reverse Causality                 | The causal relationship between the dependent and independent variables runs in the opposite direction than what is assumed. For example, high crime rates might lead to more police presence, rather than the other way around.     |
| Endogenous Grouping               | Endogeneity can arise when individuals or subjects are grouped based on an endogenous characteristic, leading to biased estimates. For instance, studying the effect of education on income using data from a specific university. |
| Sample Selection Bias             | Non-random sample selection can introduce bias in the estimated relationships. For example, studying the job performance of employees who voluntarily participate in a training program may not represent the overall workforce.     |
| Spurious Correlation              | When two variables appear to be related, but the correlation is due to chance or a common underlying factor, the estimated relationship may be spurious. For example, the correlation between ice cream sales and drowning incidents in the summer. |


## Data

In [5]:
selct_columns = ['lwage', 'exper', 'expersq', 'educ',  'age', 'kidslt6', 'kidsge6', 'motheduc', 'fatheduc', 'huseduc']
df = (
    pd.read_stata('http://fmwww.bc.edu/ec-p/data/wooldridge/mroz.dta')
    .filter(selct_columns)
    .dropna()
    .assign(educgr = lambda X: pd.cut(X["educ"], bins = [5,11,13, 18], labels=('Diploma','Degree','Masters'), ordered=True))
)

## Model

In [6]:
olsModel = smf.ols(formula = 'lwage ~ exper + expersq + educ', data=df).fit()
print(olsModel.summary())

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.157
Model:                            OLS   Adj. R-squared:                  0.151
Method:                 Least Squares   F-statistic:                     26.29
Date:                Sat, 29 Jul 2023   Prob (F-statistic):           1.30e-15
Time:                        16:31:41   Log-Likelihood:                -431.60
No. Observations:                 428   AIC:                             871.2
Df Residuals:                     424   BIC:                             887.4
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.5220      0.199     -2.628      0.0

## Detecting endogeneity
| Approach                            | Description                                                                                                                                                                                             |
|-------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Theoretical Understanding           | Examine the theoretical framework and relationships between variables to identify potential issues of reverse causality, omitted variables, or simultaneous determination.                              |
| Scatter Plots and Visual Inspection | Create scatter plots between independent and dependent variables to visualize their relationship and identify suspicious patterns or nonlinear associations that might indicate endogeneity.            |
| Correlation Analysis                | Calculate correlation coefficients between variables; high correlations between the independent variable and the error term can suggest endogeneity.                                                    |
| Durbin-Wu-Hausman Test              | Compare estimates from a standard OLS regression and an instrumental variables (IV) regression; a significant difference between the two estimates indicates the presence of endogeneity.               |
| Hausman Test                        | Compare the consistency of OLS and IV estimates to check for endogeneity; rejecting the null hypothesis suggests the presence of endogeneity.                                                           |
| Residual Analysis                   | Examine the residuals from the OLS regression to check for patterns or heteroscedasticity that might indicate endogeneity or model misspecification.                                                    |
| Qualitative Information             | Gather insights from experts or stakeholders to identify potential sources of endogeneity that might not be apparent from the data alone.                                                               |


In [16]:
dependent = "lwage"
exog = ["exper", "expersq"]
endog = ["educ"]
instrs = ["motheduc", "fatheduc", "huseduc"]


data = sm.add_constant(df, prepend=False)

ivModel = lm.IV2SLS(
    dependent=data[dependent],
    exog=data[exog],
    endog=data[endog],
    instruments=data[instrs]
)

ivModel = ivModel.fit(cov_type="homoskedastic", debiased=True)

In [17]:
print(ivModel.summary)

                          IV-2SLS Estimation Summary                          
Dep. Variable:                  lwage   R-squared:                      0.7688
Estimator:                    IV-2SLS   Adj. R-squared:                 0.7671
No. Observations:                 428   F-statistic:                    458.39
Date:                Sat, Jul 29 2023   P-value (F-stat)                0.0000
Time:                        16:41:57   Distribution:                 F(3,425)
Cov. Estimator:         homoskedastic                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
exper          0.0397     0.0133     2.9854     0.0030      0.0136      0.0659
expersq       -0.0008     0.0004    -1.9611     0.05

### wooldridge_regression
| Test                          | Description |
|-------------------------------|-------------------------------------------------------------------------------------------------------------------------------|
| <b>wooldridge_regression</b>  | Performs an additional regression using the original endogenous variable and the predicted values from the first-stage regression. The purpose of this additional regression is to compute the robust Wald test for endogeneity proposed by Wooldridge (1995). The Wald test compares the sum of squared residuals from the additional regression with the sum of squared residuals from the original IV regression. The results include the test statistic and the p-value for the Wald test. |
| Decision | Reject H0 (All endogenous variables are exogenous) if the pvalue is less than the significance level (5%)|


In [21]:
ivModel.wooldridge_regression

Wooldridge's regression test of exogeneity
H0: Endogenous variables are exogenous
Statistic: 4.9118
P-value: 0.0267
Distributed: chi2(1)
WaldTestStatistic, id: 0x7fe48852f460

### Hausman test
| Test          | Description|
|---------------|------------|
| Hausman Test  | The Hausman test is a statistical test used to assess the presence of endogeneity in a regression model. It helps to determine whether the ordinary least squares (OLS) estimator or the instrumental variables (IV) estimator is more appropriate for obtaining unbiased and consistent estimates of the coefficients in the presence of endogeneity.|
|               |The test compares the difference between the OLS and IV coefficient estimates with their respective variance-covariance matrices to calculate the test statistic. |
|Hypothesis    |The null lypothesis is that all endogenous variables are exogenous.|
| Decision      | If the test statistic is greater than the critical value from a Chi-squared distribution, the null hypothesis of no endogeneity is rejected, indicating that the IV estimator is preferred.|
|               |Reject H0  if the pvalue is less than the significance level (5%) |

In [25]:
ivModel.wu_hausman()

Wu-Hausman test of exogeneity
H0: All endogenous variables are exogenous
Statistic: 4.9144
P-value: 0.0272
Distributed: F(1,424)
WaldTestStatistic, id: 0x7fe4b1fa7100

## Determining valid Instruments

### Requirements for valid Instruments
| Requirement           | Description                                                                                     | Check in Data                |
|-----------------------|-------------------------------------------------------------------------------------------------|------------------------------|
| Relevance             | The instrument should be correlated with the endogenous variable.                               | Correlation                  |
| Exogeneity           | The instrument should be unrelated to the error term in the main regression equation.            | Residual Test |
|                       |                                                                                                 |You run the first-stage regression (instrument regressed on the endogenous variable) and obtain the residuals. Then, you test whether these residuals are correlated with the instrument. If the correlation is low or statistically insignificant, it suggests that the instrument is exogenous                |
| Exclusion Restrictions| The instrument should only affect the dependent variable through the endogenous variable.     | Theoretical Understanding    |
| Sufficient Variation  | The instrument should have enough variability in the sample.                                 | Visual Inspection or Summary Statistics |


### Sargan-Hansen (Sargan) test for overidentification
<p style="font-size:16px">
    Overidentification refers to a situation where there are more instrumental variables used than necessary to estimate the parameters of the endogenous variable. 
</p>

| Test        | Description |
|-------------|-------------|
| Sargan test | The Sargan test is designed to check whether the instruments used in an IV regression are valid. The test works by comparing the OLS residuals with the predicted values of the endogenous variable obtained from the first-stage regression (regressing the endogenous variable on the instruments). If the instruments are valid, the OLS residuals should be uncorrelated with the predicted values, as they should not contain any endogeneity. |
| Hypothesis  | The null hypothesis of the Sargan test is that the instruments are valid, meaning that the OLS residuals and the predicted values are uncorrelated  |
| Decision    | If the p-value associated with the test is greater than the chosen significance level (e.g., 0.05), you fail to reject the null hypothesis, suggesting that the instruments are valid and the IV regression results can be trusted. |

In [22]:
ivModel.sargan

Sargan's test of overidentification
H0: The model is not overidentified.
Statistic: 1.1711
P-value: 0.5568
Distributed: chi2(2)
WaldTestStatistic, id: 0x7fe4886e4a90