# Sample Selection in regression models


### Econometrics B (ØkB)

(Wooldridge Ch. 18)

Bertel Schjerning

Department of Economics, University of Copenhagen


### Outline

**Lectures on sample selection**
- Sample selection in regression models
- Exclusion restrictions
- Likelihood models
- Nonparametric bounds: Set vs point identification in the sample selection model

### Sample selection due to sample design
- Sample selection based on an explanatory variable
	- Example: only persons aged 30-50 years are interviewed.
- Truncation:
	- Sample selected based on the values of the dependent variable.
	- Ex.: We want to explain wealth and only poor people is sampled.

### Sample selection due behavior:
- *Non-response on survey questions*
- *Attrition*: People drop out of the sample over time
- *Incidental truncation*: Dependent variable unobserved because of the outcome of another variable (Classical example:  we only observe wages for who work)

#### When is sample selection a problem?
Problems arises when the selection a non-random drawn from the population of interest.

### Sample selection framework for regression models
Consider the linear regression model

$$	y_{1}=x_{1}\beta _{1}+u_{1},\quad \quad E( u_{1}|x_{1}) =0 $$

**The sample selection problem**
- $y_{1}$ or $x_{1}$ or both are unobserved when some selection indicator $s=0$
- If we run a regression on the selected sample, we effectively condition on $s=1$

**Implication**
- We will have to work with the regression function $E(y_{1}|x_{1},s=1) $
- We need to condition on $s=1$, but object of interest is $E( y_{1}|x_{1}) $

### Sample selection framework for regression models


<img src="img/sampleselection.png" width="900">

### Sample selection: 5 cases
**Equation of interest:**
$$
	y_{1}=x_{1}\beta _{1}+u_{1},\quad \quad E( u_{1}|x_{1}) =0
$$

We consider **5 types of sample selection:**
1. $s$ is a function of $x_{1}$ only
1. $s$ is independent of $x_{1}$, and $u_{1}$
1. $s=1( a_{1}<y_{1}<a_{2}) $ (truncation)
1. $s=1( x\delta_{2}+v_{2}>0) $ (discrete response selection with dependence between $u_{1}$ and $v_{2}$)

1. $y_{2}=\max (0,x\delta_{2}+v_{2}) $ and $s=1(y_{2}>0)$ (Tobit selection with dependence between $u_{1}$ and $v_{2}$ - implies more structure)


### Case 1: Selection on a regressor
**Assume selection is a deterministic function of $x_{1}$ only**

$$ s=h\left( x_{1}\right) $$

Regression model on the selected sample

$$ E(y_{1}|x_{1},s=1) =x_{1}\beta _{1}+E(u_{1}|x_{1},s=1) $$

- Since $s$ is a deterministic function of $x_{1}$, $s$ does not give us more information than $x_{1}$

$$	E(u_{1}|x_{1},s=1) =E(u_{1}|x_{1}) $$

- Assuming exogenous explanatory variables $E\left( u_{1}|x_{1}\right)=0$

\begin{eqnarray*}
	E(y_{1}|x_{1},s=1)  &=&x_{1}\beta _{1}+E(u_{1}|x_{1}) \\
	&=&x_{1}\beta _{1}
\end{eqnarray*}

### Case 1: We can ignore selection when selection is a deterministic function of $x_1$
- Selection rule: deterministic function of $x_1$

$$ s=h(x_{1}) $$

- We can consistently estimate $\beta $ (as well as $E\left(y_{1}|x_{1}\right) $) using OLS on the selected sample, since 

$$ E\left( y_{1}|x_{1},s=1\right) =x_{1}\beta _{1}$$

- **Selection on a regressor is not a problem**

### Case 2: Selection independent of x and u

Assume selection is independent of $x_{1}$ and $u_{1}$

\begin{eqnarray*}
E( y_{1}|x_{1},s=1)  &=&x_{1}\beta _{1}+E(u_{1}|x_{1},s=1) \\
	&=&x_{1}\beta _{1}+E( u_{1}|x_{1}) \\
	&=&x_{1}\beta _{1}
\end{eqnarray*}

$\to$ we can consistently estimate $\beta $ (as well as $E(y_{1}|x_{1}) $) using OLS on the selected sample, since 

$$	E(y_{1}|x_{1},s=1) =x_{1}\beta _{1} $$

**$\to$ Random sample selection is not a problem**	
-  Example: Sometimes, we (are forced to) make a *random* subsample of our dataset, and the result shows that estimating on our subsample will give us consistent estimates.
	

### Case 3: Truncated regression: Selection on the response variable (selection on $y_{1}$)

Suppose that the selection rule is
$$ s=\mathbb{1}[a_{1} < y_{1} < a_{2}] $$

- $\left( y_{1},x_{1}\right) $ only observed if $s=1$.
- $a_{1}$ and $a_{2}$ are  {known} constants and obviously $a_{2}>a_{1}$.

**We do not need to have truncation from both sides**
- Truncation only from below corresponds to $a_{2}=\infty $
- Truncation only from above corresponds to $a_{1}=-\infty $.

### Case 3: Selection can't be ignored
**Object of interest:** As usual we are interested in estimating 

\begin{eqnarray*}
E\left( y_{1}|x_{1}\right) =x_{1}\beta _{1} \\
s=1[ a_{1} < y_{1} < a_{2}] 
\end{eqnarray*}

- Conditional moment restrictions are not sufficient for identification
- Need to specify full conditional distribution of $y_{1}$ given $x_{1}$
- Need to estimate using CMLE

**Assumption:** Conditional cdf $y_{1}|x_{1}$ is $F(c|x_{1})=P(y_{1}\leq c|x_{1}) $, where $c$ is the usual dummy argument.

- Can't work with distribution of $y_{1}|x_{1}$ directly
- Need to derive distribution condition on selection

### Digression: conditional density
Recall, the formula for conditional probability
$$
	P\left(A|B\right) =\frac{P\left( A\cap B\right) }{P\left( B\right) }
$$ 
we can now write

\begin{eqnarray*}
P\left( y_{1}\leq c|x_{1},s=1\right)  &=&\frac{P\left( y_{1}\leq c,s=1|x_{1}\right) }{P\left( s=1|x_{1}\right) }\\
&=&\frac{P\left( a_{1} < y_{1} < c|x_{1}\right) }{P\left(a_{1} < y_{1} < a_{2}|x_{1}\right) } \\
&=&\frac{F\left( c|x_{1}\right) -F\left( a_{1}|x_{1}\right) }{F\left(a_{2}|x_{1}\right) -F\left( a_{1}|x_{1}\right) }
\end{eqnarray*}

To obtain density simply differentiate with respect to $c$
$$
	f\left( c|x_{1},s=1\right) =\frac{f\left( c|x_{1}\right) }{F\left(a_{2}|x_{1}\right) -F\left( a_{1}|x_{1}\right) }
$$

### Likelihood function for truncated regression (selection $y_1$)

Plug $y_{1i}$ into this and take the logs, then we arrive at the log likelihood function
\begin{eqnarray*}
\ln L(\beta ,\sigma)  	&=&\frac{1}{N}\sum_{i=1}^{N}s_{i}\ln f(y_{1i}|x_{1i},s_{i}=1) \\
				&=&\frac{1}{N}\sum_{i=1}^{N}s_{i}\ln \left( \frac{f\left( y_{1i}|x_{1i}\right) }{F\left( a_{2}|x_{1i}\right) -F\left( a_{1}|x_{1i}\right) }\right) 
\end{eqnarray*}

If $y_{1}|x_{1}\sim N\left( x_{1}\beta _{1},\sigma ^{2}\right) $ and $a$ is independent of $x_{1}$ we have the **truncated** Tobit model.
$$
\ln L\left( \beta ,\sigma \right) =\frac{1}{N}\sum_{i=1}^{N}s_{i}\ln \frac{\frac{1}{\sigma }\phi \left( x_{1i}\beta \right) }{\Phi \left( a_{2}\right) -\Phi\left( a_{1}\right) }
$$

### Case 3: Truncated vs. censored Tobit 

**Censored Tobit:** we observe covariates $x_{1}$ for all people

**Truncated Tobit:** we do not.

- If we have data to use the censored Tobit, we would use this, since then we will use all our information

- As in censored regression, heteroscedasticity and non-normality has severe consequences

### Case 4: Incidental truncation (Probit selection)
**Structural equation**
$$
y_{1}=x_{1}\beta _{1}+u_{1},\ E\left( u_{1}|x_{1}\right) =0
$$

**Reduced form probit selection equation**

$$
s=1\left( x\delta_{2}+v_{2}>0\right) 
\quad \quad v_{2}\sim N\left( 0,1\right)
\quad \quad x=\left( x_{1},x_{2}\right)
$$
- $\left( u_{1},v_{2}\right) $ is independent of $x$

- We only observe $y_{1}$ when $s=1,$ but $x$ is always observed

- Selection equation is a probit equation $P\left(s=1|x\right) =\Phi\left(x\delta_{2}\right)$ 



- **Assume further that $u_{1},v_{2}$ are mutually mean dependent**,  
$E\left( u_{1}|v_{2}\right) =\gamma _{1}v_{2}$
 
 $\to$ This form for dependence rules out many joint distributions

 $\to$ but not the bivariate normal distribution.

 $\to$ less restrictive than assuming that $u_{1}$ and $v_{2}$ are bivariate normal.

###  Case 4: Regression on the selected sample

To derive $E\left( y_{1}|x,s=1\right) $ we first compute

$$
E\left( y_{1}|x,s=1,v_{2}\right) 
=x_{1}\beta _{1}+E\left( u_{1}|x_{1},v_{2}\right) 
$$

Because $u_{1}$ and $v_{2}$ are independent of $x$ we can write

\begin{eqnarray*}
E\left( y_{1}|x,s=1,v_{2}\right)  &=&x_{1}\beta _{1}+E\left(
u_{1}|v_{2}\right) 
\\
&=&x_{1}\beta _{1}+\gamma _{1}v_{2}
\end{eqnarray*}

where we have used the assumption that $E\left( u_{1}|v_{2}\right) =\gamma
_{1}v_{2}$

**By LIE we obtain the conditional regression function**

\begin{eqnarray*}
E\left( y_{1}|x,s=1\right)  &=&x_{1}\beta +E\left( \gamma
_{1}v_{2}|x,s=1\right) \\
&=&x_{1}\beta +\gamma _{1}E\left( v_{2}|x,v_{2}>-x\delta_{2}\right) 
\end{eqnarray*}

Regression depends on $E\left( v_{2}|x,v_{2}>-x\delta_{2}\right)$ which takes the form of a mean of a truncated Normal variable

### Case 4: When is selection due to incidental truncation an issue?
- If $u_{1}$ and $v_{2}$ are uncorrelated
- If $\gamma_{1}=0$

\begin{eqnarray*}
E\left( y_{1}|x,s=1\right)  &=&x_{1}\beta _{1}+\gamma _{1}E\left(
v_{2}|x,v_{2}>-x\delta_{2}\right) \beta _{1}\\
&=&x_{1}\beta _{1}
\end{eqnarray*}

- There is no selection problem and we can just use OLS

### Case 4: Incidental truncation (Probit selection)
- If $u_{1}$ and $v_{2}$ are correlated
- If $\gamma_{1}\neq 0$

$$
E\left(y_{1}|x,s=1\right) 
=x_{1}\beta_{1}+\gamma_{1}E\left(v_{2}|x,v_{2}>-x\delta_{2}\right) 
$$

- OLS on the selected sample is inconsistent due to the selection bias $
E\left( v_{2}|x,v_{2}>-x\delta_{2}\right) >0$

- $E\left( v_{2}|x,v_{2}>-x\delta_{2}\right)$: omitted variable correlated with $x$

- Notice that $v_{2}\sim N\left( 0,1\right)$ 

- $E\left( v_{2}|x,v_{2}>-x\delta_{2}\right) $ is just the mean of a
truncated normal.

### Digression: Expected value of a truncated normal

First note that when $v_{2}\sim N\left( 0,1\right)$ 

\begin{eqnarray*}
\phi \left( v_{2}\right)  &=&\frac{1}{\sqrt{2\pi }}\exp \left(-\frac{v_{2}^{2}}{2}\right) \\
\phi ^{\prime }\left( v_{2}\right) 
&=&-v_{2}\frac{1}{\sqrt{2\pi }}\exp \left( -\frac{v_{2}^{2}}{2}\right) 
=-v_{2}\phi \left( v_{2}\right) 
\end{eqnarray*}

**Expected value of a truncated standard normal**

\begin{eqnarray*}
E\left( v_{2}|v_{2}>-x\delta_{2}\right) 
&=&\frac{\int_{-x\delta_{2}}^{\infty }v_{2}\phi \left( v_{2}\right) dv_{2}}{
1-\Phi \left( -x\delta_{2}\right) }
=\frac{-\int_{-x\delta_{2}}^{\infty }\phi ^{\prime }\left( v_{2}\right)
dv_{2}}{1-\Phi \left( -x\delta_{2}\right) }\\
&=&\frac{\left[ -\phi \left( \infty \right) --\phi \left( -x\delta
_{2}\right) \right] }{1-\Phi \left( -x\delta_{2}\right) }
=\frac{\phi \left( x\delta_{2}\right) }{\Phi \left( x\delta_{2}\right) }
\end{eqnarray*}

 - We have used that the normal distribution is symmetric 
such that $\phi \left( x\delta_{2}\right) =\phi \left( -x\delta_{2}\right)$ and $\Phi \left( x\delta_{2}\right) =1-\Phi \left( -x\delta_{2}\right)$ 

- Need to normalize density with probability for truncation so that normalized density integrates to $1$.


### Case 4: Collecting terms
\begin{eqnarray}
E\left( y_{1}|x,s=1\right)  &=&x_{1}\beta +\gamma_{1}\frac{\phi \left(
x\delta_{2}\right) }{\Phi \left( x\delta_{2}\right) }  \notag \\
&=&x_{1}\beta +\gamma_{1}\lambda \left( x\delta_{2}\right) 
\end{eqnarray}


**Sample selection as an omitted variable problem**

- Regressing $y_{1}$ on $x_{1}$ using the selected sample we omit the term $\lambda \left( x\delta_{2}\right) $ 
- Heckman suggests: we estimate $\delta_{2}$, generate $\lambda \left( x\hat{\delta}_{2}\right)$ and include it as an regressor

### Case 4: Heckman's two-step sample selection procedure

**Heckit procedure, Heckman (1979):**
1. Probit of $s_{i}$ on $x_{i} \to \hat{\delta}_{2}$ and $\lambda _{i}=\lambda \left( x_{i}\hat{\delta}_{2}\right) $
2. OLS of $y_{1i}$ on $x_{1i}$ and the generated regressor $\lambda _{i}=\lambda \left( x_{i}\hat{\delta}_{2}\right) 
\to \hat{\beta}_{1},\hat{\gamma}_{1}$

### Test for selectivity bias following Heckman's two-step procedure

When $\gamma_{1}\neq 0$ *asymptotic inference is complicated* for two reasons
1. Heteroscedasticity: $Var\left( y_{1}|x,s=1\right) \neq Var\left( y_{1}|x\right)$ is not constant.

2. $\hat{\lambda}_{i}$ is a generated regressor.

- **Non-standard inference when $\gamma \neq 0$**
Covariance matrix on the form $A_{0}^{-1}B_{0}A_{0}^{-1}$ is robust for heterosedasticity, but this will not solve generated regressors problem
- **But we can still test for selectivity bias, i.e. $H_{0}:\gamma_{1}=0$**
(using the usual t-statistics based on OLS standard errors)
- **Why?:** Under $H_{0}$ we have $\gamma_{1}=0$ and $Var\left(y_{1}|x,s=1\right) =Var\left( y_{1}|x\right) =Var\left(
u_{1}\right) $



### How to correct standard errors?
We can correct for generated regressor problem
- Using **asymptotic theory for **two-step M-estimators** (see 12.5.2) - brain intensive
- **Bootstrapping the standard errors** - computer intensive, but easy to do

**Alternatively use MLE approach (Partial MLE)**
- Stronger assumptions: Need to specify joint distribution of $u_{1}$ and $v_{2}$

### Heteroscedasticity due to selection

Showing that $$Var\left(y_{1}|x,s=1\right) \neq Var\left(y_{1}|x\right) $$ would take us to far 

But we can get some intuition by the following example: 

- Suppose some values of $x$ that imply low wages. 
- These $x$'s will tend to also imply a lower probability of working 
- $\to$ more truncation 
- $\to$ a lower variance of the error term in a sample of workers. 
- $\to$ variance depends on $x$

### Case 5: Tobit Selection: The model
**Structural equation**

$$
y_{1}=x_{1}\beta _{1}+u_{1},\  \ E\left( u_{1}|x_{1}\right) =0
$$

**Selection rule**
\begin{eqnarray*}
s &=&1\left( y_{2}>0\right) \\
y_{2} &=&\max \left( 0,x\delta_{2}+v_{2}\right) \text{, }
\end{eqnarray*}

**Same distributional assumptions as for probit selection**
- $x=\left( x_{1},x_{2}\right)$
- $\left( u_{1},v_{2}\right) $ is independent of $x$
- $v_{2}\sim N\left( 0,1\right) $.
- $u_{1},v_{2}$ are mutually dependent with $E\left( u_{1}|v_{2}\right)=\gamma_{1}v_{2}$

### Case 5: Tobit selection- two-step estimation procedure

As before
$$
E\left( y_{1}|x,s=1\right) =x_{1}\beta _{1}+\gamma_{1}v_{2}
$$

**This suggest the two-step estimation procedure**
1. Estimate $\delta_{2}$ in the censored Tobit of $y_{2i}$ on $x_{i}$ and compute the residual $\hat{v}_{2i}$
1. Using observations with $y_{2i}>0$ estimate $\beta _{1}$ and $\gamma_{1}$ by the OLS regression: 
 
 $y_{1i}$ on $x_{1i}$ and $\hat{v}_{2i}$

### Case 5: A few remarks on Tobit selection

**Test for selectivity bias: $H_{0}:\gamma_{1}=0$**
- use t-statistic based on OLS standard errors (valid under the null of no selection).

- we can estimate the residuals $\hat{v}_{2}$ directly

  $\to $we do not have to compute the inverse Mills ratio

- **Exclusion restriction not required** allowing $x_{1}=x$ causes no problem , since $v_{2}$ always has separate variation due to the variation in $y_{2}$

The vast majority of empirical studies use probit selection rather than Tobit selection.

### Exclusion restrictions (required for probit selection)

**Exclusion restriction**: 
- $x_{2}$ is assumed to affect $s$, but not $y_{1}$.
- crucial for identification

In principle, $\beta $ is identified in Heckman's sample selection model without an exclusion restriction
- BUT only because of the nonlinearity of the inverse Mills ratio.
- $\to$ we do NOT want to heavily rely on a parametric assumption 
- $\to$ we should NOT estimate Heckman's sample selection model without an exclusion restriction
- often the inverse Mills ratio for a large part of its distribution is fairly linear (leading to multicollinearity)

# Exclusion restrictions
 

### Limitation: Hard to find exclusion restrictions

- I can think of any exclusion restriction, which is not subject to relevant critique
- Huge limitation of this approach
- Finding exclusion restrictions is an "art"

**Not possible to validity of exclusion restriction**
- Why not simply test whether the instrument is significant in the structural equation?
- This is NOT a validation of the exclusion restriction since we are estimating on the selected sample
- even if it is insignificant we cannot be sure that it has no effect on the population equation.

### Example: Wage equation for women

- The classical exclusion restriction: Existence of **small children**
- **Identifying assumption**: having small children has an effect on women's participation decision, but not on their wage offer


- **Critique of exclusion restriction:** 
	- Small children can imply that the woman is restricted to working in jobs with fewer hours and lower wage
	- Having children is an intertemporal decision: future births may depend on current wage offers.

### Example: Wage equation for women

Alternative exclusion restriction:
- Indicator for guarantied **access to day care** (In Danish'pasningsgaranti') or the degree of subsidized day care
- Simonsen (2008) considers female labor supply and day care provision. 

**Critique of exclusion restriction:**
- Tiebout model: Most productive women may choose to live where they can be sure to send their children to day care.
- Availability of day care is correlated with both labor supply and wages

### Example: Wage equation for women

Another classic exclusion restriction: *Husband's income*

**Identifying assumption:** Husband reduces wife's labor supply, but has no effect on her wages.

**Critique of exclusion restriction:**
- Poor exclusion restriction if there is positive assortative matching in the marriage market (high productive men and high productive women match).

### Example: Exclusion restrictions in dynamic wage equation

**Vella and Verbeek (1998):** Wage equation with dynamic labor supply

**Exclusion restriction:**
- lagged labor market participation

**Identifying assumption:** Wage equation is static, but the participation decision is dynamic

**Critique of exclusion restriction:**: 
- no participation previous period due to stress. 
- $\Rightarrow $ smaller probability of participating this year due to state dependence
- or perhaps since the worker still cannot work due to stress

 If worker returns: less likely to take-up a stressful job (smaller workload $\Rightarrow$  lower wage)


 **Hence, lagged participation may affect current wage**

# Concluding remarks

## Concluding remarks
