In [31]:
%matplotlib inline

import statsmodels.formula.api as smf
from functools import partial
import pandas as pd
import numpy as np

from auxiliary import *

np.random.seed(123)

# Self-selection, heterogeneity, and causal graphs

**Overview**

* Introduction

* Nonignorability and selection on the unobservables revisited

* Selection on the unobservables and the utility of additional posttreatment measures of the outcome

* Causal graphs for complex patterns of self-selection and heterogeneity

* Conclusion


**Alternatives to back-door identification**

The next chapters deal with:

* instrumental variables
* front-door identification with causal mechanisms
* conditioning estimators using pretreatment variables

Why do we need to consider alternatives?

$\rightarrow$selection on unobservables / nonignorability of treatment}

What makes an unobservable?

* simple confounding, stable unobserved common cause of treatment and outcome variable

* subtle confounding, direct self-selection into the treatment based on accurate perceptions of the individual level treatment effect

Selection on unobservables as a combination of two features:

* treatment effect heterogeneity
* self-selection

## Nonignorability and selection on the unobservables

### Selection on observables 

<img src="material/figure-4-8.png" height=500 width=500 />

### Selection on unobservables

<img src="material/figure-4-9.png" height=500 width=500 />


## Selection on the unobservables and the utility of additional posttreatment measures of the outcome

We proceed in two steps:

* assess identification for given directed graphs

* examine structure of directed graph itself

<img src="material/figure-8-1.png" height=500 width=500 />

* **U**, unobserved motivation to learn, differences in home environment, anticipation of causal effect itself.

We cannot identify the causal effect of $D$ on $Y_{10}$ in subfigure (a) but in subfigure (b). However, at what cost?

<img src="material/figure-8-2.png" height=500 width=500 />

Back-door adjustment by $Y_{10}$ ineffective again after revisiting economic implications of the imposed graph. In fact, $Y_{10}$ is now a collider variable that induces a noncausal dependence.

## Panel Data Demonstration

The motivation behind this example is simply to show that we cannot learn anything about the underlying causal effect with the conventional strategies and how we model self-selection in the data generating process.

In [85]:
num_agents = 100

In [87]:
def get_propensity_score(o, u):
    """Get the propenstiy score."""
    level = -3.8 + o + u 
    return np.exp(level) / (1 + np.exp(level))


def get_treatment_status(o, u):
    # Following the causal graph, the treatment indicator is only a function 
    # of the background characteristics O and the unobservable U.
    p = get_propensity_score(o, u)
    return np.random.choice([1, 0], p=[p, 1 - p])
   
def get_covariates():
    """
    
    """
    o = np.random.normal()
    e = np.random.normal()
    
    # Based on the graph, the variables X and U both depend on O.
    x = o + np.random.normal()
    u = o + np.random.normal()
        
    return o, u, x, e
    
    
def get_potential_outcomes(grade, o, u, x, e, scenario=0, selection=False):
    """
        0: without
        1: with supplementary dependence on E
    """
    assert scenario in range(2)
    
    if scenario == 0:
        y_0 = list()
        y_0.append(100 + o + u + x + np.random.normal())
        y_0.append(101 + o + u + x + np.random.normal())
        y_0.append(102 + o + u + x + np.random.normal())
    elif scenario == 1:
        y_0 = list()
        y_0.append(100 + o + u + x + e + np.random.normal())
        y_0.append(101 + o + u + x + e + np.random.normal())
        y_0.append(102 + o + u + x + e + np.random.normal())
    else:
        raise NotImplementedError
    
    # Sampling treatment effecs
    delta_1 = np.random.normal(loc=10, scale=1)
    
    if selection:
        delta_2 = np.random.normal(loc=u)
    else:
        delta_2 = np.random.normal()
    
    y_1 = list()
    y_1.append(y_0[0] + delta_1 + delta_2)
    y_1.append(y_0[1] + (1 + delta_1) + delta_2)
    y_1.append(y_0[2] + (2 + delta_1) + delta_2)
        
    idx = grade - 10

    return y_0[idx], y_1[idx]
        
def get_sample_panel_demonstration(num_agents=1000, scenario=0, selection=False, seed=123):
    
    columns = ['Y', 'D', 'O', 'U', 'X', 'E', 'Y_1', 'Y_0']
    index = list()
    for i in range(num_agents):
        for j in [10, 11, 12]:
            index.append((i, j))
    index = pd.MultiIndex.from_tuples(index, names=('Identifier', 'Grade'))
    df = pd.DataFrame(columns=columns, index=index)

    np.random.seed(seed)
    for i in range(num_agents):

        o, u, x, e = get_covariates()
        d = get_treatment_status(o, u)
        for grade in [10, 11, 12]:
            y_0, y_1 = get_potential_outcomes(grade, o, u, x, e, scenario, selection)
            y = d * y_1 + (1 - d) * y_0
            df.loc[(i, grade), :] = [y, d, o, u, x, e, y_1, y_0]

    df = df.astype(np.float)
    df = df.astype({'D': np.int})

    return df

In [81]:
num_agents, scenario, selection = 100, 0, False
df = get_sample_panel_demonstration(num_agents, scenario, selection)
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Y,D,O,U,X,E,Y_1,Y_0
Identifier,Grade,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,10,98.345538,0,-1.085631,-0.802652,-0.088285,-1.506295,108.705587,98.345538
0,11,98.575991,0,-1.085631,-0.802652,-0.088285,-1.506295,108.813428,98.575991
0,12,100.149958,0,-1.085631,-0.802652,-0.088285,-1.506295,113.055897,100.149958
1,10,103.505368,0,0.522742,1.247658,0.988387,1.495827,112.983792,103.505368
1,11,102.897032,0,0.522742,1.247658,0.988387,1.495827,109.32691,102.897032


What is the average treatment effect and how does it depend on the presence of selection?

In [82]:
num_agents, scenario, selection = 1000, 0, False

# This setup allows to freeze some arguments of the function
# that do not change during the analysis.
simulate_sample = partial(get_sample_panel_demonstration, num_agents, scenario)

for selection in [False, True]:
    print(' Selection {:}'.format(selection))
    df = simulate_sample(selection)
    for grade in [10, 12]:
        subset = df.loc[(slice(None), grade), :]
        stat = (subset['Y_1'] - subset['Y_0']).mean()
        print(" Grade {:}:  ATE {:5.3f}".format(*[grade, stat]))
    print('\n')

 Selection False
 Grade 10:  ATE 10.072
 Grade 12:  ATE 12.020


 Selection True
 Grade 10:  ATE 10.020
 Grade 12:  ATE 11.968




In [83]:
for selection in [False, True]:
    print(' Selection {:}'.format(selection))
    df = simulate_sample(selection)
    for grade in [10, 12]:
        subset = df.loc[(slice(None), grade), :]
            
        treated = subset['D'] == 1
        control = subset['D'] == 0

        stat = list()
        stat.append((subset['Y_1'][treated] - subset['Y_0'][treated]).mean())
        stat.append((subset['Y_1'][control] - subset['Y_0'][control]).mean())
        print(" Grade {:}:  ATT {:5.3f}   ATC {:5.3f}".format(grade, *stat))
    print('\n')

 Selection False
 Grade 10:  ATT 9.919   ATC 10.083
 Grade 12:  ATT 11.626   ATC 12.050


 Selection True
 Grade 10:  ATT 11.609   ATC 9.900
 Grade 12:  ATT 13.317   ATC 11.867




We want to run some exemplary estimations that just confirm our fear that we cannot learn anything about the underlying causal effects by applying regression-based estimations.

In [84]:
for grade in [10, 12]:
    for model in ['Y ~ D', 'Y ~ D + X + O']:
        subset = df.loc[(slice(None), grade), :]
        rslt = smf.ols(formula=model, data=subset).fit()
        stat = rslt.params['D']
        print('Grade: {}  Model: {:}'.format(*[grade, model]))
        print('   Estimated Treatment Effect: {:5.3f}\n'.format(stat))

Grade: 10  Model: Y ~ D
   Estimated Treatment Effect: 15.647

Grade: 10  Model: Y ~ D + X + O
   Estimated Treatment Effect: 12.283

Grade: 12  Model: Y ~ D
   Estimated Treatment Effect: 17.528

Grade: 12  Model: Y ~ D + X + O
   Estimated Treatment Effect: 14.190



## Causal graphs for complex patterns of self-selection

We want to make sure that complex patterns of self-selection can be represented by directed graphs.

### Separate graphs for separate latent classes

**Groups**

* $G=1$, selection of schools mainly for lifestyle reasons, proximity to home and taste for school cultures

* $G=2$, selection of schools to maximize expected achievement



<img src="material/figure-8-3.png" height=500 width=500 />

What is economic mechanisms are represented by each of the arrows? Why would we expect them to differ across the two groups?


### A single graph that represents all latent classes

<img src="material/figure-8-4.png" height=500 width=500 />


<img src="material/figure-8-5.png" height=500 width=500 />
