# Example Code: Estimating Causal Effect of Grad School on Income

Code authored by: Shawhin Talebi <br>

Causal Effects via Regression: blog coming soon <br>
DoWhy Library: https://microsoft.github.io/dowhy/ <br>
Data from: https://archive.ics.uci.edu/ml/datasets/census+income

### Import modules

In [1]:
import pickle

import econml
import dowhy

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

### Load data

In [2]:
df = pickle.load( open( "df_causal_effects.p", "rb" ) ).astype(int)

### Define causal model

In [3]:
model = dowhy.CausalModel(
        data = df,
        treatment= "hasGraduateDegree",
        outcome= "greaterThan50k",
        common_causes="age",
        )

#### Linear Regression
First we try linear regression.

In [4]:
estimand = model.identify_effect(proceed_when_unidentifiable=True)

LR_estimate = model.estimate_effect(estimand, method_name="backdoor.linear_regression")

linear_regression
{'control_value': 0, 'treatment_value': 1, 'test_significance': None, 'evaluate_effect_strength': False, 'confidence_intervals': False, 'target_units': 'ate', 'effect_modifiers': []}


In [5]:
print(LR_estimate)

*** Causal Estimate ***

## Identified estimand
Estimand type: nonparametric-ate

### Estimand : 1
Estimand name: backdoor
Estimand expression:
         d                                 
────────────────────(E[greaterThan50k|age])
d[hasGraduateDegree]                       
Estimand assumption 1, Unconfoundedness: If U→{hasGraduateDegree} and U→greaterThan50k then P(greaterThan50k|hasGraduateDegree,age,U) = P(greaterThan50k|hasGraduateDegree,age)

## Realized estimand
b: greaterThan50k~hasGraduateDegree+age
Target units: ate

## Estimate
Mean value: 0.2976051357032903



#### Double Machine Learning

Next, we try Double ML which is a bit overkill for this simple example, espeically with the treatment and outcome variable only taking values of 0 or 1. 

Note that the models we use in the DML process are all linear regression for this example, however more sophisticated techniques can be used for more complex problems.

In [6]:
DML_estimate = model.estimate_effect(estimand, 
                                     method_name="backdoor.econml.dml.DML",
                                     method_params={"init_params":{
                                         'model_y':LinearRegression(),
                                         'model_t':LinearRegression(),
                                         'model_final':LinearRegression()
                                                                  },
                                                   "fit_params":{}
                                              })

The final model has a nonzero intercept for at least one outcome; it will be subtracted, but consider fitting a model without an intercept if possible.


In [7]:
print(DML_estimate)

*** Causal Estimate ***

## Identified estimand
Estimand type: nonparametric-ate

### Estimand : 1
Estimand name: backdoor
Estimand expression:
         d                                 
────────────────────(E[greaterThan50k|age])
d[hasGraduateDegree]                       
Estimand assumption 1, Unconfoundedness: If U→{hasGraduateDegree} and U→greaterThan50k then P(greaterThan50k|hasGraduateDegree,age,U) = P(greaterThan50k|hasGraduateDegree,age)

## Realized estimand
b: greaterThan50k~hasGraduateDegree+age | 
Target units: ate

## Estimate
Mean value: 0.29773849929894924
Effect estimates: [0.2977385 0.2977385 0.2977385 ... 0.2977385 0.2977385 0.2977385]



#### X-learner
Finally we try the X-learner making use of decision trees for our sub-models.

In [8]:
Xlearner_estimate = model.estimate_effect(estimand,
                                method_name="backdoor.econml.metalearners.XLearner",
                                method_params={"init_params":{
                                                    'models': DecisionTreeRegressor()
                                                    },
                                               "fit_params":{}
                                              })

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


In [9]:
print(Xlearner_estimate)

*** Causal Estimate ***

## Identified estimand
Estimand type: nonparametric-ate

### Estimand : 1
Estimand name: backdoor
Estimand expression:
         d                                 
────────────────────(E[greaterThan50k|age])
d[hasGraduateDegree]                       
Estimand assumption 1, Unconfoundedness: If U→{hasGraduateDegree} and U→greaterThan50k then P(greaterThan50k|hasGraduateDegree,age,U) = P(greaterThan50k|hasGraduateDegree,age)

## Realized estimand
b: greaterThan50k~hasGraduateDegree+age
Target units: ate

## Estimate
Mean value: 0.20232049389358914
Effect estimates: [ 0.31037666  0.21099013  0.36363636 ...  0.16049383 -0.00342775
  0.2008029 ]

