# Example Code: Estimating Causal Effect of Grad School on Income

Code authored by: Shawhin Talebi <br>

Causal Effects via Regression: blog coming soon <br>
DoWhy Library: https://microsoft.github.io/dowhy/ <br>
Data from: https://archive.ics.uci.edu/ml/datasets/census+income

In [1]:
#https://medium.com/towards-data-science/causal-effects-via-regression-28cb58a2fffc

### Import modules

In [2]:
import pickle
import pandas as pd

import econml
import dowhy

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

### Load data

In [10]:
df = pd.read_pickle( open( "df_causal_effects.p", "rb" ) ).astype(int)
df

Unnamed: 0,age,hasGraduateDegree,greaterThan50k
0,39,0,0
1,50,0,0
2,38,0,0
3,53,0,0
5,37,1,0
...,...,...,...
32556,27,0,0
32557,40,0,1
32558,58,0,0
32559,22,0,0


### Define causal model

In [11]:
model = dowhy.CausalModel(
        data = df,
        treatment= "hasGraduateDegree",
        outcome= "greaterThan50k",
        common_causes="age",
        )

#### Linear Regression
First we try linear regression.

In [12]:
estimand = model.identify_effect(proceed_when_unidentifiable=True)

LR_estimate = model.estimate_effect(estimand, method_name="backdoor.linear_regression")

linear_regression
{'control_value': 0, 'treatment_value': 1, 'test_significance': None, 'evaluate_effect_strength': False, 'confidence_intervals': False, 'target_units': 'ate', 'effect_modifiers': []}


  intercept_parameter = self.model.params[0]


In [13]:
print(LR_estimate)

*** Causal Estimate ***

## Identified estimand
Estimand type: nonparametric-ate

### Estimand : 1
Estimand name: backdoor
Estimand expression:
         d                                 
────────────────────(E[greaterThan50k|age])
d[hasGraduateDegree]                       
Estimand assumption 1, Unconfoundedness: If U→{hasGraduateDegree} and U→greaterThan50k then P(greaterThan50k|hasGraduateDegree,age,U) = P(greaterThan50k|hasGraduateDegree,age)

## Realized estimand
b: greaterThan50k~hasGraduateDegree+age
Target units: ate

## Estimate
Mean value: 0.2976051357033005



#### Double Machine Learning

Next, we try Double ML which is a bit overkill for this simple example, espeically with the treatment and outcome variable only taking values of 0 or 1. 

Note that the models we use in the DML process are all linear regression for this example, however more sophisticated techniques can be used for more complex problems.

In [14]:
DML_estimate = model.estimate_effect(estimand, 
                                     method_name="backdoor.econml.dml.DML",
                                     method_params={"init_params":{
                                         'model_y':LinearRegression(),
                                         'model_t':LinearRegression(),
                                         'model_final':LinearRegression()
                                                                  },
                                                   "fit_params":{}
                                              })

Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
The final model has a nonzero intercept for at least one outcome; it will be subtracted, but consider fitting a model without an intercept if possible.


In [15]:
print(DML_estimate)

*** Causal Estimate ***

## Identified estimand
Estimand type: nonparametric-ate

### Estimand : 1
Estimand name: backdoor
Estimand expression:
         d                                 
────────────────────(E[greaterThan50k|age])
d[hasGraduateDegree]                       
Estimand assumption 1, Unconfoundedness: If U→{hasGraduateDegree} and U→greaterThan50k then P(greaterThan50k|hasGraduateDegree,age,U) = P(greaterThan50k|hasGraduateDegree,age)

## Realized estimand
b: greaterThan50k~hasGraduateDegree+age | 
Target units: ate

## Estimate
Mean value: 0.297637229589215
Effect estimates: [0.29763723 0.29763723 0.29763723 ... 0.29763723 0.29763723 0.29763723]



#### X-learner
Finally we try the X-learner making use of decision trees for our sub-models.

In [16]:
Xlearner_estimate = model.estimate_effect(estimand,
                                method_name="backdoor.econml.metalearners.XLearner",
                                method_params={"init_params":{
                                                    'models': DecisionTreeRegressor()
                                                    },
                                               "fit_params":{}
                                              })

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().


In [17]:
print(Xlearner_estimate)

*** Causal Estimate ***

## Identified estimand
Estimand type: nonparametric-ate

### Estimand : 1
Estimand name: backdoor
Estimand expression:
         d                                 
────────────────────(E[greaterThan50k|age])
d[hasGraduateDegree]                       
Estimand assumption 1, Unconfoundedness: If U→{hasGraduateDegree} and U→greaterThan50k then P(greaterThan50k|hasGraduateDegree,age,U) = P(greaterThan50k|hasGraduateDegree,age)

## Realized estimand
b: greaterThan50k~hasGraduateDegree+age
Target units: ate

## Estimate
Mean value: 0.20232049378002753
Effect estimates: [ 0.31037666  0.21099013  0.36363636 ...  0.16049383 -0.00342775
  0.2008029 ]

