# "Generic Machine Learning" 
> "Code for generic ML" 

- toc:false
- branch: master
- badges: true
- comments: true
- author: Mun Fai Chan
- categories: [fastpages, jupyter]

This notebook provides code to Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments by Victor Chernozhukov, Mert Demirer, Esther Duflo, and Iván Fernández-Val. 

https://arxiv.org/abs/1712.04802

### References 
https://github.com/arnaudfrn/MLheterogeneity/blob/dev/src/vb_heterogeneity_FE.R

Author of notebook : Mun Fai Chan

## Data
In this notebook, I will analyse data by Dana Burde and Leigh L. Linden in Bringing Education to Afghan Girls: A Randomized Controlled Trial of Village-Based Schools. 

The paper can be found here : https://www.jstor.org/stable/43189440?seq=1#metadata_info_tab_contents


<div class="alert alert-block alert-warning">
    
### Developments specific to this data
1. Use small inference t statistic - is this even a valid tool or is this RCT just not very well done? 
2. Cluster standard errors (e.g. wild cluster bootstrap)
3. Read about bootstrap and how this may be incorporated into this code
 </div>

<div class="alert alert-block alert-info">

### Future Developments for Code 

#### High Priority 
1. ~Add in fixed effects~ Done (Check for errors)  
2. Use clustered standard errors
3. ~Calculate propensity score for a bigger set of controls. Use a different package for calculating propensity score / use my own code.~ Not needed if I believe that randomisation is done correctly. 

#### Medium Priority 
1. Fix problem of breaks being too close to one another (ValueError: bins must increase monotonically.)

2. Hyperparameter tuning on ML estimators and figure out the need to do it - e.g. increasing accuracy of nuisance parameters

#### Low Priority 
1. Convert pandas dataframes to LaTex tables. 

#### Long term developments for code 
1. Publish as Python package 
2. Create a website for better documentation

### Other developments for research 
1. Analysis of results for Afghan education dataset (provided fixed effects and standard errors are sorted out) 
2. Randomization checks 
3. Monte Carlo simulation to test veracity and robustness of code
</div>

### Other code 

In this notebook, I have removed the code for data simulation, childcare dataset and some preliminary code for hyperparameter tuning. 

Refer to these in Generic ML 7. 

In [3]:
import import_ipynb
from Generic_ML_script_2 import *
import matplotlib.pyplot as plt

importing Jupyter notebook from Generic_ML_script_2.ipynb


###  Initialisation 

In [1]:
iterations = 100
k = 5 # number of groups for heterogeneity analysis
alpha = 0.05 # significance level 

## Aghan Dataset

In [4]:
df = pd.read_stata("~/OneDrive - London School of Economics/LSE/Year 3/EC331/Afghan/afghanistan_anonymized_data.dta")

In [5]:
df.head(); df.shape

(1804, 40)

In [None]:
## Remove missing observations
#df.isnull().sum()
#df.dropna(inplace=True)
#df.shape

# We remove missing observations later

In [None]:
household = pd.read_stata("~/OneDrive - London School of Economics/LSE/Year 3/EC331/Afghan/HH_data.dta")

In [None]:
household.head()

#### Initialise treatment, outcome. controls

In [6]:
treatment = "treatment"
outcome = "f07_formal_school" # enrollment in fall 2007 

controls = ["f07_heads_child_cnt", "f07_girl_cnt", "f07_age_cnt", 
            "f07_duration_village_cnt", "f07_age_head_cnt", "f07_yrs_ed_head_cnt", 
           "f07_jeribs_cnt", "f07_num_sheep_cnt", 
           "f07_farsi_cnt", "f07_tajik_cnt", "f07_farmer_cnt", "f07_num_ppl_hh_cnt", 
           "f07_nearest_scl"]            

fixed_effects = "clustercode" # = None otherwise

### Stata Controls

global f07_child_controls "f07_heads_child_cnt f07_girl_cnt f07_age_cnt";

global f07_hh_controls "f07_duration_village_cnt f07_farsi_cnt f07_tajik_cnt f07_farmer_cnt f07_age_head_cnt f07_yrs_ed_head_cnt f07_num_ppl_hh_cnt f07_jeribs_cnt f07_num_sheep_cnt f07_nearest_scl"; 

## Fixed Effects

Fixed effects control for unobservables within one unit, assuming that these unobservables hold true in that unit. In other words, within that unit, baseline observed and unobserved characteristics between the control and treated group are the same. 

Hence, in this context, we assume that each village group has the same unobservables and observables since we are conducting our randomisation on that level. Therefore, we have a dummy variable for each village group (except for the reference category to avoid multicollinearity). 

In [7]:
df['clustercode'] = df['clustercode'].astype('category')
states = create_states(df, fixed_effects)

In [8]:
states.head()

clustercode,2.0,3.0,4.0,5.0,6.0,7.0,12.0,13.0,14.0,15.0
0,0,0,0,1,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0,0,0
4,0,0,0,1,0,0,0,0,0,0


We have 11 units (villages), of which 5 are treated and 6 are controls. 

Hence, the propensity score is 5/11, assuming that randomisation was conducted properly. This assumption is shown to be held in Table 2 of the original paper.

In [9]:
cols_to_add = []
cols_to_add.append(treatment)
cols_to_add.append(outcome)
cols_to_add.append(fixed_effects)
cols_to_add.extend(controls)

df2 = df[cols_to_add]; 
# df2.join(ps.propscore)
df2.loc[:,"propscore"] = 5/11
df2 = df2.join(states)
df2.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Unnamed: 0,treatment,f07_formal_school,clustercode,f07_heads_child_cnt,f07_girl_cnt,f07_age_cnt,f07_duration_village_cnt,f07_age_head_cnt,f07_yrs_ed_head_cnt,f07_jeribs_cnt,...,2.0,3.0,4.0,5.0,6.0,7.0,12.0,13.0,14.0,15.0
0,1.0,1.0,5.0,1.0,1.0,7.0,35.0,30.0,6.0,0.0,...,0,0,0,1,0,0,0,0,0,0
1,1.0,1.0,5.0,1.0,1.0,9.0,35.0,30.0,6.0,0.0,...,0,0,0,1,0,0,0,0,0,0
2,1.0,1.0,5.0,1.0,1.0,11.0,35.0,35.0,0.0,0.0,...,0,0,0,1,0,0,0,0,0,0
3,1.0,1.0,5.0,1.0,0.0,8.0,15.0,40.0,0.0,1.0,...,0,0,0,1,0,0,0,0,0,0
4,1.0,1.0,5.0,1.0,1.0,8.0,15.0,40.0,0.0,1.0,...,0,0,0,1,0,0,0,0,0,0


In [10]:
df2.isnull().sum()
df2.dropna(inplace = True)
df2.shape

(1562, 27)

Ideally, we want lambdas to be as small as possible.

In [11]:
ML_models = ["random_forest", "SVM", "gradient_boost", "neural_net", "ElasticNet"]

for x in ML_models: 
    summary = Generic_ML_single(df2, treatment, outcome, controls, 10, x, 0.05, 5, fixed_effects) 
    print (str(x) + ": Lamda1: " + str(summary[-2])+ " Lambda2: " + str(summary[-1]))

random_forest: Lamda1: 0.024145708042275543 Lambda2: 0.05903495428620018
SVM: Lamda1: 0.01834523160250017 Lambda2: 0.04158840381381542
gradient_boost: Lamda1: 0.06585517110914439 Lambda2: 0.04358765310984141


ValueError: Bin edges must be unique: array([-0.56793712,  0.43601593,  0.43601593,  0.43601593,  0.54707158,
        1.03804422]).
You can drop duplicate edges by setting the 'duplicates' kwarg

In [12]:
res = Generic_ML_single(df2, treatment, outcome, controls, 50, "random_forest", 0.05, 5) 

In [13]:
## BLP 
res[0].round(3)

Unnamed: 0,ATE,HET
coeff,0.458,0.931
se,0.03,0.182
pvalue,0.0,0.0
lower bound,0.399,0.566
upper bound,0.518,1.275


In [14]:
## GATES
res[1].round(3)

Unnamed: 0,least affected(20.0%),most affected(80.0%),most - least affected
coeff,0.253,0.695,0.442
se,0.069,0.07,0.098
pvalue,0.001,0.0,0.0
lower bound,0.117,0.56,0.25
upper bound,0.386,0.834,0.634


In [15]:
## CLAN
res[2].round(3)

Unnamed: 0,coeff,se,pvalue,lower bound,upper bound
Least affected (f07_heads_child_cnt),0.917,0.022,0.0,0.87,0.965
Most affected (f07_heads_child_cnt),0.928,0.023,0.0,0.88,0.973
Most - Least affected (f07_heads_child_cnt),0.011,0.064,1.0,-0.115,0.136
Least affected (f07_girl_cnt),0.366,0.036,0.0,0.296,0.435
Most affected (f07_girl_cnt),0.863,0.038,0.0,0.794,0.932
Most - Least affected (f07_girl_cnt),0.497,0.052,0.0,0.395,0.599
Least affected (f07_age_cnt),7.395,0.131,0.0,7.135,7.664
Most affected (f07_age_cnt),8.505,0.134,0.0,8.206,8.831
Most - Least affected (f07_age_cnt),1.11,0.187,0.0,0.744,1.477
Least affected (f07_duration_village_cnt),30.178,1.241,0.0,27.722,32.629


We observe heterogeneity in fall 2007 scores for gender (girls), age and distance to nearest school. 

Estimates for ATE is also consistent with estimates obtained by the original paper. 

###### In particular, the most affected have an ATE of 0.695 and are most likely girls, are on average older, and live further away from school. 

### Girls vs Boys 

Let me repeat this for girls vs boys to see if we can get similar results as the paper.

In [16]:
df2_girls = df2[df2["f07_girl_cnt"] ==1]
df2_boys = df2[df2["f07_girl_cnt"] ==0]

In [17]:
controls2 = ["f07_heads_child_cnt", "f07_age_cnt", 
            "f07_duration_village_cnt", "f07_age_head_cnt", "f07_yrs_ed_head_cnt", 
           "f07_jeribs_cnt", "f07_num_sheep_cnt", 
           "f07_farsi_cnt", "f07_tajik_cnt", "f07_farmer_cnt", "f07_num_ppl_hh_cnt", 
           "f07_nearest_scl"] # dropped control for gender

Generic_ML_single(df2_girls, treatment, outcome, controls2, 50, "SVM", 0.05, 5, fixed_effects)

[                      ATE       HET
 coeff        5.435377e-01  0.370371
 se           4.273026e-02  0.261548
 pvalue       2.660688e-30  0.311720
 lower bound  4.604671e-01 -0.141616
 upper bound  6.274403e-01  0.906550,
              least affected(20.0%)  most affected(80.0%)  \
 coeff                        0.468                 0.631   
 se                           0.102                 0.098   
 pvalue                       0.000                 0.000   
 lower bound                  0.268                 0.446   
 upper bound                  0.664                 0.816   
 
              most - least affected  
 coeff                        0.163  
 se                           0.146  
 pvalue                       0.499  
 lower bound                 -0.124  
 upper bound                  0.450  ,
                                                    coeff     se  pvalue  \
 Least affected (f07_heads_child_cnt)               0.945  0.031   0.000   
 Most affected (f07_heads_ch

For girls, there is further heterogeneity for age of household's head, but there is not much theoretical reasoning for that. ATE for girls is 0.543. I am unsure if it is right to split the data into 2 because that may cause a reduction in power(?). 

We are unable to see any heterogeneity along age and distance to school, which was obtained from the original analysis and was consistent with the explanation in the original paper. 

In [18]:
Generic_ML_single(df2_boys, treatment, outcome, controls2, 50, "SVM", 0.05, 5)

[                      ATE       HET
 coeff        4.071911e-01  0.266605
 se           4.406001e-02  0.336564
 pvalue       2.151484e-18  0.857074
 lower bound  3.214719e-01 -0.380143
 upper bound  4.923889e-01  0.929532,
              least affected(20.0%)  most affected(80.0%)  \
 coeff                        0.364                 0.479   
 se                           0.111                 0.104   
 pvalue                       0.003                 0.000   
 lower bound                  0.144                 0.273   
 upper bound                  0.585                 0.684   
 
              most - least affected  
 coeff                        0.115  
 se                           0.163  
 pvalue                       0.608  
 lower bound                 -0.205  
 upper bound                  0.434  ,
                                                    coeff     se  pvalue  \
 Least affected (f07_heads_child_cnt)               0.964  0.031   0.000   
 Most affected (f07_heads_ch

We observe not much heterogeneity amongst boys.

## Outcome: Fall 2007 test scores


In [31]:
outcome2 = "f07_both_norma_total" # test scores

controls = ["f07_heads_child_cnt", "f07_girl_cnt", "f07_age_cnt", 
            "f07_duration_village_cnt", "f07_age_head_cnt", "f07_yrs_ed_head_cnt", 
           "f07_jeribs_cnt", "f07_num_sheep_cnt", 
           "f07_farsi_cnt", "f07_tajik_cnt", "f07_farmer_cnt", "f07_num_ppl_hh_cnt", 
           "f07_nearest_scl"]  

cols_to_add = []
cols_to_add.append(treatment)
cols_to_add.append(outcome2)
cols_to_add.extend(controls)
cols_to_add.append(fixed_effects)

df3 = df[cols_to_add]
df3 = df3.join(states)
# df.join(ps.propscore)
df3.loc[:,"propscore"] = 5/11
df3.dropna(inplace = True)
df3.shape

(1445, 27)

In [32]:
df3.columns

Index([               'treatment',     'f07_both_norma_total',
            'f07_heads_child_cnt',             'f07_girl_cnt',
                    'f07_age_cnt', 'f07_duration_village_cnt',
               'f07_age_head_cnt',      'f07_yrs_ed_head_cnt',
                 'f07_jeribs_cnt',        'f07_num_sheep_cnt',
                  'f07_farsi_cnt',            'f07_tajik_cnt',
                 'f07_farmer_cnt',       'f07_num_ppl_hh_cnt',
                'f07_nearest_scl',              'clustercode',
                              2.0,                        3.0,
                              4.0,                        5.0,
                              6.0,                        7.0,
                             12.0,                       13.0,
                             14.0,                       15.0,
                      'propscore'],
      dtype='object')

In [33]:
res2 = Generic_ML_single(df3, treatment, outcome2, controls, 60, "random_forest", 0.05, 5, fixed_effects)

In [34]:
res2[0].round(3)

Unnamed: 0,ATE,HET
coeff,0.59,0.509
se,0.064,0.15
pvalue,0.0,0.001
lower bound,0.465,0.226
upper bound,0.714,0.8


In [36]:
res2[1].round(3)

Unnamed: 0,least affected(20.0%),most affected(80.0%),most - least affected
coeff,0.316,0.959,0.644
se,0.138,0.149,0.203
pvalue,0.037,0.0,0.004
lower bound,0.055,0.671,0.246
upper bound,0.583,1.247,1.042


In [38]:
res2[2].round(3)

Unnamed: 0,coeff,se,pvalue,lower bound,upper bound
Least affected (f07_heads_child_cnt),0.913,0.022,0.0,0.872,0.955
Most affected (f07_heads_child_cnt),0.942,0.023,0.0,0.891,0.988
Most - Least affected (f07_heads_child_cnt),0.029,0.064,0.757,-0.096,0.154
Least affected (f07_girl_cnt),0.211,0.034,0.0,0.147,0.276
Most affected (f07_girl_cnt),0.826,0.037,0.0,0.752,0.899
Most - Least affected (f07_girl_cnt),0.614,0.05,0.0,0.516,0.712
Least affected (f07_age_cnt),8.516,0.123,0.0,8.271,8.778
Most affected (f07_age_cnt),8.815,0.133,0.0,8.57,9.074
Most - Least affected (f07_age_cnt),0.298,0.18,0.0,-0.055,0.652
Least affected (f07_duration_village_cnt),28.817,1.28,0.0,26.275,31.3


Here, we have heterogeneity along girls and age. 

In [35]:
res3 = Generic_ML_single(df3, treatment, outcome2, controls, 100, "SVM", 0.05, 5, fixed_effects)

In [37]:
res3[0].round(3) #SVM fails to give us heterogeneity. In fact, HET is negative which is weird

Unnamed: 0,ATE,HET
coeff,0.576,-0.178
se,0.074,0.181
pvalue,0.0,0.619
lower bound,0.433,-0.542
upper bound,0.719,0.17


# Outcome: Spring 2008 Enrollment

In [41]:
outcome3 = "s08_formal_school" # enrollment in Spring 2008

controls_07 = ["f07_heads_child_cnt", "f07_girl_cnt", "f07_age_cnt", 
            "f07_duration_village_cnt", "f07_age_head_cnt", "f07_yrs_ed_head_cnt", 
           "f07_jeribs_cnt", "f07_num_sheep_cnt", 
           "f07_farsi_cnt", "f07_tajik_cnt", "f07_farmer_cnt", "f07_num_ppl_hh_cnt", 
           "f07_nearest_scl"]  

controls_08 = ["f07_heads_child_cnt", "f07_girl_cnt", "s08_age_cnt", 
            "s08_duration_village_cnt", "s08_age_head_cnt", "s08_yrs_ed_head_cnt", 
           "s08_jeribs_cnt", "s08_num_sheep_cnt", 
           "s08_farsi_cnt", "s08_tajik_cnt", "s08_farmer_cnt", "s08_num_ppl_hh_cnt", 
           "s08_nearest_scl"] 

# Can try combining both - not sure if useful or increasing variance

cols_to_add = []
cols_to_add.append(treatment)
cols_to_add.append(outcome3)
cols_to_add.extend(controls_08)
cols_to_add.append(fixed_effects)

df4 = df[cols_to_add]
df4 = df4.join(states)
# df.join(ps.propscore)
df4.loc[:,"propscore"] = 5/11
df4.dropna(inplace = True)
df4.shape

(1307, 27)

In [42]:
Generic_ML_single(df4, treatment, outcome3, controls_08, 60, "random_forest", 0.05, 5, fixed_effects)

[                      ATE           HET
 coeff        5.088386e-01  9.974679e-01
 se           3.218103e-02  1.938083e-01
 pvalue       3.769574e-47  5.765371e-07
 lower bound  4.450589e-01  6.344451e-01
 upper bound  5.728257e-01  1.398634e+00,
              least affected(20.0%)  most affected(80.0%)  \
 coeff                        0.277                 0.765   
 se                           0.073                 0.073   
 pvalue                       0.000                 0.000   
 lower bound                  0.135                 0.622   
 upper bound                  0.417                 0.910   
 
              most - least affected  
 coeff                        0.488  
 se                           0.104  
 pvalue                       0.000  
 lower bound                  0.284  
 upper bound                  0.692  ,
                                                    coeff     se  pvalue  \
 Least affected (f07_heads_child_cnt)               0.916  0.024   0.000   
 Mos

# Outcome: Spring 2008 Test Scores

In [43]:
outcome4 = "s08_both_norma_total" # enrollment in Spring 2008

controls_07 = ["f07_heads_child_cnt", "f07_girl_cnt", "f07_age_cnt", 
            "f07_duration_village_cnt", "f07_age_head_cnt", "f07_yrs_ed_head_cnt", 
           "f07_jeribs_cnt", "f07_num_sheep_cnt", 
           "f07_farsi_cnt", "f07_tajik_cnt", "f07_farmer_cnt", "f07_num_ppl_hh_cnt", 
           "f07_nearest_scl"]  

controls_08 = ["f07_heads_child_cnt", "f07_girl_cnt", "s08_age_cnt", 
            "s08_duration_village_cnt", "s08_age_head_cnt", "s08_yrs_ed_head_cnt", 
           "s08_jeribs_cnt", "s08_num_sheep_cnt", 
           "s08_farsi_cnt", "s08_tajik_cnt", "s08_farmer_cnt", "s08_num_ppl_hh_cnt", 
           "s08_nearest_scl"] 

# Can try combining both - not sure if useful or increasing variance

cols_to_add = []
cols_to_add.append(treatment)
cols_to_add.append(outcome4)
cols_to_add.extend(controls_08)
cols_to_add.append(fixed_effects)

df5 = df[cols_to_add]
df5 = df5.join(states)
# df.join(ps.propscore)
df5.loc[:,"propscore"] = 5/11
df5.dropna(inplace = True)
df5.shape

(1247, 27)

In [44]:
Generic_ML_single(df5, treatment, outcome4, controls_08, 60, "random_forest", 0.05, 5, fixed_effects)

[                      ATE       HET
 coeff        7.101287e-01  0.456645
 se           7.150828e-02  0.197962
 pvalue       6.489352e-21  0.039888
 lower bound  5.706007e-01  0.069270
 upper bound  8.487038e-01  0.848259,
              least affected(20.0%)  most affected(80.0%)  \
 coeff                        0.475                 0.959   
 se                           0.157                 0.165   
 pvalue                       0.005                 0.000   
 lower bound                  0.168                 0.645   
 upper bound                  0.784                 1.270   
 
              most - least affected  
 coeff                        0.485  
 se                           0.228  
 pvalue                       0.059  
 lower bound                  0.038  
 upper bound                  0.931  ,
                                                    coeff     se  pvalue  \
 Least affected (f07_heads_child_cnt)               0.915  0.023   0.000   
 Most affected (f07_heads_ch

# Draft: Editing functions for fixed effects

For fixed effects, it is the same as running a regression but with additional controls. I have to drop the first dummy variable but otherwise everywhere that I run a normal regression with controls, I should include the fixed effects. 

These areas are: 
1. ML estimator - need to include fixed effects
2. CLAN - not sure if I have to?

Additionally, I want to include clustered standard errors. 

In [None]:
fixed_effects = None
fixed_effects

In [None]:
def ML_estimator(main, aux, model, treatment, outcome, controls, fixed_effects = None):
    if fixed_effects == None, 
        ML_estimator_nofe(main, aux, model, treatment, outcome, controls)

In [None]:
## ML estimator
# Initialization
main, aux = sklearn.model_selection.train_test_split(df3, train_size = 5/11)

In [None]:
cols = [treatment] + controls + list(states.columns)
aux0 = aux[aux[treatment] == 0]
aux1 = aux[aux[treatment] == 1]
X_aux0 = aux0.loc[:,cols]
y_aux0 =aux0[outcome]
X_aux1 = aux1.loc[:,cols]
y_aux1 =aux1[outcome]
    
X_main = main.loc[:,cols]
y_main = main[outcome]

In [None]:
main2 = random_forest(main, X_aux0, y_aux0, X_main, X_aux1, y_aux1)

In [None]:
CLAN(main2, treatment, controls, alpha)