# INFO371 Problem Set: Estimating Causal Effects with BA and CS methods

## Instructions

* Please write clearly! Answer each question in a way that if the code chunks are removed from your document, the result are still readable!
* Discussing the solutions and getting help is all right, but you have to solve the problem your own. Do not copy-paste from others!
* Make sure you show your work!

---

## Introduction

For this assignment, you will be using data from the Progresa program, a government social assistance program in Mexico. This program, as well as the details of its impact, are described in the paper "School subsidies for the poor: evaluating the Mexican Progresa poverty program", by Paul Shultz. Please familiarize yourself with the PROGRESA program before beginning this problem set, so you have a rough sense of where the data come from and how they were generated. If you just proceed into the problem set without understanding Progresa or the data, it will be very difficult!

The goal of this problem set is to make you familiar with the simple
estimators that you are learning in class (cross-sectional and before-after), and to use those to measure the impact of Progresa on secondary school enrollment rates.
Your task is to estimate the impact of progresa subsidies on the school attendance.  Note: this means to estimate the causal effect.

## Background - About the PROGRESA program
In 1990s, Mexican government decided to improve the school attendance
of poor rural children by introducing a cash subsidy to families.
However, the families were only able to claim the money if a) they
were considered poor, and b) if their
children attended school.  Most importantly in the current
context, the subsidy was introduced in a randomized manner where
initially only certain villages were eligible for subsidies.  In this
problem set we analyze this time period where the subsidies
formed essentially a randomized
control trial.

The timeline of the program was:

* Baseline survey conducted in 1997
* Intervention--subsidies for _poor households_ in _treatment villages_ begins in 1998, wave 1 data was collected in 1998
* wave 2 data collected in 1999
* Evaluation ends in 2000, at which point all villages becomes eligible to the subsidy

Note that:
* the Progresa program was only available for poor families, so in the analysis below we only consider poor households
* The central variable here is _sc_, the dummy variable that tells if the child did attend the school or not

When you are ready, download the progresa-sample.csv data from Canvas. The dataset is from actual data collected to evaluate the impact of the Progresa program. In this file, each row corresponds to an observation taken for a given child for a given year. There are two years of data (1997 and 1998), and just under 40,000 children who are
surveyed in both years. The table below describes the variables
in the dataset: 


|Variable Name | Description| 
|:--- | :--- |
|year | year in which data is collected |
|sex | male = 1|
|indig | indigenous = 1|
|dist\_sec| nearest distance to a secondary school|
|sc | enrolled in school in year of survey (=1) |
|grc | grade enrolled |
|fam\_n | family size|
|min\_dist | min distance to an urban center|
|dist\_cap | min distance to the capital|
|poor | poor = "pobre", not poor = "no pobre"|
|progresa | treatment = "basal", control = "0"|
|hohedu | years of schooling of head of household|
|hohwag | monthly wages of head of household|
|welfare\_index| welfare index used to classify poor|
|hohsex | gender of head of household (male=1)|
|hohage | age of head of household|
|age | years old|
|folnum | individual id|
|village | village id|
|sc97 | enrolled in school in 1997 (=1)|

---
You can also view some summary statstics about the datset generated by the code below: 

In [4]:
import pandas as pd

prog_df = pd.read_csv('progresa-sample.csv')
prog_sub = prog_df.drop(labels=['year','folnum','village'], axis=1, inplace=False)
prog_sub.describe().loc[['mean','std']].T.sort_index()

Unnamed: 0,mean,std
age,11.36646,3.167744
dist_cap,147.674452,76.063134
dist_sec,2.41891,2.234109
fam_n,7.215715,2.3529
grc,3.963537,2.499063
grc97,3.705372,2.572387
hohage,44.436717,11.620372
hohedu,2.768104,2.656106
hohsex,0.925185,0.263095
hohwag,586.985312,788.133664



## Graphical Exploration (20 pts)
Before we get into regression, it is worthwhile to have visual image of the data.

1. Load the data (note - this is already done for you in the cell above. You may use that dataframe if you would like). How many cases do we have?  How many different villages?  How many cases of poor in progresa villages?



2. (4pt) Compute average schooling rate of poor household by villages (you can use village id as the [grouping](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) variable) for 1997 and 1998. Compare it between progresa villages, and in non-progresa villages in 1997 and 1998. Here just report the averages, you'll do a graphical comparison of distributions below.) 

    _Note: this asks you to compare the schooling rate __by village__, i.e. you need a single number (avg schooling rate) for each village.  Thereafter, you should compare__averages of village averages__._


3. (4pt) Display the average schooling rate before the program (1997) separately for progresa/non-progresa villages. Mark sample average rate (separately for progresa/non-progresa villages) on the figure. Attempt to overlay these density estimates.


4. (4pt) Repeat for the program year (1998)


5. Comment the results.  Do the distributions look similar?  Do you see the schooling rate in progresa villages increasing over that of the control villages?

In [None]:
# code goes here. 
# NOTE -- you may need to add multiple cells to answer all of the questions.

## Measuring impact

Next, we measure the impact of Progresa.  We do it in two ways: first
using the cross-sectional estimator, and thereafter by before-after estimator.
Both estimators we implement in turn in three ways:

* just table of averages
* simple regression where we only introduce control/treatment group (or time in case of before-after estimator)
* multiple regression

### Cross-sectional (CS) estimator (40pts)
CS estimator compares data for treated (poor in progresa villages) and non-treated controls (poor in non-progresa villages) after the treatment (i.e. 1998).  We start with a simple table.  

1.  What is the identifying assumption behind this CS estimator?  Do you think these are satisfied here?  Explain!
  
  Hint: see [lecture notes](https://otoomet.bitbucket.io/machineLearning.pdf/) Ch 5.5.1 "Counterfactual and Identifying Assumption" and 5.5.2 "A Few Popular Estimators''.


2. (3pt) Why do we look at only poor households, and only year 1998?


3. (4pt) compute average schooling rate (variable \emph{sc}) for treated and non-treated controls after the program.  Compare these means.  How big of an effect do you find?  
  
  
4. (5pt) Based on this number, can you claim progresa was effective (i.e. it increased schooling rate)?  Interpret the number (in terms of percent points increase or decrease).

In [None]:
# code goes here 

Reading the result from the table is an easy and intuitive approach but it does not provide any standard errors and statistical significance estimates.  It is also
hard to include other relevant characteristics that may influence the
effect size.  Linear regression helps here.

5. (5pt) Implement the CS estimator using linear regression: regress the outcome after treatment on the treatment indicator.  Do not include any other controls (except theintercept). 

      If you know how to do it the go ahead in your own way.  But if you
      need a little help then you can follow these steps:

      1. Ensure you are only comparing the relevant groups: the control group that was not treated, and the treatment group that was actually treated.
      2. Create a dummy variable _T_ that tells if someone is in the treatment or control group.
      3. Regress the outcome on _T_.
  

6. (3pt) Compare the results.  You should get exactly the same number as when just comparing the group means.


7. (2pt) Is the effect statistically significant? 

In [None]:
#code goes here

So far we ignored the other relevant covariates.  If the experiment was conducted correctly, those should not matter.  But if randomization was imperfect, it may not be the case.

8. (5pt) Estimate the multiple regression model.  Include all covariates, such as education, family size and whatever else you consider relevant for the current case.


9. (5pt) Compare the results.  Do other covariates substantially change the results?


In [None]:
#code goes here

### Before-After Estimator (40pts) -- (5pt each, except question 5)
Instead of comparing treatment and control villages in 1998, we can also compare just treatment villages after (1998) and before (1997) the program was introduced.  We follow fairly similar steps as what you did above.

1. (3pt) What is the identifying assumption behind this estimator? Do you think they are fullfilled?  Explain!


2. (3pt) Why do we have to select only progresa villages and only poor for this task?


3. (4pt) compute average schooling rate (variable \emph{sc}) for the poor for the treated villages before and after the program. Compare these means.  How big effect do you find?  
  

4. (5pt) Based on this number, can you claim progresa was effective (i.e. it increased schooling rate)?  Interpret the number (in terms of percent points increase or decrease).

In [None]:
# code goes here

Next, do the same with linear regression:

5. (5pt) Implement the BA estimator using linear regression: regress the outcome for the treated group on the
after-program indicator.  Do not include any other controls (except the
intercept). 
  
     If you know how to do it the go ahead in your own way.  But if you need a little help then you can follow these steps:
      1. Ensure you are only comparing the relevant groups: the control
        group is before and treatment group is after the policy was
        implemented. 
      2. Create a dummy variable __After__ that tells if we are
        looking the period were the policy is already there.
      3. Regress the outcome on __After__.
    
    
6. (2pt) Compare the results.  You should get exactly the same number as when just comparing the group means.


7. (3pt) Is the effect statistically significant?  

In [None]:
#code goes here 


So far we ignored other relevant covariates.  If the identifying
assumptions were correct, those should not matter.  But if not, this
may not be the case.

8. (5pt) Estimate the multiple regression model.  Include all covariates, such as education, family size and whatever else you consider relevant for the current case.


9. (5pt) Compare the results.  Do other covariates substantially change the results?


10. Comment the identifying assumptions behind the CS and BA models. Which one do you find more convincing?

In [None]:
# code goes here

### Finally

Tell how much time (hours) did you spend on this!
_(PS. Feel free to add other feedback)_. 