In [1]:
from src.models.linreg import LinReg
from src.displays.display_linear import display_models

import pandas as pd

In [2]:
df = pd.read_csv('../data/kenyan_assist.csv')
df

Unnamed: 0,id,time,education,hhh_sex,mem_tot,emp_now,nutrition
0,101043,0,0,1,3,1,4
1,101203,0,0,1,6,0,0
2,101227,0,0,1,10,1,2
3,101108,0,0,1,9,0,0
4,101239,0,0,1,8,1,3
...,...,...,...,...,...,...,...
1939,209006,1,1,1,4,0,3
1940,209019,1,0,1,3,1,2
1941,209013,1,1,1,3,1,3
1942,209010,1,1,1,4,1,5


In [3]:
data_dictionary = {
    'id': 'household id',
    'time': 'time period, pre or post treatment',
    'education': 'Binary indicator of whether the respondent has completed secondary school',
    'hhh_sex': 'Household Head sex',
    'mem_tot': 'total number of household members',
    'emp_now': 'In the last 7 days, did you do any work for pay, do any kind of business?',
    'nutrition': 'nutrition score, higher values show better nutrition'
}

The dataset consists of simulated data motivated by https://microdata.worldbank.org/index.php/catalog/4210/study-description.  The study is a randomized control trial of a nutrition intervention in Kenya.  The data is simulated to have similar properties to the real data.  The outcome variable is a standardized nutrition score.  The treatment variable is a binary indicator of whether the household  received a cash assistance equal to about 12 USD.  The study conducted both pre and pot treatment surveys collecting infomration on household composition, employment and nutrition outcomes. Unfortunately, for the example we have lost the data for the control group, thus in this data we only have access to the treatment groups data.  In order to still study any potential effects we will have to leverage the  causal inference technique of pre/post analysis to work around the issue of not having a control group. Lets look at some assumptions for the technique

## Key Assumptions of Pre-Post Analysis

Pre-Post Analysis relies on several key assumptions for its validity. These include:

1. **No External Influences:** The assumption that no external factors other than the treatment influence the outcome variable between the pre- and post-treatment periods.
2. **Consistency of Treatment Effect:** The assumption that the treatment effect is consistent across subjects and over time.

## Econometric Model Explanation

The model equation is given by:

$$
Y_{it} = \alpha + bT_{it} + dX_{it} + \varepsilon_{it}
$$

where:
- $t = 0$ represents the Baseline,
- $t = 1$ represents the post-event.

In this model, the treatment effect is estimated as:

$$
E[Y_{i1}] - E[Y_{i0}]
$$

which can be broken down into:

$$
b + d(E[X_{i1}] - E[X_{i0}])
$$

Here, $b$ is the true effect, and $d(E[X_{i1}] - E[X_{i0}])$ captures the changes in all other unobserved factors.

To derive causality and accurately measure the true effect, we require:

$$
E[X_{i1}] = E[X_{i0}]
$$

This implies that the effects of all other variables must remain stable over time.




In [5]:
"""Randomization Check of covariates over time"""
education_check = LinReg(df=df, outcome="education", independent=["time"], standard_error_type='hc0')
hhh_sex_check = LinReg(df=df, outcome="hhh_sex", independent=["time"], standard_error_type='hc0')
mem_tot_check = LinReg(df=df, outcome="mem_tot", independent=["time"], standard_error_type='hc0')
display_models([education_check,hhh_sex_check,mem_tot_check])

From the randomization check above it seems that our first assumption of no external factors over time is satisfied.  We see that there is no statistically different change in the covariates over time. Additionally, as there is no statistically significant change in the covariates over time, we can assume that the treatment effect is consistent over time.  This satisfies our second assumption.  Thus, we can proceed with the pre/post analysis.  Lets look at the naive regressions first.

In [7]:
"""Naive regressions"""
nutrition_naive = LinReg(df=df, outcome="nutrition", independent=["time"], standard_error_type='hc0')
employment_naive = LinReg(df=df, outcome="emp_now", independent=["time"], standard_error_type='hc0')

display_models([nutrition_naive,employment_naive])

In the above regressions we can see that the cash transfers seemed to have been very succesful.  The nutrition score increased by 2.467 units and employment increased by 0.148 units, both highly statistically significant results.  

In [8]:
"""Regression with covariates"""

nutrition_full = LinReg(df=df, outcome="nutrition", independent=["time", 
                                                                 "education",
                                                                 "hhh_sex",
                                                                 "mem_tot"], standard_error_type='hc0')
employment_full = LinReg(df=df, outcome="emp_now", independent=["time",
                                                                "education",
                                                                "hhh_sex",
                                                                "mem_tot"], standard_error_type='hc0')

display_models([nutrition_full,employment_full])

In the above regressions we can see that the cash transfers seemed to have been very succesful.  The nutrition score increased by 2.467 units and employment increased by 0.148 units, both highly statistically significant results.  Additionally, we dont see any changes in the covariates when adjusting for our controls. This is a good sign that our controls are not absorbing the treatment effect.  Overall, we see that despite losing our control group data we can still leverage techniques to analyse this assistance program. 