In [1]:
from src.models.linreg import LinReg
from src.displays.display_linear import display_models

import pandas as pd

## Example with Real Data

In [2]:
"""Load the data"""
df = pd.read_csv('../data/housing_health_mock.csv')
print(f'The data has {df.shape[0]} rows and {df.shape[1]} columns.')
df.head()


The data has 1616 rows and 6 columns.


Unnamed: 0,assist,age,educ,female,housing,mhealth
0,0,21,1,1,0.387853,-0.348691
1,0,54,1,1,0.27478,1.262136
2,0,33,1,1,0.600904,-0.026526
3,0,40,1,1,0.312471,-0.026526
4,0,75,0,0,1.426678,0.939971


The Data shows the results for a randomized control trial of a government intervention program.  Specifically, it measured covariates pre assistance on age, education and gender.  It then split the population into a treatment and control group.  The treatment group received the government assistance and the control group did not.  The dataset has two outcome variables, housing qualty and meantal health score.  These are both scaled to have mean zero and standard deviation one.  The data is simulated, but the results are based on real data.


In [3]:
"""Do a randomization check"""
age_check = LinReg(df=df, outcome="age", independent=["assist"], standard_error_type='hc0')
educ_check = LinReg(df=df, outcome="educ", independent=["assist"], standard_error_type='hc0')
gender_check = LinReg(df=df, outcome="female", independent=["assist"], standard_error_type='hc0')
display_models([age_check,educ_check,gender_check])


In [4]:
age_check.summary()

In [5]:
age_check.summary(content_type='static')

We see from this quick analysis that there is no statistically significant difference between the mean age or education between treatment or control groups.  However, we do see a difference in the gender breakdown as the assist group has a slightly lower percentage of men, about 5% of a standard deviation. Lets take a deeper look at this gender difference. 


In [5]:
female = df[df['female'] == 1]
print(f'The data has {female.shape[0]} rows and {female.shape[1]} columns.')
female.head()

The data has 788 rows and 6 columns.


Unnamed: 0,assist,age,educ,female,housing,mhealth
0,0,21,1,1,0.387853,-0.348691
1,0,54,1,1,0.27478,1.262136
2,0,33,1,1,0.600904,-0.026526
3,0,40,1,1,0.312471,-0.026526
5,1,51,1,1,1.068612,-0.993022


In [6]:
female_age_check = LinReg(df=female, outcome="age", independent=["assist"], standard_error_type='robust')
female_educ_check = LinReg(df=female, outcome="educ", independent=["assist"], standard_error_type='robust')
display_models([female_age_check,female_educ_check])

Looking at these results we do not see any statistically significant differences between the treatment and control groups.  This is good as it means that the randomization was successful.  Now lets look at the outcomes.


In [7]:
housing_assist = LinReg(df=df, outcome="housing", independent=["assist"], standard_error_type='robust')
health_assist = LinReg(df=df, outcome="mhealth", independent=["assist"], standard_error_type='robust')
display_models([housing_assist,health_assist])

Looking at the results above we find that there is a statistically significant difference between the treatment and control groups for  housing quality.  The treatment group has a higher housing quality.  This is good news as it seems that the government assistance program was successful to some degree.  The impact is somewhat significant listing at 0.2 standard deviations.  However, the impact on mental health is not statistically significant.  This is not good news as it means that the government assistance program did not have a significant impact on mental health. Lets see if these results generalize to women as well give we saw the imbalance above.


In [9]:
female_housing_assist = LinReg(df=female, outcome="housing", independent=["assist"], standard_error_type='robust')
female_health_assist = LinReg(df=female, outcome="mhealth", independent=["assist"], standard_error_type='robust')
display_models([female_housing_assist, female_health_assist])

Looking at these results we see that there doesnt seem to be any difference between treatment and control in mental health scores.  However, we do see as we did above that assistance leads to better housing outcomes. Finally, lets add our control covariates. 

In [12]:
housing_full = LinReg(df=df, outcome="housing",
                      independent=["assist", "educ", 'age'],
                      standard_error_type='robust')

health_full = LinReg(df=df, outcome="mhealth",
                     independent=["assist", "educ", 'age'],
                     standard_error_type='robust')

display_models([
                housing_assist,
                health_assist,
                housing_full,
                health_full])

From the results above we see that even when controlling for the effects of age and education, the treatment group still has a statistically significant higher housing quality.  However, the treatment group still does not have a statistically significant difference in mental health scores.  This is not good news as it means that the government assistance program did not have a significant impact on mental health.  This is a good example of how randomization can be used to determine the impact of a program.  It is also a good example of how randomization can be used to determine if a program is effective.  In this case, the program was effective at improving housing quality, but not mental health.  This is important to know as it means that the program should be continued, but that it should be modified to improve mental health outcomes.