Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linear regression is not reproducible #27

Closed
mgralle opened this issue Dec 28, 2018 · 13 comments
Closed

Linear regression is not reproducible #27

mgralle opened this issue Dec 28, 2018 · 13 comments

Comments

@mgralle
Copy link
Contributor

mgralle commented Dec 28, 2018

#!/usr/bin/env python3

-- coding: utf-8 --

"""
Created on Thu Dec 27 11:24:48 2018

@author: mgralle

Debugging script for the dowhy package, using the Lalonde data example.

Repetition of estimation using propensity score matching or weighting gives reproducible values, as expected. However, repetition of estimation using linear regression gives different values.
"""

#To simplify debugging, I obtained the Lalonde data as described on the DoWhy
#page and wrote it to a CSV file:

#from rpy2.robjects import r as R
#%load_ext rpy2.ipython
##%R install.packages("Matching")
#%R library(Matching)
#%R data(lalonde)
#%R -o lalonde
#lfile("lalonde.csv","w")
#lalonde.to_csv(lfile,index=False)
#lfile.close()

import pandas as pd
lalonde=pd.read_csv("lalonde.csv")

print("Lalonde data frame:")
print(lalonde.describe())

from dowhy.do_why import CausalModel

1. Propensity score weighting

model=CausalModel(
data = lalonde,
treatment='treat',
outcome='re78',
common_causes='nodegr+black+hisp+age+educ+married'.split('+'))
identified_estimand = model.identify_effect()

psw_estimate = model.estimate_effect(identified_estimand,
method_name="backdoor.propensity_score_weighting")
print("\n(1) Causal Estimate from PS weighting is " + str(psw_estimate.value))

psw_estimate = model.estimate_effect(identified_estimand,
method_name="backdoor.propensity_score_weighting")
print("\n(2) Causal Estimate from PS weighting is " + str(psw_estimate.value))

#2. Propensity score matching
psm_estimate = model.estimate_effect(identified_estimand,
method_name="backdoor.propensity_score_matching")
print("\n(1) Causal estimate from PS matching is " + str(psm_estimate.value))

psm_estimate = model.estimate_effect(identified_estimand,
method_name="backdoor.propensity_score_matching")
print("\n(2) Causal estimate from PS matching is " + str(psm_estimate.value))

#3. Linear regression
linear_estimate = model.estimate_effect(identified_estimand,
method_name="backdoor.linear_regression",
test_significance=True)
print("\n(1) Causal estimate from linear regression is " + str(linear_estimate.value))

linear_estimate = model.estimate_effect(identified_estimand,
method_name="backdoor.linear_regression",
test_significance=True)
print("\n(2) Causal estimate from linear regression is " + str(linear_estimate.value))

Recreate model from scratch for linear regression

model=CausalModel(
data = lalonde,
treatment='treat',
outcome='re78',
common_causes='nodegr+black+hisp+age+educ+married'.split('+'))

identified_estimand = model.identify_effect()

linear_estimate = model.estimate_effect(identified_estimand,
method_name="backdoor.linear_regression",
test_significance=True)
print("\n(3) Causal estimate from linear regression is " + str(linear_estimate.value))

print("\nLalonde Data frame hasn't changed:")
print(lalonde.describe())

@amit-sharma
Copy link
Member

Thanks for raising this @mgralle ! I am trying to understand the source of the error. When I try this script on my local machine, I obtain the same value each time (1671.13) for linear regression.
Can you share samples of the different estimate values that you got?

@mgralle
Copy link
Contributor Author

mgralle commented Jan 21, 2019

Thanks for paying attention to this problem!
When running my script on python 3.7 in Spyder 3.3.1 in Anaconda, I get the following estimated values from linear regression:
(1) Causal estimate from linear regression is 1671.1308841235102
(2) Causal estimate from linear regression is 937.2998778128333
(3) Causal estimate from linear regression is -333.2514015372243

Re-running the entire script, I get:
(1) Causal estimate from linear regression is 1671.1308841235102
(2) Causal estimate from linear regression is -782.0656222357087
(3) Causal estimate from linear regression is -861.948525208911

After exiting Spyder and re-entering, I get:
(1) Causal estimate from linear regression is 1671.1308841235111
(2) Causal estimate from linear regression is 741.7360515239643
(3) Causal estimate from linear regression is -74.26662536330865

Hope that helps!

@akelleh
Copy link
Contributor

akelleh commented Feb 22, 2019

unable to recreate on python3.6. Making a fresh install of python3.7 to test your script there

@akelleh
Copy link
Contributor

akelleh commented Feb 22, 2019

on a fresh install of python3.7, i get

(1) Causal estimate from linear regression is 1671.130884123515
(2) Causal estimate from linear regression is 1671.130884123515
(3) Causal estimate from linear regression is 1671.130884123515

so can't reproduce. @mgralle can you provide minimal steps to reproduce in a fresh venv?

@mgralle
Copy link
Contributor Author

mgralle commented Feb 23, 2019 via email

@akelleh
Copy link
Contributor

akelleh commented Feb 23, 2019

perfect! Thanks so much.

@amit-sharma
Copy link
Member

Thanks @mgralle for spending the time. Let us know when you have an update.

@mgralle
Copy link
Contributor Author

mgralle commented Mar 1, 2019 via email

@mgralle
Copy link
Contributor Author

mgralle commented Mar 1, 2019 via email

@mgralle
Copy link
Contributor Author

mgralle commented Mar 1, 2019 via email

@akelleh
Copy link
Contributor

akelleh commented Mar 6, 2019

great! It sounds like the solution might be to say dowhy requires anaconda>5.3.1, and fixing package versions in requirements.txt to ones that are tested to work. The alternative would be to dig deep into details of which packages are failing, and that's a lot of time that could be spent on higher priorities (other bug fixes; adding features; documentation; tech debt). Do you agree?

@mgralle
Copy link
Contributor Author

mgralle commented Mar 6, 2019 via email

@mgralle
Copy link
Contributor Author

mgralle commented Mar 6, 2019

Yes, in fact I agree that it's not worth digging into the details of the packages. My hunch is that there is a problem with the downgrading of some packages forced by anaconda=5.3.1. In any case, for myself I won't use anaconda and spyder anymore since they seem to introduce unnecessary complications.

@mgralle mgralle closed this as completed Mar 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants