Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeated use restricting to e(sample) leads to different results #203

Closed
jorpppp opened this issue May 14, 2024 · 9 comments · Fixed by #205
Closed

Repeated use restricting to e(sample) leads to different results #203

jorpppp opened this issue May 14, 2024 · 9 comments · Fixed by #205
Assignees

Comments

@jorpppp
Copy link
Collaborator

jorpppp commented May 14, 2024

Consider this example

clear all
webuse nlswork
by idcode (year): gen time=_n
xtset idcode time
by idcode (time): gen union2 = sum(union)
replace union2 = 1 if union2 > 1
xtevent ln_w age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure ///
, pol(union2) w(3) cluster(idcode) impute(nuchange)

gen s = e(sample)

xtevent ln_w age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure if s ///
, pol(union2) w(3) cluster(idcode) impute(nuchange)

The second run over the estimation sample from the first run leads to different results. This probably has to do with the impute option.

I am going to label this as a bug for now because it looks like it.

@jorpppp jorpppp added the bug Something isn't working label May 14, 2024
@jorpppp jorpppp self-assigned this May 14, 2024
@jorpppp
Copy link
Collaborator Author

jorpppp commented May 14, 2024

Note that repeated saving of imputed variables with different suffixes is not working:

clear all
webuse nlswork
by idcode (year): gen time=_n
xtset idcode time
by idcode (time): gen union2 = sum(union)
replace union2 = 1 if union2 > 1
xtevent ln_w age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure ///
, pol(union2) w(3) cluster(idcode) impute(nuchange, saveimp)

ren union2_imputed union2_imputed1
gen s = e(sample)

xtevent ln_w age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure if s ///
, pol(union2) w(3) cluster(idcode) impute(nuchange, saveimp)

returns an error message.

This is because we are confirming vars with unab, so I'm going to fix that first.

@jorpppp
Copy link
Collaborator Author

jorpppp commented May 14, 2024

@Constantino-Carreto-Romero note the issue above #203 (comment) and the fix in 548b1a6. I think this is also an issue for #177 that we just closed, because it means that when xtevent drops variables to replace them, it may be dropping more variables that it should.

Can you please open a new issue to check all confirms to make sure this doesn't happen? Note that for the fix in #177 you may need to move the check for replacing to a later place in the code, when the code already knows the new kvars and can confirm them individually.

@jorpppp
Copy link
Collaborator Author

jorpppp commented May 14, 2024

After 548b1a6 the above code works: #203 (comment)

and it shows that the imputed variable is different across runs, so the issue has to do with the imputation.

@jorpppp
Copy link
Collaborator Author

jorpppp commented May 14, 2024

Actually the issue may be deeper since it shows up without any imputation:

clear all
webuse nlswork
by idcode (year): gen time=_n
xtset idcode time
by idcode (time): gen union2 = sum(union)
replace union2 = 1 if union2 > 1
xtevent ln_w age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure ///
, pol(union2) w(3) cluster(idcode)
est sto m1

xtevent ln_w age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure if s ///
, pol(union2) w(3) cluster(idcode) 
est sto m2


est tab m1 m2

@jorpppp
Copy link
Collaborator Author

jorpppp commented May 14, 2024

This is not an imputation issue, and I don't think this is a bug. This code checks the estimation samples for the two runs:

clear all
webuse nlswork
by idcode (year): gen time=_n
xtset idcode time
by idcode (time): gen union2 = sum(union)
replace union2 = 1 if union2 > 1
xtevent ln_w age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure ///
, pol(union2) w(3) cluster(idcode) savek(a)
est sto m1


* ren union2_imputed union2_imputed1
gen s1 = e(sample)

xtevent ln_w age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure if s1 ///
, pol(union2) w(3) cluster(idcode) 
est sto m2

gen s2 = e(sample)

* ren union2_imputed union2_imputed2

* assert union2_imputed1 == union2_imputed2

est tab m1 m2

tab s1 s2
tab idcode if s1 != s2

This shows that the samples differ between runs and it lists some ids for where this happens. These ids are the same that are excluded in the second run because of ambiguous event times.

image

After that if we check the sample for missing values for one of the ids:

li idcode time union2 tenure age ttl_exp s1 s2 if idcode == 13

we get

image

Note that tenure is missing precisely when the policy variable, union2, changes from zero to one. So in the first run, this observation is excluded because of missing tenure. Now, the second run that is restricted to the estimation sample of the first run. Since that observation is not in the sample for the second run, the observations for id=13 have ambiguous event times so the entire unit is excluded. The same happens for the other ids in the sample.

The same happens with the other ids with the sample.

I'll now check the imputation case.

@jorpppp
Copy link
Collaborator Author

jorpppp commented May 14, 2024

It's a good idea to return the list of excluded units due to ambiguous event times so they can be checked. I added this in 3a84461 and 7afc04c

After doing this I checked if the excluded units due to ambiguous event time when using impute(nuchange) (a larger set than without imputation) had the same issue of missing tenure when the policy activates, and they do. So the issue is the same overall.

In summary I think this is not a bug.

@Constantino-Carreto-Romero could you read the posts on this issue and let me know if you agree? If so we can start a PR to add these changes.

@Constantino-Carreto-Romero
Copy link
Collaborator

Constantino-Carreto-Romero commented May 14, 2024

It's a good idea to return the list of excluded units due to ambiguous event times so they can be checked. I added this in 3a84461 and 7afc04c

After doing this I checked if the excluded units due to ambiguous event time when using impute(nuchange) (a larger set than without imputation) had the same issue of missing tenure when the policy activates, and they do. So the issue is the same overall.

In summary I think this is not a bug.

@Constantino-Carreto-Romero could you read the posts on this issue and let me know if you agree? If so we can start a PR to add these changes.

@jorpppp
I replicated the examples throughout the issue and I agree it should not be considered as a bug. In the first run, to create the event-time dummies, the program focuses on a larger sample, but then estimates restricting to non-missing observations in the varlist. If the user wants to reestimate using e(sample) he/she might not be able to run the same estimation because the program won’t be able to define event time for some units that lost key observations in the previous run. Maybe we could add a second sample function, one that focuses on the sample used when creating the event-time dummies.

@jorpppp
Copy link
Collaborator Author

jorpppp commented May 15, 2024

@Constantino-Carreto-Romero Yes, this makes me think that we just need to set the sample to the one used when creating the event-time dummies. That way the results will replicate using e(sample), and at the end of the day, that is the sample that xtevent is using as a whole, even if _eventols or _eventiv use smaller samples. I'll implement.

@jorpppp
Copy link
Collaborator Author

jorpppp commented May 15, 2024

Summary: In this issue we addressed an issue with xtevent giving different results when ran twice, using the e(sample) from the first run in the second run.
Thread continues in #205

jorpppp added a commit that referenced this issue May 15, 2024
jorpppp added a commit that referenced this issue May 15, 2024
…#205)

* #203 confirm imputed variable with confirm, exact instead of unab

* #203 Return list of excluded units due to ambiguous event time

* #203 Modify help file

* #203 Set sample to larger sample used to create event-time dummies

* #203 #205 Add seed to test file
jorpppp added a commit that referenced this issue May 15, 2024
#207)

* #203 confirm imputed variable with confirm, exact instead of unab

* #203 Return list of excluded units due to ambiguous event time

* #203 Modify help file

* #203 Set sample to larger sample used to create event-time dummies

* #203 #205 Add seed to test file
jorpppp added a commit that referenced this issue May 15, 2024
…… ( (#208)

* #203 repeated use restricting to e(sample) leads to different results… (#207)

* #203 confirm imputed variable with confirm, exact instead of unab

* #203 Return list of excluded units due to ambiguous event time

* #203 Modify help file

* #203 Set sample to larger sample used to create event-time dummies

* #203 #205 Add seed to test file

* #206 Change endpoint dummy var labels
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants