New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race conditions in regression tests #899

Closed
jlippuner opened this Issue Sep 28, 2013 · 6 comments

Comments

Projects
None yet
3 participants
@jlippuner

jlippuner commented Sep 28, 2013

I have seen several times now (using different versions of the code) that the tests are not consistent.

For example, I just compiled 93c88b4b77581452530c03fd2bcfd775cb0d5ea0 with GCC 4.8.1 and the MPI parcel port.

So I did make and then I did make -j 8 tests and got

The following tests FAILED:
          4 - tests.regressions.actions.plain_action_dataflow_move_semantics (Failed)

Then I did make tests and got

The following tests FAILED:
          4 - tests.regressions.actions.plain_action_dataflow_move_semantics (Failed)
         30 - tests.unit.agas.local_address_rebind (Failed)

Then I did make -j 8 tests and all tests passed.

Then I did make -j 3 tests and all tests passed again.

No other instances of HPX were running while I was running these tests and I didn't do anything else in between running the tests (except compile HPX in a different ssh session in a different build directory).

Has anybody else observed a behavior like this? This was a release version and the failed tests did not output any useful information about why they failed.

@ghost ghost assigned hkaiser Sep 28, 2013

@brycelelbach

This comment has been minimized.

Show comment
Hide comment
@brycelelbach

brycelelbach Oct 5, 2013

Member

This happens due to parallelism-related bugs. Many of the tests are designed to uncover pathological race conditions. Seeing some tests fail/succeed intermittently is, in a way, good, because it indicates that the tests are revealing the sort of race conditions that they are designed to reveal.

make -j N tests does not affect the number of cores that the tests will be run on: the number of cores and localities to use is specified for each test (ideally it would specify a fraction of the total available cores to use).

Member

brycelelbach commented Oct 5, 2013

This happens due to parallelism-related bugs. Many of the tests are designed to uncover pathological race conditions. Seeing some tests fail/succeed intermittently is, in a way, good, because it indicates that the tests are revealing the sort of race conditions that they are designed to reveal.

make -j N tests does not affect the number of cores that the tests will be run on: the number of cores and localities to use is specified for each test (ideally it would specify a fraction of the total available cores to use).

@hkaiser

This comment has been minimized.

Show comment
Hide comment
@hkaiser

hkaiser Oct 5, 2013

Member

Well, I think we should leave that open to remind us to fix those race conditions in the first place.

Member

hkaiser commented Oct 5, 2013

Well, I think we should leave that open to remind us to fix those race conditions in the first place.

@hkaiser hkaiser reopened this Oct 5, 2013

@brycelelbach

This comment has been minimized.

Show comment
Hide comment
@brycelelbach

brycelelbach Oct 9, 2013

Member

I'd rather open up specific tickets for specific failures...

Member

brycelelbach commented Oct 9, 2013

I'd rather open up specific tickets for specific failures...

@jlippuner

This comment has been minimized.

Show comment
Hide comment
@jlippuner

jlippuner Nov 15, 2013

I have noticed inconsistent behavior in the following tests (not a complete list) in 0.9.7:
tests.regressions.lcos.future_hang_on_get_629
tests.unit.threads.thread
tests.regressions.lcos.after_588

jlippuner commented Nov 15, 2013

I have noticed inconsistent behavior in the following tests (not a complete list) in 0.9.7:
tests.regressions.lcos.future_hang_on_get_629
tests.unit.threads.thread
tests.regressions.lcos.after_588

@hkaiser

This comment has been minimized.

Show comment
Hide comment
@hkaiser

hkaiser Nov 15, 2013

Member

Yes, the first two seem to be genuine race conditions most likely in the tests themselves.

The last one is known to fail and is just one way of how a particular problem manifests itself (which is well understood by know, but we have no solution found yet) - see #987. The issues #993 and #1007 are probably related, and we know that #1010 and #1014 have to be fixed for this to be solved eventually.

Thanks for sharing this information, though!

Member

hkaiser commented Nov 15, 2013

Yes, the first two seem to be genuine race conditions most likely in the tests themselves.

The last one is known to fail and is just one way of how a particular problem manifests itself (which is well understood by know, but we have no solution found yet) - see #987. The issues #993 and #1007 are probably related, and we know that #1010 and #1014 have to be fixed for this to be solved eventually.

Thanks for sharing this information, though!

@hkaiser hkaiser closed this Mar 25, 2014

@hkaiser hkaiser reopened this Mar 25, 2014

@hkaiser hkaiser modified the milestones: 0.9.9, 0.9.8 Mar 25, 2014

@hkaiser

This comment has been minimized.

Show comment
Hide comment
@hkaiser

hkaiser Oct 30, 2014

Member

We have not seen these effects for a long time. Ill go ahead and close this ticket. Please re-open if the problem persists.

Member

hkaiser commented Oct 30, 2014

We have not seen these effects for a long time. Ill go ahead and close this ticket. Please re-open if the problem persists.

@hkaiser hkaiser closed this Oct 30, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment