Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Fix throttle test #2994

Closed
wants to merge 22 commits into from
Closed

Conversation

msimberg
Copy link
Contributor

@msimberg msimberg commented Nov 7, 2017

This addresses #2955 by unreverting #2962 and fixing some additional lockups.

Still to do:

  • Add more thorough tests
  • ...?

This is in progress, but opening already for comments and to check tests.

@hkaiser
Copy link
Member

hkaiser commented Nov 7, 2017

@sithhell would you have the time to look at this?

@hkaiser
Copy link
Member

hkaiser commented Nov 7, 2017

@msimberg thanks for your work on this!

@msimberg
Copy link
Contributor Author

@sithhell @hkaiser
Small update: as the throttle test passes and the checks pass I don't intend to change anything more on this, assuming you are happy with the changes. If bigger changes are needed I would take care of those when doing the suspension of threads/runtime as that might reveal additional problems.

A few things that could be taken care of in separate PRs (if considered useful):

  • Removing pus dynamically only works with work stealing (removing pu on which thread is running requires stealing). Some way of pushing work to other pus could maybe get around this.
  • Removing a pu can still hang when there are multiple numa domains, as cross-numa stealing is only done by first pu on each domain.
  • The current throttle test would fail/work wrongly if global_thread_num != virt_core + thread_offset. remove_processing_unit could return the global_thread_num of the removed pu so that one can add it back with the same global_thread_num. However, not sure if that would ever happen.

@sithhell
Copy link
Member

After testing this branch, I found several issues, which are solvable with the attached patch
thread_queue.txt

@hkaiser
Copy link
Member

hkaiser commented Nov 16, 2017

@sithhell please create a PR against this branch for your patch.

@msimberg
Copy link
Contributor Author

Update: What @sithhell found was that some tests were failing due to my changes making the scheduling loop termination criteria too relaxed (e.g. timed_thread_pool_executor_test seems to fail because it creates suspended threads and the scheduling loop exits too early). So far we have no fix so I continue looking into this (@sithhell correct me if in the meantime you've found some way around this).

@hkaiser
Copy link
Member

hkaiser commented Nov 20, 2017

Could you explain please why relaxing the exit criteria for the scheduler is necessary in order to 'fix' the throttling problem?

@msimberg
Copy link
Contributor Author

@hkaiser it has to do with stealing. The particular case I was seeing is:

  • hpx_main is scheduled to run on pu 0
  • hpx_main tries to remove pu 0
  • hpx_main gets stolen to another pu
  • the scheduling loop of pu 0 does not exit because hpx_main is still running and in its thread map (current master requires that terminated_items_count_ == 0 and thread_map_count == 0 before exiting)
  • hpx_main/remove_processing_unit is stuck waiting for the scheduling loop to end

I imagine this would happen in more general cases as well. Essentially the scheduling loop waits for work that will never be scheduled on that particular pu (unless it steals it back, which it won't do when stopping). However, as shown by the failing tests this is not a good solution.

@msimberg
Copy link
Contributor Author

msimberg commented Dec 4, 2017

Closing as #3046 replaces it.

@msimberg msimberg closed this Dec 4, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants