Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix compilation with dynamic bitset for CPU masks #3566

Merged
merged 1 commit into from Mar 15, 2019

Conversation

msimberg
Copy link
Contributor

@msimberg msimberg commented Nov 27, 2018

This is related to #3482. It fixes compilation with a dynamic bitset for the CPU masks. The default is unchanged.

It changes the default cpu mask to dynamic_bitset. If this doesn't have a performance impact I think we should remove the other options and always use a dynamic bitset.

As far as I can tell there are two places where this might have a performance impact:

  • local_queue_scheduler: when numa_sensitive != 0 this scheduler operates on the cpu mask in get_next_thread. However, this could be done the same way as local_priority_queue_scheduler which does all numa sensitive work in on_start_thread only.
  • shared_priority_queue_scheduler: this one uses fixed size arrays with HPX_HAVE_MAX_CPU_COUNT and HPX_HAVE_MAX_NUMA_DOMAIN_COUNT. I'd expect vectors to work just as well since we're not dynamically allocating them in a tight loop, but I haven't checked. Update: The shared priority scheduler doesn't work with this option and gives warnings to the user.

@hkaiser
Copy link
Member

hkaiser commented Nov 27, 2018

On Knight's Landing architectures (might not be representative, but anyways...) we have seen a massive speedup in the scheduler from letting the compiler vectorize (using AVX512) the statically sized bitmasks. We should verify that dynamic_bitset will be properly vectorized (i.e. all cores up to 512 are handled by a single operation).

@msimberg
Copy link
Contributor Author

That's a good point. My hope is it wouldn't even need to be vectorized for good performance if the masks are never used in tight loops. Is that an unreasonable restriction?

@hkaiser
Copy link
Member

hkaiser commented Nov 27, 2018

@msimberg I fully agree that we should do some thorough performance analysis before removing the static stuff.

@msimberg msimberg force-pushed the dynamic-bitset-default branch 6 times, most recently from d472197 to ea11416 Compare November 30, 2018 14:40
@msimberg
Copy link
Contributor Author

msimberg commented Dec 7, 2018

I did a quick test and it seems like simple things like bitwise and do vectorize nicely, while something like any doesn't (not sure how much sense that makes). As I said earlier my intention would be to remove all uses of cpu masks in critical sections in the local queue scheduler. That leaves the question: @hkaiser in what kind of hot loops did you or would you like to use cpu masks? If there's a clear use case I'd go ahead and benchmark, otherwise I won't bother.

@hkaiser
Copy link
Member

hkaiser commented Dec 7, 2018

I'd like to test on KNL (or any other AVX512 platform) before removing the static bitmaps. In general however, I agree with your assessment.

@msimberg
Copy link
Contributor Author

msimberg commented Dec 7, 2018

Sure, completely understand. I would also not completely remove them, just change the default. (Edit: I realized now that I first said I would remove the others if there's no slowdown, but there's no harm in keeping them as long as they're still tested.)

@msimberg msimberg changed the title WIP: Use dynamic bitset by default for cpu mask Use dynamic bitset by default for cpu mask Dec 11, 2018
@msimberg
Copy link
Contributor Author

Is there interest in having this in? If not I'd like to clean it up so that the dynamic bitset option at least works (but I wouldn't make it the default), or completely remove it if it's not going to be used or tested by anyone in any case. Don't want to leave this hanging around.

@hkaiser
Copy link
Member

hkaiser commented Feb 20, 2019

Is there interest in having this in? If not I'd like to clean it up so that the dynamic bitset option at least works (but I wouldn't make it the default), or completely remove it if it's not going to be used or tested by anyone in any case. Don't want to leave this hanging around.

Let's get this in. I'd feel better if we left the other options in place for now, however.

@msimberg
Copy link
Contributor Author

All right, I'll get it cleaned up.

@msimberg msimberg changed the title Use dynamic bitset by default for cpu mask WIP: Use dynamic bitset by default for cpu mask Feb 20, 2019
@msimberg
Copy link
Contributor Author

I changed one of the pycicle builders to use the dynamic bitset.

@msimberg msimberg force-pushed the dynamic-bitset-default branch 3 times, most recently from eff5e51 to 6f41a08 Compare February 27, 2019 09:54
@msimberg msimberg changed the title WIP: Use dynamic bitset by default for cpu mask Fix compilation with dynamic bitset for CPU masks Mar 12, 2019
@msimberg
Copy link
Contributor Author

This should be ready to go now.

@msimberg msimberg added this to the 1.3.0 milestone Mar 14, 2019
Copy link
Member

@hkaiser hkaiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go ahead with this. Thanks!

@msimberg msimberg merged commit 83c1a06 into STEllAR-GROUP:master Mar 15, 2019
@msimberg msimberg deleted the dynamic-bitset-default branch March 15, 2019 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants