New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix compilation with dynamic bitset for CPU masks #3566
Fix compilation with dynamic bitset for CPU masks #3566
Conversation
On Knight's Landing architectures (might not be representative, but anyways...) we have seen a massive speedup in the scheduler from letting the compiler vectorize (using AVX512) the statically sized bitmasks. We should verify that dynamic_bitset will be properly vectorized (i.e. all cores up to 512 are handled by a single operation). |
That's a good point. My hope is it wouldn't even need to be vectorized for good performance if the masks are never used in tight loops. Is that an unreasonable restriction? |
1b8bc6c
to
a729c30
Compare
@msimberg I fully agree that we should do some thorough performance analysis before removing the static stuff. |
d472197
to
ea11416
Compare
I did a quick test and it seems like simple things like bitwise and do vectorize nicely, while something like |
I'd like to test on KNL (or any other AVX512 platform) before removing the static bitmaps. In general however, I agree with your assessment. |
Sure, completely understand. I would also not completely remove them, just change the default. (Edit: I realized now that I first said I would remove the others if there's no slowdown, but there's no harm in keeping them as long as they're still tested.) |
Is there interest in having this in? If not I'd like to clean it up so that the dynamic bitset option at least works (but I wouldn't make it the default), or completely remove it if it's not going to be used or tested by anyone in any case. Don't want to leave this hanging around. |
Let's get this in. I'd feel better if we left the other options in place for now, however. |
All right, I'll get it cleaned up. |
ea11416
to
0973222
Compare
I changed one of the pycicle builders to use the dynamic bitset. |
eff5e51
to
6f41a08
Compare
6f41a08
to
e96bb99
Compare
This should be ready to go now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go ahead with this. Thanks!
This is related to #3482. It fixes compilation with a dynamic bitset for the CPU masks. The default is unchanged.
It changes the default cpu mask todynamic_bitset
. If this doesn't have a performance impact I think we should remove the other options and always use a dynamic bitset.As far as I can tell there are two places where this might have a performance impact:
local_queue_scheduler
: whennuma_sensitive != 0
this scheduler operates on the cpu mask inget_next_thread
. However, this could be done the same way aslocal_priority_queue_scheduler
which does all numa sensitive work inon_start_thread
only.shared_priority_queue_scheduler
: this one uses fixed size arrays withHPX_HAVE_MAX_CPU_COUNT
andHPX_HAVE_MAX_NUMA_DOMAIN_COUNT
. I'd expect vectors to work just as well since we're not dynamically allocating them in a tight loop, but I haven't checked. Update: The shared priority scheduler doesn't work with this option and gives warnings to the user.