Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding thread local allocator and use it for future shared states #3406

Closed
wants to merge 19 commits into from

Conversation

Projects
None yet
4 participants
@hkaiser
Copy link
Member

commented Aug 10, 2018

This helps reducing overheads caused by the allocation of future shared states.

This is WIP, currently. It can however already be used for experimentation.


return hpx::traits::future_access<future<result_type>>::create(
std::move(p));
p.release(), false);

This comment has been minimized.

Copy link
@biddisco

biddisco Aug 10, 2018

Contributor

Is the only reason for going to the trouble of creating a unique_ptr with an allocator_deleter and then releasing it right after, to ensure that if an exception is thrown, all is cleared up?

This comment has been minimized.

Copy link
@hkaiser

hkaiser Aug 10, 2018

Author Member

@biddisco yes, that is correct.

@hkaiser hkaiser force-pushed the thread_local_allocator branch from 42f1259 to ade140e Aug 14, 2018

@biddisco

This comment has been minimized.

Copy link
Contributor

commented Aug 17, 2018

Since this allocator is doing essentially what jemalloc/tcmalloc do, why not just create an arena in jemalloc for future shared states and use that?

@hkaiser

This comment has been minimized.

Copy link
Member Author

commented Aug 17, 2018

Since this allocator is doing essentially what jemalloc/tcmalloc do, why not just create an arena in jemalloc for future shared states and use that?

@biddisco I didn't know this was possible. I'll investigate. From what I see however, I think tcmalloc/jemalloc can't match the performance of this allocator. I'll try to collect some hard data for this.

@biddisco

This comment has been minimized.

Copy link
Contributor

commented Aug 17, 2018

What you gain by using (for example) jemalloc is some memory garbage collection thrown in, so that if you keep allocating new pages of memory and then don't need them at some point, it will do some cleanup.
This might not be a big deal if you are only allocating small shared state instances - but it's a shame to see the code base grow in this way. For task stacks, we also need a thread local allocator, and maybe we should just use the one you have included. I'd be very interested in seeing a comparison of this allocator and a dedicated jemalloc arena of the same size (meaning an arena created for objects of size N bytes).

@hkaiser

This comment has been minimized.

Copy link
Member Author

commented Aug 17, 2018

What you gain by using (for example) jemalloc is some memory garbage collection thrown in, so that if you keep allocating new pages of memory and then don't need them at some point, it will do some cleanup.

The allocator proposed here does that as well. Overall I agree, however. I'm all for ditching this PR if we could reuse functionalities in jemalloc (or similar) and still get decent performance.

I'd be very interested in seeing a comparison of this allocator and a dedicated jemalloc arena of the same size (meaning an arena created for objects of size N bytes).

Good thinking. I'll prepare some benchmarks.

@hkaiser

This comment has been minimized.

Copy link
Member Author

commented Aug 17, 2018

@biddisco would you be able to point me to the documentation that describes using special jemalloc arena allocators?

@sithhell

This comment has been minimized.

Copy link
Member

commented Aug 24, 2018

Something worth considering as well is this here: https://accu.org/index.php/journals/2533

@sithhell
Copy link
Member

left a comment

I would have expected an allocator that is completely free of synchronization primitives. That is, if data is going to get deleted on a thread that doesn't own the data, we could just add it to our freelist and not give it back to the original one.

void cleanup();

friend HPX_EXPORT void* thread_alloc(std::size_t size);
friend HPX_EXPORT void thread_free(void* addr);

This comment has been minimized.

Copy link
@sithhell

sithhell Aug 27, 2018

Member

HPX_EXPORT is not needed here.

This comment has been minimized.

Copy link
@hkaiser

hkaiser Aug 27, 2018

Author Member

We need it for MSVC :/

}

// FIXME: should we align this on page boundaries?
typename std::aligned_storage<PAGE_SIZE>::type data;

This comment has been minimized.

Copy link
@sithhell

sithhell Aug 27, 2018

Member

I think this should definitely be aligned at page boundaries. Some compilers might have difficulties with over alignment though.

This comment has been minimized.

Copy link
@hkaiser

hkaiser Aug 27, 2018

Author Member

Aligning the pages a PAGE boundaries does not guarantee the alignment of the blocks in any way. I'm not too sure about this anymore for this reason.

@hkaiser

This comment has been minimized.

Copy link
Member Author

commented Aug 27, 2018

I would have expected an allocator that is completely free of synchronization primitives. That is, if data is going to get deleted on a thread that doesn't own the data, we could just add it to our freelist and not give it back to the original one.

I'm not sure if this scheme would allow us to track the number of blocks allocated in a page (in order to know when a page can be freed). At least it would require to make the block counter atomic (again) which turned out to have a significant impact...

@hkaiser hkaiser force-pushed the thread_local_allocator branch 2 times, most recently from 0ae147a to a27f4a9 Sep 13, 2018

hkaiser added some commits Aug 10, 2018

Cleaning up thread_allocator
- using atomic_count
- adding compressed_ptr, saves 2 bytes per allocation

@hkaiser hkaiser force-pushed the thread_local_allocator branch from a27f4a9 to 3f3a533 Sep 14, 2018

hkaiser added some commits Sep 14, 2018

@hkaiser hkaiser force-pushed the thread_local_allocator branch from 418ad7a to 6bec6ba Sep 14, 2018

hkaiser added some commits Sep 14, 2018

Fixing inspect problems
- more minor tweaks

sithhell and others added some commits Sep 15, 2018

@hkaiser hkaiser force-pushed the thread_local_allocator branch from c00c89a to 5e592f5 Sep 16, 2018

Modifying allocator test to print timing information, adding std::all…
…ocator as a reference

- tweaking allocator itself
- fixing failing header compilation tests

@hkaiser hkaiser force-pushed the thread_local_allocator branch from 5e592f5 to e419cae Sep 16, 2018

Fixing test failures
- flyby: fix some clang warnings

@hkaiser hkaiser force-pushed the thread_local_allocator branch from dd593d8 to b6983f1 Sep 17, 2018

@sithhell

This comment has been minimized.

Copy link
Member

commented Sep 17, 2018

Here is the data running the test on a two socket machine with 8 cores per socket:
https://docs.google.com/spreadsheets/d/18NYIeEuQ46MfsGhZn0smCXLXdGp7nMkmur-Z54r5db8/edit?usp=sharing

The std::allocator is backed with jemalloc. While we get an improvement from 1 to 4 cores, we can observe that the current implementation doesn't scale as good as the jemalloc implementation. I think the potential is clear here. In other tests, I couldn't observe performance regression with other benchmarks.

// request the reference to the AGAS client only once
naming::resolver_client& get_agas_client()
{
static naming::resolver_client& client = naming::get_agas_client();

This comment has been minimized.

Copy link
@msimberg

msimberg Sep 17, 2018

Contributor

@hkaiser This doesn't play well with the tests that initialize the runtime multiple times (the ones that are failing on CircleCI). Does this change make a big difference?

This comment has been minimized.

Copy link
@hkaiser

hkaiser Sep 17, 2018

Author Member

@msimberg Good point! I forgot about the multiple-init scenarios. I'll roll this change back. This could also explain the hangs. Thanks!

@hkaiser

This comment has been minimized.

Copy link
Member Author

commented Sep 17, 2018

The std::allocator is backed with jemalloc. While we get an improvement from 1 to 4 cores, we can observe that the current implementation doesn't scale as good as the jemalloc implementation. I think the potential is clear here. In other tests, I couldn't observe performance regression with other benchmarks.

@sithhell I'm not sure where our allocator loses. Any idea?

hkaiser added some commits Sep 17, 2018

@hkaiser

This comment has been minimized.

Copy link
Member Author

commented Sep 17, 2018

@sithhell please try again. I have fixed the test to not try to allocate blocks larger than can be handled by the thread local allocator. Those large allocations had skewed the data.

@sithhell

This comment has been minimized.

Copy link
Member

commented Sep 18, 2018

Here is the graph with the updated test:
https://docs.google.com/spreadsheets/d/e/2PACX-1vTiupxfK5Hgb7pGmIEKnCjzaq-fw_GimM-roV9NJOyEcYHGlbEQ6O9tDLsd-uzp3IdVfMm92coaWeMu/pubchart?oid=1063489052&format=interactive

The situation didn't change much. I don't think not excluding large allocations skewed the results. Using this new allocator should always be at least as fast as the system (in that case jemalloc) allocator. I don't know what causes the small regression here. The phylanx application reported good speedups, so I think this PR is a proper basis for future optimizations.

@msimberg

This comment has been minimized.

Copy link
Contributor

commented Sep 19, 2018

Could this eventually also be used for thread data/stack allocation, or is there something fundamentally incompatible for that purpose (e.g. the maximum block size seems quite small)?

@sithhell

This comment has been minimized.

Copy link
Member

commented Sep 19, 2018

It can't be used for the stack. The main difference is that we lazily allocate the stacks (only mmap without actual physically being backed up). thread_data is a different story

@hkaiser

This comment has been minimized.

Copy link
Member Author

commented Sep 28, 2018

This can be closed now as part of the changes have been incorporated into master (see #3465). The allocator itself is not needed as it does not improve the picture over jemalloc. I will create a separate PR enabling the use of jemalloc on windows

@hkaiser hkaiser closed this Sep 28, 2018

@hkaiser hkaiser deleted the thread_local_allocator branch Sep 28, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.