New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR - Ticket 51072 - improve autotune defaults #4126
Comments
Comment from firstyear (@Firstyear) at 2020-05-07 06:48:14 It's worth noting, that even on basic tests, this was significantly faster on my machine:
That's roughly a 25% speedup. |
Comment from tbordaz (@tbordaz) at 2020-05-07 10:50:58 Is 'threads' the final nsslapd-threadnumber ? The fix looks good but I have a doubt. Default was 30. Unless workers are very slow or doing very long job, I would expect 30 workers would be enough for most of load with rapid operations. By any chance do you have searchrate/modrate numbers that shows #workers > 50 or 100 are beneficial ? |
Comment from firstyear (@Firstyear) at 2020-05-07 12:37:01 @tbordaz it sets the number of threads based on how many CPU hardware threads are presented by the OS. so if you have a 4 core machine, it's 4. If it's 256 it's 256. If you have an 8 core with hyper threads, it would be 16. The <512 there is a cap that we don't exceed. Not a minimum. |
Comment from vashirov (@vashirov) at 2020-05-07 13:08:44 Another data point: search rate on a 4 CPU machine, 1 client with 1-10 threads:
|
Comment from firstyear (@Firstyear) at 2020-05-07 13:46:29 @vashirov wow, those numbers are stunning. I think the scaling at the high end (10 client threads) is more important than the low numbbers since we do need to consider high concurrency as a key workload for us. |
Comment from lkrispen (@elkris) at 2020-05-07 16:25:05 yes, it looks good at first sight. But you need to see what happens in a mixed load, I think write operations can easiiy block 4 threads and delay all binds and searches. We should not optimize for one specific load pattern |
Comment from firstyear (@Firstyear) at 2020-05-08 01:32:18 I think that this may not be true going forward, as with lmdb on a concurrent cache design we can only have a single active writer, which means that we could distinguish between read operations and write operations in the thread pool, to guarantee that bind/read is always seperate. Flip side of this, is we could just also have many many readers stalling writers causing them to stall too. Saying this, I still agree that @vashirov can do some more of his excellent load testing to check this patch, :) |
Comment from vashirov (@vashirov) at 2020-05-12 09:12:52 Here are the test results with 1-30 client threads and 4, 8, 16, 24 worker threads: This is only with this patch applied, no other tunings. |
Comment from tbordaz (@tbordaz) at 2020-05-12 10:15:16 Thanks for all these runs ! @vashirov for mixed load is it sync operation ? is it accounting MOD+SEARCH as one operation ? It is looking like the search rate is hidden by the mod rate. For MODs at the moment we can not really conclude of a benefit of high/lower workers. For searchs there is still an unexpected significant negative impact of #workers. |
Comment from vashirov (@vashirov) at 2020-05-12 10:26:14
It was a SRCH followed by the MOD:
I will add another test with async SRCH and MOD. |
Comment from firstyear (@Firstyear) at 2020-05-13 06:45:54 My analysis of what this shows is that search seems to improve with more threads, but something causes contention leading to the loss - so lower threads == less contention yielding the cpu-matched threads to give better search throughput and latency. It appears in the mixed workload our writes are heavily impacting the searches, so I think our write path is likely to be preventing search performance improvement. Regardless, it didn't make it worse, so I'm of course in favour of this change :) |
Comment from mreynolds (@mreynolds389) at 2020-05-13 15:50:29 IMHO LGTM, like William said it's not hurting the numbers. It's definitely an improvement, and if we need to fine tune at a later date so be it. |
Comment from mreynolds (@mreynolds389) at 2020-05-13 15:58:09 Actually there was something I'd like to see tested with this change. A machine with more CPU/cores. So we tested a 4 core machine and setting the thread number to 4 was great, but what about a 16 core system with varying worker threads. Do we same improvement if we set the thread number to 16 vs 32 or 8 or 4? @vashirov - would it be hard to reserve a system with this hardware and run one more rounds of tests? |
Comment from mreynolds (@mreynolds389) at 2020-06-02 18:29:11 When I did investigation for another potential customer a few years ago I also saw that setting the thread number to the number of cores gave the best performance. I think this is definitely an improvement over what we had, ack |
Comment from firstyear (@Firstyear) at 2020-06-03 01:13:21 rebased onto 5eacf45e7caa50de2721f85d7fbee58767bcb8f0 |
Comment from firstyear (@Firstyear) at 2020-06-03 01:15:22 rebased onto 9a06935 |
Comment from firstyear (@Firstyear) at 2020-06-03 01:16:00 Pull-Request has been merged by Firstyear |
Patch |
Cloned from Pagure Pull-Request: https://pagure.io/389-ds-base/pull-request/51073
Bug Description: we have learnt that the CPU autotuning is too aggresive, potentially
decreasing throughput due to overhead in context switching and lock contention, and
that our memory tuning is not aggressive enough, at only 10% of the system memory.
Additionally, in containers, we are able to have access to different memory limits
and reservations, so we can choose to be even more forward in our selection.
Fix Description: Change thread tuning to match the number of threads available on
the system. Change memory tuning to 25% of system memory by default. Finally add
an environment variable to containers allowing more aggressive tuning to be
set DS_MEMORY_PERCENTAGE. Later this could be set to a higher default value.
Resolves: #4125
Author: William Brown william@blackhats.net.au
Review by: ???
The text was updated successfully, but these errors were encountered: