Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR - Ticket 51072 - improve autotune defaults #4126

Closed
389-ds-bot opened this issue Sep 13, 2020 · 18 comments
Closed

PR - Ticket 51072 - improve autotune defaults #4126

389-ds-bot opened this issue Sep 13, 2020 · 18 comments
Labels
merged Migration flag - PR pr Migration flag - PR

Comments

@389-ds-bot
Copy link

389-ds-bot commented Sep 13, 2020

Cloned from Pagure Pull-Request: https://pagure.io/389-ds-base/pull-request/51073


Bug Description: we have learnt that the CPU autotuning is too aggresive, potentially
decreasing throughput due to overhead in context switching and lock contention, and
that our memory tuning is not aggressive enough, at only 10% of the system memory.
Additionally, in containers, we are able to have access to different memory limits
and reservations, so we can choose to be even more forward in our selection.

Fix Description: Change thread tuning to match the number of threads available on
the system. Change memory tuning to 25% of system memory by default. Finally add
an environment variable to containers allowing more aggressive tuning to be
set DS_MEMORY_PERCENTAGE. Later this could be set to a higher default value.

Resolves: #4125

Author: William Brown william@blackhats.net.au

Review by: ???

@389-ds-bot 389-ds-bot added merged Migration flag - PR pr Migration flag - PR labels Sep 13, 2020
@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2020-05-07 06:48:14

It's worth noting, that even on basic tests, this was significantly faster on my machine:

before:
35 passed, 3 skipped, 150 warnings in 239.14s (0:03:59)
after:
35 passed, 3 skipped, 150 warnings in 173.03s (0:02:53)

That's roughly a 25% speedup.

@389-ds-bot
Copy link
Author

Comment from tbordaz (@tbordaz) at 2020-05-07 10:50:58

Is 'threads' the final nsslapd-threadnumber ?
If yes, it looks very high.

The fix looks good but I have a doubt. Default was 30. Unless workers are very slow or doing very long job, I would expect 30 workers would be enough for most of load with rapid operations. By any chance do you have searchrate/modrate numbers that shows #workers > 50 or 100 are beneficial ?

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2020-05-07 12:37:01

@tbordaz it sets the number of threads based on how many CPU hardware threads are presented by the OS. so if you have a 4 core machine, it's 4. If it's 256 it's 256. If you have an 8 core with hyper threads, it would be 16.

The <512 there is a cap that we don't exceed. Not a minimum.

@389-ds-bot
Copy link
Author

Comment from vashirov (@vashirov) at 2020-05-07 13:08:44

Another data point: search rate on a 4 CPU machine, 1 client with 1-10 threads:

thread # 4 24
1 5837.800 6160.100
2 8549.308 11939.750
3 7687.200 17349.333
4 9962.475 7191.825
5 16807.538 8386.949
6 17051.641 9447.744
7 17788.385 9990.026
8 18838.359 10417.077
9 22419.949 10889.000
10 23950.950 11561.769

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2020-05-07 13:46:29

@vashirov wow, those numbers are stunning. I think the scaling at the high end (10 client threads) is more important than the low numbbers since we do need to consider high concurrency as a key workload for us.

@389-ds-bot
Copy link
Author

Comment from lkrispen (@elkris) at 2020-05-07 16:25:05

yes, it looks good at first sight. But you need to see what happens in a mixed load, I think write operations can easiiy block 4 threads and delay all binds and searches.

We should not optimize for one specific load pattern

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2020-05-08 01:32:18

I think that this may not be true going forward, as with lmdb on a concurrent cache design we can only have a single active writer, which means that we could distinguish between read operations and write operations in the thread pool, to guarantee that bind/read is always seperate.

Flip side of this, is we could just also have many many readers stalling writers causing them to stall too.

Saying this, I still agree that @vashirov can do some more of his excellent load testing to check this patch, :)

@389-ds-bot
Copy link
Author

@389-ds-bot
Copy link
Author

Comment from tbordaz (@tbordaz) at 2020-05-12 10:15:16

Thanks for all these runs !

@vashirov for mixed load is it sync operation ? is it accounting MOD+SEARCH as one operation ? It is looking like the search rate is hidden by the mod rate.

For MODs at the moment we can not really conclude of a benefit of high/lower workers. For searchs there is still an unexpected significant negative impact of #workers.

@389-ds-bot
Copy link
Author

Comment from vashirov (@vashirov) at 2020-05-12 10:26:14

@vashirov for mixed load is it sync operation ? is it accounting MOD+SEARCH as one operation ? It is looking like the search rate is hidden by the mod rate.

It was a SRCH followed by the MOD:

[12/May/2020:08:23:26.169046247 +0200] conn=1868 op=3715 SRCH base="dc=example,dc=com" scope=2 filter="(uid=user.8306)" attrs="givenName sn mail"
[12/May/2020:08:23:26.169258473 +0200] conn=1868 op=3715 RESULT err=0 tag=101 nentries=1 etime=0.000289852
[12/May/2020:08:23:26.169500575 +0200] conn=1868 op=3716 MOD dn="uid=user.8306,ou=People,dc=example,dc=com"
[12/May/2020:08:23:26.181633127 +0200] conn=1868 op=3716 RESULT err=0 tag=103 nentries=0 etime=0.012312424

I will add another test with async SRCH and MOD.

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2020-05-13 06:45:54

My analysis of what this shows is that search seems to improve with more threads, but something causes contention leading to the loss - so lower threads == less contention yielding the cpu-matched threads to give better search throughput and latency. It appears in the mixed workload our writes are heavily impacting the searches, so I think our write path is likely to be preventing search performance improvement. Regardless, it didn't make it worse, so I'm of course in favour of this change :)

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2020-05-13 15:50:29

IMHO LGTM, like William said it's not hurting the numbers. It's definitely an improvement, and if we need to fine tune at a later date so be it.

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2020-05-13 15:58:09

Actually there was something I'd like to see tested with this change. A machine with more CPU/cores.

So we tested a 4 core machine and setting the thread number to 4 was great, but what about a 16 core system with varying worker threads. Do we same improvement if we set the thread number to 16 vs 32 or 8 or 4?

@vashirov - would it be hard to reserve a system with this hardware and run one more rounds of tests?

@389-ds-bot
Copy link
Author

Comment from mreynolds (@mreynolds389) at 2020-06-02 18:29:11

When I did investigation for another potential customer a few years ago I also saw that setting the thread number to the number of cores gave the best performance. I think this is definitely an improvement over what we had, ack

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2020-06-03 01:13:21

rebased onto 5eacf45e7caa50de2721f85d7fbee58767bcb8f0

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2020-06-03 01:15:22

rebased onto 9a06935

@389-ds-bot
Copy link
Author

Comment from firstyear (@Firstyear) at 2020-06-03 01:16:00

Pull-Request has been merged by Firstyear

@389-ds-bot
Copy link
Author

Patch
51073.patch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged Migration flag - PR pr Migration flag - PR
Projects
None yet
Development

No branches or pull requests

1 participant