-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[scudo] Enable "Delayed release to OS" feature for Android #65942
base: main
Are you sure you want to change the base?
[scudo] Enable "Delayed release to OS" feature for Android #65942
Conversation
Instead of immediately releasing any mapped memory back to the OS (via overlapping mmaps when MTE is enabled, madvise when not), the memory is retained and only given back after a configurable interval (5000 ms for now) This change gives a slight performance increase in two ways: * When MTE is enabled, mappings are made non-accessible with one mprotect instead of two mmaps * If the mapping is reused within the specified time interval, the contents are retained. This reduces the number of page faults. Signed-off-by: Anders Dellien <anders.dellien@arm.com>
I think you're saying the case in the cache of secondary allocator with MTE enabled. I'm not sure what's the performance gain here, could you share in what case do you see the improvement? Would you share some numbers? BTW, Android has already enabled this feature and it uses the 1s interval. Or are you suggesting a longer interval? |
As Chia-hung mentions, this really only changes the secondary interval default from 0 to 1000ms. If you look at the configs (allocator_config.h), you can see the explicit interval values. That's really where you should be changing anything. Although, if you are working on Android, we are using a custom config on top of tree, so you would need to modify that instead of the upstream config file. In addition, on Android all apps default to setting the secondary interval to 1000ms anyway, so this is only going to change native processes. So this seems like it is a very limited change that isn't likely to make much of a difference. |
Anders has been working on MTE performance specifically with GeekBench multicore numbers. This patch shows a pretty significant improvement on MTE, which I assume is due to the decreased mmap-sem lock contention in the kernel because on MTE we go from two-syscalls to one-syscall per free. It does sound like the right place might be changing the Android-specific configuration though (sorry I missed that earlier). Chia-Hung and Christopher, do you think you can help Anders with evaluating system-level impact of the change? I think we know it's good for perf, at the cost of some RSS, but I don't think he (or I for that matter) know the best way to actually measure the negative RSS impact. |
Unfortunately, this change should be a nop for running Geekbench. Unless the config file was also modified, the max is still going to be 1000ms coming from the config information, and it should be 1000ms without this change. Did this change get mixed in with other changes? Also, there have been a number of changes in Scudo recently that increase the multi-core benchmark performance. So if you took a snapshot from a while ago, and the current TOT, it could trick you. We would be happy to help doing an evaluation on this though. |
Hi, Thanks for looking at this. You are right that this only changes the behavior of the secondary allocator, the primary will still use 1000 ms. However, that change does make a difference as any freed memory blocks will retain their contents instead of being "over-mapped", i.e. we will do
Instead of
This is not a general improvement, it only makes a difference for any application or benchmark that makes heavy use of the secondary allocator - e.g. the SQLite and PDF Rendering sub-tests in Geekbench. All my tests have been stand-alone binaries run directly from the shell (mostly Geekbench) - if Android apps already override this option then I agree the impact will be smaller. |
As for performance data, I have only looked at the multi-threaded workloads in Geekbench. For 8 threads, I get about 4% improvement in SQLite, 1% in PDF Rendering, and possibly smaller gains (< 1%) for other workloads that use the secondary allocator (e.g. HTML5), but these gains are within the noise level so harder to see. |
Are these binaries you run how Geekbench is expected to run? As far as I know, everyone doing benchmarking runs it through the app. A long time ago, I vaguely recall using the adb shell am command to run parts of the tests, which should run it through the app, not as a stand-alone executable. |
As far as I know, Geekbench is available both as an app and as a stand-alone binary. But I am not sure which is more commonly used. Regarding the app-specific Scudo configurations - could you please let me know where to find these? |
Circling back on this - GB (the way that we run it, and so do others on the device-side team) is to use a dynamically-linked gb executable that runs with the device's libc/libm/etc. Given that we see a major source of overhead from the secondary, I think it might be worth changing the Android secondary allocator config I don't think it's worth going the full hog and creating an API to control the interval for primary/secondary individually (and then plumb it through to some I'm not exactly sure how to measure the increased PSS though. Any ideas? |
Before you go through with this, what is the benefit to this change? If there is nothing on the system that would benefit from this, I don't think this is a good idea. The problem is that it's very difficult to tell what the RSS increase will be. We can measure some of it, but because this is memory coming from the secondary, it can be a large increase in RSS from a single process. For example, it could be as bad a > 10MB for a single process. I do have a proposal specific to Android that sets all adb shell spawned processes to set the decay timer to the max. I also have a way to make sure that any command-line tools spawned from an app also get the same decay timer setting. I think this would be a better choice for this solution. I've already got the code that implements this and have tested it and it appears to work. I should have CLs up soon once I convince myself this is not going to be a problem. |
Same opinion as @cferris1000. Just want to mention that #66717 can be an alternative to consider and which only impacts the MTE path. |
So I had a look at measuring the reuslts myself of a 5 second release-to-os delay using the following patch to
Then, I measured using Geekbench 5, using
What's even more interesting is that disabling the secondary cache entirely made some pretty significant speedups only on the midcore. This is only across two runs (rather than the 31 for the larger data set above).
|
Instead of immediately releasing any mapped memory back to the OS (via overlapping mmaps when MTE is enabled, madvise when not), the memory is retained and only given back after a configurable interval (5000 ms for now)
This change gives a slight performance increase in two ways: