[scudo] Enable "Delayed release to OS" feature for Android #65942

andersdellien-arm · 2023-09-11T09:43:10Z

Instead of immediately releasing any mapped memory back to the OS (via overlapping mmaps when MTE is enabled, madvise when not), the memory is retained and only given back after a configurable interval (5000 ms for now)

This change gives a slight performance increase in two ways:

When MTE is enabled, mappings are made non-accessible with one mprotect instead of two mmaps
If the mapping is reused within the specified time interval, the contents are retained. This reduces the number of page faults.

Instead of immediately releasing any mapped memory back to the OS (via overlapping mmaps when MTE is enabled, madvise when not), the memory is retained and only given back after a configurable interval (5000 ms for now) This change gives a slight performance increase in two ways: * When MTE is enabled, mappings are made non-accessible with one mprotect instead of two mmaps * If the mapping is reused within the specified time interval, the contents are retained. This reduces the number of page faults. Signed-off-by: Anders Dellien <anders.dellien@arm.com>

ChiaHungDuan · 2023-09-12T20:09:15Z

I think you're saying the case in the cache of secondary allocator with MTE enabled. I'm not sure what's the performance gain here, could you share in what case do you see the improvement? Would you share some numbers?

BTW, Android has already enabled this feature and it uses the 1s interval. Or are you suggesting a longer interval?

cferris1000 · 2023-09-12T21:17:51Z

As Chia-hung mentions, this really only changes the secondary interval default from 0 to 1000ms. If you look at the configs (allocator_config.h), you can see the explicit interval values. That's really where you should be changing anything. Although, if you are working on Android, we are using a custom config on top of tree, so you would need to modify that instead of the upstream config file.

In addition, on Android all apps default to setting the secondary interval to 1000ms anyway, so this is only going to change native processes. So this seems like it is a very limited change that isn't likely to make much of a difference.

hctim · 2023-09-12T21:47:27Z

Anders has been working on MTE performance specifically with GeekBench multicore numbers. This patch shows a pretty significant improvement on MTE, which I assume is due to the decreased mmap-sem lock contention in the kernel because on MTE we go from two-syscalls to one-syscall per free.

It does sound like the right place might be changing the Android-specific configuration though (sorry I missed that earlier).

Chia-Hung and Christopher, do you think you can help Anders with evaluating system-level impact of the change? I think we know it's good for perf, at the cost of some RSS, but I don't think he (or I for that matter) know the best way to actually measure the negative RSS impact.

cferris1000 · 2023-09-12T22:21:50Z

Unfortunately, this change should be a nop for running Geekbench. Unless the config file was also modified, the max is still going to be 1000ms coming from the config information, and it should be 1000ms without this change. Did this change get mixed in with other changes?

Also, there have been a number of changes in Scudo recently that increase the multi-core benchmark performance. So if you took a snapshot from a while ago, and the current TOT, it could trick you.

We would be happy to help doing an evaluation on this though.

andersdellien-arm · 2023-09-15T15:17:50Z

Hi,

Thanks for looking at this.

You are right that this only changes the behavior of the secondary allocator, the primary will still use 1000 ms. However, that change does make a difference as any freed memory blocks will retain their contents instead of being "over-mapped", i.e. we will do


        Entry.MemMap.setMemoryPermission(Entry.CommitBase, Entry.CommitSize,
                                         MAP_NOACCESS);

Instead of

        mapSecondary<Config>(Options, Entry.CommitBase, Entry.CommitSize,
                             Entry.CommitBase, MAP_NOACCESS, Entry.MemMap);

This is not a general improvement, it only makes a difference for any application or benchmark that makes heavy use of the secondary allocator - e.g. the SQLite and PDF Rendering sub-tests in Geekbench.

All my tests have been stand-alone binaries run directly from the shell (mostly Geekbench) - if Android apps already override this option then I agree the impact will be smaller.

andersdellien-arm · 2023-09-18T13:51:48Z

As for performance data, I have only looked at the multi-threaded workloads in Geekbench. For 8 threads, I get about 4% improvement in SQLite, 1% in PDF Rendering, and possibly smaller gains (< 1%) for other workloads that use the secondary allocator (e.g. HTML5), but these gains are within the noise level so harder to see.

cferris1000 · 2023-09-18T22:12:32Z

Are these binaries you run how Geekbench is expected to run? As far as I know, everyone doing benchmarking runs it through the app. A long time ago, I vaguely recall using the adb shell am command to run parts of the tests, which should run it through the app, not as a stand-alone executable.

andersdellien-arm · 2023-09-19T09:09:08Z

As far as I know, Geekbench is available both as an app and as a stand-alone binary. But I am not sure which is more commonly used.

Regarding the app-specific Scudo configurations - could you please let me know where to find these?

hctim · 2023-10-26T16:47:18Z

Circling back on this - GB (the way that we run it, and so do others on the device-side team) is to use a dynamically-linked gb executable that runs with the device's libc/libm/etc.

Given that we see a major source of overhead from the secondary, I think it might be worth changing the Android secondary allocator config ReleaseToOsIntervalMs from zero -> non-zero. I'm running some experiments using the normal way I execute to corroborate results.

I don't think it's worth going the full hog and creating an API to control the interval for primary/secondary individually (and then plumb it through to some M_PURGE_LARGE or something silly), but setting the default to non-zero might have some positive effects here.

I'm not exactly sure how to measure the increased PSS though. Any ideas?

cferris1000 · 2023-10-26T18:17:10Z

Before you go through with this, what is the benefit to this change? If there is nothing on the system that would benefit from this, I don't think this is a good idea. The problem is that it's very difficult to tell what the RSS increase will be. We can measure some of it, but because this is memory coming from the secondary, it can be a large increase in RSS from a single process. For example, it could be as bad a > 10MB for a single process.

I do have a proposal specific to Android that sets all adb shell spawned processes to set the decay timer to the max. I also have a way to make sure that any command-line tools spawned from an app also get the same decay timer setting. I think this would be a better choice for this solution. I've already got the code that implements this and have tested it and it appears to work. I should have CLs up soon once I convince myself this is not going to be a problem.

ChiaHungDuan · 2023-10-26T21:23:53Z

Same opinion as @cferris1000. Just want to mention that #66717 can be an alternative to consider and which only impacts the MTE path.

hctim · 2023-11-13T15:23:45Z

So I had a look at measuring the reuslts myself of a 5 second release-to-os delay using the following patch to external/scudo, which I believe should affect the Android config:

diff --git a/standalone/allocator_config.h b/standalone/allocator_config.h
index 44c1ac5..e37f917 100644
--- a/standalone/allocator_config.h
+++ b/standalone/allocator_config.h
@@ -186,8 +186,8 @@ struct AndroidConfig {
       static const u32 QuarantineSize = 32U;
       static const u32 DefaultMaxEntriesCount = 32U;
       static const uptr DefaultMaxEntrySize = 2UL << 20;
-      static const s32 MinReleaseToOsIntervalMs = 0;
-      static const s32 MaxReleaseToOsIntervalMs = 1000;
+      static const s32 MinReleaseToOsIntervalMs = 5000;
+      static const s32 MaxReleaseToOsIntervalMs = 5000;
     };
     template <typename Config> using CacheT = MapAllocatorCache<Config>;
   };
diff --git a/standalone/secondary.h b/standalone/secondary.h
index c89e6a9..60bb343 100644
--- a/standalone/secondary.h
+++ b/standalone/secondary.h
@@ -463,7 +463,8 @@ private:
   atomic_uptr MaxEntrySize = {};
   u64 OldestTime GUARDED_BY(Mutex) = 0;
   u32 IsFullEvents GUARDED_BY(Mutex) = 0;
-  atomic_s32 ReleaseToOsIntervalMs = {};
+  atomic_s32 ReleaseToOsIntervalMs =
+    {.ValDoNotUse = CacheConfig::MinReleaseToOsIntervalMs };
   u32 CallsToRetrieve GUARDED_BY(Mutex) = 0;
   u32 SuccessfulRetrieves GUARDED_BY(Mutex) = 0;

Then, I measured using Geekbench 5, using LD_LIBRARY_PATH to point to the newer libc.so. I saw no significant difference (except for maybe a speedup in the single-threaded midcore):

                  small   medium  big
single off.log     0.23%   0.68%  -0.02%
single async.log   0.21%   0.41%  -0.08%
single sync.log    0.27%   0.34%   0.10%
multi  off.log    -0.03%  -0.12%  -0.02%
multi  async.log   0.07%  -0.24%  -0.12%
multi  sync.log   -0.16%   0.00%  -0.40%

What's even more interesting is that disabling the secondary cache entirely made some pretty significant speedups only on the midcore. This is only across two runs (rather than the 31 for the larger data set above).

   struct Secondary {
     struct Cache {
-      static const u32 EntriesArraySize = 256U;
-      static const u32 QuarantineSize = 32U;
-      static const u32 DefaultMaxEntriesCount = 32U;
+      static const u32 EntriesArraySize = 0U;
+      static const u32 QuarantineSize = 0U;
+      static const u32 DefaultMaxEntriesCount = 0U;
       static const uptr DefaultMaxEntrySize = 2UL << 20;

                   small   medium  big
single  off.log    0.76%   2.40%   1.00%
single  async.log  0.26%   1.85%   -0.98%
single  sync.log   0.85%   1.76%   0.64%
multi   off.log    -0.54%  0.88%   -0.69%
multi   async.log  0.13%   1.42%   0.06%
multi   sync.log   0.71%   1.14%   -0.86%

andersdellien-arm requested a review from a team as a code owner September 11, 2023 09:43

llvmbot added compiler-rt compiler-rt:sanitizer labels Sep 11, 2023

DanielKristofKiss requested review from hctim, ChiaHungDuan and cferris1000 September 12, 2023 07:54

ChiaHungDuan mentioned this pull request Sep 18, 2023

[scudo] Mitigate the overhead in cache storing when MTE enabled #66717

Open

ChiaHungDuan added the compiler-rt:scudo Scudo Hardened Allocator label Oct 26, 2023

vitalybuka removed the request for review from a team November 22, 2023 00:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[scudo] Enable "Delayed release to OS" feature for Android #65942

[scudo] Enable "Delayed release to OS" feature for Android #65942

Uh oh!

andersdellien-arm commented Sep 11, 2023

Uh oh!

ChiaHungDuan commented Sep 12, 2023

Uh oh!

cferris1000 commented Sep 12, 2023

Uh oh!

hctim commented Sep 12, 2023

Uh oh!

cferris1000 commented Sep 12, 2023

Uh oh!

andersdellien-arm commented Sep 15, 2023

Uh oh!

andersdellien-arm commented Sep 18, 2023

Uh oh!

cferris1000 commented Sep 18, 2023

Uh oh!

andersdellien-arm commented Sep 19, 2023

Uh oh!

hctim commented Oct 26, 2023

Uh oh!

cferris1000 commented Oct 26, 2023

Uh oh!

ChiaHungDuan commented Oct 26, 2023

Uh oh!

hctim commented Nov 13, 2023

Uh oh!

Uh oh!

[scudo] Enable "Delayed release to OS" feature for Android #65942

Are you sure you want to change the base?

[scudo] Enable "Delayed release to OS" feature for Android #65942

Uh oh!

Conversation

andersdellien-arm commented Sep 11, 2023

Uh oh!

ChiaHungDuan commented Sep 12, 2023

Uh oh!

cferris1000 commented Sep 12, 2023

Uh oh!

hctim commented Sep 12, 2023

Uh oh!

cferris1000 commented Sep 12, 2023

Uh oh!

andersdellien-arm commented Sep 15, 2023

Uh oh!

andersdellien-arm commented Sep 18, 2023

Uh oh!

cferris1000 commented Sep 18, 2023

Uh oh!

andersdellien-arm commented Sep 19, 2023

Uh oh!

hctim commented Oct 26, 2023

Uh oh!

cferris1000 commented Oct 26, 2023

Uh oh!

ChiaHungDuan commented Oct 26, 2023

Uh oh!

hctim commented Nov 13, 2023

Uh oh!

Uh oh!