Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

~lib/array/Array#push is unnecessarily garbage-y #1798

Closed
surma opened this issue Apr 11, 2021 · 13 comments · Fixed by #1841
Closed

~lib/array/Array#push is unnecessarily garbage-y #1798

surma opened this issue Apr 11, 2021 · 13 comments · Fixed by #1841

Comments

@surma
Copy link
Contributor

surma commented Apr 11, 2021

In a benchmark I discovered that AssemblyScript is significantly slower than JavaScript when using --runtime incremental when filling Arrays.

Language Runtime Average vs JS
JavaScript   233.68ms 1.0x
AssemblyScript stub 345.35ms 1.5x
AssemblyScript minimal 354.60ms 1.5x
AssemblyScript incremental 18758.50ms 80.3x

A d8 trace revealed the following profile:

[Bottom up (heavy) profile]:
  ticks parent  name
  18670   96.1%  /usr/lib/system/libsystem_platform.dylib
  13530   72.5%    Function: *~lib/rt/itcms/__renew
  13530  100.0%      Function: *~lib/array/ensureSize
  13530  100.0%        Function: *~lib/array/Array#push
  13530  100.0%          Function: *binaryheap_optimized/BinaryHeap#push
  13530  100.0%            Function: *binaryheap_optimized/push
   5119   27.4%    Function: *~lib/rt/itcms/__new
   5119  100.0%      Function: *~lib/rt/itcms/__renew
   5119  100.0%        Function: *~lib/array/ensureSize
   5119  100.0%          Function: *~lib/array/Array#push
   5119  100.0%            Function: *binaryheap_optimized/BinaryHeap#push

which led me to investigate the Array#push() implementation, which callsensureSize(). ensureSize() will create a new buffer with exactly one slot more if the previous buffer was full. This means that once you ran out of capacity once, every subsequent call to push() will create a new buffer copy all the data to the new buffer and create work for the garbage collector as the old buffer is now collectible.

Both Rust and Go amortize that work by doubling the capacity every time the buffer is exhausted. I’m sure C++ does the same but I couldn’t get through the stdlib code 😅

In an experiment, replacing the ~lib/array/Array with a CustomArray<T> that has exponential buffer growth, the performance problems go away:

Language Variant Runtime Average vs JS
JavaScript     233.68ms 1.0x
AssemblyScript customarray stub 329.43ms 1.4x
AssemblyScript customarray minimal 329.23ms 1.4x
AssemblyScript customarray incremental 335.35ms 1.4x
AssemblyScript ~lib stub 345.35ms 1.5x
AssemblyScript ~lib minimal 354.60ms 1.5x
AssemblyScript ~lib incremental 18758.50ms 80.3x

I would like to PR in a fix, but wanted to check first if there are any concerns about adopting a similar approach for ASC’s Array<T>?

@MaxGraey
Copy link
Member

That's well known problem with incremental runtime and constantly grown array (via serial of push for example).
cc @dcodeIO

@dcodeIO
Copy link
Member

dcodeIO commented Apr 11, 2021

That's indeed odd, yeah. Looking at the code again, ensureSize kinda relies on __renew to pre-alloc to the next FL, but __realloc actually only pre-allocs to the next SL, leading to a ton of unnecessary allocations. Need to check in more detail, but that may well have been wrong for a long time already, even predating the GC rework.

@surma
Copy link
Contributor Author

surma commented Apr 14, 2021

I don’t know what FL or SL are 🙈

What I don’t think either of you have answered: Are you opposed to adjusting the Array<T> impl. to grow exponentially? One of the V8 folks told me that it doesn’t have to be 2x, just a factor >1 of the previous capacity. It seems to be fairly common, even recommended practice.

@dcodeIO
Copy link
Member

dcodeIO commented Apr 14, 2021

It's a regression that it currently doesn't do this (anymore, not sure when exactly I messed this up), so absolutely, go for it. If we can tweak the factor a bit, the better. The only place where we may not want arrays to grow like this is when setting Array#length I think, which also uses ensureSize but there it doesn't make sense to grow by a factor.

One nice-to-have aspect there would be to also take block size of TLSF into account, so instead of growing unconditionally when length is exceeded, arrays would only need to grow when the TLSF block cannot be extended anymore. For instance, when a block's payload is just one byte large as would be the case for an array of just one i8 element, it can grow up to a certain limit (by just modifying the block) until the block is actually exhausted instead of reallocating.

FL and SL are the first and second level lists of TLSF, indicating block size. The payload size of a block is somewhere in between where the previous list would end and the next list would start, so that's where we can just modify the block to consider a larger payload while not needing to alloc a new one. Current size of the block is in mmInfo & ~3.

@MaxGraey
Copy link
Member

MaxGraey commented Apr 15, 2021

Could be just easily fix from:

var newLength = length + 1;

to

var newLength = min(16, length << 1);

here:
https://github.com/AssemblyScript/assemblyscript/blob/master/std/assembly/array.ts#L206

@surma
Copy link
Contributor Author

surma commented Apr 15, 2021

Is it that simple? 😅 I think that would still cause a reallocation on every push, just with a bigger buffer size so that half of it is unused.

@carlopi
Copy link

carlopi commented Apr 16, 2021

Something like:

function biggerPowerOf2(N)
{
    N = N - 1;
    N |= N >> 1;
    N |= N >> 2;
    N |= N >> 4;
    N |= N >> 8;
    N |= N >> 16;
    N |= N >> 32;  //redundant already?
    return N+1;
}
**** as noted by @MaxGraey: right shifts!

ensures (given an unsigned integer) a power of 2 bigger or equal of the starting value is returned.
Then it's just a matter of calling
usize = biggerPowerOf2(usize); inside ensureSize to force the dimension to be a power of 2.
Not sure what an equivalent bit-twiddling trick could be for choosing arbitrary multipliers, but for starters 2x seems a safe solution.

Note that calling only on entrance (=push codepath) solves the problem only partially, since also the [] path has to be checked, and seems more solid to do inside ensureSize.

@MaxGraey
Copy link
Member

MaxGraey commented Apr 16, 2021

if you need nextPowOf2 it could be implemented as n == 0 ? 0 : 1 << (32 - clz(n - 1)). Also biggerPowerOf2 should have right shifts instead left shifts.

Problem with growing to next pow of 2 is much more significant non-used memory footprint

@carlopi
Copy link

carlopi commented Apr 16, 2021

Thanks for pointing out the mistake.
I imagine there is plenty of literature to be researched, but given a primitive to find the highest set bit there are plenty of constant time solutions to grow less than 2x.
Eg.

  1. find highest set bit (calling this number N).
  2. either go to N * 2 or to N * 1.414 (by checking with a single comparison against N * 1.414).
    Here the multiplier becomes sqrt(2) = 1.414, and with a similar trick smaller multiplier could be used.

@MaxGraey
Copy link
Member

I guess we could test / bench 3 solutions:

  1. glowing by factor 2
  2. glowing by next power of 2
  3. hybrid approach. Use (2) until 2 ** 16 and use (1) which more economic after this threshold

@dcodeIO
Copy link
Member

dcodeIO commented Apr 16, 2021

Note that the bit math above (relying on clz) is about what the memory manager is already doing for FLs and SLs, so that's where my suggestion comes in to build something that cooperates with the MM :)

// === The TLSF (Two-Level Segregate Fit) memory allocator ===
// see: http://www.gii.upv.es/tlsf/
// - `ffs(x)` is equivalent to `ctz(x)` with x != 0
// - `fls(x)` is equivalent to `sizeof(x) * 8 - clz(x) - 1`

@MaxGraey
Copy link
Member

MaxGraey commented Apr 16, 2021

I figured out that nextPowOf2(n + 1) and n << 1 is the same if n0 (initial value) already have power of two value like 16. So much more interesting use exponential glow + grow by factor of 2. Something like this:

function log2(n: u32): u32 {
  return 31 - clz(n | 1);
}

function computeGrowFactor(len: u32): u32 {
  if (len < (1 << 16)) {
    return len * (1 + log2(len));
  } else {
    return min(len << 1, 1 << 30);
  }
}

In this case it takes only 4 grows (reallocs) between [0, 65536] and 15 grows between [0, 67108864]

@MaxGraey
Copy link
Member

MaxGraey commented Apr 21, 2021

It seems I found the best empirical expansion (logarithmic):

function computeGrowFactor(len: u32): u32 {
  return len * (1 << (2 - ((log2(len - 8) - 8) >>> 4)));
}

which produce this series - (length value, grow factor):

8 8
64 8
512 8
2048 4
8192 4
32768 4
131072 4
524288 4
2097152 4
8388608 4
33554432 4
67108864 2
134217728 2
268435456 2
536870912 2
1073741824 2
2147483648 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants