WIP. More Efficient MemoryAllocator #1660

JimBobSquarePants · 2021-06-13T22:08:07Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

A breaking (due to the removal of unused configuration options) rewrite of ArrayPoolMemoryAllocator to modelled on the design described at #1596 (comment) by @antonfirsov

The allocator will use ArrayPool<byte>.Shared for buffers <= 1MB
For > 1MB the allocator will use a custom pool that has a maximum allowed buffer length of 2MB. This pool uses the same trick Gen2GcCallback in TlsOverPerCoreLockedStacksArrayPool to trim the pool under pressure.
Buffer2D<T> will use 2MB for each discontiguous (did you know that was a Scottish word?) chunk.

Things to consider

For Allocate<T> only we should consider using unmanaged memory on supported platforms for > 2MB allocations using the approach described by @saucecontrol More efficient MemoryAllocator #1596 (comment) Done!
Remove all factory methods and potentially all configuration from the pool.
We need tests for the GC trimming. Will check the runtime tests to see what they have.

Issues

Several Tiff encoding tests fail with the lower buffer chunk threshold set. This is due to TiffBaseColorWriter<TPixel> calling Buffer2D<T>.GetSingleSpan(). @brianpopow I'll need you or @IldarKhayrutdinov to help me there as I don't know enough about the format to do a fix. Fixed, thanks @brianpopow !!

ImageSharp/src/ImageSharp/Formats/Tiff/Writers/TiffBaseColorWriter{TPixel}.cs

Lines 82 to 84 in 381dff8

    
           protected static Span<T> GetStripPixels<T>(Buffer2D<T> buffer2D, int y, int height) 
        
               where T : struct 
        
               => buffer2D.GetSingleSpan().Slice(y * buffer2D.Width, height * buffer2D.Width);

~~We have a failing test for ResizeKernelMap due to the use of Buffer2D<T>.GetSingleMemory(). @antonfirsov I'll need help there.~~ . Fixed!

ImageSharp/src/ImageSharp/Processing/Processors/Transforms/Resize/ResizeKernelMap.cs

Lines 54 to 55 in 381dff8

    
           this.data = memoryAllocator.Allocate2D<float>(this.MaxDiameter, bufferHeight, AllocationOptions.Clean); 
        
           this.pinHandle = this.data.GetSingleMemory().Pin();

br3aker · 2021-06-14T03:48:49Z

src/ImageSharp/Memory/Allocators/Internals/UnmanagedBuffer{T}.cs

+                : base(IntPtr.Zero, true)
+            {
+                this.SetHandle(Marshal.AllocHGlobal(size));
+                GC.AddMemoryPressure(this.byteCount = size);


Should this be done? This would lead to a faster gen2 collection and object itself would behave like it's a managed allocation which I assume is not what we want to be happening for large chunks of data.

Microsoft docs states that "The AddMemoryPressure and RemoveMemoryPressure methods improve performance only for types that exclusively depend on finalizers to release the unmanaged resources". Which is not the case here as safe handle is used to prevent memory leaking only when user forgot to call dispose which is very unlikely to be caused by library code (this should be tested somehow to be honest).

Otherwise this would lead to a memory throttle as it is with current implementation.

That advice is awkwardly worded. In the case where a user fails to Dispose an object backed by an unmanaged buffer, it does rely on the SafeHandle's finalizer to release the memory, so it is advantageous to advise the GC that it might be able to reclaim the memory by doing its GC thing. This should only result in more frequent gen2 GCs if the memory limits are being reached, which should only happen if the unmanaged buffer is long-lived or if the system is actually low on memory. For properly Disposed ephemeral buffers, the GC will be aware of the added memory pressure but will also see it freed quickly, so no harm.

Not sure that these buffers can be named 'ephemeral', 2mb is a lot (at least it looks like a lot for something temporal) which should lead to a lot of execution time spent on this memory. Primary idea I had while writing this is that even if user forgets to dispose memory backed by a handle it'll be freed by either gen0/gen1 collection or full gen2 collection when OutOfMemoryException kicks in.

While I'm not so sure about this now, I'm still concerned that this might be an overkill due to freachable queue. At least I'd want to benchmark this on something big like your parallel beeheads demo.

unmanaged buffers will only be used for allocations that are over the pooled size limit. They should be used once and then released, hence 'ephemeral'. The question of whether GC.AddMemoryPressure is appropriate ultimately comes down to "is there any way a GC could free this memory?". The answer here is yes, but only if the memory has been leaked because someone didn't dispose. Unfortunately there's no way to know whether someone is going to leak; you can only tell after they've done it, when your finalizer runs.

What the MS docs should have said is something along the lines of "don't tell the GC about memory it couldn't possibly reclaim". If the allocation and free were always 100% deterministic, there would be no point in telling the GC about them. But since we can't know whether the GC could reclaim the memory, it's better to tell it about the allocation than to not.

Once again, thanks for a deep dive! I've never looked at GC from "memory it can or cannot not claim" point of view.

br3aker · 2021-06-14T04:17:22Z

src/ImageSharp/Memory/Allocators/Internals/Utilities.cs

+        /// Does not directly use any hardware intrinsics, nor does it incur branching.
+        /// </summary>
+        /// <param name="value">The value.</param>
+        private static int Log2SoftwareFallback(uint value)


Jpeg encoder uses similar logic for internal operations, this could potentially be placed somewhere in Numerics.cs class for reusage.

P.S.
Jpeg needs to calculate minimum bitsize of a given number which can be calculated via Log2DeBruijn fallback logic on non-intrinsic hardware. Not really a game changer but would be nice to have these under single class & single test fixture.

That uses BitOperations.LeadingZeroCount does it not? Numerics.MinimumBitsToStore16 is awkwardly named actually. 16 what?

If Lzcnt intrinsic is supported - yes. But for fallback code it now uses LuT of 255 values with possible branching if value exceeds 255 only once - thus 16 bits is the maximum value bitsize this method would safely calculate. DeBruijn sequence should be a bit faster and would eliminate the 16 bit maximum value constraint. Code would be a bit different, that table is what can be shared between jpeg & pool code. I'll open a PR with naming fix & algorithms after finishing my current open pr for testing.

Your maths is better than mine! 😀

Cool I’ll leave that to you then. Thanks!

No problem! I'll try to do it today so you could use that in this pr before merging.

Lovely, thanks!

JimBobSquarePants · 2021-06-14T16:09:24Z

@saucecontrol I had a go at profiling.
This PR

Master

Here's your original sample.

antonfirsov · 2021-06-14T16:41:06Z

src/ImageSharp/Memory/Allocators/ArrayPoolMemoryAllocator.cs

        /// </summary>
-        private ArrayPool<byte> largeArrayPool;
+        private const int DefaultMaxArraysPerBucket = 16;


Since we utilize only one bucket now, an intensive parallel load will defer to the unmanaged stuff almost all the time. I would try benchmarking with different bucket counts.

I'm gonna need help writing those tests and improving others to ensure they work in CI.

antonfirsov · 2021-06-14T16:42:54Z

src/ImageSharp/Memory/Allocators/ArrayPoolMemoryAllocator.cs

-                ThrowInvalidAllocationException<T>(length);
+                // For anything greater than our pool limit defer to unmanaged memory
+                // to prevent LOH fragmentation.
+                memory = new UnmanagedBuffer<T>(length);


Would be great to see some metrics, how many times do we go here VS to the pool.
I can help with the EventSource stuff.

I know the BMP codec hits this a few times. I saw that when I made a mistake in the UnmanagedBuffer<T> type. Hopefully my EventSource stuff works ok.

antonfirsov

In order to have a fair comparison, you need to benchmark the old and the new allocator on the same PC. You can make a copy of the old allocator class in a different namespace for simplicity.

I would try to tweak different parameters and see how the results change.

antonfirsov · 2021-06-14T16:44:52Z

src/ImageSharp/Memory/Allocators/ArrayPoolMemoryAllocator.cs

        /// </summary>
-        private ArrayPool<byte> normalArrayPool;
+        internal const int DefaultMaxArrayLengthInBytes = 2 * SharedPoolThresholdInBytes;


This is also a parameter we can tweak!

Yep. Time is against me now though, it's nearly 3am.

antonfirsov · 2021-06-16T10:34:21Z

So I think what's happening here is that with GC allocating the 2MB buffers, you end up re-using the LOH segment holes

The code is still pooling only 16*2=32MB, that's a tiny part of the 1.7GB peak. What we see here must be the result of building disconitguos buffers out of 2MB unmanaged memory blocks.

@JimBobSquarePants can we run an experiment trying to pool significantly more? (128MB, 512MB, 1024MB => maxArraysPerPoolBucket=64, 256, 512)

If we get better results with unmanaged buffers, we may consider dropping the large pool entirely, though I still have concerns on the reliability SafeHandle finalizers. AFAIK a single unhandled exception in any user finalizer will prevent the finalizer queue to finish, leading to actual memory leaks. @saucecontrol thoughts?

JimBobSquarePants · 2021-06-16T10:35:35Z

Marking this as ready to review. There's a lot of testing and configuration to be done but I think the functionality is pretty much where we need it to be.

antonfirsov · 2021-06-16T10:49:33Z

the functionality is pretty much where we need it to be.

We still need to figure out how much value is there from pooling, if it has no real effect, that's a game changer on the current implementation and API shape.

We also need metrics on throughput, the new memory access patterns may have impact on cache utilization.

JimBobSquarePants · 2021-06-16T11:33:39Z

I’ll leave the memory profiling to the more proficient. I need to take a break, struggling with jet lag

antonfirsov · 2021-06-16T13:51:12Z

I need to take a break, struggling with jet lag

TBH it's also a very difficult time for me now. There's plenty of work left IMO, I recommend to slow down, and to declare it as a marathon rather than a sprint, there's no good in doing it in rush. I'll see what I can do on Saturday.

saucecontrol · 2021-06-16T21:50:12Z

The code is still pooling only 16*2=32MB, that's a tiny part of the 1.7GB peak. What we see here must be the result of building disconitguos buffers out of 2MB unmanaged memory blocks.

@antonfirsov Correct. I was attempting to explain why the profile of the initial PR showed much lower total VirtualAlloc numbers despite much higher peak and baseline memory. The switch to unmanaged makes every buffer allocation an actual allocation, giving a higher total VirtualAlloc. But those allocations are released immediately, keeping the instantaneous committed memory lower.

When falling back to unmanaged allocations, keeping the 2MB discontiguous buffer strategy will be a negative perf-wise in that AllocHGlobal is comparatively expensive. It would be better to keep those allocations contiguous, but that looks like that would take a pretty big refactor. More tuning is needed in either case.

I still have concerns on the reliability SafeHandle finalizers. AFAIK a single unhandled exception in any user finalizer will prevent the finalizer queue to finish

That's true, but only because an unhandled exception in a finalizer will crash the process. 😆

Your finalizers in particular will only be calling LocalFree (by way of Marshal.FreeHGlobal), which will always succeed provided the handle you give it was created by LocalAlloc, and GC.RemoveMemoryPressure, which only throws for invalid args.

JimBobSquarePants · 2021-06-17T00:41:00Z

@JimBobSquarePants can we run an experiment trying to pool significantly more? (128MB, 512MB, 1024MB => maxArraysPerPoolBucket=64, 256, 512)

@antonfirsov I need to better understand what we want here. As I see it the larger the pool the more aggressively we will have to trim the pool as it will end up holding onto memory again. With our new logic we'd end up adding 1024MB arrays to that pool and almost nothing would go to unmanaged memory.

When falling back to unmanaged allocations, keeping the 2MB discontiguous buffer strategy will be a negative perf-wise in that AllocHGlobal is comparatively expensive. It would be better to keep those allocations contiguous, but that looks like that would take a pretty big refactor. More tuning is needed in either case.

@saucecontrol Actually, it's pretty simple as I recall. I think if we simply adjusted MaxContiguousArrayLengthInBytes to a larger number then MemoryGroup<T>Allocate will request larger buffers.

saucecontrol · 2021-06-17T00:57:07Z

I think if we simply adjusted MaxContiguousArrayLengthInBytes to a larger number then MemoryGroup<T>Allocate will request larger buffers.

Yeah, but that would mean larger managed buffer requests too, which would then always fail, no? Would be a quick way to test whether going all unmanaged would be ok, though.

One more note on perf around that...

In addition to the fact AllocHGlobal is expensive, allocating any finalizable object is also expensive. The GC's finalization queue is global and, therefore, takes a lock to update it when you new a finalizable object ~~and when you SuppressFinalize~~, because these don't happen during a GC pause. That's in addition to it always taking the GC's slow allocation path rather than the fast bump allocator usually used for small objects. It's worth looking into whether those things are in part responsible for the more rough CPU usage graph in the latest trace. Of course bigger chunks would mitigate that as well, so that might show up after a quick change to MaxContiguousArrayLengthInBytes

Correction: SuppressFinalize only updates the marker bit in the object header. It's removed from the finalization queue during GC pause.

JimBobSquarePants · 2021-06-17T01:08:51Z

Yeah, but that would mean larger managed buffer requests too, which would then always fail, no? Would be a quick way to test whether going all unmanaged would be ok, though.

Ah no. If the requested amount is larger than the pool maximum of 2MB or if the pool is exhausted then we defer to unmanaged buffers.

ImageSharp/src/ImageSharp/Memory/Allocators/Internals/GCAwareConfigurableArrayPool{T}.cs

Lines 107 to 157 in a22b0cb

    
           T[] buffer = null; 
        
           int index = SelectBucketIndex(minimumLength); 
        
           if (index < this.buckets.Length) 
        
           { 
        
               // Search for an array starting at the 'index' bucket. If the bucket is empty, bump up to the 
        
               // next higher bucket and try that one, but only try at most a few buckets. 
        
               const int maxBucketsToTry = 2; 
        
               int i = index; 
        
               do 
        
               { 
        
                   // Attempt to rent from the bucket.  If we get a buffer from it, return it. 
        
                   buffer = this.buckets[i].Rent(); 
        
                   if (buffer != null) 
        
                   { 
        
                       if (log.IsEnabled()) 
        
                       { 
        
                           log.BufferRented(buffer.GetHashCode(), buffer.Length, this.Id, this.buckets[i].Id); 
        
                       } 
        
                       return buffer; 
        
                   } 
        
               } 
        
               while (++i < this.buckets.Length && i != index + maxBucketsToTry); 
        
           } 
        
           // We were unable to return a buffer. 
        
           // This can happen for two reasons: 
        
           // 1: The pool was exhausted for this buffer size. 
        
           // 2: The request was for a size too large for the pool. 
        
           // We should now log this. We use the conventional allocation logging  since we will 
        
           // be advising the GC of the subsequent unmanaged allocation. 
        
           if (log.IsEnabled()) 
        
           { 
        
               const int bufferId = -1; 
        
               log.BufferRented(bufferId, buffer.Length, this.Id, ArrayPoolEventSource.NoBucketId); 
        
               ArrayPoolEventSource.BufferAllocatedReason reason = index >= this.buckets.Length 
        
                   ? ArrayPoolEventSource.BufferAllocatedReason.OverMaximumSize 
        
                   : ArrayPoolEventSource.BufferAllocatedReason.PoolExhausted; 
        
               log.BufferAllocated( 
        
                   bufferId, 
        
                   buffer.Length, 
        
                   this.Id, 
        
                   ArrayPoolEventSource.NoBucketId, 
        
                   reason); 
        
           } 
        
           // Return the null buffer. 
        
           // Our calling allocator will check for this and use unmanaged memory instead. 
        
           return buffer;

ImageSharp/src/ImageSharp/Memory/Allocators/ArrayPoolMemoryAllocator.cs

Lines 133 to 152 in a22b0cb

    
           byte[] array = pool.Rent(bufferSizeInBytes); 
        
           // Our custom GC aware pool differs from normal will return null 
        
           // if the pool is exhausted or the buffer is too large. 
        
           if (array != null) 
        
           { 
        
               memory = new Buffer<T>(array, length, pool); 
        
           } 
        
           else 
        
           { 
        
               // Use unmanaged buffer to prevent LOH fragmentation. 
        
               memory = new UnmanagedBuffer<T>(length); 
        
           } 
        
           if (options == AllocationOptions.Clean) 
        
           { 
        
               memory.GetSpan().Clear(); 
        
           } 
        
           return memory;

saucecontrol · 2021-06-17T01:17:08Z

Yeah, that's what I meant. MemoryGroup<T>Allocate would request chunks > 2MB, which can never be served by a managed bucket so would always fall through to unmanaged. As it is now, it will consume as many 2MB managed chunks as it can before falling through to unmanaged (this may or may not actually be a good thing -- more profiling is needed)

What I was picturing was something that requested 2MB (or whatever you pick for your max managed chunk size) at a time from the managed pool, and then when that returns null, request all the rest in one unmanaged allocation. That would require moving that abstraction out of the allocator, though.

JimBobSquarePants · 2021-06-17T01:34:10Z

What I was picturing was something that requested 2MB (or whatever you pick for your max managed chunk size) at a time from the managed pool, and then when that returns null, request all the rest in one unmanaged allocation. That would require moving that abstraction out of the allocator, though.

Yeah, that get's a bit iffy. I'd really like to keep everything there.

Here's what happens if I change the contiguous length to 24MB

Vs the current PR state

Looks like CPU takes a hit there.

saucecontrol · 2021-06-17T01:42:04Z

Ouch. Yeah, 24MB is too chunky. You'd allocate 48MB when 25MB is requested, which shows in your total VirtualAlloc number jumping way up again. That's why it would be better for the MemoryGroup to know it's making an unmanaged 'rental' so it can size it exactly to what it needs.

There's a balance there somewhere. Unfortunately it'll be a lot of testing.

JimBobSquarePants · 2021-06-17T02:02:15Z

That's why it would be better for the MemoryGroup to know it's making an unmanaged 'rental' so it can size it exactly to what it needs.

I think we can expose the required properties via the allocator interface easily enough. We're breaking it anyway.

br3aker · 2021-06-17T03:30:33Z

src/ImageSharp/Memory/Allocators/Internals/GCAwareConfigurableArrayPool{T}.cs

+                bool lockTaken = false, allocateBuffer = false;
+                try
+                {
+                    this.spinLock.Enter(ref lockTaken);
+
+                    if (this.index < buffers.Length)
+                    {
+                        buffer = buffers[this.index];
+                        buffers[this.index++] = null;
+                        allocateBuffer = buffer == null;
+                    }
+                }


Calculating if buffer was null can be done outside of lock scope: if(buffer == null) {}

Suggested change

bool lockTaken = false, allocateBuffer = false;

try

{

this.spinLock.Enter(ref lockTaken);

if (this.index < buffers.Length)

{

buffer = buffers[this.index];

buffers[this.index++] = null;

allocateBuffer = buffer == null;

}

}

bool lockTaken = false;

try

{

this.spinLock.Enter(ref lockTaken);

if (this.index < buffers.Length)

{

buffer = buffers[this.index];

buffers[this.index++] = null;

}

}

That's a straight copy from the runtime

https://github.com/dotnet/runtime/blob/a02b668214e9711cb3d5f426f085601698a24b92/src/libraries/System.Private.CoreLib/src/System/Buffers/ConfigurableArrayPool.cs#L198-L217

Note: I don't know if you can suggest changes from different places so do not commit this, line 353 must be changed to if(buffer == null) for this to work.

@JimBobSquarePants hm, I may be wrong of course but we can easily do that outside of the lock scope, no? That comparison is nothing but with spinlocks better be fast than asleep :P

br3aker · 2021-06-17T03:59:17Z

src/ImageSharp/Memory/Allocators/Internals/GCAwareConfigurableArrayPool{T}.cs

+                // for that slot, in which case we should do so now.
+                if (allocateBuffer)
+                {
+                    if (this.index == 0)


This is not thread safe, this.index variable is a shared state that must be altered only in locked scope.

this.index would never be 0 in this piece of code. If bucket is full (i.e. all buffers are not rented), this.index would be equal to zero. After exactly one rent in locked scope this.index would be incremented thus at if(this.index == 0) line index would always be >= 1.

This can be fixed like this:

bool lockTaken = false; int takenIndex = -1; try { this.spinLock.Enter(ref lockTaken); if (this.index < buffers.Length) { buffer = buffers[this.index]; buffers[this.index] = null; takenIndex = this.index++; } } finally { if (lockTaken) { this.spinLock.Exit(false); } } if (buffer == null) { if (takenIndex == 0) { // Stash the time the first item was added. this.firstItemMS = (uint)Environment.TickCount; } // ...

Good catch. It's also in the wrong method. Will update.

br3aker · 2021-06-17T04:31:58Z

2D buffer renting logic can be optimized.

From Buffer2D<T>.Allocate:

// ...
var buffers = new IMemoryOwner<T>[bufferCount];
for (int i = 0; i < buffers.Length - 1; i++)
{
    buffers[i] = allocator.Allocate<T>(bufferLength, options);
}

if (bufferCount > 0)
{
    buffers[buffers.Length - 1] = allocator.Allocate<T>(sizeOfLastBuffer, options);
}
// ....

If we try to allocate a 2D buffer for some images from separate threads at the same time - those would be paused on spinlock with extremely fast response times which can lead to LOH fragmentation as we don't know which image/resource would be freed in what order and in what times:

Bucket:
[0]: free
[1]: free
[2]: free
[3]: free
[4]: free
...

Thread1 requests 2 buffers
Thread2 requests 3 buffers, let's say thread2 wins:

time quant 1
Bucket:
[0]: thread2 taken
[1]: free
[2]: free
[3]: free
[4]: free
...

// spin locks can be very fast considering nothing big happens inside locked code
// while thread2 doing some bureaucracy returning from allocator method & and calling it again with some checks again
// thread1 could easily have taken the lock which can lead to allocator fragmentation
time quant 2
Bucket:
[0]: thread2 taken
[1]: thread1 taken
[2]: free
[3]: free
[4]: free
...

// possible example
time quant 3-5
Bucket:
[0]: thread2 taken
[1]: thread1 taken
[2]: thread2 taken
[3]: thread1 taken
[4]: thread2 taken
...

And that's only 2 threads, add 8 or even 16 cores, async/await - boom.

Solution proposal: ability to rent a number of buffers at a single call. This can be easily implemented in the custom GCAwareConfigurableArrayPool but .net TLS pool isn't designed to support this so it will operate the same way as it is now, unfortunately. This would still be very fast & spin locks won't be hurt. Renting logic can be simplified to bucket buffer index increase, with multiple buffers rent this can be can as simple as bucket.nextBufferIndex += numberOfBuffersToRent with some checks. If we requested 10 buffers but got only 5 as buffer is exhausted during the process - we can allocate remaining slots on the fly and these buffers will participate at the returning stage. This might not be a good thing - our buffer is jagged with 2 halves from pool and some random LOH memory, returning them as a full chunk could create a fragmented space in the bucket.

This can also possibly decrease fragmention at the bucket first allocation. Current implementation allocates new buffers if they were rented as null, batch renting would lead to subsequent allocations which have a higher chance for being a continuous memory which also might help with fragmentation.

This can be proofed via some random 4k image:
Custom log from MemoryGroup.Allocate (lesser than 2mb allocations are omitted):

Allocating 7478016 bytes via 4 buffers
Allocating 14755840 bytes via 8 buffers

Even 2 threads with same images would mess up the ordering by a lot:

Note: this is an array of which buffer was taken by which thread via Thread.ManagedThreadId.
Proposed solution should chnage this behaviour to something like: 5 5 5 5 6 6 6 6 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6. But the game changer is that they are returned in batches.

P.S.
Sorry for a giant wall of text, couldn't come up with something more compact :(

JimBobSquarePants · 2021-06-18T00:10:40Z

I definitely broke something in NET FX when I removed IManagedByteBuffer. The tests freeze up and never report completion. Will revert to the commit before that and try again.

antonfirsov

Don't get me wrong, this is amazing work, it's great to see a proof that unmanaged allocations can reduce memory peaks by more than 50%. However, this is a massive change in the core engine, and the current code and the status of validation is still very far from a state I would consider mergeable.

Here is the list of major problems we need to solve:

As a start, we need to setup systematic benchmarking, which can help us getting the following metrics for any configuration in a quantitative/comparable manner enabling data-driven decision making to shape the design and fine-tune the final perf parameters:

Peak memory usage
Average memory usage
CPU utilization
Throughput (~time taken to process all the bee heads)

We should be able to put these values into a table so we can easily compare benchmark results for different parameter sets.

With the current tuning, there is almost no pooling happening when it comes to large buffers, meaning that the current PR behavior does not justify the presence of the complex array pooling & trimming logic and the ArrayPoolMemoryAllocator name / API shape. Intuition tells that we need some pooling, but ATM we have no idea how much. We need to see how does the allocator work with pooling disabled + and a list of more aggressive pooling setups.

With our new logic we'd end up adding 1024MB arrays to that pool and almost nothing would go to unmanaged memory.

I did not propose 1024 as a final value. It is an extreme config in general, but not for the bee heads benchmark, where the memory usage seems to fluctuate around 1.2GB. I'm also more curious about less aggressive, more reasonable pooling settings, but I see no reason for not getting all the data points that help us understanding the whole picture and make better decisions.

As pointed out in WIP. More Efficient MemoryAllocator #1660 (comment), we need different sizes for the contiguous blocks coming from pooled arrays VS unmanaged buffers. There are several ways to implement this, probably the most straightforward thing is to move the code in MemoryGroup.Allocate to a virtual method on MemoryAllocator, that could be overridden, and use ideas from WIP. More Efficient MemoryAllocator #1660 (comment) in the override.
Using the ArrayPool<T> abstraction and the entire GCAwareConfigurableArrayPool<T> class to pool 2MB buffers doesn't bring us value, since we are utilizing only one bucket of that pool. We should implement a custom pool class dedicated to the concern of pooling uniform arrays. It should be relatively easy to refactor it from GCAwareConfigurableArrayPool<T>.Bucket.
We should periodically trim the pool by by a certain factor even when there is low memory pressure. This would address the concern of retaining memory unnecessarily (pointed out in General discussion about memory management #1590 and other user complaints), and also enable us to pool much more when there is high load.
We need extensive test coverage to validate our assumptions regarding the utilization of the pools and unmanaged buffers. We also need to test trimming. Unfortunately, it would be too expensive to do it with regular Xunit runs, but we can define a local-only tests, that deliver & validate the logs/metrics proving the trimming is happening in the way we expect.

I understand this is enormous amount of work, but it was always the case for all the previous PR-s refactoring the memory management engine. I don't see a reason to lower our quality criterias in this case, and omit any of the points above, especially that we are about to introduce a breaking change, and we want to prevent future breaking changes. Personally, I want to start working on point 1., but even that alone may take several evenings, that makes it very hard to give an ETA. If we feel like the allocator work may block V1.1 for too long, we should focus on #1597 first, since it will bring even bigger improvement with a lower development cost.

JimBobSquarePants · 2021-06-18T17:13:01Z

@antonfirsov I actually agree with all the above. I'm running before we can walk without the relevant experience and suffering as a result. #1597 should immediately benefit us for V1.1.

What I'm actually going to do is close this and instead introduce a few smaller PRs to do some sanitation work which will allow the allocator changes to be made more easily.

Rename all GetSpan, GetMemory that require a single span to add a Dangerous prefix and remove all usage that is not required.
Multiple PRs to Migrate any calls that use AllocateManageByteBuffer to use Allocate<byte> (Turns out it isn't that that is breaking thing, the cause is still undetermined)
Optimize the Gif encoder to allow us to clear and reuse a cache to save memory churn in multiframe images.

antonfirsov · 2021-06-19T22:27:22Z

I spent some time today trying to figure out how to get the desired metrics out of .ETL files, but I realized there is no easy way that is worth the efforts. This should not block systematic comparison, but will make it even more of a chore work 😞

JimBobSquarePants added 8 commits June 13, 2021 05:44

Add GC handling, use shared, simplify pool

8da4092

Only assign reference if using the large pool

1a0ce6f

Fix null ref

407092b

Introduce a timer that performs a callback to cleanup

ad96db7

Pass allocator directly

6306567

Update ArrayPoolMemoryAllocator.Buffer{T}.cs

a9ee629

Use GC Aware configurable pool.

6036a39

Simplify and use 2MB buffers

c860447

JimBobSquarePants added area:performance breaking Signifies a binary breaking change. labels Jun 13, 2021

JimBobSquarePants added this to the 1.1.0 milestone Jun 13, 2021

JimBobSquarePants linked an issue Jun 13, 2021 that may be closed by this pull request

More efficient MemoryAllocator #1596

Closed

Fix iterator tests

e9105d3

JimBobSquarePants requested a review from a team June 14, 2021 01:12

Add unmanaged buffer implementation

4feb673

br3aker reviewed Jun 14, 2021

View reviewed changes

JimBobSquarePants added 3 commits June 14, 2021 17:43

Cleanup and fix warnings

0de66a4

Fix everything but Tiff

f9395fd

Merge branch 'master' into js/memory-experiments

5e51795

brianpopow mentioned this pull request Jun 14, 2021

TiffEncoder: Avoid buffer2D.GetSingleSpan() and use GetPixelRowSpan instead #1662

Merged

4 tasks

JimBobSquarePants added 2 commits June 14, 2021 16:36

Merge branch 'master' into js/memory-experiments

801e887

Merge branch 'master' into js/memory-experiments

06f9557

Better naming, fix last failing test

0aa3211

antonfirsov reviewed Jun 14, 2021

View reviewed changes

JimBobSquarePants changed the title ~~WIP: More Efficient MemoryAllocator~~ More Efficient MemoryAllocator Jun 16, 2021

JimBobSquarePants requested review from antonfirsov and a team June 16, 2021 10:33

br3aker reviewed Jun 17, 2021

View reviewed changes

JimBobSquarePants added 2 commits June 17, 2021 23:32

Fix first ticks.

f01bc0a

Skip flaky test

547614a

JimBobSquarePants changed the title ~~More Efficient MemoryAllocator~~ WIP. More Efficient MemoryAllocator Jun 18, 2021

JimBobSquarePants marked this pull request as draft June 18, 2021 00:13

antonfirsov requested changes Jun 18, 2021

View reviewed changes

JimBobSquarePants closed this Jun 18, 2021

JimBobSquarePants deleted the js/memory-experiments branch May 21, 2022 12:22

	protected static Span<T> GetStripPixels<T>(Buffer2D<T> buffer2D, int y, int height)
	where T : struct
	=> buffer2D.GetSingleSpan().Slice(y * buffer2D.Width, height * buffer2D.Width);

	this.data = memoryAllocator.Allocate2D<float>(this.MaxDiameter, bufferHeight, AllocationOptions.Clean);
	this.pinHandle = this.data.GetSingleMemory().Pin();

WIP. More Efficient MemoryAllocator #1660

WIP. More Efficient MemoryAllocator #1660

Conversation

JimBobSquarePants commented Jun 13, 2021 • edited Loading

Prerequisites

Description

Things to consider

Issues

br3aker Jun 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

br3aker Jun 14, 2021 • edited Loading

Choose a reason for hiding this comment

JimBobSquarePants Jun 14, 2021 • edited Loading

Choose a reason for hiding this comment

br3aker Jun 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JimBobSquarePants commented Jun 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antonfirsov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antonfirsov commented Jun 16, 2021 • edited Loading

JimBobSquarePants commented Jun 16, 2021

antonfirsov commented Jun 16, 2021

JimBobSquarePants commented Jun 16, 2021

antonfirsov commented Jun 16, 2021 • edited Loading

saucecontrol commented Jun 16, 2021 • edited Loading

JimBobSquarePants commented Jun 17, 2021 • edited Loading

saucecontrol commented Jun 17, 2021 • edited Loading

JimBobSquarePants commented Jun 17, 2021

saucecontrol commented Jun 17, 2021 • edited Loading

JimBobSquarePants commented Jun 17, 2021

saucecontrol commented Jun 17, 2021 • edited Loading

JimBobSquarePants commented Jun 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

br3aker Jun 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

br3aker Jun 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

br3aker commented Jun 17, 2021 • edited Loading

JimBobSquarePants commented Jun 18, 2021

antonfirsov left a comment • edited Loading

Choose a reason for hiding this comment

JimBobSquarePants commented Jun 18, 2021

antonfirsov commented Jun 19, 2021

JimBobSquarePants commented Jun 13, 2021 •

edited

Loading

br3aker Jun 14, 2021 •

edited

Loading

br3aker Jun 14, 2021 •

edited

Loading

JimBobSquarePants Jun 14, 2021 •

edited

Loading

br3aker Jun 14, 2021 •

edited

Loading

JimBobSquarePants commented Jun 14, 2021 •

edited

Loading

antonfirsov commented Jun 16, 2021 •

edited

Loading

antonfirsov commented Jun 16, 2021 •

edited

Loading

saucecontrol commented Jun 16, 2021 •

edited

Loading

JimBobSquarePants commented Jun 17, 2021 •

edited

Loading

saucecontrol commented Jun 17, 2021 •

edited

Loading

saucecontrol commented Jun 17, 2021 •

edited

Loading

saucecontrol commented Jun 17, 2021 •

edited

Loading

br3aker Jun 17, 2021 •

edited

Loading

br3aker Jun 17, 2021 •

edited

Loading

br3aker commented Jun 17, 2021 •

edited

Loading

antonfirsov left a comment •

edited

Loading