Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient MemoryAllocator #1596

Closed
antonfirsov opened this issue Apr 13, 2021 · 53 comments · Fixed by #1730
Closed

More efficient MemoryAllocator #1596

antonfirsov opened this issue Apr 13, 2021 · 53 comments · Fixed by #1730

Comments

@antonfirsov
Copy link
Member

One way to address the concerns in #1590 is to come up with a new MemoryAllocator that meets the following two requirements even under very high load:
(A) Steady allocation patterns, instead of GC fluctuations
(B) Lower amount of memory being retained, at least "after some time"

Some ideas to explore:

  1. Allocate native memory over a certain threshold (~1MB) deferring the memory management to the OS
    1.1 Consider pooling unmanaged memory, especially if we implement the next point.
  2. Go discontigous, and build all large buffers from fix-sized blocks of memory similarly to RecyclableMemoryStream. It's worth to check how does this work with both pooled arrays and unmanaged memory.
  3. Have a threshold on the reatained memory, and release some of the pooled arrays (or pooled unmanaged buffers) when we grow over it.
  4. Mentioning an idea I really hate for the sake of completeness: Have a synchronization mechanism around large buffer acquisition, similarly to System.Drawing (WIC?). To do this properly, memory allocation should become an asynchronous operation, otherwise SpinLocks will just add more fuel to the fire.

Point 1. seems to be very simple to prototype. We need an allocator that uses Marshal.AllocHGlobal over some threshold, and an ArrayPool below it, and see how the memory timeline goes with the bee heads MemoryStress benchmark in comparison to ArrayPoolMemoryAllocator.

@saucecontrol any thoughts or further ideas? (Especially on point 2.)

@saucecontrol
Copy link
Contributor

Idea 2 is interesting. I wonder, though, if you can manage discontinous buffers, is it much of a stretch from there to go fully streamed?

For the record, I also hate idea 4, but that may be necessary if you really do have to materialize large images and care at all about 32-bit hosts. The machine I ran my tests on was only a quad core as well. I wonder what would happen on a 16+ core machine... I reckon even in 64-bit you'd start paging.

@antonfirsov
Copy link
Member Author

I wonder, though, if you can manage discontinous buffers, is it much of a stretch from there to go fully streamed?

Implementing the ability to have discontinous buffers was relatively easy, since our processing code usually goes row-by row. We just had upgrade buffer2d.GetRowSpan(y) so it can take the row from different buffers (buffer boundary is always between rows). Going fully-streamed needs much bigger architectural changes, I would rather start with the "decode into target size" first, and see where we can get from there.

Currently we have a very high limit for images being split to different buffers, what I'm considering is to check what happens if we aggressively lower that limit. I'm unable to assess if it would make thing worse or better.

but that may be necessary if you really do have to materialize large images and care at all about 32-bit hosts.

That's why I want to avoid materializing them :) Will open an issue for tracking "decode into target size" stuff tomorrow.

The machine I ran my tests on was only a quad core as well. I wonder what would happen on a 16+ core machine...

I was wondering why do are using the basic Parallel.For overload that scales parallelism with the machine's CPU count? Is this a way of modeling typical server-side multi-threaded stress?

@saucecontrol
Copy link
Contributor

I'm unable to assess if it would make thing worse or better.

I don't think it would make things worse. If I understand right, your current mode of processing is to apply each transform to the entire image before moving on the next? If so, you're kind of trashing the cache at each stage, so it wouldn't matter whether the memory is contiguous or not. I'm not sure it would make things better either, though. You might want to look at your LOH fragmentation under those high memory conditions and see whether being able to stuff a smaller buffer into an LOH hole would gain you anything. Switching to unmanaged buffers may help a bit there, but heap fragmentation can be a problem in unmanaged as well.

Of course being able to aggregate small buffers in place of a single larger one would be of benefit, provided you don't need them at the same time. I assume if that works for RecyclableMemoryStream, it should work for ImageSharp 🤷‍♂️

Is this a way of modeling typical server-side multi-threaded stress?

I must admit I didn't put that much thought into it. But yeah, the point was to make sure there were enough threads running to check whether the libraries were capable of true linear scale-up. As long as there are free processors to hide work on (as Vips does today -- and ImageSharp did before your windowed convolution implementation), you can't be sure. If you think about a high-load web server scenario, you would assume the managed thread pool would fully occupy all processors, so I want to know how a library will perform in that environment -- and to be sure that a large number of image processing operations doesn't make for a self-DoS.

BTW, I suspect the anomalous CPU measurements I got for Vips running under the profiler might have actually come from an oversubscription problem. Could be the CPU overhead of the profiler was just enough to throw the whole system into chaos and drop the speed by several multiples. I plan on doing some followup testing to see if I can nail that down.

@saucecontrol
Copy link
Contributor

Couple of other interesting stress cases you might want to look at.

  1. 3400+ frame GIF: http://i.imgur.com/Xv4UKYL.gif
  2. Lots of good giant JPEGs available on the ESO site, like: https://www.eso.org/public/usa/images/eso1719a/

@antonfirsov
Copy link
Member Author

antonfirsov commented Apr 16, 2021

The reason I'm thinking about "discontiguous memory by default" is not cache coherence, but the fact that it may enable more efficient pooling, reducing the GC pressure (or OS allocation traffic in case of unmanaged buffers). With large images it is very likely that a the requested buffer will outgrow the pool deferring to GC/OS allocating new giant arrays of non-uniform sizes, which would probably result in heap fragmentation even if we are using unmanaged memory.

I'm thinking of a solution that works as following:

  • Under 1MB we use a standard ArrayPool (can be ArrayPool<byte>.Shared)
  • Over that, we utilize a custom pool that provides uniform 2MB buffers. We then build the large discontiguous buffers aka. MemoryGroups out of these 2MB subbuffers.
    • This pool might be built of LOH arrays or unmanaged buffers. (Both approaches seem to have pros and cons.)
    • Obviously there is an upper limit for the retained/pooled memory
  • If we outgrow the discontiguous pool, we may consider to defer new allocations to unmanaged memory, so at least we don't stress the GC
  • There is a logic that periodically shrinks the pool so the solution works well with ImageSharp.Web (which typically places much lower load on the core library after a warmup period, so it doesn't make sense to retain memory in the pools)

@saucecontrol any thoughts on this particular plan? I wish I could pull in a GC expert to assess it ...

@saucecontrol
Copy link
Contributor

That plan looks sound to me, but I'm no GC expert. I do expect that inability to grab a contiguous large block of memory due to LOH (or process heap) fragmentation is at play in the 32-bit OOM scenario, but it would be good to verify that before diving in to the work if possible.

It seems like if you end up doing the work to completely manage the allocations, you may as well go straight to unmanaged memory so that when you say to free something you don't have to then wait for GC to comply and then do its LOH compaction and all that. Plus, you'll probably always have to deal with some allocation requests that will be over whatever pooling threshold you set, and you'll want those to go straight to unmanaged so they are as short-lived as possible.

@antonfirsov
Copy link
Member Author

you may as well go straight to unmanaged memory so that when you say to free something you don't have to then wait for GC to comply and then do its LOH compaction and all that

The problem is that I don't really know when to free up memory. With LOH + the Gen2Callback trick I can at least rely on GC heuristics with this. (Otherwise it feels like implementing my own GC)

@JimBobSquarePants
Copy link
Member

@antonfirsov
Copy link
Member Author

@JimBobSquarePants it just allocates / returns unmanaged memory without pooling. Maybe it's good enough (or even preferable) approach with unmanaged memory, not sure.

My main concern about unmanaged memory is that I would be hesitant to use it as a default, because of the users who may implicitly depend on not disposing Image. An update would introduce a reliability / security issue for them instead of their existing performance issue.

@saucecontrol
Copy link
Contributor

Instead of holding the raw IntPtr as a member of the MemoryManager<T>, that should hold a SafeHandle wrapping it. Something like:

internal sealed class SafeHGlobalHandle : SafeHandle
{
    private readonly int byteCount;

    public override bool IsInvalid => handle == IntPtr.Zero;

    public SafeHGlobalHandle(int size) : base(IntPtr.Zero, true)
    {
        SetHandle(Marshal.AllocHGlobal(size));
        GC.AddMemoryPressure(byteCount = size);
    }

    protected override bool ReleaseHandle()
    {
        if (IsInvalid)
            return false;

        Marshal.FreeHGlobal(handle);
        GC.RemoveMemoryPressure(byteCount);
        handle = IntPtr.Zero;

        return true;
    }
}

Since SafeHandle has a finalizer, it won't leak, but you can call Dispose to free it immediately if the MemoryManager<T> is Disposed properly.

It might be worth implementing something like that for any allocations larger than your max pool buffer size just because you know those will happen and will have a large GC cost. It's also a lot easier to trial than the segmented buffer idea.

@saucecontrol
Copy link
Contributor

FYI dotnet/runtime#52098 is tracking some possible scenarios to improve in ArrayPool/GC interaction. I think Jan's scenarios 1 and 3 (and maybe 2 as well) are applicable to ImageSharp.

@antonfirsov
Copy link
Member Author

antonfirsov commented May 3, 2021

There are so many things I hate about ArrayPool, especially ArrayPool<T>.Shared which is a library-level public shared global state, and as such something that shouldn't exist at all IMO.

All the magic ArrayPool.Shared is doing should be a runtime/GC implementation detail hidden behind standard new T[] allocations. (Edit: OK, it was mostly ranting, this approach would require an API like GC.UnsafeReleaseArray(...), but but AFAIK array pooling is much less common in Java for example, I assume GC does better job there)

@br3aker
Copy link
Contributor

br3aker commented May 11, 2021

Hi everybody!
Have a couple of thoughts on this project I'd like to share. Thanks for such a good topic on memory performance.

First of all, current state of pooling already supports discontiguous buffers. It's just hidden behind MemoryGroup<T>.Allocate(,,,) call:

First it checks if single stride can fit in single contiguos buffer:

int blockCapacityInElements = allocator.GetBufferCapacityInBytes() / ElementSize;

Then it does strange things:

int numberOfAlignedSegments = blockCapacityInElements / bufferAlignment;
int bufferLength = numberOfAlignedSegments * bufferAlignment;
if (totalLength > 0 && totalLength < bufferLength)
{
bufferLength = (int)totalLength;
}
int sizeOfLastBuffer = (int)(totalLength % bufferLength);
long bufferCount = totalLength / bufferLength;

Details aside - this alloc call either throws or pools single contiguous buffer each time. So here goes first problem: most of the time pool is stressed by huge arrays from buckets of higher end of indices when low indices are empty which makes maxArraysPerBucket config value completely irrelevant.
Edit: this can't be fixed by simply rewriting MemoryGroup<T>.Allocate(,,,) call, at least resizing processor assumes that entire underlying Buffer2D of pixel data is a single contiguous buffer and throws specific exeption. Haven't checked other processors but I guess it might also be a problem somewhere else.

Edit: huge mistake in calculations and hardcoded upper bound of pool max size (it's hex, not dec), this still implies but for certain pool setups, see post below.
But there's a far more complicated problem: pooling doesn't work in certain situations. This is solely due to current implementation of the pool in .net, it silently clamps maximum size of the array:

const int MinimumArrayLength = 0x10, MaximumArrayLength = 0x40000000;
if (maxArrayLength > MaximumArrayLength)
{
    maxArrayLength = MaximumArrayLength;
}

Which leads to unexpected LOH allocations when working with images up to ~40mb of pixel data and that isn't very rare:
rgb24 1080p jpeg is 24 * 1920 * 1080 = ~49mb of decoded pixel data.
Bee heads test completely ignores pooling because it consists of rgb24 jpeg`s with at least 3000x3000 pixels.

Parallel.Foreach bee heads can skyrocket memory usage (been counting gigabytes of work set during execution) with frequent GC(2) collections.

@saucecontrol
Copy link
Contributor

saucecontrol commented May 11, 2021

You've got a couple of translation errors in your analysis. First, the hard-coded MaximumArrayLength = 0x40000000 isn't 40MB -- it's hex, so that's 1GB. Second, RGB24 is 24 bits, or 3 bytes, per pixel. So a decoded 1920x1080 RGB image is 6MB.

But you're right about the LOH allocations. The default memory pool is limited to 24MB per array, so images over ~8 megapixels will cause allocations, and there are definitely images in the bee heads set that exceed that limit.

internal const int DefaultMaxPooledBufferSizeInBytes = 24 * 1024 * 1024;

The challenge is in achieving a balance between avoiding GC allocations and holding a high steady-state memory level when large arrays get pooled.

@br3aker
Copy link
Contributor

br3aker commented May 11, 2021

Oh gosh, you're right, math was not on my side tonight, thanks for pointing out.

Worst part about built-in .net pool is that it was developed for general use-case. Exponential growth of internal buffers would waste a lot of space in certain conditions. Empty buckets are not a big deal but they are just sit there for no reason.
Pixel buffer consisting of several unmanaged chunks is a much better choice, plus custom alignment and more reliable memory deallocation.

Although GC.AddMemoryPressure looks counterproductive in this particular situation - we don't want to affect GC at all while using pool.

@br3aker
Copy link
Contributor

br3aker commented May 11, 2021

IMO all allocations should be divided into 2 categories: pixel data which outlives any other allocation type in context of this library and everything else. Main memory consumers are pixel buffers which cause massive hits on both GC heaps and theoretical custom memory pool based on unmanaged memory - it'll be much harder to support large buffers while allowing to pool small buffers of integers/small structs.

This way we can delegate those small allocations to GC (thx to tiers this won't really affect its performance) while concentrating on storing pixel buffers with proper alignment.

Another interesting concept mentioned by @antonfirsov - async/await for memory pooling. This approach could facilitate environments where memory is crucial. But it might be really tricky not to fall into the deadlock when there won't be enough memory for any of the awaiters.
Anyway, I don't think that it's a good solution to avoid locks in pooling logic - some cases simply don't want to work with TPL - calling something like allocator.GetMemoryAsync(size).Result looks ugly and performs even worse than it looks.
Asynchronous pooling can use separate memory pool so it won't interfere with synchronized pooling. And I still don't think it's worth spending time implementing this atm, better to concentrate on basic pool implementation.

@antonfirsov
Copy link
Member Author

antonfirsov commented May 11, 2021

@br3aker reacting to #1596 (comment) first:

this alloc call either throws or pools single contiguous buffer each time.

at least resizing processor assumes that entire underlying Buffer2D of pixel data is a single contiguous buffer and throws specific exeption

If any of these are true, it's a bug. You should be able to limit the maximum contiguous buffer size by using the following full ArrayPoolMemoryAllocator constructor, and setting bufferCapacityInBytes:

/// <summary>
/// Initializes a new instance of the <see cref="ArrayPoolMemoryAllocator"/> class.
/// </summary>
/// <param name="maxPoolSizeInBytes">The maximum size of pooled arrays. Arrays over the thershold are gonna be always allocated.</param>
/// <param name="poolSelectorThresholdInBytes">The threshold to pool arrays in <see cref="largeArrayPool"/> which has less buckets for memory safety.</param>
/// <param name="maxArraysPerBucketLargePool">Max arrays per bucket for the large array pool.</param>
/// <param name="maxArraysPerBucketNormalPool">Max arrays per bucket for the normal array pool.</param>
/// <param name="bufferCapacityInBytes">The length of the largest contiguous buffer that can be handled by this allocator instance.</param>
public ArrayPoolMemoryAllocator(
int maxPoolSizeInBytes,
int poolSelectorThresholdInBytes,
int maxArraysPerBucketLargePool,
int maxArraysPerBucketNormalPool,
int bufferCapacityInBytes)

You can pass the default values for the rest of the parameters. -- Except bucket sizes, you need to increase them significantly to make sure the pool is not drained.

Note that the default for bufferCapacityInBytes is a very high value today (equivalent of ~130 Megapixels of RGBA32), so this may explain that you see contiguous buffers without customization.

I believe lowering the capacity + setting a higher bucket size should help with the biggest "pooling doesn't work in certain situations" problem. If you have a repro proving this fails on the load+resize+save path (= contiguous buffer is allocated or exception is thrown) can you please open a new separate issue for that?

@antonfirsov
Copy link
Member Author

antonfirsov commented May 11, 2021

Regarding the rest of the comments:

I believe all mentioned issues can be addressed either by a very basic unmanaged allocator or by the "smart" allocator described in #1596 (comment). None of those would use the BCL ArrayPool<T> for large buffers, and therefore won't suffer from the asymmetric bucket usage problem.

Main memory consumers are pixel buffers

I wish this was the case, but this is not true: #1597. This our top-priority memory issue now, effecting the most common load+resize+save path. Even if we manage to fully optimize it, the "fancy" (blur, sharpen etc.) processors may still keep allocating all kinds of large Buffer2D<T>-s of Width * Height size. I don't see a conceptual difference between pixel buffers backing Image<T> and other internal buffers.

@br3aker
Copy link
Contributor

br3aker commented May 11, 2021

Oh yeah, I've haven't actually checked all the processors. What I meant is that allocator shouldn't be bothered with small allocations like here:

this.frequenciesMemoryOwner = memoryAllocator.Allocate<short>(elements);
this.frequenciesMemoryHandle = this.frequenciesMemoryOwner.Memory.Pin();
this.Frequencies = (short*)this.frequenciesMemoryHandle.Pointer;
this.lengthsMemoryOwner = memoryAllocator.AllocateManagedByteBuffer(elements);
this.lengthsMemoryHandle = this.lengthsMemoryOwner.Memory.Pin();
this.Length = (byte*)this.lengthsMemoryHandle.Pointer;
this.codesMemoryOwner = memoryAllocator.Allocate<short>(elements);
this.codesMemoryHandle = this.codesMemoryOwner.Memory.Pin();
this.codes = (short*)this.codesMemoryHandle.Pointer;

Those buffers are not actually big:

this.literalTree = new Tree(memoryAllocator, LiteralNumber, 257, 15);
this.distTree = new Tree(memoryAllocator, DistanceNumber, 1, 15);
this.blTree = new Tree(memoryAllocator, BitLengthNumber, 4, 7);

// The number of literal codes.
private const int LiteralNumber = 286;
// Number of distance codes
private const int DistanceNumber = 30;
// Number of codes used to transfer bit lengths
private const int BitLengthNumber = 19;

Without proper investigation I thought several places had this kind of small allocations but I was wrong, that's the only place I could find with allocator.Allocate call with small buffers (compared to other allocations). My bad, sorry for the rustle.

#1596 (comment) I changed bucket size prior to experimenting with non-contiguos buffers, sorry for not mentioning that. Will try to reproduce and write proper issue/comment today.

@hexawyz
Copy link

hexawyz commented May 12, 2021

Hello,

I saw this issue pop in my twitter feed and I was curious about it, I hope you'll forgive me for butting in 👀

I didn't see it mentioned, but at least in the scenario of very large pixel buffers, an option could be to rely on MemoryMappedFile for allocating unmanaged memory in the form of raw memory pages.
As an alternative to Marshal.AllocHGlobal, this would avoid fragmenting the native heap (used by AllocHGlobal) and the LOH, at least in a 64 bit process. In a 32 bit process, address space fragmentation could still be a problem.

Similary to AllocHGlobal/FreeHGlobal, the memory you get from MemoryMappedFile is allocated and deallocated immediately.
However, the memory pages used by a memory mapped files should be freed pages immediately, where an heap/LOH based approach could keep the pages in memory. (e.g. because they are in the middle of the heap or still partially used)

⚠ One possible drawback could be a performance hit associated with allocating/deallocating the memory pages for the memory mapped file (this would need to be benchmarked), but this should still be worthwile for very large buffers, as a heap-based mechanism would also need to allocate new pages.

Of course, this way of allocating memory could also be used with pooling and discontinous buffers. What it does at the core is give you more control over how much physical memory is allocated, or at least reserved in your address space.
It could be worthwile if you were to chose to pool your own unmanaged buffers.

For the sake of completeness, you can find a rough example of usage of MemoryMappedFile to allocate private memory here.

@JimBobSquarePants
Copy link
Member

Regarding discontigous buffers. If we went that route we would likely not be able to pin anything. There's a couple of places we do so currently.

@JimBobSquarePants
Copy link
Member

@kkokosa @Maoni0 My sincerest apologies for the name drop but since you are both absolute experts in memory management and the GC it would be wonderful if we could have your opinion on the best design approach.

@antonfirsov
Copy link
Member Author

@JimBobSquarePants in the current design MemoryAllocator.Allocate<T> will always allocate a contiguous buffer, while GetBufferCapacityInBytes works as a hint for building large (potentially discontiguous) 2D buffers. We never want to pin the contents of a Buffer2D, we should always process it row span by row span. These buffers are responsible for the majority of our allocations, we need to find a good strategy for managing them. This shouldn't prevent us to enforce using contiguous buffers in special cases.

@GoldenCrystal using MemoryMapped files is a good idea, but it's the most complex option of all, needs a lot of research, and has to be implemented with care. If possible, I would prefer to go with managed LOH arrays, because anything unmanaged is a breaking change (an application leaking IDisposable Image-s updates to new version of the library introducing a security risk by actually starting to leak unmanaged memory).

@antonfirsov
Copy link
Member Author

In case anyone chooses to chime in, but the thread is TLDR:

I'm thinking about a custom pool of uniform (~2MB) LOH arrays, that can be used build larger discontiguous buffers to back ImageSharp's Image<TPixel> and other very large internal temporary "2D" buffers.

Q1: Is this a viable strategy to fight fragmentation and OOMs in memory constrained environments, or are there some GC behaviors that make this inefficient?
Q2: What is the best way to detect high memory pressure and release the pooled buffers so GC can clean them up? (Preferably cross platform, or at least something that doesn't need .NET 5/6)

@JimBobSquarePants
Copy link
Member

@JimBobSquarePants in the current design MemoryAllocator.Allocate<T> will always allocate a contiguous buffer, while GetBufferCapacityInBytes works as a hint for building large (potentially discontiguous) 2D buffers. We never want to pin the contents of a Buffer2D, we should always process it row span by row span. These buffers are responsible for the majority of our allocations, we need to find a good strategy for managing them. This shouldn't prevent us to enforce using contiguous buffers in special cases.

Yep but this had me concerned.

Go discontigous, and build all large buffers from fix-sized blocks of memory similarly to RecyclableMemoryStream.

Regarding GetBufferCapacityInBytes. ArrayPoolMemoryAllocator currently uses this as more than a hint and throws an InvalidMemoryOperationException if the requested 1D buffer is larger than BufferCapacityInBytes which I'm not currently a fan of.

@br3aker
Copy link
Contributor

br3aker commented Jun 12, 2021

Q1: Is this a viable strategy to fight fragmentation and OOMs in memory constrained environments, or are there some GC behaviors that make this inefficient?

Honestly, current problems with OutOfMemory exceptions look like a bug rather than real memory management problems. I can't believe that even in parallel tests 8 (or even 16) cores can spend up to 8gb in its peak scenario on such a simple task (source - beeheads demo by @saucecontrol)

Q2: What is the best way to detect high memory pressure and release the pooled buffers so GC can clean them up? (Preferably cross platform, or at least something that doesn't need .NET 5/6)

One of the simpliests yet the most 'controlable' thing from user perspective you can do (especially good for bulk processing):

using (var context = new Context()) 
{
    using Image img = Image.Load(..., context);
    using Image resized = img.Resize(..., context);
    resized.Save(..., context);
}

Unfortunately, this would lead to a lot of public API changes/additions but it allows to manually control when processing is over and resources can be freed.

Problem with LOH allocations is that LOH can't actually trigger GC as reliably as non-LOH allocations do. We have 2 rather rare scenarions of LOH collection:
1. Low available memory to allocate non-LOH memory -> full GC collection cycle for freeing all memory
2. Low available memory to allocate new LOH buffer for pooling -> full GC collection cycle as gen2 triggers both gen0 and gen1

(1) won't be caused by pooling if we use LOH buffers
(2) pool won't allocate new buffers after some time - it would either use buffers 100% for some processing or would have a lot of free buffers

What I wanted to state is that LOH collection is a really rare occasion unless triggered manually which is not what library code can reliably control.

Yes, we can try to force gen2 GC collection at pool buffer returnal:

// somewhere in pool class
public void Return(byte[] buffer) {
    // buffer returnal logic

    usedBuffersCount--;
    if(usedBuffersCount / allocatedBuffersCount < somePercentage) {
        GC.Collect(2);
    }
}

But this should be done with care as gen2 collection could cause a significant performance hit.

@JimBobSquarePants
Copy link
Member

Yet another problem with current implementation - race condition. Neither Configuration.MemoryAllocator nor ArrayPoolMemoryAllocator locks in rent/returnal code which can lead to many different entities renting same buffer.

@br3aker They're the same thing if set which is the default behavior. Locking during Rent/Return occurs in the underlying ConfigurableArrayPool<T>.

I agree it would be much better to allow trimming but without completely reimplementing TlsOverPerCoreLockedStacksArrayPool<T> that would be impossible.

@br3aker
Copy link
Contributor

br3aker commented Jun 13, 2021

@JimBobSquarePants I must be blind, thanks for clarifying about locks, didn't see that _lock.Enter(...) call at first glance.

Yep, new solution would require entirely new pool implementation which I'm willing to try in the upcoming week(s). I'll open a draft if anything works out.

@antonfirsov
Copy link
Member Author

@br3aker on .NET Core there is a trick to detect if a gen2 GC is happening:

https://github.com/dotnet/runtime/blob/01b7e73cd378145264a7cb7a09365b41ed42b240/src/libraries/System.Private.CoreLib/src/System/Gen2GcCallback.cs

What I would do is to trim some of the retained buffers on every gen2 GC (eg. 1/2 or 3/4 or similar). I think this would handle the cases you pointed out in #1596 (comment) without complex statics.

Note that the contextual batching API you referred to in #1596 (comment) already exists: you can create an pass around a Configuration with it's own MemoryAllocator and call configuration.MemoryAllocator.ReleaseRetainedResources(). We need something that works by default.

@br3aker
Copy link
Contributor

br3aker commented Jun 13, 2021

@antonfirsov yeah I understand about "better memory handling by default", contextual batching api idea was more of a feature around new built-in allocator. Current API is a bit misleading with creating a new allocator and then manually calling ReleaseRetainedResources() but that's an another day discussion, sorry for the confusion in this already complicated topic.

@JimBobSquarePants
Copy link
Member

Over that, we utilize a custom pool that provides uniform 2MB buffers. We then build the large discontiguous buffers aka. MemoryGroups out of these 2MB subbuffers.

@antonfirsov Would the 2MB uniform limit have any affect upon our ability to work with wrapped buffers. e.g SwapOrCopy. My memory is hazy there.

@antonfirsov
Copy link
Member Author

SwapOrCopy is now implemented on the MemoryGroup level, so it should work fine:

public static bool SwapOrCopyContent(MemoryGroup<T> target, MemoryGroup<T> source)
{
if (source is Owned ownedSrc && ownedSrc.Swappable &&
target is Owned ownedTarget && ownedTarget.Swappable)
{
Owned.SwapContents(ownedTarget, ownedSrc);
return true;
}
else
{
if (target.TotalLength != source.TotalLength)
{
throw new InvalidMemoryOperationException(
"Trying to copy/swap incompatible buffers. This is most likely caused by applying an unsupported processor to wrapped-memory images.");
}
source.CopyTo(target);
return false;
}
}

@saucecontrol
Copy link
Contributor

I agree it would be much better to allow trimming but without completely reimplementing TlsOverPerCoreLockedStacksArrayPool that would be impossible.

This would seem to be the way in the short term. The way it trims in response to Gen2GcCallback is hacky but clever and well-tested. You'd want to simplify it to not hold a bucket per size per thread since we're talking about larger buffers, though. Maybe some hybrid between BCL's TlsOverPerCoreLockedStacksArrayPool and ConfigurableMemoryPool.

I think regardless, you'll end up with times you need to allocate something over the pool limit, and falling back to unmanaged memory (with a SafeHandle to protect against failure to Dispose) would be a big improvement there. Those extra large allocations living until the next full GC can be a killer.

@antonfirsov
Copy link
Member Author

GC.GetGCMemoryInfo() doesn't look like a reliable thing: dotnet/runtime#55126

We should not use it, instead, we should always trim the pool by a certain percentage on every Gen2 GC.

@JimBobSquarePants
Copy link
Member

Subscribed! Very curious to see what the issue is.

@Sergio0694
Copy link
Member

Very late to the party (super interesting conversation by the way!), just thought I'd throw this out there too as it seems somewhat relevant to the new issue Anton opened, and potentially useful?

dotnet/runtime#53895

This is essentially meant to be a more reliable and less hacky way to get memory pressure based callbacks than what is achievable today by using Gen2Callback that Clint mentioned 😄

@antonfirsov
Copy link
Member Author

@Sergio0694 yeah, seen that, cool stuff, thanks a lot for pushing it!

@antonfirsov
Copy link
Member Author

antonfirsov commented Jul 24, 2021

I made some progress with a prototype on a branch. As a next step, I need to start some benchmarking to determine optimal parameters, however I'm not sure what the pool sizes are considered to be still "sane".

It can be any value between a few megabytes, up to several gigabytes on many-core systems. Note that dotnet/runtime#55621 went very aggressive for ArrayPool<T>.Shared.

Maybe we could define it as a percentage of TotalAvailableMemoryBytes? (However it won't work well automatically with stuff like docker's --memory switch.)

@JimBobSquarePants
Copy link
Member

Wow, the limit has been removed entirely!

So what's you're current plan then? An admittedly very quick look at your branch yields to me what seems to be 3 pools now?

  1. ArrayPool<byte>.Shared <= 1MB
  2. UniformByteArrayPool - Which doesn't appear to do any cleanup? <= 48 MB
  3. UnmanagedMemoryAllocator > 48MB

Isn't using the UniformByteArrayPool in this manner going to lead to the same issue as we are seeing where we hold on to memory long after it is useful?

I feel like the DefaultMemoryAllocator is currently offering too many options in its constructor and that users will find them confusing (I find the two contiguous options confusing and I know how the library works). If someone wants to wrap memory or cares about contiguous buffer length perhaps they should simply switch to the SimpleGcMemoryAllocator or roll their own?

@antonfirsov
Copy link
Member Author

@JimBobSquarePants it's still in a prototype state. Trimming is not yet implemented, first I want to figure out the optimal parameter sizes for the pool size and other things. Current WIP default is 256 MB, 48 MB is too low for stress cases.

I think we should hide the implementing type (DefaultMemoryAllocator or whatever) and make the public API implementation-agnostic this time:

public class MemoryAllocator
{
    public static MemoryAllocator CreateDefault(
        int maxContiguousBlockSizeInBytes = DefaultContiguousBlockSizeInBytes,
        int totalPoolSizeInBytes = DefaultTotalPoolSizeInBytes);
}

We can introduce utilities to help interop use cases, which we will likely break by scaling down the default contiguous buffer size (see #1307 and #1258):

public static class SizeCalculator
{
    public static GetImageSizeOf<T>(int width, int height);
}

// Customize the allocator for GPU interop use cases:

Configuration.Default.MemoryAllocator = MemoryAllocator.CreateDefault(
    maxContiguousBlockSizeInBytes: SizeCalculator.GetImageSizeOf<Rgba32>(4096, 4096));

@JimBobSquarePants
Copy link
Member

Yep. I like all these ideas!

@antonfirsov
Copy link
Member Author

@JimBobSquarePants @saucecontrol I was already at final steps polishing my implementation, when I discovered an issue with the UniformByteArrayPool approach. Looks like the GC is unable to decommit it's LOH pages despite trimming UniformByteArrayPool a couple of times (or even clearing it entirely with a call to .ReleaseRetainedResources()) if the pages are even slightly interleaved with other LOH allocations (typically coming from ArrayPool.Shared).

The following graph shows the committed memory for an experiment which attempts to free all the memory in the end:

image

For comparison here is the same experiment, with a slight alteration: setting GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce before GC.Collect(), showing that after compacting the LOH the memory is decommitted:

image

What to do now?

We can do better job by getting rid of GC for large buffers entirely, and refactoring UniformByteArrayPool into a pool of unmanaged buffers, but this could possibly prolong the PR for a couple of weeks.

To deliver something to the users in the meanwhile, we can also consider to PR an unmanaged allocator without any pooling. From what I see now in large scale stress benchmarks, UniformByteArrayPool adds about 5% to the throughput in comparison to a non-pooling unmanaged allocator, but a non-pooling unmanaged allocator has already 9% better throughput than our current ArrayPoolMemoryAllocator, and much better memory characteristics.

@JimBobSquarePants
Copy link
Member

Oh wow! Interesting and slightly disappointing result. I'm glad you've shown your level of thoroughness. I would have likely missed something like this.

A few more weeks is no problem. It gives us time to review and merge the WebP code and fix a few more issues. Plus I'm trying to focus on Fonts and Drawing right now to get them closer to RC status.

@antonfirsov
Copy link
Member Author

@jaddie there is a WIP PR #1730, and an API proposal in #1739, discussions shifted there. The required change is so fundamental, that we decided to do a major version bump, so our next release containing the memory fix will be ImageSharp 2.0.

The memory PR is expected to be finished within 1 or 2 months (not that much work left, but I'm doing it in my free time), however we also want to include WebP in that release, which may prolong things too. Hopefully we will be still able to deliver 2.0 before the end of 2021.

If you are considering to help: I need users with production-like environments to test and validate #1730. Let me know if you are interested.

@cshung
Copy link

cshung commented Oct 29, 2021

@antonfirsov With regard to the issue that LOH cannot decommit memory, have you tried the GCSettings.LargeObjectHeapCompactionMode Property?

@antonfirsov
Copy link
Member Author

antonfirsov commented Oct 29, 2021

Yes, the second graph in #1596 (comment) is made with GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce set in the end.

The problem is that I'm not sure if it's a good practice to touch that property from library code. The compaction comes with extra GC cost, unexpected by the user, and we don't know when / how often to compact. I'm not excluding that it can be a lesser evil in our case than going unmanaged and sacrificing the simplicity of our pixel manipulation APIs as a result (see #1739), but I can't commit to such a decision without consulting a GC expert.

@cshung if you think you have some time to chat about this, let me know, I would really appreciate the help!

@cshung
Copy link

cshung commented Nov 4, 2021

The problem is that I'm not sure if it's a good practice to touch that property from library code. The compaction comes with extra GC cost, unexpected by the user, and we don't know when / how often to compact.

Not compacting the LOH automatically by default is a bit sad. Historically, we never compacted the LOH, some customers depended on that fact and assume LOH allocated objects are pinned, so we cannot automatically compact the LOH without some kind of user opt-in, otherwise we might break some users. The GCSettings.LargeObjectHeapCompactionMode property seems to solved most people's pressing issue and therefore we were not looking into it.

Those days are long gone, now we can automatically compact the LOH with minimal configuration. Starting with .NET 6, you can specify the GCConserveMemory settings. It is a number ranging from 0 to 9, indicating how much you want the GC to work to conserve memory instead of giving the best possible speed.

Unfortunately, the setting is not documented yet, it will be. For the time being, we can explore what the setting could do by looking at the code here(*).

If you search for compacting LOH due to GCConserveMem setting, you should find the relevant logic for deciding LOH compaction. As an overview, the implementation is checking the fragmentation ratio (i.e. how much percent of memory is wasted as free space). If it is larger than a threshold derived from the GCConserveMemory settings, then it will turn on LOH compaction.

This should alleviate the need to set that property - whether or not to compact the LOH is best left for the GC to decide.

(*) Whatever we actually do in the code is an implementation detail that is subjected to change, we do need the flexibility to avoid painting ourselves into a corner, like what we did with LOH not compacting.

@cshung
Copy link

cshung commented Nov 4, 2021

@cshung if you think you have some time to chat about this, let me know, I would really appreciate the help!

I am more than happy to reach out to developers like you who care about garbage collection performance. My personal goal is to understand the needs and ideally come up with benchmarks that are representative to work on. What is the best way to reach you?

@antonfirsov
Copy link
Member Author

@cshung thanks a lot for the answers!

Starting with .NET 6, you can specify the GCConserveMemory settings.

From our perspective the problem with an opt-in setting is that we can't configure it on behalf of our users. Even if we would promote it in our documentation, most users would still miss it, and complain about poor scalability compared to unmanaged libraries like skia(sharp). It's much better if the library "just works" without any extra configuration thanks to good defaults.

For now I decided to go on with our switching to unmanaged memory because of two reasons:

  1. The compaction issues discussed above
  2. Unmanaged memory has better characteristics when allocations are overflowing the pooling threshold, and the images are disposed immediately. (this instead of this)

However, thinking in longer terms, this feels wrong to me. Ideally, a managed library should be able to meet all of it's requirements using managed memory only. Would be cool to switch back to GC in a future version. I wonder if ImageSharp is some sort of special animal here, or are there other memory-heavy libraries or apps facing similar issues.

My personal goal is to understand the needs and ideally come up with benchmarks that are representative to work on. What is the best way to reach you?

That's great to hear, we can chat on Teams I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants