-
-
Notifications
You must be signed in to change notification settings - Fork 852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP. More Efficient MemoryAllocator #1660
Conversation
: base(IntPtr.Zero, true) | ||
{ | ||
this.SetHandle(Marshal.AllocHGlobal(size)); | ||
GC.AddMemoryPressure(this.byteCount = size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be done? This would lead to a faster gen2 collection and object itself would behave like it's a managed allocation which I assume is not what we want to be happening for large chunks of data.
Microsoft docs states that "The AddMemoryPressure and RemoveMemoryPressure methods improve performance only for types that exclusively depend on finalizers to release the unmanaged resources". Which is not the case here as safe handle is used to prevent memory leaking only when user forgot to call dispose which is very unlikely to be caused by library code (this should be tested somehow to be honest).
Otherwise this would lead to a memory throttle as it is with current implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That advice is awkwardly worded. In the case where a user fails to Dispose an object backed by an unmanaged buffer, it does rely on the SafeHandle's finalizer to release the memory, so it is advantageous to advise the GC that it might be able to reclaim the memory by doing its GC thing. This should only result in more frequent gen2 GCs if the memory limits are being reached, which should only happen if the unmanaged buffer is long-lived or if the system is actually low on memory. For properly Disposed ephemeral buffers, the GC will be aware of the added memory pressure but will also see it freed quickly, so no harm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure that these buffers can be named 'ephemeral', 2mb is a lot (at least it looks like a lot for something temporal) which should lead to a lot of execution time spent on this memory. Primary idea I had while writing this is that even if user forgets to dispose memory backed by a handle it'll be freed by either gen0/gen1 collection or full gen2 collection when OutOfMemoryException
kicks in.
While I'm not so sure about this now, I'm still concerned that this might be an overkill due to freachable queue. At least I'd want to benchmark this on something big like your parallel beeheads demo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unmanaged buffers will only be used for allocations that are over the pooled size limit. They should be used once and then released, hence 'ephemeral'. The question of whether GC.AddMemoryPressure
is appropriate ultimately comes down to "is there any way a GC could free this memory?". The answer here is yes, but only if the memory has been leaked because someone didn't dispose. Unfortunately there's no way to know whether someone is going to leak; you can only tell after they've done it, when your finalizer runs.
What the MS docs should have said is something along the lines of "don't tell the GC about memory it couldn't possibly reclaim". If the allocation and free were always 100% deterministic, there would be no point in telling the GC about them. But since we can't know whether the GC could reclaim the memory, it's better to tell it about the allocation than to not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once again, thanks for a deep dive! I've never looked at GC from "memory it can or cannot not claim" point of view.
/// Does not directly use any hardware intrinsics, nor does it incur branching. | ||
/// </summary> | ||
/// <param name="value">The value.</param> | ||
private static int Log2SoftwareFallback(uint value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jpeg encoder uses similar logic for internal operations, this could potentially be placed somewhere in Numerics.cs
class for reusage.
P.S.
Jpeg needs to calculate minimum bitsize of a given number which can be calculated via Log2DeBruijn fallback logic on non-intrinsic hardware. Not really a game changer but would be nice to have these under single class & single test fixture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That uses BitOperations.LeadingZeroCount
does it not? Numerics.MinimumBitsToStore16
is awkwardly named actually. 16 what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Lzcnt
intrinsic is supported - yes. But for fallback code it now uses LuT of 255 values with possible branching if value exceeds 255 only once - thus 16 bits is the maximum value bitsize this method would safely calculate. DeBruijn sequence should be a bit faster and would eliminate the 16 bit maximum value constraint. Code would be a bit different, that table is what can be shared between jpeg & pool code. I'll open a PR with naming fix & algorithms after finishing my current open pr for testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your maths is better than mine! 😀
Cool I’ll leave that to you then. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem! I'll try to do it today so you could use that in this pr before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lovely, thanks!
@saucecontrol I had a go at profiling. Master Here's your original sample. |
/// </summary> | ||
private ArrayPool<byte> largeArrayPool; | ||
private const int DefaultMaxArraysPerBucket = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we utilize only one bucket now, an intensive parallel load will defer to the unmanaged stuff almost all the time. I would try benchmarking with different bucket counts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm gonna need help writing those tests and improving others to ensure they work in CI.
ThrowInvalidAllocationException<T>(length); | ||
// For anything greater than our pool limit defer to unmanaged memory | ||
// to prevent LOH fragmentation. | ||
memory = new UnmanagedBuffer<T>(length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to see some metrics, how many times do we go here VS to the pool.
I can help with the EventSource stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the BMP codec hits this a few times. I saw that when I made a mistake in the UnmanagedBuffer<T>
type. Hopefully my EventSource stuff works ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to have a fair comparison, you need to benchmark the old and the new allocator on the same PC. You can make a copy of the old allocator class in a different namespace for simplicity.
I would try to tweak different parameters and see how the results change.
/// </summary> | ||
private ArrayPool<byte> normalArrayPool; | ||
internal const int DefaultMaxArrayLengthInBytes = 2 * SharedPoolThresholdInBytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also a parameter we can tweak!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Time is against me now though, it's nearly 3am.
The code is still pooling only 16*2=32MB, that's a tiny part of the 1.7GB peak. What we see here must be the result of building disconitguos buffers out of 2MB unmanaged memory blocks. @JimBobSquarePants can we run an experiment trying to pool significantly more? (128MB, 512MB, 1024MB => If we get better results with unmanaged buffers, we may consider dropping the large pool entirely, though I still have concerns on the reliability SafeHandle finalizers. AFAIK a single unhandled exception in any user finalizer will prevent the finalizer queue to finish, leading to actual memory leaks. @saucecontrol thoughts? |
Marking this as ready to review. There's a lot of testing and configuration to be done but I think the functionality is pretty much where we need it to be. |
We still need to figure out how much value is there from pooling, if it has no real effect, that's a game changer on the current implementation and API shape. We also need metrics on throughput, the new memory access patterns may have impact on cache utilization. |
I’ll leave the memory profiling to the more proficient. I need to take a break, struggling with jet lag |
TBH it's also a very difficult time for me now. There's plenty of work left IMO, I recommend to slow down, and to declare it as a marathon rather than a sprint, there's no good in doing it in rush. I'll see what I can do on Saturday. |
@antonfirsov Correct. I was attempting to explain why the profile of the initial PR showed much lower total VirtualAlloc numbers despite much higher peak and baseline memory. The switch to unmanaged makes every buffer allocation an actual allocation, giving a higher total VirtualAlloc. But those allocations are released immediately, keeping the instantaneous committed memory lower. When falling back to unmanaged allocations, keeping the 2MB discontiguous buffer strategy will be a negative perf-wise in that
That's true, but only because an unhandled exception in a finalizer will crash the process. 😆 Your finalizers in particular will only be calling |
@antonfirsov I need to better understand what we want here. As I see it the larger the pool the more aggressively we will have to trim the pool as it will end up holding onto memory again. With our new logic we'd end up adding 1024MB arrays to that pool and almost nothing would go to unmanaged memory.
@saucecontrol Actually, it's pretty simple as I recall. I think if we simply adjusted |
Yeah, but that would mean larger managed buffer requests too, which would then always fail, no? Would be a quick way to test whether going all unmanaged would be ok, though. One more note on perf around that... In addition to the fact Correction: |
Ah no. If the requested amount is larger than the pool maximum of 2MB or if the pool is exhausted then we defer to unmanaged buffers. ImageSharp/src/ImageSharp/Memory/Allocators/Internals/GCAwareConfigurableArrayPool{T}.cs Lines 107 to 157 in a22b0cb
ImageSharp/src/ImageSharp/Memory/Allocators/ArrayPoolMemoryAllocator.cs Lines 133 to 152 in a22b0cb
|
Yeah, that's what I meant. What I was picturing was something that requested 2MB (or whatever you pick for your max managed chunk size) at a time from the managed pool, and then when that returns null, request all the rest in one unmanaged allocation. That would require moving that abstraction out of the allocator, though. |
Yeah, that get's a bit iffy. I'd really like to keep everything there. Here's what happens if I change the contiguous length to 24MB Looks like CPU takes a hit there. |
Ouch. Yeah, 24MB is too chunky. You'd allocate 48MB when 25MB is requested, which shows in your total VirtualAlloc number jumping way up again. That's why it would be better for the MemoryGroup to know it's making an unmanaged 'rental' so it can size it exactly to what it needs. There's a balance there somewhere. Unfortunately it'll be a lot of testing. |
I think we can expose the required properties via the allocator interface easily enough. We're breaking it anyway. |
bool lockTaken = false, allocateBuffer = false; | ||
try | ||
{ | ||
this.spinLock.Enter(ref lockTaken); | ||
|
||
if (this.index < buffers.Length) | ||
{ | ||
buffer = buffers[this.index]; | ||
buffers[this.index++] = null; | ||
allocateBuffer = buffer == null; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calculating if buffer was null can be done outside of lock scope: if(buffer == null) {}
bool lockTaken = false, allocateBuffer = false; | |
try | |
{ | |
this.spinLock.Enter(ref lockTaken); | |
if (this.index < buffers.Length) | |
{ | |
buffer = buffers[this.index]; | |
buffers[this.index++] = null; | |
allocateBuffer = buffer == null; | |
} | |
} | |
bool lockTaken = false; | |
try | |
{ | |
this.spinLock.Enter(ref lockTaken); | |
if (this.index < buffers.Length) | |
{ | |
buffer = buffers[this.index]; | |
buffers[this.index++] = null; | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I don't know if you can suggest changes from different places so do not commit this, line 353 must be changed to if(buffer == null)
for this to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JimBobSquarePants hm, I may be wrong of course but we can easily do that outside of the lock scope, no? That comparison is nothing but with spinlocks better be fast than asleep :P
// for that slot, in which case we should do so now. | ||
if (allocateBuffer) | ||
{ | ||
if (this.index == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not thread safe, this.index
variable is a shared state that must be altered only in locked scope.
this.index
would never be 0 in this piece of code. If bucket is full (i.e. all buffers are not rented), this.index
would be equal to zero. After exactly one rent in locked scope this.index
would be incremented thus at if(this.index == 0)
line index would always be >= 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be fixed like this:
bool lockTaken = false;
int takenIndex = -1;
try
{
this.spinLock.Enter(ref lockTaken);
if (this.index < buffers.Length)
{
buffer = buffers[this.index];
buffers[this.index] = null;
takenIndex = this.index++;
}
}
finally
{
if (lockTaken)
{
this.spinLock.Exit(false);
}
}
if (buffer == null)
{
if (takenIndex == 0)
{
// Stash the time the first item was added.
this.firstItemMS = (uint)Environment.TickCount;
}
// ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. It's also in the wrong method. Will update.
I definitely broke something in NET FX when I removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't get me wrong, this is amazing work, it's great to see a proof that unmanaged allocations can reduce memory peaks by more than 50%. However, this is a massive change in the core engine, and the current code and the status of validation is still very far from a state I would consider mergeable.
Here is the list of major problems we need to solve:
- As a start, we need to setup systematic benchmarking, which can help us getting the following metrics for any configuration in a quantitative/comparable manner enabling data-driven decision making to shape the design and fine-tune the final perf parameters:
- Peak memory usage
- Average memory usage
- CPU utilization
- Throughput (~time taken to process all the bee heads)
We should be able to put these values into a table so we can easily compare benchmark results for different parameter sets.
- With the current tuning, there is almost no pooling happening when it comes to large buffers, meaning that the current PR behavior does not justify the presence of the complex array pooling & trimming logic and the
ArrayPoolMemoryAllocator
name / API shape. Intuition tells that we need some pooling, but ATM we have no idea how much. We need to see how does the allocator work with pooling disabled + and a list of more aggressive pooling setups.
With our new logic we'd end up adding 1024MB arrays to that pool and almost nothing would go to unmanaged memory.
I did not propose 1024 as a final value. It is an extreme config in general, but not for the bee heads benchmark, where the memory usage seems to fluctuate around 1.2GB. I'm also more curious about less aggressive, more reasonable pooling settings, but I see no reason for not getting all the data points that help us understanding the whole picture and make better decisions.
-
As pointed out in WIP. More Efficient MemoryAllocator #1660 (comment), we need different sizes for the contiguous blocks coming from pooled arrays VS unmanaged buffers. There are several ways to implement this, probably the most straightforward thing is to move the code in
MemoryGroup.Allocate
to a virtual method onMemoryAllocator
, that could be overridden, and use ideas from WIP. More Efficient MemoryAllocator #1660 (comment) in the override. -
Using the
ArrayPool<T>
abstraction and the entireGCAwareConfigurableArrayPool<T>
class to pool 2MB buffers doesn't bring us value, since we are utilizing only one bucket of that pool. We should implement a custom pool class dedicated to the concern of pooling uniform arrays. It should be relatively easy to refactor it fromGCAwareConfigurableArrayPool<T>.Bucket
. -
We should periodically trim the pool by by a certain factor even when there is low memory pressure. This would address the concern of retaining memory unnecessarily (pointed out in General discussion about memory management #1590 and other user complaints), and also enable us to pool much more when there is high load.
-
We need extensive test coverage to validate our assumptions regarding the utilization of the pools and unmanaged buffers. We also need to test trimming. Unfortunately, it would be too expensive to do it with regular Xunit runs, but we can define a local-only tests, that deliver & validate the logs/metrics proving the trimming is happening in the way we expect.
I understand this is enormous amount of work, but it was always the case for all the previous PR-s refactoring the memory management engine. I don't see a reason to lower our quality criterias in this case, and omit any of the points above, especially that we are about to introduce a breaking change, and we want to prevent future breaking changes. Personally, I want to start working on point 1., but even that alone may take several evenings, that makes it very hard to give an ETA. If we feel like the allocator work may block V1.1 for too long, we should focus on #1597 first, since it will bring even bigger improvement with a lower development cost.
@antonfirsov I actually agree with all the above. I'm running before we can walk without the relevant experience and suffering as a result. #1597 should immediately benefit us for V1.1. What I'm actually going to do is close this and instead introduce a few smaller PRs to do some sanitation work which will allow the allocator changes to be made more easily.
|
I spent some time today trying to figure out how to get the desired metrics out of .ETL files, but I realized there is no easy way that is worth the efforts. This should not block systematic comparison, but will make it even more of a chore work 😞 |
Prerequisites
Description
A breaking (due to the removal of unused configuration options) rewrite of ArrayPoolMemoryAllocator to modelled on the design described at #1596 (comment) by @antonfirsov
ArrayPool<byte>.Shared
for buffers <= 1MBGen2GcCallback
inTlsOverPerCoreLockedStacksArrayPool
to trim the pool under pressure.Buffer2D<T>
will use 2MB for each discontiguous (did you know that was a Scottish word?) chunk.Things to consider
ForDone!Allocate<T>
only we should consider using unmanaged memory on supported platforms for > 2MB allocations using the approach described by @saucecontrol More efficient MemoryAllocator #1596 (comment)Issues
Several Tiff encoding tests fail with the lower buffer chunk threshold set. This is due toFixed, thanks @brianpopow !!TiffBaseColorWriter<TPixel>
callingBuffer2D<T>.GetSingleSpan()
. @brianpopow I'll need you or @IldarKhayrutdinov to help me there as I don't know enough about the format to do a fix.ImageSharp/src/ImageSharp/Formats/Tiff/Writers/TiffBaseColorWriter{TPixel}.cs
Lines 82 to 84 in 381dff8
We have a failing test for. Fixed!ResizeKernelMap
due to the use ofBuffer2D<T>.GetSingleMemory()
. @antonfirsov I'll need help there.ImageSharp/src/ImageSharp/Processing/Processors/Transforms/Resize/ResizeKernelMap.cs
Lines 54 to 55 in 381dff8