Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PathGradientBrush fails to render a complex polygon #117

Closed
chris-steema opened this issue Dec 18, 2020 · 20 comments · Fixed by #128
Closed

PathGradientBrush fails to render a complex polygon #117

chris-steema opened this issue Dec 18, 2020 · 20 comments · Fixed by #128

Comments

@chris-steema
Copy link

chris-steema commented Dec 18, 2020

SixLabors.ImageSharp v1.0.2
SixLabors.ImageSharp.Drawing v1.0.0-beta11
Running in Visual Studio 2019 v.16.8.3 on Windows 10 v.2004 in .NET 5.0.1-servicing.20575.16

Using the SixLabors.ImageSharp.Drawing.Processing.FillPathExtensions's Fill method, I can render a complex polygon using a SolidBrush, but using a PathGradientBrush the program doesn't finish executing and 'hangs' with high memory usage.

Code to reproduce:

        private static void Main(string[] args)
        {
            static void GradientPolygon()
            {
                static void DoComplexPolygon(IImageProcessingContext g, RectangleF rect)
                {
                    static PointF PointFromCircle(RectangleF rectBounds, double degrees)
                    {
                        static void RectCenter(RectangleF r, out float x, out float y)
                        {
                            x = (r.Left + r.Right) / 2.0f;
                            y = (r.Top + r.Bottom) / 2.0f;
                        }

                        RectCenter(rectBounds, out var xCenter, out var yCenter);
                        var radians = (Math.PI * degrees) / 180.0f;
                        var radiusH = rectBounds.Width / 2.0f;
                        var radiusV = rectBounds.Height / 2.0f;

                        var tmpX = xCenter + radiusH * (float)Math.Cos(radians);
                        var tmpY = yCenter - radiusV * (float)Math.Sin(radians);

                        return new PointF(tmpX, tmpY);
                    }

                    var iPoints = new List<PointF>();

                    for (double i = 0; i <= 360; i += 0.1)
                    {
                        iPoints.Add(PointFromCircle(rect, i));
                    }
                    var widthReduction = 50;

                    rect.Inflate(-widthReduction, -widthReduction);

                    for (double i = 360; i >= 0; i -= 0.1)
                    {
                        iPoints.Add(PointFromCircle(rect, i));
                    }

                    var path = new SixLabors.ImageSharp.Drawing.Polygon(new LinearLineSegment(iPoints.ToArray()));

                    //g.Fill(new SolidBrush(Color.Red), path); //this works
                    g.Fill(new PathGradientBrush(iPoints.ToArray(), new Color[] { Color.Red, Color.White, Color.Blue }), path); //this does not complete
                }

                var bitmap = new Image<Rgba32>(800, 600);

                bitmap.Mutate(x =>
                {
                    x.Clear(Color.White);
                    DoComplexPolygon(x, new RectangleF(137, 50, 526, 526));
                });

                var path = $@"{AppDomain.CurrentDomain.BaseDirectory}/Polygon{DateTime.Now.Ticks}.png";

                bitmap.SaveAsPng(path);

                new Process
                {
                    StartInfo = new ProcessStartInfo(path)
                    {
                        UseShellExecute = true
                    }
                }.Start();
            }

            GradientPolygon();
        }
    }
@tannergooding
Copy link
Contributor

tannergooding commented Apr 10, 2021

It's not that this doesn't complete, its that it is very slow. iPoints contains 7200 entries, if you cut this down to 72 points instead, it completes in under 5 seconds (under 30 if the debugger is attached).

I'm pulling a trace to see if there is any obvious hot spots here

@tannergooding
Copy link
Contributor

For reference, here it is rendered with all 7200 points:
image

@JimBobSquarePants
Copy link
Member

Thanks for having a look into this @tannergooding really appreciate it! I’d like to review our internal API for brushes to see if we can optimise them. As I recall everything is calculated per pixel (maybe unavoidable).

@tannergooding
Copy link
Contributor

A trace shows the following:
image

Essentially most of the time (almost 50%) is spent in the call to allocator.Allocate<PointF> here: https://github.com/SixLabors/ImageSharp.Drawing/blob/master/src/ImageSharp.Drawing/Processing/PathGradientBrush.cs#L166

The allocation graph for live objects ends up looking like this:
image

@tannergooding
Copy link
Contributor

tannergooding commented Apr 10, 2021

In this scenario, almost all allocations are for 2 points (so 16 bytes).

ArrayPoolMemoryAllocator itself looks to be doing some basic math, pulling an entry from the array pool, and then allocating a Buffer<T> to hold the data, with the buffer being freed after FindIntersection completes by the using.

This is being done for every edge (all 7200 of them) for every pixel (all 480,000) of them, so its quite a lot of very short lived, non-reused allocations (which wrap the pooled allocations).

@tannergooding
Copy link
Contributor

This is the kind of scenario that would be greatly improved by an alloca like API, one that is able to either stackalloc or return an array, depending on the number of bytes required.
I imagine it would also be improved by pooling Buffer<T> instances, so the short lived allocations don't pile up so high (it's currently 7200 Buffer<T> allocations per pixel).

Manually doing an alloca like scenario, which is basically (rather than using (IMemoryOwner<PointF> memory = allocator.Allocate<PointF>(this.path.MaxIntersections))):

IMemoryOwner<PointF> memory = null;

try
{
    var maxIntersections = this.path.MaxIntersections;
    Span<PointF> buffer;

    if (maxIntersections < (1024 / sizeof(PointF)))
    {
        PointF* points = stackalloc PointF[maxIntersections];
        buffer = new Span<PointF>(points, maxIntersections);
    }
    else
    {
        memory = allocator.Allocate<PointF>(this.path.MaxIntersections);
        buffer = memory.Memory.Span;
    }

    // code
}
finally
{
    memory?.Dispose();
}

Cuts the execution time (under the profiler) for 720 points from ~57 seconds to ~22 seconds.

@tannergooding
Copy link
Contributor

Doing something like pre-building an IList<IMemoryOwner<PointF>> (one per edge) in PathGradientBrush.Apply provides similar benefits, it cuts the time down to about ~21 seconds (under attached profiler) for this specific scenario.

There are likely pro's and cons to both approaches here. The root cause is the number of allocations being done and each fix "resolves" that by reducing the number of allocations required (either by removing them entirely or by moving them higher up and allowing sharing).

@antonfirsov
Copy link
Member

antonfirsov commented Apr 10, 2021

I'm not familiar with this code, and chiming in for just a second, but probably the buffer could be reused across all FindIntersection calls (taken as Span<PointF> parameter) . Hopefully it can be pre-allocated (or pre-stack-allocated), after taking the maximum of all edge.path.MaxIntersections to get the maximum size.

@JimBobSquarePants
Copy link
Member

I’ll have a look at all this ASAP. Can’t focus just now though as have oral shingles. Happy to follow whatever lead you both come up with.

@tannergooding
Copy link
Contributor

but probably the buffer could be reused across all FindIntersection calls

Right, that's roughly what I referred to by pre-building an IList<IMemoryOwner<PointF>>, but just with a single reused IMemoryOwner rather than a reused set of them.

I think the principle of the change is that BrushApplicator could have some virtual method that allows derived types to provide a shared IMemoryOwner instance for use within Apply. This allows you to allocate once for the entire scanline, rather than once per pixel (reducing Buffer<T> allocations to 600 in this case, rather than 7200 * 800 * 600 😄).

This looks to be achievable today without extending BrushApplicator but would require derived types, like PathGradientBrushApplicator, to override Apply. If BrushApplicator were to be extended to allow reuse, it looks like there will need to be another generic type parameter since the T in the IMemoryOwner<T> might differ from applicator to applicator (unless it is always non-existent or PointF). It could alternatively just be a generic object that the base applicator logic passes through, which would still be fairly efficient and ultimately the most customizable.

@antonfirsov
Copy link
Member

antonfirsov commented Apr 12, 2021

BrushApplicator is a contextual object itself, created/disposed on a per-operation basis. The subclasses can store their temporary buffers (of whatever type) as members.

It seems to me that a PointF buffer of size this.edges.Max(e => e.path.MaxIntersections) can be allocated right in the PathGradientBrushApplicator constructor. We can then pass it's Span<PointF> down to the edge.FindIntersection calls.

@tannergooding
Copy link
Contributor

That seems reasonable and avoids the need for BrushApplicator to be concerned with any of this.

It does mean that the allocation potentially persists for much longer than needed, however, and is dependent on Dispose being called for it to be freed

@antonfirsov
Copy link
Member

The pixel-agnostic IBrush-es are a persistent objects, serving as factories for the pixel-specific BrushApplicator<T>s, which have a lifetime scoped to the image processing operation under execution (eg. filling image with a brush). The applicator is disposed and abandoned afterwards. This is a common pattern we use allowing us to maintain the lifetime of temporary resources easily.

@JimBobSquarePants
Copy link
Member

A chink in the plan. BrushApplicator.Apply() is called in parallel. We'd have to override this virtual method and do the allocation there. This starts to get messy though as we can't pass any parameters T[x, y] which is where the current allocation takes place.

I think we should have a deeper look at our underlying API here.

@antonfirsov
Copy link
Member

I think we should either remove the parallelism, or somehow find a solution that eliminates synchronization concerns by design, by creating an isolated context per-thread. (For example we can probably create a separate BrushApplicator in each thread.)

The hard thing is that this would require extending the ParallelRowIterator stuff.

@JimBobSquarePants
Copy link
Member

JimBobSquarePants commented Apr 19, 2021

I've been looking at T[x, y] in BrushApplicator and it's internal and only ever called by it's own Apply method which is threadsafe and designed to be overridable. I think we could remove it, make Apply abstract and move any custom logic to individual Apply implementations. That would allow requesting the buffer once per parallel iteration.

public virtual void Apply(Span<float> scanline, int x, int y)
{
MemoryAllocator memoryAllocator = this.Configuration.MemoryAllocator;
using (IMemoryOwner<float> amountBuffer = memoryAllocator.Allocate<float>(scanline.Length))
using (IMemoryOwner<TPixel> overlay = memoryAllocator.Allocate<TPixel>(scanline.Length))
{
Span<float> amountSpan = amountBuffer.Memory.Span;
Span<TPixel> overlaySpan = overlay.Memory.Span;
float blendPercentage = this.Options.BlendPercentage;
if (blendPercentage < 1)
{
for (int i = 0; i < scanline.Length; i++)
{
amountSpan[i] = scanline[i] * blendPercentage;
overlaySpan[i] = this[x + i, y];
}
}
else
{
for (int i = 0; i < scanline.Length; i++)
{
amountSpan[i] = scanline[i];
overlaySpan[i] = this[x + i, y];
}
}
Span<TPixel> destinationRow = this.Target.GetPixelRowSpan(y).Slice(x, scanline.Length);
this.Blender.Blend(this.Configuration, destinationRow, destinationRow, overlaySpan, amountSpan);
}
}

I've also noticed we request a buffer from the ArrayPool for ever call to InternalPath.FindIntersections. This is called per pixel and could use a single span.

public int FindIntersections(PointF start, PointF end, Span<PointF> buffer, IntersectionRule intersectionRule)
{
PointOrientation[] orientations = ArrayPool<PointOrientation>.Shared.Rent(buffer.Length);
try
{
Span<PointOrientation> orientationsSpan = orientations.AsSpan(0, buffer.Length);
var position = this.FindIntersectionsWithOrientation(start, end, buffer, orientationsSpan);
var activeBuffer = buffer.Slice(0, position);
var activeOrientationsSpan = orientationsSpan.Slice(0, position);

ComplexPolygon.FindIntersections does the same however I haven't found a useage yet outside ofthe tests.

public int FindIntersections(PointF start, PointF end, Span<PointF> buffer, IntersectionRule intersectionRule)
{
this.EnsureInternalPathsInitalized();
int totalAdded = 0;
InternalPath.PointOrientation[] orientations = ArrayPool<InternalPath.PointOrientation>.Shared.Rent(buffer.Length); // the largest number of intersections of any sub path of the set is the max size with need for this buffer.
Span<InternalPath.PointOrientation> orientationsSpan = orientations;

@antonfirsov
Copy link
Member

I've been looking at T[x, y] in BrushApplicator and it's internal and only ever called by it's own Apply method which is threadsafe and designed to be overridable. I think we could remove it, make Apply abstract and move any custom logic to individual Apply implementations. That would allow requesting the buffer once per parallel iteration.

  • 1 for overriding Apply, since it would definitely optimize things (for example not calling the virtual this[x,y] per-pixel.
    But I think currently there is no way to pass the buffer (or any other context) to Apply, which is per-row.

I would consider (upon overriding Apply) trying ThreadLocal<IMemoryOwner<PointF>> as an applicator member (SixLabors/ImageSharp#1611 (comment)) (accessing it in the overriden Apply for lowest overhead).

I've also noticed we request a buffer from the ArrayPool for ever call to InternalPath.FindIntersections. This is called per pixel and could use a single span.

+1 for that. Note that FindIntersections should not be public and can be handled as an implementation detail.

@tannergooding
Copy link
Contributor

In one of my prototypes, I had just extended this[x, y] to take an additional parameter for any additional data a given applicator wants to pass down.

It was basically:

internal abstract TPixel this[int x, int y, IDisposable data] { get; }

internal abstract IDisposable GetDataForApply();

Apply then just did `

     using (IMemoryOwner<float> amountBuffer = memoryAllocator.Allocate<float>(scanline.Length)) 
     using (IMemoryOwner<TPixel> overlay = memoryAllocator.Allocate<TPixel>(scanline.Length)) 
     using (IDisposable data = GetDataForApply()) 
    {
        // rest of logic
    }

Applicators that didn't do need it could just return null. Applicators that did need it could allocate whatever data they need (in this case IMemoryOwner<PointF>) and that would be passed through to this[x, y, data].
Given the methods are all internal or private, this is safe and you won't expect/receive any unexpected types or values. It likewise avoids the need to customize Apply when 99% of the logic stays the same (you really just want to pass down one or more large enough buffers).
Likewise, given the Buffer<T> is the actual type being returned, there isn't a risk of boxing or other overhead here.

@JimBobSquarePants
Copy link
Member

Interesting. Thanks!

I'm going to do some experimentation here to see if I can clean up the API a little based upon this conversation.

@JimBobSquarePants
Copy link
Member

There's a bit of me that wonders whether we should add the following method to BrushApplicator<T>

public abstract void Apply(Span<float> scanline, int x, in RowInterval rows);

That would allow us to reduce allocations from per-scanline to per interval in the fill processor. Might be overkill though. I'll focus on optimizing the per-scanline operation first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants