Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: Refactors query prefetch mechanism #4361

Conversation

kevin-montrose
Copy link
Contributor

@kevin-montrose kevin-montrose commented Mar 20, 2024

Description

Reworks ParallelPrefetch.PrefetchInParallelAsync to reduce allocations.

This came out of profiling an application, and discovering that this method is allocating approximately as many bytes worth of Task[] as the whole application is creating byte[] for IO. This is because Task.WhenAny(...) is a) used in a loop b) makes a defensive copy of the passed Tasks.

This version is substantially more complicated, and accordingly there are a lot of tests in this PR (code coverage is 100% of lines and blocks). Special attention was paid to exception and cancellation cases.

Improvements

Greatly Reduced Allocations

In my benchmarking, anywhere from 30% to 99% depending on the total number of IPrefetchers used.

More benchmarking discussion is at the bottom of this PR.

Special Casing For maxConcurrency

When == 0 we do no work, more efficiently than current code.
When == 1 we devolve to a foreach, which is just about ideal.

Special Casing When Only 1 IPrefetcher

We accept an IEnumerable<IPrefetcher>, but when that is only going to yield one IPrefetcher a lot of work (even with the old code) is pointless. New code detects this case (generically, it doesn't look for specific types) and devolves into a single await.

Prompter Starting Of Next Task

Old code starts at most one task per pass through the while loop, so if multiple Tasks are sitting there completed there's a fair amount a work done before they are all replaced with active Tasks.

New code has the completed Task start its replacement, which should keep us closer to maxConcurrency active Tasks.

IEnumerator<IPrefetcher> Disposed

Small nit, but the old code doesn't dispose the IEnumerator<IPrefetcher>. While unlikely, this can put more load on the finalizer thread or potentially leak resources.

Outline

  • There are 4 paths through the method now
    • maxConcurrency == 0 just returns
    • maxConcurrency == 1 is just a foreach
    • maxConcurrency <= BatchSize is more complicated
      • Up to BatchSize IPrefetchers are loaded into a rented array
      • Tasks are then started for each of those IPrefetchers
        • These Tasks grab and start the next IPrefetcher of the IEnumerator<IPrefetcher> when they finish with their last one
      • Every started Task is then awaited in order
    • maxConcurrency > BatchSize reuses a lot of the above case, but is still more complicated
      • Up to BatchSize IPrefetchers are loaded and started as above
        • Also as above, the Tasks grab and start the next IPrefetcher when they finish with one
      • But we continue trying to start new batches of IPrefetchers (up to maxConcurrency) while there are active Tasks
      • All of this is tracked in a pseudo-linked-list of rented object[], which is awaited in turn once maxConcurrency is reached (or the IEnumerator<T> finishes)

We distinguish between the two maxConcurrency > 1 cases to avoid allocating very large arrays, and to make sure we start some prefetches fairly promptly even when maxConcurrency is very large. BatchSize is, somewhat arbitrarily, 512 - any value > 1 and < 8,192 would be valid.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Sort of a bug I guess? Current code allocates a lot more than you'd expect.

Benchmarking

Standard caveats about micro-benchmarking apply, but I did some benchmarking to see how this stacks up versus the old code.

TL;DR - across the board improvements in allocations, with no wall clock regressions in what I believe is the common case. There are some narrow, less common cases, where small wall clock regressions are observed.

I consider the primary case here when the IPrefetcher actually goes async, and takes some non-trivial time to do its work. My expectation is that the two versions of the code should have about the same wall-clock time when # tasks > maxConcurrency, with the new code edging out old as # tasks increases.

That said, I did also test the synchronous completion case, and the "goes async, but then completes immediately" cases to make sure performance wasn't terrible.

In all cases I expect the new code to perform fewer allocations than the old.

Summarizing some results (the full data is in the previous link)...

Here's old vs new on .NET 6 where the IPrefetcher is just an await Task.Delay(1) (< 1 is an improvement):
image
As expected, wall clock time is basically unaffected (the delay dominates) but allocations are improved across the board. The benefits of improved replacement Task starting logic are visible at the very extreme ends of max concurrency and prefetcher counts.

Again, but this time IPrefetcher just return default;s so everything completes synchronously:
image
We see here that between 2 and 8 tasks there are configurations with wall clock regressions. I could try and improve that, but I believe "all synchronous completions" is fantastically rare, so it's not worth the extra code complications.

And finally, IPrefetcher is just await Task.Yield(); so everything completes almost immediately but forces all the async completion machinery to run:
image
Similarly, between 4 and 8 tasks there are some wall clock regressions. While more realistic than the "all synchronous"-case, I think this would still be pretty rare - most IPrefetchers should be doing real work after some asynchronous operation.

Since we target netstandard, I also benchmarked under a Framework version (4.6.2) and the results are basically the same:
image

@kevin-montrose
Copy link
Contributor Author

@microsoft-github-policy-service agree company="Microsoft"

@kevin-montrose
Copy link
Contributor Author

You have nice performance charts for this implementation in the description. If you wrote some code for benchmarking this, it may be good to add it to the Performance Tests project

The benchmarks are off in a gist (also linked in the description), but they take a loooong time to run (I just ran 'em overnight during development) so I don't think it'd make much sense to check them into anything that is regularly run. Since I didn't intend to run them regularly, I assembled the charting by hand - so there's nothing to save there.

neildsh
neildsh previously approved these changes Mar 29, 2024
@ealsur
Copy link
Member

ealsur commented Mar 29, 2024

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ealsur ealsur changed the title Parallel prefetch rework Performance: Refactors query prefetch mechanism Mar 29, 2024
ealsur
ealsur previously approved these changes Mar 29, 2024
@kevin-montrose kevin-montrose dismissed stale reviews from neildsh and ealsur via b5d4f07 March 29, 2024 20:37
@neildsh
Copy link
Contributor

neildsh commented Mar 30, 2024

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@sboshra sboshra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@neildsh neildsh added QUERY auto-merge Enables automation to merge PRs labels Apr 1, 2024
@neildsh
Copy link
Contributor

neildsh commented Apr 1, 2024

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@microsoft-github-policy-service microsoft-github-policy-service bot merged commit e04ce51 into Azure:master Apr 1, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Enables automation to merge PRs QUERY
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants