New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance: Refactors query prefetch mechanism #4361
Performance: Refactors query prefetch mechanism #4361
Conversation
…ons, but we also can't be substantially slower to start all tasks
…d swallow exceptions
…sks and buffers that could occur in that some places
…ot allocating more in the non-test cases, but found a field to reuse; needs benchmarking
@microsoft-github-policy-service agree company="Microsoft" |
The benchmarks are off in a gist (also linked in the description), but they take a loooong time to run (I just ran 'em overnight during development) so I don't think it'd make much sense to check them into anything that is regularly run. Since I didn't intend to run them regularly, I assembled the charting by hand - so there's nothing to save there. |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
e04ce51
into
Azure:master
Description
Reworks
ParallelPrefetch.PrefetchInParallelAsync
to reduce allocations.This came out of profiling an application, and discovering that this method is allocating approximately as many bytes worth of
Task[]
as the whole application is creatingbyte[]
for IO. This is becauseTask.WhenAny(...)
is a) used in a loop b) makes a defensive copy of the passedTask
s.This version is substantially more complicated, and accordingly there are a lot of tests in this PR (code coverage is 100% of lines and blocks). Special attention was paid to exception and cancellation cases.
Improvements
Greatly Reduced Allocations
In my benchmarking, anywhere from 30% to 99% depending on the total number of
IPrefetcher
s used.More benchmarking discussion is at the bottom of this PR.
Special Casing For
maxConcurrency
When
== 0
we do no work, more efficiently than current code.When
== 1
we devolve to a foreach, which is just about ideal.Special Casing When Only 1
IPrefetcher
We accept an
IEnumerable<IPrefetcher>
, but when that is only going to yield oneIPrefetcher
a lot of work (even with the old code) is pointless. New code detects this case (generically, it doesn't look for specific types) and devolves into a singleawait
.Prompter Starting Of Next Task
Old code starts at most one task per pass through the while loop, so if multiple
Task
s are sitting there completed there's a fair amount a work done before they are all replaced with activeTask
s.New code has the completed
Task
start its replacement, which should keep us closer tomaxConcurrency
activeTask
s.IEnumerator<IPrefetcher>
DisposedSmall nit, but the old code doesn't dispose the
IEnumerator<IPrefetcher>
. While unlikely, this can put more load on the finalizer thread or potentially leak resources.Outline
maxConcurrency == 0
just returnsmaxConcurrency == 1
is just aforeach
maxConcurrency <= BatchSize
is more complicatedBatchSize
IPrefetcher
s are loaded into a rented arrayTask
s are then started for each of thoseIPrefetcher
sTask
s grab and start the nextIPrefetcher
of theIEnumerator<IPrefetcher>
when they finish with their last oneTask
is then awaited in ordermaxConcurrency > BatchSize
reuses a lot of the above case, but is still more complicatedBatchSize
IPrefetchers
are loaded and started as aboveTask
s grab and start the nextIPrefetcher
when they finish with oneIPrefetchers
(up tomaxConcurrency
) while there are activeTask
sobject[]
, which is awaited in turn oncemaxConcurrency
is reached (or theIEnumerator<T>
finishes)We distinguish between the two
maxConcurrency > 1
cases to avoid allocating very large arrays, and to make sure we start some prefetches fairly promptly even whenmaxConcurrency
is very large.BatchSize
is, somewhat arbitrarily,512
- any value> 1
and< 8,192
would be valid.Type of change
Sort of a bug I guess? Current code allocates a lot more than you'd expect.
Benchmarking
Standard caveats about micro-benchmarking apply, but I did some benchmarking to see how this stacks up versus the old code.
TL;DR - across the board improvements in allocations, with no wall clock regressions in what I believe is the common case. There are some narrow, less common cases, where small wall clock regressions are observed.
I consider the primary case here when the
IPrefetcher
actually goes async, and takes some non-trivial time to do its work. My expectation is that the two versions of the code should have about the same wall-clock time when# tasks > maxConcurrency
, with the new code edging out old as# tasks
increases.That said, I did also test the synchronous completion case, and the "goes async, but then completes immediately" cases to make sure performance wasn't terrible.
In all cases I expect the new code to perform fewer allocations than the old.
Summarizing some results (the full data is in the previous link)...
Here's old vs new on .NET 6 where the
IPrefetcher
is just anawait Task.Delay(1)
(< 1
is an improvement):As expected, wall clock time is basically unaffected (the delay dominates) but allocations are improved across the board. The benefits of improved replacement
Task
starting logic are visible at the very extreme ends of max concurrency and prefetcher counts.Again, but this time
IPrefetcher
justreturn default;
s so everything completes synchronously:We see here that between 2 and 8 tasks there are configurations with wall clock regressions. I could try and improve that, but I believe "all synchronous completions" is fantastically rare, so it's not worth the extra code complications.
And finally,
IPrefetcher
is justawait Task.Yield();
so everything completes almost immediately but forces all the async completion machinery to run:Similarly, between 4 and 8 tasks there are some wall clock regressions. While more realistic than the "all synchronous"-case, I think this would still be pretty rare - most
IPrefetcher
s should be doing real work after some asynchronous operation.Since we target netstandard, I also benchmarked under a Framework version (4.6.2) and the results are basically the same: