Performance: Refactors query prefetch mechanism #4361

kevin-montrose · 2024-03-20T17:32:52Z

Description

Reworks ParallelPrefetch.PrefetchInParallelAsync to reduce allocations.

This came out of profiling an application, and discovering that this method is allocating approximately as many bytes worth of Task[] as the whole application is creating byte[] for IO. This is because Task.WhenAny(...) is a) used in a loop b) makes a defensive copy of the passed Tasks.

This version is substantially more complicated, and accordingly there are a lot of tests in this PR (code coverage is 100% of lines and blocks). Special attention was paid to exception and cancellation cases.

Improvements

Greatly Reduced Allocations

In my benchmarking, anywhere from 30% to 99% depending on the total number of IPrefetchers used.

More benchmarking discussion is at the bottom of this PR.

Special Casing For `maxConcurrency`

When == 0 we do no work, more efficiently than current code.
When == 1 we devolve to a foreach, which is just about ideal.

Special Casing When Only 1 `IPrefetcher`

We accept an IEnumerable<IPrefetcher>, but when that is only going to yield one IPrefetcher a lot of work (even with the old code) is pointless. New code detects this case (generically, it doesn't look for specific types) and devolves into a single await.

Prompter Starting Of Next Task

Old code starts at most one task per pass through the while loop, so if multiple Tasks are sitting there completed there's a fair amount a work done before they are all replaced with active Tasks.

New code has the completed Task start its replacement, which should keep us closer to maxConcurrency active Tasks.

`IEnumerator<IPrefetcher>` Disposed

Small nit, but the old code doesn't dispose the IEnumerator<IPrefetcher>. While unlikely, this can put more load on the finalizer thread or potentially leak resources.

Outline

There are 4 paths through the method now
- maxConcurrency == 0 just returns
- maxConcurrency == 1 is just a foreach
- maxConcurrency <= BatchSize is more complicated
  - Up to BatchSize IPrefetchers are loaded into a rented array
  - Tasks are then started for each of those IPrefetchers
    - These Tasks grab and start the next IPrefetcher of the IEnumerator<IPrefetcher> when they finish with their last one
  - Every started Task is then awaited in order
- maxConcurrency > BatchSize reuses a lot of the above case, but is still more complicated
  - Up to BatchSize IPrefetchers are loaded and started as above
    - Also as above, the Tasks grab and start the next IPrefetcher when they finish with one
  - But we continue trying to start new batches of IPrefetchers (up to maxConcurrency) while there are active Tasks
  - All of this is tracked in a pseudo-linked-list of rented object[], which is awaited in turn once maxConcurrency is reached (or the IEnumerator<T> finishes)

We distinguish between the two maxConcurrency > 1 cases to avoid allocating very large arrays, and to make sure we start some prefetches fairly promptly even when maxConcurrency is very large. BatchSize is, somewhat arbitrarily, 512 - any value > 1 and < 8,192 would be valid.

Type of change

Bug fix (non-breaking change which fixes an issue)

Sort of a bug I guess? Current code allocates a lot more than you'd expect.

Benchmarking

Standard caveats about micro-benchmarking apply, but I did some benchmarking to see how this stacks up versus the old code.

TL;DR - across the board improvements in allocations, with no wall clock regressions in what I believe is the common case. There are some narrow, less common cases, where small wall clock regressions are observed.

I consider the primary case here when the IPrefetcher actually goes async, and takes some non-trivial time to do its work. My expectation is that the two versions of the code should have about the same wall-clock time when # tasks > maxConcurrency, with the new code edging out old as # tasks increases.

That said, I did also test the synchronous completion case, and the "goes async, but then completes immediately" cases to make sure performance wasn't terrible.

In all cases I expect the new code to perform fewer allocations than the old.

Summarizing some results (the full data is in the previous link)...

Here's old vs new on .NET 6 where the IPrefetcher is just an await Task.Delay(1) (< 1 is an improvement):

_{As expected, wall clock time is basically unaffected (the delay dominates) but allocations are improved across the board. The benefits of improved replacement Task starting logic are visible at the very extreme ends of max concurrency and prefetcher counts.}

Again, but this time IPrefetcher just return default;s so everything completes synchronously:

_{We see here that between 2 and 8 tasks there are configurations with wall clock regressions. I could try and improve that, but I believe "all synchronous completions" is fantastically rare, so it's not worth the extra code complications.}

And finally, IPrefetcher is just await Task.Yield(); so everything completes almost immediately but forces all the async completion machinery to run:

_{Similarly, between 4 and 8 tasks there are some wall clock regressions. While more realistic than the "all synchronous"-case, I think this would still be pretty rare - most IPrefetchers should be doing real work after some asynchronous operation.}

Since we target netstandard, I also benchmarked under a Framework version (4.6.2) and the results are basically the same:

…ons, but we also can't be substantially slower to start all tasks

…ount of CPU

…d swallow exceptions

…sks and buffers that could occur in that some places

…unts down

…ot allocating more in the non-test cases, but found a field to reuse; needs benchmarking

kevin-montrose · 2024-03-20T17:41:14Z

@microsoft-github-policy-service agree company="Microsoft"

Microsoft.Azure.Cosmos/src/Pagination/ParallelPrefetch.cs

… partial

kevin-montrose · 2024-03-29T17:13:29Z

You have nice performance charts for this implementation in the description. If you wrote some code for benchmarking this, it may be good to add it to the Performance Tests project

The benchmarks are off in a gist (also linked in the description), but they take a loooong time to run (I just ran 'em overnight during development) so I don't think it'd make much sense to check them into anything that is regularly run. Since I didn't intend to run them regularly, I assembled the charting by hand - so there's nothing to save there.

ealsur · 2024-03-29T18:17:14Z

/azp run

azure-pipelines · 2024-03-29T18:17:25Z

Azure Pipelines successfully started running 1 pipeline(s).

neildsh · 2024-03-30T00:13:13Z

/azp run

azure-pipelines · 2024-03-30T00:13:25Z

Azure Pipelines successfully started running 1 pipeline(s).

sboshra

neildsh · 2024-04-01T20:14:22Z

/azp run

azure-pipelines · 2024-04-01T20:14:43Z

Azure Pipelines successfully started running 1 pipeline(s).

kevin-montrose added 17 commits March 14, 2024 10:28

sketch out improved ParallelPrefetcher; focus is on reducing allocati…

d005a66

…ons, but we also can't be substantially slower to start all tasks

little more cleanup to further reduce allocations, and save a tiny am…

14d5a0f

…ount of CPU

start on testing

dd7bbae

some tweaks and testing for buffer management

7032e9c

test exception handling; fix a bug in high concurrency case that woul…

6868904

…d swallow exceptions

test cancellation

7086198

more testing, a little bit of cleanup

91eca90

test the case where the enumerator faults; fixes a couple leaks of Ta…

afe77d2

…sks and buffers that could occur in that some places

tiny bit of cleanup

a7e0012

cleanup and expand comments; code is tricky, it needs documentation

354b22e

address a whole bunch of style nits, just to keep compiler Message co…

e0ed9f9

…unts down

address some feedback on comment clarity

b972b0a

don't rely on finalizers for testing, it's too brittle; hold up was n…

8ade43b

…ot allocating more in the non-test cases, but found a field to reuse; needs benchmarking

complete ITrace proxy for testing

dedbd20

style nits and a bit more commentary

4cdd081

explicit test for concurrent access to the inner IEnumerator

f6c3993

explicit test that IEnumerator is disposed

fa5b42a

explicitly implement all ITrace members

9e46e1b

kevin-montrose marked this pull request as ready for review March 21, 2024 15:20

kevin-montrose requested review from khdang, sboshra, adityasa, neildsh, kirankumarkolli, ealsur, FabianMeiswinkel and kirillg as code owners March 21, 2024 15:20

neildsh reviewed Mar 28, 2024

View reviewed changes

Microsoft.Azure.Cosmos/src/Pagination/ParallelPrefetch.cs Outdated Show resolved Hide resolved

neildsh reviewed Mar 28, 2024

View reviewed changes

Microsoft.Azure.Cosmos/src/Pagination/ParallelPrefetch.cs Outdated Show resolved Hide resolved

kevin-montrose added 5 commits March 29, 2024 10:44

address feedback: break test-only bits of ParallelPrefetch out into a…

f52a737

… partial

address feedback: move const above type declarations

e0891fb

address feedback: naming nits

f7465e0

address feedback: use the existing NoOpTrace

acdb020

address feedback: remove pointless using

987c33d

kevin-montrose dismissed neildsh’s stale review via 987c33d March 29, 2024 16:52

Merge branch 'master' into parallelPrefetchRework

cc4f82d

neildsh previously approved these changes Mar 29, 2024

View reviewed changes

ealsur changed the title ~~Parallel prefetch rework~~ Performance: Refactors query prefetch mechanism Mar 29, 2024

ealsur previously approved these changes Mar 29, 2024

View reviewed changes

kevin-montrose and others added 2 commits March 29, 2024 16:14

Merge branch 'master' into parallelPrefetchRework

75e5d0b

update baseline trace text for QueryAsync test

b5d4f07

kevin-montrose dismissed stale reviews from neildsh and ealsur via b5d4f07 March 29, 2024 20:37

Merge branch 'master' into parallelPrefetchRework

4147878

neildsh approved these changes Apr 1, 2024

View reviewed changes

kevin-montrose requested a review from ealsur April 1, 2024 17:46

sboshra approved these changes Apr 1, 2024

View reviewed changes

Merge branch 'master' into parallelPrefetchRework

e8f0cca

neildsh added QUERY auto-merge Enables automation to merge PRs labels Apr 1, 2024

microsoft-github-policy-service bot enabled auto-merge (squash) April 1, 2024 20:14

microsoft-github-policy-service bot merged commit e04ce51 into Azure:master Apr 1, 2024
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: Refactors query prefetch mechanism #4361

Performance: Refactors query prefetch mechanism #4361

kevin-montrose commented Mar 20, 2024 •

edited

kevin-montrose commented Mar 20, 2024

kevin-montrose commented Mar 29, 2024

ealsur commented Mar 29, 2024

azure-pipelines bot commented Mar 29, 2024

neildsh commented Mar 30, 2024

azure-pipelines bot commented Mar 30, 2024

sboshra left a comment

neildsh commented Apr 1, 2024

azure-pipelines bot commented Apr 1, 2024

Performance: Refactors query prefetch mechanism #4361

Performance: Refactors query prefetch mechanism #4361

Conversation

kevin-montrose commented Mar 20, 2024 • edited

Description

Improvements

Greatly Reduced Allocations

Special Casing For maxConcurrency

Special Casing When Only 1 IPrefetcher

Prompter Starting Of Next Task

IEnumerator<IPrefetcher> Disposed

Outline

Type of change

Benchmarking

kevin-montrose commented Mar 20, 2024

kevin-montrose commented Mar 29, 2024

ealsur commented Mar 29, 2024

azure-pipelines bot commented Mar 29, 2024

neildsh commented Mar 30, 2024

azure-pipelines bot commented Mar 30, 2024

sboshra left a comment

Choose a reason for hiding this comment

neildsh commented Apr 1, 2024

azure-pipelines bot commented Apr 1, 2024

kevin-montrose commented Mar 20, 2024 •

edited

Special Casing For `maxConcurrency`

Special Casing When Only 1 `IPrefetcher`

`IEnumerator<IPrefetcher>` Disposed