New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RS5 horrible performance regression in `ScrollConsoleScreenBuffer` API. #279

Open
awson opened this Issue Oct 11, 2018 · 16 comments

Comments

Projects
None yet
6 participants
@awson

awson commented Oct 11, 2018

  • Windows build number: 10.0.17763.1

The description below refers to x64 versions of OS and application.

Repro:
Install Far Manager and set "Height" in "Screen Buffer Size" to 9999 for it. Then try to run anything, e.g. dir in it's command line. There will be almost a second pause before and after dir runs.

The cuplrit is this call of ScrollConsoleScreenBuffer function.

In my particular case the ScrollRectangle is set to {0,0,170,9938}, ClipRectangle is null ptr and DestinationOrigin is set to {0,-1}. This single call lasts almost a second.

Legacy mode does not have this problem.

Also my impression is that this API was already rather slow even before RS5, but not so much horribly slow.

@miniksa miniksa self-assigned this Oct 11, 2018

@miniksa

This comment has been minimized.

Member

miniksa commented Oct 11, 2018

Filed internally as MSFT:19270439 and assigned to myself.

@alabuzhev

This comment has been minimized.

alabuzhev commented Oct 11, 2018

Hi,
I've created a minimal project to reproduce the issue (attached).
It happens only if you scroll the invisible area (i.e. above the window rectangle, see the comments in the code).

On my machine every such call of ScrollConsoleScreenBuffer takes about 7 ms in legacy mode and 450 ms in default mode (65 times slower!).

ScrollBufferBug.zip

@miniksa

This comment has been minimized.

Member

miniksa commented Oct 11, 2018

Thank you for that. It should be very helpful when I get a chance to investigate this bug to have the minimal repro already reduced.

@miniksa

This comment has been minimized.

Member

miniksa commented Oct 12, 2018

OK. I am looking at this today. It looks like there is some excessive heap allocation occurring down this path that is slowing down the operation.

I've made one change already today that brought this down from (on my machine) ~193ms per call to ~82ms per call.

I have a few more call sites to look at. Your minimal repro project is super helpful!

@miniksa

This comment has been minimized.

Member

miniksa commented Oct 15, 2018

Still looking at this, just thought I'd give a little more insight.

~3-7ms = console v1 code

NOTE: The top number is generated with code out of the full Windows build. The Windows build provides extensive optimization including profile guiding to final binaries which is above and beyond what I can tightly loop on my development box. It's not a 100% fair comparison to the below, but it's a good reference point.
NOTE: The below numbers are generated out of Visual Studio 2017 build locally with release optimizations on. I'm getting all sorts of MSVC optimizations, but I'm missing profile guiding and whatever other special sauces the performance teams in Windows provide as a part of our complete nightly build.

~20ms = console v2 from approximately October 2017
~20ms = console v2 from approximately March 2018
~193ms = console v2 from approximately last few months

Now I've been working on some fixes and I've got:
~82ms = console v2 after consolidating a ton of individual heap allocations into a single batch heap allocation
~60ms = console v2 after removing unnecessary interim heap allocations from some copy steps
~40ms = console v2 if I take out the ability for copies to process variable length strings (UTF-8)

Somewhere in these bottom two, I'm starting to get into the territory of the "rep movs" commands being generated in the optimizer which I believe is about as fast as I can get for copying a big block of data from point A to point B.

I want to set expectations here: it is very, very unlikely we'll get back to that 3-7ms number for several reasons.

  1. The v2 console has a lot more state in its buffer than the v1 console to support all the sorts of enhancements we've made. This means more data per cell and more to copy every time that a region is scrolled. This includes colors going up from 8 bits (foreground/background 16 color) to 48 bits (foreground/background full RGB color) and additional flags and fields for supporting things like reflow wrapping. In RS5, we realized we weren't always copying these things with full fidelity when this API call was happening and we corrected it to copy them too.
  2. The v2 console has started to include things that have reduced the locality of the data including restructuring we've done to make progress to supporting UTF-8 which can have variable lengths of text and isn't a simple 1:1 character to cell (or 1:2 for wide) like the v1 console used to assume. We have to spend more time and take some trade-offs here to check and test things to be flexible enough to try to support this. We're trying to optimize for the common case where variable lengths aren't being used, but it will cost us at least something to make progress. In RS5, we changed the cells to start to be able to understand fully flexible N:M relationships between strings and cells for UTF-8.
  3. The v2 console is using structures, containers, and calling patterns that support guarded or checked accesses to memory locations to prevent overruns/underruns and other sorts of exploits. It takes more time to process these and we'll not be going back to any of the more unchecked sorts of buffers present in the legacy console. In RS5, we spent a chunk of work on moving unsafe/unchecked operations to secured operations.

The combination of these cost us a lot of performance in the RS5 release, it appears. But on further reflection on Friday and today, it looks like a more careful implementation of some of these changes can buy us back much of the performance while maintaining the benefits we desired.

My goal here is to spend another day or two looking at this and trying to correct this down to the 20-60ms range (for my PC and environment). While it won't entirely pay back the regressed speed, I feel it will strike a more sensible balance between the additional powers of the new v2 console with what you expect out of the ScrollConsoleScreenBuffer API.

@alabuzhev

This comment has been minimized.

alabuzhev commented Oct 15, 2018

Thank you for this comprehensive explanation.

Maybe performance could be improved further by copying less?

Note that scrolling the whole buffer is still as fast as it's always been, and scrolling even one line less is suddenly much slower.
It feels in case of the whole buffer it all comes down to advancing its pointer and appending a bit of data to the end, otherwise that huge block is copied somewhere.

So, when the scrollable block is (much) larger than the fixed one (as in our case - 9950 lines vs 50) - maybe it would make sense to copy the fixed block somewhere instead, move the large scrollable block with that fast-full-buffer algorithm above and copy the small fixed block back where it should be?

(I have no idea how all that is implemented of course, only speculating)

@miniksa

This comment has been minimized.

Member

miniksa commented Oct 15, 2018

Yes, I will try to look at reducing the total amount copied next.

While it is really bad at copying and the times are magnified, it is easier to see the differences in everything I am trying to make copying itself faster.

Once I've optimized that as far as I can, I'll try to spend a bit of time stepping up a level and reducing the total amount copied. I still don't believe I'll hit 3-7ms, but we'll see. I'll make another update when I know more.

@miniksa

This comment has been minimized.

Member

miniksa commented Oct 17, 2018

OK. I got it down to 40-50ms on its own for copying while maintaining the ability for it to process UTF-8 variable length data and that's about as low as I can go.

Now I've stepped up a level and I'm making an optimization to avoid copying.
Right now I have it as:

  • If your source rectangle spans the entire width of the screen...
  • And if the destination rectangle spans the entire width of the screen...
  • And if the movement delta is only in the Y direction...

Then I go into the backing buffer and I manipulate the pointers to the rows in the buffer to shuffle things around in lieu of copying anything, let the fill step do its business, and go on with life.

On my machine, it gets the given sample code down to 0ms per call (20 calls takes 2-3ms).
Which handily beats the v1 console taking 7ms per call (20 calls takes ~140-150ms).

If you ask for the scroll rectangle to be less than the width of the screen or you start moving things in the X dimension... it's going to go back to the copy path as it has to move contents around. But I think this optimization will seriously help with what you're trying to accomplish using the ScrollConsoleScreenBuffer API.

Please let me know if you think differently or have any further input on this plan.

@alabuzhev

This comment has been minimized.

alabuzhev commented Oct 18, 2018

That sounds awesome. Thank you!

@alabuzhev

This comment has been minimized.

alabuzhev commented Oct 18, 2018

Which handily beats the v1 console

Would the same optimisation be applied to v1?
(this is not a request, just curious)

@zadjii-msft

This comment has been minimized.

Member

zadjii-msft commented Oct 18, 2018

Would the same optimisation be applied to v1?

Nope, the v1 is a generally parked codebase. Outside of security issues, we're not touching it at all. It's our last line of back-compat support. We try not to ever regress anything in the v2 console, but worst-case, the v1 will always act exactly as it always has.

@miniksa

This comment has been minimized.

Member

miniksa commented Oct 22, 2018

Good news!

This got merged today.

It should be out to Insiders in 3-4 weeks.

@tavrez

This comment has been minimized.

tavrez commented Nov 14, 2018

Will we get this fixed on 1809?

@zadjii-msft

This comment has been minimized.

Member

zadjii-msft commented Nov 14, 2018

@tavrez Nope.

@DarthSpock

This comment has been minimized.

DarthSpock commented Nov 15, 2018

@zadjii-msft Due to how much of a difference this makes to utilizing the Windows Console, shouldn't that be justification to submit for a backport of this code change? Or is there a dependency that exists that would take more than trivial effort to remove for backporting?

@miniksa

This comment has been minimized.

Member

miniksa commented Nov 15, 2018

Someone has to prove that this is costing millions to billions of dollars of lost productivity or revenue to outweigh the risks of shipping the fix to hundreds of millions of Windows machines and potentially breaking something.

Our team generally finds it pretty hard to prove that against the developer audience given that they're only a small portion of the total installed market of Windows machines.

Our only backport successes really come from corporations with massive addressable market (like OEMs shipping PCs) who complain that this is fouling up their manufacturing line (or something of that ilk). Otherwise, our management typically says that the risks don't outweigh the benefits.

It's also costly in terms of time, effort, and testing for us to validate a modification to a released OS. We have a mindbogglingly massive amount of automated machinery dedicated to processing and validating the things that we check in while developing the current OS builds. But it's a special costly ask to spin up some to all of those activities to validate backported fixes. We do it all the time for Patch Tuesday, but in those patches, they only pass through the minimum number of fixes required to maximize the restoration of productivity/security/revenue/etc. because every additional fix adds additional complexity and additional risk.

So from our little team working hard to make developers happy, we virtually never make the cut for servicing. We're sorry, but we hope you can understand. It's just the reality of the situation to say "nope" when people ask for a backport. In our team's ideal world, you would all be running the latest console bits everywhere everytime we make a change. But that's just not how it is today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment