Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TownDirectoryWindow::OnPaint() triggers large (>50ms) redraw times #7905

Closed
floodious opened this issue Jan 5, 2020 · 25 comments
Closed

TownDirectoryWindow::OnPaint() triggers large (>50ms) redraw times #7905

floodious opened this issue Jan 5, 2020 · 25 comments

Comments

@floodious
Copy link

@floodious floodious commented Jan 5, 2020

Version of OpenTTD

Revisions to date;

Expected result

OnPaint() should never trigger "heavy lifting" which is not required to repaint the display. It should do the minimal amount of work needed to fill a dirty rect and nothing more.

Actual result

In this case TownDirectoryWindow::OnPaint() triggers a rebuild and resort of the town directory list which should normally be triggered only from UI or a game-time tick based interval timer.

Steps to reproduce

Open the FPS window and observe the sporadic spikes in excess of 30 ms during resort of large town directory lists. Such cases occur where very high populations in very large numbers of towns (>1k) frequently set flags for rebuilds of the list due to changes to towns. These flags then trigger the heavy-lifting rebuild/resort process from OnPaint().

Solution

towngui.cpp
void OnPaint() override
{
// OnPaint should never trigger heavy-lifting directly
//if (this->towns.NeedRebuild()) this->BuildSortTownList();
// redraw the window using the existing list, whether populated or not
this->DrawWidgets();
}

@JGRennison
Copy link
Contributor

@JGRennison JGRennison commented Jan 5, 2020

Your solution as posted breaks the scenario editor, as towns can be removed, resulting in a use-after-free.

NeedRebuild() indicates that the list contents is not valid and cannot be used. This cannot be ignored.
Calling ForceResort()/ForceRebuild() in fewer circumstances, where possible depending on the current sort mode, may be more practical.

Alternatively, sorting of the the list of towns (and various other similar lists) by name can be made significantly less slow by not regenerating the name string(s) on each comparison.

@floodious
Copy link
Author

@floodious floodious commented Jan 5, 2020

The code doing the free would be responsible to ensure any reference of the object was also destroyed. If not, this is another severe pre-existing bug with unhandled dangling references.

@JGRennison
Copy link
Contributor

@JGRennison JGRennison commented Jan 5, 2020

Town::PostDestructor calls InvalidateWindowData(WC_TOWN_DIRECTORY, 0, 0) which calls this->towns.ForceRebuild().
BuildSortTownList() must then be called before the towns list is read again. This is what the this->towns.NeedRebuild() check is for.
This type of mechanism is used throughout the UI.

@floodious
Copy link
Author

@floodious floodious commented Jan 5, 2020

It's flawed. The free must not occur until after all dangling references are cleared up.

@floodious
Copy link
Author

@floodious floodious commented Jan 5, 2020

Often times you see reference counters used in cases like this and the free occurs only upon references == 0.

@LordAro
Copy link
Member

@LordAro LordAro commented Jan 5, 2020

As long as the dangling references are not used after a free, I don't really see an issue.

Regardless, it's a much wider "issue" that would require significant rework of all of the GUI. I suspect that it is not worth it to "fix" generally, but it is certainly not relevant here.

@floodious
Copy link
Author

@floodious floodious commented Jan 5, 2020

It's unrelated to this issue, I agree. Dangling references are a severe bug that will lead to problems with threading in the future however. As far as how to ensure this work-around is handled more appropriately without heavy lifting occurring in the OnPaint() function, I'm not sure.

@JGRennison
Copy link
Contributor

@JGRennison JGRennison commented Jan 5, 2020

As the UI and simulation are both run in the same thread, it doesn't actually matter whether the heavy lifting is done within the OnPaint() call or elsewhere.
Running the UI and simulation in different threads is simply not viable, for a long list of reasons.

@floodious
Copy link
Author

@floodious floodious commented Jan 5, 2020

You're ignoring the core issue and brushing it away by claiming to "solve" other problems such as high processing cost associated with doing actual work...

Please remember, the issue is that work not directly associated with OnPaint() such as cleaning up applying work-arounds to other buggy code that left dangling references in an extremely inefficient way is not appropriate, period.

@floodious
Copy link
Author

@floodious floodious commented Jan 5, 2020

A super hacky kludge workaround (all workarounds/fallthroughs/etc are inherently bug-prone kludges) would be to store a list of deleted towns, or change the flag from "rebuild" to "check_references".

The loop would then not rebuild or resort the list, but rather go through the process of ensuring every entry in the list is valid. This will definitely be at least as efficient as repopulating the list, and will eliminate the sorting overhead.

@floodious
Copy link
Author

@floodious floodious commented Jan 5, 2020

I'm just posting the obvious now... things we should all already know.

The nasty centralized solution is to have the towns object (does one exist?) maintain a list of registered reference holders. Any reference holder would first need to register, then it would have access to the town database.

The database before deleting an entry could iterate the list of registered reference holders and notify each one that the item had been discarded (not deleted, but references must be discarded!) Upon completing this iterative notification the entry itself could be safely deleted.

Note that this is easier than it seems... since the only objects that would need to use this new accessor (town database) would be those who need it. In this case solely the town directory windows could be registered.

As more code is updated and freed of dangling references, bugs and workarounds, the fixed objects could utilize the new system. This would greatly ease eliminating the old system as references to the old accessors could be tracked.

This sounds to me like a half hour job. Only parts at play are Town::PostDestructor and TownDirectoryWindow.

@JGRennison
Copy link
Contributor

@JGRennison JGRennison commented Jan 6, 2020

A super hacky kludge workaround (all workarounds/fallthroughs/etc are inherently bug-prone kludges) would be to store a list of deleted towns, or change the flag from "rebuild" to "check_references".

There isn't any need for such workarounds or kludges.
It is not valid to store a list of freed pointers for comparison as they can be reused by subsequent allocations.
Getting the complete list of town pointers is negligibly cheap, they are already stored in a Pool (i.e. an array).

towns object (does one exist?)

Take a look at the various src/core/pool* files.

Once ForceRebuild() is called the contents of the window's town list is invalid and can't be used for comparison or dereferencing. If you are worried about dangling pointers, you could add a truncate call to this method, fill it with it with poison values, override operator [] to throw, etc.

The expensive/slow parts are name sorting and to a lesser extent name filtering. Have a look with a profiler.

Only parts at play are Town::PostDestructor and TownDirectoryWindow.

Unfortunately not, the remaining parts of Town::PostDestructor are doing the same for NewGRF object instances and town indices in the map array, these are persistent.
There is also more of this sort of stuff in Town::~Town().

This sounds to me like a half hour job

I can guarantee you that it is not a half hour job.

@floodious
Copy link
Author

@floodious floodious commented Jan 6, 2020

Yes I know roughly the structure, but the accessors aren't limited in every case due to the old c-style code. So while there may be a Town object, I'd be a fool to assume there aren't getTownPopulation() globals too.

If I couldn't add a list of window pointers to Town and loop through it on each deleted town while commenting out InvalidateWindowData(WC_TOWN_DIRECTORY, 0, 0) in a half hour, I'd be surprised.

I'd spend most of the time actually handling the removal of a specific index in the TownDirectoryWindow handler, plus whatever dispatcher method I added since there currently is no notify(handle, data) or similar system.

I mean a list of deleted towns like an array. For example you're deleting towns 5, 81 and 213, so you pass TownIndex discarded[] { 5, 81, 213, }; to foreach (reference_holder as h) { h->notifiy(discard, discarded); }

The only aim is to eliminate the possibility you described of a reference to a town becoming invalid in the scenario editor, to eliminate this specific single instance of the if (this->towns.NeedRebuild()) this->BuildSortTownList(); call.

@floodious
Copy link
Author

@floodious floodious commented Jan 6, 2020

The only parts at play from Town:: would be the places where a town is actually deleted. If that has many spaghetti-code instances, they would need to be replaced with a call like Town::delete_town(index) to centralize it. Before actually deleting the town, the notification would need to be passed out to all entries in the reference holder list.

@floodious
Copy link
Author

@floodious floodious commented Jan 6, 2020

The database is TownPool in town.h and it doesn't have a specialization, so there is no town object, only the per-instance town data structure.

So I'd need to create the specialized object by inheriting (perhaps?) TownPool, and fixup all references to that object that were broken by the changes. Then the mechanism for deleting an object from the pool would need to have the notify list and dispatcher added, then the target object TownDirectoryWindow would need to register itself upon creation, and deregister upon deletion.

@floodious
Copy link
Author

@floodious floodious commented Jan 6, 2020

So Pool handles deletion via std::vector::erase and such, and therefore has no ability to dispatch notifications. FreeItem might be used for such, but due to the use of generic templates the notification mechanism would need to be added in a generic way. So both ~Pool and FreeItem would need to be fixed to add proper notification to do so in the best way, which would multiply the complexity and risk associated with such a tiny change exponentially.

This is the opposite of good C++.

@JGRennison
Copy link
Contributor

@JGRennison JGRennison commented Jan 6, 2020

Also, getting back to the excessive resort/redraw times, for reference my earlier solutions are here, I could look into forward-porting them in due course: JGRennison/OpenTTD-patches@f6b9395 / JGRennison/OpenTTD-patches@b91ee6f

@floodious
Copy link
Author

@floodious floodious commented Jan 6, 2020

That doesn't solve the core issue, it just band-aids over it. While optimization and fixes in their own isolated systems are good... the real problem is the spaghetti-code and nasty workarounds resulting from bad coding practice building upon bad coding practice and "that doesn't matter" attitudes where such issues are brushed aside.

@floodious
Copy link
Author

@floodious floodious commented Jan 6, 2020

For example, I know I'm rambling on... but look at it this way.

Description

You have your cost of the heavy lifting operation, "C". We know this happens on a regular schedule according to the time T which we're make a unit by frequency of occurrence.

Now we also have many other processes that get tied in to the cost C because of the linkage with invalidation of the list. The list is not in fact invalid since removing a single item from a pool does not change the ID of every other item... the ID remains the same. Just that single ID which was removed becomes invalidated.

So due to this linkage associated with invalidation we also have the regular process of painting, P. Additional processes that cause changes in towns cause P to sporadically trigger C at some ratio X.

Then we have the real extremely rare UI triggered occurrence the link was established to handle: the user deleting a town which we'll call D and we know is extremely tiny, infinitesimal or even zero in many cases.

Cost analysis

So we get:
T = 1/100 ticks
P = 1 frame
X = probability this tick triggered a change in a town
D = rare one-off instance the user pressed a button to delete a town
total cost = CT + CPX + CD.

T is constant and should happen at an exact ratio to game-time ticks.
P happens (I assume) every frame, which is also fixed to game-time ticks, once every tick.
X is proportionate to the number of towns times size of each town times growth and probability of change... in other words BASED UPON THE STATE OF THE GAME-WORLD.

Do you see the issue? Regardless of the value of C, the user experiences CPX far in excess of CT as they play in larger or older (heavily developed) worlds. The user is directly responsible for CD, which is a UI cost and not relevant or problematic as an issue here.

Conclusion

Your improvements might reduce C, which will improve the latency associated with the user deleting a town. It will also decrease the cost of the regular timed cost CT at interval T. These costs however are a tiny fraction of the costs associated with CPX!

In optimization you need to look at the real impact of your effort. If you're looking at shaving some portion of a 1/100 fraction, your efforts are useless because no matter how good you do you'll never get better than 1% improvement.

You need to focus on the BIG fractions first. Sometimes it isn't possible to eliminate these parts, but they will be where the smallest effort pays off in the biggest way.

You must look at the ratio between the reason for the link CD which is a rare one-off UI interaction, versus the regular sporadic cost CPX that accumulates over time. If we can eliminate CPX, even making CD significantly larger is well worth it when you sum the total time spent over a period of game-play. The user in the scenario editor triggered CD maybe a hundred times. Then while playing the scenario for thousands of hours spread over thousands of users, CPX occurs MILLIONS OF TIMES.

Hint: 1e9 > 100.

@JGRennison
Copy link
Contributor

@JGRennison JGRennison commented Jan 6, 2020

I cannot find the profiling files from when I did this originally unfortunately, but measuring it now, the reduction in C for name and rating sorting on a large map with the above change applied varies between approximately 7.5x and 11x.
This is sufficient that the sorting delay is no longer annoying.

All of these costs only apply when the window is open, which is a very small fraction of total game play time for most players. Cost C additionally occurs once each time the window is opened.

I would assert that the list being sorted and rendered correctly at all times is more important than any performance saving. Otherwise you could just set X to 0.
That said, reducing the effective value of X by means of filtering change events based on the sort mode looks like it ought to be straightforward, I will look into that in due course.

@floodious
Copy link
Author

@floodious floodious commented Jan 6, 2020

I agree the overall optimization makes a huge difference. My point being simply that even given 1/11th, zero is still less. So if it's possible to cut out a large chunk of the total cost entirely, thereby reducing it to zero, that's a very worthwhile aim as well.

My argument is often CPX (CX, since P = 1?) isn't needed at all. Unless I'm severely mistaken here, most instances where the invalidation flag is set that triggers the rebuild or resort are not cases where the list is actually invalid or even requires re-sorting... rather the only example I have of a valid case is the one you gave of deleting a town.

I think glx (?) was checking in to whether those instances were better represented with a dedicated re-sort flag as suggested by Samu in the IRC channel. A re-sort doesn't need to be triggered from the OnPaint() function since the existing list is still perfectly valid. Running here in debug, no assertions are triggered with my suggested removal of the function call, so it would appear either the use-after-free doesn't trigger access violation or the assertion is missing, or it isn't actually a problem to begin with.

That's what I'm looking at... how to eliminate that portion of the cost altogether. You're absolutely right that these situations only come up with the directory windows open and it's certainly likely that they're rarely used... nonetheless having them function without useless overhead would be best. Your optimization of the sort is essential there, but it would in my opinion be a mistake to overlook eliminating redundant or needless costs as well.

@floodious
Copy link
Author

@floodious floodious commented Jan 6, 2020

#7906 mostly fixes this. The issue remains of the work-around in OnPaint() cleaning up dangling references, but this is only important from a broad architectural perspective. An example is where the graphics renderer is threaded in the future, triggering a rebuild/resort of the list will lag the frame rate in the instant when a town is deleted.

The idea of the system to handle registration for reference holders and creating a notification system for messages when deleting an item or clearing a pool should be implemented. This is a variation of the singleton pattern which can be improved and made a member of a "world" object which would own the windows to establish a proper hierarchy. If no other instances of proper application of list invalidation exist, elimination of this single use-case would allow removal of the NeedRebuild() function as well as the flag to prevent bugs in the future from misapplication of these flags.

@floodious floodious closed this Jan 6, 2020
@LordAro
Copy link
Member

@LordAro LordAro commented Jan 6, 2020

An example is where the graphics renderer is threaded in the future

The thing is, this is almost certainly never going to happen. There are far too many things that depend on the GUI updating in lock-step with the game engine. Yes, really. NewGRF animations use the current animation frame to decide on the next. We can't change or fix that.

But, in the mythical case where someone is creating a threaded graphics renderer, every single bit of graphical update code would need to be touched, and then a solution to this single OnPaint issue would be found.

Arguing about architectural issues is all well and good, but when the issue is never going to happen it's all just a pointless waste of time. It's like making sure there's space to fit an ejector seat in a car just in case someone decides to fit wings to it and make it a plane one day.

@floodious
Copy link
Author

@floodious floodious commented Jan 6, 2020

Given the minimal effort required to make the system function correctly by cleaning up after itself and not leaving dangling references (solely when the user deletes a town!), the only consideration is the overhead of checking whether the std::vector of notification relays is empty. In cases where it's empty this should be similar to checking a bool if (false) { ; }

In other cases it would clean up the code and put it in proper working order, allowing removal of the hack NeedRebuild() function, and putting the code on a path to being correctly implemented.

The single biggest reason such a thing will never happen is if those involved in the project are never willing to take such a simple first step toward it.

@floodious
Copy link
Author

@floodious floodious commented Jan 6, 2020

I acknowledge that such aims are strongly resisted in projects where there is fear of breaking something due to existing flaws and interdependencies (spaghetti-code) in the code. I am not suggesting such a change be made for the next release. The critical bug has already been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.