Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change: [Linkgraph] Allow job threads to be aborted early when clearing schedule #8416

Merged
merged 1 commit into from Dec 24, 2020

Conversation

@JGRennison
Copy link
Contributor

@JGRennison JGRennison commented Dec 22, 2020

Motivation / Problem

Long running link graph jobs running in separate threads continue running until completion when the game state is cleared due to one of: abandoning game, loading replacement game, exiting game.

This causes the game and user interface to be unnecessarily blocked by joining the job thread when the result will be discarded.

Description

When link graph jobs are cleared due to abandoning the game or exiting, flag the job as aborted.

The link graph job running in a separate thread checks the aborted flag periodically and terminates processing early if set.

This reduces the delay at game abandon or exit if a long-running job would otherwise still be running.

Limitations

N/A

Checklist for review

PR justification: #7081 (review)

…ng schedule (OpenTTD#8416)

When link graph jobs are cleared due to abandoning the game or exiting,
flag the job as aborted.
The link graph job running in a separate thread checks the aborted flag
periodically and terminates processing early if set.
This reduces the delay at game abandon or exit if a long-running job
would otherwise still be running.
@TrueBrain
Copy link
Member

@TrueBrain TrueBrain commented Dec 24, 2020

I did a quick test of this PR, where I add a CSleep(30000) in the LinkGraphSchedule::Run(), as with the other PR.

This does not work for me as advertised; "Abandon Game", "Exit Game", and "Load Game" still hangs for 30 seconds before the game continues.

Looking at the code, that somewhat makes sense, so I moved the CSleep into FlowMapper::Run(). Still this hangs my client for 30 seconds in all cases.

Am I right to assume this happens because this PR assumes only the MCF can take a lot of time? And not for example the flowmapper? And even for the MCF you assume a single iteration is not the problem, but the amount of iterations it is doing?

To be clear, not saying you should fix my cases, just trying to understand what this does and doesn't fix. Mainly as the PR description suggests a broader scope than the code does ;)
I was also wondering why you don't just terminate the thread, instead of asking the thread to terminate as quickly as it can? (again, honest question, I know very little about this code, so all I can do is ask stupid questions :D)

Tnx!

@JGRennison
Copy link
Contributor Author

@JGRennison JGRennison commented Dec 24, 2020

A single large sleep is by nature not interruptible by this sort of mechanism.

I was also wondering why you don't just terminate the thread, instead of asking the thread to terminate as quickly as it can? (again, honest question, I know very little about this code, so all I can do is ask stupid questions :D)

This is highly unportable and has a high risk of putting the whole process in a bad state.

For example, if the thread is holding a lock or some other resource when you sent it a kill signal, that lock could be left locked, permanently wedging the whole process. You would also have a high risk of the LinkGraphJob destructor exploding when you try to clean up from the main thread side, unless you decided to leak it on purpose. There are ways around this like pthread_cleanup but these are extremely fragile. I would not recommend such measures.

Am I right to assume this happens because this PR assumes only the MCF can take a lot of time? And not for example the flowmapper? And even for the MCF you assume a single iteration is not the problem, but the amount of iterations it is doing?

The first MCF pass is by far the most expensive phase.
There is a trade off between the numbers of abort checks and the delay before the next abort check is reached.
In general there is no "correct" number of abort check points, but the checks as included gave acceptable latency in my own testing, such that more fine grained checking was unnecessary.
The testing was done several years ago, but the link graph has not meaningfully changed since then.

Copy link
Member

@TrueBrain TrueBrain left a comment

Perfectly clear explanation :D Thank you!
For future reference, that would have been perfect to mention in "Limitations" 😄 But I fully understand it now, so this seems more than fine by me!

tldr; doesn't solve all cases, but it does solve the most important ones. If we ever find others that are lagging, we can deal with those when they become apparent :D

@michicc michicc merged commit 94d629d into OpenTTD:master Dec 24, 2020
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants