Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Old PQS coroutines are never stopped, causing perf degradation over time #85

Closed
gotmachine opened this issue Aug 21, 2022 · 4 comments
Closed
Labels
kspBug Identified KSP issue kspPerformance Possible performance improvement in KSP

Comments

@gotmachine
Copy link
Contributor

Quoting @Gameslinx :

I found a fairly big problem yesterday relating to the PQS.UpdateSphere() coroutine starting without stopping the old one.
There should be at maximum two instances of the coroutine running at once. One for the ocean, and one for the terrain
However, whenever the PQS is restarted or set active, the coroutine is started again but if there was one running before this, it isn't stopped
So as an example, going from the Space Center to Kerbin will spawn an additional 2, making 4 of them running at once
Quicksaving and quickloading on the runway spawns 2 more each time, so I managed to rack up 14 of them running at once
image

Likely cause :

PQS.ResetSphere() is calling StopCoroutine("UpdateSphere").
This is the overload that takes a string arg for a named coroutine, but that only work if the coroutine was started using the matching string overload, which isn't the case, the coroutine is started in PQS.StartSphere() with the IEnumerator arg overload : StartCoroutine(UpdateSphere())

@gotmachine gotmachine added kspBug Identified KSP issue kspPerformance Possible performance improvement in KSP labels Aug 21, 2022
@gotmachine
Copy link
Contributor Author

gotmachine commented Aug 21, 2022

So, this is quite difficult to fix, but with some messing around it's at least possible to avoid all PQS.UpdateSphere() coroutine duplication cases. Eliminating non-duplicated but unneeded coroutines that are still running on other bodies than the main body is a lot more problematic, but they are eventually collected on scene switches.

Main fix involves changing the StartCoroutine() calls from the IEnumerator overload to the string overload in PQS.StartSphere() and PQS.ResetAndWait() methods. This allow the stock StopCoroutine() calls in ResetSphere() and ForceStart() to actually work, and allow us to place additional StopCoroutine() calls.

However, simply fixing the stock bug isn't enough because the stock PQS lifecycle isn't designed at all to ensure only one coroutine exists at a given time. Notably (but not limited to), it seems there is no guaranteed code path at all that could clean up the old coroutines when the PQS is restarted while it was already the active PQS. This notably mean that on scene switches, if the main body doesn't change, a new coroutine is created without stopping the old one. Trying to aggressively kill the coroutines (either from within the coroutine, or externally) when a PQS is not the main body anymore occasionally result in killing the actually needed coroutine. Such solutions seems too risky, there are too many corner cases and delays / order shenanigans in the PQS execution flow.

A poor man alternative is to always call StopCoroutine() prior to the StartCoroutine() call. This ensure there is no duplication happening, it cover the vast majority of cases and should be quite safe. There are still some occasional cases of the previous main body PQS coroutine being kept alive after leaving its SOI, but those will be eventually cleaned up on the next scene switch and PQS.UpdateSphere() has the good taste to skip a good chunk of the processing when the PQS isn't active, so the perf impact is relatively limited.

gotmachine added a commit that referenced this issue Aug 21, 2022
- New performance / KSP bugfix patch : [PQSCoroutineLeak](#85) (issue discovered by @Gameslinx)
- Fixed the ToolbarShowHide patch partially failing due to an ambigious match exception when searching the no args `ApplicationLauncher.ShouldItHide()` method overload.
@JonnyOThan
Copy link
Contributor

Hmm, could you inject code into the coroutine itself that detects when it's not needed anymore and ends it?

@gotmachine
Copy link
Contributor Author

gotmachine commented Aug 22, 2022

That's what I meant with "Trying to aggressively kill the coroutines (either from within the coroutine, or externally) when a PQS is not the main body anymore occasionally result in killing the actually needed coroutine."

I haven't found a reliable way to "detect when it's not needed anymore". If I try to self-kill them when the PQS isn't supposed to be active anymore, there are a bunch of corner cases where if the PQS is reactivated latter, it doesn't go through a code path spawning a new coroutine. I don't remember everything I tried, but trying to self-kill the coroutine also face the issue that the coroutine is spawned before the PQS stuff is fully initialized, so you can also end up self-killing the newly spawned coroutine.

Everything I tried didn't work consistently so I gave up because without investigating too long, I have a feeling this would be a pretty deep rabbit hole.

In fact, I think that the main bug causing those coroutines to never be stopped has been around since forever without any functional side effect (although there are definitely occasional "PQS crash, restarting" entries popping up in the logs), so this was hiding the fact that there is a hole in the PQS lifecycle management . Maybe it's possible to patch that hole but I'm not willing to put my hands in this mess.

The committed patch takes care of the "leak" with minimal intervention, it solves the main problem of unrecoverable perf degradation over time. With the patch, worst case scenario (visiting multiple SOIs without a scene switch or reload in between) you have an extra update function consuming ~0.3 ms lying around. And this is not systematic, I'm not even sure it's reproducible without aggressively teleporting from one SOI to another like I was doing for testing.

@NathanKell
Copy link
Contributor

This has shipped.

gotmachine added a commit that referenced this issue Jan 30, 2023
- New performance / KSP bugfix patch : [PQSCoroutineLeak](#85) (issue discovered by @Gameslinx)
- Fixed the ToolbarShowHide patch partially failing due to an ambigious match exception when searching the no args `ApplicationLauncher.ShouldItHide()` method overload.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kspBug Identified KSP issue kspPerformance Possible performance improvement in KSP
Development

No branches or pull requests

3 participants