Notify connection disconnect during error handling of new connection … #7463
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Notify connection disconnect during error handling of new connection from a
worker when there is already a known (but actually disconnected) connection to the same worker.
Additionally, it looks like under this circumstance notifyDisconnect may be called a second time, when the connection is actually closed by the network, so this PR replaces the delivered callback subscriptions with an empty subscription object.
As a just-in-case the KeepAlive call is also not sending a KeepAlive if the the Conn is no longer associated with a Worker or the Worker is connected using a different Conn object.
This would otherwise cause a case of worker "knowing" it was "connected", while the manager equally "know" that the worker was "offline" because it was not attached to any builders, all the while the low-level connection maintenance functionality kept sending and receiving keep-alives, thus keeping the unattached connection open. (#4678)
Additional info:
This patch may not be everything needed to fix the underlying issue.
We currently have the newConnection part as a HotPatch in our system (without the pb.py patch, which would require patching the main buildbot package), and as far as we can tell that did not work in the one recent post-patch case where we had a network hickup between a group of workers and the manager and could observe the recovery process in real time. The reconnection succeded and the old connection was shut down, the new one attached and green. However, 10 minutes later the new connections were de-attached as apparently a new connection showed up, with the result being that the workers were "offline" from the manager POV, but online according to the worker POV. Recovering from this situation means either restarting the worker buildbot process on each affected worker, or restarting the manager process. It is possible that we have had similar hickups since the patch was added, especially considering the frequency it tended to show up at before the patch was applied; the hickup mentioned included several minutes downtime, not just a couple of seconds as the "normal" hickups.
I suspect that a "belt-and-suspenders" approach to this would be that the keep-alive system verify that the currently active connection to a worker is attached to ALL builders that it should be attached to before sending the keep-alive, and if necessary re-attach or shut down the connection so that the worker can reconnect properly.
This problem has been nearly impossible to reproduce in unit tests, and the only way I was able to reproduce was by inserting an assert on values in AbstractWorker.detached() . The reason for the problem is that it turned out that I was unable to create a relevant test case using the PB connection due to some kind of issue related to the Twisted Reactor system. Additionally the null connection actually did not have the problem the test was supposed to recreated, while PB connection did have the problem.
The tests used was something like this, based on the latent_worker tests:
Contributor Checklist:
newsfragments
directory (and read theREADME.txt
in that directory)