Notify connection disconnect during error handling of new connection … #7463

YngveNPettersen · 2024-03-02T16:54:48Z

Notify connection disconnect during error handling of new connection from a

worker when there is already a known (but actually disconnected) connection to the same worker.

Additionally, it looks like under this circumstance notifyDisconnect may be called a second time, when the connection is actually closed by the network, so this PR replaces the delivered callback subscriptions with an empty subscription object.

As a just-in-case the KeepAlive call is also not sending a KeepAlive if the the Conn is no longer associated with a Worker or the Worker is connected using a different Conn object.

This would otherwise cause a case of worker "knowing" it was "connected", while the manager equally "know" that the worker was "offline" because it was not attached to any builders, all the while the low-level connection maintenance functionality kept sending and receiving keep-alives, thus keeping the unattached connection open. (#4678)

Additional info:

This patch may not be everything needed to fix the underlying issue.

We currently have the newConnection part as a HotPatch in our system (without the pb.py patch, which would require patching the main buildbot package), and as far as we can tell that did not work in the one recent post-patch case where we had a network hickup between a group of workers and the manager and could observe the recovery process in real time. The reconnection succeded and the old connection was shut down, the new one attached and green. However, 10 minutes later the new connections were de-attached as apparently a new connection showed up, with the result being that the workers were "offline" from the manager POV, but online according to the worker POV. Recovering from this situation means either restarting the worker buildbot process on each affected worker, or restarting the manager process. It is possible that we have had similar hickups since the patch was added, especially considering the frequency it tended to show up at before the patch was applied; the hickup mentioned included several minutes downtime, not just a couple of seconds as the "normal" hickups.

I suspect that a "belt-and-suspenders" approach to this would be that the keep-alive system verify that the currently active connection to a worker is attached to ALL builders that it should be attached to before sending the keep-alive, and if necessary re-attach or shut down the connection so that the worker can reconnect properly.

This problem has been nearly impossible to reproduce in unit tests, and the only way I was able to reproduce was by inserting an assert on values in AbstractWorker.detached() . The reason for the problem is that it turned out that I was unable to create a relevant test case using the PB connection due to some kind of issue related to the Twisted Reactor system. Additionally the null connection actually did not have the problem the test was supposed to recreated, while PB connection did have the problem.

The tests used was something like this, based on the latent_worker tests:

class SeverConnectionReconnect(TimeoutableTestCase, RunFakeMasterTestCase):

    def tearDown(self):
        # Flush the errors logged by the master stop cancelling the builds.
        self.flushLoggedErrors(LatentWorkerSubstantiatiationCancelled)
        super().tearDown()

    @defer.inlineCallbacks
    def create_single_worker_config(self, controller_kwargs=None):
        if not controller_kwargs:
            controller_kwargs = {}

        controller = WorkerController(self, 'local', **controller_kwargs)
        config_dict = {
            'builders': [
                BuilderConfig(name="testy",
                              workernames=["local"],
                              factory=BuildFactory(),
                              ),
            ],
            'workers': [controller.worker],
            'protocols': {'null': {}},
            # Disable checks about missing scheduler.
            'multiMaster': True,
        }
        yield self.setup_master(config_dict)
        builder_id = yield self.master.data.updates.findBuilderId('testy')

        return controller, builder_id

    @defer.inlineCallbacks
    def reconfig_workers_new_controller(self, controller_kwargs=None):
        if not controller_kwargs:
            controller_kwargs = {}

        controller = WorkerController(self, 'local', **controller_kwargs)
        config_dict = {
            'workers': [controller.worker],
            'protocols': {'null': {}},
            #'multiMaster': True
        }
        config = MasterConfig.loadFromDict(config_dict, '<dict>')
        yield self.master.workers.reconfigServiceWithBuildbotConfig(config)
        
        return controller

    @defer.inlineCallbacks
    def test_sever_connection_then_reconnect_after_timeout(self):
        controller1, builder_id = yield self.create_single_worker_config(
            controller_kwargs={"build_wait_timeout": 1})

        yield self.create_build_request([builder_id])

        yield controller1.connect_worker()
        self.assertTrue(controller1.worker.conn is not None)
        self.reactor.advance(1)
        
        self.assertTrue(len(controller1.worker.workerforbuilders) > 0)
        self.assertTrue(controller1.worker.conn is not None)
        log.msg("Connected worker 1")
        # sever connection just before insubstantiation and lose it after TCP
        # times out
        #with patchForDelay('buildbot.worker.base.AbstractWorker.disconnect') as delay:
        self.reactor.advance(1)
        controller1.sever_connection()
        log.msg("Sever connection to worker 1")
        controller1.worker.retire_conn()
        
        self.reactor.advance(100)
        controller1.worker.close_retired_conn()
        yield controller1.disconnect_worker()
        log.msg("Disconnected worker 1")

        self.reactor.advance(1)
        yield controller1.connect_worker()
        log.msg("Connected worker 2")
        self.assertTrue(controller1.worker.conn is not None)
        self.reactor.advance(1)
        self.assertTrue(len(controller1.worker.workerforbuilders) > 0)
        self.assertTrue(controller1.worker.conn is not None)

        yield self.create_build_request([builder_id])
        yield controller1.disconnect_worker()
        log.msg("Disconnected worker 2")
        self.flushLoggedErrors(pb.PBConnectionLost)

    @defer.inlineCallbacks
    def test_sever_connection_then_reconnect_immediately(self):
        controller1, builder_id = yield self.create_single_worker_config(
            controller_kwargs={"build_wait_timeout": 1})

        yield self.create_build_request([builder_id])
        yield controller1.connect_worker()
        log.msg("Connected worker 1")
        self.reactor.advance(1)
        
        self.assertTrue(len(controller1.worker.workerforbuilders) > 0)

        # sever connection just before insubstantiation and lose it after TCP
        # times out
        #with patchForDelay('buildbot.worker.base.AbstractWorker.disconnect') as delay:
        controller1.sever_connection()
        #    delay.fire()
        log.msg("Severed worker 1")
        self.reactor.advance(1)

        yield self.create_build_request([builder_id])
        log.msg("Starting worker 2")
        self.reactor.advance(1)
        yield controller1.connect_new_worker()
        self.reactor.advance(1)
        log.msg("Connected worker 2")
        self.reactor.advance(10)
        self.assertTrue(len(controller1.worker.workerforbuilders) > 0)
        log.msg("step 2")
        self.reactor.advance(100)
        log.msg("step 3")
        yield controller1.disconnect_worker(False)
        log.msg("Closed worker 1")
        self.reactor.advance(1)
        
        self.assertTrue(len(controller1.worker.workerforbuilders) > 0)
        self.reactor.advance(1)
        log.msg("Disconnecting worker 2")
        controller1.disconnect_new_worker()
        self.reactor.advance(1)
                #controller1.remote_worker2.parent = None
        
        log.msg("Disconnected worker 2")
        self.flushLoggedErrors(pb.PBConnectionLost)
        log.msg("step 4")

Contributor Checklist:

I have updated the unit tests. One added for double notifiyDisconnect. None for main issue (see description)
I have created a file in the newsfragments directory (and read the README.txt in that directory)
I have updated the appropriate documentation

master/buildbot/worker/protocols/pb.py

…from a worker when there is already a known (but actually disconnected) connection to the same worker. This would otherwise cause a case of worker "knowing" it was "connected", while the manager equally "know" that the worker was "offline" because it was not attached to any builders, all the while the low-level connection maintenance functionality kept sending and receiving keep-alives, thus keeping the unattached connection open. (buildbot#4678)

… with a worker (buildbot#4678) This is a just-in-case-logic is the connection is somehow detached from all Builders, but still leaving the connection active

once, by replacing the current subscription object with a new one after delivering the notifications. A second call could mess up the newer connection's state. (buildbot#4678)

YngveNPettersen commented Mar 2, 2024

View reviewed changes

master/buildbot/worker/protocols/pb.py Outdated Show resolved Hide resolved

YngveNPettersen added 3 commits April 13, 2024 19:51

Close connection immediately during KeepAlive if it is not associated…

e05a2ca

… with a worker (buildbot#4678) This is a just-in-case-logic is the connection is somehow detached from all Builders, but still leaving the connection active

Make sure the Conn object only calls the notifyDisconnect callbacks

f9ffe15

once, by replacing the current subscription object with a new one after delivering the notifications. A second call could mess up the newer connection's state. (buildbot#4678)

YngveNPettersen force-pushed the yngve/handle-new-connection-20240302 branch from 49a32f4 to f9ffe15 Compare April 13, 2024 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notify connection disconnect during error handling of new connection … #7463

Notify connection disconnect during error handling of new connection … #7463

YngveNPettersen commented Mar 2, 2024 •

edited

Notify connection disconnect during error handling of new connection … #7463

Are you sure you want to change the base?

Notify connection disconnect during error handling of new connection … #7463

Conversation

YngveNPettersen commented Mar 2, 2024 • edited

Additional info:

Contributor Checklist:

YngveNPettersen commented Mar 2, 2024 •

edited