Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notify connection disconnect during error handling of new connection … #7463

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

YngveNPettersen
Copy link
Contributor

@YngveNPettersen YngveNPettersen commented Mar 2, 2024

Notify connection disconnect during error handling of new connection from a

worker when there is already a known (but actually disconnected) connection to the same worker.

Additionally, it looks like under this circumstance notifyDisconnect may be called a second time, when the connection is actually closed by the network, so this PR replaces the delivered callback subscriptions with an empty subscription object.

As a just-in-case the KeepAlive call is also not sending a KeepAlive if the the Conn is no longer associated with a Worker or the Worker is connected using a different Conn object.

This would otherwise cause a case of worker "knowing" it was "connected", while the manager equally "know" that the worker was "offline" because it was not attached to any builders, all the while the low-level connection maintenance functionality kept sending and receiving keep-alives, thus keeping the unattached connection open. (#4678)

Additional info:

This patch may not be everything needed to fix the underlying issue.

We currently have the newConnection part as a HotPatch in our system (without the pb.py patch, which would require patching the main buildbot package), and as far as we can tell that did not work in the one recent post-patch case where we had a network hickup between a group of workers and the manager and could observe the recovery process in real time. The reconnection succeded and the old connection was shut down, the new one attached and green. However, 10 minutes later the new connections were de-attached as apparently a new connection showed up, with the result being that the workers were "offline" from the manager POV, but online according to the worker POV. Recovering from this situation means either restarting the worker buildbot process on each affected worker, or restarting the manager process. It is possible that we have had similar hickups since the patch was added, especially considering the frequency it tended to show up at before the patch was applied; the hickup mentioned included several minutes downtime, not just a couple of seconds as the "normal" hickups.

I suspect that a "belt-and-suspenders" approach to this would be that the keep-alive system verify that the currently active connection to a worker is attached to ALL builders that it should be attached to before sending the keep-alive, and if necessary re-attach or shut down the connection so that the worker can reconnect properly.

This problem has been nearly impossible to reproduce in unit tests, and the only way I was able to reproduce was by inserting an assert on values in AbstractWorker.detached() . The reason for the problem is that it turned out that I was unable to create a relevant test case using the PB connection due to some kind of issue related to the Twisted Reactor system. Additionally the null connection actually did not have the problem the test was supposed to recreated, while PB connection did have the problem.

The tests used was something like this, based on the latent_worker tests:

class SeverConnectionReconnect(TimeoutableTestCase, RunFakeMasterTestCase):

    def tearDown(self):
        # Flush the errors logged by the master stop cancelling the builds.
        self.flushLoggedErrors(LatentWorkerSubstantiatiationCancelled)
        super().tearDown()

    @defer.inlineCallbacks
    def create_single_worker_config(self, controller_kwargs=None):
        if not controller_kwargs:
            controller_kwargs = {}

        controller = WorkerController(self, 'local', **controller_kwargs)
        config_dict = {
            'builders': [
                BuilderConfig(name="testy",
                              workernames=["local"],
                              factory=BuildFactory(),
                              ),
            ],
            'workers': [controller.worker],
            'protocols': {'null': {}},
            # Disable checks about missing scheduler.
            'multiMaster': True,
        }
        yield self.setup_master(config_dict)
        builder_id = yield self.master.data.updates.findBuilderId('testy')

        return controller, builder_id

    @defer.inlineCallbacks
    def reconfig_workers_new_controller(self, controller_kwargs=None):
        if not controller_kwargs:
            controller_kwargs = {}

        controller = WorkerController(self, 'local', **controller_kwargs)
        config_dict = {
            'workers': [controller.worker],
            'protocols': {'null': {}},
            #'multiMaster': True
        }
        config = MasterConfig.loadFromDict(config_dict, '<dict>')
        yield self.master.workers.reconfigServiceWithBuildbotConfig(config)
        
        return controller

    @defer.inlineCallbacks
    def test_sever_connection_then_reconnect_after_timeout(self):
        controller1, builder_id = yield self.create_single_worker_config(
            controller_kwargs={"build_wait_timeout": 1})

        yield self.create_build_request([builder_id])

        yield controller1.connect_worker()
        self.assertTrue(controller1.worker.conn is not None)
        self.reactor.advance(1)
        
        self.assertTrue(len(controller1.worker.workerforbuilders) > 0)
        self.assertTrue(controller1.worker.conn is not None)
        log.msg("Connected worker 1")
        # sever connection just before insubstantiation and lose it after TCP
        # times out
        #with patchForDelay('buildbot.worker.base.AbstractWorker.disconnect') as delay:
        self.reactor.advance(1)
        controller1.sever_connection()
        log.msg("Sever connection to worker 1")
        controller1.worker.retire_conn()
        
        self.reactor.advance(100)
        controller1.worker.close_retired_conn()
        yield controller1.disconnect_worker()
        log.msg("Disconnected worker 1")

        self.reactor.advance(1)
        yield controller1.connect_worker()
        log.msg("Connected worker 2")
        self.assertTrue(controller1.worker.conn is not None)
        self.reactor.advance(1)
        self.assertTrue(len(controller1.worker.workerforbuilders) > 0)
        self.assertTrue(controller1.worker.conn is not None)

        yield self.create_build_request([builder_id])
        yield controller1.disconnect_worker()
        log.msg("Disconnected worker 2")
        self.flushLoggedErrors(pb.PBConnectionLost)

    @defer.inlineCallbacks
    def test_sever_connection_then_reconnect_immediately(self):
        controller1, builder_id = yield self.create_single_worker_config(
            controller_kwargs={"build_wait_timeout": 1})

        yield self.create_build_request([builder_id])
        yield controller1.connect_worker()
        log.msg("Connected worker 1")
        self.reactor.advance(1)
        
        self.assertTrue(len(controller1.worker.workerforbuilders) > 0)

        # sever connection just before insubstantiation and lose it after TCP
        # times out
        #with patchForDelay('buildbot.worker.base.AbstractWorker.disconnect') as delay:
        controller1.sever_connection()
        #    delay.fire()
        log.msg("Severed worker 1")
        self.reactor.advance(1)

        yield self.create_build_request([builder_id])
        log.msg("Starting worker 2")
        self.reactor.advance(1)
        yield controller1.connect_new_worker()
        self.reactor.advance(1)
        log.msg("Connected worker 2")
        self.reactor.advance(10)
        self.assertTrue(len(controller1.worker.workerforbuilders) > 0)
        log.msg("step 2")
        self.reactor.advance(100)
        log.msg("step 3")
        yield controller1.disconnect_worker(False)
        log.msg("Closed worker 1")
        self.reactor.advance(1)
        
        self.assertTrue(len(controller1.worker.workerforbuilders) > 0)
        self.reactor.advance(1)
        log.msg("Disconnecting worker 2")
        controller1.disconnect_new_worker()
        self.reactor.advance(1)
                #controller1.remote_worker2.parent = None
        
        log.msg("Disconnected worker 2")
        self.flushLoggedErrors(pb.PBConnectionLost)
        log.msg("step 4")

Contributor Checklist:

  • I have updated the unit tests. One added for double notifiyDisconnect. None for main issue (see description)
  • I have created a file in the newsfragments directory (and read the README.txt in that directory)
  • I have updated the appropriate documentation

…from a

worker when there is already a known (but actually disconnected) connection
to the same worker.

This would otherwise cause a case of worker "knowing" it was "connected", while
the manager equally "know" that the worker was "offline" because it was not
attached to any builders, all the while the low-level connection maintenance
functionality kept sending and receiving keep-alives, thus keeping the unattached
connection open. (buildbot#4678)
… with a worker (buildbot#4678)

This is a just-in-case-logic is the connection is somehow detached from all Builders, but still leaving the connection active
once, by replacing the current subscription object with a new one
after delivering the notifications. A second call could mess up the
newer connection's state. (buildbot#4678)
@YngveNPettersen YngveNPettersen force-pushed the yngve/handle-new-connection-20240302 branch from 49a32f4 to f9ffe15 Compare April 13, 2024 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant