Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for race condition: Allocation of Deleting GameServers Possible #367

Merged

Conversation

markmandel
Copy link
Member

If an allocation occurred during a Fleet scale down, or during a update of a Fleet, it was entirely possible for those parallel delete operations to be applied to a GameServer that was being allocated at the same time.

This is mainly because the client-go informer cache is lazily consistent, but also because there was nothing preventing a Delete() of a GameServer from happening while allocating a specific GameServer.

To solve this, there are two strategies implemented here:

  1. Share the allocationLock sync.Mutex between controllers such that allocations cannot occur while GameServer Deletes for Fleet resizing/update are happening and vice versa.
  2. use cache.WaitForCacheSync to bring the cluster informer up to date, to remove the chance for non-updated information about GameServers to be
    acted upon.

The shared lock is a quite broad approach - down the line, we could refine this to being per Fleet, or per GameServerSet, if we find this to be a bottleneck, but the priority here was to get something working that resolvesthe issue, and we can optimise as needed from here.

There are also e2e tests specifically designed for catching these race conditions as well.

If an allocation occurred during a Fleet scale down, or during a update of a
Fleet, it was entirely possible for those parallel delete operations to
be applied to a GameServer that was being allocated at the same time.

This is mainly because the client-go informer cache is lazily consistent, but
also because there was nothing preventing a `Delete()` of a `GameServer` from
happening while allocating a specific `GameServer`.

To solve this, there are two strategies implemented here:

1. Share the `allocationLock` `sync.Mutex` between controllers such that
allocations cannot occur while `GameServer` Deletes for Fleet resizing/update
are happening and vice versa.
2. use `cache.WaitForCacheSync` to bring the cluster informer up to date,
to remove the chance for non-updated information about `GameServers` to be
acted upon.

The shared lock is a quite broad approach - down the line, we could refine this
to being per `Fleet`, or per `GameServerSet`, if we find this to be a
bottleneck, but the priority here was to get something working that resolves
the issue, and we can optimise as needed from here.

There are also e2e tests specifically designed for catching these race
conditions as well.
@markmandel markmandel added kind/bug These are bugs. area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc labels Sep 28, 2018
@markmandel markmandel added this to the 0.5.0 milestone Sep 28, 2018
@agones-bot
Copy link
Collaborator

Build Succeeded 👏

Build Id: 0c0d9f54-7cb9-471d-aee0-4df518bfc905

The following development artifacts have been built, and will exist for the next 30 days:

(experimental) To install this version:

  • git fetch https://github.com/GoogleCloudPlatform/agones.git pull/367/head:pr_367 && git checkout pr_367
  • helm install install/helm/agones --namespace agones-system --name agones --set agones.image.tag=0.5.0-e424295

Copy link
Collaborator

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@cyriltovena cyriltovena merged commit e6818e3 into googleforgames:master Sep 28, 2018
@markmandel markmandel deleted the bug/race-allocate-on-delete branch September 28, 2018 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/user-experience Pertaining to developers trying to use Agones, e.g. SDK, installation, etc kind/bug These are bugs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants