New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Multiple Cluster spanning the globe #301

Open
markmandel opened this Issue Jul 20, 2018 · 19 comments

Comments

Projects
None yet
6 participants
@markmandel
Collaborator

markmandel commented Jul 20, 2018

Agones should make supporting multiple Agones clusters around the world, and provide tools for streamlining these processes.

I see this come under two main feature sets:

Cluster Registries

  • Be able to add/remove clusters from a single registry
  • Be aware of the health of a registry
  • That registry should be queryable in some way as well.

There should be existing work, most likely coming out of sig-multicluster, that we can take advantage of - as many applications have this need.

Research

Standard Ping/Latency tooling

We should integrate some tooling into each cluster and accompanying tooling that can be incorporated into game clients, such that determining ping time to clusters is an out of the box solution (or as out of the box as we can make it)

@victor-prodan

This comment has been minimized.

Contributor

victor-prodan commented Jul 21, 2018

I think it would be useful to be able to define spillover rules - when one cluster is full to redirect allocation requests to another one.

@markmandel

This comment has been minimized.

Collaborator

markmandel commented Jul 27, 2018

Ooh, that's an interesting idea. Kind of like a global fleet allocation load balancer. Nice 👍

@markmandel

This comment has been minimized.

Collaborator

markmandel commented Sep 6, 2018

We will also need to consider how to deploy/manage fleets to multiple clusters around the globe, and how to manage that. It isn't just a "deploy to all" - but gradual rollout / specific region first, etc.

Also, should multicluster have some kind of "cost" / "priority" value? i.e. if you run your own datacenter, you only want to burst into the cloud when necessary. How can you show that? (Can clusters have labels and annotations? That could well be an easy answer)

@Maxpain177

This comment has been minimized.

Contributor

Maxpain177 commented Oct 7, 2018

+1. I'm looking forward to it

@Kuqd

This comment has been minimized.

Collaborator

Kuqd commented Oct 11, 2018

@markmandel Question: is it out of scope cluster creation ? I personally think so, also this makes it easier to handle multiple clusters by using a single mean of connection the kubeconfig/service account. We should however add a tool that help to register kubeconfig for each provider.

@markmandel

This comment has been minimized.

Collaborator

markmandel commented Oct 11, 2018

@Kuqd yeah, I would agree that cluster creation is out of scope - but yes, we need some kind of tool for managing kubeconfig -- I wonder if cluster-registry will help?

@Kuqd

This comment has been minimized.

Collaborator

Kuqd commented Oct 11, 2018

no it's something else basically endpoints and authinfo (bearer token from sa).

However looking at the code base there is nothing else than the CRD and generated code for it, so this is very early. At the end of the day you need a rest.Config

@Kuqd

This comment has been minimized.

Collaborator

Kuqd commented Oct 30, 2018

@markmandel

This comment has been minimized.

Collaborator

markmandel commented Oct 30, 2018

@Kuqd although nobody uses Federation anymore - it's pretty much been determined to be a failed experiment.

Although maybe there are pieces we can lift out of it that are useful.

@Oleksii-Terekhov

This comment has been minimized.

Oleksii-Terekhov commented Nov 14, 2018

We also interested in multi-cluster Agones.
But selecting best cluster - in our project it's Matchmaking goal - due minimal ping for all users, costs and similar reasons for maximize UX and profit.
So any "logic to select cluster for Gameserver" must be switchable - we want explicit select fleet+cluster in FleetAllocation manifest...
Maybe, as side-effect, must be quick and trusted metrics API in Agones controller about current load (free/allocated/total/unhealthy/available GameServers....)

@markmandel markmandel self-assigned this Nov 14, 2018

@markmandel

This comment has been minimized.

Collaborator

markmandel commented Nov 14, 2018

I've started researching this as the next big ticket item to tackle.

The thing I'm trying to decide is what is the first item that should be tackled.

Maybe having a registry of clusters, and a standard way to ping each of them for round trip time? Would that be a good starting point?

@EricFortin

This comment has been minimized.

Collaborator

EricFortin commented Nov 15, 2018

I am not convinced about the registry. Unless we want to offer that as a separate project, where would this leave? As stated before allocation strategy is bound to change a lot based on multiple criteria. So if we only delivers a simple registry, I feel most already have some form of service to return configuration where they could store that list.

That being said, having a setup to ping each cluster so we can send this data to our matchmaking service is definitely worth it. It also requires client code so we need to think about how we will deliver this since this code will need to run on consoles too.

@markmandel

This comment has been minimized.

Collaborator

markmandel commented Nov 15, 2018

@EricFortin agreed. Doing the research, it looks like the cluster-registry project and also federation-v2 (linked above) will (at least eventually) take care of registering and tracking multiple clusters, as well as being able to deploy Agones CRDs, such as fleets across them - so while that is still alpha now, it's coming, and we should lean on that work (I'll likely start playing with both soon, just to get a feel).

Regarding a "pinger" - I have a couple of questions in regard to this (probably because my knowledge here is a thinner):

  1. Is there any prior art we can leverage here? Is there a standard way of determining RTT, or an existing open source project we can leverage? (I had a hunt, but couldn't find anything - do we send some UDP packets and echo them back - maybe with an epoch timestamp?)
  2. Client side code - I'm thinking that we both need to do a C# and a C++ sdk for this, but also define it as a standard. That way, if can't use the supported client code, it can be integrated relatively easily (I hope) / can be developed in phases.

Does that make sense?

^ this should likely also be its own ticket at this point, it seems.

@victor-prodan

This comment has been minimized.

Contributor

victor-prodan commented Nov 16, 2018

  1. Client side code - I'm thinking that we both need to do a C# and a C++ sdk for this

Do you mean game client code? If yes, I don't think it's necessary. I think that each game has its own way of pinging a HTTP (or maybe UDP) endpoint, so we just need a way to create those endpoints.

Here is how Amazon is doing it: https://www.cloudping.info/.
I imagine a similar thing for Agones.

@markmandel

This comment has been minimized.

Collaborator

markmandel commented Nov 16, 2018

Also similar to: http://www.gcping.com/ 😄

Is a HTTP endpoint good enough, or does it have to be a UDP endpoint? (or maybe both?)

Here's an interesting question - should this be behind a loadbalancer? A LB will mean that we can do redundancy, and scale up the pinger (if it's needed), but does a LB introduce a layer that otherwise wouldn't be there.

@victor-prodan

This comment has been minimized.

Contributor

victor-prodan commented Nov 17, 2018

A LB is needed, yes, and this is why it sounds like an independent service hosted in the same cluster. Something like this already exists, maybe?

About udp... Yes, it might be helpful to detect packet loss for example.

IT also depends on the protocol used by the game... If it's based on smth like websocket than they wouldnt need udp.

Http is easier to implement by both sides and it's a must. Udp is nice to have.

@markmandel

This comment has been minimized.

Collaborator

markmandel commented Nov 17, 2018

That makes lots of sense.

For the HTTP request - I'm assuming this would need to return an "ok" and HTTP 200, and the client would want to track the round trip time themselves.

For UDP - would it return a empty packet back to the sender? Does it need to contain any information, like an id/hash? (or echo what has been sent) Since it's async, I would assume this would be required - unless there is a better way.

  • Another fun question - do we have any concerns about someone using this for a reflection DDOS attack of some kind? (spoofing the "from" address, to forward the attack to another location? Sounds like we should rate limit these requests).
@victor-prodan

This comment has been minimized.

Contributor

victor-prodan commented Nov 19, 2018

For the HTTP request - I'm assuming this would need to return an "ok" and HTTP 200, and the client would want to track the round trip time themselves.

👍

For UDP - would it return a empty packet back to the sender? Does it need to contain any information, like an id/hash?

UDP is tricky, as the production would use its own protocol. Maybe it would be better to let the user supply the binary? They would also be responsible for any throttling in this case.

@EricFortin

This comment has been minimized.

Collaborator

EricFortin commented Nov 19, 2018

@markmandel

Another fun question - do we have any concerns about someone using this for a reflection DDOS attack of some kind? (spoofing the "from" address, to forward the attack to another location? Sounds like we should rate limit these requests).

For a reflection attack to be effective, you usually want to provoke a bigger response than the request you sent. If we simply echo back the packet we received, there is not much to gain from hitting us instead of the target directly.

Rate limiting is still a good thing though.

markmandel added a commit to markmandel/agones that referenced this issue Dec 1, 2018

Pinger service for Multiple Cluster Latency Measurement.
Context: GoogleCloudPlatform#301

This creates a simple HTTP endpoint and/or a rate limited UDP echo service
to be able to easily do RTT latency tests from game clients, to multiple
Agones installs.

markmandel added a commit to markmandel/agones that referenced this issue Dec 1, 2018

Pinger service for Multiple Cluster Latency Measurement.
Context: GoogleCloudPlatform#301

This creates a simple HTTP endpoint and/or a rate limited UDP echo service
to be able to easily do RTT latency tests from game clients, to multiple
Agones installs.

markmandel added a commit to markmandel/agones that referenced this issue Dec 1, 2018

Pinger service for Multiple Cluster Latency Measurement.
Context: GoogleCloudPlatform#301

This creates a simple HTTP endpoint and/or a rate limited UDP echo service
to be able to easily do RTT latency tests from game clients, to multiple
Agones installs.

markmandel added a commit to markmandel/agones that referenced this issue Dec 1, 2018

Pinger service for Multiple Cluster Latency Measurement.
Context: GoogleCloudPlatform#301

This creates a simple HTTP endpoint and/or a rate limited UDP echo service
to be able to easily do RTT latency tests from game clients, to multiple
Agones installs.

markmandel added a commit to markmandel/agones that referenced this issue Dec 5, 2018

Pinger service for Multiple Cluster Latency Measurement.
Context: GoogleCloudPlatform#301

This creates a simple HTTP endpoint and/or a rate limited UDP echo service
to be able to easily do RTT latency tests from game clients, to multiple
Agones installs.

markmandel added a commit that referenced this issue Dec 6, 2018

Pinger service for Multiple Cluster Latency Measurement.
Context: #301

This creates a simple HTTP endpoint and/or a rate limited UDP echo service
to be able to easily do RTT latency tests from game clients, to multiple
Agones installs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment