Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initializing Client with Multiple Brokers #7

Closed
eapache opened this issue Aug 12, 2013 · 5 comments · Fixed by #13
Closed

Initializing Client with Multiple Brokers #7

eapache opened this issue Aug 12, 2013 · 5 comments · Fixed by #13

Comments

@eapache
Copy link
Contributor

eapache commented Aug 12, 2013

This has come up in the back of my mind a few times but I've always held off because I'm not sure what sane default behaviours are in all cases. Right now you specify a single host:port to construct a client and it pulls metadata on the rest of the cluster from that. It makes sense to be able to specify a list of brokers, so that if part of the cluster is offline the client can still connect.

@eapache
Copy link
Contributor Author

eapache commented Aug 13, 2013

Or does it make more sense for just the caller to loop creating clients until one works - then it can decide its own error behaviour...

@burke
Copy link
Contributor

burke commented Aug 13, 2013

I'd probably shoot for something like this:

func NewClient(id string, brokers ...string) (*Client, error) {
  for _, broker := range(brokers) {
    host, portstr := strings.SplitN(broker, ":", 2)
    port := strconv.Atoi(portstr)
    // the retry logic you described above
  }
}

So the effective change is that NewClient is invoked as NewClient("my-client", "localhost:8341") instead of with the host and port separate, and becomes variadic. Does that sound reasonable? I can whip up a PR if you're okay with that approach.

@eapache
Copy link
Contributor Author

eapache commented Aug 13, 2013

This is making me think of another potential issue: if the client cannot connect to all of the listed brokers in Client.update(), it just errors out. This potentially leaks broker connections and isn't necessarily the right thing to do.

This is a surprisingly hard problem :/

@eapache
Copy link
Contributor Author

eapache commented Aug 14, 2013

As I mentioned in #4, it probably makes sense for the Client, Producer and Consumer constructors to take some sort of Config structs so we can add parameters in a backwards-compatible way. In that case I'm not sure if NewClient(id string, config *ClientConfig, brokers ...string) or NewClient(id string, brokers []string, config *ClientConfig) makes more sense. I like the order of the latter, but would prefer not to have to wrap everything in a slice...

@eapache
Copy link
Contributor Author

eapache commented Aug 14, 2013

Additional concern: as per the comment near the end of NewClient, broker metadata requests do not seem required to return all the brokers they know about (it is not documented which ones they do return, but I suspect it is the only the subset that leads a partition whose data is also in the response). This means that even once we connect to a broker and retrieve metadata, we should not discard the other n brokers that the user has provided us with - we may need them later if they were not returned in the metadata and all of the ones that were returned go away.

Secondly, if the user gives us two addresses, the first of which is down, we should not block for two minutes (the TCP connection timeout) waiting for the first one before we try the second - we should launch async connections to all of them and use the first that works (or return an error if none of them work).

eapache pushed a commit that referenced this issue Aug 15, 2013
Fixes #7

This does basically the simplest thing that works. It has some flaws, but most
importantly it gets the API in place and is basically functional. We can
improve it later without breaking API compatibility.

Remaining sub-optimal behaviour:
 * If the user provides three addresses, only the last of which is valid, we
   wait for the first two connections to completely fail (which may take several
   minutes depending on the TCP timeout) before we try the third. Trying them
   all in parallel and using the first valid one would be better.
 * Once we've had an address fail, we discard it forever. Over the lifetime of a
   long-running cluster, all of the nodes may be down at one point or another,
   meaning that eventually we may 'run out' of nodes and abort. We should
   probably just mark the address as recently-failed, and retry them after some
   time has passed. Only if all nodes have *recently* failed should we abort.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants