Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scrap the connection test in the client metadata update #277

Merged
merged 1 commit into from Feb 17, 2015

Conversation

eapache
Copy link
Contributor

@eapache eapache commented Feb 17, 2015

Previously, if the cluster metadata was giving us back a broker which we
suspected was unavailable (since it was already in our 'dead' set) then we would
wait for the connection, and mark it as unavailable if the connection failed
(otherwise, we simply do what the cluster tells us and let the
producers/consumers deal with the connection errors). This was handy since it
let us back off nicely if a broker crashed and came back, retrying metadata
until the cluster had caught up and moved the leader to a broker that was up.

I'm now of the opinion this was more trouble than it's worth, so scrap it. Among
other things:

The unfortunate side-effect of scrapping it is that in the producer and consumer
we are more likely to fail if we don't wait long enough for the cluster to fail
over leadership. The real solution if that occurs is to wait longer in the
correct spot (RetryBackoff in the producer, currently hard-coded to 10 seconds
in the consumer) instead of this hack.

@Shopify/kafka

cc @luck02

Previously, if the cluster metadata was giving us back a broker which we
suspected was unavailable (since it was already in our 'dead' set) then we would
wait for the connection, and mark it as unavailable if the connection failed
(otherwise, we simply do what the cluster tells us and let the
producers/consumers deal with the connection errors). This was handy since it
let us back off nicely if a broker crashed and came back, retrying metadata
until the cluster had caught up and moved the leader to a broker that was up.

I'm now of the opinion this was more trouble than it's worth, so scrap it. Among
other things:
 - it does network IO while holding an important mutex, which is a bad pattern
   to begin with (#263)
 - it can mask real network errors behind "LeaderNotAvailable" (#272)

The unfortunate side-effect of scrapping it is that in the producer and consumer
we are more likely to fail if we don't wait long enough for the cluster to fail
over leadership. The real solution if that occurs is to wait longer in the
correct spot (`RetryBackoff` in the producer, currently hard-coded to 10 seconds
in the consumer) instead of this hack.
@wvanbergen
Copy link
Contributor

This makes sense to me.

Should we DRY up the retry logic between the consumer and producer, or at least reuse the same config variable for it?

@eapache
Copy link
Contributor Author

eapache commented Feb 17, 2015

The consumer is much simpler since if it fails it just goes back to sleep and then tries again (forever) whereas the producer is more finicky since it must eventually give up. The only reason the consumer sleeps at all is to avoid spinning, so I didn't make it configurable and I didn't put much thought into it.

@wvanbergen
Copy link
Contributor

If it's easy to move it to a config variable, I say do it, if not :shipit: as is. 👍

@eapache
Copy link
Contributor Author

eapache commented Feb 17, 2015

Eh, I think we could tackle that as part of #261 if anyone ever cares enough to take a crack at it.

eapache added a commit that referenced this pull request Feb 17, 2015
Scrap the connection test in the client metadata update
@eapache eapache merged commit bb9687f into master Feb 17, 2015
@eapache eapache deleted the scrap-client-connection-test branch February 17, 2015 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants