-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrent metadata refresh #172
Conversation
seems to have an issue in the go 1.1 build. Investigating... |
unable to reliably replicate locally - are you guys aware of any issues with |
Yes, it seems like you got caught in our functional test heisenbug which strikes every now and then :( |
The tests are passing now. If you have any idea what could cause this bug (most likely due to concurrency), please let us know. |
// Public channels | ||
Get chan *fanoutResult | ||
// optional channel to notify outside when clean up is complete | ||
Done chan struct{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do these channels have to be public? This seems like an internal thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some godoc to explain what these struct members are for? The name Get
is not very enlightening to me.
@eapache thoughts? |
I'm not sure I like this approach in general. To quote old me in #15 "connecting to all the brokers just to pick one and drop the others is gross". It's also inefficient, and could cause issues on clusters with a large number of brokers. Originally in that issue I'd suggested that we should wait a small amount before trying the next broker, but I now think that's basically over-complicated. The much better solution to achieve that behaviour is just to reduce the connect timeout on the brokers to something reasonable for LANs (if it takes more than 2 seconds to connect over a LAN, something is probably wrong). @crxpandion why did you decide to implement this? What problem precisely were you trying to solve? |
@eapache - I started on this path when testing sarama's ability to weather broker failures. I noticed that even with smaller timeouts, as you suggest, the metadata refresh can take a long time, particularly in large clusters. That delay is not only frustrating when running lots of integration type tests, but could be detrimental in a production system. As for the inefficiency - I can see how its not computationally efficient but there it seems we should mainly be concerned with getting the cluster metadata quickly, so that the user of this library can get on with their computation. The problems with large cluster (20-100s) could be solved by throttling the number of outbound attempts. Overall I think that abstracting the approach to fetching metadata (regardless of strategy) from the refreshMetadata function, which mainly contains retry logic, serves a lot to make this section of code not as gross. Anyways I appreciate the feedback. @wvanbergen re the function_test heisenbug. I think there are at least two modes of failure that i see locally: the one that is easy to diagnose is one that is the result of OOM killers in the vm (I am not sure on how travis ci works so this might not be the issue we saw above) and judging from recent commits it seems you guys are aware of that one. An alternate approach that might make the memory increase more reliable would be to reduce the number of brokers to 3. Judging from stack traces taken when metadata is refreshed, I think there is at least one other failure mode besides the OOM killer, but I haven't been able to isolate it yet - I might have time later this week to look into it again later this week if you guys want. |
OK, so I'm not opposed to the general concept of trying multiple brokers concurrently for metadata, but:
Given the above, here are a few suggestions:
|
Sure both of those suggestions make sense - I'll work to amend my patch. I am coming down with a cold so it might take me a few days. :( I feel that the "thundering horde" problem is often solved by random jitter in delay before attempting refresh. What are you thoughts on this strategy? Obviously we'd want to use it in conjunction with fanout throttling. |
I don't have a problem with jitter to avoid thundering horde - I'm not sure it's necessary for the first connection of a refresh (clusters should be OK as long as each client makes only one request, not Let us know when you've got a new version of the patch ready. |
@eapache I added a throttle into the fanout. The default is 1 connection attempt at a time. The logic is a tad more complicated. |
} else { | ||
for _, broker := range client.brokers { | ||
Logger.Printf("Fanning out %v\n", broker) | ||
b := broker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this intermediate variable seems unnecessary?
57161ad
to
42ee2da
Compare
I rebased on The way seedBrokers are handled definitely complicates the code. Is there a reason for the logical separation between user supplied broker addresses and brokers found via metadata requests? |
Per the comment in client.go:
There are also some bits documented in my initial post on #15. |
@crxpandion do you plan to revisit this? |
The PR as it stand fulfills what was requested - however I feel that it does not help alleviate the root issues with this section of the code. Therefore I'm inclined to stash this until the client gets sorted out further. Thoughts? |
I'm not sure how much more change there is really going to be to the client at this point. The current design certainly has issues, but as per the fairly extensive conversation in the other tickets there doesn't seem to be a magic bullet that doesn't introduce other issues. I wouldn't hold your breath waiting for that to happen. |
The client does need to be worked on before this can be tackled properly, but once that is done this will probably have to be rewritten anyways, so there's not much point in leaving the PR open. It can/should be mined for ideas at some point. |
Changes broker metadata to refresh concurrently rather than sequentially. If all brokers error out the last call to
Broker.GetMetadata(...)
that returned an error is exposed and the client enters the resurrection retry loop.This also make users of
sarama.Client
(e.g. consumer/producer) be solely responsible for disconnecting brokers who timeout or misbehave.Related to #15