Be more tolerant with unresponsive brokers #347

yungchin · 2015-11-11T17:15:54Z

This is a half-formed thought, but it appears @rduplain in #189 has been battling with https://issues.apache.org/jira/browse/KAFKA-1387 - the odd problem there is that a broker may be listed in metadata, but will refuse to speak to us upon connecting.

Currently, I think Cluster._update_brokers() stumbles when that happens. What should happen instead is that we just continue, leaving the broker in a disconnected state, because we only really have a problem once it turns out that the broker in question is a Partition.leader - which it probably won't be, so we wouldn't care that it remains disconnected.

The text was updated successfully, but these errors were encountered:

yungchin · 2015-11-14T06:37:39Z

Just collecting more thoughts here, because I cannot completely oversee the extent of the problem yet: I wonder if it would make sense to move away from storing Broker instances on Partition. If we'd instead store only broker-ids and look up the real Broker instance only when accessed, we would automatically be more immune to failed brokers that may feature in replica-listings but aren't leaders on any of the partitions we care to access.

(This is also generally a safer setup, and would have avoided a bug @emmett9001 discovered the other week, where a Partition may hold an old, disconnected Broker instance while a new instance for the same broker-id was correctly available in Cluster.brokers).

yungchin · 2015-11-14T06:39:33Z

All this is also related, but not exactly identical, to the issues in #338.

emmettbutler · 2015-11-16T23:47:28Z

#358 is related. I'm not super excited about moving away from storing Broker instances on Partition, since if I understand the issue correctly, it can be solved with a much simpler changeset that doesn't run the risk of introducing lots of new bugs.

emmettbutler · 2016-03-15T22:56:26Z

This issue is old enough and similar enough and unclear enough that I'm going to close it. Please reopen if you think that's not appropriate.

yungchin · 2016-03-19T00:50:04Z

Yep, agree that we don't need to move on this right now: I don't think I've seen any related issues opened lately, and the proposed changes would, like you said, be a bit of a risk.

Quick recap in case we revisit this in future (and so that I can remove the "hazy" label ;)):

This would be a resilience-enhancement, that addresses the situation where Kafka responses may contain broker IDs for brokers that are not currently reachable by the client (which happens with KAFKA-1387 but it's a state that could probably arise intermittently in general)
The first part of the fix is that Cluster updates should not bail out if any brokers listed in a metadata response turn out unresponsive.
A consequence of that change would be that Cluster.brokers may not have a Broker instance for every broker ID that we may encounter in partition metadata. But that's fine if we store broker IDs rather than Broker instances on Partition instances - this is the second part.

yungchin added the backlog label Nov 11, 2015

yungchin mentioned this issue Nov 14, 2015

broad LeaderNotAvailable #355

Merged

emmettbutler mentioned this issue Dec 18, 2015

Long-lived pykafka consumer occasionally freezes. #189

Closed

emmettbutler added question hazy labels Feb 19, 2016

emmettbutler closed this as completed Mar 15, 2016

emmettbutler removed the backlog label Mar 15, 2016

yungchin added enhancement and removed hazy question labels Mar 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Be more tolerant with unresponsive brokers #347

Be more tolerant with unresponsive brokers #347

yungchin commented Nov 11, 2015

yungchin commented Nov 14, 2015

yungchin commented Nov 14, 2015

emmettbutler commented Nov 16, 2015

emmettbutler commented Mar 15, 2016

yungchin commented Mar 19, 2016

Be more tolerant with unresponsive brokers #347

Be more tolerant with unresponsive brokers #347

Comments

yungchin commented Nov 11, 2015

yungchin commented Nov 14, 2015

yungchin commented Nov 14, 2015

emmettbutler commented Nov 16, 2015

emmettbutler commented Mar 15, 2016

yungchin commented Mar 19, 2016