Skip to content
This repository has been archived by the owner on Mar 24, 2021. It is now read-only.

Be more tolerant with unresponsive brokers #347

Closed
yungchin opened this issue Nov 11, 2015 · 5 comments
Closed

Be more tolerant with unresponsive brokers #347

yungchin opened this issue Nov 11, 2015 · 5 comments

Comments

@yungchin
Copy link
Contributor

This is a half-formed thought, but it appears @rduplain in #189 has been battling with https://issues.apache.org/jira/browse/KAFKA-1387 - the odd problem there is that a broker may be listed in metadata, but will refuse to speak to us upon connecting.

Currently, I think Cluster._update_brokers() stumbles when that happens. What should happen instead is that we just continue, leaving the broker in a disconnected state, because we only really have a problem once it turns out that the broker in question is a Partition.leader - which it probably won't be, so we wouldn't care that it remains disconnected.

@yungchin
Copy link
Contributor Author

Just collecting more thoughts here, because I cannot completely oversee the extent of the problem yet: I wonder if it would make sense to move away from storing Broker instances on Partition. If we'd instead store only broker-ids and look up the real Broker instance only when accessed, we would automatically be more immune to failed brokers that may feature in replica-listings but aren't leaders on any of the partitions we care to access.

(This is also generally a safer setup, and would have avoided a bug @emmett9001 discovered the other week, where a Partition may hold an old, disconnected Broker instance while a new instance for the same broker-id was correctly available in Cluster.brokers).

@yungchin
Copy link
Contributor Author

All this is also related, but not exactly identical, to the issues in #338.

@emmettbutler
Copy link
Contributor

#358 is related. I'm not super excited about moving away from storing Broker instances on Partition, since if I understand the issue correctly, it can be solved with a much simpler changeset that doesn't run the risk of introducing lots of new bugs.

@emmettbutler
Copy link
Contributor

This issue is old enough and similar enough and unclear enough that I'm going to close it. Please reopen if you think that's not appropriate.

@yungchin
Copy link
Contributor Author

Yep, agree that we don't need to move on this right now: I don't think I've seen any related issues opened lately, and the proposed changes would, like you said, be a bit of a risk.

Quick recap in case we revisit this in future (and so that I can remove the "hazy" label ;)):

  • This would be a resilience-enhancement, that addresses the situation where Kafka responses may contain broker IDs for brokers that are not currently reachable by the client (which happens with KAFKA-1387 but it's a state that could probably arise intermittently in general)
  • The first part of the fix is that Cluster updates should not bail out if any brokers listed in a metadata response turn out unresponsive.
  • A consequence of that change would be that Cluster.brokers may not have a Broker instance for every broker ID that we may encounter in partition metadata. But that's fine if we store broker IDs rather than Broker instances on Partition instances - this is the second part.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants