Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Producer hangs when Retry.Max is 0 #294

Closed
eapache opened this issue Feb 23, 2015 · 10 comments
Closed

Producer hangs when Retry.Max is 0 #294

eapache opened this issue Feb 23, 2015 · 10 comments
Labels

Comments

@eapache
Copy link
Contributor

eapache commented Feb 23, 2015

(This is also poor behaviour in other configurations in that it can "waste" some of the configured retries, but is otherwise not a problem - it doesn't break ordering or anything.)

The leaderDispatcher function only knows to check for a new leader when it receives a message with a new retry value, but there are at least two cases (possibly more?) where for various reasons the flusher fails but never sends a retriable message to the leaderDispatcher:

  • all of the messages are at their retry limit, and get returned to the user instead
  • there is a topic/partition led by that broker, but with no messages in the current batch, when the broker connection dies

In both of the above cases, we can end up with a leaderDispatcher whose broker is toast, but doesn't know it and so doesn't try and get a new one.

In the normal case, the next message (not at the retry limit) will get spun around anyways and kick things off the way it should; this is less than ideal (we waste a retry for the message(s) in question) but not a very severe problem as by default we have three retries configured and typically only need one.

However, in the case where someone sets MaxRetries to 0, every subsequent message is immediately returned to the user and the producer ends up stuck.

@eapache
Copy link
Contributor Author

eapache commented Feb 23, 2015

The following test will pass/fail this issue, it should be copied into producer_test.go when this is fixed.

edit: moved into test suite with the appropriate Skip call

@wvanbergen
Copy link
Contributor

I feel the solution should be that we have a way to actively invalidate (remove) metadata as soon as we encounter a problem, so the subsequent message has no choice but to request fresh metadata. Maybe we should invalidate all cached metadata for a broker in disconnectBroker?

@eapache
Copy link
Contributor Author

eapache commented Feb 25, 2015

The problem is not actually in the client - the metadata in the client is correctly invalidated in this case. The problem is that the leaderDispatcher keeps a local "cache" of its brokerProducer (via the local output variable), and it relies on a "retried" message to come back in order to invalidate that cache.

@eapache
Copy link
Contributor Author

eapache commented Feb 25, 2015

Once #300 happens there are several easy ways to solve this. One of them is to fetch the brokerProducer on every iteration of the leaderDispatcher (which would require tweaking how we reference-count those slightly). Another is to send a dummy "trigger" message (using flags) around on failure to all the affected leadersDispatchers, even if they didn't have a message in the batch.

eapache added a commit that referenced this issue Mar 10, 2015
Skip it until the bug is fixed, but at least it will keep up with the API
changes now. Before it was inline in the ticket, and was falling behind.
eapache added a commit that referenced this issue Mar 10, 2015
@eapache eapache changed the title Producer hangs when MaxRetries is 0 Producer hangs when Retries.Max is 0 Aug 13, 2015
@eapache eapache changed the title Producer hangs when Retries.Max is 0 Producer hangs when Retry.Max is 0 Aug 13, 2015
@wkuranowski
Copy link

Hello. So it looks like my issue #509 is a duplicate of this bug. Do you have any plan to fix that issue?

Now I am trying to configure built-in retry mechanism, but I have a question. Are you retrying network and circuit breaker errors? I am using a round robin partitioner and it will be great if you can automatically reassign partition in case of circuit breaker or network error.

@eapache
Copy link
Contributor Author

eapache commented Aug 13, 2015

Do you have any plan to fix that issue?

Yes, it's just slow going as it mostly depends on the refactoring in #300. That is almost done however.

Are you retrying network and circuit breaker errors?

The producer retries only errors resulting from actual Produce requests. Errors from metadata requests and similar are not retried in the producer because they are already retried at the client level. Errors from circuit-breakers are not retried, because that would defeat the entire purpose of having the circuit-breaker.

@wkuranowski
Copy link

Errors from metadata requests and similar are not retried in the producer because they are already retried at the client level.

And what about network errors when producing a message? Is this the same case?

Errors from circuit-breakers are not retried, because that would defeat the entire purpose of having the circuit-breaker.

You can automatically assign an available partition when using random or round robin partitioner. Ordering is not important for that messages.

@eapache
Copy link
Contributor Author

eapache commented Aug 13, 2015

And what about network errors when producing a message?

Network errors when producing a message are retried.

You can automatically assign an available partition when using random or round robin partitioner. Ordering is not important for that messages.

Hmm, yes that would be a good improvement. Perhaps file a separate enhancement request to track this idea.

@varun06
Copy link
Contributor

varun06 commented Feb 20, 2019

Does anyone know if this is still an issue. Can we close it now? @bai @eapache

@bai
Copy link
Contributor

bai commented Feb 20, 2019

Hmmm, doesn't seem so — I'm going to close it but feel free to reopen if you think otherwise 🙌

@bai bai closed this as completed Feb 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants