Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consumer stop consuming after Broker transport failure #548

Closed
Giska opened this issue Dec 21, 2018 · 14 comments
Closed

Consumer stop consuming after Broker transport failure #548

Giska opened this issue Dec 21, 2018 · 14 comments
Labels
stale Stale issues

Comments

@Giska
Copy link

Giska commented Dec 21, 2018

Hi,

We encounter a problem with consumers that stop providing new messages to the 'data' listener.
This seemingly happens after a broker becomes temporarily unavailable (broker transport failure), but only rarely. We observed this on several different consumers on different topics with similar configurations, seemingly randomly (most of the times the consumers resume operations after a broken broker connection).

The consumer is still synchronized with its consumer group (which consists of a single consumer for one topic of 5 partitions), the high offsets increase as new message arrive on the partitions, but the consumer lag keeps increasing and messages are seemingly never properly consumed by the consumer.

We observed this sequence of events, where all partitions of a topic stopped consuming:

  • This 'event.error' seems to indicate the beginning of the problem: Error: broker transport failure

  • After this, no stats are logged again, although they were being logged every second before that.

  • 10 seconds after the error, the consumer stops fetching every partition of the topic, with these two event logs happening for each partition:

{ severity: 7, fac: 'FETCH' } [thrd:BROKER_IP:9092/0]: BROKER_IP:9092/0: Topic TOPIC_NAME [3] in state active at offset 39611 (10/10 msgs, 0/40960 kb queued, opv 6) is not fetchable: queued.min.messages exceeded

{ severity: 7, fac: 'FETCHADD' } [thrd:BROKER_IP:9092/0]: BROKER_IP:9092/0: Removed TOPIC_NAME [3] from fetch list (0 entries, opv 6)

  • This happens at a time when no new messages are available (partitions with infrequent messages that appear at set times in this test environment), and the 'data' listener function does not receive any message, so it is not clear to us why the queue would be full.

Probably linked to #182.

Environment Information

  • OS: Debian Stretch
  • Node Version: 8.11.0
  • node-rdkafka version: 2.4.2

Consumer configuration

'api.version.request': true,
 'message.max.bytes': 150 * 1024 * 1024, // 150 MB
 'receive.message.max.bytes': messageMaxBytes * 1.3,
 // Logging
 'log.connection.close': true,
 'statistics.interval.ms': 1000,
 // Consumer-specific rdkafka settings
 'group.id': group_id,
 'auto.commit.interval.ms': 2000,
 'enable.auto.commit': true,
 'enable.auto.offset.store': true,
 'enable.partition.eof': false,
 'fetch.wait.max.ms': 100,
 'fetch.min.bytes': 1,
 'fetch.message.max.bytes': 20 * 1024 * 1024, // 20 MB
 'fetch.error.backoff.ms': 0,
 'heartbeat.interval.ms': 1000,
 'queued.min.messages': 10,
 'queued.max.messages.kbytes': Math.floor(40 * 1024), // 40 MB
 'session.timeout.ms': 7000,
@bobzsj87
Copy link

Same behaviour and same error of "broker transport failure". Consumer stops and we can see the lag of a topic caused by that. We have to restart the whole thing

@carlessistare
Copy link

@webmakersteve just pinging here too, since this issue is tracked in multiple issues, and on my opinion it's pretty critical, since the recovery for this problem, in prod environments is not easy.

@ivan83
Copy link

ivan83 commented Apr 12, 2019

@webmakersteve +1
This issue has been popping up on our prod environment since we started using this connector.
Most of the time connector recovers, but every once in a while it becomes unresponsive.
So each day, we have at least one consumer stopping at random time of the day.

@mvtm-dn
Copy link

mvtm-dn commented Apr 15, 2019

@carlessistare IMHO there is a bug in librdkafka. My observations told me that the thread stops consume inside the library. Indirect sign of this is a "solving" issue #222

@smaheshw
Copy link

Same issue at our side. Has anybody got a working solution for this? This is extremely critical now for our project.

@aakashkharche04
Copy link

We are also facing the same issue, Is there any fix for it?

@RaajBadra
Copy link

I'm also facing the same issue. This is a critical issue which has to be fixed. Is there a working solution?

@danielAnguloG
Copy link

Hello, Is there any update about this issue? or a possible workaround?

@cravi24
Copy link

cravi24 commented Aug 5, 2019

We are also facing the same issue. Should we go for non-flow mode for the time being till the fix is available

@NeoyeElf
Copy link

Is there any progress for it?

@edenhill
Copy link

Check the librdkafka release notes, might be time to upgrade the librdkafka provided by node-rdkafka.
https://github.com/edenhill/librdkafka/releases

@funduck
Copy link

funduck commented Aug 28, 2019

Had same issue, first added every N minutes restart to my app, then switched to other lib, which is quite good for consuming messages, for producing is slow. Here I compared them

@stale
Copy link

stale bot commented Nov 26, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale Stale issues label Nov 26, 2019
@stale stale bot closed this as completed Dec 3, 2019
@l8on
Copy link

l8on commented Aug 6, 2020

We are noticing a similar issue. It seems like an update to the version of librdkafka that is used by this module might be worth a try. Is there anything the community can do to help move that along?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Stale issues
Projects
None yet
Development

No branches or pull requests