-
Notifications
You must be signed in to change notification settings - Fork 232
Long-lived pykafka consumer occasionally freezes. #189
Comments
@rduplain are you using the BalancedConsumer or SimpleConsumer? Also, any logs from around the time the consumer freezes? |
I heard this issue through @amontalenti. The questions you are asking @msukmanowsky are exactly what I'd like to see too to assist in addressing the problem. I opened the issue, because it didn't look like one already existed. This is now an open call for those with details to post to the thread. |
Gotcha! We'll try to contribute what we've seen in and around when our BalancedConsumer does this. Think @kbourgoin or @emmett9001 may have some historical logs (though I don't think they tell much). Pretty sure this issue is isolated to the BalancedConsumer and likely some threading issues, don't think the SimpleConsumer suffers from any problems here. |
As noted in comments elsewhere, I mentioned to @kbourgoin seeing lots of freezes in the test suite - always on tests involving a BalancedConsumer. I need to figure out how to get nose to give me more info, as --nologcapture doesn't seem to do the trick. |
Ok, just caught log output from a stalled test. The test in question is this one, and the log output is in a gist here. The test was run like so: |
Is this a rebalancing loop? Looks like it rebalanced at least ten times in quick succession. |
Could that be because the topic/partitions have only just been created (in test setup) and so the brokers are still settling? I've updated the gist with sections of the server logs, starting from 21:52 (so about 20 seconds before the test starts). |
Now that I have the test runner going, I am seeing similar behavior as @yungchin:
I am running this, as modeled after
The test suite is stalling at:
For my environment, I am on Ubuntu 14.04 64-bit with a source-compiled Python 2.7.10, and I haven't yet given any special consideration to my Java installation:
|
Interesting, the tests have never shown this stalling behavior for me. |
As discussed in #189 this test would previously stall randomly, waiting forever until you'd hit ctrl-c. The change demonstrates that the problem is test-suite related: we put a timeout on the consumer, and as a result we get a test error whenever None (== timed out) is returned from consume(), instead of the stalling. In log output, we see one of two things whenever the test errors out. Either we get the wrong consumer offset straight away: DEBUG - pykafka.simpleconsumer - Set offset for partition 2 to 0 (where offset -1 is expected, on a fresh topic), or INFO - pykafka.simpleconsumer - Resetting offsets for 1 partitions (which, by default settings, will jump to OffsetType.LATEST, ie past the freshly produced message we want to get). Neither of these should occur if the test harness hands us a newly created topic, as we expect. On a side note, this resolves #179 because we now test with a binary string, and that works (when it doesn't stall). Signed-off-by: Yung-Chin Oei <yungchin@yungchin.nl>
It could be related to whether you keep the test cluster up between tests. I usually follow the recipe layed out in That is also to say, I think I may have confused this bug report by suggesting that the test stalling is related to the original report at all. I've just pushed a test fix for the stalling thing. |
Possibly related to #260 |
I am experiencing the same behaviour on a staging environment where I am testing pykafka with three balanced consumers connected to a single broker. After a while the consumers will just stop consuming messages, however in the log output I can see that they keep rebalancing and they keep committing their current offsets. Here is an excerpt of the debug log, where I have indicated the point after which no more messages will be consumed by one particular consumer: https://gist.github.com/aeneaswiener/aa38495c7e7d9ff3d74c Also, below you can see the progress of the consumer group in light blue, and I have indicated the point where all consumers of the topic have stopped consuming messages. Finally, looking at the Kafka broker log output after the consumers have stopped, I can see the following error message:
This keeps reoccurring until I manually shut down the consumer processes. |
Hi, I'd just like to comment that this exact thing happened to us once after running BalancedConsumers for about a month or two. Restarting the consumers made them continue consuming fine. This occurred a while ago, so I don't have any relevant logs. If it happens again, I'll capture some info and post here. |
I can also confirm that restarting the consumers fixes the issue. For us the issue happens reproducibly after about 10 minutes or so in our staging environment, where we are consuming messages at a rate of thousands per second. @ottomata did you also see the above error log message in the Kafka broker? ( |
Interesting. We only ever see this behavior in our Travis and local test environments - we run PyKafka consumers in production at Parse.ly without seeing this. It's possible that the issue is invisible to us because we use Storm to manage the automatic restarting of consumers, so maybe we can look into that. |
@aeneaswiener could you maybe repeat that with additional debug logging - something like
The thing that strikes me in your logging output is that at some point it mentions fetching for partition 15 but 15 doesn't show up again further down the line, even though you're steadily consuming, and then it alternates between 19 and 18 for a bit until finally it only fetches for 18. So I wonder if we're leaking partition locks somehow - even if I don't see how when just staring at the code. |
+1 for this issue. I am also seeing the PyKafka Balanced consumer stop consuming after a period of time. I've seen this anywhere from 30 minutes to 4 hours after establishing the connection. I'm still seeing this message:
but the consumer offsets are no longer changing.
Here is the traceback that I saw come across right before the consumer stopped consuming: |
@jasonrhaas if I'm reading that traceback correctly, I believe what's happening is that we get a zookeeper notification, presumably because one of the brokers has become unresponsive (maybe intermittently), and it happens to be the broker that was the coordinator for our consumer-group's offsets, so that when we try to write our offsets (which we want to do before restarting the consumer) we hit a socket timeout (about that - can I ask, are you on Linux, or some BSD? Just curious because I thought sockets didn't have any timeout by default). So actually @emmett9001 is working on handling such errors in #327 right now (only so far we only expected @emmett9001 in addition, I'm wondering if |
@yungchin Zookeeper, Kafka, and the consumer client that is running Storm are all running Linux CentOS 6.5. The fact that I'm seeing error messages from Kazoo points towards losing a zookeeper connection, although I don't seem to have this problem with all the other consumers I have running |
Thanks @jasonrhaas. Yes, the error is raised on a thread spawned by kazoo, but I believe the bug is in pykafka code. The traceback you posted comes out of a callback that we register with kazoo, so that we are notified when there are broker availability changes (and I suspect that in your case, it triggers because a broker became temporarily unresponsive - and long enough for zookeeper to notice it). The bug in pykafka would then be that while we receive this notification we still try to write offsets to the unresponsive broker. (In surmising all this, I am going by the logging excerpt you posted, which only has tracebacks containing @emmett9001 has just merged #331 to master, which doesn't fix the problem yet, but improves the situation by surfacing this exception from the callback to the main thread, so now you'd see it when you call I'll also make a note on #327 about the fact that you encounter |
So running this again, and tailing the logs the whole time, the cause of death now seems to be this:
It runs smoothly for about 15-20 minutes, and then this happens. The topic I'm consuming has 3 partitions, and I'm using a Balanced consumer with :p of 2 in streamparse to read the topic. So if I look at the lag it might look like this:
Note: I was running a Kafkacat consumer on the same topic at the same time to make sure there wasn't any issues on the producer/topic side. My Kafkacat consumer kept going after the PyKafka consumer stopped consuming messages. |
There are three different values of
The value in @jasonrhaas's most recent comment is |
It looks like we need to look for |
#341 has been merged and I'll be putting it into a |
I'd like to add that this may be the same problem that I saw. We've been doing some rolling broker restarts this week, and I regularly see SocketDisconnectedError followed by stopped consumption (or stopped offset commits? Can't tell). |
I'm pretty sure this issue encompasses several that have since been solved, and it's hard to pick apart the remaining problems. I'm closing this issue in favor of updated ones, specifically #347. If anyone objects and feels this issue should stay open, feel free to respond/reopen. |
Reporting this bug as I've heard it through @amontalenti. I don't have any information, as I'm new to the project, but want to post this issue to get other reports from others.
The text was updated successfully, but these errors were encountered: