Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NullPointerException caused zkClient to close socket with zookeeper and kafka broker not functioning #71

Open
chaloulo opened this issue Jun 18, 2018 · 6 comments

Comments

@chaloulo
Copy link

Hello,

We use kafka version 0.10.2.1 which uses zkclient 0.10 (zkclient-0.10.jar). We use a setup of 3 kafka brokers and 3 zookeeper nodes. Both kafka and zookeeper run in docker containers. We have seen a few occasions where one kafka broker remains running, but there is no connection with zookeeper (netstat reports no socket connected to port 2181). This is probably caused by a transient network issue or transient DNS resolution issue, however it seems that the transient issue becomes permanent and the broker never recovers. Instead the broker remains out of sync with the other 2 brokers. I can recover it only by restarting the broker.
I can see the following from the broker logs (unfortunately no debug logs are available):

2018-06-16T00:12:14.357000000Z org.apache.zookeeper.ClientCnxn$SendThread: Client session timed out, have not heard from server in 26679ms for sessionid 0x16405b171770003
2018-06-16T00:12:30.795000000Z org.apache.zookeeper.ClientCnxn$SendThread: Unable to reconnect to ZooKeeper service, session 0x16405b171770003 has expired
2018-06-16T00:12:33.811000000Z org.I0Itec.zkclient.ZkEventThread: Could not establish session with zookeeperorg.I0Itec.zkclient.exception.ZkException: Unable to connect to zookeeper.tas01.local:2181/eventqueue
org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:72)
org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:1289)
org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:721)
org.I0Itec.zkclient.ZkClient.process(ZkClient.java:632)
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:505)
2018-06-16T00:12:33.816000000Z org.I0Itec.zkclient.ZkEventThread: Could not establish session with zookeeperorg.I0Itec.zkclient.exception.ZkException: Unable to connect to zookeeper.tas01.local:2181/eventqueue
org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:72)
org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:1289)
org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:721)
org.I0Itec.zkclient.ZkClient.process(ZkClient.java:632)
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:505)
2018-06-16T00:12:33.817000000Z org.I0Itec.zkclient.ZkEventThread: Error handling event ZkEvent[Children of /config/changes changed sent to kafka.common.ZkNodeChangeNotificationListener$NodeChangeListener$@75be6c7a]java.lang.NullPointerException
org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:115)
org.I0Itec.zkclient.ZkClient$5.call(ZkClient.java:703)
org.I0Itec.zkclient.ZkClient$5.call(ZkClient.java:700)
org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:991)
org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:700)
org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:709)
org.I0Itec.zkclient.ZkClient$10.run(ZkClient.java:842)
org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:72)
2018-06-16T00:12:33.818000000Z org.I0Itec.zkclient.ZkEventThread: Error handling event ZkEvent[Data of /controller changed sent to kafka.server.ZookeeperLeaderElector$LeaderChangeListener@57f56c84]java.lang.NullPointerException
org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:115)
org.I0Itec.zkclient.ZkClient$5.call(ZkClient.java:703)
org.I0Itec.zkclient.ZkClient$5.call(ZkClient.java:700)
org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:991)
org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:700)
org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:820)
org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:72)

I checked the last log ("Error handling event ..."). Looking at the code it seems that the NullPointerException is caught here: link
However after the exception is caught the while loops seems to continue, and no more events arrive, possibly. There does not seem to be any re-connection attempts or even a process restart. This assumption matches with what I saw in the particular environment. The kafka broker remained running but there was no connection with zookeeper (no connected socket reported by netstat) and the broker did not respond to any events. Also the broker was out of sync for all topics/partitions. A broker restart fixed the problem.
Is the above assumption valid, or did I miss something? Do you think this is a bug of the zkClient code? How can we ensure that the broker will reconnect to zookeeper?

I found a similar issue already reported: KAFKA-2182. This issue appears to be resolved, which is puzzling.

Best regards,

Klearchos

@chaloulo chaloulo changed the title NullPointerException caused zkClient to close socker with zookeeper and kafka broker not functioning NullPointerException caused zkClient to close socket with zookeeper and kafka broker not functioning Jun 18, 2018
@jzillmann
Copy link
Collaborator

Looking at the logs and the code i think this is what happens:

  • There is an expire event (delivered to ZkClient#processStateChanged())
  • There is exactly one attempt to reconnect (thats how the code is), which fails as well
  • The reconnect calls close() on the ZkConnection which sets it's _zk instance to null
  • All NullPointerException happen on the close ZkConnection

So much to the analysis...
In case the one reconnect would have been successful, there would be no problem...

So there are many things to improve here...

  • Thinking about more reconnect options (like don't give up after the first try)
  • Also what should happen with calls on the closed ZkConnection... Could they be freezed until the the connection is back again ? At least the error message could be improved...

Anyone who wants to work on that stuff ?

@harshach
Copy link
Contributor

harshach commented Jul 3, 2018

@jzillmann I am interested in taking this up.

@jzillmann
Copy link
Collaborator

@harshach So what would you need for that ?

@harshach
Copy link
Contributor

@jzillmann working on a patch will open a PR. Thanks.

@harshach
Copy link
Contributor

@jzillmann opened a PR that resolves this issue when there is DNS or cached ip issue happens.
#76
Can you please review and we are also looking for a release. Thanks.

@aaafenghao
Copy link

I have a similar problem. Has it been solved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants