New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix connect concurrency, can cause connection nodes to close #6964
Fix connect concurrency, can cause connection nodes to close #6964
Conversation
try { | ||
nodeChannels.close(); | ||
} finally { | ||
logger.debug("disconnected from [{}]", node); | ||
logger.debug("disconnected from [{}] due to explicit disconnect call", node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should move this above the nodeChannels.close() so it will be logged before an eventual consequence.
Left two minor comments. LGTM. |
Looking at the connect code, if 2 threads at the same time try and connect to a node, and both enter sequentially the connectLock code block, the second one would try and put the connection in the map, and close the replaced channels, which will cause the existing connection to close as well (since it removes the node from the connectedNodes map) To fix this, simply make sure we properly check the existence of the connection within the connectionLock block, so there won't be concurrent connections going on. While doing this, also went over all the mutation code that handles disconnections, and made sure they are properly done only within a connection lock. closes elastic#6964
Looking at the connect code, if 2 threads at the same time try and connect to a node, and both enter sequentially the connectLock code block, the second one would try and put the connection in the map, and close the replaced channels, which will cause the existing connection to close as well (since it removes the node from the connectedNodes map) To fix this, simply make sure we properly check the existence of the connection within the connectionLock block, so there won't be concurrent connections going on. While doing this, also went over all the mutation code that handles disconnections, and made sure they are properly done only within a connection lock. closes #6964
Would this issue cause intermittent exceptions with the following messages? org.elasticsearch.client.transport.NoNodeAvailableException: None of the configured nodes were available We are running ES 1.3.2 and seeing intermittent cases where we get these 2 exceptions on the JVMs that run the transport client even though there is no load on the cluster or any sort of networking issues between the JVM running the transport client and cluster. |
@andreasch I think it might, if you only have one node configured in your transport client? also if you turn on debugging logging it should see some complaints about node not connected. |
@bleskes "Node not connected" seems to be the root exception message for all of our exceptions with the message "org.elasticsearch.client.transport.NoNodeAvailableException: None of the configured nodes are available" |
Ok. sounds plausible then. But as these things go - it's hard to be sure.. |
Looking at the connect code, if 2 threads at the same time try and connect to a node, and both enter sequentially the connectLock code block, the second one would try and put the connection in the map, and close the replaced channels, which will cause the existing connection to close as well (since it removes the node from the connectedNodes map)
To fix this, simply make sure we properly check the existence of the connection within the connectionLock block, so there won't be concurrent connections going on.
While doing this, also went over all the mutation code that handles disconnections, and made sure they are properly done only within a connection lock.