Repository-hdfs plugin not always closing tcp connexions #220

jubagarie · 2014-06-19T10:53:47Z

I'm using the "repository-hdfs" plugin to store snapshots on HDFS with Elasticsearch 1.1.1. It seems that Elasticsearch doesn't properly close the TCP connections after a snapshot is created.

The result for me was a "too many open files" errors in the Elasticsearch logs. Using the "lsof" command I found a pile of more than 50k TCP connections in the CLOSE_WAIT state and as many file descriptors.

costin · 2014-06-19T11:07:29Z

What version of the repository-hdfs plugin are you using? Any information on where the TCP connections point to? Also what version of hadoop are you using?

Thanks

jubagarie · 2014-06-19T12:19:24Z

I'm using :

hadoop 2.0.0 (cdh 4.1.2)
repository-hdfs 2.0.0-light
Debian Wheezy

The TCP connections are pointing to the nodes of my hadoop cluster. Here is an extract of "lsof" output for the elasticsearch process :

java    68073 elasticsearch 6747u  IPv4           15823974       0t0      TCP es08:57227->cdh4worker04:50010 (CLOSE_WAIT)
java    68073 elasticsearch 6748u  IPv4           15824908       0t0      TCP es08:57651->cdh4worker05:50010 (CLOSE_WAIT)
java    68073 elasticsearch 6749u  IPv4           15818656       0t0      TCP es08:54883->cdh4worker12:50010 (CLOSE_WAIT)

I noticed that even few hours after the last snapshot, the connections are still not closed.

Thanks for your help

costin · 2014-06-19T13:36:57Z

I've done some quick searches and it looks like this is likely caused by Hadoop itself. For example, see this thread and this issue.
I'll try to rework the code so that the filesystem instance is disposed and created per action but if the connections are leaking, this will not help much...
http://archive.cloudera.com/cdh4/cdh/4/mr1-0.20.2+1215.releasenotes.html

costin · 2014-06-19T13:46:18Z

@jubagarie Can you try a quick fix? Hope it forces hadoop to close the connections. After you do the backup, can you try unregistering the repository and see whether it has any effect on the number of connections opened?
This causes the plugin to close the underlying HDFS FS which is the only possible fix we can apply.

Let me know how it goes - thanks!

jubagarie · 2014-06-19T14:41:55Z

Unfortunately, unregistering the repository and registering it again doesn't affect the number of connections.

I can also add that my server version of hadoop-hdfs is a CDH 4.1.2 but my Elasticsearch servers are running a CDH 4.6.0. Both of them seem to include the patch created for the issue you found.

Thanks

costin · 2014-06-19T14:53:20Z

Hmm, I'm afraid I'm not sure what else can be done on this front. Can you confirm in the ES logs that the file-system is created again when you register it again after unregistering it?
As I've mentioned the only thing we can do is close the FS faster than we are currently doing it. We are not opening or closing connections manually, rather we are just clients of Hadoop's FileSystem interface...
I'm wondering whether using concurrent_streams of 1 has any impact on the number of opened connections and thus potentially slows the growth down...

The only thing I can think of is restart the node which is clearly not ideal...

jubagarie · 2014-06-20T09:32:38Z

Here are the ES logs when I unregister/register the repository so it seems fine :

[2014-06-19 16:17:07,732][INFO ][repositories             ] [Turner Century] delete repository [hdfs]
[2014-06-19 16:31:56,739][INFO ][repositories             ] [Turner Century] put repository [hdfs]

I will dig on the Hadoop side.

Thanks for your time !

costin · 2014-06-20T11:03:02Z

As a work-around, you could potentially add some firewall rules to kill connections in CLOSE_WAIT state, idle for more than X minutes.

costin · 2014-08-14T13:00:44Z

Closing this with won't fix since there's no much we can do unfortunately...

costin · 2014-12-02T15:43:55Z

As it seems this bug in Hadoop keeps occurring some pointers in the docs would help on how to try to fix it.

bflad · 2015-01-15T00:40:04Z

Hmmm. As a point of reference, Elasticsearch 1.1.1 + CDH 5.2.1 (Hadoop 2.5.x) here, and I don't see any CLOSE_WAIT connections in lsof after taking HDFS snapshots.

Aside: is there a recommended way to setup the light plugin with Hadoop jars (like CDH) for startup? Is ES_CLASSPATH the way to go? Might be worth a mention in the documentation, especially since on CentOS/RHEL provided files, its not obvious since the variable not in the sysconfig file or init script. Found it in the Elasticsearch shell script itself.

bflad · 2015-01-15T00:42:33Z

Oh wait, totally lying, the connections were on the master nodes:

java      13452 elasticsearch  211u     IPv6            7964297      0t0        TCP esmaster01.example.com:58187->cdh02.example.com:50010 (CLOSE_WAIT)
java      13452 elasticsearch  212u     IPv6            7964298      0t0        TCP esmaster01.example.com:52732->cdh03.example.com:50010 (CLOSE_WAIT)
java      13452 elasticsearch  213u     IPv6            7964299      0t0        TCP esmaster01.example.com:57888->cdh01.example.com:50010 (CLOSE_WAIT)

So yes, still an issue!

costin · 2015-01-15T12:27:01Z

@bflad Sorry to heart that.
As for the classpath, ES_CLASSPATH definitely works. I'll update the readme to make this clearer.

relates #220

costin · 2015-06-17T21:45:15Z

@bflad @jubagarie you might want to try 2.1.0.rc1 since it improves the creation and closing of the FileSystem.

costin · 2015-10-29T12:02:53Z

As there hasn't been any update, I'm closing the issue.

costin added v2.0.1 labels Jun 19, 2014

costin closed this as completed Aug 14, 2014

costin added the wontfix label Aug 14, 2014

costin reopened this Dec 2, 2014

costin added a commit that referenced this issue Jan 19, 2015

[DOC] Point out ES_CLASSPATH env variable

362e0b0

relates #220

costin mentioned this issue Jul 1, 2015

CLOSE_WAIT with hdfs repository plugin #492

Closed

costin closed this as completed Oct 29, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository-hdfs plugin not always closing tcp connexions #220

Repository-hdfs plugin not always closing tcp connexions #220

jubagarie commented Jun 19, 2014

costin commented Jun 19, 2014

jubagarie commented Jun 19, 2014

costin commented Jun 19, 2014

costin commented Jun 19, 2014

jubagarie commented Jun 19, 2014

costin commented Jun 19, 2014

jubagarie commented Jun 20, 2014

costin commented Jun 20, 2014

costin commented Aug 14, 2014

costin commented Dec 2, 2014

bflad commented Jan 15, 2015

bflad commented Jan 15, 2015

costin commented Jan 15, 2015

costin commented Jun 17, 2015

costin commented Oct 29, 2015

Repository-hdfs plugin not always closing tcp connexions #220

Repository-hdfs plugin not always closing tcp connexions #220

Comments

jubagarie commented Jun 19, 2014

costin commented Jun 19, 2014

jubagarie commented Jun 19, 2014

costin commented Jun 19, 2014

costin commented Jun 19, 2014

jubagarie commented Jun 19, 2014

costin commented Jun 19, 2014

jubagarie commented Jun 20, 2014

costin commented Jun 20, 2014

costin commented Aug 14, 2014

costin commented Dec 2, 2014

bflad commented Jan 15, 2015

bflad commented Jan 15, 2015

costin commented Jan 15, 2015

costin commented Jun 17, 2015

costin commented Oct 29, 2015