SimpleHttpConnectionManager problem with elasticsearch-hadoop-2.1.2.jar on Spark #618

vichargrave · 2015-11-28T02:43:35Z

I'm seeing an issue when using pyspark, Elasticsearch 1.7.0, elasticsearch-hadoop-2.1.2.jar on Spark 1.5.1 (all running on my Mac OS Yosemite system). I run the simple program shown below (from the article at http://qbox.io/blog/elasticsearch-in-apache-spark-python). After the print(es_rdd.first()) statement is executed, pyshark just hangs:

Using Python version 2.7.10 (default, Oct 19 2015 18:31:17)
SparkContext available as sc, HiveContext available as sqlContext.

>>> es_rdd = sc.newAPIHadoopRDD(
...     inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",
...     keyClass="org.apache.hadoop.io.NullWritable",
...     valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
...     conf={ "es.resource" : "titanic/passenger" })
15/11/27 18:16:41 WARN EsInputFormat: Cannot determine task id...
>>> print(es_rdd.first())
15/11/27 18:16:50 WARN EsInputFormat: Cannot determine task id...
15/11/27 18:16:51 WARN SimpleHttpConnectionManager: SimpleHttpConnectionManager being used incorrectly.  Be sure that HttpMethod.releaseConnection() is always called and that only one thread and/or method is using this connection manager at a time.

When I stop Elasticsearch I get the following output:

15/11/27 18:33:32 ERROR NetworkClient: Node [10.0.0.2:9200] failed (The server 10.0.0.2 failed to respond with a valid HTTP response); no other nodes left - aborting...
15/11/27 18:33:32 WARN NewHadoopRDD: Exception in RecordReader.close()
org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[10.0.0.2:9200]]
    at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:142)
    at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:329)
    at org.elasticsearch.hadoop.rest.RestClient.executeNotFoundAllowed(RestClient.java:337)
    at org.elasticsearch.hadoop.rest.RestClient.deleteScroll(RestClient.java:403)
    at org.elasticsearch.hadoop.rest.ScrollQuery.close(ScrollQuery.java:70)
    at org.elasticsearch.hadoop.mr.EsInputFormat$ShardRecordReader.close(EsInputFormat.java:262)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.org$apache$spark$rdd$NewHadoopRDD$$anon$$close(NewHadoopRDD.scala:190)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$3.apply(NewHadoopRDD.scala:156)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1$$anonfun$3.apply(NewHadoopRDD.scala:156)
    at org.apache.spark.TaskContextImpl$$anon$1.onTaskCompletion(TaskContextImpl.scala:60)
    at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:79)
    at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:77)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:77)
    at org.apache.spark.scheduler.Task.run(Task.scala:90)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
(u'892', {u'fare': u'7.8292', u'name': u'Kelly, Mr. James', u'embarked': u'Q', u'age': u'34.5', u'parch': u'0', u'pclass': u'3', u'sex': u'male', u'ticket': u'330911', u'passengerid': u'892', u'sibsp': u'0', u'cabin': None})

Note 10.0.0.2 is the IP address of my Mac. At any rate, I end up getting the expected output (the last line above) after a series of error messages. When I use elasticsearch-hadoop.2.1.0.jar instead of 2.1.2 I do not see this problem and the program runs without error.

Is this an incompatibility problem with elasticsearch-hadoop.2.1.2.jar, Elaseticsearch 1.7.0, and Spark 1.5.1?

larghir · 2015-12-22T14:39:53Z

I am getting a similar error:
ERROR NetworkClient: Node [The server <...> failed to respond with a valid HTTP response] failed (<...>:9300); no other nodes left - aborting...
Exception in thread "main" org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[<...>:9300]]

<...> stands for the actual IP

I am using Spark 1.5.1, Elasticsearch 1.7.1 and elasticsearch-spark_2.11 version 2.1.1

Port is open, I am able to connect with a different client.
Any hints appreciated.

costin · 2016-01-08T20:03:08Z

I'm pretty sure you bumped into the same issue as described in #591. This has been fixed in master and will be included in the upcoming 2.2 rc1. Can you please check it out once is released and if it's not working, reopen the issue?

Thanks,

vichargrave · 2016-01-08T20:18:15Z

OK will do. Thanks.

On 1/8/16, Costin Leau notifications@github.com wrote:

Closed #618.

Reply to this email directly or view it on GitHub:
#618 (comment)

costin closed this as completed Jan 8, 2016

costin added bug :Spark v2.2.0-rc1 v2.1.3 duplicate labels Jan 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SimpleHttpConnectionManager problem with elasticsearch-hadoop-2.1.2.jar on Spark #618

SimpleHttpConnectionManager problem with elasticsearch-hadoop-2.1.2.jar on Spark #618

vichargrave commented Nov 28, 2015

larghir commented Dec 22, 2015

costin commented Jan 8, 2016

vichargrave commented Jan 8, 2016

SimpleHttpConnectionManager problem with elasticsearch-hadoop-2.1.2.jar on Spark #618

SimpleHttpConnectionManager problem with elasticsearch-hadoop-2.1.2.jar on Spark #618

Comments

vichargrave commented Nov 28, 2015

larghir commented Dec 22, 2015

costin commented Jan 8, 2016

vichargrave commented Jan 8, 2016