Hive loading data into ES error: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed #606

vanjor · 2015-11-12T01:30:52Z

Thanks for @costin advice , I create a individual issue

Using hive 1.2.1 loading data from hive to ES,
I was succeeded to load billions of data into ES by using hive, but when I try to update the data into ES, the job will failed after several hours.
Briefly, It okay to loading data into an empty index by using hive, but it will failed in the midway when updating those large scale ES data by using hive.

@costin, I read your another blog: https://discuss.elastic.co/t/spark-es-batch-write-retry-count-negative-value-is-ignored/25436/2

es.batch.write.retry.count should work. Note that the connector has two types of retries:,

I have no idea which one is the type I encountered, network hiccups or document reject. And is that If i set it to a negative number , will avoid the job stop midway?

And the same condition, full update the index is failed, but recreate the index is okay, it that update index cost more ES resources?

Follow with detail errors:

Hadoop job throws error, and stopped:

org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[10.1.23.134:9200, es.op.koudai.com:9200, 10.1.23.132:9200, 10.1.23.131:9200, 10.1.23.130:9200, 10.1.23.133:9200]]

2015-11-06 15:40:55,230 ERROR [main] org.elasticsearch.hadoop.rest.NetworkClient: Node [Read timed out] failed (10.1.23.133:9200); no other nodes left - aborting...
2015-11-06 15:40:55,259 FATAL [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"user_id":"492923825","is_register":null,"register_time":null"}
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:518)
    at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[10.1.23.134:9200, es.op.koudai.com:9200, 10.1.23.132:9200, 10.1.23.131:9200, 10.1.23.130:9200, 10.1.23.133:9200]] 
    at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:142)
    at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:317)
    at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:313)
    at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:150)
    at org.elasticsearch.hadoop.rest.RestRepository.tryFlush(RestRepository.java:209)
    at org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:232)
    at org.elasticsearch.hadoop.rest.RestRepository.doWriteToIndex(RestRepository.java:185)
    at org.elasticsearch.hadoop.rest.RestRepository.writeProcessedToIndex(RestRepository.java:164)
    at org.elasticsearch.hadoop.hive.EsHiveOutputFormat$EsHiveRecordWriter.write(EsHiveOutputFormat.java:63)
    at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:753)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
    at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
    at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:97)
    at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:162)
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:508)
    ... 9 more

2015-11-06 15:40:55,259 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: 3 finished. closing... 
2015-11-06 15:40:55,259 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: RECORDS_IN:879999
2015-11-06 15:40:55,259 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: DESERIALIZE_ERRORS:0

my mapping config is

CREATE EXTERNAL TABLE es.buyer_es (
  user_id  string,
  is_register int,
  register_time  string,
  xxx
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'xxx/buyers','es.nodes'= 'xxx',
              'es.port'= '9200','es.mapping.id' = 'user_id','es.index.auto.create' = 'true',
              'es.batch.size.entries' = '1000','es.batch.write.retry.count' = '10000','es.batch.write.retry.wait' = '10s',
              'es.batch.write.refresh' = 'false','es.nodes.discovery' = 'true','es.nodes.client.only' = 'false'
             );

my insert&update script is

INSERT OVERWRITE TABLE es.buyer_es 
select * from xxx

The text was updated successfully, but these errors were encountered:

costin · 2015-11-12T11:38:53Z

Did a minor formatting to your post

Thanks for the detailed post.
From what you explain, it looks like the cluster in your case is getting overloaded when perfoming the update. Note that in ES, an update is actually two operations - a document delete followed by a doc index. In 2.0 things have been optimized and will continue to do so in that the update doesn't just happen but rather ES checks whether the docs are the same - if they are and no changes occur, the doc stays in place.

In your case, while indexing runs successfully, the update is more expensive.

You shouldn't really care about what type of failure you get; it doesn't change the behaviour. As for a negative retry.count that means ES-Hadoop will keep on retrying.

Which is typically not what you want. Why?

Because ES is lagging behind and instead of understanding why that is and fixing it, the job keeps and sending more and more data, which causes ES to become even more overloaded. This can easily cause the processing nodes to freeze and look as being dead since the JVM is being GC'ed.

The retry count of 10K is simply way, way too high - you should use 3 or 5, 10 maybe but more than that means ignoring the issues. Also a wait of 10s is quite high - a bulk request should take 1-2s, more than than and it's a sign you should send less data to ES or have more/beefer (better) nodes.

So I suggest reverting to the defaults and monitoring your ES cluster - how does it behave? Do you see a lot of GCs, are there a lot of rejections?
How many tasks do you have on the Hive server? This is also important since it's not uncommon for people to have 100s tasks hitting an ES cluster with only 3 nodes... Which clearly doesn't work/scale.

P.S. By the way, what version of ES-Hadoop are you using?

vanjor · 2015-11-12T14:26:09Z

Thanks for your reply:)

Shall ES maintains the traffic control mechanism ? I thought it was okay to using 'es.batch.write.retry.count' = '10000','es.batch.write.retry.wait' = '10s', in case of ES is busy, client will auto slower down request for the repeat waiting during retry and by this config, it should take 10000*10s = 27 hours for job to fail, but my job failed after 4 hours. why, it that all tasks under the same mapreduce job or index sharing the retry counters?

I am runing the experiment by defaults config to update 10+ billions data, 115 tasks on 5 ES nodes with 16 core CPU.
The result is the task still failed for the same reason after less than 4 hours. from the marvel monitor, it seems to be ok.
my ES version:1.7.1, ES-Hadoop version:2.1.1

Also I was config the index as follows to maxim the indexing throughout capacity

    "index" : {
        "number_of_shards" : 5,
        "number_of_replicas" : 0,
        "refresh_interval": -1 
    }

costin · 2015-11-15T04:01:39Z

ES does maintain traffic control - when overloaded it starts rejecting documents. And ES-Hadoop retries after waiting a bit, only the failed docs. However by asking to have 1000 retries one basically disregards such pushback and keeps on retrying over and over again rendering the push back void.

Note that under load, a JVM can start GC'ing a lot which effectively means the node is frozen, not responding to any network calls and thus can be interpreted as dead. Which is likely the case here - you overload the cluster, keep pushing, the nodes start GC'ing and the clients assumes they have dropped of the network.

115 tasks on a 5 ES nodes is simply way too much. The CPU is not the only param you should take into account, memory is just as important and is disk (SSDs is what you are looking for).
I recommend monitoring your ES cluster closely in particular the IO and memory usage and to read the docs (including this page )and webinars on performance.

As indicated above, minimizing the number of tasks to something more like 1-2-3x the number of shards (so 15) and increasing the batch size in small steps (1.5x) is likely to yield much better results and more significantly allow the job to complete successfully.

KrishnaShah123 · 2018-04-19T06:50:20Z

@vanjor Could you please tell me how were you able to load billions of data with those queries? When I use the same ones, the process doesnt start for me.

jbaiera · 2018-04-19T16:22:24Z

@KrishnaShah123 Please avoid petitioning specific users for help on old issues on github. In the future, we ask that you keep these kinds of questions to the forums. We reserve github for tracking confirmed bugs and feature planning.

costin added :Hive question labels Nov 12, 2015

costin closed this as completed Nov 15, 2015

costin added v2.2.0-rc1 v2.1.3 labels Nov 15, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hive loading data into ES error: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed #606

Hive loading data into ES error: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed #606

vanjor commented Nov 12, 2015

costin commented Nov 12, 2015

vanjor commented Nov 12, 2015

costin commented Nov 15, 2015

KrishnaShah123 commented Apr 19, 2018

jbaiera commented Apr 19, 2018

Hive loading data into ES error: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed #606

Hive loading data into ES error: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed #606

Comments

vanjor commented Nov 12, 2015

Follow with detail errors:

costin commented Nov 12, 2015

vanjor commented Nov 12, 2015

costin commented Nov 15, 2015

KrishnaShah123 commented Apr 19, 2018

jbaiera commented Apr 19, 2018