Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive loading data into ES error: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed #606

Closed
vanjor opened this issue Nov 12, 2015 · 5 comments

Comments

@vanjor
Copy link

vanjor commented Nov 12, 2015

Thanks for @costin advice , I create a individual issue

Using hive 1.2.1 loading data from hive to ES,
I was succeeded to load billions of data into ES by using hive, but when I try to update the data into ES, the job will failed after several hours.
Briefly, It okay to loading data into an empty index by using hive, but it will failed in the midway when updating those large scale ES data by using hive.

@costin, I read your another blog: https://discuss.elastic.co/t/spark-es-batch-write-retry-count-negative-value-is-ignored/25436/2

es.batch.write.retry.count should work. Note that the connector has two types of retries:,

I have no idea which one is the type I encountered, network hiccups or document reject. And is that If i set it to a negative number , will avoid the job stop midway?

And the same condition, full update the index is failed, but recreate the index is okay, it that update index cost more ES resources?

Follow with detail errors:

Hadoop job throws error, and stopped:

org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[10.1.23.134:9200, es.op.koudai.com:9200, 10.1.23.132:9200, 10.1.23.131:9200, 10.1.23.130:9200, 10.1.23.133:9200]]

2015-11-06 15:40:55,230 ERROR [main] org.elasticsearch.hadoop.rest.NetworkClient: Node [Read timed out] failed (10.1.23.133:9200); no other nodes left - aborting...
2015-11-06 15:40:55,259 FATAL [main] org.apache.hadoop.hive.ql.exec.mr.ExecMapper: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"user_id":"492923825","is_register":null,"register_time":null"}
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:518)
    at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[10.1.23.134:9200, es.op.koudai.com:9200, 10.1.23.132:9200, 10.1.23.131:9200, 10.1.23.130:9200, 10.1.23.133:9200]] 
    at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:142)
    at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:317)
    at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:313)
    at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:150)
    at org.elasticsearch.hadoop.rest.RestRepository.tryFlush(RestRepository.java:209)
    at org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:232)
    at org.elasticsearch.hadoop.rest.RestRepository.doWriteToIndex(RestRepository.java:185)
    at org.elasticsearch.hadoop.rest.RestRepository.writeProcessedToIndex(RestRepository.java:164)
    at org.elasticsearch.hadoop.hive.EsHiveOutputFormat$EsHiveRecordWriter.write(EsHiveOutputFormat.java:63)
    at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:753)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
    at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
    at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:97)
    at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:162)
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:508)
    ... 9 more

2015-11-06 15:40:55,259 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: 3 finished. closing... 
2015-11-06 15:40:55,259 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: RECORDS_IN:879999
2015-11-06 15:40:55,259 INFO [main] org.apache.hadoop.hive.ql.exec.MapOperator: DESERIALIZE_ERRORS:0

my mapping config is

CREATE EXTERNAL TABLE es.buyer_es (
  user_id  string,
  is_register int,
  register_time  string,
  xxx
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'xxx/buyers','es.nodes'= 'xxx',
              'es.port'= '9200','es.mapping.id' = 'user_id','es.index.auto.create' = 'true',
              'es.batch.size.entries' = '1000','es.batch.write.retry.count' = '10000','es.batch.write.retry.wait' = '10s',
              'es.batch.write.refresh' = 'false','es.nodes.discovery' = 'true','es.nodes.client.only' = 'false'
             );

my insert&update script is

INSERT OVERWRITE TABLE es.buyer_es 
select * from xxx
@costin
Copy link
Member

costin commented Nov 12, 2015

Did a minor formatting to your post

Thanks for the detailed post.
From what you explain, it looks like the cluster in your case is getting overloaded when perfoming the update. Note that in ES, an update is actually two operations - a document delete followed by a doc index. In 2.0 things have been optimized and will continue to do so in that the update doesn't just happen but rather ES checks whether the docs are the same - if they are and no changes occur, the doc stays in place.

In your case, while indexing runs successfully, the update is more expensive.

You shouldn't really care about what type of failure you get; it doesn't change the behaviour. As for a negative retry.count that means ES-Hadoop will keep on retrying.

Which is typically not what you want. Why?

Because ES is lagging behind and instead of understanding why that is and fixing it, the job keeps and sending more and more data, which causes ES to become even more overloaded. This can easily cause the processing nodes to freeze and look as being dead since the JVM is being GC'ed.

The retry count of 10K is simply way, way too high - you should use 3 or 5, 10 maybe but more than that means ignoring the issues. Also a wait of 10s is quite high - a bulk request should take 1-2s, more than than and it's a sign you should send less data to ES or have more/beefer (better) nodes.

So I suggest reverting to the defaults and monitoring your ES cluster - how does it behave? Do you see a lot of GCs, are there a lot of rejections?
How many tasks do you have on the Hive server? This is also important since it's not uncommon for people to have 100s tasks hitting an ES cluster with only 3 nodes... Which clearly doesn't work/scale.

P.S. By the way, what version of ES-Hadoop are you using?

@vanjor
Copy link
Author

vanjor commented Nov 12, 2015

Thanks for your reply:)

Shall ES maintains the traffic control mechanism ? I thought it was okay to using 'es.batch.write.retry.count' = '10000','es.batch.write.retry.wait' = '10s', in case of ES is busy, client will auto slower down request for the repeat waiting during retry and by this config, it should take 10000*10s = 27 hours for job to fail, but my job failed after 4 hours. why, it that all tasks under the same mapreduce job or index sharing the retry counters?

I am runing the experiment by defaults config to update 10+ billions data, 115 tasks on 5 ES nodes with 16 core CPU.
The result is the task still failed for the same reason after less than 4 hours. from the marvel monitor, it seems to be ok.
my ES version:1.7.1, ES-Hadoop version:2.1.1

Also I was config the index as follows to maxim the indexing throughout capacity

    "index" : {
        "number_of_shards" : 5,
        "number_of_replicas" : 0,
        "refresh_interval": -1 
    }

@costin
Copy link
Member

costin commented Nov 15, 2015

ES does maintain traffic control - when overloaded it starts rejecting documents. And ES-Hadoop retries after waiting a bit, only the failed docs. However by asking to have 1000 retries one basically disregards such pushback and keeps on retrying over and over again rendering the push back void.

Note that under load, a JVM can start GC'ing a lot which effectively means the node is frozen, not responding to any network calls and thus can be interpreted as dead. Which is likely the case here - you overload the cluster, keep pushing, the nodes start GC'ing and the clients assumes they have dropped of the network.

115 tasks on a 5 ES nodes is simply way too much. The CPU is not the only param you should take into account, memory is just as important and is disk (SSDs is what you are looking for).
I recommend monitoring your ES cluster closely in particular the IO and memory usage and to read the docs (including this page )and webinars on performance.

As indicated above, minimizing the number of tasks to something more like 1-2-3x the number of shards (so 15) and increasing the batch size in small steps (1.5x) is likely to yield much better results and more significantly allow the job to complete successfully.

@KrishnaShah123
Copy link

@vanjor Could you please tell me how were you able to load billions of data with those queries? When I use the same ones, the process doesnt start for me.

@jbaiera
Copy link
Member

jbaiera commented Apr 19, 2018

@KrishnaShah123 Please avoid petitioning specific users for help on old issues on github. In the future, we ask that you keep these kinds of questions to the forums. We reserve github for tracking confirmed bugs and feature planning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants