ElasticSearch overload when handle many connections #10291

zolaahihi · 2015-03-27T08:07:39Z

Hi all

I have installed ElasticSearch (ES) version 1.5 in our server (only one server)
I set ulimit ~ 64K (max open file).
Our server: 4 virtual CPUs, 7.5 GB RAM (Amazon EC2)

after I start ES, many users can search with reponse respectively. But after 30 seconds to 1 minute, our ES can not response anymore connect ffrom client (PHP library, just implement curl command)

I try to get PID of ES, and run command: ls /proc/ES_ID/fd | wc -l
the result is increase every second (~ 2 connect), and the CPU that ES process take about ~ 100%, RAM free about 60%

I try to view log, here is some logs:
[2015-03-27 07:57:33,512][DEBUG][http.netty ] [Impossible Man] Caught exception while handling client http traffic, closing connection [id: 0xbd4fa29e, /10.0.0.166:43963 :> /10.0.0.230:9200]
java.nio.channels.ClosedChannelException
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:433)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.writeFromUserCode(AbstractNioWorker.java:128)
at org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink.handleAcceptedSocket(NioServerSocketPipelineSink.java:99)
at org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink.eventSunk(NioServerSocketPipelineSink.java:36)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:779)
at org.elasticsearch.common.netty.channel.Channels.write(Channels.java:725)
at org.elasticsearch.common.netty.handler.codec.oneone.OneToOneEncoder.doEncode(OneToOneEncoder.java:71)
at org.elasticsearch.common.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:59)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:784)
at org.elasticsearch.http.netty.pipelining.HttpPipeliningHandler.handleDownstream(HttpPipeliningHandler.java:87)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:582)
at org.elasticsearch.http.netty.NettyHttpChannel.sendResponse(NettyHttpChannel.java:199)
at org.elasticsearch.rest.action.support.RestResponseListener.processResponse(RestResponseListener.java:43)
at org.elasticsearch.rest.action.support.RestActionListener.onResponse(RestActionListener.java:49)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$2.doRun(TransportSearchQueryThenFetchAction.java:149)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-03-27 07:57:38,286][DEBUG][monitor.jvm ] [Impossible Man] [gc][old][207][165] duration [2.7s], collections [1]/[2.8s], total [2.7s]/[5.1m], memory [811.9mb]->[836mb]/[990.7mb], all_pools {[young] [120.7mb]->[144.9mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [691.2mb]->[691.2mb]/[691.2mb]}

Here is some configurations I added:
script.groovy.sandbox.enabled: true

bootstrap.mlockall: true

threadpool.index.type: fixed
threadpool.index.size: 4
threadpool.index.queue_size: 400
threadpool.search.queue_size: 1000
threadpool.search.type: cached
threadpool.bulk.type: fixed
threadpool.bulk.size: 4 # availableProcessors
threadpool.bulk.queue_size: 1000

And the remaining configurations are default.

With those connections (many connections), when I use version 0.9x, everything still OK - work normally. I must upgrade because critical security of ES :(

Could you help me solve that problem :( :(((((((((((((((((((((

zolaahihi · 2015-03-27T08:20:29Z

When the ES service started, I try to run, curl 'localhost:9200/_cluster/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'

And I get the result:
"open_file_descriptors" : {
"min" : 211,
"max" : 211,
"avg" : 211
}

why open_file_descriptors are very large :(

zolaahihi · 2015-03-27T09:29:09Z

When I switch to version 0.9.x again, here is command "mpstat -P ALL 1" response output:

09:27:39 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
09:27:40 all 0,25 0,00 0,00 0,00 0,00 0,00 0,00 0,00 99,75
09:27:41 all 0,50 0,00 0,00 0,00 0,00 0,00 0,00 0,00 99,50
09:27:42 all 0,63 0,00 0,00 0,13 0,00 0,00 0,00 0,00 99,25
09:27:43 all 0,25 0,00 0,13 0,13 0,00 0,00 0,00 0,00 99,50
09:27:44 all 0,13 0,00 0,00 0,00 0,00 0,00 0,00 0,00 99,87
09:27:45 all 1,88 0,00 0,13 0,50 0,00 0,00 0,00 0,00 97,49

who think that problem is version 1.5 take many CPU?

clintongormley · 2015-03-29T11:56:04Z

Hiya

That network error occurs when the client network connection disconnects, perhaps because you set a request timeout?

It looks like your Elasticsearch is under severe memory pressure. You've given it less than one GB of heap, and it needs more to cope with your load.

Also, you say you had to upgrade ES because of security, but then you have reenabled dynamic scripting, so you are allowing anybody with access to your box to run whatever scripts they want, defeating the point of the upgrade.

The number of open file descriptors you have is not large at all. Elasticsearch can easily use many more file handles than you are currently using.

Also check (look at GET /_nodes) to ensure that mlockall is being applied correctly.

clintongormley · 2015-03-29T11:56:39Z

Also you can check what is using all the CPU with the hot threads API, but I think you'll find that it is mostly garbage collection.

zolaahihi · 2015-03-29T15:38:32Z

Hi clintongormley

what is the best value of HEAP size should I assign, my server RAM is 4GB (maybe 8GB)

currently, the mlockall value is true, is that ok? And what is the hots API.

Thank you very much.

zolaahihi · 2015-03-30T04:56:19Z

Hi guy,

Here is some variables in init script

es_heap_size: 2G
es_heap_newsize:
es_min_mem: 512M
es_max_mem: 1G
max_locked_memory: unlimited

I try to add ES_HEAP_SIZE to 2G, and ElasticSearch server run normally for 3 hours, and after that, ES will not response request anymore (like the issue I created).

Here is the lastest log:

[root@ip-10-0-0-230 quydo]# tailf /var/log/elasticsearch/elasticsearch.log
[2015-03-30 04:47:41,217][DEBUG][monitor.jvm ] [Karolina Dean] [gc][old][6235][112] duration [3.5s], collections [1$
/[4.2s], total [3.5s]/[1.7m], memory [1.6gb]->[1.6gb]/[1.9gb], all_pools {[young] [394.5kb]->[270.9kb]/[266.2mb]}{[survivor] [0b$
->[0b]/[33.2mb]}{[old] [1.6gb]->[1.6gb]/[1.6gb]}
[2015-03-30 04:47:47,218][INFO ][monitor.jvm ] [Karolina Dean] [gc][old][6236][113] duration [5.5s], collections [1$
/[6s], total [5.5s]/[1.8m], memory [1.6gb]->[1.6gb]/[1.9gb], all_pools {[young] [270.9kb]->[27.3mb]/[266.2mb]}{[survivor] [0b]->$
0b]/[33.2mb]}{[old] [1.6gb]->[1.6gb]/[1.6gb]}
[2015-03-30 04:47:53,708][INFO ][monitor.jvm ] [Karolina Dean] [gc][old][6238][114] duration [5.4s], collections [1$
/[5.4s], total [5.4s]/[1.9m], memory [1.7gb]->[1.6gb]/[1.9gb], all_pools {[young] [31.9mb]->[167.3kb]/[266.2mb]}{[survivor] [29m$
]->[0b]/[33.2mb]}{[old] [1.6gb]->[1.6gb]/[1.6gb]}
[2015-03-30 04:47:58,135][DEBUG][monitor.jvm ] [Karolina Dean] [gc][old][6239][115] duration [3.4s], collections [1$
/[4.4s], total [3.4s]/[2m], memory [1.6gb]->[1.6gb]/[1.9gb], all_pools {[young] [167.3kb]->[15.2mb]/[266.2mb]}{[survivor] [0b]->$
12.1mb]/[33.2mb]}{[old] [1.6gb]->[1.6gb]/[1.6gb]}
[2015-03-30 04:48:02,313][DEBUG][monitor.jvm ] [Karolina Dean] [gc][old][6240][116] duration [3.6s], collections [1$
/[4.1s], total [3.6s]/[2m], memory [1.6gb]->[1.6gb]/[1.9gb], all_pools {[young] [15.2mb]->[1005kb]/[266.2mb]}{[survivor] [12.1mb$
->[0b]/[33.2mb]}{[old] [1.6gb]->[1.6gb]/[1.6gb]}
[2015-03-30 04:48:05,667][DEBUG][monitor.jvm ] [Karolina Dean] [gc][old][6241][117] duration [3.2s], collections [1$
/[3.3s], total [3.2s]/[2.1m], memory [1.6gb]->[1.6gb]/[1.9gb], all_pools {[young] [1005kb]->[372.4kb]/[266.2mb]}{[survivor] [0b]$

[0b]/[33.2mb]}{[old] [1.6gb]->[1.6gb]/[1.6gb]}
[2015-03-30 04:48:09,417][DEBUG][monitor.jvm ] [Karolina Dean] [gc][old][6242][118] duration [3.6s], collections [1$
/[3.7s], total [3.6s]/[2.2m], memory [1.6gb]->[1.6gb]/[1.9gb], all_pools {[young] [372.4kb]->[45.1kb]/[266.2mb]}{[survivor] [0b]$
[0b]/[33.2mb]}{[old] [1.6gb]->[1.6gb]/[1.6gb]}
[2015-03-30 04:48:13,156][DEBUG][monitor.jvm ] [Karolina Dean] [gc][old][6243][119] duration [3.6s], collections [1$
/[3.7s], total [3.6s]/[2.2m], memory [1.6gb]->[1.6gb]/[1.9gb], all_pools {[young] [45.1kb]->[410.4kb]/[266.2mb]}{[survivor] [0b]$
[0b]/[33.2mb]}{[old] [1.6gb]->[1.6gb]/[1.6gb]}
[2015-03-30 04:48:16,903][DEBUG][monitor.jvm ] [Karolina Dean] [gc][old][6244][120] duration [3.6s], collections [1$
/[3.7s], total [3.6s]/[2.3m], memory [1.6gb]->[1.6gb]/[1.9gb], all_pools {[young] [410.4kb]->[501.1kb]/[266.2mb]}{[survivor] [0b$
->[0b]/[33.2mb]}{[old] [1.6gb]->[1.6gb]/[1.6gb]}
[2015-03-30 04:48:20,369][DEBUG][monitor.jvm ] [Karolina Dean] [gc][old][6245][121] duration [3.3s], collections [1]
/[3.4s], total [3.3s]/[2.3m], memory [1.6gb]->[1.6gb]/[1.9gb], all_pools {[young] [501.1kb]->[422.1kb]/[266.2mb]}{[survivor] [0b]
->[0b]/[33.2mb]}{[old] [1.6gb]->[1.6gb]/[1.6gb]}
[2015-03-30 04:48:23,694][DEBUG][monitor.jvm ] [Karolina Dean] [gc][old][6246][122] duration [3.2s], collections [1]
/[3.3s], total [3.2s]/[2.4m], memory [1.6gb]->[1.6gb]/[1.9gb], all_pools {[young] [422.1kb]->[322.5kb]/[266.2mb]}{[survivor] [0b]
->[0b]/[33.2mb]}{[old] [1.6gb]->[1.6gb]/[1.6gb]}

What should I do to solve that problem

clintongormley · 2015-04-05T12:44:45Z

It looks like you need more memory. Your heap is full.

splashx · 2015-07-31T14:18:40Z

@clintongormley if the heap is full shouldn't it use disk to store? I thought so if it doesn't have index.store.type set to memory.

clintongormley · 2015-08-05T11:14:21Z

@splashx that would simply kill performance

clintongormley added the feedback_needed label Mar 29, 2015

zolaahihi closed this as completed Mar 29, 2015

zolaahihi reopened this Mar 29, 2015

clintongormley closed this as completed Apr 5, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ElasticSearch overload when handle many connections #10291

ElasticSearch overload when handle many connections #10291

zolaahihi commented Mar 27, 2015

zolaahihi commented Mar 27, 2015

zolaahihi commented Mar 27, 2015

clintongormley commented Mar 29, 2015

clintongormley commented Mar 29, 2015

zolaahihi commented Mar 29, 2015

zolaahihi commented Mar 30, 2015

clintongormley commented Apr 5, 2015

splashx commented Jul 31, 2015

clintongormley commented Aug 5, 2015

ElasticSearch overload when handle many connections #10291

ElasticSearch overload when handle many connections #10291

Comments

zolaahihi commented Mar 27, 2015

zolaahihi commented Mar 27, 2015

zolaahihi commented Mar 27, 2015

clintongormley commented Mar 29, 2015

clintongormley commented Mar 29, 2015

zolaahihi commented Mar 29, 2015

zolaahihi commented Mar 30, 2015

clintongormley commented Apr 5, 2015

splashx commented Jul 31, 2015

clintongormley commented Aug 5, 2015