Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ElasticSearch overload when handle many connections #10291

Closed
zolaahihi opened this issue Mar 27, 2015 · 9 comments
Closed

ElasticSearch overload when handle many connections #10291

zolaahihi opened this issue Mar 27, 2015 · 9 comments

Comments

@zolaahihi
Copy link

Hi all

I have installed ElasticSearch (ES) version 1.5 in our server (only one server)
I set ulimit ~ 64K (max open file).
Our server: 4 virtual CPUs, 7.5 GB RAM (Amazon EC2)

after I start ES, many users can search with reponse respectively. But after 30 seconds to 1 minute, our ES can not response anymore connect ffrom client (PHP library, just implement curl command)

I try to get PID of ES, and run command: ls /proc/ES_ID/fd | wc -l
the result is increase every second (~ 2 connect), and the CPU that ES process take about ~ 100%, RAM free about 60%

I try to view log, here is some logs:
[2015-03-27 07:57:33,512][DEBUG][http.netty ] [Impossible Man] Caught exception while handling client http traffic, closing connection [id: 0xbd4fa29e, /10.0.0.166:43963 :> /10.0.0.230:9200]
java.nio.channels.ClosedChannelException
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:433)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.writeFromUserCode(AbstractNioWorker.java:128)
at org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink.handleAcceptedSocket(NioServerSocketPipelineSink.java:99)
at org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink.eventSunk(NioServerSocketPipelineSink.java:36)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:779)
at org.elasticsearch.common.netty.channel.Channels.write(Channels.java:725)
at org.elasticsearch.common.netty.handler.codec.oneone.OneToOneEncoder.doEncode(OneToOneEncoder.java:71)
at org.elasticsearch.common.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:59)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:784)
at org.elasticsearch.http.netty.pipelining.HttpPipeliningHandler.handleDownstream(HttpPipeliningHandler.java:87)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:582)
at org.elasticsearch.http.netty.NettyHttpChannel.sendResponse(NettyHttpChannel.java:199)
at org.elasticsearch.rest.action.support.RestResponseListener.processResponse(RestResponseListener.java:43)
at org.elasticsearch.rest.action.support.RestActionListener.onResponse(RestActionListener.java:49)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction$2.doRun(TransportSearchQueryThenFetchAction.java:149)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-03-27 07:57:38,286][DEBUG][monitor.jvm ] [Impossible Man] [gc][old][207][165] duration [2.7s], collections [1]/[2.8s], total [2.7s]/[5.1m], memory [811.9mb]->[836mb]/[990.7mb], all_pools {[young] [120.7mb]->[144.9mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old] [691.2mb]->[691.2mb]/[691.2mb]}

Here is some configurations I added:
script.groovy.sandbox.enabled: true

bootstrap.mlockall: true

threadpool.index.type: fixed
threadpool.index.size: 4
threadpool.index.queue_size: 400
threadpool.search.queue_size: 1000
threadpool.search.type: cached
threadpool.bulk.type: fixed
threadpool.bulk.size: 4 # availableProcessors
threadpool.bulk.queue_size: 1000

And the remaining configurations are default.

With those connections (many connections), when I use version 0.9x, everything still OK - work normally. I must upgrade because critical security of ES :(

Could you help me solve that problem :( :(((((((((((((((((((((

@zolaahihi
Copy link
Author

When the ES service started, I try to run, curl 'localhost:9200/_cluster/stats?network=true&transport=true&http=true&thread_pool=true&indices=false&pretty=true'

And I get the result:
"open_file_descriptors" : {
"min" : 211,
"max" : 211,
"avg" : 211
}

why open_file_descriptors are very large :(

@zolaahihi
Copy link
Author

When I switch to version 0.9.x again, here is command "mpstat -P ALL 1" response output:

09:27:39 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
09:27:40 all 0,25 0,00 0,00 0,00 0,00 0,00 0,00 0,00 99,75
09:27:41 all 0,50 0,00 0,00 0,00 0,00 0,00 0,00 0,00 99,50
09:27:42 all 0,63 0,00 0,00 0,13 0,00 0,00 0,00 0,00 99,25
09:27:43 all 0,25 0,00 0,13 0,13 0,00 0,00 0,00 0,00 99,50
09:27:44 all 0,13 0,00 0,00 0,00 0,00 0,00 0,00 0,00 99,87
09:27:45 all 1,88 0,00 0,13 0,50 0,00 0,00 0,00 0,00 97,49

who think that problem is version 1.5 take many CPU?

@clintongormley
Copy link

Hiya

That network error occurs when the client network connection disconnects, perhaps because you set a request timeout?

It looks like your Elasticsearch is under severe memory pressure. You've given it less than one GB of heap, and it needs more to cope with your load.

Also, you say you had to upgrade ES because of security, but then you have reenabled dynamic scripting, so you are allowing anybody with access to your box to run whatever scripts they want, defeating the point of the upgrade.

The number of open file descriptors you have is not large at all. Elasticsearch can easily use many more file handles than you are currently using.

Also check (look at GET /_nodes) to ensure that mlockall is being applied correctly.

@clintongormley
Copy link

Also you can check what is using all the CPU with the hot threads API, but I think you'll find that it is mostly garbage collection.

@zolaahihi
Copy link
Author

Hi clintongormley

what is the best value of HEAP size should I assign, my server RAM is 4GB (maybe 8GB)

currently, the mlockall value is true, is that ok? And what is the hots API.

Thank you very much.

@zolaahihi zolaahihi reopened this Mar 29, 2015
@zolaahihi
Copy link
Author

Hi guy,

Here is some variables in init script

es_heap_size: 2G
es_heap_newsize:
es_min_mem: 512M
es_max_mem: 1G
max_locked_memory: unlimited

I try to add ES_HEAP_SIZE to 2G, and ElasticSearch server run normally for 3 hours, and after that, ES will not response request anymore (like the issue I created).

Here is the lastest log:

[root@ip-10-0-0-230 quydo]# tailf /var/log/elasticsearch/elasticsearch.log
[2015-03-30 04:47:41,217][DEBUG][monitor.jvm ] [Karolina Dean] [gc][old][6235][112] duration [3.5s], collections [1$
/[4.2s], total [3.5s]/[1.7m], memory [1.6gb]->[1.6gb]/[1.9gb], all_pools {[young] [394.5kb]->[270.9kb]/[266.2mb]}{[survivor] [0b$
->[0b]/[33.2mb]}{[old] [1.6gb]->[1.6gb]/[1.6gb]}
[2015-03-30 04:47:47,218][INFO ][monitor.jvm ] [Karolina Dean] [gc][old][6236][113] duration [5.5s], collections [1$
/[6s], total [5.5s]/[1.8m], memory [1.6gb]->[1.6gb]/[1.9gb], all_pools {[young] [270.9kb]->[27.3mb]/[266.2mb]}{[survivor] [0b]->$
0b]/[33.2mb]}{[old] [1.6gb]->[1.6gb]/[1.6gb]}
[2015-03-30 04:47:53,708][INFO ][monitor.jvm ] [Karolina Dean] [gc][old][6238][114] duration [5.4s], collections [1$
/[5.4s], total [5.4s]/[1.9m], memory [1.7gb]->[1.6gb]/[1.9gb], all_pools {[young] [31.9mb]->[167.3kb]/[266.2mb]}{[survivor] [29m$
]->[0b]/[33.2mb]}{[old] [1.6gb]->[1.6gb]/[1.6gb]}
[2015-03-30 04:47:58,135][DEBUG][monitor.jvm ] [Karolina Dean] [gc][old][6239][115] duration [3.4s], collections [1$
/[4.4s], total [3.4s]/[2m], memory [1.6gb]->[1.6gb]/[1.9gb], all_pools {[young] [167.3kb]->[15.2mb]/[266.2mb]}{[survivor] [0b]->$
12.1mb]/[33.2mb]}{[old] [1.6gb]->[1.6gb]/[1.6gb]}
[2015-03-30 04:48:02,313][DEBUG][monitor.jvm ] [Karolina Dean] [gc][old][6240][116] duration [3.6s], collections [1$
/[4.1s], total [3.6s]/[2m], memory [1.6gb]->[1.6gb]/[1.9gb], all_pools {[young] [15.2mb]->[1005kb]/[266.2mb]}{[survivor] [12.1mb$
->[0b]/[33.2mb]}{[old] [1.6gb]->[1.6gb]/[1.6gb]}
[2015-03-30 04:48:05,667][DEBUG][monitor.jvm ] [Karolina Dean] [gc][old][6241][117] duration [3.2s], collections [1$
/[3.3s], total [3.2s]/[2.1m], memory [1.6gb]->[1.6gb]/[1.9gb], all_pools {[young] [1005kb]->[372.4kb]/[266.2mb]}{[survivor] [0b]$

[0b]/[33.2mb]}{[old] [1.6gb]->[1.6gb]/[1.6gb]}
[2015-03-30 04:48:09,417][DEBUG][monitor.jvm ] [Karolina Dean] [gc][old][6242][118] duration [3.6s], collections [1$
/[3.7s], total [3.6s]/[2.2m], memory [1.6gb]->[1.6gb]/[1.9gb], all_pools {[young] [372.4kb]->[45.1kb]/[266.2mb]}{[survivor] [0b]$
[0b]/[33.2mb]}{[old] [1.6gb]->[1.6gb]/[1.6gb]}
[2015-03-30 04:48:13,156][DEBUG][monitor.jvm ] [Karolina Dean] [gc][old][6243][119] duration [3.6s], collections [1$
/[3.7s], total [3.6s]/[2.2m], memory [1.6gb]->[1.6gb]/[1.9gb], all_pools {[young] [45.1kb]->[410.4kb]/[266.2mb]}{[survivor] [0b]$
[0b]/[33.2mb]}{[old] [1.6gb]->[1.6gb]/[1.6gb]}
[2015-03-30 04:48:16,903][DEBUG][monitor.jvm ] [Karolina Dean] [gc][old][6244][120] duration [3.6s], collections [1$
/[3.7s], total [3.6s]/[2.3m], memory [1.6gb]->[1.6gb]/[1.9gb], all_pools {[young] [410.4kb]->[501.1kb]/[266.2mb]}{[survivor] [0b$
->[0b]/[33.2mb]}{[old] [1.6gb]->[1.6gb]/[1.6gb]}
[2015-03-30 04:48:20,369][DEBUG][monitor.jvm ] [Karolina Dean] [gc][old][6245][121] duration [3.3s], collections [1]
/[3.4s], total [3.3s]/[2.3m], memory [1.6gb]->[1.6gb]/[1.9gb], all_pools {[young] [501.1kb]->[422.1kb]/[266.2mb]}{[survivor] [0b]
->[0b]/[33.2mb]}{[old] [1.6gb]->[1.6gb]/[1.6gb]}
[2015-03-30 04:48:23,694][DEBUG][monitor.jvm ] [Karolina Dean] [gc][old][6246][122] duration [3.2s], collections [1]
/[3.3s], total [3.2s]/[2.4m], memory [1.6gb]->[1.6gb]/[1.9gb], all_pools {[young] [422.1kb]->[322.5kb]/[266.2mb]}{[survivor] [0b]
->[0b]/[33.2mb]}{[old] [1.6gb]->[1.6gb]/[1.6gb]}

What should I do to solve that problem

@clintongormley
Copy link

It looks like you need more memory. Your heap is full.

@splashx
Copy link

splashx commented Jul 31, 2015

@clintongormley if the heap is full shouldn't it use disk to store? I thought so if it doesn't have index.store.type set to memory.

@clintongormley
Copy link

@splashx that would simply kill performance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants