New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
request limit breaker does not calculate estimated_size_in_bytes correctly, which causes all the aggregations failed #26943
Comments
@xzer I tried to reproduce this, but without any luck. Can you elaborate on perhaps the type of data (an example document and mapping would be great) and query/aggregations that you are running? |
I met the same issue, please help. |
@liud I'm still hoping for a better reproduction of this, do you have a working reproduction? |
Describe the feature: There were no other query requests when I started kibana, and found estimated_size only increases HEAD / HTTP/1.1 HTTP/1.1 200 OK GET /_nodes?filter_path=nodes..version%2Cnodes..http.publish_address%2Cnodes.*.ip HTTP/1.1 HTTP/1.1 200 OK {"nodes":{"zfJglW84TbmHwld2BeJ0dw":{"ip":"10.133.8.72","version":"5.6.1","http":{"publish_address":"10.133.8.72:8200"}}}}GET /_nodes/_local?filter_path=nodes.*.settings.tribe HTTP/1.1 HTTP/1.1 200 OK {}POST /_mget HTTP/1.1 {"docs":[{"_index":".kibana","_type":"config","_id":"5.6.1"}]}HTTP/1.1 200 OK {"docs":[{"_index":".kibana","_type":"config","_id":"5.6.1","_version":2,"found":true,"_source":{"buildNum":15533,"defaultIndex":"AV9SX9txRDfVRTQGun0h"}}]}GET /_cluster/health/.kibana?timeout=5s HTTP/1.1 HTTP/1.1 200 OK {"cluster_name":"es5-qdcc-test-cluster","status":"yellow","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":1,"active_shards":1,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":1,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":50.0}GET /.kibana/_mappings HTTP/1.1 HTTP/1.1 200 OK {".kibana":{"mappings":{"url":{"dynamic":"strict","properties":{"accessCount":{"type":"long"},"accessDate":{"type":"date"},"createDate":{"type":"date"},"url":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":2048}}}}},"timelion-sheet":{"dynamic":"strict","properties":{"description":{"type":"text"},"hits":{"type":"integer"},"kibanaSavedObjectMeta":{"properties":{"searchSourceJSON":{"type":"text"}}},"timelion_chart_height":{"type":"integer"},"timelion_columns":{"type":"integer"},"timelion_interval":{"type":"keyword"},"timelion_other_interval":{"type":"keyword"},"timelion_rows":{"type":"integer"},"timelion_sheet":{"type":"text"},"title":{"type":"text"},"version":{"type":"integer"}}},"default":{"dynamic":"strict"},"config":{"dynamic":"true","properties":{"buildNum":{"type":"keyword"},"defaultIndex":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}},"search":{"dynamic":"strict","properties":{"columns":{"type":"keyword"},"description":{"type":"text"},"hits":{"type":"integer"},"kibanaSavedObjectMeta":{"properties":{"searchSourceJSON":{"type":"text"}}},"sort":{"type":"keyword"},"title":{"type":"text"},"version":{"type":"integer"}}},"visualization":{"dynamic":"strict","properties":{"description":{"type":"text"},"kibanaSavedObjectMeta":{"properties":{"searchSourceJSON":{"type":"text"}}},"savedSearchId":{"type":"keyword"},"title":{"type":"text"},"uiStateJSON":{"type":"text"},"version":{"type":"integer"},"visState":{"type":"text"}}},"dashboard":{"dynamic":"strict","properties":{"description":{"type":"text"},"hits":{"type":"integer"},"kibanaSavedObjectMeta":{"properties":{"searchSourceJSON":{"type":"text"}}},"optionsJSON":{"type":"text"},"panelsJSON":{"type":"text"},"refreshInterval":{"properties":{"display":{"type":"keyword"},"pause":{"type":"boolean"},"section":{"type":"integer"},"value":{"type":"integer"}}},"timeFrom":{"type":"keyword"},"timeRestore":{"type":"boolean"},"timeTo":{"type":"keyword"},"title":{"type":"text"},"uiStateJSON":{"type":"text"},"version":{"type":"integer"}}},"index-pattern":{"dynamic":"strict","properties":{"fieldFormatMap":{"type":"text"},"fields":{"type":"text"},"intervalName":{"type":"keyword"},"notExpandable":{"type":"boolean"},"sourceFilters":{"type":"text"},"timeFieldName":{"type":"keyword"},"title":{"type":"text"}}},"server":{"dynamic":"strict","properties":{"uuid":{"type":"keyword"}}}}}}POST /.kibana/_search?size=1000&from=0 HTTP/1.1 {"version":true,"query":{"bool":{"must":[{"match_all":{}}],"filter":[{"bool":{"should":[{"term":{"_type":"config"}},{"term":{"type":"config"}}]}}]}},"sort":[{"buildNum":{"order":"desc","unmapped_type":"keyword"}},{"config.buildNum":{"order":"desc","unmapped_type":"keyword"}}]}HTTP/1.1 200 OK {"took":1,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":1,"max_score":null,"hits":[{"_index":".kibana","_type":"config","_id":"5.6.1","_version":2,"_score":null,"_source":{"buildNum":15533,"defaultIndex":"AV9SX9txRDfVRTQGun0h"},"sort":["15533",null]}]}}HEAD / HTTP/1.1 HTTP/1.1 200 OK GET /_nodes?filter_path=nodes..version%2Cnodes..http.publish_address%2Cnodes.*.ip HTTP/1.1 HTTP/1.1 200 OK {"nodes":{"zfJglW84TbmHwld2BeJ0dw":{"ip":"10.133.8.72","version":"5.6.1","http":{"publish_address":"10.133.8.72:8200"}}}}GET /_nodes/_local?filter_path=nodes.*.settings.tribe HTTP/1.1 HTTP/1.1 200 OK {}POST /_mget HTTP/1.1 {"docs":[{"_index":".kibana","_type":"config","_id":"5.6.1"}]}HTTP/1.1 200 OK {"docs":[{"_index":".kibana","_type":"config","_id":"5.6.1","_version":2,"found":true,"_source":{"buildNum":15533,"defaultIndex":"AV9SX9txRDfVRTQGun0h"}}]}GET /_cluster/health/.kibana?timeout=5s HTTP/1.1 HTTP/1.1 200 OK {"cluster_name":"es5-qdcc-test-cluster","status":"yellow","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":1,"active_shards":1,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":1,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":50.0}GET /.kibana/_mappings HTTP/1.1 HTTP/1.1 200 OK {".kibana":{"mappings":{"url":{"dynamic":"strict","properties":{"accessCount":{"type":"long"},"accessDate":{"type":"date"},"createDate":{"type":"date"},"url":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":2048}}}}},"timelion-sheet":{"dynamic":"strict","properties":{"description":{"type":"text"},"hits":{"type":"integer"},"kibanaSavedObjectMeta":{"properties":{"searchSourceJSON":{"type":"text"}}},"timelion_chart_height":{"type":"integer"},"timelion_columns":{"type":"integer"},"timelion_interval":{"type":"keyword"},"timelion_other_interval":{"type":"keyword"},"timelion_rows":{"type":"integer"},"timelion_sheet":{"type":"text"},"title":{"type":"text"},"version":{"type":"integer"}}},"default":{"dynamic":"strict"},"config":{"dynamic":"true","properties":{"buildNum":{"type":"keyword"},"defaultIndex":{"type":"text","fields":{"keyword":{"type":"keyword","ignore_above":256}}}}},"search":{"dynamic":"strict","properties":{"columns":{"type":"keyword"},"description":{"type":"text"},"hits":{"type":"integer"},"kibanaSavedObjectMeta":{"properties":{"searchSourceJSON":{"type":"text"}}},"sort":{"type":"keyword"},"title":{"type":"text"},"version":{"type":"integer"}}},"visualization":{"dynamic":"strict","properties":{"description":{"type":"text"},"kibanaSavedObjectMeta":{"properties":{"searchSourceJSON":{"type":"text"}}},"savedSearchId":{"type":"keyword"},"title":{"type":"text"},"uiStateJSON":{"type":"text"},"version":{"type":"integer"},"visState":{"type":"text"}}},"dashboard":{"dynamic":"strict","properties":{"description":{"type":"text"},"hits":{"type":"integer"},"kibanaSavedObjectMeta":{"properties":{"searchSourceJSON":{"type":"text"}}},"optionsJSON":{"type":"text"},"panelsJSON":{"type":"text"},"refreshInterval":{"properties":{"display":{"type":"keyword"},"pause":{"type":"boolean"},"section":{"type":"integer"},"value":{"type":"integer"}}},"timeFrom":{"type":"keyword"},"timeRestore":{"type":"boolean"},"timeTo":{"type":"keyword"},"title":{"type":"text"},"uiStateJSON":{"type":"text"},"version":{"type":"integer"}}},"index-pattern":{"dynamic":"strict","properties":{"fieldFormatMap":{"type":"text"},"fields":{"type":"text"},"intervalName":{"type":"keyword"},"notExpandable":{"type":"boolean"},"sourceFilters":{"type":"text"},"timeFieldName":{"type":"keyword"},"title":{"type":"text"}}},"server":{"dynamic":"strict","properties":{"uuid":{"type":"keyword"}}}}}}POST /.kibana/_search?size=1000&from=0 HTTP/1.1 {"version":true,"query":{"bool":{"must":[{"match_all":{}}],"filter":[{"bool":{"should":[{"term":{"_type":"config"}},{"term":{"type":"config"}}]}}]}},"sort":[{"buildNum":{"order":"desc","unmapped_type":"keyword"}},{"config.buildNum":{"order":"desc","unmapped_type":"keyword"}}]}HTTP/1.1 200 OK |
ES log [2017-10-30T19:54:47,167][INFO ][o.e.p.r.a.ACL ] ^[[36mALLOWED by '{ block=kibana_base, match=true }' req={ ID:2016606753-1505744276#2982213, TYP:MainRequest, CGR:N/A, USR:kibana, BRS:false, ACT:cluster:monitor/
|
@dakrone I am sorry for reply so late, because I was being concentrated on our production release in last month. At first, I have to apologize that we finally found that the breaker count leak is caused by a plugin which is created by other team for some special aggregation process. And still, I have question for the request limit break, as I described in my initial description, according tot he document, the request limit breaker should be counted per request, but we found it is calculated based on a global counter, which is obviously not per request. So is the document wrong or the implementation size? |
@xzer the request breaker is a global breaker for all requests in the system, so the bytes are counted for a particular request, but the limit is global. Going to close this now since it was caused by a plugin. |
@dakrone according to the official document as following, we understood that it means the breaker is per-request, not in global. So I believe the document need to be modified to clarify how the breaker is working.
And also, what's different between the "indices.breaker.request.limit" and "network.breaker.inflight_requests.limit"? According to the source, both are same counted globally. |
@xzer for the difference between the request and inflight_request breakers, I'm adding a bit more info about them here: https://github.com/elastic/elasticsearch/pull/27116/files#diff-c35a409b177f40a4be365a147eefa2f9R45 For what I meant for "per-request" versus "global", I mean the 60% limit is global, and each request increments the amount, so let's say you have 3 concurrent requests, using 10%, 8%, and 5%, so globally it's 23% usage, even though the actual "use" is per-request (meaning it's released when the request is finished). |
@dakrone thanks for your information and now I understand the difference between them. But I still wish the document to be modified to explain what is per-request and what is global as you explained here. The current description is really misleading readers. |
Sure, I will work on a PR to explain then better, thanks for the suggestion!
On Nov 27, 2017 7:50 PM, "xzer" <notifications@github.com> wrote:
@dakrone <https://github.com/dakrone> thanks for your information and now I
understand the difference between them.
But I still wish the document to be modified to explain what is per-request
and what is global as you explained here. The current description is really
misleading readers.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#26943 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABKdFZGhii8F6ucKMAjXiRk5IO3bVNCks5s63URgaJpZM4PzruR>
.
|
Describe the feature:
Elasticsearch version (
bin/elasticsearch --version
): 5.4.3Plugins installed: []
JVM version (
java -version
): 1.8.0_131OS version (
uname -a
if on a Unix-like system): Debian 3.16.43-2Description of the problem including expected versus actual behavior:
we have 6 nodes cluster and we noticed that the estimated_size_in_bytes of request breaker keep increasing, which causes all the aggregation failed when the estimated_size_in_bytes reach at the configured limit.
we have 31g heap size and 24g old gen congiured. we also configured the request limit as 2g, after we tripped by the limit breaker, if we increase the limit to 4g dynamically, of course, we got our queries revived. But, after we increased our limit to 4g, we also noticed that the increasing speed of estimated_size_in_bytes became extremely slowly, and almost keep at the original 2G level.
Another related information is that we also configured our request cache to 2g. At first, we suspected that the cached result does not count down the estimated size in request because the resource may not be release, but after we clear the request cache, the estimated size of request limit breaker remains without decreasing. after we clear cache, we increased limit dynamically, and then, as described above, the increasing of estimated_size_in_bytes became slowly.
(additinal info: after 3 hours test, the estimated_size_in_bytes is increasing again, now it has been over 2.5g)
the following graph suggest how the counter is increasing and the cache increasing, query cache and request cache was not collected initially.
We found an old issue here with almost same symptom, according the description in the issue, there may be OOM, but we did not get OOM by check our log files.
#14065
We also did some source digging, by step by step debugging, we confirmed that at the following location:
https://github.com/elastic/elasticsearch/blob/v5.4.3/core/src/main/java/org/elasticsearch/common/breaker/ChildMemoryCircuitBreaker.java#L155
the currentUsed is zero at the first time we stop at the breakpoint after we started the first query, and then, after the first query finished, we perform the second one, we noticed that the currentUsed is not zero at the first time stopping after we performed the second query.
Aslo, even we highly suspect our discovery and believe that we must missing something in the source, we noticed that the request limit breaker is retrieved from a global registry and it seems that it does not act per-request which is described in the document.
Steps to reproduce:
Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.
The text was updated successfully, but these errors were encountered: