-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
There were multiple incidents where ES cluster went into hang state post executing the query from SQL plugin. for which we recommended DI team to use the JSON search request.
However today COMPASS ES cluster went down post executing below search query:
{"from": 0,"size": 0,"query": {"bool": {"must": {"bool": {"must": [{"match": {"id": {"query": "8663","type": "phrase"}}},{"match": {"cltp": {"query": "encounters","type": "phrase"}}}]}}}},"_source": {"includes": ["COUNT"],"excludes": []},"aggregations": {"eid": {"terms": {"field": "eid","size": 200},"aggregations": {"COUNT(DISTINCT eid)": {"cardinality": {"field": "eid","precision_threshold": 40000}}}}}}
when we checked the cluster logs, we found that on two nodes, GC was continuosly running which make them in stale state and drop out from the cluster. and as cluster is configured with master quarum size of 3 so on loss of 2 Nodes available live master in cluster was 2 which made complete cluster. and the GC on two nodes were continuosly running because GC wasn't able to reduce the heap size below 75% and Elastic java process is configured to initiate the GC collection at 75% heap and so it went in infinite loop.
below are the supporting logs from cluster.
DN134:
[2017-11-10 08:27:25,898][INFO ][monitor.jvm ] [10.0.1.134] [gc][old][1076][3] duration [7.4s], collections [1]/[7.4s], total [7.4s]/[16.9s], memory [31.5gb]->[31.8gb]/[31.8gb], all_pools {[young] [614.5mb]->[865.3mb]/[865.3mb]}{[survivor] [0b]->[107.4mb]/[108.1mb]}{[old] [30.9gb]->[30.9gb]/[30.9gb]}
[2017-11-10 08:28:17,615][INFO ][monitor.jvm ] [10.0.1.134] [gc][old][1077][18] duration [1.2m], collections [15]/[1.2m], total [1.2m]/[1.5m], memory [31.8gb]->[31.8gb]/[31.8gb], all_pools {[young] [865.3mb]->[865.3mb]/[865.3mb]}{[survivor] [107.4mb]->[108mb]/[108.1mb]}{[old] [30.9gb]->[30.9gb]/[30.9gb]}
[2017-11-10 08:28:49,877][INFO ][monitor.jvm ] [10.0.1.134] [gc][old][1080][26] duration [12.1s], collections [2]/[12.1s], total [12.1s]/[2.1m], memory [31.8gb]->[31.8gb]/[31.8gb], all_pools {[young] [865.3mb]->[865.3mb]/[865.3mb]}{[survivor] [108mb]->[108mb]/[108.1mb]}{[old] [30.9gb]->[30.9gb]/[30.9gb]}
DN95:
[2017-11-10 08:26:36,566][WARN ][monitor.jvm ] [10.0.1.95] [gc][young][1074][32] duration [2.3s], collections [1]/[2.4s], total [2.3s]/[11.2s], memory [19.6gb]->[20.4gb]/[31.8gb], all_pools {[young] [17.5mb]->[8.7mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [19.5gb]->[20.3gb]/[30.9gb]}
[2017-11-10 08:26:38,016][WARN ][monitor.jvm ] [10.0.1.95] [gc][young][1075][33] duration [1.3s], collections [1]/[1.4s], total [1.3s]/[12.5s], memory [20.4gb]->[21.3gb]/[31.8gb], all_pools {[young] [8.7mb]->[26mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [20.3gb]->[21.2gb]/[30.9gb]}
[2017-11-10 08:26:39,693][WARN ][monitor.jvm ] [10.0.1.95] [gc][young][1076][34] duration [1.5s], collections [1]/[1.6s], total [1.5s]/[14.1s], memory [21.3gb]->[22.1gb]/[31.8gb], all_pools {[young] [26mb]->[28.4mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [21.2gb]->[22gb]/[30.9gb]}
[2017-11-10 08:26:41,217][WARN ][monitor.jvm ] [10.0.1.95] [gc][young][1077][35] duration [1.4s], collections [1]/[1.5s], total [1.4s]/[15.6s], memory [22.1gb]->[23gb]/[31.8gb], all_pools {[young] [28.4mb]->[27.4mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [22gb]->[22.8gb]/[30.9gb]}
[2017-11-10 08:26:42,990][WARN ][monitor.jvm ] [10.0.1.95] [gc][young][1078][36] duration [1.6s], collections [1]/[1.7s], total [1.6s]/[17.2s], memory [23gb]->[23.8gb]/[31.8gb], all_pools {[young] [27.4mb]->[17.4mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [22.8gb]->[23.7gb]/[30.9gb]}
[2017-11-10 08:26:45,167][WARN ][monitor.jvm ] [10.0.1.95] [gc][young][1079][37] duration [2s], collections [1]/[2.1s], total [2s]/[19.3s], memory [23.8gb]->[24.7gb]/[31.8gb], all_pools {[young] [17.4mb]->[34.6mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [23.7gb]->[24.5gb]/[30.9gb]}
[2017-11-10 08:26:47,178][WARN ][monitor.jvm ] [10.0.1.95] [gc][young][1080][38] duration [1.9s], collections [1]/[2s], total [1.9s]/[21.2s], memory [24.7gb]->[25.5gb]/[31.8gb], all_pools {[young] [34.6mb]->[35.2mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [24.5gb]->[25.4gb]/[30.9gb]}
[2017-11-10 08:26:48,714][WARN ][monitor.jvm ] [10.0.1.95] [gc][young][1081][39] duration [1.4s], collections [1]/[1.5s], total [1.4s]/[22.7s], memory [25.5gb]->[26.3gb]/[31.8gb], all_pools {[young] [35.2mb]->[25.9mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [25.4gb]->[26.2gb]/[30.9gb]}
[2017-11-10 08:26:50,547][WARN ][monitor.jvm ] [10.0.1.95] [gc][young][1082][40] duration [1.7s], collections [1]/[1.8s], total [1.7s]/[24.4s], memory [26.3gb]->[27.2gb]/[31.8gb], all_pools {[young] [25.9mb]->[27.5mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [26.2gb]->[27gb]/[30.9gb]}
[2017-11-10 08:26:52,066][WARN ][monitor.jvm ] [10.0.1.95] [gc][young][1083][41] duration [1.4s], collections [1]/[1.5s], total [1.4s]/[25.9s], memory [27.2gb]->[28gb]/[31.8gb], all_pools {[young] [27.5mb]->[26.1mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [27gb]->[27.9gb]/[30.9gb]}
[2017-11-10 08:26:53,067][INFO ][monitor.jvm ] [10.0.1.95] [gc][young][1084][42] duration [879ms], collections [1]/[1s], total [879ms]/[26.7s], memory [28gb]->[29.3gb]/[31.8gb], all_pools {[young] [26.1mb]->[472.2mb]/[865.3mb]}{[survivor] [108.1mb]->[108.1mb]/[108.1mb]}{[old] [27.9gb]->[28.7gb]/[30.9gb]}
[2017-11-10 08:27:04,654][WARN ][monitor.jvm ] [10.0.1.95] [gc][old][1086][2] duration [10.3s], collections [1]/[10.4s], total [10.3s]/[10.4s], memory [30.5gb]->[31.3gb]/[31.8gb], all_pools {[young] [25.9mb]->[450.2mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [30.4gb]->[30.9gb]/[30.9gb]}
[2017-11-10 08:27:12,423][INFO ][monitor.jvm ] [10.0.1.95] [gc][old][1087][3] duration [7.7s], collections [1]/[7.7s], total [7.7s]/[18.1s], memory [31.3gb]->[31.8gb]/[31.8gb], all_pools {[young] [450.2mb]->[865.3mb]/[865.3mb]}{[survivor] [0b]->[105.6mb]/[108.1mb]}{[old] [30.9gb]->[30.9gb]/[30.9gb]}
Post making cluster up again by restarting the cluster, we again fired the search request. but we modified the request this time. we added parameter ""execution_hint": "global_ordinals_hash"" in aggregation field query. now it got executed in 6.8 seconds.
{"from": 0,"size": 0,"query": {"bool": {"must": {"bool": {"must": [{"match": {"id": {"query": "8663","type": "phrase"}}},{"match": {"cltp": {"query": "encounters","type": "phrase"}}}]}}}},"_source": {"includes": ["COUNT"],"excludes": []},"aggregations": {"eid": {"terms": {"field": "eid","execution_hint": "global_ordinals_hash","size": 200},"aggregations": {"COUNT(DISTINCT eid)": {"cardinality": {"field": "eid","precision_threshold": 40000}}}}}}
we want to modify the rest query to include parameter "execution_hint": "global_ordinals_hash"
which ELASTIC SQL request to elasticsearch. kindly assist us to do so.