NullPointerException on `/_cat/indices` when cluster RED #26942

jeancornic · 2017-10-10T09:17:04Z

Elasticsearch version: Version: 5.6.1, Build: 667b497/2017-09-14T19:22:05.189Z

Plugins installed: []

JVM version: 1.8.0_144

OS version (uname -a if on a Unix-like system):
Linux NodeB 4.4.0-1013-aws #22-Ubuntu SMP Fri Mar 31 15:41:31 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

We have a cluster with 6 nodes.
Probably due to network flakiness, some nodes started to loose connection with each other, during several minutes.
At least 2 nodes lost connection with the master => NodeB & NodeA.

Cluster went red, and stayed as is even after all the nodes came back to the cluster.

[2017-10-06T01:42:07,068][INFO ][o.e.c.r.a.AllocationService] [NodeC] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[index_1][1]] ...]).
[2017-10-06T01:35:35,929][INFO ][o.e.c.r.a.AllocationService] [NodeC] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[index_2][0]] ...]).
[2017-10-06T01:21:20,590][INFO ][o.e.c.r.a.AllocationService] [NodeC] Cluster health status changed from [YELLOW] to [RED] (reason: [{NodeB}{ItIC8RvWQk-K9xv4EtUH-g}{doC-pFtIRpmn-PaywU9AFg}{IpNodeB}{IpNodeB:9300}{availability_zone=us-east-1b, tag=histo} failed to ping, tried [3] times, each with maximum [30s] timeout]).
[2017-10-06T01:20:50,553][INFO ][o.e.c.r.a.AllocationService] [NodeC] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{NodeA}{tAZvKePkTqiqP44v4g4L7g}{areeqGx9RiO_q3_vip_fYA}{IpNodeA}{IpNodeA:9300}{availability_zone=us-east-1c, tag=histo} failed to ping, tried [3] times, each with maximum [30s] timeout]).

When executing a /_cat/indices:

[2017-10-06T01:33:27,125][WARN ][r.suppressed             ] path: /_cat/indices, params: {}
java.lang.NullPointerException: null
        at org.elasticsearch.rest.action.cat.RestIndicesAction.buildTable(RestIndicesAction.java:368) ~[elasticsearch-5.6.1.jar:5.6.1]
        at org.elasticsearch.rest.action.cat.RestIndicesAction$1$1$1.buildResponse(RestIndicesAction.java:116) ~[elasticsearch-5.6.1.jar:5.6.1]
        at org.elasticsearch.rest.action.cat.RestIndicesAction$1$1$1.buildResponse(RestIndicesAction.java:113) ~[elasticsearch-5.6.1.jar:5.6.1]
        at org.elasticsearch.rest.action.RestResponseListener.processResponse(RestResponseListener.java:37) ~[elasticsearch-5.6.1.jar:5.6.1]
        at org.elasticsearch.rest.action.RestActionListener.onResponse(RestActionListener.java:47) [elasticsearch-5.6.1.jar:5.6.1]

Bugging line is the following: RestIndicesAction.java#L368

^ A restart of the node NodeA at 01:35:16 fixed the issue.

Some relevant logs (not exhaustive):

[2017-10-06T01:35:56,904][WARN ][o.e.c.a.s.ShardStateAction] [NodeC] [index_1][2] received shard failed for shard id [[index_1][2]], allocation id [hLAZTvTCRWGJ_vBnpc5xbg], primary term [4], message [mark copy as stale]
[2017-10-06T01:35:35,981][WARN ][o.e.c.a.s.ShardStateAction] [NodeC] [index_2][0] received shard failed for shard id [[index_2][0]], allocation id [IvPPBlcRRnSQUA43s9v0qw], primary term [4], message [mark copy as stale]
[2017-10-06T01:35:10,053][WARN ][o.e.c.a.s.ShardStateAction] [NodeC] [index_2][1] received shard failed for shard id [[index_2][1]], allocation id [xRch14rfR_OvfQHYvPul-g], primary term [2], message [mark copy as stale]
[2017-10-06T01:35:09,840][WARN ][o.e.c.a.s.ShardStateAction] [NodeC] [index_2][1] received shard failed for shard id [[index_2][1]], allocation id [7JpkTXDnSR-Z54p3t9dlTQ], primary term [1], message [failed to perform indices:data/write/bulk[s] on replica [index_2][1], node[0dPW5AaBR--KS7JRNB32yA], [R], s[STARTED], a[id=7JpkTXDnSR-Z54p3t9dlTQ]], failure [RemoteTransportException[[NodeC][IpNodeC:9300][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[active primary shard cannot be a replication target before relocation hand off [index_2][1], node[0dPW5AaBR--KS7JRNB32yA], [P], s[STARTED], a[id=7JpkTXDnSR-Z54p3t9dlTQ], state is [STARTED]]; ]
[2017-10-06T01:35:09,840][WARN ][o.e.c.a.s.ShardStateAction] [NodeC] [index_2][1] received shard failed for shard id [[index_2][1]], allocation id [7JpkTXDnSR-Z54p3t9dlTQ], primary term [1], message [failed to perform indices:data/write/bulk[s] on replica [index_2][1], node[0dPW5AaBR--KS7JRNB32yA], [R], s[STARTED], a[id=7JpkTXDnSR-Z54p3t9dlTQ]], failure [RemoteTransportException[[NodeC][IpNodeC:9300][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[active primary shard cannot be a replication target before relocation hand off [index_2][1], node[0dPW5AaBR--KS7JRNB32yA], [P], s[STARTED], a[id=7JpkTXDnSR-Z54p3t9dlTQ], state is [STARTED]]; ]
[2017-10-06T01:35:09,894][WARN ][o.e.a.b.TransportShardBulkAction] [NodeF] [[index_2][1]] failed to perform indices:data/write/bulk[s] on replica [index_2][1], node[0dPW5AaBR--KS7JRNB32yA], [R], s[STARTED], a[id=7JpkTXDnSR-Z54p3t9dlTQ]
[2017-10-06T01:35:04,107][WARN ][o.e.a.b.TransportShardBulkAction] [NodeD] [[index_2][0]] failed to perform indices:data/write/bulk[s] on replica [index_2][0], node[l-TN-YQMThO8V_srAwknTg], [R], s[STARTED], a[id=IvPPBlcRRnSQUA43s9v0qw]
[2017-10-06T01:21:20,553][WARN ][o.e.d.z.PublishClusterStateAction] [NodeC] timed out waiting for all nodes to process published state [423] (timeout [30s], pending nodes: [{NodeD}{of6-ePXOT6uGk5TDKS1h-A}{IGu1YUCSRNiPOUgcq8HClw}{IpNodeD}{IpNodeD:9300}{availability_zone=us-east-1c, tag=fresh}, {NodeB}{ItIC8RvWQk-K9xv4EtUH-g}{doC-pFtIRpmn-PaywU9AFg}{IpNodeB}{IpNodeB:9300}{availability_zone=us-east-1b, tag=histo}, {NodeE}{_2uc635bS66TcqHVXjWpLA}{SzgLC8b0SpegMwaKLkPhgA}{IpNodeE}{IpNodeE:9300}{availability_zone=us-east-1a, tag=histo}])
[2017-10-06T01:21:20,594][INFO ][o.e.c.s.ClusterService ] [NodeF] removed {{NodeB}{ItIC8RvWQk-K9xv4EtUH-g}{doC-pFtIRpmn-PaywU9AFg}{IpNodeB}{IpNodeB:9300}{availability_zone=us-east-1b, tag=histo},}, reason: zen-disco-receive(from master [master {NodeC}{0dPW5AaBR--KS7JRNB32yA}{bvYMHcw-QZ6xTN8SMaaMHw}{IpNodeC}{IpNodeC:9300}{availability_zone=us-east-1b, tag=fresh} committed version [424]])
[2017-10-06T01:21:20,579][WARN ][o.e.c.s.ClusterService ] [NodeC] cluster state update task [zen-disco-node-failed({NodeA}{tAZvKePkTqiqP44v4g4L7g}{areeqGx9RiO_q3_vip_fYA}{IpNodeA}{IpNodeA:9300}{availability_zone=us-east-1c, tag=histo}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{NodeA}{tAZvKePkTqiqP44v4g4L7g}{areeqGx9RiO_q3_vip_fYA}{IpNodeA}{IpNodeA:9300}{availability_zone=us-east-1c, tag=histo} failed to ping, tried [3] times, each with maximum [30s] timeout]] took [30s] above the warn threshold of 30s

The text was updated successfully, but these errors were encountered:

jeancornic · 2017-10-10T09:18:50Z

Is that an expected behaviour? Would it help against this kind of issue to have dedicated master nodes?

It's hard to say how we could reproduce unfortunately. Maybe you guys have a better idea.

dnhatn · 2017-10-10T15:48:38Z

This happens when a primary shard is being relocated. I think, we can reproduce as follows.

Have 2+ nodes running
Create an index with number_of_replicas >= 1, number_of_shards = 1
Continuously execute GET /_cat/indices
Shutdown a node that contains the primary shard

dnhatn · 2017-10-10T15:49:01Z

@jasontedor, Can I start working on this?

jasontedor · 2017-10-10T15:50:13Z

@dnhatn Please do.

When a node which contains the primary shard is unavailable, the primary stats (and the total stats) of an `IndexStats` will be empty for a short moment (while the primary shard is being relocated). However, we assume that these stats are always non-empty when handling `_cat/indices` in RestIndicesAction. This commit checks the content of these stats before accessing. Closes elastic#26942

When a node which contains the primary shard is unavailable, the primary stats (and the total stats) of an `IndexStats` will be empty for a short moment (while the primary shard is being relocated). However, we assume that these stats are always non-empty when handling `_cat/indices` in RestIndicesAction. This commit checks the content of these stats before accessing. Closes #26942

When a node which contains the primary shard is unavailable, the primary stats (and the total stats) of an `IndexStats` will be empty for a short moment (while the primary shard is being relocated). However, we assume that these stats are always non-empty when handling `_cat/indices` in RestIndicesAction. This commit checks the content of these stats before accessing. Closes elastic#26942

dnhatn added :Cluster good first issue low hanging fruit labels Oct 10, 2017

jasontedor assigned dnhatn Oct 10, 2017

dnhatn added >bug :Data Management/CAT APIs Text APIs behind /_cat labels Oct 10, 2017

dnhatn mentioned this issue Oct 10, 2017

Fix NPE for /_cat/indices when no primary shard #26953

Merged

dnhatn added the review label Oct 10, 2017

dnhatn closed this as completed in #26953 Oct 10, 2017

dnhatn added backport pending and removed review labels Oct 10, 2017

dnhatn removed the backport pending label Oct 10, 2017

tvernum mentioned this issue Oct 19, 2017

Curl /_cat/indices causes null pointer exception #27046

Closed

clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NullPointerException on `/_cat/indices` when cluster RED #26942

NullPointerException on `/_cat/indices` when cluster RED #26942

jeancornic commented Oct 10, 2017 •

edited

jeancornic commented Oct 10, 2017

dnhatn commented Oct 10, 2017

dnhatn commented Oct 10, 2017

jasontedor commented Oct 10, 2017

NullPointerException on /_cat/indices when cluster RED #26942

NullPointerException on /_cat/indices when cluster RED #26942

Comments

jeancornic commented Oct 10, 2017 • edited

jeancornic commented Oct 10, 2017

dnhatn commented Oct 10, 2017

dnhatn commented Oct 10, 2017

jasontedor commented Oct 10, 2017

NullPointerException on `/_cat/indices` when cluster RED #26942

NullPointerException on `/_cat/indices` when cluster RED #26942

jeancornic commented Oct 10, 2017 •

edited