Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NullPointerException on /_cat/indices when cluster RED #26942

Closed
jeancornic opened this issue Oct 10, 2017 · 4 comments
Closed

NullPointerException on /_cat/indices when cluster RED #26942

jeancornic opened this issue Oct 10, 2017 · 4 comments
Assignees
Labels
>bug :Data Management/CAT APIs Text APIs behind /_cat :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. good first issue low hanging fruit

Comments

@jeancornic
Copy link

jeancornic commented Oct 10, 2017

Elasticsearch version: Version: 5.6.1, Build: 667b497/2017-09-14T19:22:05.189Z

Plugins installed: []

JVM version: 1.8.0_144

OS version (uname -a if on a Unix-like system):
Linux NodeB 4.4.0-1013-aws #22-Ubuntu SMP Fri Mar 31 15:41:31 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

We have a cluster with 6 nodes.
Probably due to network flakiness, some nodes started to loose connection with each other, during several minutes.
At least 2 nodes lost connection with the master => NodeB & NodeA.

Cluster went red, and stayed as is even after all the nodes came back to the cluster.

[2017-10-06T01:42:07,068][INFO ][o.e.c.r.a.AllocationService] [NodeC] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[index_1][1]] ...]).
[2017-10-06T01:35:35,929][INFO ][o.e.c.r.a.AllocationService] [NodeC] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[index_2][0]] ...]).
[2017-10-06T01:21:20,590][INFO ][o.e.c.r.a.AllocationService] [NodeC] Cluster health status changed from [YELLOW] to [RED] (reason: [{NodeB}{ItIC8RvWQk-K9xv4EtUH-g}{doC-pFtIRpmn-PaywU9AFg}{IpNodeB}{IpNodeB:9300}{availability_zone=us-east-1b, tag=histo} failed to ping, tried [3] times, each with maximum [30s] timeout]).
[2017-10-06T01:20:50,553][INFO ][o.e.c.r.a.AllocationService] [NodeC] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{NodeA}{tAZvKePkTqiqP44v4g4L7g}{areeqGx9RiO_q3_vip_fYA}{IpNodeA}{IpNodeA:9300}{availability_zone=us-east-1c, tag=histo} failed to ping, tried [3] times, each with maximum [30s] timeout]).

When executing a /_cat/indices:

[2017-10-06T01:33:27,125][WARN ][r.suppressed             ] path: /_cat/indices, params: {}
java.lang.NullPointerException: null
        at org.elasticsearch.rest.action.cat.RestIndicesAction.buildTable(RestIndicesAction.java:368) ~[elasticsearch-5.6.1.jar:5.6.1]
        at org.elasticsearch.rest.action.cat.RestIndicesAction$1$1$1.buildResponse(RestIndicesAction.java:116) ~[elasticsearch-5.6.1.jar:5.6.1]
        at org.elasticsearch.rest.action.cat.RestIndicesAction$1$1$1.buildResponse(RestIndicesAction.java:113) ~[elasticsearch-5.6.1.jar:5.6.1]
        at org.elasticsearch.rest.action.RestResponseListener.processResponse(RestResponseListener.java:37) ~[elasticsearch-5.6.1.jar:5.6.1]
        at org.elasticsearch.rest.action.RestActionListener.onResponse(RestActionListener.java:47) [elasticsearch-5.6.1.jar:5.6.1]

Bugging line is the following: RestIndicesAction.java#L368

^ A restart of the node NodeA at 01:35:16 fixed the issue.

Some relevant logs (not exhaustive):

[2017-10-06T01:35:56,904][WARN ][o.e.c.a.s.ShardStateAction] [NodeC] [index_1][2] received shard failed for shard id [[index_1][2]], allocation id [hLAZTvTCRWGJ_vBnpc5xbg], primary term [4], message [mark copy as stale]
[2017-10-06T01:35:35,981][WARN ][o.e.c.a.s.ShardStateAction] [NodeC] [index_2][0] received shard failed for shard id [[index_2][0]], allocation id [IvPPBlcRRnSQUA43s9v0qw], primary term [4], message [mark copy as stale]
[2017-10-06T01:35:10,053][WARN ][o.e.c.a.s.ShardStateAction] [NodeC] [index_2][1] received shard failed for shard id [[index_2][1]], allocation id [xRch14rfR_OvfQHYvPul-g], primary term [2], message [mark copy as stale]
[2017-10-06T01:35:09,840][WARN ][o.e.c.a.s.ShardStateAction] [NodeC] [index_2][1] received shard failed for shard id [[index_2][1]], allocation id [7JpkTXDnSR-Z54p3t9dlTQ], primary term [1], message [failed to perform indices:data/write/bulk[s] on replica [index_2][1], node[0dPW5AaBR--KS7JRNB32yA], [R], s[STARTED], a[id=7JpkTXDnSR-Z54p3t9dlTQ]], failure [RemoteTransportException[[NodeC][IpNodeC:9300][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[active primary shard cannot be a replication target before relocation hand off [index_2][1], node[0dPW5AaBR--KS7JRNB32yA], [P], s[STARTED], a[id=7JpkTXDnSR-Z54p3t9dlTQ], state is [STARTED]]; ]
[2017-10-06T01:35:09,840][WARN ][o.e.c.a.s.ShardStateAction] [NodeC] [index_2][1] received shard failed for shard id [[index_2][1]], allocation id [7JpkTXDnSR-Z54p3t9dlTQ], primary term [1], message [failed to perform indices:data/write/bulk[s] on replica [index_2][1], node[0dPW5AaBR--KS7JRNB32yA], [R], s[STARTED], a[id=7JpkTXDnSR-Z54p3t9dlTQ]], failure [RemoteTransportException[[NodeC][IpNodeC:9300][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[active primary shard cannot be a replication target before relocation hand off [index_2][1], node[0dPW5AaBR--KS7JRNB32yA], [P], s[STARTED], a[id=7JpkTXDnSR-Z54p3t9dlTQ], state is [STARTED]]; ]
[2017-10-06T01:35:09,894][WARN ][o.e.a.b.TransportShardBulkAction] [NodeF] [[index_2][1]] failed to perform indices:data/write/bulk[s] on replica [index_2][1], node[0dPW5AaBR--KS7JRNB32yA], [R], s[STARTED], a[id=7JpkTXDnSR-Z54p3t9dlTQ]
[2017-10-06T01:35:04,107][WARN ][o.e.a.b.TransportShardBulkAction] [NodeD] [[index_2][0]] failed to perform indices:data/write/bulk[s] on replica [index_2][0], node[l-TN-YQMThO8V_srAwknTg], [R], s[STARTED], a[id=IvPPBlcRRnSQUA43s9v0qw]
[2017-10-06T01:21:20,553][WARN ][o.e.d.z.PublishClusterStateAction] [NodeC] timed out waiting for all nodes to process published state [423] (timeout [30s], pending nodes: [{NodeD}{of6-ePXOT6uGk5TDKS1h-A}{IGu1YUCSRNiPOUgcq8HClw}{IpNodeD}{IpNodeD:9300}{availability_zone=us-east-1c, tag=fresh}, {NodeB}{ItIC8RvWQk-K9xv4EtUH-g}{doC-pFtIRpmn-PaywU9AFg}{IpNodeB}{IpNodeB:9300}{availability_zone=us-east-1b, tag=histo}, {NodeE}{_2uc635bS66TcqHVXjWpLA}{SzgLC8b0SpegMwaKLkPhgA}{IpNodeE}{IpNodeE:9300}{availability_zone=us-east-1a, tag=histo}])
[2017-10-06T01:21:20,594][INFO ][o.e.c.s.ClusterService ] [NodeF] removed {{NodeB}{ItIC8RvWQk-K9xv4EtUH-g}{doC-pFtIRpmn-PaywU9AFg}{IpNodeB}{IpNodeB:9300}{availability_zone=us-east-1b, tag=histo},}, reason: zen-disco-receive(from master [master {NodeC}{0dPW5AaBR--KS7JRNB32yA}{bvYMHcw-QZ6xTN8SMaaMHw}{IpNodeC}{IpNodeC:9300}{availability_zone=us-east-1b, tag=fresh} committed version [424]])
[2017-10-06T01:21:20,579][WARN ][o.e.c.s.ClusterService ] [NodeC] cluster state update task [zen-disco-node-failed({NodeA}{tAZvKePkTqiqP44v4g4L7g}{areeqGx9RiO_q3_vip_fYA}{IpNodeA}{IpNodeA:9300}{availability_zone=us-east-1c, tag=histo}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{NodeA}{tAZvKePkTqiqP44v4g4L7g}{areeqGx9RiO_q3_vip_fYA}{IpNodeA}{IpNodeA:9300}{availability_zone=us-east-1c, tag=histo} failed to ping, tried [3] times, each with maximum [30s] timeout]] took [30s] above the warn threshold of 30s
@jeancornic
Copy link
Author

Is that an expected behaviour? Would it help against this kind of issue to have dedicated master nodes?

It's hard to say how we could reproduce unfortunately. Maybe you guys have a better idea.

@dnhatn
Copy link
Member

dnhatn commented Oct 10, 2017

This happens when a primary shard is being relocated. I think, we can reproduce as follows.

  1. Have 2+ nodes running
  2. Create an index with number_of_replicas >= 1, number_of_shards = 1
  3. Continuously execute GET /_cat/indices
  4. Shutdown a node that contains the primary shard

@dnhatn
Copy link
Member

dnhatn commented Oct 10, 2017

@jasontedor, Can I start working on this?

@dnhatn dnhatn added :Cluster good first issue low hanging fruit labels Oct 10, 2017
@jasontedor
Copy link
Member

@dnhatn Please do.

@dnhatn dnhatn added >bug :Data Management/CAT APIs Text APIs behind /_cat labels Oct 10, 2017
dnhatn added a commit to dnhatn/elasticsearch that referenced this issue Oct 10, 2017
When a node which contains the primary shard is unavailable, the primary
stats (and the total stats) of an `IndexStats` will be empty for a short
moment (while the primary shard is being relocated). However, we assume
that these stats are always non-empty when handling `_cat/indices` in
RestIndicesAction. This commit checks the content of these stats before
accessing.

Closes elastic#26942
@dnhatn dnhatn added the review label Oct 10, 2017
dnhatn added a commit that referenced this issue Oct 10, 2017
When a node which contains the primary shard is unavailable, the primary
stats (and the total stats) of an `IndexStats` will be empty for a short
moment (while the primary shard is being relocated). However, we assume
that these stats are always non-empty when handling `_cat/indices` in
RestIndicesAction. This commit checks the content of these stats before
accessing.

Closes #26942
dnhatn added a commit that referenced this issue Oct 10, 2017
When a node which contains the primary shard is unavailable, the primary
stats (and the total stats) of an `IndexStats` will be empty for a short
moment (while the primary shard is being relocated). However, we assume
that these stats are always non-empty when handling `_cat/indices` in
RestIndicesAction. This commit checks the content of these stats before
accessing.

Closes #26942
dnhatn added a commit that referenced this issue Oct 10, 2017
When a node which contains the primary shard is unavailable, the primary
stats (and the total stats) of an `IndexStats` will be empty for a short
moment (while the primary shard is being relocated). However, we assume
that these stats are always non-empty when handling `_cat/indices` in
RestIndicesAction. This commit checks the content of these stats before
accessing.

Closes #26942
dnhatn added a commit that referenced this issue Oct 10, 2017
When a node which contains the primary shard is unavailable, the primary
stats (and the total stats) of an `IndexStats` will be empty for a short
moment (while the primary shard is being relocated). However, we assume
that these stats are always non-empty when handling `_cat/indices` in
RestIndicesAction. This commit checks the content of these stats before
accessing.

Closes #26942
@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Cluster labels Feb 13, 2018
rahulanishetty pushed a commit to rahulanishetty/elasticsearch that referenced this issue Jul 26, 2018
When a node which contains the primary shard is unavailable, the primary
stats (and the total stats) of an `IndexStats` will be empty for a short
moment (while the primary shard is being relocated). However, we assume
that these stats are always non-empty when handling `_cat/indices` in
RestIndicesAction. This commit checks the content of these stats before
accessing.

Closes elastic#26942
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/CAT APIs Text APIs behind /_cat :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. good first issue low hanging fruit
Projects
None yet
Development

No branches or pull requests

4 participants