Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NullPointerException when multiple zones are used #16967

Closed
ghorkov opened this issue Mar 5, 2016 · 4 comments
Closed

NullPointerException when multiple zones are used #16967

ghorkov opened this issue Mar 5, 2016 · 4 comments
Assignees
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs

Comments

@ghorkov
Copy link

ghorkov commented Mar 5, 2016

Elasticsearch version: 2.1.2
JVM version:1.8.0_74
OS version:Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u3 (2016-01-17) x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
GCE Discovery is not working when multiple zones are used. I followed the steps here: https://www.elastic.co/guide/en/elasticsearch/plugins/current/cloud-gce-usage-discovery-zones.html
I have tested my configuration by changing the configuration file to use 1 zone and GCE discovery works correctly and the nodes can communicate correctly. As soon as I add a second zone the nodes stop communicating to each other.

I was expecting to configure all nodes to communicate with each other across all Google Cloud zones. If I need a new node in a different zone I could clone my existing instance template and move it to the new zone and communication between nodes will happen automatically across all zones.

Steps to reproduce:

  1. Install Elastic 2.1.2 + GCE plugin for 2 Linux instances in the Google Cloud
  2. Configure one of the instances to use multiple zones

Provide logs (if relevant):

[2016-03-04 13:25:35,730][WARN ][discovery.gce ] [Firebrand] Exception caught during discovery java.lang.NullPointerException : null
[2016-03-04 13:25:35,732][TRACE][discovery.gce ] [Firebrand] Exception caught during discovery
java.lang.NullPointerException
at com.google.common.collect.Iterables$3.transform(Iterables.java:512)
at com.google.common.collect.Iterables$3.transform(Iterables.java:509)
at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
at com.google.common.collect.Iterators$5.hasNext(Iterators.java:548)
at org.elasticsearch.common.util.CollectionUtils.iterableAsArrayList(CollectionUtils.java:390)
at org.elasticsearch.cloud.gce.GceComputeServiceImpl.instances(GceComputeServiceImpl.java:97)
at org.elasticsearch.discovery.gce.GceUnicastHostsProvider.buildDynamicNodes(GceUnicastHostsProvider.java:123)
at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.sendPings(UnicastZenPing.java:335)
at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.ping(UnicastZenPing.java:240)
at org.elasticsearch.discovery.zen.ping.ZenPingService.ping(ZenPingService.java:106)
at org.elasticsearch.discovery.zen.ping.ZenPingService.pingAndWait(ZenPingService.java:84)
at org.elasticsearch.discovery.zen.ZenDiscovery.findMaster(ZenDiscovery.java:879)
at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:335)
at org.elasticsearch.discovery.zen.ZenDiscovery.access$5000(ZenDiscovery.java:75)
at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1236)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2016-03-04 13:25:35,743][DEBUG][discovery.gce ] [Firebrand] 0 node(s) added

My configuration yml file:

cluster.name: elastic
network.host: 0.0.0.0
http.port: 9201
http.cors.enabled : true
http.cors.allow-origin : /.*/
transport.tcp.port: 9301
cloud:
  gce:
      project_id: app
      zone: ["asia-east1-a", "us-central1-a"]
discovery:
      type: gce
shield.enabled: false
readonlyrest:
    enable: true
    response_if_req_forbidden: Not found
    access_control_rules:

    - name: Accept only requests with api keys
      type: allow
      api_keys: [XXX]
      methods: [GET,POST,PUT,DELETE,OPTIONS]

By the way: I've also configured the metadata es_port=9301

GCE discovery only works if the zone property is changed to one of the following:

zone: ["asia-east1-a"] 
zone: asia-east1-a
@dadoonet
Copy link
Member

dadoonet commented Mar 5, 2016

I looked at the source code and I think that this could happen if you have absolutely no instance running in one of the zones you mentioned. Is that your case?
If I'm right, that means we need to catch this case properly instead of sending a NPE.

Could you confirm that please?

For the record, we have a test which tries settings with 2 zones: https://github.com/elastic/elasticsearch/blob/2.1/plugins/cloud-gce/src/test/java/org/elasticsearch/discovery/gce/GceDiscoverySettingsTests.java#L72.

@dadoonet dadoonet self-assigned this Mar 5, 2016
@dadoonet dadoonet removed the help wanted adoptme label Mar 5, 2016
@ghorkov
Copy link
Author

ghorkov commented Mar 5, 2016

Thank you dadoonet,

That is correct at the moment I don't have any instances running in the second zone. This is my current setup:

zone: asia-east1-a has node1 & node2
zone: us-central1-a doesn't have any active nodes

The problem is that when a second zone is added to the configuration file the communication between the nodes is terminated. I'm using autoscaling so depending on traffic I may need a new node on a different zone or if traffic is low the nodes in a certain zone can be shutdown leaving that zone empty

@dadoonet
Copy link
Member

dadoonet commented Mar 6, 2016

Thanks for confirming. Definitely something we need to fix.

dadoonet added a commit to dadoonet/elasticsearch that referenced this issue Jun 30, 2016
When GCE region is empty we get back from the API something like:

```
{
  "id": "dummy"
}
```

instead of:

```
{
  "id": "dummy",
  "items":[ ]
}
```

This generates a NPE when we aggregate all the lists into a single one.

Closes elastic#16967.
dadoonet added a commit to dadoonet/elasticsearch that referenced this issue Jun 30, 2016
When GCE region is empty we get back from the API something like:

```
{
  "id": "dummy"
}
```

instead of:

```
{
  "id": "dummy",
  "items":[ ]
}
```

This generates a NPE when we aggregate all the lists into a single one.

Closes elastic#16967.
@dadoonet
Copy link
Member

@ghorkov I was able to reproduce it and came with fixes for 5.x and 2.x versions.

@clintongormley clintongormley added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs and removed :Plugin Cloud GCE labels Feb 14, 2018
fixmebot bot referenced this issue in VectorXz/elasticsearch Apr 22, 2021
fixmebot bot referenced this issue in VectorXz/elasticsearch May 28, 2021
fixmebot bot referenced this issue in VectorXz/elasticsearch Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs
Projects
None yet
Development

No branches or pull requests

3 participants