Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More graceful handling of failure in DescribeInstances (ec2-discovery) #28452

Closed
azsolinsky opened this issue Jan 31, 2018 · 10 comments
Closed
Assignees
Labels
:Distributed/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure >enhancement help wanted adoptme

Comments

@azsolinsky
Copy link

azsolinsky commented Jan 31, 2018

If the DescribeInstances call fails from the EC2 Discovery plugin for any reason, the code just returns an empty list of nodes. This is bad because the code currently caches it until the refresh interval expires. This is bad because the code uses the empty list of nodes immediately, and will try to make the call again on the next get, which potentially doesn't include any retry back-off.

https://github.com/elastic/elasticsearch/blob/139deb535a58de87c602888a121b2791bcd22df2/plugins/discovery-ec2/src/main/java/org/elasticsearch/discovery/ec2/AwsEc2UnicastHostsProvider.java#L106:L119

~~With the default refresh of 10s this is sometimes not catastrophic; however, if throttling is happening a lot it can potentially cause the masters to not be able to communicate with one another and lead to cluster instability. ~~

Also, with this bug, increasing the refresh interval is dangerous because the empty results list is cached until the refresh interval expires. The code should probably not return empty list if it is being throttled and continue to use the list from the last successful call, or possibly retry more with exponential back-off for throttling exceptions.

@ywelsch
Copy link
Contributor

ywelsch commented Feb 6, 2018

@azsolinsky How did you determine for this to be the cause? Looking at DiscoNodesCache, which is the class that does the caching, it seems to account for the empty list. If the returned list is empty, no caching should be happening. The only possible fault I can find with the DiscoNodesCache class is that the empty variable might not be concurrency-safe.

@azsolinsky
Copy link
Author

azsolinsky commented Feb 7, 2018

Ok, how did you determine no caching is happening if it returns an empty list?

https://github.com/elastic/elasticsearch/blob/139deb535a58de87c602888a121b2791bcd22df2/plugins/discovery-ec2/src/main/java/org/elasticsearch/discovery/ec2/AwsEc2UnicastHostsProvider.java#L239:L243

https://github.com/elastic/elasticsearch/blob/139deb535a58de87c602888a121b2791bcd22df2/server/src/main/java/org/elasticsearch/common/util/SingleObjectCache.java#L54:L64

If empty list is returned the SIngleObjectCache will certainly cache it; only if the response were null would it not be cached; the code doesn't distinguish between empty list and and non-empty list; why do you think it will not cache the empty list?

(EDIT: I see now that the needsRefresh() was overridden, thus making it refresh on the next call and preventing it from effectively being cached -- I missed this in my initial inspection of the code)

@rjernst
Copy link
Member

rjernst commented Feb 7, 2018

@azsolinsky The empty variable is set if the returned list is empty, which means needsRefresh() will return true.

@azsolinsky
Copy link
Author

azsolinsky commented Feb 8, 2018 via email

@azsolinsky
Copy link
Author

azsolinsky commented Feb 8, 2018 via email

@rjernst
Copy link
Member

rjernst commented Feb 8, 2018

I'm not saying the current behavior is good or correct, just that it will not "cache the empty list" as was originally stated in this issue.

@azsolinsky azsolinsky changed the title EC2 Discovery Plugin Should Not Cache Empty Results due to Throttling of DescribeIntstances Call EC2 Discovery Plugin should handle throttling of DescribeIntstances Call more gracefully, and not return an empty list Feb 8, 2018
@azsolinsky
Copy link
Author

Ok, your point is taken. I took for granted that needsRefresh was being overridden, and thought it would not refresh on a second call. I updated the title. However, I think the behavior during throttling still needs to be addressed to protect the service from spikes calling the describe instance call API in the case of throttling.

@rjernst rjernst added the help wanted adoptme label Feb 9, 2018
@rjernst
Copy link
Member

rjernst commented Feb 9, 2018

@azsolinsky I've marked this as an adoptme. Do you have any interesting in working on a PR?

@azsolinsky
Copy link
Author

I have interest, but I'm not sure about my availability; I'll have to get back on it in the near future.

@clintongormley clintongormley added :Distributed/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure and removed :Plugin Discovery EC2 labels Feb 13, 2018
@dadoonet dadoonet removed their assignment Mar 9, 2018
@DaveCTurner DaveCTurner changed the title EC2 Discovery Plugin should handle throttling of DescribeIntstances Call more gracefully, and not return an empty list More graceful handling of failure in DescribeInstances (ec2-discovery) Mar 14, 2018
@albertzaharovits albertzaharovits self-assigned this Mar 27, 2018
@henningandersen
Copy link
Contributor

@azsolinsky thanks for your interest in improving elasticsearch. We discussed this internally and do not currently intend to work on this in the foreseeable future. I will therefore close this issue. If you still have an interest in improving this, feel free to open a pull request for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure >enhancement help wanted adoptme
Projects
None yet
Development

No branches or pull requests

8 participants