New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elasticsearch 5.0: _nodes/stats?all throws a 400 #21410
Comments
At a glance, it looks related to #21266 (see https://github.com/elastic/elasticsearch/blob/5.0/core/src/main/java/org/elasticsearch/rest/action/admin/cluster/RestNodesStatsAction.java#L64) |
By default, node stats returns all stats. You can explicitly request all stats by hitting |
Actually, (and this goes waaay back to pre 0.90), stats used to just return a selection of stats, and |
@jasontedor Was this the case in all previous versions? Monitoring systems that hit the APIs need to span many versions, and every breaking change like this (like changing This breaks our monitoring system. And anyone else's spanning many versions. Why break this? Why was it necessary? Just ignore the param if the behavior would be the same. |
It's like this in 1.x and 2.x as well, |
It was the case in 1.x and 2.x for sure, I just checked the code.
It didn't do anything, it was silently ignored.
This is why we are getting rid of silently ignoring things. All this time you thought that that parameter was doing something and it wasn't. |
It was doing something in old versions. If you think no one is still using 0.90, that's just way off base. Unfortunately, that's reality. And they're using some of our stuff to monitor it. So now, for no good reason, instead of maintaining minimal code in Elastic to ignore it, every monitoring system has to maintain a code fork and version awareness to adapt to this. That's not an efficient use of time for anyone. This is happening with every release, and none of them were necessary. Please, quit breaking things for very minor cleanups. It's just shifting a lot of work to |
So we're just supposed to keep all of our mistakes forever? Sorry @NickCraver but no. Strict parsing of query string parameters is a really important step forwards, and we're not going to keep cruft forever. We try hard not to break things. Sometimes we do it unintentionally without realising that something will be an issue, for which I apologise, but it is untenable to keep everything the same forever. It would make our code so creaky and complicated we wouldn't be able to touch it without fear of breaking things. |
That's right, very old versions, and versions that we do not support anymore.
This change was made almost three years ago in bb27516. The related issue is #4057. The addition of strict REST parsing was a major, nor minor, step forward for Elasticsearch. The related PR is #20722. |
No. But not acknowledging the APIs that need to span versions are more special and should have a much higher bar here shows a large disconnect with what consumes these APIs. I never said everything should work forever. I am speaking about a very narrow scope of APIs here. Instead of you maintaining support, now everyone else has to deal with it. And all monitoring systems have to have a code fork here to handle it. Is that really the best use of human time? Also note: this isn't ancient history. It was documented to work an hour ago. |
Note there is no "strict REST parsing" for the
gives a (unexpected) response on an undefined metric "foobar" and is not rejected as invalid request. |
@jprante Thanks for reporting this, I have a fix for this and will make a formal PR shortly. |
@jasontedor You shouldn't make another breaking change like this until the next major version. Can we assume such a break there would be 6.0 or beyond? |
It was a documentation bug and now it's fixed. |
It's consumers that are using the stats endpoints with invalid metrics that are broken but are going undetected because we are lenient here. The change would be a bug fix, and I'm definitely not holding it until a major version. It's tempting to even include it in a bug fix release, but it's definitely going into 5.1.0. |
@jasontedor Honest question, do you just not care about breaking users? Does Elastic in general not care a bit about semantic versioning? I'm honestly trying to gauge what the intent is here. From a consumer point of view this is both hostile and toxic. As a maintainer of a project people consume, this is just all around bad. There's nothing harmful about these endpoints right now. They are experiencing the same behavior from the day they deployed on 5.0. If they upgrade to 5.1.0 with a change like this, you're breaking them. Why? What good does that do? There's good reason breaking changes are reserved for major versions. Most of the software world understands this. |
the fact we can now strictly parse parameters is a huge improvement to users, since it allows us to be properly consistent and predictable with our APIs. Something, as anything with writing software, I personally wish I should have done with the get go. On the other hand, I do appreciate the implications when it comes to backward comp. I would like to hear how you use it, to make sure it is not broken already, and actually, this fix causes things to be fixed? Our goal it not to break backward comp. but I want to understand and appreciate the balance between wrong invocation of API due to mis-design on end that gets fixed, vs. really breaking correct existing behavior? @NickCraver can you share more info? |
hey @NickCraver , looking into it, we did make a braking change, and made I view strict URL parsing important, and regardless of the docs, which we missed, and I apologize for that, the code still rejects the p.s. I am on a EU TZ, so can't respond soonish moving forward, please don't view me being in-responsives as a non comment. |
@kimchy So context here: there is no functionality fix here, the documentation was retroactively changed to "fix" a breaking change. That breaking change was in a major release (good thing), though IMO unnecessary (distinct issue). It should be added to breaking change docs regardless. There's now another change proposed in #21417 that's breaking not in a major version change (bad). That's an issue in itself, and just bad behavior for any project. The documentation fix is appreciated, but what are we fixing here? The strict parameter parsing is good, but would adding this as a recognized parameter and maintaining behavior be a bad thing? Then when a monitoring system hits these routes, it's not version switched per version. When things like this break, I'm not only fixing our monitoring system (both Bosun and Opserver), everyone else is also doing this. And that's bad, because this break wasn't really necessary. No one's asking that we revert strict parameter parsing, the ask if much smaller in scope: recognize this parameter. It was allowed and documented to be allowed past the 5.x release (the live docs still have it). More globally, the monitoring APIs in general are the thing most likely to need many-version support at once. Any changes that aren't really necessary here are painful, multiplied by the number of systems hitting them. As a user, what's the benefit here? How many lines of code would it take in elasticsearch to make this work? The current version would need to just ignore it. Is that a big ask? What's happening right now is we've gone from a route that worked fine to throwing a 400, for...what? How much code did we save? Was it worth the pain caused to users? This is broken. There's no fix for it. Users are on 5.0.0 and I have to support them. The ship has sailed and either monitoring systems specifically don't support this exact version (if the behavior did go back), or we already have to have this fork in our code. For probably a decade. But what's the bigger issue here? Breaking changes are proposed for a 5.1.0 or (even worse) a 5.0.x in #21417 is very bad. These should absolutely not be allowed. If I have no idea if my stuff will break in the most minor of releases, then I won't upgrade. That's bad for everyone. There are really good reasons that semantic versioning exists and this is a primary example. Please, don't let this happen. Elastic has to take breaking changes seriously if they want people to be on a version anywhere near current. If users can't trust an upgrade, then they won't. Things like this (if deployed) destroy such trust, instead of building it. I honestly have no idea if Elastic (in general) strives to maintain semantic versioning, but even if that's not the case, can we at least get on the same page with breaking changes? As a developer and a sysadmin, I'm telling you: this is extremely important. |
@NickCraver I think we crossed responses, based on my understanding, 5.0.0 is not a unique version from a code perspective, we just missed docs. If it was a unique version from a code behavior perspective, I 1000% agree with you. But it is not. From 5.0.0 onwards, ?all will get rejected. |
@kimchy Are you only seeing one of the two issues here? #21417 is not about The |
The ES team has been extremely aggressive with breaking changes and it has been very painful to update my apps to accommodate these changes. I really hope that you guys got the cleanup done that you wanted to do and that the bar for breaking changes goes up drastically going forward. Otherwise I'm afraid you are going to fragment and alienate your customers. |
Hi @NickCraver There is a tension here between forcing existing users to adjust to breaking changes and saving new users from hours of debugging when the parameters that they try just don't work. We have had so many issues opened because our APIs silently ignored unknown parameters. We have largely fixed this - we now tell you when you're doing something wrong, and we provide suggestions for which parameters you probably wanted to use. This is a huge step forward in usability. The Regarding the strict checking of metrics in #21417. We have formally embraced strict REST parsing in 5.0 - shouting loud and shouting early saves users so much time and heartache, the benefits are clear. If we find APIs that still accept unknown parameters, it is considered a bug which should be fixed, not a breaking change. If we continue silently ignoring unknown metrics, we're punishing all of those users who don't understand why their requests aren't working correctly. We can add backwards compatibility and deprecation logging for the
@ejsmith We're trying hard really hard to balance these two camps: not hurting existing users and making things better for new users. This is not an easy choice. We don't make breaking changes lightly and when we do, it is with the intention of moving to more stable APIs. The breaking changes between 1.x and 2.0 were hard for everybody, 2.x to 5.0 should be a much easier experience. We don't always get it right but please be aware that a lot of soul searching goes into these decisions. |
These are orthogonal issues, and not at all mutually exclusive. Any change can be both, as is the example here. That doesn't change the fact that breaking changes should be reserved for major versions. Yet repeatedly, with this project, they are not. This contributes greatly to people not upgrading. It's one of the reasons we don't upgrade often here.
I'm sorry but not that's not how this is coming across. The original comment here was "somebody talk me down from putting this in a 5.0.x release" (I hope that's exact, if not it's very close) with an emoji on it. I really wish I had the tab open with a screenshot (maybe someone else still does). If you want to carefully consider changes and what goes into things that's fine, but saying that happened here is bull. The fact that comment was edited shows it's recognized. Elasticsearch is your project. Do what you want with it. But don't assume this has zero impact. If I have to weigh heavily testing every release for breaks because Elastic doesn't care about semantic versioning and is therefore shifting the burden to me (and every other user), then it's worth it one again to spend equal time evaluating other options all together. Would we move to Solr here? I have no idea, but it's absolutely worth investigating again, because the time factor went from low and known to completely unknown and significant. When we upgrade Stack Overflow to a new bugfix release, will it break production? Because you're "helping" me with these breaking changes? Well now we have no confidence in that. The point of versioning and semantic versioning is that, as a user, I know what I am getting. If it's a x.0.0 upgrade, I know there will be breaks. If there are breaks hiding in build or minor releases, then I a) have to test far more heavily for every change, and b) now expect surprises when we deploy production. That's unhealthy, at best. Semantic versioning exists for extremely good reasons. I'm really not seeing that anyone at Elastic gets that in this issue. Instead, the comments and actions here indicate the exact opposite. |
I agree that was a poorly considered comment which doesn't reflect the thought process here. Hence the edit. And your comments are appreciated and heard. They have, for instance, made me reconsider my decision to backport a change to the CAT api to 5.1 - it's now been reverted and will be in 6.0 only. |
See: elastic/elasticsearch#21410 The all=true parameter is no longer supported, and causes the HTTP GET to return: $ curl 'http://ny2-laa-005:9200/_stats?all=true' {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"request [/_stats] contains unrecognized parameter: [all]"}],"type":"illegal_argument_exception","reason":"request [/_stats] contains unrecognized parameter: [all]"},"status":400} This causes the es2graphite.py to throw this stack trace and continuously retry: 2016-12-12 09:31:08,006 [MainThread] [INFO ] 2016-12-12 09:31:06: GET http://ny2-laa-005:9200/_stats?all=true 2016-12-12 09:31:08,352 [MainThread] [ERROR ] Traceback+%28most+recent+call+last%29%3A%0A++File+%22.%2Fes2graphite.py%22%2C+line+328%2C+in+%3Cmodule%3E%0A++++get_metrics%28%29%0A++File+%22.%2Fes2graphite.py%22%2C+line+34%2C+in+timed%0A++++result+%3D+method%28%2Aargs%2C+%2A%2Akw%29%0A++File+%22.%2Fes2graphite.py%22%2C+line+274%2C+in+get_metrics%0A++++indices_stats_data+%3D+urllib2.urlopen%28indices_stats_url%29.read%28%29%0A++File+%22%2FSystem%2FLibrary%2FFrameworks%2FPython.framework%2FVersions%2F2.7%2Flib%2Fpython2.7%2Furllib2.py%22%2C+line+154%2C+in+urlopen%0A++++return+opener.open%28url%2C+data%2C+timeout%29%0A++File+%22%2FSystem%2FLibrary%2FFrameworks%2FPython.framework%2FVersions%2F2.7%2Flib%2Fpython2.7%2Furllib2.py%22%2C+line+437%2C+in+open%0A++++response+%3D+meth%28req%2C+response%29%0A++File+%22%2FSystem%2FLibrary%2FFrameworks%2FPython.framework%2FVersions%2F2.7%2Flib%2Fpython2.7%2Furllib2.py%22%2C+line+550%2C+in+http_response%0A++++%27http%27%2C+request%2C+response%2C+code%2C+msg%2C+hdrs%29%0A++File+%22%2FSystem%2FLibrary%2FFrameworks%2FPython.framework%2FVersions%2F2.7%2Flib%2Fpython2.7%2Furllib2.py%22%2C+line+475%2C+in+error%0A++++return+self._call_chain%28%2Aargs%29%0A++File+%22%2FSystem%2FLibrary%2FFrameworks%2FPython.framework%2FVersions%2F2.7%2Flib%2Fpython2.7%2Furllib2.py%22%2C+line+409%2C+in+_call_chain%0A++++result+%3D+func%28%2Aargs%29%0A++File+%22%2FSystem%2FLibrary%2FFrameworks%2FPython.framework%2FVersions%2F2.7%2Flib%2Fpython2.7%2Furllib2.py%22%2C+line+558%2C+in+http_error_default%0A++++raise+HTTPError%28req.get_full_url%28%29%2C+code%2C+msg%2C+hdrs%2C+fp%29%0AHTTPError%3A+HTTP+Error+400%3A+Bad+Request%0A
Fixes broken HTTP GET for index statistics. Should be backwards compatible because the all=true parameter has not been used by elasticsearch for a long time. See: elastic/elasticsearch#21410
Refer to issue discussed here elastic/elasticsearch#21410. As an alternative it was suggested to use "/_nodes/stats" or "/_nodes/stats/_all" or "/_nodes/stats?metric=_all". For backward compatibility till 0.90 using no flags should work as the "/_nodes/stats" returns all stats by default and is consistent across all versions so far.
I get something like that. |
Elasticsearch version: 5.0.0 (253032b)
Plugins installed: none
JVM version: 1.8.0_102
OS version: Linux 4.4.20-moby (vanilla 5.0 Docker container)
Description of the problem including expected versus actual behavior:
Calling
_nodes/stats?all
should return all stats populated, as it did in previous versions and is still documented to work as in the Node Stats documentation:Instead, it returns a 400:
Steps to reproduce:
/_nodes/stats?all
, e.g. http://localhost:9200/_nodes/stats?allCan we restore this functionality please? Given the lack of a documentation update, I'm hoping this was an unintentional and completely unnecessary breaking change.
The text was updated successfully, but these errors were encountered: