Elasticsearch 5.0: _nodes/stats?all throws a 400 #21410

NickCraver · 2016-11-08T18:33:39Z

Elasticsearch version: 5.0.0 (253032b)
Plugins installed: none
JVM version: 1.8.0_102
OS version: Linux 4.4.20-moby (vanilla 5.0 Docker container)

Description of the problem including expected versus actual behavior:
Calling _nodes/stats?all should return all stats populated, as it did in previous versions and is still documented to work as in the Node Stats documentation:

The all flag can be set to return all the stats.

Instead, it returns a 400:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "request [/_nodes/stats] contains unrecognized parameter: [all]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "request [/_nodes/stats] contains unrecognized parameter: [all]"
  },
  "status" : 400
}

Steps to reproduce:

Navigate to /_nodes/stats?all, e.g. http://localhost:9200/_nodes/stats?all
Get a 400
Go to nearest corner
Assume fetal position
Cry a little

Can we restore this functionality please? Given the lack of a documentation update, I'm hoping this was an unintentional and completely unnecessary breaking change.

The text was updated successfully, but these errors were encountered:

nlowe · 2016-11-08T18:41:54Z

At a glance, it looks related to #21266 (see https://github.com/elastic/elasticsearch/blob/5.0/core/src/main/java/org/elasticsearch/rest/action/admin/cluster/RestNodesStatsAction.java#L64)

jasontedor · 2016-11-08T18:44:45Z

By default, node stats returns all stats. You can explicitly request all stats by hitting /_nodes/stats/_all or /_nodes/stats?metric=_all. Requesting /_nodes/stats?all never did anything, it was just silently ignored in the past. I'll clarify the docs.

clintongormley · 2016-11-08T18:47:25Z

Requesting /_nodes/stats?all never did anything, it was just silently ignored in the past

Actually, (and this goes waaay back to pre 0.90), stats used to just return a selection of stats, and all used to have a purpose. But that is ancient history

NickCraver · 2016-11-08T18:48:23Z

@jasontedor Was this the case in all previous versions? Monitoring systems that hit the APIs need to span many versions, and every breaking change like this (like changing hostname to host) create real unnecessary pain points.

This breaks our monitoring system. And anyone else's spanning many versions. Why break this? Why was it necessary? Just ignore the param if the behavior would be the same.

jasontedor · 2016-11-08T18:51:17Z

Actually, (and this goes waaay back to pre 0.90), stats used to just return a selection of stats, and all used to have a purpose. But that is ancient history

It's like this in 1.x and 2.x as well, ?all did nothing, it was just silently ignored.

jasontedor · 2016-11-08T18:52:35Z

Was this the case in all previous versions?

It was the case in 1.x and 2.x for sure, I just checked the code.

Why break this? Why was it necessary?

It didn't do anything, it was silently ignored.

Just ignore the param if the behavior would be the same.

This is why we are getting rid of silently ignoring things. All this time you thought that that parameter was doing something and it wasn't.

jasontedor · 2016-11-08T18:54:03Z

I pushed the following commits:

master: aec09a7
5.x: f10e26a
5.0: b3df73f

NickCraver · 2016-11-08T18:55:19Z

This is why we are getting rid of silently ignoring things. All this time you thought that that parameter was doing something and it wasn't.

It was doing something in old versions. If you think no one is still using 0.90, that's just way off base. Unfortunately, that's reality. And they're using some of our stuff to monitor it. So now, for no good reason, instead of maintaining minimal code in Elastic to ignore it, every monitoring system has to maintain a code fork and version awareness to adapt to this. That's not an efficient use of time for anyone.

This is happening with every release, and none of them were necessary. Please, quit breaking things for very minor cleanups. It's just shifting a lot of work to n consumers.

clintongormley · 2016-11-08T19:02:27Z

So we're just supposed to keep all of our mistakes forever? Sorry @NickCraver but no. Strict parsing of query string parameters is a really important step forwards, and we're not going to keep cruft forever. We try hard not to break things. Sometimes we do it unintentionally without realising that something will be an issue, for which I apologise, but it is untenable to keep everything the same forever. It would make our code so creaky and complicated we wouldn't be able to touch it without fear of breaking things.

jasontedor · 2016-11-08T19:03:08Z

It was doing something in old versions. If you think no one is still using 0.90, that's just way off base.

That's right, very old versions, and versions that we do not support anymore.

Please, quit breaking things for very minor cleanups.

This change was made almost three years ago in bb27516. The related issue is #4057.

The addition of strict REST parsing was a major, nor minor, step forward for Elasticsearch. The related PR is #20722.

NickCraver · 2016-11-08T19:11:35Z

So we're just supposed to keep all of our mistakes forever?

No. But not acknowledging the APIs that need to span versions are more special and should have a much higher bar here shows a large disconnect with what consumes these APIs. I never said everything should work forever. I am speaking about a very narrow scope of APIs here.

Instead of you maintaining support, now everyone else has to deal with it. And all monitoring systems have to have a code fork here to handle it. Is that really the best use of human time?

Also note: this isn't ancient history. It was documented to work an hour ago.

jprante · 2016-11-08T19:14:39Z

Note there is no "strict REST parsing" for the _nodes endpoint.

 curl '0:9200/_nodes/stats/foobar'

gives a (unexpected) response on an undefined metric "foobar" and is not rejected as invalid request.

jasontedor · 2016-11-08T19:17:05Z

@jprante Strict REST parsing (#20722) was about rejecting unrecognized URL parameters, what you're referring to is a separate issue and should be addressed.

jasontedor · 2016-11-08T19:37:36Z

@jprante Thanks for reporting this, I have a fix for this and will make a formal PR shortly.

NickCraver · 2016-11-08T19:39:29Z

@jasontedor You shouldn't make another breaking change like this until the next major version. Can we assume such a break there would be 6.0 or beyond?

jasontedor · 2016-11-08T21:45:02Z

Also note: this isn't ancient history. It was documented to work an hour ago.

It was a documentation bug and now it's fixed.

jasontedor · 2016-11-08T22:13:27Z

You shouldn't make another breaking change like this until the next major version. Can we assume such a break there would be 6.0 or beyond?

It's consumers that are using the stats endpoints with invalid metrics that are broken but are going undetected because we are lenient here. The change would be a bug fix, and I'm definitely not holding it until a major version. It's tempting to even include it in a bug fix release, but it's definitely going into 5.1.0.

jasontedor · 2016-11-08T22:13:41Z

Thanks for reporting @jprante, I opened #21417.

NickCraver · 2016-11-08T22:28:23Z

@jasontedor Honest question, do you just not care about breaking users? Does Elastic in general not care a bit about semantic versioning? I'm honestly trying to gauge what the intent is here. From a consumer point of view this is both hostile and toxic. As a maintainer of a project people consume, this is just all around bad.

There's nothing harmful about these endpoints right now. They are experiencing the same behavior from the day they deployed on 5.0. If they upgrade to 5.1.0 with a change like this, you're breaking them. Why? What good does that do? There's good reason breaking changes are reserved for major versions. Most of the software world understands this.

kimchy · 2016-11-08T23:26:02Z

the fact we can now strictly parse parameters is a huge improvement to users, since it allows us to be properly consistent and predictable with our APIs. Something, as anything with writing software, I personally wish I should have done with the get go.

On the other hand, I do appreciate the implications when it comes to backward comp. I would like to hear how you use it, to make sure it is not broken already, and actually, this fix causes things to be fixed? Our goal it not to break backward comp. but I want to understand and appreciate the balance between wrong invocation of API due to mis-design on end that gets fixed, vs. really breaking correct existing behavior? @NickCraver can you share more info?

kimchy · 2016-11-08T23:47:19Z

hey @NickCraver , looking into it, we did make a braking change, and made ?all a breaking change from 5.0.0 forward, we just missed the documentation aspect, which we fixed. Strict URL parsing is important for resiliency in our codebase, so regardless, you would need to change 5.x code to support it.

I view strict URL parsing important, and regardless of the docs, which we missed, and I apologize for that, the code still rejects the ?all flag, from 5.0.0 onwards, which is good. And any monitoring tool working against 5.x versions need to adjust to it.

p.s. I am on a EU TZ, so can't respond soonish moving forward, please don't view me being in-responsives as a non comment.

NickCraver · 2016-11-08T23:47:49Z

@kimchy So context here: there is no functionality fix here, the documentation was retroactively changed to "fix" a breaking change. That breaking change was in a major release (good thing), though IMO unnecessary (distinct issue). It should be added to breaking change docs regardless. There's now another change proposed in #21417 that's breaking not in a major version change (bad). That's an issue in itself, and just bad behavior for any project.

The documentation fix is appreciated, but what are we fixing here? The strict parameter parsing is good, but would adding this as a recognized parameter and maintaining behavior be a bad thing? Then when a monitoring system hits these routes, it's not version switched per version. When things like this break, I'm not only fixing our monitoring system (both Bosun and Opserver), everyone else is also doing this. And that's bad, because this break wasn't really necessary.

No one's asking that we revert strict parameter parsing, the ask if much smaller in scope: recognize this parameter. It was allowed and documented to be allowed past the 5.x release (the live docs still have it).

More globally, the monitoring APIs in general are the thing most likely to need many-version support at once. Any changes that aren't really necessary here are painful, multiplied by the number of systems hitting them. As a user, what's the benefit here? How many lines of code would it take in elasticsearch to make this work? The current version would need to just ignore it. Is that a big ask? What's happening right now is we've gone from a route that worked fine to throwing a 400, for...what? How much code did we save? Was it worth the pain caused to users?

This is broken. There's no fix for it. Users are on 5.0.0 and I have to support them. The ship has sailed and either monitoring systems specifically don't support this exact version (if the behavior did go back), or we already have to have this fork in our code. For probably a decade.

But what's the bigger issue here? Breaking changes are proposed for a 5.1.0 or (even worse) a 5.0.x in #21417 is very bad. These should absolutely not be allowed. If I have no idea if my stuff will break in the most minor of releases, then I won't upgrade. That's bad for everyone. There are really good reasons that semantic versioning exists and this is a primary example. Please, don't let this happen. Elastic has to take breaking changes seriously if they want people to be on a version anywhere near current. If users can't trust an upgrade, then they won't. Things like this (if deployed) destroy such trust, instead of building it.

I honestly have no idea if Elastic (in general) strives to maintain semantic versioning, but even if that's not the case, can we at least get on the same page with breaking changes? As a developer and a sysadmin, I'm telling you: this is extremely important.

jasontedor · 2016-11-08T23:48:09Z

I also backported the docs fixes to 2.4 via a025c9c and 1.7 via 6d95de2.

kimchy · 2016-11-08T23:51:24Z

@NickCraver I think we crossed responses, based on my understanding, 5.0.0 is not a unique version from a code perspective, we just missed docs. If it was a unique version from a code behavior perspective, I 1000% agree with you. But it is not. From 5.0.0 onwards, ?all will get rejected.

NickCraver · 2016-11-08T23:54:13Z

@kimchy Are you only seeing one of the two issues here? #21417 is not about ?all. It's another breaking change, pitched for both a minor or even a "bugfix" release.

The ?all is unsolvable in itself (as mentioned above) because 5.0.0 shipped at all. I still disagree that, given the nature of this API set, it should have been broken at all. That's more of a discussion on breaking monitoring APIs in general - which should happen. Which APIs should be extremely long-term stable? (Without very compelling reasons to break them)

ejsmith · 2016-11-09T01:17:22Z

The ES team has been extremely aggressive with breaking changes and it has been very painful to update my apps to accommodate these changes. I really hope that you guys got the cleanup done that you wanted to do and that the bar for breaking changes goes up drastically going forward. Otherwise I'm afraid you are going to fragment and alienate your customers.

clintongormley · 2016-11-09T11:30:38Z

Hi @NickCraver

There is a tension here between forcing existing users to adjust to breaking changes and saving new users from hours of debugging when the parameters that they try just don't work. We have had so many issues opened because our APIs silently ignored unknown parameters. We have largely fixed this - we now tell you when you're doing something wrong, and we provide suggestions for which parameters you probably wanted to use. This is a huge step forward in usability.

The all flag was removed in 1.0, along with the clear and various metric flags. The bug was that we didn't remove all the documentation about the all flag, but it hasn't worked since 1.0. I understand your point about how ignoring the all flag silently in 5.0 wouldn't make a difference, but why is the all flag different from the other flags? Maybe you are only using the all flag but there may be others using the other flags, which also haven't worked since 1.0. If we silently ignore those too, we leave users guessing about why things aren't working the way they assume instead of telling them immediately what the problem is.

Regarding the strict checking of metrics in #21417. We have formally embraced strict REST parsing in 5.0 - shouting loud and shouting early saves users so much time and heartache, the benefits are clear. If we find APIs that still accept unknown parameters, it is considered a bug which should be fixed, not a breaking change.

If we continue silently ignoring unknown metrics, we're punishing all of those users who don't understand why their requests aren't working correctly. We can add backwards compatibility and deprecation logging for the percolator metric, which is no longer available, but how does it help users to ignore unknown metrics? Most users are probably unaware that they have a problem and would be grateful for being told what the problem is. While we could delay this change for 6.0, it would mean another year where users are left in the dark.

The ES team has been extremely aggressive with breaking changes and it has been very painful to update my apps to accommodate these changes. I really hope that you guys got the cleanup done that you wanted to do and that the bar for breaking changes goes up drastically going forward. Otherwise I'm afraid you are going to fragment and alienate your customers.

@ejsmith We're trying hard really hard to balance these two camps: not hurting existing users and making things better for new users. This is not an easy choice. We don't make breaking changes lightly and when we do, it is with the intention of moving to more stable APIs. The breaking changes between 1.x and 2.0 were hard for everybody, 2.x to 5.0 should be a much easier experience. We don't always get it right but please be aware that a lot of soul searching goes into these decisions.

NickCraver · 2016-11-09T12:34:53Z

it is considered a bug which should be fixed, not a breaking change

These are orthogonal issues, and not at all mutually exclusive. Any change can be both, as is the example here. That doesn't change the fact that breaking changes should be reserved for major versions. Yet repeatedly, with this project, they are not. This contributes greatly to people not upgrading. It's one of the reasons we don't upgrade often here.

This is not an easy choice. We don't make breaking changes lightly

I'm sorry but not that's not how this is coming across. The original comment here was "somebody talk me down from putting this in a 5.0.x release" (I hope that's exact, if not it's very close) with an emoji on it. I really wish I had the tab open with a screenshot (maybe someone else still does). If you want to carefully consider changes and what goes into things that's fine, but saying that happened here is bull. The fact that comment was edited shows it's recognized.

Elasticsearch is your project. Do what you want with it. But don't assume this has zero impact. If I have to weigh heavily testing every release for breaks because Elastic doesn't care about semantic versioning and is therefore shifting the burden to me (and every other user), then it's worth it one again to spend equal time evaluating other options all together. Would we move to Solr here? I have no idea, but it's absolutely worth investigating again, because the time factor went from low and known to completely unknown and significant. When we upgrade Stack Overflow to a new bugfix release, will it break production? Because you're "helping" me with these breaking changes? Well now we have no confidence in that.

The point of versioning and semantic versioning is that, as a user, I know what I am getting. If it's a x.0.0 upgrade, I know there will be breaks. If there are breaks hiding in build or minor releases, then I a) have to test far more heavily for every change, and b) now expect surprises when we deploy production. That's unhealthy, at best. Semantic versioning exists for extremely good reasons. I'm really not seeing that anyone at Elastic gets that in this issue. Instead, the comments and actions here indicate the exact opposite.

clintongormley · 2016-11-09T12:57:49Z

I'm sorry but not that's not how this is coming across. The original comment here was "somebody talk me down from putting this in a 5.0.x release" (I hope that's exact, if not it's very close) with an emoji on it.

I agree that was a poorly considered comment which doesn't reflect the thought process here. Hence the edit.

And your comments are appreciated and heard. They have, for instance, made me reconsider my decision to backport a change to the CAT api to 5.1 - it's now been reverted and will be in 6.0 only.

See: elastic/elasticsearch#21410 The all=true parameter is no longer supported, and causes the HTTP GET to return: $ curl 'http://ny2-laa-005:9200/_stats?all=true' {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"request [/_stats] contains unrecognized parameter: [all]"}],"type":"illegal_argument_exception","reason":"request [/_stats] contains unrecognized parameter: [all]"},"status":400} This causes the es2graphite.py to throw this stack trace and continuously retry: 2016-12-12 09:31:08,006 [MainThread] [INFO ] 2016-12-12 09:31:06: GET http://ny2-laa-005:9200/_stats?all=true 2016-12-12 09:31:08,352 [MainThread] [ERROR ] Traceback+%28most+recent+call+last%29%3A%0A++File+%22.%2Fes2graphite.py%22%2C+line+328%2C+in+%3Cmodule%3E%0A++++get_metrics%28%29%0A++File+%22.%2Fes2graphite.py%22%2C+line+34%2C+in+timed%0A++++result+%3D+method%28%2Aargs%2C+%2A%2Akw%29%0A++File+%22.%2Fes2graphite.py%22%2C+line+274%2C+in+get_metrics%0A++++indices_stats_data+%3D+urllib2.urlopen%28indices_stats_url%29.read%28%29%0A++File+%22%2FSystem%2FLibrary%2FFrameworks%2FPython.framework%2FVersions%2F2.7%2Flib%2Fpython2.7%2Furllib2.py%22%2C+line+154%2C+in+urlopen%0A++++return+opener.open%28url%2C+data%2C+timeout%29%0A++File+%22%2FSystem%2FLibrary%2FFrameworks%2FPython.framework%2FVersions%2F2.7%2Flib%2Fpython2.7%2Furllib2.py%22%2C+line+437%2C+in+open%0A++++response+%3D+meth%28req%2C+response%29%0A++File+%22%2FSystem%2FLibrary%2FFrameworks%2FPython.framework%2FVersions%2F2.7%2Flib%2Fpython2.7%2Furllib2.py%22%2C+line+550%2C+in+http_response%0A++++%27http%27%2C+request%2C+response%2C+code%2C+msg%2C+hdrs%29%0A++File+%22%2FSystem%2FLibrary%2FFrameworks%2FPython.framework%2FVersions%2F2.7%2Flib%2Fpython2.7%2Furllib2.py%22%2C+line+475%2C+in+error%0A++++return+self._call_chain%28%2Aargs%29%0A++File+%22%2FSystem%2FLibrary%2FFrameworks%2FPython.framework%2FVersions%2F2.7%2Flib%2Fpython2.7%2Furllib2.py%22%2C+line+409%2C+in+_call_chain%0A++++result+%3D+func%28%2Aargs%29%0A++File+%22%2FSystem%2FLibrary%2FFrameworks%2FPython.framework%2FVersions%2F2.7%2Flib%2Fpython2.7%2Furllib2.py%22%2C+line+558%2C+in+http_error_default%0A++++raise+HTTPError%28req.get_full_url%28%29%2C+code%2C+msg%2C+hdrs%2C+fp%29%0AHTTPError%3A+HTTP+Error+400%3A+Bad+Request%0A

Fixes broken HTTP GET for index statistics. Should be backwards compatible because the all=true parameter has not been used by elasticsearch for a long time. See: elastic/elasticsearch#21410

Refer to issue discussed here elastic/elasticsearch#21410. As an alternative it was suggested to use "/_nodes/stats" or "/_nodes/stats/_all" or "/_nodes/stats?metric=_all". For backward compatibility till 0.90 using no flags should work as the "/_nodes/stats" returns all stats by default and is consistent across all versions so far.

him5 · 2019-09-07T23:28:05Z

I get something like that.

clintongormley added >docs General docs changes :Data Management/Stats Statistics tracking and retrieval APIs labels Nov 8, 2016

NickCraver mentioned this issue Nov 8, 2016

Elasticsearch 5.0 Compatibility opserver/Opserver#221

Closed

jasontedor closed this as completed Nov 8, 2016

jasontedor mentioned this issue Nov 8, 2016

Remove lenient stats parsing #21417

Merged

jasontedor mentioned this issue Nov 15, 2016

Remove lenient stats parsing 5.x #21576

Merged

asterr mentioned this issue Dec 12, 2016

Fix indices_stats_url for ElasticSearch 5.0 mattweber/es2graphite#21

Open

mayankgoyal mentioned this issue Oct 6, 2017

_nodes/stats?all is broken in new versions of ElasticSearch NewRelic-Python-Plugins/newrelic-python-agent#18

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch 5.0: _nodes/stats?all throws a 400 #21410

Elasticsearch 5.0: _nodes/stats?all throws a 400 #21410

NickCraver commented Nov 8, 2016

nlowe commented Nov 8, 2016 •

edited

jasontedor commented Nov 8, 2016

clintongormley commented Nov 8, 2016

NickCraver commented Nov 8, 2016

jasontedor commented Nov 8, 2016

jasontedor commented Nov 8, 2016

jasontedor commented Nov 8, 2016

NickCraver commented Nov 8, 2016

clintongormley commented Nov 8, 2016

jasontedor commented Nov 8, 2016 •

edited

NickCraver commented Nov 8, 2016 •

edited

jprante commented Nov 8, 2016

jasontedor commented Nov 8, 2016 •

edited

jasontedor commented Nov 8, 2016

NickCraver commented Nov 8, 2016

jasontedor commented Nov 8, 2016 •

edited

jasontedor commented Nov 8, 2016

jasontedor commented Nov 8, 2016

NickCraver commented Nov 8, 2016

kimchy commented Nov 8, 2016

kimchy commented Nov 8, 2016

NickCraver commented Nov 8, 2016

jasontedor commented Nov 8, 2016

kimchy commented Nov 8, 2016

NickCraver commented Nov 8, 2016 •

edited

ejsmith commented Nov 9, 2016

clintongormley commented Nov 9, 2016

NickCraver commented Nov 9, 2016

clintongormley commented Nov 9, 2016

him5 commented Sep 7, 2019

Elasticsearch 5.0: _nodes/stats?all throws a 400 #21410

Elasticsearch 5.0: _nodes/stats?all throws a 400 #21410

Comments

NickCraver commented Nov 8, 2016

nlowe commented Nov 8, 2016 • edited

jasontedor commented Nov 8, 2016

clintongormley commented Nov 8, 2016

NickCraver commented Nov 8, 2016

jasontedor commented Nov 8, 2016

jasontedor commented Nov 8, 2016

jasontedor commented Nov 8, 2016

NickCraver commented Nov 8, 2016

clintongormley commented Nov 8, 2016

jasontedor commented Nov 8, 2016 • edited

NickCraver commented Nov 8, 2016 • edited

jprante commented Nov 8, 2016

jasontedor commented Nov 8, 2016 • edited

jasontedor commented Nov 8, 2016

NickCraver commented Nov 8, 2016

jasontedor commented Nov 8, 2016 • edited

jasontedor commented Nov 8, 2016

jasontedor commented Nov 8, 2016

NickCraver commented Nov 8, 2016

kimchy commented Nov 8, 2016

kimchy commented Nov 8, 2016

NickCraver commented Nov 8, 2016

jasontedor commented Nov 8, 2016

kimchy commented Nov 8, 2016

NickCraver commented Nov 8, 2016 • edited

ejsmith commented Nov 9, 2016

clintongormley commented Nov 9, 2016

NickCraver commented Nov 9, 2016

clintongormley commented Nov 9, 2016

him5 commented Sep 7, 2019

nlowe commented Nov 8, 2016 •

edited

jasontedor commented Nov 8, 2016 •

edited

NickCraver commented Nov 8, 2016 •

edited

jasontedor commented Nov 8, 2016 •

edited

jasontedor commented Nov 8, 2016 •

edited

NickCraver commented Nov 8, 2016 •

edited