Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch 5.0: _nodes/stats?all throws a 400 #21410

Closed
NickCraver opened this issue Nov 8, 2016 · 30 comments
Closed

Elasticsearch 5.0: _nodes/stats?all throws a 400 #21410

NickCraver opened this issue Nov 8, 2016 · 30 comments
Labels
:Data Management/Stats Statistics tracking and retrieval APIs >docs General docs changes

Comments

@NickCraver
Copy link

Elasticsearch version: 5.0.0 (253032b)
Plugins installed: none
JVM version: 1.8.0_102
OS version: Linux 4.4.20-moby (vanilla 5.0 Docker container)

Description of the problem including expected versus actual behavior:
Calling _nodes/stats?all should return all stats populated, as it did in previous versions and is still documented to work as in the Node Stats documentation:

The all flag can be set to return all the stats.

Instead, it returns a 400:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "request [/_nodes/stats] contains unrecognized parameter: [all]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "request [/_nodes/stats] contains unrecognized parameter: [all]"
  },
  "status" : 400
}

Steps to reproduce:

  1. Navigate to /_nodes/stats?all, e.g. http://localhost:9200/_nodes/stats?all
  2. Get a 400
  3. Go to nearest corner
  4. Assume fetal position
  5. Cry a little

Can we restore this functionality please? Given the lack of a documentation update, I'm hoping this was an unintentional and completely unnecessary breaking change.

@nlowe
Copy link

nlowe commented Nov 8, 2016

@jasontedor
Copy link
Member

By default, node stats returns all stats. You can explicitly request all stats by hitting /_nodes/stats/_all or /_nodes/stats?metric=_all. Requesting /_nodes/stats?all never did anything, it was just silently ignored in the past. I'll clarify the docs.

@clintongormley
Copy link

Requesting /_nodes/stats?all never did anything, it was just silently ignored in the past

Actually, (and this goes waaay back to pre 0.90), stats used to just return a selection of stats, and all used to have a purpose. But that is ancient history

@clintongormley clintongormley added >docs General docs changes :Data Management/Stats Statistics tracking and retrieval APIs labels Nov 8, 2016
@NickCraver
Copy link
Author

@jasontedor Was this the case in all previous versions? Monitoring systems that hit the APIs need to span many versions, and every breaking change like this (like changing hostname to host) create real unnecessary pain points.

This breaks our monitoring system. And anyone else's spanning many versions. Why break this? Why was it necessary? Just ignore the param if the behavior would be the same.

@jasontedor
Copy link
Member

Actually, (and this goes waaay back to pre 0.90), stats used to just return a selection of stats, and all used to have a purpose. But that is ancient history

It's like this in 1.x and 2.x as well, ?all did nothing, it was just silently ignored.

@jasontedor
Copy link
Member

Was this the case in all previous versions?

It was the case in 1.x and 2.x for sure, I just checked the code.

Why break this? Why was it necessary?

It didn't do anything, it was silently ignored.

Just ignore the param if the behavior would be the same.

This is why we are getting rid of silently ignoring things. All this time you thought that that parameter was doing something and it wasn't.

@jasontedor
Copy link
Member

I pushed the following commits:

@NickCraver
Copy link
Author

This is why we are getting rid of silently ignoring things. All this time you thought that that parameter was doing something and it wasn't.

It was doing something in old versions. If you think no one is still using 0.90, that's just way off base. Unfortunately, that's reality. And they're using some of our stuff to monitor it. So now, for no good reason, instead of maintaining minimal code in Elastic to ignore it, every monitoring system has to maintain a code fork and version awareness to adapt to this. That's not an efficient use of time for anyone.

This is happening with every release, and none of them were necessary. Please, quit breaking things for very minor cleanups. It's just shifting a lot of work to n consumers.

@clintongormley
Copy link

So we're just supposed to keep all of our mistakes forever? Sorry @NickCraver but no. Strict parsing of query string parameters is a really important step forwards, and we're not going to keep cruft forever. We try hard not to break things. Sometimes we do it unintentionally without realising that something will be an issue, for which I apologise, but it is untenable to keep everything the same forever. It would make our code so creaky and complicated we wouldn't be able to touch it without fear of breaking things.

@jasontedor
Copy link
Member

jasontedor commented Nov 8, 2016

It was doing something in old versions. If you think no one is still using 0.90, that's just way off base.

That's right, very old versions, and versions that we do not support anymore.

Please, quit breaking things for very minor cleanups.

This change was made almost three years ago in bb27516. The related issue is #4057.

The addition of strict REST parsing was a major, nor minor, step forward for Elasticsearch. The related PR is #20722.

@NickCraver
Copy link
Author

NickCraver commented Nov 8, 2016

So we're just supposed to keep all of our mistakes forever?

No. But not acknowledging the APIs that need to span versions are more special and should have a much higher bar here shows a large disconnect with what consumes these APIs. I never said everything should work forever. I am speaking about a very narrow scope of APIs here.

Instead of you maintaining support, now everyone else has to deal with it. And all monitoring systems have to have a code fork here to handle it. Is that really the best use of human time?

Also note: this isn't ancient history. It was documented to work an hour ago.

@jprante
Copy link
Contributor

jprante commented Nov 8, 2016

Note there is no "strict REST parsing" for the _nodes endpoint.

 curl '0:9200/_nodes/stats/foobar'

gives a (unexpected) response on an undefined metric "foobar" and is not rejected as invalid request.

@jasontedor
Copy link
Member

jasontedor commented Nov 8, 2016

@jprante Strict REST parsing (#20722) was about rejecting unrecognized URL parameters, what you're referring to is a separate issue and should be addressed.

@jasontedor
Copy link
Member

@jprante Thanks for reporting this, I have a fix for this and will make a formal PR shortly.

@NickCraver
Copy link
Author

@jasontedor You shouldn't make another breaking change like this until the next major version. Can we assume such a break there would be 6.0 or beyond?

@jasontedor
Copy link
Member

jasontedor commented Nov 8, 2016

Also note: this isn't ancient history. It was documented to work an hour ago.

It was a documentation bug and now it's fixed.

@jasontedor
Copy link
Member

You shouldn't make another breaking change like this until the next major version. Can we assume such a break there would be 6.0 or beyond?

It's consumers that are using the stats endpoints with invalid metrics that are broken but are going undetected because we are lenient here. The change would be a bug fix, and I'm definitely not holding it until a major version. It's tempting to even include it in a bug fix release, but it's definitely going into 5.1.0.

@jasontedor
Copy link
Member

Thanks for reporting @jprante, I opened #21417.

@NickCraver
Copy link
Author

@jasontedor Honest question, do you just not care about breaking users? Does Elastic in general not care a bit about semantic versioning? I'm honestly trying to gauge what the intent is here. From a consumer point of view this is both hostile and toxic. As a maintainer of a project people consume, this is just all around bad.

There's nothing harmful about these endpoints right now. They are experiencing the same behavior from the day they deployed on 5.0. If they upgrade to 5.1.0 with a change like this, you're breaking them. Why? What good does that do? There's good reason breaking changes are reserved for major versions. Most of the software world understands this.

@kimchy
Copy link
Member

kimchy commented Nov 8, 2016

the fact we can now strictly parse parameters is a huge improvement to users, since it allows us to be properly consistent and predictable with our APIs. Something, as anything with writing software, I personally wish I should have done with the get go.

On the other hand, I do appreciate the implications when it comes to backward comp. I would like to hear how you use it, to make sure it is not broken already, and actually, this fix causes things to be fixed? Our goal it not to break backward comp. but I want to understand and appreciate the balance between wrong invocation of API due to mis-design on end that gets fixed, vs. really breaking correct existing behavior? @NickCraver can you share more info?

@kimchy
Copy link
Member

kimchy commented Nov 8, 2016

hey @NickCraver , looking into it, we did make a braking change, and made ?all a breaking change from 5.0.0 forward, we just missed the documentation aspect, which we fixed. Strict URL parsing is important for resiliency in our codebase, so regardless, you would need to change 5.x code to support it.

I view strict URL parsing important, and regardless of the docs, which we missed, and I apologize for that, the code still rejects the ?all flag, from 5.0.0 onwards, which is good. And any monitoring tool working against 5.x versions need to adjust to it.

p.s. I am on a EU TZ, so can't respond soonish moving forward, please don't view me being in-responsives as a non comment.

@NickCraver
Copy link
Author

@kimchy So context here: there is no functionality fix here, the documentation was retroactively changed to "fix" a breaking change. That breaking change was in a major release (good thing), though IMO unnecessary (distinct issue). It should be added to breaking change docs regardless. There's now another change proposed in #21417 that's breaking not in a major version change (bad). That's an issue in itself, and just bad behavior for any project.

The documentation fix is appreciated, but what are we fixing here? The strict parameter parsing is good, but would adding this as a recognized parameter and maintaining behavior be a bad thing? Then when a monitoring system hits these routes, it's not version switched per version. When things like this break, I'm not only fixing our monitoring system (both Bosun and Opserver), everyone else is also doing this. And that's bad, because this break wasn't really necessary.

No one's asking that we revert strict parameter parsing, the ask if much smaller in scope: recognize this parameter. It was allowed and documented to be allowed past the 5.x release (the live docs still have it).

More globally, the monitoring APIs in general are the thing most likely to need many-version support at once. Any changes that aren't really necessary here are painful, multiplied by the number of systems hitting them. As a user, what's the benefit here? How many lines of code would it take in elasticsearch to make this work? The current version would need to just ignore it. Is that a big ask? What's happening right now is we've gone from a route that worked fine to throwing a 400, for...what? How much code did we save? Was it worth the pain caused to users?

This is broken. There's no fix for it. Users are on 5.0.0 and I have to support them. The ship has sailed and either monitoring systems specifically don't support this exact version (if the behavior did go back), or we already have to have this fork in our code. For probably a decade.

But what's the bigger issue here? Breaking changes are proposed for a 5.1.0 or (even worse) a 5.0.x in #21417 is very bad. These should absolutely not be allowed. If I have no idea if my stuff will break in the most minor of releases, then I won't upgrade. That's bad for everyone. There are really good reasons that semantic versioning exists and this is a primary example. Please, don't let this happen. Elastic has to take breaking changes seriously if they want people to be on a version anywhere near current. If users can't trust an upgrade, then they won't. Things like this (if deployed) destroy such trust, instead of building it.

I honestly have no idea if Elastic (in general) strives to maintain semantic versioning, but even if that's not the case, can we at least get on the same page with breaking changes? As a developer and a sysadmin, I'm telling you: this is extremely important.

@jasontedor
Copy link
Member

I also backported the docs fixes to 2.4 via a025c9c and 1.7 via 6d95de2.

@kimchy
Copy link
Member

kimchy commented Nov 8, 2016

@NickCraver I think we crossed responses, based on my understanding, 5.0.0 is not a unique version from a code perspective, we just missed docs. If it was a unique version from a code behavior perspective, I 1000% agree with you. But it is not. From 5.0.0 onwards, ?all will get rejected.

@NickCraver
Copy link
Author

NickCraver commented Nov 8, 2016

@kimchy Are you only seeing one of the two issues here? #21417 is not about ?all. It's another breaking change, pitched for both a minor or even a "bugfix" release.

The ?all is unsolvable in itself (as mentioned above) because 5.0.0 shipped at all. I still disagree that, given the nature of this API set, it should have been broken at all. That's more of a discussion on breaking monitoring APIs in general - which should happen. Which APIs should be extremely long-term stable? (Without very compelling reasons to break them)

@ejsmith
Copy link

ejsmith commented Nov 9, 2016

The ES team has been extremely aggressive with breaking changes and it has been very painful to update my apps to accommodate these changes. I really hope that you guys got the cleanup done that you wanted to do and that the bar for breaking changes goes up drastically going forward. Otherwise I'm afraid you are going to fragment and alienate your customers.

@clintongormley
Copy link

Hi @NickCraver

There is a tension here between forcing existing users to adjust to breaking changes and saving new users from hours of debugging when the parameters that they try just don't work. We have had so many issues opened because our APIs silently ignored unknown parameters. We have largely fixed this - we now tell you when you're doing something wrong, and we provide suggestions for which parameters you probably wanted to use. This is a huge step forward in usability.

The all flag was removed in 1.0, along with the clear and various metric flags. The bug was that we didn't remove all the documentation about the all flag, but it hasn't worked since 1.0. I understand your point about how ignoring the all flag silently in 5.0 wouldn't make a difference, but why is the all flag different from the other flags? Maybe you are only using the all flag but there may be others using the other flags, which also haven't worked since 1.0. If we silently ignore those too, we leave users guessing about why things aren't working the way they assume instead of telling them immediately what the problem is.

Regarding the strict checking of metrics in #21417. We have formally embraced strict REST parsing in 5.0 - shouting loud and shouting early saves users so much time and heartache, the benefits are clear. If we find APIs that still accept unknown parameters, it is considered a bug which should be fixed, not a breaking change.

If we continue silently ignoring unknown metrics, we're punishing all of those users who don't understand why their requests aren't working correctly. We can add backwards compatibility and deprecation logging for the percolator metric, which is no longer available, but how does it help users to ignore unknown metrics? Most users are probably unaware that they have a problem and would be grateful for being told what the problem is. While we could delay this change for 6.0, it would mean another year where users are left in the dark.

The ES team has been extremely aggressive with breaking changes and it has been very painful to update my apps to accommodate these changes. I really hope that you guys got the cleanup done that you wanted to do and that the bar for breaking changes goes up drastically going forward. Otherwise I'm afraid you are going to fragment and alienate your customers.

@ejsmith We're trying hard really hard to balance these two camps: not hurting existing users and making things better for new users. This is not an easy choice. We don't make breaking changes lightly and when we do, it is with the intention of moving to more stable APIs. The breaking changes between 1.x and 2.0 were hard for everybody, 2.x to 5.0 should be a much easier experience. We don't always get it right but please be aware that a lot of soul searching goes into these decisions.

@NickCraver
Copy link
Author

it is considered a bug which should be fixed, not a breaking change

These are orthogonal issues, and not at all mutually exclusive. Any change can be both, as is the example here. That doesn't change the fact that breaking changes should be reserved for major versions. Yet repeatedly, with this project, they are not. This contributes greatly to people not upgrading. It's one of the reasons we don't upgrade often here.

This is not an easy choice. We don't make breaking changes lightly

I'm sorry but not that's not how this is coming across. The original comment here was "somebody talk me down from putting this in a 5.0.x release" (I hope that's exact, if not it's very close) with an emoji on it. I really wish I had the tab open with a screenshot (maybe someone else still does). If you want to carefully consider changes and what goes into things that's fine, but saying that happened here is bull. The fact that comment was edited shows it's recognized.

Elasticsearch is your project. Do what you want with it. But don't assume this has zero impact. If I have to weigh heavily testing every release for breaks because Elastic doesn't care about semantic versioning and is therefore shifting the burden to me (and every other user), then it's worth it one again to spend equal time evaluating other options all together. Would we move to Solr here? I have no idea, but it's absolutely worth investigating again, because the time factor went from low and known to completely unknown and significant. When we upgrade Stack Overflow to a new bugfix release, will it break production? Because you're "helping" me with these breaking changes? Well now we have no confidence in that.

The point of versioning and semantic versioning is that, as a user, I know what I am getting. If it's a x.0.0 upgrade, I know there will be breaks. If there are breaks hiding in build or minor releases, then I a) have to test far more heavily for every change, and b) now expect surprises when we deploy production. That's unhealthy, at best. Semantic versioning exists for extremely good reasons. I'm really not seeing that anyone at Elastic gets that in this issue. Instead, the comments and actions here indicate the exact opposite.

@clintongormley
Copy link

I'm sorry but not that's not how this is coming across. The original comment here was "somebody talk me down from putting this in a 5.0.x release" (I hope that's exact, if not it's very close) with an emoji on it.

I agree that was a poorly considered comment which doesn't reflect the thought process here. Hence the edit.

And your comments are appreciated and heard. They have, for instance, made me reconsider my decision to backport a change to the CAT api to 5.1 - it's now been reverted and will be in 6.0 only.

asterr pushed a commit to asterr/es2graphite that referenced this issue Dec 12, 2016
See: elastic/elasticsearch#21410

The all=true parameter is no longer supported, and causes the HTTP GET to return:

$ curl 'http://ny2-laa-005:9200/_stats?all=true'
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"request [/_stats] contains unrecognized parameter: [all]"}],"type":"illegal_argument_exception","reason":"request [/_stats] contains unrecognized parameter: [all]"},"status":400}

This causes the es2graphite.py to throw this stack trace and continuously retry:

2016-12-12 09:31:08,006 [MainThread] [INFO    ]  2016-12-12 09:31:06: GET http://ny2-laa-005:9200/_stats?all=true
2016-12-12 09:31:08,352 [MainThread] [ERROR   ]  Traceback+%28most+recent+call+last%29%3A%0A++File+%22.%2Fes2graphite.py%22%2C+line+328%2C+in+%3Cmodule%3E%0A++++get_metrics%28%29%0A++File+%22.%2Fes2graphite.py%22%2C+line+34%2C+in+timed%0A++++result+%3D+method%28%2Aargs%2C+%2A%2Akw%29%0A++File+%22.%2Fes2graphite.py%22%2C+line+274%2C+in+get_metrics%0A++++indices_stats_data+%3D+urllib2.urlopen%28indices_stats_url%29.read%28%29%0A++File+%22%2FSystem%2FLibrary%2FFrameworks%2FPython.framework%2FVersions%2F2.7%2Flib%2Fpython2.7%2Furllib2.py%22%2C+line+154%2C+in+urlopen%0A++++return+opener.open%28url%2C+data%2C+timeout%29%0A++File+%22%2FSystem%2FLibrary%2FFrameworks%2FPython.framework%2FVersions%2F2.7%2Flib%2Fpython2.7%2Furllib2.py%22%2C+line+437%2C+in+open%0A++++response+%3D+meth%28req%2C+response%29%0A++File+%22%2FSystem%2FLibrary%2FFrameworks%2FPython.framework%2FVersions%2F2.7%2Flib%2Fpython2.7%2Furllib2.py%22%2C+line+550%2C+in+http_response%0A++++%27http%27%2C+request%2C+response%2C+code%2C+msg%2C+hdrs%29%0A++File+%22%2FSystem%2FLibrary%2FFrameworks%2FPython.framework%2FVersions%2F2.7%2Flib%2Fpython2.7%2Furllib2.py%22%2C+line+475%2C+in+error%0A++++return+self._call_chain%28%2Aargs%29%0A++File+%22%2FSystem%2FLibrary%2FFrameworks%2FPython.framework%2FVersions%2F2.7%2Flib%2Fpython2.7%2Furllib2.py%22%2C+line+409%2C+in+_call_chain%0A++++result+%3D+func%28%2Aargs%29%0A++File+%22%2FSystem%2FLibrary%2FFrameworks%2FPython.framework%2FVersions%2F2.7%2Flib%2Fpython2.7%2Furllib2.py%22%2C+line+558%2C+in+http_error_default%0A++++raise+HTTPError%28req.get_full_url%28%29%2C+code%2C+msg%2C+hdrs%2C+fp%29%0AHTTPError%3A+HTTP+Error+400%3A+Bad+Request%0A
asterr pushed a commit to asterr/es2graphite that referenced this issue Dec 12, 2016
Fixes broken HTTP GET for index statistics.

Should be backwards compatible because the all=true parameter has not been used by elasticsearch for a long time.
See: elastic/elasticsearch#21410
mayankgoyal added a commit to mayankgoyal/newrelic-python-agent that referenced this issue Oct 6, 2017
Refer to issue discussed here elastic/elasticsearch#21410. As an alternative it was suggested to use "/_nodes/stats" or "/_nodes/stats/_all" or "/_nodes/stats?metric=_all". For backward compatibility till 0.90 using no flags should work as the "/_nodes/stats" returns all stats by default and is consistent across all versions so far.
@him5
Copy link

him5 commented Sep 7, 2019

I get something like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Stats Statistics tracking and retrieval APIs >docs General docs changes
Projects
None yet
Development

No branches or pull requests

8 participants