Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

versorted should follow SemVer rules #61

Closed
SethMMorton opened this issue Jul 1, 2018 · 12 comments
Closed

versorted should follow SemVer rules #61

SethMMorton opened this issue Jul 1, 2018 · 12 comments
Labels

Comments

@SethMMorton
Copy link
Owner

SethMMorton commented Jul 1, 2018

Minimum, Complete, Verifiable Example

In [1]: import natsort

In [2]: a = ['1.0.0-alpha', '1.0.0-alpha.1', '1.0.0-alpha.beta', '1.0.0-beta', '1.0.0-beta.2', '1.0.0
   ...: -beta.11', '1.0.0-rc.1', '1.0.0']

In [3]: natsort.versorted(a)
Out[3]: 
['1.0.0',
 '1.0.0-alpha',
 '1.0.0-alpha.1',
 '1.0.0-alpha.beta',
 '1.0.0-beta',
 '1.0.0-beta.2',
 '1.0.0-beta.11',
 '1.0.0-rc.1']

In [4]: natsort.__version__
Out[4]: '5.3.2'

According to https://semver.org/, 1.0.0-alpha < 1.0.0-alpha.1 < 1.0.0-alpha.beta < 1.0.0-beta < 1.0.0-beta.2 < 1.0.0-beta.11 < 1.0.0-rc.1 < 1.0.0. natsort puts the 1.0.0 is in the wrong place.

Error message, Traceback, Desired behavior, Suggestion, Request, or Question

There is a useful hack to make this work, but that should not be needed for a function called versorted. It should handle this out-of-the-box.

This would be a breaking change, and might require updating the natsort major version.

@SethMMorton
Copy link
Owner Author

I'm not particularly interested in implementing this myself. Any takers?

@thebigmunch
Copy link

thebigmunch commented Nov 7, 2018

I'm somewhat interested in this. But there's a much easier solution for SemVer specifically using the semver package:

>>> a = ['1.0.0-alpha', '1.0.0-alpha.1', '1.0.0-alpha.beta', '1.0.0-beta', '1.0.0-beta.2', '1.0.0-beta.11', '1.0.0-rc.1', '1.0.0']
>>> natsorted(a, key=semver.parse_version_info)
[
    '1.0.0-alpha',
    '1.0.0-alpha.1',
    '1.0.0-alpha.beta',
    '1.0.0-beta',
    '1.0.0-beta.2',
    '1.0.0-beta.11',
    '1.0.0-rc.1',
    '1.0.0',
]

Perhaps this should just be documented instead? There are many different versioning systems, so you'd likely end up making the API and/or code ugly trying to specifically support them all (or even just the most popular) from within natsort.

It could also be supported by creating an algorithm for SemVer and others if needed. Users would still need to specify the algorithm in the call, but natsort would handle any package imports, etc.

Edit: Should have used semver.parse_version_info instead of semver.parse.

@SethMMorton
Copy link
Owner Author

SethMMorton commented Nov 7, 2018

I like this idea, but to be successful I think it needs to handle input that contains versions (e.g. package names with versions), like the below list

a = [
    "package-1.0.0.tar.gz",
    "package-1.0.0-alpha.tar.gz",
    "package-1.0.0-rc.gz",
    "package-1.0.0-alpha.1.tar.gz",
    "package-1.0.0-beta.tar.gz",
]

Can the semver package handle this (documentation's pretty light so it's not immediately obvious to me if it does).

Alternatively, users could jjust be recommended to remove "package-" and ".tar.gz" from their input as part of the key.

@thebigmunch
Copy link

The semver only handles version strings.

So, it might be possible to find semantic version strings within package and file names (if algorithm specified) to help determine the sorting key in some way. The only tricky part might be separating extensions from the end of the version string in some cases. The semver package has a regular expression that might be modified a bit to get anything preceding and following the version string as well as the version string. It really depends on how generalized you want to get. Should it be limited to only things that look like package and file names? Should it be able to support any string as long as there is a semantic version string in it?

@thebigmunch
Copy link

Yeah, I don't think there's going to be a reliable way to separate the file extension from dotted pre-release or build sections short of whitelisting extensions.

@SethMMorton
Copy link
Owner Author

Yeah, that's why I had lost interest in implementing 😄

I think this can be done using a factory function given to the user so that they can make a custom key. I'll give it some thought and respond later today with an idea of what I am thinking.

@SethMMorton
Copy link
Owner Author

SethMMorton commented Nov 8, 2018

What if natsort provided a key-generation function for semver that optionally accepted a regular expression that matches possible suffixes (like file extensions). This way, the user defines where the semantic version ends. (Instead of a key-generation function, if this were implemented as part of versorted then that function could just take an extra parameter for the possible suffixes.)

This is a really hard problem. I think that if it is implemented with known limitations, and those limitations are documented clearly, it will be a win.

@thebigmunch
Copy link

I think that this is really a bigger change. If versorted is going to be taken out of deprecation and made the canonical of sorting version strings, package names with version strings, and file names with version strings (which it should for what is being proposed), this is a major, breaking change. Not only would deprecation undone, the semantics of versorted would be changed. Also, it should then support at least the version scheme natsort currently (mostly) supports, SemVer, and CalVer from the start. This leads to some more questions:

  • Should sorting by version be taken out of natsorted in favor of using versorted?
  • Should natsorted be made to support other version schemes instead?
  • Should this workaround be implemented in versorted for the default version scheme?
  • How to/Should versorted and/or natsorted handle sorting mixed input?
    • Strings without versions and strings with versions.
    • Version strings, package names with version strings, file names with version strings.
  • How many/what variations of package/file names to support? Leave it to the user in some way?

I'm sure I've probably forgotten some of the questions/ideas I came up with last night in bed about your idea. But here are my thoughts on these:

  • I think versorted should at least be strongly encouraged for version string sorting rather than using natsorted directly for all version schemes, if not having sorting by version scheme be limited to versorted.
  • I think the workaround for the default version scheme should be implemented in some way with versorted.
  • I haven't thought about mixed input enough yet to have a solid opinion. I'm leaning towards supporting the 2nd case, but not the 1st. And possibly making the 2nd case configurable, so it could be done by just the version or by prefix->version->suffix.
  • I'd have to look at a more exhaustive list of package naming for programming languages, etc to have a good idea of what is possible and necessary.

Note: Restrictions in my opinions generally include a `when a version algorithm or the versorted function is used' caveat. But, if I'm not mistaken, the currently supported version algorithm is on by default in natsorted, correct?


So, here's different questions I have: are there people who actually want file names sorted by versioning rather than as file names are sorted by <insert OS/file manager>? Is this a problem we should be worrying about? My gut feeling says that people are looking for OS/file manager sorting for file names, at least the other case would be quite rare.

I also think we're conflating many different ideas/features into one here. I think the idea of supporting version-based sorting on anything other than version strings is a separate idea from supporting sorting strings with versions based on a specific versioning scheme. Frankly, the current version scheme support isn't technically sorting by the version scheme anyway, hence the documented workaround. I think supporting sorting of version strings based on a version scheme through the use of algorithms is what should be done right now. Maybe it should still be done by taking versorted out of mothballs. I think this could even include versioned package names (which is a more likely case for versioning-based sorting) but not file names or arbitrary strings. If there's really a strong desire for sorting versioned file names in the future, it will be brought up and discussed at that time. I don't think we need to swallow the whole thing at once (and maybe not at all).

@SethMMorton
Copy link
Owner Author

I think you have many good points. There's a lot to sift through - apologies if I miss something you felt is important.


I think that many of the points you made can be addressed if I give some history of natsort. When I originally released natsort, the default algorithm for sorting was using signed floats instead of unsigned ints. At the time this was my major use case so I made it the default.

In retrospect, this was a terrible idea. I had many issues filed where natsort did not give results meeting user's expectations. Out of fear of breaking backwards compatibility, Instead of changing the default to what most people actually want and expect I added the number_type and signed keyword options (this was before alg was available), and users could get their expected behavior with number_type=int, signed=False, or just number_type=None.

In retrospect, this was a terrible idea. Discoverability of this was low, and it is a lot to type. Again, instead of changing the default behavior to what people want 99% of the time, I decided to make it easier to use that algorithm by providing a function called versorted, because at the time I believed that the only reason you wouldn't want to sort by signed floats was to sort strings with versions in them.

In retrospect, this was a terrible idea. Now there was a function with a name that implies that it treated version numbers specially in some manner, when in fact it was just using a run-of-the-mill algorithm that just happens to work for most version numbers.

So, in natsort version 4 I made the default use unsigned integers instead of signed floats. Finally, a good idea. The only problem was that now there was this crusty old function versorted that I couldn't remove for backwards-compatibility reasons.

I don't really like the presence of versorted because it doesn't actually comprehend versions. It is ultra-misleading. The reason I created this issue was that if there is a versorted function, it probably should actually comprehend version numbers. Otherwise, it should be removed in the next major release.

Every other function within the natsort package can handle any type of input given to it - it just returns different results depending on which function was called. This is the reason was not excited about making versorted only handle version strings without anything before or after the version itself - it would not behave like the rest of the functions in the natsort suite.


Should sorting by version be taken out of natsorted in favor of using versorted?

natsorted has actually never actually comprehended versions. It separates out the numbers in a string then passes that result to sorted. Sorting versions cannot be taken out of natsorted because being able to sort most versions is just a natural consequence of this mechanism.

Should this workaround be implemented in versorted for the default version scheme?

I don't think so, because that only works if what a user is sorting is only the version, and if that is the limitation then semver.parse_version_info would do everything the workaround does.

How many/what variations of package/file names to support? Leave it to the user in some way?

I think this is getting a bit too specific. I really don't like the idea of tailoring the algorithm to assume the input data conforms to a particular "shape". Many of the problems I faced early on with this library were because I made assumptions about how the input data looked. So, rather than supporting packages/file names, the way I want to approach the problem is handling arbitrary input where the definition of the number is a version rather than a signed/unsigned float/int.

I think the workaround for the default version scheme should be implemented in some way with versorted.

I think that an optimal solution to finding versions in an arbitrary string would not need a workaround in order to give the correct results.

So, here's different questions I have: are there people who actually want file names sorted by versioning rather than as file names are sorted by <insert OS/file manager>? Is this a problem we should be worrying about? My gut feeling says that people are looking for OS/file manager sorting for file names, at least the other case would be quite rare.

This. I think these are the correct types of questions to be asking.

Consider that you have a folder of distributions of a package, e.g. "foo-1.0.0.zip", "foo-2.0.0.zip", etc. And you want to present them to a user to indicate the available packages they can use, starting from the latest. In this case the sorting would be on more than just the version.

Did natsorted work for me as-is? Yes. Do I think that people really need SemVer support for this? Maybe. No one has asked yet, so maybe it's not worth it.

Perhaps the whole idea of supporting SemVer natively and completely is me looking for a problem where there isn't one. Your suggestion of just using semver.parse_version_info as a key to natsorted would probably be fine solution for most cases, and in that case no change would need to be made to natsort, just maybe an additional section in the documentation. It could probably even replace the workaround you mentioned because it handles cases the workaround does not.

@thebigmunch
Copy link

thebigmunch commented Nov 10, 2018

Just some quick clarifications and conclusion.

Should sorting by version be taken out of natsorted in favor of using versorted?

natsorted has actually never actually comprehended versions. It separates out the numbers in a string then passes that result to sorted. Sorting versions cannot be taken out of natsorted because being able to sort most versions is just a natural consequence of this mechanism.

Technically, as shown in the existence of that workaround, natsorted doesn't actually sort versions correctly by coincidence or otherwise. It only works properly when all versions are nothing but numbers (and separators).

Should this workaround be implemented in versorted for the default version scheme?

I don't think so, because that only works if what a user is sorting is only the version, and if that is the limitation then semver.parse_version_info would do everything the workaround does.
snip
Perhaps the whole idea of supporting SemVer natively and completely is me looking for a problem where there isn't one. Your suggestion of just using semver.parse_version_info as a key to natsorted would probably be fine solution for most cases, and in that case no change would need to be made to natsort, just maybe an additional section in the documentation. It could probably even replace the workaround you mentioned because it handles cases the workaround does not.

The versions in the workaround example are not valid semantic versions, so it couldn't replace the workaround for non-SemVer version strings. I'm not sure that workaround works properly for all semantic versions (or how many cases it actually does solve). I really do (and always did) think this should be a documented example using semver.parse_version_info as the key. That being said, I always enjoy thinking out and discussing things like this. And find it helpful when someone else does so with me when I'm working on API/usability ideas and issues.

@SethMMorton
Copy link
Owner Author

Regarding your first point, I think we are both in agreement. There is no handling within natsort at all for versions. It just happens to work for versions following MAJOR.MINOR.PATCH, which was the whole point of that convention in the first place. For the vast majority of cases this is enough, which is where the statement "being able to sort most versions is just a natural consequence of this mechanism" came from, emphasis on most.

The real issue is that this is not called out explicitly in the documentation. I will make sure to do that.

As for your second point, I hadn't given it too much thought. The real problem is (as you pointed out in an earlier comment) that there are simply too many version number conventions to be able to reliably handle all of them. The best case scenario is to show users examples of how to handle various version schemes (like the workaround or semver.parse_verison_info) and then keep the API general.

To avoid confusion, in the next major release I think versorted should simply be removed from the API.

SethMMorton added a commit that referenced this issue Nov 15, 2018
The documentation used to give the impression that natsort comprehended
versions in a meaningful way. Hopefully that fantasy has been dispelled.

This is in response to the discussion in #61.
@SethMMorton
Copy link
Owner Author

Resolution:

  • Add more info in documentation about what version sorting will work out-of-the-box and what will not
  • Direct users to use third-party modules to handle specific versioning schemes.
  • Handle anything more complicated when it arrives.

@thebigmunch Thanks for the discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants