-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DIR sorting should respect NLS settings #68
Comments
This sounds right. An example of such a collating table for CP 850 can be found here: https://github.com/SvarDOS/edrdos/blob/9751c114b84df883956fa289a0142dfe54b57854/drdos/country.asm#L2037 It is interesting to see that this seems to be a case-insensitive ordering. |
While the FreeDOS one is case-sensitive: https://github.com/FDOS/country/blob/a170a5508430cd861754b9064d7e1a081d8b3101/country.asm#L3768 |
Thanks for your input, Bernd. I will look into implementing this in SvarCOM soon, it's relatively easy and I'm halfway there already. Another, more annoying subject is that localcfg has no support for this collation business. |
I tested on MS-DOS 6.22 + 'trunk' SvarCOM. Results are a bit strange. At first, I created dirs ä, a, ö, o, u, ü, ß, s using MS COMMAND.COM.
No COUNTRY line (= EN-US) + MS COMMAND.COM: No COUNTRY line (= EN-US) + SvarCOM:
md s ß u ü o ö a ä Switched to MS COMMAND.COM |
a quick theory: is this because Ü and U have the same weight in your country.sys table? In the same manner, Ä might have the same weight as A. In such case, the order is random between these two, and it's the letter that comes after that will decide of the order of files. more importantly: do you have different results with MS command.com ?
That is unlikely, really. Are your sure you performed reboots between each of your tests? There is no way SvarCOM could invent the proper order.. All it does is ask the kernel for "current country/codepage sorting order". Unless you test on some German version of MSDOS, which comes with the default collation set to German? |
if your results are reproductible, then maybe could you provide me with a boot floppy that has your exact NLS environment? I could then have a closer look at what happens exactly. |
Now that I think of it, the behavior you describe does make sense to me. Independently of the "country" (1, 49, 33, or any other), since the currently selected codepage is able to display "Ü" I'd expect it to be always sorted like "U". This is to say that maybe the "country" does not mean anything, the collating table is probably tied only to the codepage. |
If you'd be keen on doing more tests, I think you could try replacing In any case, I like the behavior you describe more than having the "en-US" sort being stupid about European glyphs. :) |
Dunno. Didn't have a look at the table so far and I'm also new to collation at all.
Is it really random or does it depend on the order of creation on disk?
Do you mean any randomness in the order?
Yes.
I tested all this on a German version of MS-DOS, but why would it work correctly then with MS COMMAND.COM? Now, I repeated one of those tests on an English version of MS-DOS 6.22. Same result. CONFIG.SYS:
AUTOEXEC.BAT:
|
No change. Also no change after replacing mov bx, 0xffff with mov bx, 437. |
It is sorted via quicksort, so the entries are shuffled around quite a bit, I'm not sure the on-disk order is always preserved in conflicting case, so I'd rather say "undefined behavior".
Well, there isn't much more I could do then... I suppose this could be due to some hardcoded rule I do not see a problem having the sort rely on NLS all the time (as long as NLS is available, that is), and at least it makes for a consistent sorting experience across languages. Unless you have some other ideas, I will check later today that the NLS sorting behaves well also in Polish and Russian and call this a feature. |
I've set up an MS-DOS 6.0 VM (had to borrow the COUNTRY.SYS and EGA3.CPI from MS-DOS 6.22, though) and tested the collate sort order for CP852 and CP866: both behave the same with SvarCOM and MS COMMAND.COM when the COUNTRY is set to 048 and 007, respectively. For example: All good. But when the COUNTRY is NOT set, then things go south. MS COMMAND.COM orders files according to ASCII, which is not linguistically correct but fair enough given the circumstances: SvarCOM, on the other hand, lists files in an order that makes no sense: The above order is not ASCII, not alphabetic, and it's also not the order of files on disk. It is interesting to note that this order is the same for both PL and RU codepages. Noticing this, I changed my configuration and set So my working theory (speculation) is that when COUNTRY is not set or set to its default value (1), then the kernel falls back to a collate table designed for CP437. I do not know what are the rationale for this behavior, maybe there is a reason for this, or maybe it is a bug. Whatever the cause, it appears that NLS sorting should be disabled for "COUNTRY is 001" after all. |
r1744 performs NLS sorting only when COUNTRY > 1. This, I think, mimics what MS COMMAND does, and also avoids ending up with a wild sort order for non-437 languages when COUNTRY is not configured (because when COUNTRY is not configured, the kernel assumes COUNTRY=1 and proposes an CP437 collate). I am not entirely convinced this is a good approach, because after all a missing COUNTRY is a configuration error that the user should fix, and besides - I really liked the elegant CP437 sorting being applied to U.S.... but if in doubt, it is probably safer to monkey whatever MS did 40 years ago. @bttrx This should make the sort order work as you initially expected. Do you confirm? |
Interesting findings! Have you tried checking the table size for being exactly 256? Currently there is a <= 256. Maybe the table contains simply "uninitialized" garbage. I am currently also on this topic but from an EDR kernel perspective. For EDR the case-insensitive standard collation is set by default even without a COUNTRY line in CONFIG.SYS. Would be interesting to see which table the MS-DOS kernel returns in the "default" case. |
There may be a combination of country=1 and code page=850. The EDR country.sys contains this combination. In this case collating table is that of CP 850. |
Yes I did, the kernel always advertises the table as 256 bytes. But even if it was less, it would be no issue because then SvarCOM relies on ASCII sorting for whatever is not covered by the collate table.
It is basically a "common sense CP437" sorting that is case-insensitive, for example i = I = ï = î = ì = í. But here it is, I dumped it for you :)
The issue here is that MS-DOS returns "something" (the above collate table) for combinations that do not exist, like country=1 and page=866, which makes it difficult to trust anything when country is 1, as it's a default value... |
Thanks :-) Looks indeed like a valid collating table. For reference, this are the FreeDOS country.sys values for 437. I am bad at comparing, but this looks like the tables are equal. (posted the FreeDOS one because the EDR one is in hex :-P)
|
and this is what MS-DOS returns for COUNTRY=1 / CP=850. (indeed, a different set)
But again, this table is NOT used by command.com for the combination COUNTRY=1 / CP=850. Instead, ASCII sort is applied (just like for the combination COUNTRY=1 / CP=437). But as soon as I switch to COUNTRY=33 / CP=850, the above sort table is not only proposed by the kernel, but also applied by command.com. So while the kernel seem to have some options for nice COUNTRY=1 sorting, MS COMMAND.COM prefers ignoring them. |
Do you want SvarCOM to be bug-for-bug compatible? :-D |
As additional data point: 4DOS does not seem to respect country and code page at all for sorting. Just tried it with my current SvarDOS install. But it outputs its listing in lowercase by default. Which fails on german umlauts :-) |
No, and this is why at first I was happy to keep NLS sorting for US codepages, despite Robert's complaints. :) I'm not sure what to do on this, and would need to make more tests to compare how it works with the FreeDOS and EDR kernels. But for now, having no certainty I preferred to opt for following MS's cautious choice so I can push SvarCOM 2024.2 out. Then there will always be time to reconsider options.
SvarDOS might not be a good test candidate, as it comes with a very limited COUNTRY.SYS, with no collation tables and no upcase tables. Maybe that's the reason 4DOS fails on the umlauts? |
It is SvarDOS using EDR and its COUNTRY.SYS I am running. I have not looked into the 4DOS source yet. But my assumption is that it simply does not make use of the INT21,65xx functions (at least for sorting). Regarding the conversion to lower case, which leads to something like I noticed that the EDR country.sys has upcase conversion tables but no lower case tables. MS-DOS country.sys seems to have some lower case tables since 6.22 according to RBIL, but incomplete. This makes conversion to lower case harder than conversion to upper case, I think. Perhaps one can convert the upper case table to a lower case table? Should be possible if the mapping is bijective. |
Better play safe 👍 |
It is not, because due to space limitation of a single codepage, not all glyphs are available in both upper and lower cases. For example in CP437 there is the french "è" but not its upcase version, so the upcase conversion is "è -> E". Same situation happens with many other glyphs. |
Checking for COUNTRY=1 and ignoring NLS sorting is a no-go after all, because the FreeDOS kernel returns an error "invalid function number" to the call INT 21h/AX=6501h (and that's the call I need to discover the current COUNTRY). Hence the "if country==1 then ignore NLS" hack is not only ugly, but not possible anyway with SvarDOS' current default kernel. I will therefore remove this hack and we will have to live with the fact that DIR collation will be very weird for users that set a non-437 codepage but forget to set a proper COUNTRY setting. |
…bugz#68) git-svn-id: svn://svn.svardos.org/svardos@1744 911cea91-c70f-4353-bd03-772f58fe8c9d
Closing this, for the time I do not see any better approach than applying NLS sorting unconditionally. I believe it is the most elegant solution, even though it differs from MSDOS' behavior. |
This follows #11
@bttrx writes:
The text was updated successfully, but these errors were encountered: