Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DIR sorting should respect NLS settings #68

Closed
mateuszviste opened this issue Feb 17, 2024 · 28 comments
Closed

DIR sorting should respect NLS settings #68

mateuszviste opened this issue Feb 17, 2024 · 28 comments
Labels
SvarCOM The SvarDOS command interpreter

Comments

@mateuszviste
Copy link
Collaborator

This follows #11

@bttrx writes:

I found a problem with DIR /O:N: Sort order of names depends on the language, but SvarCOM DIR doesn't take this into account currently. Just comparing ASCII values isn't enough for sorting properly. ;-)

Example:

   md a
   md b
   md ä
   dir /o:n on MS-DOS 5.0 with COUNTRY=049,850,C:\DOS\COUNTRY.SYS shows dirs in this order: A -> Ä -> B
   dir /o:n on MS-DOS 5.0 with same settings + 'trunk' SvarCOM shows dirs in this order: A -> B -> Ä

I think, Int 21/AX=6506h is your friend here, although I never used this function myself.

@mateuszviste mateuszviste added the SvarCOM The SvarDOS command interpreter label Feb 17, 2024
@boeckmann
Copy link
Collaborator

I think, Int 21/AX=6506h is your friend here, although I never used this function myself.

This sounds right. An example of such a collating table for CP 850 can be found here: https://github.com/SvarDOS/edrdos/blob/9751c114b84df883956fa289a0142dfe54b57854/drdos/country.asm#L2037

It is interesting to see that this seems to be a case-insensitive ordering.

@boeckmann
Copy link
Collaborator

It is interesting to see that this seems to be a case-insensitive ordering.

While the FreeDOS one is case-sensitive: https://github.com/FDOS/country/blob/a170a5508430cd861754b9064d7e1a081d8b3101/country.asm#L3768

@mateuszviste
Copy link
Collaborator Author

Thanks for your input, Bernd. I will look into implementing this in SvarCOM soon, it's relatively easy and I'm halfway there already.
Worth noting that for FreeDOS the collation does not matter much, since FreeCOM is not supporting NLS sorting anyway :-P (it uses a simple strcmp() call)

Another, more annoying subject is that localcfg has no support for this collation business.

@mateuszviste
Copy link
Collaborator Author

Committed in r1743

It seems to work, but I haven't tested it very much to be honest, as I do not use COUNTRY.SYS myself.
@bttrx any chance you could check if this works alright on your setup with all these weird German letters of yours?

@bttrx
Copy link
Collaborator

bttrx commented Feb 18, 2024

I tested on MS-DOS 6.22 + 'trunk' SvarCOM. Results are a bit strange.

At first, I created dirs ä, a, ö, o, u, ü, ß, s using MS COMMAND.COM.

COUNTRY=049,850,C:\DOS\COUNTRY.SYS + MS COMMAND.COM.
DIR /O:N order: Ä, A, Ö, O, ß, S, U, Ü
Why comes Ü after U, but Ä before A? I created these in different order.
In a new dir I did:
md a, md ä, dir /on -> aä
md ä, md a, dir /on -> äa
md u, md ü, dir /on -> uü
md ü, md u, dir /on -> üu

COUNTRY=049,850,C:\DOS\COUNTRY.SYS + SvarCOM.
DIR /O:N order: A, Ä, O, Ö, ß, S, Ü, U
In another new dir I did:
md a, md ä, dir /on -> aä
md ä, md a, dir /on -> äa
md u, md ü, dir /on -> üu (!)
md ü, md u, dir /on -> uü (!)

No COUNTRY line (= EN-US) + MS COMMAND.COM:
DIR /O:N order: A, O, S, U, Ä, Ö, Ü, ß
This is the expected order for EN-US.

No COUNTRY line (= EN-US) + SvarCOM:
DIR /O:N order: A, Ä, O, Ö, ß, S, Ü, U
That's unexpected, because it's the COUNTRY=049 order!

COUNTRY=001,437,C:\DOS\COUNTRY.SYS (= EN-US) + SvarCOM:
DIR /O:N order: A, Ä, O, Ö, ß, S, Ü, U
That's unexpected again, because it's the COUNTRY=049 order!

md s ß u ü o ö a ä
DIR /O:N order: A, Ä, Ö, O, S, ß, Ü, U (?)
removed all dirs and created in the same order -> same result

Switched to MS COMMAND.COM
DIR /O:N order: A, O, S, U, Ä, Ö, Ü, ß
This is again the expected order for EN-US.

@bttrx bttrx reopened this Feb 18, 2024
@mateuszviste
Copy link
Collaborator Author

mateuszviste commented Feb 18, 2024

Why comes Ü after U, but Ä before A? I created these in different order.

a quick theory: is this because Ü and U have the same weight in your country.sys table? In the same manner, Ä might have the same weight as A. In such case, the order is random between these two, and it's the letter that comes after that will decide of the order of files.

more importantly: do you have different results with MS command.com ?

No COUNTRY line (= EN-US) + SvarCOM:
DIR /O:N order: A, Ä, O, Ö, ß, S, Ü, U
That's unexpected, because it's the COUNTRY=049 order!

That is unlikely, really. Are your sure you performed reboots between each of your tests? There is no way SvarCOM could invent the proper order.. All it does is ask the kernel for "current country/codepage sorting order". Unless you test on some German version of MSDOS, which comes with the default collation set to German?

@mateuszviste
Copy link
Collaborator Author

if your results are reproductible, then maybe could you provide me with a boot floppy that has your exact NLS environment? I could then have a closer look at what happens exactly.

@mateuszviste
Copy link
Collaborator Author

Now that I think of it, the behavior you describe does make sense to me. Independently of the "country" (1, 49, 33, or any other), since the currently selected codepage is able to display "Ü" I'd expect it to be always sorted like "U".

This is to say that maybe the "country" does not mean anything, the collating table is probably tied only to the codepage.

@mateuszviste
Copy link
Collaborator Author

If you'd be keen on doing more tests, I think you could try replacing mov dx, 0xffff from r1743 by an actual country value (1, 49...) and see if it changes anything. 0xffff is supposed to mean "current country", but maybe there is some different behavior if it is given explicitly.

In any case, I like the behavior you describe more than having the "en-US" sort being stupid about European glyphs. :)

@bttrx
Copy link
Collaborator

bttrx commented Feb 18, 2024

Why comes Ü after U, but Ä before A? I created these in different order.

a quick theory: is this because Ü and U have the same weight in your country.sys table?

Dunno. Didn't have a look at the table so far and I'm also new to collation at all.

In the same manner, Ä might have the same weight as A. In such case, the order is random between these two, and it's the letter that comes after that will decide of the order of files.

Is it really random or does it depend on the order of creation on disk?

more importantly: do you have different results with MS command.com ?

Do you mean any randomness in the order?
No, didn't notice any randomness.

No COUNTRY line (= EN-US) + SvarCOM:
DIR /O:N order: A, Ä, O, Ö, ß, S, Ü, U
That's unexpected, because it's the COUNTRY=049 order!

That is unlikely, really. Are your sure you performed reboots between each of your tests?

Yes.

There is no way SvarCOM could invent the proper order.. All it does is ask the kernel for "current country's sorting order". Unless you test on some German version of MSDOS, which comes with the default collation set to German?

I tested all this on a German version of MS-DOS, but why would it work correctly then with MS COMMAND.COM?

Now, I repeated one of those tests on an English version of MS-DOS 6.22. Same result.
MS COMMAND.COM dir /on -> AOUÄÖU
'trunk' SvarCOM dir /on -> ÄAÖOÜU

CONFIG.SYS:

[MENU]
MENUITEM=MSCOM,MS COMMAND.COM
MENUITEM=SVARCOM,SvarCOM

[MSCOM]
SHELL=C:\COMMAND.COM /P

[SVARCOM]
SHELL=C:\SVARCOM.COM /E:512 /P

[COMMON]
SWITCHES=/F
DEVICE=C:\DOS\SETVER.EXE
DEVICE=C:\DOS\HIMEM.SYS /TESTMEM:OFF /V
DOS=HIGH
FILES=30

AUTOEXEC.BAT:

C:\DOS\SMARTDRV.EXE /X
@ECHO OFF
PROMPT $p$g
PATH C:\DOS
SET TEMP=C:\DOS
EIDL.COM

@bttrx
Copy link
Collaborator

bttrx commented Feb 18, 2024

I think you could try replacing mov dx, 0xffff from r1743 by an actual country value (1, 49...) and see if it changes anything.

No change. Also no change after replacing mov bx, 0xffff with mov bx, 437.

@mateuszviste
Copy link
Collaborator Author

Is it really random or does it depend on the order of creation on disk?

It is sorted via quicksort, so the entries are shuffled around quite a bit, I'm not sure the on-disk order is always preserved in conflicting case, so I'd rather say "undefined behavior".

No change. Also no change after replacing mov bx, 0xffff with mov bx, 437.

Well, there isn't much more I could do then... I suppose this could be due to some hardcoded rule
if country=1 then do not bother with NLS and just rely on fast ASCII order.

I do not see a problem having the sort rely on NLS all the time (as long as NLS is available, that is), and at least it makes for a consistent sorting experience across languages.

Unless you have some other ideas, I will check later today that the NLS sorting behaves well also in Polish and Russian and call this a feature.

@mateuszviste
Copy link
Collaborator Author

I will check later today that the NLS sorting behaves well also in Polish and Russian and call this a feature.

I've set up an MS-DOS 6.0 VM (had to borrow the COUNTRY.SYS and EGA3.CPI from MS-DOS 6.22, though) and tested the collate sort order for CP852 and CP866: both behave the same with SvarCOM and MS COMMAND.COM when the COUNTRY is set to 048 and 007, respectively. For example:

image

image

All good.

But when the COUNTRY is NOT set, then things go south. MS COMMAND.COM orders files according to ASCII, which is not linguistically correct but fair enough given the circumstances:

image

image

SvarCOM, on the other hand, lists files in an order that makes no sense:

image

image

The above order is not ASCII, not alphabetic, and it's also not the order of files on disk.
It does not seem to be a SvarCOM bug, because SvarCOM really does receive such collate table from the kernel, and the INT21h/AX=6506h call does not fail (CF is clear). Weird.

It is interesting to note that this order is the same for both PL and RU codepages. Noticing this, I changed my configuration and set COUNTRY=001,437,.... And guess what: the order is still exactly the same!

So my working theory (speculation) is that when COUNTRY is not set or set to its default value (1), then the kernel falls back to a collate table designed for CP437. I do not know what are the rationale for this behavior, maybe there is a reason for this, or maybe it is a bug. Whatever the cause, it appears that NLS sorting should be disabled for "COUNTRY is 001" after all.

@mateuszviste
Copy link
Collaborator Author

r1744 performs NLS sorting only when COUNTRY > 1.

This, I think, mimics what MS COMMAND does, and also avoids ending up with a wild sort order for non-437 languages when COUNTRY is not configured (because when COUNTRY is not configured, the kernel assumes COUNTRY=1 and proposes an CP437 collate).

I am not entirely convinced this is a good approach, because after all a missing COUNTRY is a configuration error that the user should fix, and besides - I really liked the elegant CP437 sorting being applied to U.S.... but if in doubt, it is probably safer to monkey whatever MS did 40 years ago.

@bttrx This should make the sort order work as you initially expected. Do you confirm?

@boeckmann
Copy link
Collaborator

Interesting findings! Have you tried checking the table size for being exactly 256? Currently there is a <= 256. Maybe the table contains simply "uninitialized" garbage. I am currently also on this topic but from an EDR kernel perspective. For EDR the case-insensitive standard collation is set by default even without a COUNTRY line in CONFIG.SYS. Would be interesting to see which table the MS-DOS kernel returns in the "default" case.

@boeckmann
Copy link
Collaborator

r1744 performs NLS sorting only when COUNTRY > 1

There may be a combination of country=1 and code page=850. The EDR country.sys contains this combination. In this case collating table is that of CP 850.

@mateuszviste
Copy link
Collaborator Author

Have you tried checking the table size for being exactly 256?

Yes I did, the kernel always advertises the table as 256 bytes. But even if it was less, it would be no issue because then SvarCOM relies on ASCII sorting for whatever is not covered by the collate table.

Would be interesting to see which table the MS-DOS kernel returns in the "default" case.

It is basically a "common sense CP437" sorting that is case-insensitive, for example i = I = ï = î = ì = í. But here it is, I dumped it for you :)

 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015
 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031
 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047
 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063
 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079
 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095
 096 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079
 080 081 082 083 084 085 086 087 088 089 090 123 124 125 126 127
 067 085 069 065 065 065 065 067 069 069 069 073 073 073 065 065
 069 065 065 079 079 079 085 085 089 079 085 036 036 036 036 036
 065 073 079 085 078 078 166 167 063 169 170 171 172 033 034 034
 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
 224 083 226 227 228 229 230 231 232 233 234 235 236 237 238 239
 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255

There may be a combination of country=1 and code page=850. The EDR country.sys contains this combination. In this case collating table is that of CP 850.

The issue here is that MS-DOS returns "something" (the above collate table) for combinations that do not exist, like country=1 and page=866, which makes it difficult to trust anything when country is 1, as it's a default value...

@boeckmann
Copy link
Collaborator

But here it is, I dumped it for you :)

Thanks :-) Looks indeed like a valid collating table. For reference, this are the FreeDOS country.sys values for 437. I am bad at comparing, but this looks like the tables are equal. (posted the FreeDOS one because the EDR one is in hex :-P)

db   0,	  1,   2,   3,	 4,   5,   6,	7
db   8,	  9,  10,  11,	12,  13,  14,  15
db  16,	 17,  18,  19,	20,  21,  22,  23
db  24,	 25,  26,  27,	28,  29,  30,  31
db  32,	 33,  34,  35,	36,  37,  38,  39
db  40,	 41,  42,  43,	44,  45,  46,  47
db  48,	 49,  50,  51,	52,  53,  54,  55
db  56,	 57,  58,  59,	60,  61,  62,  63
db  64,	 65,  66,  67,	68,  69,  70,  71
db  72,	 73,  74,  75,	76,  77,  78,  79
db  80,	 81,  82,  83,	84,  85,  86,  87
db  88,	 89,  90,  91,	92,  93,  94,  95
db  96,	 65,  66,  67,	68,  69,  70,  71
db  72,	 73,  74,  75,	76,  77,  78,  79
db  80,	 81,  82,  83,	84,  85,  86,  87
db  88,	 89,  90, 123, 124, 125, 126, 127
db  67,	 85,  69,  65,	65,  65,  65,  67
db  69,	 69,  69,  73,	73,  73,  65,  65
db  69,	 65,  65,  79,	79,  79,  85,  85
db  89,	 79,  85,  36,	36,  36,  36,  36
db  65,	 73,  79,  85,	78,  78, 166, 167
db  63, 169, 170, 171, 172,  33,  34,  34
db 176, 177, 178, 179, 180, 181, 182, 183
db 184, 185, 186, 187, 188, 189, 190, 191
db 192, 193, 194, 195, 196, 197, 198, 199
db 200, 201, 202, 203, 204, 205, 206, 207
db 208, 209, 210, 211, 212, 213, 214, 215
db 216, 217, 218, 219, 220, 221, 222, 223
db 224,	 83, 226, 227, 228, 229, 230, 231
db 232, 233, 234, 235, 236, 237, 238, 239
db 240, 241, 242, 243, 244, 245, 246, 247
db 248, 249, 250, 251, 252, 253, 254, 255

@mateuszviste
Copy link
Collaborator Author

and this is what MS-DOS returns for COUNTRY=1 / CP=850. (indeed, a different set)

 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015
 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031
 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047
 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063
 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079
 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095
 096 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079
 080 081 082 083 084 085 086 087 088 089 090 123 124 125 126 127
 067 085 069 065 065 065 065 067 069 069 069 073 073 073 065 065
 069 065 065 079 079 079 085 085 089 079 085 079 036 079 158 036
 065 073 079 085 078 078 166 167 063 169 170 171 172 033 034 034
 176 177 178 179 180 065 065 065 184 185 186 187 188 036 036 191
 192 193 194 195 196 197 065 065 200 201 202 203 204 205 206 036
 068 068 069 069 069 073 073 073 073 217 218 219 220 221 073 223
 079 083 079 079 079 079 230 232 232 085 085 085 089 089 238 239
 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255

But again, this table is NOT used by command.com for the combination COUNTRY=1 / CP=850. Instead, ASCII sort is applied (just like for the combination COUNTRY=1 / CP=437). But as soon as I switch to COUNTRY=33 / CP=850, the above sort table is not only proposed by the kernel, but also applied by command.com.

So while the kernel seem to have some options for nice COUNTRY=1 sorting, MS COMMAND.COM prefers ignoring them.

@boeckmann
Copy link
Collaborator

So while the kernel seem to have some options for nice COUNTRY=1 sorting, MS COMMAND.COM prefers ignoring them.

Do you want SvarCOM to be bug-for-bug compatible? :-D

@boeckmann
Copy link
Collaborator

MS COMMAND.COM prefers ignoring them

As additional data point: 4DOS does not seem to respect country and code page at all for sorting. Just tried it with my current SvarDOS install. But it outputs its listing in lowercase by default. Which fails on german umlauts :-)

@mateuszviste
Copy link
Collaborator Author

So while the kernel seem to have some options for nice COUNTRY=1 sorting, MS COMMAND.COM prefers ignoring them.

Do you want SvarCOM to be bug-for-bug compatible? :-D

No, and this is why at first I was happy to keep NLS sorting for US codepages, despite Robert's complaints. :)
But then I made more tests and I realized that the (MS) kernel does not fail the INT21h/AX=6506h call when it does not have a proper collate table, and instead provides this "CP437" table as a fallback. Combined with the fact that COUNTRY=1 is used both for US and for "country unknown" situations, it is very easy to end up with a totally messed up sorting. Which is probably (I assume) the reason that MS COMMAND prefers to flatly ignore anything with country=1...

I'm not sure what to do on this, and would need to make more tests to compare how it works with the FreeDOS and EDR kernels. But for now, having no certainty I preferred to opt for following MS's cautious choice so I can push SvarCOM 2024.2 out. Then there will always be time to reconsider options.

4DOS does not seem to respect country and code page at all for sorting. Just tried it with my current SvarDOS

SvarDOS might not be a good test candidate, as it comes with a very limited COUNTRY.SYS, with no collation tables and no upcase tables. Maybe that's the reason 4DOS fails on the umlauts?

@boeckmann
Copy link
Collaborator

SvarDOS might not be a good test candidate, as it comes with a very limited COUNTRY.SYS, with no collation tables and no upcase tables. Maybe that's the reason 4DOS fails on the umlauts?

It is SvarDOS using EDR and its COUNTRY.SYS I am running. I have not looked into the 4DOS source yet. But my assumption is that it simply does not make use of the INT21,65xx functions (at least for sorting).

Regarding the conversion to lower case, which leads to something like abÄd.txt in 4DOS dir output, I think it simply does the non NLS-enabled standard case conversion.

I noticed that the EDR country.sys has upcase conversion tables but no lower case tables. MS-DOS country.sys seems to have some lower case tables since 6.22 according to RBIL, but incomplete. This makes conversion to lower case harder than conversion to upper case, I think. Perhaps one can convert the upper case table to a lower case table? Should be possible if the mapping is bijective.

@boeckmann
Copy link
Collaborator

I'm not sure what to do on this, and would need to make more tests to compare how it works with the FreeDOS and EDR kernels. But for now, having no certainty I preferred to opt for following MS's cautious choice so I can push SvarCOM 2024.2 out. Then there will always be time to reconsider options.

Better play safe 👍

@mateuszviste
Copy link
Collaborator Author

Perhaps one can convert the upper case table to a lower case table? Should be possible if the mapping is bijective.

It is not, because due to space limitation of a single codepage, not all glyphs are available in both upper and lower cases. For example in CP437 there is the french "è" but not its upcase version, so the upcase conversion is "è -> E". Same situation happens with many other glyphs.

@mateuszviste
Copy link
Collaborator Author

mateuszviste commented Feb 19, 2024

Checking for COUNTRY=1 and ignoring NLS sorting is a no-go after all, because the FreeDOS kernel returns an error "invalid function number" to the call INT 21h/AX=6501h (and that's the call I need to discover the current COUNTRY).

Hence the "if country==1 then ignore NLS" hack is not only ugly, but not possible anyway with SvarDOS' current default kernel. I will therefore remove this hack and we will have to live with the fact that DIR collation will be very weird for users that set a non-437 codepage but forget to set a proper COUNTRY setting.

@mateuszviste
Copy link
Collaborator Author

PS. when compiled with "-DDIR_DUMPNLSCOLLATE", SvarCOM will show a dump of the NLS collate table on screen, on top of every DIR output. It is one line to uncomment in the makefile.

image

roytam1 pushed a commit to roytam1/SvarDOS that referenced this issue Feb 20, 2024
…bugz#68)

git-svn-id: svn://svn.svardos.org/svardos@1744 911cea91-c70f-4353-bd03-772f58fe8c9d
@mateuszviste
Copy link
Collaborator Author

Closing this, for the time I do not see any better approach than applying NLS sorting unconditionally. I believe it is the most elegant solution, even though it differs from MSDOS' behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SvarCOM The SvarDOS command interpreter
Projects
None yet
Development

No branches or pull requests

3 participants