-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mcview: file interpreted as latin1 instead of utf8 #3783
Comments
Broke somewhere between 4.8.16 and 4.8.17. Running git bisect... |
[4d65a73] is the first bad commit
mcview: refactoring of mcview_get_utf(). |
How much do you remember this change? As far as I can see, I believe your only intent was to swap the last (out) parameter and the return value, other than that you meant to leave the behavior unaltered, am I right? |
I'm not Andrew, but I think that this was indeed his only intent, only I can't yet see what's wrong here and I need to run to cook for tomorrow... :-/ |
I couldn't spot the bug either just by simply looking... will need to dig deeper... but not now :) |
Replying to egmont:
Yes, you are. I wanted to make unification of function prototypes: mcview_get_utf() with other mcview_get_*(). |
Cool, thanks. I like this intent, the new interface is definitely nicer. I'll try to find where it went wrong (unless someone is faster than me :)) |
... so, if only there were enough unit tests for this function... ;-) ... |
Rather than (or in addition to) unit tests, it's the docs for a really nontrivial case that's missing ;)
4.8.16:
src/viewer/datasource.c, mcview_get_utf(), the last "if (res < 0)" branch contains:
str is a signed char* containing the lone invalid UTF-8 byte, e.g. in case of the copyright symbol it's -87. It's then assigned to the signed integer (/me wonders why not gunichar, nevermind) and then returned, so it's still -87.
Then in src/viewer/ascii.c mcview_display_line(), after the "Nonprintable, or lonely spacing mark" comment it is detected as nonprintable and hence replaced with a dot.
4.8.17:
That line was modified to:
which interprets the lone invalid UTF-8 byte as unsigned 169 and is assigned to the Unicode character, this is semantically the Latin-1 -> Unicode conversion. The rest goes on as if this was read as a valid 1-byte character.
The pattern of denoting and carrying invalid UTF-8 bytes as negative numbers is really weird to me. I'm inclined to say that the old design worked "accidentally". At least I don't recall ever thinking about it during my viewer rewrite.
Removing the explicit cast fixes the bug, since it makes the new code equivalent to the old one. It's good enough for now I guess. Plus, I really think this weird design should be documented :) |
|
|
|
|
Important
This issue was migrated from Trac:
egmont
(@egmontkob)In a fully UTF-8 environment, create this file which contains some Latin-1 characters:
Verify that indeed simply sending out the file's contents to the terminal results in replacement symbols, as expected:
Now execute
and notice that the file's contents are displayed according to Latin-1, that is:
Press Alt-E to confirm that the codeset is indeed UTF-8.
This is incorrect, if UTF-8 is chosen then replacement symbols should be shown instead of the copyright and plus-minus signs.
mcedit, as well as the standard panels (if filenames contain such bytes) don't suffer from this bug.
Note
Original attachments:
egmont
(@egmontkob) onMay 6, 2017 at 22:39 UTC
The text was updated successfully, but these errors were encountered: