Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mcview: file interpreted as latin1 instead of utf8 #3783

Closed
mc-butler opened this issue Mar 7, 2017 · 16 comments
Closed

mcview: file interpreted as latin1 instead of utf8 #3783

mc-butler opened this issue Mar 7, 2017 · 16 comments
Assignees
Labels
area: mcview mcview, the built-in text editor prio: medium Has the potential to affect progress ver: 4.8.17 Reproducible in version 4.8.17
Milestone

Comments

@mc-butler
Copy link

Important

This issue was migrated from Trac:

Origin https://midnight-commander.org/ticket/3783
Reporter egmont (@egmontkob)

In a fully UTF-8 environment, create this file which contains some Latin-1 characters:

echo $'copyright \xA9 plusminus \xB1' > latin1.txt

Verify that indeed simply sending out the file's contents to the terminal results in replacement symbols, as expected:

$ cat latin1.txt
copyright � plusminus �
$ cat latin1.txt | iconv --from latin1 --to utf8
copyright © plusminus ±

Now execute

mcview latin1.txt

and notice that the file's contents are displayed according to Latin-1, that is:

copyright © plusminus ±

Press Alt-E to confirm that the codeset is indeed UTF-8.

This is incorrect, if UTF-8 is chosen then replacement symbols should be shown instead of the copyright and plus-minus signs.

mcedit, as well as the standard panels (if filenames contain such bytes) don't suffer from this bug.

Note

Original attachments:

@mc-butler
Copy link
Author

Changed by egmont (@egmontkob) on Mar 7, 2017 at 20:22 UTC (comment 1)

Broke somewhere between 4.8.16 and 4.8.17. Running git bisect...

@mc-butler
Copy link
Author

Changed by egmont (@egmontkob) on Mar 7, 2017 at 20:40 UTC (comment 2)

[4d65a73] is the first bad commit

mcview: refactoring of mcview_get_utf().

@mc-butler
Copy link
Author

Changed by egmont (@egmontkob) on Mar 7, 2017 at 20:45 UTC (comment 3)

@aborodin:

How much do you remember this change? As far as I can see, I believe your only intent was to swap the last (out) parameter and the return value, other than that you meant to leave the behavior unaltered, am I right?

@mc-butler
Copy link
Author

Changed by zaytsev (@zyv) on Mar 7, 2017 at 20:58 UTC (comment 4)

I'm not Andrew, but I think that this was indeed his only intent, only I can't yet see what's wrong here and I need to run to cook for tomorrow... :-/

@mc-butler
Copy link
Author

Changed by egmont (@egmontkob) on Mar 7, 2017 at 20:59 UTC (comment 5)

I couldn't spot the bug either just by simply looking... will need to dig deeper... but not now :)

@mc-butler
Copy link
Author

Changed by andrew_b (@aborodin) on Mar 8, 2017 at 4:36 UTC (comment 3.6)

Replying to egmont:

How much do you remember this change? As far as I can see, I believe your only intent was to swap the last (out) parameter and the return value, other than that you meant to leave the behavior unaltered, am I right?

Yes, you are. I wanted to make unification of function prototypes: mcview_get_utf() with other mcview_get_*().

@mc-butler
Copy link
Author

Changed by egmont (@egmontkob) on Mar 8, 2017 at 7:14 UTC (comment 7)

Cool, thanks. I like this intent, the new interface is definitely nicer. I'll try to find where it went wrong (unless someone is faster than me :))

@mc-butler
Copy link
Author

Changed by zaytsev (@zyv) on Mar 8, 2017 at 21:05 UTC (comment 8)

... so, if only there were enough unit tests for this function... ;-) ...

@mc-butler
Copy link
Author

Changed by egmont (@egmontkob) on Mar 9, 2017 at 0:27 UTC (comment 9)

Rather than (or in addition to) unit tests, it's the docs for a really nontrivial case that's missing ;)

4.8.16:

src/viewer/datasource.c, mcview_get_utf(), the last "if (res < 0)" branch contains:

ch = *str;

str is a signed char* containing the lone invalid UTF-8 byte, e.g. in case of the copyright symbol it's -87. It's then assigned to the signed integer (/me wonders why not gunichar, nevermind) and then returned, so it's still -87.

Then in src/viewer/ascii.c mcview_display_line(), after the "Nonprintable, or lonely spacing mark" comment it is detected as nonprintable and hence replaced with a dot.

4.8.17:

That line was modified to:

*ch = (unsigned char) (*str);

which interprets the lone invalid UTF-8 byte as unsigned 169 and is assigned to the Unicode character, this is semantically the Latin-1 -> Unicode conversion. The rest goes on as if this was read as a valid 1-byte character.

The pattern of denoting and carrying invalid UTF-8 bytes as negative numbers is really weird to me. I'm inclined to say that the old design worked "accidentally". At least I don't recall ever thinking about it during my viewer rewrite.

Removing the explicit cast fixes the bug, since it makes the new code equivalent to the old one. It's good enough for now I guess. Plus, I really think this weird design should be documented :)

@mc-butler
Copy link
Author

Changed by egmont (@egmontkob) on May 6, 2017 at 22:39 UTC

@mc-butler
Copy link
Author

Changed by egmont (@egmontkob) on May 6, 2017 at 22:40 UTC (comment 10)

  • Milestone changed from Future Releases to 4.8.20

@mc-butler
Copy link
Author

Changed by andrew_b (@aborodin) on May 7, 2017 at 17:56 UTC (comment 11)

  • Votes set to andrew_b
  • Owner set to andrew_b
  • Branch state changed from no branch to on review
  • Status changed from new to accepted
  • Version changed from 4.8.19 to 4.8.17

Branch: 3783_mcview_invalid_utf8
[5e238fa]

@mc-butler
Copy link
Author

Changed by andrew_b (@aborodin) on May 17, 2017 at 8:28 UTC (comment 12)

  • Branch state changed from on review to approved

@mc-butler
Copy link
Author

Changed by andrew_b (@aborodin) on May 17, 2017 at 8:30 UTC (comment 13)

  • Branch state changed from approved to merged
  • Resolution set to fixed
  • Status changed from accepted to testing
  • Votes changed from andrew_b to committed-master

Merged to master: [ed0ed4a].

@mc-butler
Copy link
Author

Changed by andrew_b (@aborodin) on May 17, 2017 at 8:31 UTC (comment 14)

  • Status changed from testing to closed

@mc-butler
Copy link
Author

Changed by zaytsev (@zyv) on Jul 26, 2017 at 5:24 UTC (comment 15)

Ticket #3844 has been marked as a duplicate of this ticket.

@mc-butler mc-butler marked this as a duplicate of #3844 Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: mcview mcview, the built-in text editor prio: medium Has the potential to affect progress ver: 4.8.17 Reproducible in version 4.8.17
Development

No branches or pull requests

2 participants