New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first try fixing utf8 issues #52

Closed
wants to merge 2 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@ambs
Contributor

ambs commented Aug 19, 2016

This is a first try fixing a utf8 problem from #37.
This is not really a pull request, but a request for comments. This solves the issue, but will only work for perl >= 5.8.1

What happens is that when Perl reads the Makefile.PL, with utf8, it will store strings as utf8. Then, you will not be able to write them as bytes, unless you perform the right conversion.

btw, this is a Pull Request Challenge PR.

@karenetheridge

This comment has been minimized.

Show comment
Hide comment
@karenetheridge

karenetheridge Aug 19, 2016

Member

use utf8 is not necessary to call functions in the utf8 namespace -- its only purpose is to signal that the file should be parsed as utf8-encoded bytes.

utf8::is_utf8 does not do what you think it does. It only checks the internal representation of the perl string; it will not inform you whether the string is unicode octets vs. utf8-encoded bytes.

This code makes no sense:

my $test_string = 'Alberto Simões <ambs@cpan.org>';
utf8::decode($test_string);

utf8::decode on a string that is already in unicode characters will result in mojibake -- if it doesn't error out on encountering bytes that cannot exist in utf8.

Member

karenetheridge commented Aug 19, 2016

use utf8 is not necessary to call functions in the utf8 namespace -- its only purpose is to signal that the file should be parsed as utf8-encoded bytes.

utf8::is_utf8 does not do what you think it does. It only checks the internal representation of the perl string; it will not inform you whether the string is unicode octets vs. utf8-encoded bytes.

This code makes no sense:

my $test_string = 'Alberto Simões <ambs@cpan.org>';
utf8::decode($test_string);

utf8::decode on a string that is already in unicode characters will result in mojibake -- if it doesn't error out on encountering bytes that cannot exist in utf8.

@ambs

This comment has been minimized.

Show comment
Hide comment
@ambs

ambs Aug 20, 2016

Contributor

On 19/08/16 21:59, Karen Etheridge wrote:

This code makes no sense:

|my $test_string = 'Alberto Simões ambs@cpan.org';
utf8::decode($test_string); |

|utf8::decode| on a string that is already in unicode characters will
result in mojibake -- if it doesn't error out on encountering bytes that
cannot exist in utf8.

But it is not in utf8.
Note that the test script doesn't use utf8.

And you can't make it use utf8, or the Module::Install generated will be
wrote with bad utf8 encoding.

Contributor

ambs commented Aug 20, 2016

On 19/08/16 21:59, Karen Etheridge wrote:

This code makes no sense:

|my $test_string = 'Alberto Simões ambs@cpan.org';
utf8::decode($test_string); |

|utf8::decode| on a string that is already in unicode characters will
result in mojibake -- if it doesn't error out on encountering bytes that
cannot exist in utf8.

But it is not in utf8.
Note that the test script doesn't use utf8.

And you can't make it use utf8, or the Module::Install generated will be
wrote with bad utf8 encoding.

@ambs

This comment has been minimized.

Show comment
Hide comment
@ambs

ambs Aug 20, 2016

Contributor

Dear @karenetheridge, looks more understandable this way?

Contributor

ambs commented Aug 20, 2016

Dear @karenetheridge, looks more understandable this way?

@ambs

This comment has been minimized.

Show comment
Hide comment
@ambs

ambs Aug 20, 2016

Contributor

Also, added some code to check Perl version.
I think this could be a really solution now.

Contributor

ambs commented Aug 20, 2016

Also, added some code to check Perl version.
I think this could be a really solution now.

@Leont

This comment has been minimized.

Show comment
Hide comment
@Leont

Leont Aug 22, 2016

Member

utf8::is_utf8 does not do what you think it does. It only checks the internal representation of the perl string; it will not inform you whether the string is unicode octets vs. utf8-encoded bytes.

Indeed, it is incorrect.

Member

Leont commented Aug 22, 2016

utf8::is_utf8 does not do what you think it does. It only checks the internal representation of the perl string; it will not inform you whether the string is unicode octets vs. utf8-encoded bytes.

Indeed, it is incorrect.

@miyagawa

This comment has been minimized.

Show comment
Hide comment
@miyagawa

miyagawa Aug 22, 2016

Member

The correct fix would be just call utf8::encode without checking the utf8 flag. It would break existing Makefile.PL that sets utf-8 byte string, but then they can fix it by adding use utf8; in there.

utf8 was shipped with 5.6.0 and there's no problem requiring that, given Module::Install's lowest supported perl is also 5.6.0.

Member

miyagawa commented Aug 22, 2016

The correct fix would be just call utf8::encode without checking the utf8 flag. It would break existing Makefile.PL that sets utf-8 byte string, but then they can fix it by adding use utf8; in there.

utf8 was shipped with 5.6.0 and there's no problem requiring that, given Module::Install's lowest supported perl is also 5.6.0.

@miyagawa

This comment has been minimized.

Show comment
Hide comment
@miyagawa

miyagawa Aug 22, 2016

Member

btw #37 sounds like a different problem then - it should decode when reading the =head1 AUTHOR section with author_from directive, when combined with the utf8::encode added like this.

Member

miyagawa commented Aug 22, 2016

btw #37 sounds like a different problem then - it should decode when reading the =head1 AUTHOR section with author_from directive, when combined with the utf8::encode added like this.

@ambs

This comment has been minimized.

Show comment
Hide comment
@ambs

ambs Aug 22, 2016

Contributor

@miyagawa, removing the utf8 check makes t/20_authors_with_special_characters.t fail.
Will try to understand how that can be fixed. Probably, changing that test... to use utf8

Contributor

ambs commented Aug 22, 2016

@miyagawa, removing the utf8 check makes t/20_authors_with_special_characters.t fail.
Will try to understand how that can be fixed. Probably, changing that test... to use utf8

@miyagawa

This comment has been minimized.

Show comment
Hide comment
@miyagawa

miyagawa Aug 22, 2016

Member

Probably, changing that test... to use utf8

Haven't looked at the test but that sounds right. This can't essentially be done without a breaking change...

Member

miyagawa commented Aug 22, 2016

Probably, changing that test... to use utf8

Haven't looked at the test but that sounds right. This can't essentially be done without a breaking change...

@Leont

This comment has been minimized.

Show comment
Hide comment
@Leont

Leont Aug 22, 2016

Member

The only sane way to use unicode in perl is to handle the data consistently as either bytes or characters. Anything else will set everything on fire.

And while I generally consider threating strings as characters the more sensible option, the path of least resistance here is probably threating everything as bytes.

Member

Leont commented Aug 22, 2016

The only sane way to use unicode in perl is to handle the data consistently as either bytes or characters. Anything else will set everything on fire.

And while I generally consider threating strings as characters the more sensible option, the path of least resistance here is probably threating everything as bytes.

@ambs

This comment has been minimized.

Show comment
Hide comment
@ambs

ambs Aug 22, 2016

Contributor

How would this change affect M::I users? I am never sure what the encode/decode stuff will affect the string.

Contributor

ambs commented Aug 22, 2016

How would this change affect M::I users? I am never sure what the encode/decode stuff will affect the string.

@ambs

This comment has been minimized.

Show comment
Hide comment
@ambs

ambs Aug 22, 2016

Contributor

@Leont you say that as your name doesn't have accented characters 😺

Contributor

ambs commented Aug 22, 2016

@Leont you say that as your name doesn't have accented characters 😺

@karenetheridge

This comment has been minimized.

Show comment
Hide comment
@karenetheridge

karenetheridge Aug 22, 2016

Member

We've discussed in CPAN-Meta tickets that it's more important to avoid breaking installations than to corrupt an author or contributor's name. However, different rules can apply in MI since the code is bundled with the distribution, so if breakage occurs, it's all on the author side, and they are in a position to fix their Makefile.PL before shipping.

Given this, I think it would be acceptable to die with "invalid character encoding" errors and force the author to fix their code (especially if it's easy to do so, and we have something in the documentation explaining how) when they upgrade to the latest MI.

...especially since we don't really want authors to continue to use MI anyway.. :)

Member

karenetheridge commented Aug 22, 2016

We've discussed in CPAN-Meta tickets that it's more important to avoid breaking installations than to corrupt an author or contributor's name. However, different rules can apply in MI since the code is bundled with the distribution, so if breakage occurs, it's all on the author side, and they are in a position to fix their Makefile.PL before shipping.

Given this, I think it would be acceptable to die with "invalid character encoding" errors and force the author to fix their code (especially if it's easy to do so, and we have something in the documentation explaining how) when they upgrade to the latest MI.

...especially since we don't really want authors to continue to use MI anyway.. :)

@Leont

This comment has been minimized.

Show comment
Hide comment
@Leont

Leont Aug 22, 2016

Member

To summarize the current situation:

  1. The file gets read a bytes
  2. Module::Install/YAML::Tiny interprets that characters and returns characters.
  3. The file gets written to a binary handle, so it gets silently downgraded to latin1 if needed.

These mismatches cause mojibake.

The first suggested solution was essentially "sometimes we interpret the output as bytes, sometimes we don't", which is a path to madness.

The second suggested solution is to treat all files (on input and on output) as UTF-8. Which will actually work as long as those conditions are true, but I bet it often isn't.

The sensible solution is a combination of:

  1. read input correctly, e.g. decode input data from pod if =encoding <whatever> is set.
  2. write output correctly, e.g. encode the YAML string to bytes.
Member

Leont commented Aug 22, 2016

To summarize the current situation:

  1. The file gets read a bytes
  2. Module::Install/YAML::Tiny interprets that characters and returns characters.
  3. The file gets written to a binary handle, so it gets silently downgraded to latin1 if needed.

These mismatches cause mojibake.

The first suggested solution was essentially "sometimes we interpret the output as bytes, sometimes we don't", which is a path to madness.

The second suggested solution is to treat all files (on input and on output) as UTF-8. Which will actually work as long as those conditions are true, but I bet it often isn't.

The sensible solution is a combination of:

  1. read input correctly, e.g. decode input data from pod if =encoding <whatever> is set.
  2. write output correctly, e.g. encode the YAML string to bytes.
@miyagawa

This comment has been minimized.

Show comment
Hide comment
@miyagawa

miyagawa Aug 22, 2016

Member

The sensible solution is a combination of:

Yes, that was my suggestion. I forgot that the author names are often read from file with all_from, rather I was thinking about literals in Makefile.PL first.

Member

miyagawa commented Aug 22, 2016

The sensible solution is a combination of:

Yes, that was my suggestion. I forgot that the author names are often read from file with all_from, rather I was thinking about literals in Makefile.PL first.

@ambs

This comment has been minimized.

Show comment
Hide comment
@ambs

ambs Aug 22, 2016

Contributor

If I read it correctly, for M::I the solution is keep the code "quiet" as it was?

Contributor

ambs commented Aug 22, 2016

If I read it correctly, for M::I the solution is keep the code "quiet" as it was?

@Leont

This comment has been minimized.

Show comment
Hide comment
@Leont

Leont Aug 22, 2016

Member

If I read it correctly, for M::I the solution is keep the code "quiet" as it was?

  1. In Module::Install::Metadata::authors_from (and possibly other functions) the data should be appropriately decoded before being stored in the object.
  2. In Module::Install::Admin::Metadata::write_meta the data should be encoded before writing it.

This should fix #37 and probably other similar issues with characters outside the latin1 range, without risking any breakage elsewhere involving these read/write functions.

Member

Leont commented Aug 22, 2016

If I read it correctly, for M::I the solution is keep the code "quiet" as it was?

  1. In Module::Install::Metadata::authors_from (and possibly other functions) the data should be appropriately decoded before being stored in the object.
  2. In Module::Install::Admin::Metadata::write_meta the data should be encoded before writing it.

This should fix #37 and probably other similar issues with characters outside the latin1 range, without risking any breakage elsewhere involving these read/write functions.

@ambs

This comment has been minimized.

Show comment
Hide comment
@ambs

ambs Aug 22, 2016

Contributor

OK, will try to see if I understood it.

Contributor

ambs commented Aug 22, 2016

OK, will try to see if I understood it.

@ambs

This comment has been minimized.

Show comment
Hide comment
@ambs

ambs Aug 22, 2016

Contributor

Something like #55?
I can't understand why with two 'encode' it works, but with one encode and one decode it doesn't.
Almost quitting this task 😞 hate unicode!

Contributor

ambs commented Aug 22, 2016

Something like #55?
I can't understand why with two 'encode' it works, but with one encode and one decode it doesn't.
Almost quitting this task 😞 hate unicode!

@ambs ambs closed this Aug 23, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment