handling encodings #2

karenetheridge · 2014-03-16T07:45:53Z

Here is my (very early in the morning) assessment about what we should do regarding scanning the file and encodings (related issue: https://rt.cpan.org/Ticket/Display.html?id=78434)

we need different interfaces for new-from-string, new-from-file, new-from-handle. new-from-string will assume decoded characters. new-from-file takes an encoding parameter, and uses binmode(":encoding($encoding)") on the handle. new-from-handle assumes that the caller has already set encoding layers appropriately.
new-from-file may not have been provided an encoding parameter. in this case we start reading assuming ascii. If we encounter a 'use utf8;' directive, we back up and start over with the correct encoding applied. We will also flag a warning (error?) if we encounter a 'use utf8' directive when we were explicitly passed a different encoding parameter than 'utf8' (UTF-8?).
pod can have a different encoding than the code. when starting to read pod, and we don't already have an encoding supplied (can we assume that new-from-string is decoded and therefore exempt ourself from this step?) and we encounter an '=encoding ...' directive, we decode the extracted pod lines as they are read. Test cases need to consider corpus examples where the code is in a different encoding than the (interleaved) pod (can this happen in any situation where the code is not ascii, which is utf8-compatible?)

Am I missing anything?

karenetheridge · 2014-03-16T07:51:50Z

Note - that the moment, the current interfaces are new_from_file, new_from_handle, and new_from_module (which uses $PERL5LIB and assumes the usual module-to-filename semantics), and totally ignorant of encodings. Tools such as Dist::Zilla that are encoding-aware have difficulty working with these interfaces - the only one it can really use is new_from_handle and set the encoding layer itself, but this is not convenient if the file is in memory already (as scalarrefs can't be opened on wide characters) -- so in this case, ->new_from_string($file->decoded_content) would be nicest.

Also - we probably don't need the back-up-and-start-reading-again logic, as (needs to be verified?) Perl doesn't even do this, and we should be able to assume that anything before use utf8 or =encoding utf8 is valid ascii. (Does this also hold for UTF-16?)

dagolden · 2014-03-16T09:23:18Z

On Sun, Mar 16, 2014 at 8:45 AM, Karen Etheridge
notifications@github.comwrote:

we need different interfaces for new-from-string, new-from-file,
new-from-handle. new-from-string will assume decoded characters.
new-from-file takes an encoding parameter, and uses
binmode(":encoding($encoding)") on the handle. new-from-handle assumes that
the caller has already set encoding layers appropriately.

I agree except for from-file, which I comment on later.

new-from-file may not have been provided an encoding parameter. in
this case we start reading assuming ascii. If we encounter a 'use utf8;'
directive, we back up and start over with the correct encoding applied. We
will also flag a warning (error?) if we encounter a 'use utf8' directive
when we were explicitly passed a different encoding parameter than
'utf8' (UTF-8?).

How about slurping it raw fully and then working with the resulting data
rather than "backing up"? I also wonder whether we can reliably detect
"use utf8;" without implement a subset of perl's parsing rules. E.g.:

# don't use utf8 because blah blah blah
$foo = <<HERE;
always use utf8 when opening handles
HERE

We also have to consider a pod encoding directive in absence of "use utf8"
as indicative.

pod can have a different encoding than the code. when starting to read
pod, and we don't already have an encoding supplied (can we assume that
new-from-string is decoded and therefore exempt ourself from this step?)
and we encounter an '=encoding ...' directive, we decode the extracted pod
lines as they are read. Test cases need to consider corpus examples where
the code is in a different encoding than the (interleaved) pod (can this
happen in any situation where the code is not ascii, which is
utf8-compatible?)

I suspect that in practice having intentionally different encoding in
code and pod portions is unlikely. I would go further and say that p5p
should make it an error.

rjbs · 2014-03-18T02:12:24Z

I agree with the last bit (which is not to say I disagree with anything else): we should document that =encoding is provided so that a Pod parser can be told the file's encoding, and that it should be the file's encoding, not an encoding applied only to the Pod sections. That'd just be crazy.

SineSwiper · 2014-03-28T03:45:18Z

Various meta pragmas "use utf8" indirectly, so trying to parse through Perl code to see if it's actually using it seems plagued with pitfalls. Using something like Encode::Guess might be a better bet, but user-specified encoding trumps all of that. If a user says it's Latin1, then we should consider it to be Latin1, and not question that decision unless Perl just plain can't read it.

karenetheridge · 2014-03-28T15:29:27Z

On Thu, Mar 27, 2014 at 08:45:18PM -0700, Brendan Byrd wrote:

Various meta pragmas "use utf8" indirectly, so trying to parse through Perl code to see if it's actually using it seems plagued with pitfalls. Using something like Encode::Guess might be a better bet, but user-specified encoding trumps all of that. If a user says it's Latin1, then we should consider it to be Latin1, and not question that decision unless Perl just plain can't read it.

Yes, I think this comes back to "simplicity trumps all", especially since
this module needs to be 1. core only, and 2. fast in the most common cases.

karenetheridge self-assigned this Mar 20, 2015

karenetheridge added the enhancement label Mar 20, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handling encodings #2

handling encodings #2

karenetheridge commented Mar 16, 2014

karenetheridge commented Mar 16, 2014

dagolden commented Mar 16, 2014

rjbs commented Mar 18, 2014

SineSwiper commented Mar 28, 2014

karenetheridge commented Mar 28, 2014

handling encodings #2

handling encodings #2

Comments

karenetheridge commented Mar 16, 2014

karenetheridge commented Mar 16, 2014

dagolden commented Mar 16, 2014

rjbs commented Mar 18, 2014

SineSwiper commented Mar 28, 2014

karenetheridge commented Mar 28, 2014