Permalink
Browse files

Add check that "$!" is correctly interpreted as UTF-8

We sometimes need to know if an error message is UTF-8 or not.
Previously we checked that it is syntactically valid UTF-8, and that the
LC_MESSAGES locale is UTF-8.  But some systems, notably Windows, do not
have LC_MESSAGES.  For those, this commit adds a different, semantic,
check that the text of the message when interpreted as UTF-8 is all in
the same Unicode script.  This is not foolproof, unlike the LC_MESSAGES
check, but it's better than what we have now for such systems.  It
likely is foolproof for non-Latin locales, as any message will have a
bunch of characters in that locale, and no ASCII Latin ones.  For a
Latin locale, these ASCII letters could be intermixed with the UTF-8
ones, causing potential ambiguity.
  • Loading branch information...
khwilliamson committed Jan 4, 2018
1 parent ee5191f commit c4004986bd8a62c2ed10d9aedba9c0f87e3eb35c
Showing with 9 additions and 3 deletions.
  1. +9 −3 mg.c
View
12 mg.c
@@ -818,9 +818,9 @@ S_fixup_errno_string(pTHX_ SV* sv)
* avoid as many possible backward compatibility issues as possible, we
* don't turn on the flag unless we have to. So the flag stays off for
* an entirely invariant string. We assume that if the string looks
* like UTF-8, it really is UTF-8: "text in any other encoding that
* uses bytes with the high bit set is extremely unlikely to pass a
* UTF-8 validity test"
* like UTF-8 in a single script, it really is UTF-8: "text in any
* other encoding that uses bytes with the high bit set is extremely
* unlikely to pass a UTF-8 validity test"
* (http://en.wikipedia.org/wiki/Charset_detection). There is a
* potential that we will get it wrong however, especially on short
* error message text, so do an additional check. */
@@ -831,6 +831,12 @@ S_fixup_errno_string(pTHX_ SV* sv)
&& _is_cur_LC_category_utf8(LC_MESSAGES)
#else /* If can't check directly, at least can see if script is consistent,
under UTF-8, which gives us an extra measure of confidence. */
&& isSCRIPT_RUN((const U8 *) SvPVX_const(sv), (U8 *) SvEND(sv),
TRUE, /* Means assume UTF-8 */
NULL)
#endif
) {

0 comments on commit c400498

Please sign in to comment.