Skip to content

Commit

Permalink
locale.c: Revamp fallback detection of UTF-8 locales
Browse files Browse the repository at this point in the history
This commit continues the process started in the previous few commits to
improve the detection of whether a locale is UTF-8 or not when the
platform doesn't have the more modern tools available.

What was done before was examine various texts, like the days of the
week, in a locale, and see if they are legal UTF-8 or not.  If there
were any, and all were legal, it assumed that UTF-8 was needed.  If
there weren't any (as in American English), it looked at the locale's
name.  This presents false negatives and false positives.

Basically, it adds the constraint that all the texts need to be in the
same script when interpreted as UTF-8, which basically rules out any
false positives when the script isn't Latin.  With Latin, it isn't so
clear cut, as the text can be intermixed with ASCII Latin letters and
UTF-8 variant sequences that could be some Latin locale, or UTF-8, and
they just coincidentally happen to be syntactically UTF-8.  Because of
the structuredness of UTF-8, the odds of a coincidence go down with
increasing numbers of variants in a row.  This also isn't likely to
happen with ISO 8859-1, as the bytes that could be legal continuations
in UTF-8 are almost entirely controls or punctuation.  But in other
locales in the 8859 series, there are some legal continuations that
could be part of a month name, say.

As an example of the issues, in 8859-2, one could have \xC6 (C with
acute) followed by \xB1 (a with ogonek), which in UTF-8 would be
U+01B1: LATIN CAPITAL LETTER UPSILON.  However, something like \xCD
(i acute) followed by \xB3 (l with stroke) yields U+0373: GREEK
SMALL LETTER ARCHAIC SAMPI, and the script check added by this commit
would catch that.  In non-Latin texts, the only permissible ASCII
characters would be punctuation, and you aren't going to have many of
those in the LC_TIME strings, and certainly not in a row.  Instead those
will consist of at least several variant characters in a row, and the
odds of those coincidentally being syntactically valid UTF-8 and
semantically in the same script are exceedingly low.

To catch Latin UTF-8 locales, this commit adds a list of the distinct
variants found so far.  If there are even just several of these, the
odds of the syntax being coincidentally UTF-8 greatly diminish.  The
number needed for this to conclude that the locale is UTF-8, is easily
tweakable at compile time.

The problem remains for English and other Latin script languages that
have rare accented characters.  The name is still then examined for
containing "UTF-8".  Note that previous commits have guaranteed that if
the locale has a non-ASCII currency symbol that is recognized by
Unicode, such as the Euro or Pound Sterling, that will correctly be
recognized.
  • Loading branch information
khwilliamson committed Jan 18, 2018
1 parent 8983edc commit 8b4104f
Show file tree
Hide file tree
Showing 4 changed files with 351 additions and 136 deletions.
6 changes: 4 additions & 2 deletions embed.fnc
Expand Up @@ -1336,7 +1336,7 @@ Ap |void |regdump |NN const regexp* r
ApM |SV* |regclass_swash |NULLOK const regexp *prog \
|NN const struct regnode *node|bool doinit \
|NULLOK SV **listsvp|NULLOK SV **altsvp
#if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_PERL_C) || defined(PERL_IN_UTF8_C)
#if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_PERL_C) || defined(PERL_IN_UTF8_C) || defined(PERL_IN_LOCALE_C)
EXpR |SV* |_new_invlist_C_array|NN const UV* const list
EXMp |bool |_invlistEQ |NN SV* const a|NN SV* const b|const bool complement_b
#endif
Expand Down Expand Up @@ -1734,12 +1734,14 @@ EXp |SV* |_core_swash_init|NN const char* pkg|NN const char* name \
|NN SV* listsv|I32 minbits|I32 none \
|NULLOK SV* invlist|NULLOK U8* const flags_p
#endif
#if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_UTF8_C) || defined(PERL_IN_LOCALE_C)
EXpMRn |SSize_t|_invlist_search |NN SV* const invlist|const UV cp
#endif
#if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_UTF8_C)
EiMRn |UV* |invlist_array |NN SV* const invlist
EiMRn |bool* |get_invlist_offset_addr|NN SV* invlist
EiMRn |UV |_invlist_len |NN SV* const invlist
EMiRn |bool |_invlist_contains_cp|NN SV* const invlist|const UV cp
EXpMRn |SSize_t|_invlist_search |NN SV* const invlist|const UV cp
EXMpR |SV* |_get_swash_invlist|NN SV* const swash
EXMpR |HV* |_swash_inversion_hash |NN SV* const swash
#endif
Expand Down
6 changes: 4 additions & 2 deletions embed.h
Expand Up @@ -1068,7 +1068,7 @@
# if defined(PERL_IN_REGCOMP_C) || defined (PERL_IN_DUMP_C)
#define _invlist_dump(a,b,c,d) Perl__invlist_dump(aTHX_ a,b,c,d)
# endif
# if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_PERL_C) || defined(PERL_IN_UTF8_C)
# if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_PERL_C) || defined(PERL_IN_UTF8_C) || defined(PERL_IN_LOCALE_C)
#define _invlistEQ(a,b,c) Perl__invlistEQ(aTHX_ a,b,c)
#define _new_invlist_C_array(a) Perl__new_invlist_C_array(aTHX_ a)
# endif
Expand All @@ -1084,11 +1084,13 @@
#define _get_swash_invlist(a) Perl__get_swash_invlist(aTHX_ a)
#define _invlist_contains_cp S__invlist_contains_cp
#define _invlist_len S__invlist_len
#define _invlist_search Perl__invlist_search
#define _swash_inversion_hash(a) Perl__swash_inversion_hash(aTHX_ a)
#define get_invlist_offset_addr S_get_invlist_offset_addr
#define invlist_array S_invlist_array
# endif
# if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_UTF8_C) || defined(PERL_IN_LOCALE_C)
#define _invlist_search Perl__invlist_search
# endif
# if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_UTF8_C) || defined(PERL_IN_TOKE_C)
#define _core_swash_init(a,b,c,d,e,f,g) Perl__core_swash_init(aTHX_ a,b,c,d,e,f,g)
# endif
Expand Down

0 comments on commit 8b4104f

Please sign in to comment.