locale.c: Revamp fallback detection of UTF-8 locales

This commit continues the process started in the previous few commits to improve the detection of whether a locale is UTF-8 or not when the platform doesn't have the more modern tools available. What was done before was examine various texts, like the days of the week, in a locale, and see if they are legal UTF-8 or not. If there were any, and all were legal, it assumed that UTF-8 was needed. If there weren't any (as in American English), it looked at the locale's name. This presents false negatives and false positives. Basically, it adds the constraint that all the texts need to be in the same script when interpreted as UTF-8, which basically rules out any false positives when the script isn't Latin. With Latin, it isn't so clear cut, as the text can be intermixed with ASCII Latin letters and UTF-8 variant sequences that could be some Latin locale, or UTF-8, and they just coincidentally happen to be syntactically UTF-8. Because of the structuredness of UTF-8, the odds of a coincidence go down with increasing numbers of variants in a row. This also isn't likely to happen with ISO 8859-1, as the bytes that could be legal continuations in UTF-8 are almost entirely controls or punctuation. But in other locales in the 8859 series, there are some legal continuations that could be part of a month name, say. As an example of the issues, in 8859-2, one could have \xC6 (C with acute) followed by \xB1 (a with ogonek), which in UTF-8 would be U+01B1: LATIN CAPITAL LETTER UPSILON. However, something like \xCD (i acute) followed by \xB3 (l with stroke) yields U+0373: GREEK SMALL LETTER ARCHAIC SAMPI, and the script check added by this commit would catch that. In non-Latin texts, the only permissible ASCII characters would be punctuation, and you aren't going to have many of those in the LC_TIME strings, and certainly not in a row. Instead those will consist of at least several variant characters in a row, and the odds of those coincidentally being syntactically valid UTF-8 and semantically in the same script are exceedingly low. To catch Latin UTF-8 locales, this commit adds a list of the distinct variants found so far. If there are even just several of these, the odds of the syntax being coincidentally UTF-8 greatly diminish. The number needed for this to conclude that the locale is UTF-8, is easily tweakable at compile time. The problem remains for English and other Latin script languages that have rare accented characters. The name is still then examined for containing "UTF-8". Note that previous commits have guaranteed that if the locale has a non-ASCII currency symbol that is recognized by Unicode, such as the Euro or Pound Sterling, that will correctly be recognized.
Perl · Jan 18, 2018 · 8b4104f · 8b4104f
1 parent 8983edc
commit 8b4104f
Show file tree

Hide file tree

Showing 4 changed files with 351 additions and 136 deletions.
diff --git a/embed.fnc b/embed.fnc
@@ -1336,7 +1336,7 @@ Ap	|void	|regdump	|NN const regexp* r
 ApM	|SV*	|regclass_swash	|NULLOK const regexp *prog \
 				|NN const struct regnode *node|bool doinit \
 				|NULLOK SV **listsvp|NULLOK SV **altsvp
-#if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_PERL_C) || defined(PERL_IN_UTF8_C)
+#if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_PERL_C) || defined(PERL_IN_UTF8_C) || defined(PERL_IN_LOCALE_C)
 EXpR	|SV*	|_new_invlist_C_array|NN const UV* const list
 EXMp	|bool	|_invlistEQ	|NN SV* const a|NN SV* const b|const bool complement_b
 #endif
@@ -1734,12 +1734,14 @@ EXp	|SV*	|_core_swash_init|NN const char* pkg|NN const char* name \
 		|NN SV* listsv|I32 minbits|I32 none \
 		|NULLOK SV* invlist|NULLOK U8* const flags_p
 #endif
+#if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_UTF8_C) || defined(PERL_IN_LOCALE_C)
+EXpMRn	|SSize_t|_invlist_search	|NN SV* const invlist|const UV cp
+#endif
 #if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_UTF8_C)
 EiMRn	|UV*	|invlist_array	|NN SV* const invlist
 EiMRn	|bool*	|get_invlist_offset_addr|NN SV* invlist
 EiMRn	|UV	|_invlist_len	|NN SV* const invlist
 EMiRn	|bool	|_invlist_contains_cp|NN SV* const invlist|const UV cp
-EXpMRn	|SSize_t|_invlist_search	|NN SV* const invlist|const UV cp
 EXMpR	|SV*	|_get_swash_invlist|NN SV* const swash
 EXMpR	|HV*	|_swash_inversion_hash	|NN SV* const swash
 #endif

diff --git a/embed.h b/embed.h
@@ -1068,7 +1068,7 @@
 #  if defined(PERL_IN_REGCOMP_C) || defined (PERL_IN_DUMP_C)
 #define _invlist_dump(a,b,c,d)	Perl__invlist_dump(aTHX_ a,b,c,d)
 #  endif
-#  if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_PERL_C) || defined(PERL_IN_UTF8_C)
+#  if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_PERL_C) || defined(PERL_IN_UTF8_C) || defined(PERL_IN_LOCALE_C)
 #define _invlistEQ(a,b,c)	Perl__invlistEQ(aTHX_ a,b,c)
 #define _new_invlist_C_array(a)	Perl__new_invlist_C_array(aTHX_ a)
 #  endif
@@ -1084,11 +1084,13 @@
 #define _get_swash_invlist(a)	Perl__get_swash_invlist(aTHX_ a)
 #define _invlist_contains_cp	S__invlist_contains_cp
 #define _invlist_len		S__invlist_len
-#define _invlist_search		Perl__invlist_search
 #define _swash_inversion_hash(a)	Perl__swash_inversion_hash(aTHX_ a)
 #define get_invlist_offset_addr	S_get_invlist_offset_addr
 #define invlist_array		S_invlist_array
 #  endif
+#  if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_UTF8_C) || defined(PERL_IN_LOCALE_C)
+#define _invlist_search		Perl__invlist_search
+#  endif
 #  if defined(PERL_IN_REGCOMP_C) || defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_UTF8_C) || defined(PERL_IN_TOKE_C)
 #define _core_swash_init(a,b,c,d,e,f,g)	Perl__core_swash_init(aTHX_ a,b,c,d,e,f,g)
 #  endif