Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
locale.c: Revamp fallback detection of UTF-8 locales
This commit continues the process started in the previous few commits to improve the detection of whether a locale is UTF-8 or not when the platform doesn't have the more modern tools available. What was done before was examine various texts, like the days of the week, in a locale, and see if they are legal UTF-8 or not. If there were any, and all were legal, it assumed that UTF-8 was needed. If there weren't any (as in American English), it looked at the locale's name. This presents false negatives and false positives. Basically, it adds the constraint that all the texts need to be in the same script when interpreted as UTF-8, which basically rules out any false positives when the script isn't Latin. With Latin, it isn't so clear cut, as the text can be intermixed with ASCII Latin letters and UTF-8 variant sequences that could be some Latin locale, or UTF-8, and they just coincidentally happen to be syntactically UTF-8. Because of the structuredness of UTF-8, the odds of a coincidence go down with increasing numbers of variants in a row. This also isn't likely to happen with ISO 8859-1, as the bytes that could be legal continuations in UTF-8 are almost entirely controls or punctuation. But in other locales in the 8859 series, there are some legal continuations that could be part of a month name, say. As an example of the issues, in 8859-2, one could have \xC6 (C with acute) followed by \xB1 (a with ogonek), which in UTF-8 would be U+01B1: LATIN CAPITAL LETTER UPSILON. However, something like \xCD (i acute) followed by \xB3 (l with stroke) yields U+0373: GREEK SMALL LETTER ARCHAIC SAMPI, and the script check added by this commit would catch that. In non-Latin texts, the only permissible ASCII characters would be punctuation, and you aren't going to have many of those in the LC_TIME strings, and certainly not in a row. Instead those will consist of at least several variant characters in a row, and the odds of those coincidentally being syntactically valid UTF-8 and semantically in the same script are exceedingly low. To catch Latin UTF-8 locales, this commit adds a list of the distinct variants found so far. If there are even just several of these, the odds of the syntax being coincidentally UTF-8 greatly diminish. The number needed for this to conclude that the locale is UTF-8, is easily tweakable at compile time. The problem remains for English and other Latin script languages that have rare accented characters. The name is still then examined for containing "UTF-8". Note that previous commits have guaranteed that if the locale has a non-ASCII currency symbol that is recognized by Unicode, such as the Euro or Pound Sterling, that will correctly be recognized.
- Loading branch information