-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Stage 3 proposal Intl.Locale #5675
base: master
Are you sure you want to change the base?
Conversation
f4d5cec
to
6a69de3
Compare
6a69de3
to
5c3e427
Compare
Interesting, looks like ICU 55 doesnt minimize |
Also see comments of #5674 with @jefgen. ICU canonicalization code does a lot that Intl doesn't really care about, but its handling of UTS35/RFC5646-style canonicalization is a bit all over the place. I believe spidermonkey implements their own locale processing, and I am not sure what jsc does. I looked into doing the locale processing manually and it wouldn't be too difficult (at least, not on top of all of the processing that we already do in the abstract operations), so I don't know if its worth getting the code into ICU if the only people worrying about it are Intl implementers. |
const LANG_TAG_RE = new RegExp(`^${LANG_TAG}$`, 'i'); // [1] language; [2] script; [3] region; [4] variants; [5] extensions; | ||
let unicodeExtensionsEnd; | ||
for (unicodeExtensionsEnd = unicodeExtensionStart + 1; unicodeExtensionsEnd < extensionParts.length && extensionParts[unicodeExtensionsEnd].length > 1; unicodeExtensionsEnd++) { | ||
// do nothing, we just want k to equal the index of the next element whose length is 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are you referring to with k
? I don't see any variables with that name (which is nice, by the way).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, renamed the variable without updating the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should share some notes across implementations about where it is or isn't possible to reuse ICU's locale processing; c
v8 is about to make a switch to ICU API from regexes to validate input tags. There are two groups of issues:
-
ICU's handling of grandfathered tags and deprecated region/language code is outdated. This is a data issue (the latest version of the IANA language tag registry should be used) . I've filed a series of bugs against the ICU. I'm assigned to them and I do have patches (that have been applied to v8's copy of ICU since this spring)
-
I also discovered a couple of bugs and put up a PR against the ICU (it's approved, but not yet merged).
With the above two issues resolved, v8 will make a switch to ICU with a couple of extra pre/post-processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spoke too early. What I wrote above is mainly structure-validation and canonicalization. Min/max also work fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last I had checked V8 (not sure which version) it was still converting unicode attributes to unicode keywords with value = "yes", and converting en-GB-oed to en-GB-x-oed rather than en-GB-oxendict. Chakra does this as well as a direct result of allowing ICU to do the canonicalization, so I assumed V8 used ICU in the same way that we did. In other words, I thought both V8 and Chakra suffered primarily from data issues because of ICU, not structure issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for reminding me of 'yes' issue. That's still outstanding. It's
https://unicode-org.atlassian.net/browse/ICU-13730 .
In case of en-GB-oed, it's fixed at least in geCanonicalLocales() along with a number of cases arising from data. ICU upstream hasn't been fixed, though. [1]
var loc1=Intl.getCanonicalLocales("en-GB-oed")
undefined
loc1
["en-GB-oxendict"]
Let me check Intl.Locale.
[1] https://unicode-org.atlassian.net/browse/ICU-13721
13719, 13720, 13723, 13726 are other bugs about the date update. Perhaps, I'd better consolidate them all into one and make a PR ( the v8/Chromium patch is
https://cs.chromium.org/chromium/src/third_party/icu/patches/locid_map.patch ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intl.locale is also fixed (as it should be):
$ d8 --harmony-locale
d8> new Intl.Locale("en-gb-oed").toString()
"en-GB-oxendict"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the structural validity, my PR for https://unicode-org.atlassian.net/browse/ICU-20098 was just merged to the ICU tot. When I replaced custom regular expressions for BCP 47 structural validity in v8 with ICU uloc_forLanguageTag/uloc_toLanguageTag with the above PR applied locally, at least there's no regression and one failing test begins to pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's great! The Chromium patch + ICU PR covers all of the cases I tested that ICU didn't handle correctly except for three (plus the -yes issue):
- The UTS35 Likely Subtags algorithm notes that sh -> sr_Latn and mo -> ro_MD, but in my testing ICU didn't handle that. The UTS page is unclear about where that data comes from -- its not in the subtag registry, and I don't know enough about the CLDR layout to find it there, either.
- und-Arab-AF maximizes to ar-Arab-AF, which seems wrong -- Wikipedia says the two primary languages of Afghanistan are Dari (prs) and Pashto (ps), not Arabic (ar). I am not sure if the Arabic script bit of the tag is causing the language to be Arabic, but the above Likely Subtags algorithm says und-Arab-AF should be maximized to fa-Arab-AF (fa == Persian), which also seems more reasonable since at least Dari is a Persian dialect.
- The UTS35 page also mentions that when the script is Zzzz or the region is ZZ, it should be removed from the tag entirely, but ICU seems to accept it.
I can file ICU issues for any/all three if theyre actually incorrect behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you've written some tests here; can you submit them upstream or put a license on them that'll allow me to? |
Does the MIT license that we use preclude that? I asked some people internally and it wasn't clear, but I will keep asking around. Additionally, I only looked at the test262 cases that failed during development, but from what I can tell everything in these tests that isn't in test262 came from my interpretation of UTS35's Likely Subtags section. I personally would want to resolve some of the questions and commented out bits of the tests I wrote before porting them to test262. |
Will take a look |
// the UTS35 example says the maximized version should be fa-Arab-AF? | ||
test("und-Arab-AF", "und-Arab-AF", "und-Arab-AF", "ar-Arab-AF"); | ||
|
||
// Chakra performs incorrect canonicalization, so the following cases don't pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chakra [](start = 15, length = 6)
make it more clear that this is actually because of ICU's logic
@@ -257,6 +257,7 @@ RT_ERROR_MSG(JSERR_MissingCurrencyCode, 5123, "", "Currency code was not specifi | |||
RT_ERROR_MSG(JSERR_InvalidDate, 5124, "", "Invalid Date", kjstRangeError, 0) | |||
RT_ERROR_MSG(JSERR_IntlNotAvailable, 5125, "", "Intl is not available.", kjstTypeError, 0) | |||
RT_ERROR_MSG(JSERR_IntlNotImplemented, 5126, "", "Intl operation '%s' is not implemented.", kjstTypeError, 0) | |||
RT_ERROR_MSG(JSERR_InvalidPrivateOrGrandfatheredTag, 5127, "", "The arguments provided to Intl.Locale form an invalid privateuse or grandfathered language tag", kjstRangeError, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
invalid privateuse or grandfathered language tag [](start = 110, length = 48)
nit: This language confuses me, but it might make sense with the algo, or we can fix the message later -- so meh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any suggestion? From the spec, I would suggest its fairly clear -- there are specific cases where the spec says "if tag matches the grandfathered production or the privateuse production, throw a RangeError"
@@ -122,6 +122,16 @@ | |||
concat(array, ...els) { return callInstanceFunc(platform.builtInJavascriptArrayEntryConcat, array, ...els); }, | |||
filter(array, func) { return callInstanceFunc(platform.builtInJavascriptArrayEntryFilter, array, func); }, | |||
unique(array) { return _.filter(array, (v, i) => _.arrayIndexOf(array, v) === i); }, | |||
any(array, func) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any(array, func) { [](start = 8, length = 18)
@jdalton FYI lodash method reimplementation :P
const scriptREString = `\\b(?:${ALPHA}{4})\\b`; // script = 4ALPHA | ||
const extlangREString = `\\b(?:${ALPHA}{3}\\b(?:-${ALPHA}{3}){0,2})\\b`; // extlang = 3ALPHA *2("-" 3ALPHA) | ||
|
||
const languageREString = '\\b(?:' + // language = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 84, length = 2)
nit indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still a few things to do, but I wanted to get this out the door before heading home for the day:
langtagToParts
andparseLangtag
This currently passes 160/184 Intl.Locale spec tests. The 24 tests that do not pass are all cases where we shell out to ICU which does non-Intl-spec-compliant behavior (mostly related to the differences between platform.normalizeLanguageTag and CanonicalizeLanguageTag
/cc @littledan