[RegExp] case-insensitive matching misses characters #517

Closed
goyakin opened this Issue Mar 10, 2016 · 5 comments

Projects

None yet

6 participants

@goyakin
goyakin commented Mar 10, 2016

When we generate the character casing equivalence sets for RegExp using UnicodeData.txt, we start from lowercase letters and then add the characters that they map to. However, this misses non-letter characters. For example, '\u0345', which is in the "Mn" (Mark, Nonspacing) category, should match '\u0399' when the case-insensitive flag is passed (even without the "unicode" flag), but we fail to do so.

When this is fixed, the following test should pass:

var didMatch = /\u0345/i.test('\u0399');
WScript.Echo(didMatch);
@goyakin goyakin added the Bug label Mar 10, 2016
@abchatra abchatra added this to the 1.2 milestone Mar 10, 2016
@goyakin goyakin was assigned by abchatra Mar 10, 2016
@abchatra abchatra removed this from the 1.2 milestone Mar 24, 2016
@bterlson
Member

We are also not properly handling canonicalization with i and u flags. /\W/iu.test("S") and /\W/iu.test("K") should return true.

@mathiasbynens

Since that last comment tends to confuse people, here’s an explanation of why /\W/iu.test('S') and /\W/iu.test('K') should return true: https://mathiasbynens.be/notes/es6-unicode-regex#impact-i

@goyakin
goyakin commented Apr 7, 2016

Possible spec change regarding \W: tc39/ecma262#512

@goyakin goyakin removed their assignment Apr 27, 2016
@akroshg akroshg was assigned by dilijev Jun 21, 2016
@dilijev dilijev referenced this issue in mathiasbynens/regexpu-core Jun 21, 2016
Closed

Add missing `iu-mappings` #7

@dilijev
Member
dilijev commented Jun 21, 2016

@mathiasbynens wrote in #1136:

Chakra returns false for each of these tests, while V8 and JavaScriptCore return true for all of them:

/\u01F1/i.test('\u01F3')
/\u0345/i.test('\u03B9')
/\u037F/i.test('\u03F3')
/\u0528/i.test('\u0529')
/\u052A/i.test('\u052B')
/\u052C/i.test('\u052D')
/\u052E/i.test('\u052F')
/\uA698/i.test('\uA699')
/\uA69A/i.test('\uA69B')
/\uA796/i.test('\uA797')
/\uA798/i.test('\uA799')
/\uA79A/i.test('\uA79B')
/\uA79C/i.test('\uA79D')
/\uA79E/i.test('\uA79F')
/\uA7AB/i.test('\u025C')
/\uA7AC/i.test('\u0261')
/\uA7AD/i.test('\u026C')
/\uA7B0/i.test('\u029E')
/\uA7B1/i.test('\u0287')

https://mathiasbynens.be/demo/regex-i

(See also: similar SpiderMonkey bug.)

@mathiasbynens
mathiasbynens commented Jun 23, 2016 edited

/\W/iu.test("S") and /\W/iu.test("K") should return true.

This is no longer true now that tc39/ecma262#525 has landed.

The other bugs mentioned in this thread remain.

@dilijev dilijev assigned dilijev and unassigned akroshg Nov 11, 2016
@dilijev dilijev added this to the 1.4 milestone Nov 11, 2016
@dilijev dilijev changed the title from RegExp case-insensitive matching misses characters to [RegExp] case-insensitive matching misses characters Nov 11, 2016
@dilijev dilijev modified the milestone: 1.4.0, 1.4.1 Dec 20, 2016
@dilijev dilijev added a commit to dilijev/ChakraCore that referenced this issue Jan 12, 2017
@dilijev dilijev Add provided test case from #517. cb736ab
@dilijev dilijev added Fixed and removed Fixed labels Jan 13, 2017
@chakrabot chakrabot pushed a commit that closed this issue Jan 14, 2017
@dilijev dilijev [MERGE #2356 @dilijev] Update CaseInsensitive table from hybrid of (U…
…nicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517

Merge pull request #2356 from dilijev:unicase

Update CaseInsensitive table from hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support.

Fixes #517

Note: The current standard wants Unicode 9.0 but it might be too risky to update that far in a stabilization branch. Opened #2367 to track this work item.

The table was generated in the past but then was (mostly) manually edited to include various optimizations and to fix bugs over the years. To make sure we got a complete update, I wrote a tool to generate the table.

## CaseInsensitive mapping generator tool
PR: dilijev#3
Source: https://github.com/dilijev/ChakraCore/tree/CaseInsensitive/tools/Unicode/CaseInsensitive

From this tool I was able to see and apply the differences from the current implementation to the correct implementation. In order to keep the change as small as possible, I used the diff as a reference for what needed changing and left out non-essential diffs. Additionally, the tool generates a suite of tests to track regressions against the update and ensure that the implementation does what is expected. I took some key tests from that suite and created the test file contained in this PR.

# Overview of Changes

I have staged the changes to hopefully make this easier to review. Here's an overview.

NOTE: The individual commits list or summarize the relevant lines of UnicodeData.txt where applicable.

First, I normalized the existing table to a reasonable format (same as the output of the tool) to make the later commits more clear. This involves fixing the casing and sorting deltas on each line in ascending order.

3d0f37f

Next, I fixed a few bugs with the current table that were preventing some cases from being matched correctly.

abb5d91
4894d24
25049de

Added new codepoints:

f197902 - GREEK LETTER YOT
af2d083 - Cyrillic
cba5439 - Cherokee
6c25a51 - Latin extensions

Other tests and fixes:

cb736ab - Add test cases from #517 to ensure those issues are fixed.
fbfb953 - 0x0345 and case-insensitive equivalent characters with or without /u flag.
dc3e750 - Case-insensitive matching for Cherokee only with /u. [1]
d96eed5 - All other Unicode 8.0 cases of case-insensitive matching only with /u. [1] Added generated tests.

[1] These were with a focus on compat with v8 as determined by running the full regression test suite I generated against node-6.9.4-LTS and node-7.4.0 (latest), and double-checking a handful of tests against the latest stable Chrome (v 55).

# Test Coverage

* **Regex test run successful!** `Summary: E:\d\RegexTestCollateral had 151147 tests; 0 failures`
* Internal and slow tests pass.
* Note that PRs are merged with the target branch before running Jenkins checks so attempting to run slow tests on this PR would result in failures as per #2316 -- but running them locally on this branch, the tests pass.

# Reviewers

@tcare @bterlson @boingoing @Cellule - Thank you for your assistance with this change and for your code reviews.
7565382
@chakrabot chakrabot closed this in 7565382 Jan 14, 2017
@chakrabot chakrabot pushed a commit that referenced this issue Jan 14, 2017
@dilijev dilijev [1.4>master] [MERGE #2356 @dilijev] Update CaseInsensitive table from…
… hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support. Fixes #517

Merge pull request #2356 from dilijev:unicase

Update CaseInsensitive table from hybrid of (Unicode 6.3 and later, up to 8.0) to full Unicode 8.0 support.

Fixes #517

Note: The current standard wants Unicode 9.0 but it might be too risky to update that far in a stabilization branch. Opened #2367 to track this work item.

The table was generated in the past but then was (mostly) manually edited to include various optimizations and to fix bugs over the years. To make sure we got a complete update, I wrote a tool to generate the table.

## CaseInsensitive mapping generator tool
PR: dilijev#3
Source: https://github.com/dilijev/ChakraCore/tree/CaseInsensitive/tools/Unicode/CaseInsensitive

From this tool I was able to see and apply the differences from the current implementation to the correct implementation. In order to keep the change as small as possible, I used the diff as a reference for what needed changing and left out non-essential diffs. Additionally, the tool generates a suite of tests to track regressions against the update and ensure that the implementation does what is expected. I took some key tests from that suite and created the test file contained in this PR.

# Overview of Changes

I have staged the changes to hopefully make this easier to review. Here's an overview.

NOTE: The individual commits list or summarize the relevant lines of UnicodeData.txt where applicable.

First, I normalized the existing table to a reasonable format (same as the output of the tool) to make the later commits more clear. This involves fixing the casing and sorting deltas on each line in ascending order.

3d0f37f

Next, I fixed a few bugs with the current table that were preventing some cases from being matched correctly.

abb5d91
4894d24
25049de

Added new codepoints:

f197902 - GREEK LETTER YOT
af2d083 - Cyrillic
cba5439 - Cherokee
6c25a51 - Latin extensions

Other tests and fixes:

cb736ab - Add test cases from #517 to ensure those issues are fixed.
fbfb953 - 0x0345 and case-insensitive equivalent characters with or without /u flag.
dc3e750 - Case-insensitive matching for Cherokee only with /u. [1]
d96eed5 - All other Unicode 8.0 cases of case-insensitive matching only with /u. [1] Added generated tests.

[1] These were with a focus on compat with v8 as determined by running the full regression test suite I generated against node-6.9.4-LTS and node-7.4.0 (latest), and double-checking a handful of tests against the latest stable Chrome (v 55).

# Test Coverage

* **Regex test run successful!** `Summary: E:\d\RegexTestCollateral had 151147 tests; 0 failures`
* Internal and slow tests pass.
* Note that PRs are merged with the target branch before running Jenkins checks so attempting to run slow tests on this PR would result in failures as per #2316 -- but running them locally on this branch, the tests pass.

# Reviewers

@tcare @bterlson @boingoing @Cellule - Thank you for your assistance with this change and for your code reviews.
6dbbd79
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment