diff --git a/doc/html/pcre2_set_compile_extra_options.html b/doc/html/pcre2_set_compile_extra_options.html index 09ba34ad9..cb62022a2 100644 --- a/doc/html/pcre2_set_compile_extra_options.html +++ b/doc/html/pcre2_set_compile_extra_options.html @@ -43,8 +43,10 @@

pcre2_set_compile_extra_options man page

PCRE2_EXTRA_ESCAPED_CR_IS_LF Interpret \r as \n PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines PCRE2_EXTRA_MATCH_WORD Pattern matches "words" + PCRE2_EXTRA_NEVER_CALLOUT Disallow callouts in pattern PCRE2_EXTRA_NO_BS0 Disallow \0 (but not \00 or \000) PCRE2_EXTRA_PYTHON_OCTAL Use Python rules for octal + PCRE2_EXTRA_TURKISH_CASING Use Turkish I case folding There is a complete description of the PCRE2 native API in the pcre2api diff --git a/doc/html/pcre2api.html b/doc/html/pcre2api.html index df439c4c4..bcffaa77f 100644 --- a/doc/html/pcre2api.html +++ b/doc/html/pcre2api.html @@ -1697,12 +1697,21 @@

pcre2api man page

changed within a pattern by a (?i) option setting. If either PCRE2_UTF or PCRE2_UCP is set, Unicode properties are used for all characters with more than one other case, and for all characters whose code points are greater than -U+007F. Note that there are two ASCII characters, K and S, that, in addition to +U+007F. +

+

+Note that there are two ASCII characters, K and S, that, in addition to their lower case ASCII equivalents, are case-equivalent with U+212A (Kelvin sign) and U+017F (long S) respectively. If you do not want this case equivalence, you can suppress it by setting PCRE2_EXTRA_CASELESS_RESTRICT.

+One language family, Turkish and Azeri, has its own case-insensitivity rules, +which can be selected by setting PCRE2_EXTRA_TURKISH_CASING. This alters the +behaviour of the 'i', 'I', U+0130 (capital I with dot above), and U+0131 +(small dotless i) characters. +

+

For lower valued characters with only one other case, a lookup table is used for speed. When neither PCRE2_UTF nor PCRE2_UCP is set, a lookup table is used for all code points less than 256, and higher code points (available only in @@ -2037,9 +2046,16 @@

pcre2api man page

upper/lower casing operations, even when PCRE2_UTF is not set. This makes it possible to process strings in the 16-bit UCS-2 code. This option is available only if PCRE2 has been compiled with Unicode support (which is the default). -The PCRE2_EXTRA_CASELESS_RESTRICT option (see below) restricts caseless +

+

+The PCRE2_EXTRA_CASELESS_RESTRICT option (see above) restricts caseless matching such that ASCII characters match only ASCII characters and non-ASCII -characters match only non-ASCII characters. +characters match only non-ASCII characters. The PCRE2_EXTRA_TURKISH_CASING option +(see above) alters the matching of the 'i' characters to follow their behaviour +in Turkish and Azeri languages. For further details on +PCRE2_EXTRA_CASELESS_RESTRICT and PCRE2_EXTRA_TURKISH_CASING, see the +pcre2unicode +page.

   PCRE2_UNGREEDY
 
@@ -2176,7 +2192,8 @@

pcre2api man page

ASCII letter K is case-equivalent to U+212a (Kelvin sign). This option disables recognition of case-equivalences that cross the ASCII/non-ASCII boundary. In a caseless match, both characters must either be ASCII or non-ASCII. The option -can be changed with a pattern by the (?r) option setting. +can be changed within a pattern by the (*CASELESS_RESTRICT) or (?r) option +settings.
   PCRE2_EXTRA_ESCAPED_CR_IS_LF
 
@@ -2223,6 +2240,14 @@

pcre2api man page

returning PCRE2_ERROR_CALLOUT_CALLER_DISABLED. This is useful if the application knows that a callout will not be provided to pcre2_match(), so that callouts in the pattern are not silently ignored. +
+  PCRE2_EXTRA_TURKISH_CASING
+
+This option alters case-equivalence of the 'i' letters to follow the +alphabet used by Turkish and Azeri languages. The option can be changed within +a pattern by the (*TURKISH_CASING) start-of-pattern setting. Either the UTF or +UCP options must be set. In the 8-bit library, UTF must be set. This option +cannot be combined with PCRE2_EXTRA_CASELESS_RESTRICT.


JUST-IN-TIME (JIT) COMPILATION

diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html index 667620fe8..20da7e04f 100644 --- a/doc/html/pcre2pattern.html +++ b/doc/html/pcre2pattern.html @@ -302,7 +302,10 @@

pcre2pattern man page

equivalents, are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F (long S) respectively when either PCRE2_UTF or PCRE2_UCP is set, unless the PCRE2_EXTRA_CASELESS_RESTRICT option is in force (either passed to -pcre2_compile() or set by (?r) within the pattern). +pcre2_compile() or set by (*CASELESS_RESTRICT) or (?r) within the +pattern). If the PCRE2_EXTRA_TURKISH_CASING option is in force (either passed +to pcre2_compile() or set by (*TURKISH_CASING) within the pattern), then +the 'i' letters are matched according to Turkish and Azeri languages.

The power of regular expressions comes from the ability to include wild cards, diff --git a/doc/html/pcre2syntax.html b/doc/html/pcre2syntax.html index 537408228..649301c7d 100644 --- a/doc/html/pcre2syntax.html +++ b/doc/html/pcre2syntax.html @@ -436,17 +436,19 @@

pcre2syntax man page

of the newline or \R sequences or options with similar syntax. More than one of them may appear. For the first three, d is a decimal number.
-  (*LIMIT_DEPTH=d) set the backtracking limit to d
-  (*LIMIT_HEAP=d)  set the heap size limit to d * 1024 bytes
-  (*LIMIT_MATCH=d) set the match limit to d
-  (*NOTEMPTY)      set PCRE2_NOTEMPTY when matching
-  (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
-  (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
+  (*CASELESS_RESTRICT) set PCRE2_EXTRA_CASELESS_RESTRICT when matching
+  (*LIMIT_DEPTH=d)     set the backtracking limit to d
+  (*LIMIT_HEAP=d)      set the heap size limit to d * 1024 bytes
+  (*LIMIT_MATCH=d)     set the match limit to d
+  (*NOTEMPTY)          set PCRE2_NOTEMPTY when matching
+  (*NOTEMPTY_ATSTART)  set PCRE2_NOTEMPTY_ATSTART when matching
+  (*NO_AUTO_POSSESS)   no auto-possessification (PCRE2_NO_AUTO_POSSESS)
   (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
-  (*NO_JIT)       disable JIT optimization
-  (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
-  (*UTF)          set appropriate UTF mode for the library in use
-  (*UCP)          set PCRE2_UCP (use Unicode properties for \d etc)
+  (*NO_JIT)            disable JIT optimization
+  (*NO_START_OPT)      no start-match optimization (PCRE2_NO_START_OPTIMIZE)
+  (*TURKISH_CASING)    set PCRE2_EXTRA_TURKISH_CASING when matching
+  (*UTF)               set appropriate UTF mode for the library in use
+  (*UCP)               set PCRE2_UCP (use Unicode properties for \d etc)
 
Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the value of the limits set by the caller of pcre2_match() or pcre2_dfa_match(), diff --git a/doc/html/pcre2test.html b/doc/html/pcre2test.html index ebe1c594d..c36230397 100644 --- a/doc/html/pcre2test.html +++ b/doc/html/pcre2test.html @@ -673,6 +673,7 @@

pcre2test man page

no_start_optimize set PCRE2_NO_START_OPTIMIZE no_utf_check set PCRE2_NO_UTF_CHECK python_octal set PCRE2_EXTRA_PYTHON_OCTAL + turkish_casing set PCRE2_EXTRA_TURKISH_CASING ucp set PCRE2_UCP ungreedy set PCRE2_UNGREEDY use_offset_limit set PCRE2_USE_OFFSET_LIMIT diff --git a/doc/html/pcre2unicode.html b/doc/html/pcre2unicode.html index 1f6911a03..9a58848ed 100644 --- a/doc/html/pcre2unicode.html +++ b/doc/html/pcre2unicode.html @@ -157,6 +157,35 @@

pcre2unicode man page

counterparts can be disabled by setting the PCRE2_EXTRA_CASELESS_RESTRICT option. When this is set, all characters in a case equivalence must either be ASCII or non-ASCII; there can be no mixing. +
+    Without PCRE2_EXTRA_CASELESS_RESTRICT:
+      'k' = 'K' = U+212A (Kelvin sign)
+      's' = 'S' = U+017F (long S)
+    With PCRE2_EXTRA_CASELESS_RESTRICT:
+      'k' = 'K'
+      U+212A (Kelvin sign)  only case-equivalent to itself
+      's' = 'S'
+      U+017F (long S)       only case-equivalent to itself
+
+

+

+One language family, Turkish and Azeri, has its own case-insensitivity rules, +which can be selected by setting PCRE2_EXTRA_TURKISH_CASING. This alters the +behaviour of the 'i', 'I', U+0130 (capital I with dot above), and U+0131 +(small dotless i) characters. +

+    Without PCRE2_EXTRA_TURKISH_CASING:
+      'i' = 'I'
+      U+0130 (capital I with dot above)  only case-equivalent to itself
+      U+0131 (small dotless i)           only case-equivalent to itself
+    With PCRE2_EXTRA_TURKISH_CASING:
+      'i' = U+0130 (capital I with dot above)
+      U+0131 (small dotless i) = 'I'
+
+

+

+It is not allowed to specify both PCRE2_EXTRA_CASELESS_RESTRICT and +PCRE2_EXTRA_TURKISH_CASING together.

From release 10.45 the Unicode letter properties Lu (upper case), Ll (lower diff --git a/doc/pcre2_set_compile_extra_options.3 b/doc/pcre2_set_compile_extra_options.3 index 490d6a0a1..114479e6a 100644 --- a/doc/pcre2_set_compile_extra_options.3 +++ b/doc/pcre2_set_compile_extra_options.3 @@ -43,8 +43,10 @@ options are: PCRE2_EXTRA_ESCAPED_CR_IS_LF Interpret \er as \en PCRE2_EXTRA_MATCH_LINE Pattern matches whole lines PCRE2_EXTRA_MATCH_WORD Pattern matches "words" + PCRE2_EXTRA_NEVER_CALLOUT Disallow callouts in pattern PCRE2_EXTRA_NO_BS0 Disallow \e0 (but not \e00 or \e000) PCRE2_EXTRA_PYTHON_OCTAL Use Python rules for octal + PCRE2_EXTRA_TURKISH_CASING Use Turkish I case folding .sp There is a complete description of the PCRE2 native API in the .\" HREF diff --git a/doc/pcre2api.3 b/doc/pcre2api.3 index beab4107f..55e325256 100644 --- a/doc/pcre2api.3 +++ b/doc/pcre2api.3 @@ -1633,11 +1633,18 @@ letters in the subject. It is equivalent to Perl's /i option, and it can be changed within a pattern by a (?i) option setting. If either PCRE2_UTF or PCRE2_UCP is set, Unicode properties are used for all characters with more than one other case, and for all characters whose code points are greater than -U+007F. Note that there are two ASCII characters, K and S, that, in addition to +U+007F. +.P +Note that there are two ASCII characters, K and S, that, in addition to their lower case ASCII equivalents, are case-equivalent with U+212A (Kelvin sign) and U+017F (long S) respectively. If you do not want this case equivalence, you can suppress it by setting PCRE2_EXTRA_CASELESS_RESTRICT. .P +One language family, Turkish and Azeri, has its own case-insensitivity rules, +which can be selected by setting PCRE2_EXTRA_TURKISH_CASING. This alters the +behaviour of the 'i', 'I', U+0130 (capital I with dot above), and U+0131 +(small dotless i) characters. +.P For lower valued characters with only one other case, a lookup table is used for speed. When neither PCRE2_UTF nor PCRE2_UCP is set, a lookup table is used for all code points less than 256, and higher code points (available only in @@ -1986,9 +1993,17 @@ The second effect of PCRE2_UCP is to force the use of Unicode properties for upper/lower casing operations, even when PCRE2_UTF is not set. This makes it possible to process strings in the 16-bit UCS-2 code. This option is available only if PCRE2 has been compiled with Unicode support (which is the default). -The PCRE2_EXTRA_CASELESS_RESTRICT option (see below) restricts caseless +.P +The PCRE2_EXTRA_CASELESS_RESTRICT option (see above) restricts caseless matching such that ASCII characters match only ASCII characters and non-ASCII -characters match only non-ASCII characters. +characters match only non-ASCII characters. The PCRE2_EXTRA_TURKISH_CASING option +(see above) alters the matching of the 'i' characters to follow their behaviour +in Turkish and Azeri languages. For further details on +PCRE2_EXTRA_CASELESS_RESTRICT and PCRE2_EXTRA_TURKISH_CASING, see the +.\" HREF +\fBpcre2unicode\fP +.\" +page. .sp PCRE2_UNGREEDY .sp @@ -2128,7 +2143,8 @@ characters. The ASCII letter S is case-equivalent to U+017f (long S) and the ASCII letter K is case-equivalent to U+212a (Kelvin sign). This option disables recognition of case-equivalences that cross the ASCII/non-ASCII boundary. In a caseless match, both characters must either be ASCII or non-ASCII. The option -can be changed with a pattern by the (?r) option setting. +can be changed within a pattern by the (*CASELESS_RESTRICT) or (?r) option +settings. .sp PCRE2_EXTRA_ESCAPED_CR_IS_LF .sp @@ -2177,6 +2193,14 @@ If this option is set, PCRE2 treats callouts in the pattern as a syntax error, returning PCRE2_ERROR_CALLOUT_CALLER_DISABLED. This is useful if the application knows that a callout will not be provided to \fBpcre2_match()\fP, so that callouts in the pattern are not silently ignored. +.sp + PCRE2_EXTRA_TURKISH_CASING +.sp +This option alters case-equivalence of the 'i' letters to follow the +alphabet used by Turkish and Azeri languages. The option can be changed within +a pattern by the (*TURKISH_CASING) start-of-pattern setting. Either the UTF or +UCP options must be set. In the 8-bit library, UTF must be set. This option +cannot be combined with PCRE2_EXTRA_CASELESS_RESTRICT. . . .\" HTML diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3 index 11dddd9a9..424e66689 100644 --- a/doc/pcre2pattern.3 +++ b/doc/pcre2pattern.3 @@ -278,7 +278,10 @@ ASCII characters, K and S, that, in addition to their lower case ASCII equivalents, are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F (long S) respectively when either PCRE2_UTF or PCRE2_UCP is set, unless the PCRE2_EXTRA_CASELESS_RESTRICT option is in force (either passed to -\fBpcre2_compile()\fP or set by (?r) within the pattern). +\fBpcre2_compile()\fP or set by (*CASELESS_RESTRICT) or (?r) within the +pattern). If the PCRE2_EXTRA_TURKISH_CASING option is in force (either passed +to \fBpcre2_compile()\fP or set by (*TURKISH_CASING) within the pattern), then +the 'i' letters are matched according to Turkish and Azeri languages. .P The power of regular expressions comes from the ability to include wild cards, character classes, alternatives, and repetitions in the pattern. These are diff --git a/doc/pcre2syntax.3 b/doc/pcre2syntax.3 index d3a772030..eda042b49 100644 --- a/doc/pcre2syntax.3 +++ b/doc/pcre2syntax.3 @@ -411,17 +411,19 @@ The following are recognized only at the very start of a pattern or after one of the newline or \eR sequences or options with similar syntax. More than one of them may appear. For the first three, d is a decimal number. .sp - (*LIMIT_DEPTH=d) set the backtracking limit to d - (*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes - (*LIMIT_MATCH=d) set the match limit to d - (*NOTEMPTY) set PCRE2_NOTEMPTY when matching - (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching - (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS) + (*CASELESS_RESTRICT) set PCRE2_EXTRA_CASELESS_RESTRICT when matching + (*LIMIT_DEPTH=d) set the backtracking limit to d + (*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes + (*LIMIT_MATCH=d) set the match limit to d + (*NOTEMPTY) set PCRE2_NOTEMPTY when matching + (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching + (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS) (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR) - (*NO_JIT) disable JIT optimization - (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE) - (*UTF) set appropriate UTF mode for the library in use - (*UCP) set PCRE2_UCP (use Unicode properties for \ed etc) + (*NO_JIT) disable JIT optimization + (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE) + (*TURKISH_CASING) set PCRE2_EXTRA_TURKISH_CASING when matching + (*UTF) set appropriate UTF mode for the library in use + (*UCP) set PCRE2_UCP (use Unicode properties for \ed etc) .sp Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the value of the limits set by the caller of \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP, diff --git a/doc/pcre2test.1 b/doc/pcre2test.1 index 9567005b0..b5a343016 100644 --- a/doc/pcre2test.1 +++ b/doc/pcre2test.1 @@ -628,6 +628,7 @@ for a description of the effects of these options. no_start_optimize set PCRE2_NO_START_OPTIMIZE no_utf_check set PCRE2_NO_UTF_CHECK python_octal set PCRE2_EXTRA_PYTHON_OCTAL + turkish_casing set PCRE2_EXTRA_TURKISH_CASING ucp set PCRE2_UCP ungreedy set PCRE2_UNGREEDY use_offset_limit set PCRE2_USE_OFFSET_LIMIT diff --git a/doc/pcre2unicode.3 b/doc/pcre2unicode.3 index ef54e845e..0dbc07817 100644 --- a/doc/pcre2unicode.3 +++ b/doc/pcre2unicode.3 @@ -147,6 +147,31 @@ Recognition of these non-ASCII characters as case-equivalent to their ASCII counterparts can be disabled by setting the PCRE2_EXTRA_CASELESS_RESTRICT option. When this is set, all characters in a case equivalence must either be ASCII or non-ASCII; there can be no mixing. +.sp + Without PCRE2_EXTRA_CASELESS_RESTRICT: + 'k' = 'K' = U+212A (Kelvin sign) + 's' = 'S' = U+017F (long S) + With PCRE2_EXTRA_CASELESS_RESTRICT: + 'k' = 'K' + U+212A (Kelvin sign) only case-equivalent to itself + 's' = 'S' + U+017F (long S) only case-equivalent to itself +.P +One language family, Turkish and Azeri, has its own case-insensitivity rules, +which can be selected by setting PCRE2_EXTRA_TURKISH_CASING. This alters the +behaviour of the 'i', 'I', U+0130 (capital I with dot above), and U+0131 +(small dotless i) characters. +.sp + Without PCRE2_EXTRA_TURKISH_CASING: + 'i' = 'I' + U+0130 (capital I with dot above) only case-equivalent to itself + U+0131 (small dotless i) only case-equivalent to itself + With PCRE2_EXTRA_TURKISH_CASING: + 'i' = U+0130 (capital I with dot above) + U+0131 (small dotless i) = 'I' +.P +It is not allowed to specify both PCRE2_EXTRA_CASELESS_RESTRICT and +PCRE2_EXTRA_TURKISH_CASING together. .P From release 10.45 the Unicode letter properties Lu (upper case), Ll (lower case), and Lt (title case) are all treated as Lc (cased letter) when caseless diff --git a/maint/GenerateUcd.py b/maint/GenerateUcd.py index 83f692bcf..63f063442 100755 --- a/maint/GenerateUcd.py +++ b/maint/GenerateUcd.py @@ -737,6 +737,12 @@ def write_bitsets(list, item_size): if x > 127 and x + other_case[x] < 128: other_case[x] = 0 +# Append a couple of extra caseless sets (unreferenced by the record objects) +# to hold the optional Turkish case equivalences. +turkish_dotted_i_index = offset +caseless_sets.append([0x69, 0x0130]) +caseless_sets.append([0x49, 0x0131]) + # Combine all the tables table, records = combine_tables(script, category, break_props, @@ -855,6 +861,17 @@ def write_bitsets(list, item_size): f.write(' NOTACHAR,\n') f.write('};\n\n') +# --- Output the indices of the Turkish caseless character sets --- + +f.write("""\ +/* This is the index, within ucd_caseless_sets, of the additional +Turkish case-equivalences. The dotted I ones are this offset; the +dotless I are +3 from here. */ + +const uint32_t PRIV(ucd_turkish_dotted_i_caseset) = %d; + +""" % (turkish_dotted_i_index)) + # --- Other tables are not needed by pcre2test --- f.write("""\ @@ -867,7 +884,7 @@ def write_bitsets(list, item_size): # --- Output the nocase sets --- f.write("""\ -/* This table contains character ranges, where the characters in the range has +/* This table contains character ranges, where the characters in the range have no other case. Both start and end values are excluded from the range. */ const uint32_t PRIV(ucd_nocase_ranges)[] = { @@ -880,7 +897,7 @@ def write_bitsets(list, item_size): total = 0 for c in range(1, MAX_UNICODE): - if other_case[c] != 0: + if other_case[c] != 0 or c in [0x0130, 0x0131]: # add the two chars that gain casing in Turkish if c - range_start > expected_size: range_size = c - range_start - 1 f.write(' 0x%04x, 0x%04x, /* %d */\n' % (range_start, c, range_size)) @@ -980,6 +997,6 @@ def write_bitsets(list, item_size): /* End of pcre2_ucd.c */ """) -f.close +f.close() # End diff --git a/src/pcre2.h.generic b/src/pcre2.h.generic index 5edad7df5..7b9184085 100644 --- a/src/pcre2.h.generic +++ b/src/pcre2.h.generic @@ -162,6 +162,7 @@ D is inspected during pcre2_dfa_match() execution #define PCRE2_EXTRA_PYTHON_OCTAL 0x00002000u /* C */ #define PCRE2_EXTRA_NO_BS0 0x00004000u /* C */ #define PCRE2_EXTRA_NEVER_CALLOUT 0x00008000u /* C */ +#define PCRE2_EXTRA_TURKISH_CASING 0x00010000u /* C */ /* These are for pcre2_jit_compile(). */ @@ -328,6 +329,9 @@ pcre2_pattern_convert(). */ #define PCRE2_ERROR_PATTERN_COMPILED_SIZE_TOO_BIG 201 #define PCRE2_ERROR_OVERSIZE_PYTHON_OCTAL 202 #define PCRE2_ERROR_CALLOUT_CALLER_DISABLED 203 +#define PCRE2_ERROR_EXTRA_CASING_REQUIRES_UNICODE 204 +#define PCRE2_ERROR_TURKISH_CASING_REQUIRES_UTF 205 +#define PCRE2_ERROR_EXTRA_CASING_INCOMPATIBLE 206 /* "Expected" matching error codes: no match and partial match. */ diff --git a/src/pcre2.h.in b/src/pcre2.h.in index 28d60b253..1ed873843 100644 --- a/src/pcre2.h.in +++ b/src/pcre2.h.in @@ -162,6 +162,7 @@ D is inspected during pcre2_dfa_match() execution #define PCRE2_EXTRA_PYTHON_OCTAL 0x00002000u /* C */ #define PCRE2_EXTRA_NO_BS0 0x00004000u /* C */ #define PCRE2_EXTRA_NEVER_CALLOUT 0x00008000u /* C */ +#define PCRE2_EXTRA_TURKISH_CASING 0x00010000u /* C */ /* These are for pcre2_jit_compile(). */ @@ -328,6 +329,9 @@ pcre2_pattern_convert(). */ #define PCRE2_ERROR_PATTERN_COMPILED_SIZE_TOO_BIG 201 #define PCRE2_ERROR_OVERSIZE_PYTHON_OCTAL 202 #define PCRE2_ERROR_CALLOUT_CALLER_DISABLED 203 +#define PCRE2_ERROR_EXTRA_CASING_REQUIRES_UNICODE 204 +#define PCRE2_ERROR_TURKISH_CASING_REQUIRES_UTF 205 +#define PCRE2_ERROR_EXTRA_CASING_INCOMPATIBLE 206 /* "Expected" matching error codes: no match and partial match. */ diff --git a/src/pcre2_compile.c b/src/pcre2_compile.c index 08f27088b..bb834dc8d 100644 --- a/src/pcre2_compile.c +++ b/src/pcre2_compile.c @@ -666,7 +666,8 @@ are allowed. */ PCRE2_NO_DOTSTAR_ANCHOR|PCRE2_UCP|PCRE2_UNGREEDY) #define PUBLIC_LITERAL_COMPILE_EXTRA_OPTIONS \ - (PCRE2_EXTRA_MATCH_LINE|PCRE2_EXTRA_MATCH_WORD|PCRE2_EXTRA_CASELESS_RESTRICT) + (PCRE2_EXTRA_MATCH_LINE|PCRE2_EXTRA_MATCH_WORD| \ + PCRE2_EXTRA_CASELESS_RESTRICT|PCRE2_EXTRA_TURKISH_CASING) #define PUBLIC_COMPILE_EXTRA_OPTIONS \ (PUBLIC_LITERAL_COMPILE_EXTRA_OPTIONS| \ @@ -683,6 +684,7 @@ compatibility, (*UTFn) is supported in the relevant libraries, but (*UTF) is generic and always supported. */ enum { PSO_OPT, /* Value is an option bit */ + PSO_XOPT, /* Value is an xoption bit */ PSO_FLG, /* Value is a flag bit */ PSO_NL, /* Value is a newline type */ PSO_BSR, /* Value is a \R type */ @@ -711,6 +713,8 @@ static const pso pso_list[] = { { STRING_NO_DOTSTAR_ANCHOR_RIGHTPAR, 18, PSO_OPTMZ, PCRE2_OPTIM_DOTSTAR_ANCHOR }, { STRING_NO_JIT_RIGHTPAR, 7, PSO_FLG, PCRE2_NOJIT }, { STRING_NO_START_OPT_RIGHTPAR, 13, PSO_OPTMZ, PCRE2_OPTIM_START_OPTIMIZE }, + { STRING_CASELESS_RESTRICT_RIGHTPAR, 18, PSO_XOPT, PCRE2_EXTRA_CASELESS_RESTRICT }, + { STRING_TURKISH_CASING_RIGHTPAR, 15, PSO_XOPT, PCRE2_EXTRA_TURKISH_CASING }, { STRING_LIMIT_HEAP_EQ, 11, PSO_LIMH, 0 }, { STRING_LIMIT_MATCH_EQ, 12, PSO_LIMM, 0 }, { STRING_LIMIT_DEPTH_EQ, 12, PSO_LIMD, 0 }, @@ -2835,8 +2839,8 @@ be quantified. */ /* Here's the actual function. */ -static int parse_regex(PCRE2_SPTR ptr, uint32_t options, BOOL *has_lookbehind, - compile_block *cb) +static int parse_regex(PCRE2_SPTR ptr, uint32_t options, uint32_t xoptions, + BOOL *has_lookbehind, compile_block *cb) { uint32_t c; uint32_t delimiter; @@ -2851,7 +2855,6 @@ uint32_t *this_parsed_item = NULL; uint32_t *prev_parsed_item = NULL; uint32_t meta_quantifier = 0; uint32_t add_after_mark = 0; -uint32_t xoptions = cb->cx->extra_options; uint16_t nest_depth = 0; int after_manual_callout = 0; int expect_cond_assert = 0; @@ -5293,7 +5296,7 @@ restriction is in force). Sometimes we can just extend the original range. */ if ((options & PCRE2_CASELESS) != 0) { -#ifndef SUPPORT_UNICODE +#ifdef SUPPORT_UNICODE if ((options & (PCRE2_UTF|PCRE2_UCP)) == 0) #endif /* SUPPORT_UNICODE */ /* Not UTF mode */ @@ -5736,8 +5739,12 @@ for (;; pptr++) /* ===================================================================*/ /* Empty character classes are allowed if PCRE2_ALLOW_EMPTY_CLASS is set. Otherwise, an initial ']' is taken as a data character. When empty classes - are allowed, [] must always fail, so generate OP_FAIL, whereas [^] must - match any character, so generate OP_ALLANY. */ + are allowed, [] must generate an empty class - we have no dedicated opcode + to optimise the representation, but it's a rare case (the '(*FAIL)' + construct would be a clearer way for a pattern author to represent a + non-matching branch, but it does have different semantics to '[]' if both + are followed by a quantifier). The empty-negated [^] matches any character, + so is useful: generate OP_ALLANY for this. */ case META_CLASS_EMPTY: case META_CLASS_EMPTY_NOT: @@ -5785,9 +5792,6 @@ for (;; pptr++) if (pptr[1] < META_END && pptr[2] == META_CLASS_END) { -#ifdef SUPPORT_UNICODE - uint32_t d; -#endif uint32_t c = pptr[1]; pptr += 2; /* Move on to class end */ @@ -5808,18 +5812,35 @@ for (;; pptr++) /* For caseless UTF or UCP mode, check whether this character has more than one other case. If so, generate a special OP_NOTPROP item instead of OP_NOTI. When restricted by PCRE2_EXTRA_CASELESS_RESTRICT, ignore any - caseless set that starts with an ASCII character. */ + caseless set that starts with an ASCII character. If the character is + affected by the special Turkish rules, hardcode the not-matching + characters using a caseset. */ #ifdef SUPPORT_UNICODE - if ((utf||ucp) && (options & PCRE2_CASELESS) != 0 && - (d = UCD_CASESET(c)) != 0 && - ((xoptions & PCRE2_EXTRA_CASELESS_RESTRICT) == 0 || - PRIV(ucd_caseless_sets)[d] > 127)) + if ((utf||ucp) && (options & PCRE2_CASELESS) != 0) { - *code++ = OP_NOTPROP; - *code++ = PT_CLIST; - *code++ = d; - break; /* We are finished with this class */ + uint32_t caseset; + + if ((xoptions & (PCRE2_EXTRA_TURKISH_CASING|PCRE2_EXTRA_CASELESS_RESTRICT)) == + PCRE2_EXTRA_TURKISH_CASING && + UCD_ANY_I(c)) + { + caseset = PRIV(ucd_turkish_dotted_i_caseset) + (UCD_DOTTED_I(c)? 0 : 3); + } + else if ((caseset = UCD_CASESET(c)) != 0 && + (xoptions & PCRE2_EXTRA_CASELESS_RESTRICT) != 0 && + PRIV(ucd_caseless_sets)[caseset] < 128) + { + caseset = 0; /* Ignore the caseless set if it's restricted. */ + } + + if (caseset != 0) + { + *code++ = OP_NOTPROP; + *code++ = PT_CLIST; + *code++ = caseset; + break; /* We are finished with this class */ + } } #endif /* Char has only one other (usable) case, or UCP not available */ @@ -5834,7 +5855,8 @@ for (;; pptr++) they are case partners. This can be optimized to generate a caseless single character match (which also sets first/required code units if relevant). When casing restrictions apply, ignore a caseless set if both characters - are ASCII. */ + are ASCII. When Turkish casing applies, an 'i' does not match its normal + Unicode "othercase". */ if (meta == META_CLASS && pptr[1] < META_END && pptr[2] < META_END && pptr[3] == META_CLASS_END) @@ -5842,9 +5864,12 @@ for (;; pptr++) uint32_t c = pptr[1]; #ifdef SUPPORT_UNICODE - if (UCD_CASESET(c) == 0 || - ((xoptions & PCRE2_EXTRA_CASELESS_RESTRICT) != 0 && - c < 128 && pptr[2] < 128)) + if ((UCD_CASESET(c) == 0 || + ((xoptions & PCRE2_EXTRA_CASELESS_RESTRICT) != 0 && + c < 128 && pptr[2] < 128)) && + !((xoptions & (PCRE2_EXTRA_TURKISH_CASING|PCRE2_EXTRA_CASELESS_RESTRICT)) == + PCRE2_EXTRA_TURKISH_CASING && + UCD_ANY_I(c))) #endif { uint32_t d; @@ -7189,8 +7214,10 @@ for (;; pptr++) PUT2INC(code, 0, index); PUT2INC(code, 0, count); if ((options & PCRE2_CASELESS) != 0) - *code++ = ((xoptions & PCRE2_EXTRA_CASELESS_RESTRICT) != 0)? - REFI_FLAG_CASELESS_RESTRICT : 0; + *code++ = (((xoptions & PCRE2_EXTRA_CASELESS_RESTRICT) != 0)? + REFI_FLAG_CASELESS_RESTRICT : 0) | + (((xoptions & PCRE2_EXTRA_TURKISH_CASING) != 0)? + REFI_FLAG_TURKISH_CASING : 0); } break; @@ -8146,8 +8173,10 @@ for (;; pptr++) *code++ = ((options & PCRE2_CASELESS) != 0)? OP_REFI : OP_REF; PUT2INC(code, 0, meta_arg); if ((options & PCRE2_CASELESS) != 0) - *code++ = ((xoptions & PCRE2_EXTRA_CASELESS_RESTRICT) != 0)? - REFI_FLAG_CASELESS_RESTRICT : 0; + *code++ = (((xoptions & PCRE2_EXTRA_CASELESS_RESTRICT) != 0)? + REFI_FLAG_CASELESS_RESTRICT : 0) | + (((xoptions & PCRE2_EXTRA_TURKISH_CASING) != 0)? + REFI_FLAG_TURKISH_CASING : 0); /* Update the map of back references, and keep the highest one. We could do this in parse_regex() for numerical back references, but not @@ -8343,15 +8372,28 @@ for (;; pptr++) /* For caseless UTF or UCP mode, check whether this character has more than one other case. If so, generate a special OP_PROP item instead of OP_CHARI. When casing restrictions apply, ignore caseless sets that start with an - ASCII character. */ + ASCII character. If the character is affected by the special Turkish rules, + hardcode the matching characters using a caseset. */ #ifdef SUPPORT_UNICODE if ((utf||ucp) && (options & PCRE2_CASELESS) != 0) { - uint32_t caseset = UCD_CASESET(meta); - if (caseset != 0 && - ((xoptions & PCRE2_EXTRA_CASELESS_RESTRICT) == 0 || - PRIV(ucd_caseless_sets)[caseset] > 127)) + uint32_t caseset; + + if ((xoptions & (PCRE2_EXTRA_TURKISH_CASING|PCRE2_EXTRA_CASELESS_RESTRICT)) == + PCRE2_EXTRA_TURKISH_CASING && + UCD_ANY_I(meta)) + { + caseset = PRIV(ucd_turkish_dotted_i_caseset) + (UCD_DOTTED_I(meta)? 0 : 3); + } + else if ((caseset = UCD_CASESET(meta)) != 0 && + (xoptions & PCRE2_EXTRA_CASELESS_RESTRICT) != 0 && + PRIV(ucd_caseless_sets)[caseset] < 128) + { + caseset = 0; /* Ignore the caseless set if it's restricted. */ + } + + if (caseset != 0) { *code++ = OP_PROP; *code++ = PT_CLIST; @@ -10269,6 +10311,7 @@ PCRE2_SIZE parsed_size_needed; /* Needed for parsed pattern */ uint32_t firstcuflags, reqcuflags; /* Type of first/req code unit */ uint32_t firstcu, reqcu; /* Value of first/req code unit */ uint32_t setflags = 0; /* NL and BSR set flags */ +uint32_t xoptions; /* Flags from context, modified */ uint32_t skipatstart; /* When checking (*UTF) etc */ uint32_t limit_heap = UINT32_MAX; @@ -10443,6 +10486,7 @@ non-zero-terminated patterns. */ if (zero_terminated) VALGRIND_MAKE_MEM_NOACCESS(pattern + patlen, CU2BYTES(1)); #endif +xoptions = ccontext->extra_options; ptr = pattern; skipatstart = 0; @@ -10468,6 +10512,10 @@ if ((options & PCRE2_LITERAL) == 0) cb.external_options |= p->value; break; + case PSO_XOPT: + xoptions |= p->value; + break; + case PSO_FLG: setflags |= p->value; break; @@ -10591,6 +10639,31 @@ if (ucp && (cb.external_options & PCRE2_NEVER_UCP) != 0) goto HAD_EARLY_ERROR; } +/* PCRE2_EXTRA_TURKISH_CASING checks */ + +if ((xoptions & PCRE2_EXTRA_TURKISH_CASING) != 0) + { + if (!utf && !ucp) + { + errorcode = ERR104; + goto HAD_EARLY_ERROR; + } + +#if PCRE2_CODE_UNIT_WIDTH == 8 + if (!utf) + { + errorcode = ERR105; + goto HAD_EARLY_ERROR; + } +#endif + + if ((xoptions & PCRE2_EXTRA_CASELESS_RESTRICT) != 0) + { + errorcode = ERR106; + goto HAD_EARLY_ERROR; + } + } + /* Process the BSR setting. */ if (bsr == 0) bsr = ccontext->bsr_convention; @@ -10686,7 +10759,7 @@ cb.parsed_pattern_end = cb.parsed_pattern + parsed_size_needed + 1; /* Do the parsing scan. */ -errorcode = parse_regex(ptr, cb.external_options, &has_lookbehind, &cb); +errorcode = parse_regex(ptr, cb.external_options, xoptions, &has_lookbehind, &cb); if (errorcode != 0) goto HAD_CB_ERROR; /* If there are any lookbehinds, scan the parsed pattern to figure out their @@ -10761,7 +10834,7 @@ pptr = cb.parsed_pattern; code = cworkspace; *code = OP_BRA; -(void)compile_regex(cb.external_options, ccontext->extra_options, &code, &pptr, +(void)compile_regex(cb.external_options, xoptions, &code, &pptr, &errorcode, 0, &firstcu, &firstcuflags, &reqcu, &reqcuflags, NULL, NULL, &cb, &length); @@ -10813,7 +10886,7 @@ re->blocksize = re_blocksize; re->magic_number = MAGIC_NUMBER; re->compile_options = options; re->overall_options = cb.external_options; -re->extra_options = ccontext->extra_options; +re->extra_options = xoptions; re->flags = PCRE2_CODE_UNIT_WIDTH/8 | cb.external_flags | setflags; re->limit_heap = limit_heap; re->limit_match = limit_match; @@ -10867,7 +10940,7 @@ of the function here. */ pptr = cb.parsed_pattern; code = (PCRE2_UCHAR *)codestart; *code = OP_BRA; -regexrc = compile_regex(re->overall_options, ccontext->extra_options, &code, +regexrc = compile_regex(re->overall_options, re->extra_options, &code, &pptr, &errorcode, 0, &firstcu, &firstcuflags, &reqcu, &reqcuflags, NULL, NULL, &cb, NULL); if (regexrc < 0) re->flags |= PCRE2_MATCH_EMPTY; @@ -11042,8 +11115,8 @@ if ((optim_flags & PCRE2_OPTIM_START_OPTIMIZE) != 0) } /* The first code unit is > 128 in UTF or UCP mode, or > 255 otherwise. - In 8-bit UTF mode, codepoints in the range 128-255 are introductory code - points and cannot have another case, but if UCP is set they may do. */ + In 8-bit UTF mode, code units in the range 128-255 are introductory code + units and cannot have another case, but if UCP is set they may do. */ #ifdef SUPPORT_UNICODE #if PCRE2_CODE_UNIT_WIDTH == 8 diff --git a/src/pcre2_compile.h b/src/pcre2_compile.h index d33ba845f..c09ffd12f 100644 --- a/src/pcre2_compile.h +++ b/src/pcre2_compile.h @@ -61,7 +61,7 @@ enum { ERR0 = COMPILE_ERROR_BASE, ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79, ERR80, ERR81, ERR82, ERR83, ERR84, ERR85, ERR86, ERR87, ERR88, ERR89, ERR90, ERR91, ERR92, ERR93, ERR94, ERR95, ERR96, ERR97, ERR98, ERR99, ERR100, - ERR101,ERR102,ERR103 }; + ERR101,ERR102,ERR103,ERR104,ERR105,ERR106 }; /* Code values for parsed patterns, which are stored in a vector of 32-bit unsigned ints. Values less than META_END are literal data values. The coding diff --git a/src/pcre2_compile_class.c b/src/pcre2_compile_class.c index bc9f8e0e3..080343b29 100644 --- a/src/pcre2_compile_class.c +++ b/src/pcre2_compile_class.c @@ -81,6 +81,7 @@ while (TRUE) #define PARSE_CLASS_UTF 0x1 #define PARSE_CLASS_CASELESS_UTF 0x2 #define PARSE_CLASS_RESTRICTED_UTF 0x4 +#define PARSE_CLASS_TURKISH_UTF 0x8 /* Get the range of nocase characters which includes the 'c' character passed as argument, or directly follows 'c'. */ @@ -145,10 +146,21 @@ while (c <= end) } /* Compute caseless set. */ - co = UCD_CASESET(c); - if (co != 0 && (!(options & PARSE_CLASS_RESTRICTED_UTF) - || PRIV(ucd_caseless_sets)[co] > 127)) + if ((options & (PARSE_CLASS_TURKISH_UTF|PARSE_CLASS_RESTRICTED_UTF)) == + PARSE_CLASS_TURKISH_UTF && + UCD_ANY_I(c)) + { + co = PRIV(ucd_turkish_dotted_i_caseset) + (UCD_DOTTED_I(c)? 0 : 3); + } + else if ((co = UCD_CASESET(c)) != 0 && + (options & PARSE_CLASS_RESTRICTED_UTF) != 0 && + PRIV(ucd_caseless_sets)[co] < 128) + { + co = 0; /* Ignore the caseless set if it's restricted. */ + } + + if (co != 0) list = PRIV(ucd_caseless_sets) + co; else { @@ -447,6 +459,9 @@ if ((options & PCRE2_CASELESS) && (options & (PCRE2_UTF|PCRE2_UCP))) if (xoptions & PCRE2_EXTRA_CASELESS_RESTRICT) class_options |= PARSE_CLASS_RESTRICTED_UTF; + +if (xoptions & PCRE2_EXTRA_TURKISH_CASING) + class_options |= PARSE_CLASS_TURKISH_UTF; #endif /* Compute required space for the range. */ diff --git a/src/pcre2_error.c b/src/pcre2_error.c index f3b9490f4..196efc0a1 100644 --- a/src/pcre2_error.c +++ b/src/pcre2_error.c @@ -192,6 +192,10 @@ static const unsigned char compile_error_texts[] = "compiled pattern would be longer than the limit set by the application\0" "octal value given by \\ddd is greater than \\377 (forbidden by PCRE2_EXTRA_PYTHON_OCTAL)\0" "using callouts is disabled by the application\0" + "PCRE2_EXTRA_TURKISH_CASING require Unicode (UTF or UCP) mode\0" + /* 105 */ + "PCRE2_EXTRA_TURKISH_CASING requires UTF in 8-bit mode\0" + "PCRE2_EXTRA_TURKISH_CASING and PCRE2_EXTRA_CASELESS_RESTRICT are not compatible\0" ; /* Match-time and UTF error texts are in the same format. */ diff --git a/src/pcre2_internal.h b/src/pcre2_internal.h index 2ce93e036..e2e23a2ea 100644 --- a/src/pcre2_internal.h +++ b/src/pcre2_internal.h @@ -974,6 +974,8 @@ a positive value. */ #define STRING_NO_START_OPT_RIGHTPAR "NO_START_OPT)" #define STRING_NOTEMPTY_RIGHTPAR "NOTEMPTY)" #define STRING_NOTEMPTY_ATSTART_RIGHTPAR "NOTEMPTY_ATSTART)" +#define STRING_CASELESS_RESTRICT_RIGHTPAR "CASELESS_RESTRICT)" +#define STRING_TURKISH_CASING_RIGHTPAR "TURKISH_CASING)" #define STRING_LIMIT_HEAP_EQ "LIMIT_HEAP=" #define STRING_LIMIT_MATCH_EQ "LIMIT_MATCH=" #define STRING_LIMIT_DEPTH_EQ "LIMIT_DEPTH=" @@ -1277,6 +1279,8 @@ only. */ #define STRING_NO_START_OPT_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_S STR_T STR_A STR_R STR_T STR_UNDERSCORE STR_O STR_P STR_T STR_RIGHT_PARENTHESIS #define STRING_NOTEMPTY_RIGHTPAR STR_N STR_O STR_T STR_E STR_M STR_P STR_T STR_Y STR_RIGHT_PARENTHESIS #define STRING_NOTEMPTY_ATSTART_RIGHTPAR STR_N STR_O STR_T STR_E STR_M STR_P STR_T STR_Y STR_UNDERSCORE STR_A STR_T STR_S STR_T STR_A STR_R STR_T STR_RIGHT_PARENTHESIS +#define STRING_CASELESS_RESTRICT_RIGHTPAR STR_C STR_A STR_S STR_E STR_L STR_E STR_S STR_S STR_UNDERSCORE STR_R STR_E STR_S STR_T STR_R STR_I STR_C STR_T STR_RIGHT_PARENTHESIS +#define STRING_TURKISH_CASING_RIGHTPAR STR_T STR_U STR_R STR_K STR_I STR_S STR_H STR_UNDERSCORE STR_C STR_A STR_S STR_I STR_N STR_G STR_RIGHT_PARENTHESIS #define STRING_LIMIT_HEAP_EQ STR_L STR_I STR_M STR_I STR_T STR_UNDERSCORE STR_H STR_E STR_A STR_P STR_EQUALS_SIGN #define STRING_LIMIT_MATCH_EQ STR_L STR_I STR_M STR_I STR_T STR_UNDERSCORE STR_M STR_A STR_T STR_C STR_H STR_EQUALS_SIGN #define STRING_LIMIT_DEPTH_EQ STR_L STR_I STR_M STR_I STR_T STR_UNDERSCORE STR_D STR_E STR_P STR_T STR_H STR_EQUALS_SIGN @@ -1832,6 +1836,7 @@ in UTF-8 mode. The code that uses this table must know about such things. */ /* Constants used by OP_REFI and OP_DNREFI to control matching behaviour. */ #define REFI_FLAG_CASELESS_RESTRICT 0x1 +#define REFI_FLAG_TURKISH_CASING 0x2 /* ---------- Private structures that are mode-independent. ---------- */ @@ -1908,6 +1913,14 @@ typedef struct { #define UCD_SCRIPTX(ch) UCD_SCRIPTX_PROP(GET_UCD(ch)) #define UCD_BPROPS(ch) UCD_BPROPS_PROP(GET_UCD(ch)) #define UCD_BIDICLASS(ch) UCD_BIDICLASS_PROP(GET_UCD(ch)) +#define UCD_ANY_I(ch) \ + /* match any of the four characters 'i', 'I', U+0130, U+0131 */ \ + (((uint32_t)(ch) | 0x20u) == 0x69u || ((uint32_t)(ch) | 1u) == 0x0131u) +#define UCD_DOTTED_I(ch) \ + ((uint32_t)(ch) == 0x69u || (uint32_t)(ch) == 0x0130u) +#define UCD_FOLD_I_TURKISH(ch) \ + ((uint32_t)(ch) == 0x0130u ? 0x69u : \ + (uint32_t)(ch) == 0x49u ? 0x0131u : (uint32_t)(ch)) /* The "scriptx" and bprops fields contain offsets into vectors of 32-bit words that form a bitmap representing a list of scripts or boolean properties. These @@ -1973,6 +1986,7 @@ extern const uint8_t PRIV(utf8_table4)[]; #define _pcre2_vspace_list PCRE2_SUFFIX(_pcre2_vspace_list_) #define _pcre2_ucd_boolprop_sets PCRE2_SUFFIX(_pcre2_ucd_boolprop_sets_) #define _pcre2_ucd_caseless_sets PCRE2_SUFFIX(_pcre2_ucd_caseless_sets_) +#define _pcre2_ucd_turkish_dotted_i_caseset PCRE2_SUFFIX(_pcre2_ucd_turkish_dotted_i_caseset_) #define _pcre2_ucd_nocase_ranges PCRE2_SUFFIX(_pcre2_ucd_nocase_ranges_) #define _pcre2_ucd_nocase_ranges_size PCRE2_SUFFIX(_pcre2_ucd_nocase_ranges_size_) #define _pcre2_ucd_digit_sets PCRE2_SUFFIX(_pcre2_ucd_digit_sets_) @@ -1999,6 +2013,7 @@ extern const uint32_t PRIV(hspace_list)[]; extern const uint32_t PRIV(vspace_list)[]; extern const uint32_t PRIV(ucd_boolprop_sets)[]; extern const uint32_t PRIV(ucd_caseless_sets)[]; +extern const uint32_t PRIV(ucd_turkish_dotted_i_caseset); extern const uint32_t PRIV(ucd_nocase_ranges)[]; extern const uint32_t PRIV(ucd_nocase_ranges_size); extern const uint32_t PRIV(ucd_digit_sets)[]; diff --git a/src/pcre2_jit_compile.c b/src/pcre2_jit_compile.c index 440507b92..70cfd685f 100644 --- a/src/pcre2_jit_compile.c +++ b/src/pcre2_jit_compile.c @@ -8226,9 +8226,10 @@ while (*cc != XCL_END) case PT_CLIST: other_cases = PRIV(ucd_caseless_sets) + cc[1]; - /* At least three characters are required. + /* At least two characters are required. Otherwise this case would be handled by the normal code path. */ - SLJIT_ASSERT(other_cases[0] != NOTACHAR && other_cases[1] != NOTACHAR && other_cases[2] != NOTACHAR); + SLJIT_ASSERT(other_cases[0] != NOTACHAR && other_cases[1] != NOTACHAR); + /* NOTACHAR is the unsigned maximum. */ SLJIT_ASSERT(other_cases[0] < other_cases[1] && other_cases[1] < other_cases[2]); /* Optimizing character pairs, if their difference is power of 2. */ @@ -8247,6 +8248,8 @@ while (*cc != XCL_END) } else if (is_powerof2(other_cases[2] ^ other_cases[1])) { + SLJIT_ASSERT(other_cases[2] != NOTACHAR); + if (charoffset == 0) OP2(SLJIT_OR, TMP2, 0, TMP1, 0, SLJIT_IMM, other_cases[2] ^ other_cases[1]); else @@ -9428,6 +9431,8 @@ struct sljit_jump *nopartial; #if defined SUPPORT_UNICODE struct sljit_label *loop; struct sljit_label *caseless_loop; +struct sljit_jump *turkish_ascii_i = NULL; +struct sljit_jump *turkish_non_ascii_i = NULL; jump_list *no_match = NULL; int source_reg = COUNT_MATCH; int source_end_reg = ARGUMENTS; @@ -9450,7 +9455,7 @@ else OP1(SLJIT_MOV, TMP1, 0, SLJIT_MEM1(TMP2), 0); #if defined SUPPORT_UNICODE -if (common->utf && (*cc == OP_REFI || *cc == OP_DNREFI)) +if ((common->utf || common->ucp) && (*cc == OP_REFI || *cc == OP_DNREFI)) { SLJIT_ASSERT(common->iref_ptr != 0); @@ -9488,6 +9493,16 @@ if (common->utf && (*cc == OP_REFI || *cc == OP_DNREFI)) CMPTO(SLJIT_EQUAL, TMP1, 0, char1_reg, 0, loop); + if ((refi_flag & (REFI_FLAG_TURKISH_CASING|REFI_FLAG_CASELESS_RESTRICT)) == + REFI_FLAG_TURKISH_CASING) + { + OP2(SLJIT_OR, SLJIT_TMP_DEST_REG, 0, char1_reg, 0, SLJIT_IMM, 0x20); + turkish_ascii_i = CMP(SLJIT_EQUAL, SLJIT_TMP_DEST_REG, 0, SLJIT_IMM, 0x69); + + OP2(SLJIT_OR, SLJIT_TMP_DEST_REG, 0, char1_reg, 0, SLJIT_IMM, 0x1); + turkish_non_ascii_i = CMP(SLJIT_EQUAL, SLJIT_TMP_DEST_REG, 0, SLJIT_IMM, 0x131); + } + OP1(SLJIT_MOV, TMP3, 0, TMP1, 0); add_jump(compiler, &common->getucd, JUMP(SLJIT_FAST_CALL)); @@ -9503,12 +9518,13 @@ if (common->utf && (*cc == OP_REFI || *cc == OP_DNREFI)) OP2(SLJIT_ADD, TMP1, 0, TMP1, 0, TMP3, 0); CMPTO(SLJIT_EQUAL, TMP1, 0, char1_reg, 0, loop); - if (refi_flag & REFI_FLAG_CASELESS_RESTRICT) - add_jump(compiler, &no_match, CMP(SLJIT_LESS, char1_reg, 0, SLJIT_IMM, 128)); add_jump(compiler, &no_match, CMP(SLJIT_EQUAL, TMP2, 0, SLJIT_IMM, 0)); OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 2); OP2(SLJIT_ADD, TMP2, 0, TMP2, 0, SLJIT_IMM, (sljit_sw)PRIV(ucd_caseless_sets)); + if (refi_flag & REFI_FLAG_CASELESS_RESTRICT) + add_jump(compiler, &no_match, CMP(SLJIT_LESS | SLJIT_32, SLJIT_MEM1(TMP2), 0, SLJIT_IMM, 128)); + caseless_loop = LABEL(); OP1(SLJIT_MOV_U32, TMP1, 0, SLJIT_MEM1(TMP2), 0); OP2(SLJIT_ADD, TMP2, 0, TMP2, 0, SLJIT_IMM, sizeof(uint32_t)); @@ -9516,6 +9532,28 @@ if (common->utf && (*cc == OP_REFI || *cc == OP_DNREFI)) JUMPTO(SLJIT_EQUAL, loop); JUMPTO(SLJIT_LESS, caseless_loop); + if ((refi_flag & (REFI_FLAG_TURKISH_CASING|REFI_FLAG_CASELESS_RESTRICT)) == + REFI_FLAG_TURKISH_CASING) + { + add_jump(compiler, &no_match, JUMP(SLJIT_JUMP)); + JUMPHERE(turkish_ascii_i); + + OP2(SLJIT_LSHR, char1_reg, 0, char1_reg, 0, SLJIT_IMM, 5); + OP2(SLJIT_AND, char1_reg, 0, char1_reg, 0, SLJIT_IMM, 1); + OP2(SLJIT_XOR, char1_reg, 0, char1_reg, 0, SLJIT_IMM, 1); + OP2(SLJIT_ADD, char1_reg, 0, char1_reg, 0, SLJIT_IMM, 0x130); + CMPTO(SLJIT_EQUAL, TMP1, 0, char1_reg, 0, loop); + + add_jump(compiler, &no_match, JUMP(SLJIT_JUMP)); + JUMPHERE(turkish_non_ascii_i); + + OP2(SLJIT_AND, char1_reg, 0, char1_reg, 0, SLJIT_IMM, 1); + OP2(SLJIT_XOR, char1_reg, 0, char1_reg, 0, SLJIT_IMM, 1); + OP2(SLJIT_SHL, char1_reg, 0, char1_reg, 0, SLJIT_IMM, 5); + OP2(SLJIT_ADD, char1_reg, 0, char1_reg, 0, SLJIT_IMM, 0x49); + CMPTO(SLJIT_EQUAL, TMP1, 0, char1_reg, 0, loop); + } + set_jumps(no_match, LABEL()); if (common->mode == PCRE2_JIT_COMPLETE) JUMPHERE(partial); diff --git a/src/pcre2_match.c b/src/pcre2_match.c index bae7157d0..ce5d41be7 100644 --- a/src/pcre2_match.c +++ b/src/pcre2_match.c @@ -391,6 +391,7 @@ if (caseless) #if defined SUPPORT_UNICODE BOOL utf = (mb->poptions & PCRE2_UTF) != 0; BOOL caseless_restrict = (caseopts & REFI_FLAG_CASELESS_RESTRICT) != 0; + BOOL turkish_casing = !caseless_restrict && (caseopts & REFI_FLAG_TURKISH_CASING) != 0; if (utf || (mb->poptions & PCRE2_UCP) != 0) { @@ -422,8 +423,13 @@ if (caseless) d = *p++; } - ur = GET_UCD(d); - if (c != d && c != (uint32_t)((int)d + ur->other_case)) + if (turkish_casing && UCD_ANY_I(d)) + { + c = UCD_FOLD_I_TURKISH(c); + d = UCD_FOLD_I_TURKISH(d); + if (c != d) return -1; /* No match */ + } + else if (c != d && c != (uint32_t)((int)d + (ur = GET_UCD(d))->other_case)) { const uint32_t *pp = PRIV(ucd_caseless_sets) + ur->caseset; diff --git a/src/pcre2_ucd.c b/src/pcre2_ucd.c index 1859ec600..4c5e5163b 100644 --- a/src/pcre2_ucd.c +++ b/src/pcre2_ucd.c @@ -142,14 +142,22 @@ const uint32_t PRIV(ucd_caseless_sets)[] = { 0x004b, 0x006b, 0x212a, NOTACHAR, 0x00c5, 0x00e5, 0x212b, NOTACHAR, 0x1c88, 0xa64a, 0xa64b, NOTACHAR, + 0x0069, 0x0130, NOTACHAR, + 0x0049, 0x0131, NOTACHAR, }; +/* This is the index, within ucd_caseless_sets, of the additional +Turkish case-equivalences. The dotted I ones are this offset; the +dotless I are +3 from here. */ + +const uint32_t PRIV(ucd_turkish_dotted_i_caseset) = 112; + /* When #included in pcre2test, we don't need the table of digit sets, nor the the large main UCD tables. */ #ifndef PCRE2_PCRE2TEST -/* This table contains character ranges, where the characters in the range has +/* This table contains character ranges, where the characters in the range have no other case. Both start and end values are excluded from the range. */ const uint32_t PRIV(ucd_nocase_ranges)[] = { diff --git a/src/pcre2test.c b/src/pcre2test.c index 0593fc376..d8a966ffb 100644 --- a/src/pcre2test.c +++ b/src/pcre2test.c @@ -790,6 +790,7 @@ static modstruct modlist[] = { { "substitute_unknown_unset", MOD_PND, MOD_CTL, CTL2_SUBSTITUTE_UNKNOWN_UNSET, PO(control2) }, { "substitute_unset_empty", MOD_PND, MOD_CTL, CTL2_SUBSTITUTE_UNSET_EMPTY, PO(control2) }, { "tables", MOD_PAT, MOD_INT, 0, PO(tables_id) }, + { "turkish_casing", MOD_CTC, MOD_OPT, PCRE2_EXTRA_TURKISH_CASING, CO(extra_options) }, { "ucp", MOD_PATP, MOD_OPT, PCRE2_UCP, PO(options) }, { "ungreedy", MOD_PAT, MOD_OPT, PCRE2_UNGREEDY, PO(options) }, { "use_length", MOD_PAT, MOD_CTL, CTL_USE_LENGTH, PO(control) }, @@ -4394,7 +4395,7 @@ show_compile_extra_options(uint32_t options, const char *before, const char *after) { if (options == 0) fprintf(outfile, "%s %s", before, after); -else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s", +else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s", before, ((options & PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK) != 0) ? " allow_lookaround_bsk" : "", ((options & PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES) != 0)? " allow_surrogate_escapes" : "", @@ -4412,6 +4413,7 @@ else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s", ((options & PCRE2_EXTRA_NEVER_CALLOUT) != 0)? " never_callout" : "", ((options & PCRE2_EXTRA_NO_BS0) != 0)? " no_bs0" : "", ((options & PCRE2_EXTRA_PYTHON_OCTAL) != 0)? " python_octal" : "", + ((options & PCRE2_EXTRA_TURKISH_CASING) != 0)? " turkish_casing" : "", after); } diff --git a/testdata/testinput10 b/testdata/testinput10 index 5e4e383aa..96b601ecd 100644 --- a/testdata/testinput10 +++ b/testdata/testinput10 @@ -663,6 +663,11 @@ /(..)(*scs:(1)ab$)/match_invalid_utf ab\x80cde +/(.) \1/i,ucp + i I + +/(.) \1/i,ucp,turkish_casing + # python_octal /\400/ diff --git a/testdata/testinput11 b/testdata/testinput11 index 61656e4d5..4591da4c0 100644 --- a/testdata/testinput11 +++ b/testdata/testinput11 @@ -390,4 +390,6 @@ /[\x00-\x2f\x11-\xff]*?!/B abcd!e +/i/turkish_casing + # End of testinput11 diff --git a/testdata/testinput12 b/testdata/testinput12 index 9cf702a65..608a48557 100644 --- a/testdata/testinput12 +++ b/testdata/testinput12 @@ -579,8 +579,75 @@ \= Expect no match \x{17f} +/(.) \1/i,ucp + i I + +/(.) \1/i,ucp,turkish_casing +\= Expect no match + i I + +/(.) \1/i,ucp + i I + \x{212a} k +\= Expect no match + i \x{0130} + \x{0131} I + +/(.) \1/i,ucp,turkish_casing + \x{212a} k + i \x{0130} + \x{0131} I +\= Expect no match + i I + +/(.) (?r:\1)/i,ucp,turkish_casing + i I +\= Expect no match + i \x{0130} + \x{0131} I + \x{212a} k + +/[a-z][^i]I/ucp,turkish_casing + bII + b\x{0130}I + b\x{0131}I +\= Expect no match + biI + +/[a-z][^i]I/i,ucp,turkish_casing + b\x{0131}I + bII +\= Expect no match + biI + b\x{0130}I + +/[a-z](?r:[^i])I/i,ucp,turkish_casing + b\x{0131}I + b\x{0130}I +\= Expect no match + bII + biI + +/b(?r:[\x{00FF}-\x{FFEE}])/i,ucp,turkish_casing + b\x{0130} + b\x{0131} + B\x{212a} +\= Expect no match + bi + bI + bk + # ---------------------------------------------------- +/b[\x{00FF}-\x{FFEE}]/ir + b\x{0130} + b\x{0131} + B\x{212a} +\= Expect no match + bi + bI + bk + # Quantifier after a literal that has the value of META_ACCEPT (not UTF). This # fails in 16-bit mode, but is OK for 32-bit. diff --git a/testdata/testinput5 b/testdata/testinput5 index 25bff66f1..64e04aa2f 100644 --- a/testdata/testinput5 +++ b/testdata/testinput5 @@ -2337,6 +2337,213 @@ # End caseless restrict tests +# TESTS for PCRE2_EXTRA_TURKISH_CASING - again, tests with and without. + +/i/i,utf + i + I +\= Expect no match + \x{0130} + \x{0131} + +/i/i,utf,turkish_casing + i + \x{0130} +\= Expect no match + I + \x{0131} + +/I/i,utf + i + I +\= Expect no match + \x{0130} + \x{0131} + +/I/i,utf,turkish_casing + I + \x{0131} +\= Expect no match + i + \x{0130} + +/\x{0130}/i,utf + \x{0130} +\= Expect no match + i + I + \x{0131} + +/\x{0130}/i,utf,turkish_casing + i + \x{0130} +\= Expect no match + I + \x{0131} + +/\x{0131}/i,utf + \x{0131} +\= Expect no match + i + I + \x{0130} + +/\x{0131}/i,utf,turkish_casing + I + \x{0131} +\= Expect no match + i + \x{0130} + +/[i]/i,utf + i + I +\= Expect no match + \x{0130} + \x{0131} + +/[i]/i,utf,turkish_casing + i + \x{0130} +\= Expect no match + I + \x{0131} + +/[^i]/i,utf + \x{0130} + \x{0131} +\= Expect no match + i + I + +/[^i]/i,utf,turkish_casing + I + \x{0131} +\= Expect no match + i + \x{0130} + +/[\x{0130}]/i,utf + \x{0130} +\= Expect no match + i + I + \x{0131} + +/[\x{0130}]/i,utf,turkish_casing + i + \x{0130} +\= Expect no match + I + \x{0131} + +/[\x{0120}-\x{0130}]/i,utf + \x{0130} +\= Expect no match + i + I + \x{0131} + +/[\x{0120}-\x{0130}]/i,utf,turkish_casing + i + \x{0130} +\= Expect no match + I + \x{0131} + +/[zi]/i,utf + i + I +\= Expect no match + \x{0130} + \x{0131} + +/[zi]/i,utf,turkish_casing + i + \x{0130} +\= Expect no match + I + \x{0131} + +/[z\x{0130}]/i,utf + \x{0130} +\= Expect no match + i + I + \x{0131} + +/[z\x{0130}]/i,utf,turkish_casing + i + \x{0130} +\= Expect no match + I + \x{0131} + +/[iI]/i,utf + i + I +\= Expect no match + \x{0130} + \x{0131} + +/[iI]/i,utf,turkish_casing + i + I + \x{0130} + \x{0131} + +/[i\x{0130}]/i,utf + i + I + \x{0130} +\= Expect no match + \x{0131} + +/[i\x{0130}]/i,utf,turkish_casing + i + \x{0130} +\= Expect no match + I + \x{0131} + +/(.) \1/i,utf + i I +\= Expect no match + i \x{0130} + \x{0131} I + +/(*TURKISH_CASING)(.) \1/i,utf + i \x{0130} + \x{0131} I +\= Expect no match + i I + +/(.) \1/i,utf,turkish_casing + i \x{0130} + \x{0131} I +\= Expect no match + i I + +/i/i,utf,caseless_restrict,turkish_casing + +/i/i,turkish_casing + +/i/i,utf,caseless_restrict + i + +/i/i,ucp,caseless_restrict + i + +/b(?r:[\x{00FF}-\x{FFEE}])/i,utf,turkish_casing + b\x{0130} + b\x{0131} +\= Expect no match + bi + bI + bk + +# End Turkish casing tests + # TESTS for PCRE2_EXTRA_ASCII_xxx - again, tests with and without. # DIGITS diff --git a/testdata/testinput7 b/testdata/testinput7 index eba49cc27..d91ea2854 100644 --- a/testdata/testinput7 +++ b/testdata/testinput7 @@ -2297,6 +2297,163 @@ # End caseless restrict tests +# TESTS for PCRE2_EXTRA_TURKISH_CASING - again, tests with and without. + +/i/i,utf + i + I +\= Expect no match + \x{0130} + \x{0131} + +/i/i,utf,turkish_casing + i + \x{0130} +\= Expect no match + I + \x{0131} + +/I/i,utf + i + I +\= Expect no match + \x{0130} + \x{0131} + +/I/i,utf,turkish_casing + I + \x{0131} +\= Expect no match + i + \x{0130} + +/\x{0130}/i,utf + \x{0130} +\= Expect no match + i + I + \x{0131} + +/\x{0130}/i,utf,turkish_casing + i + \x{0130} +\= Expect no match + I + \x{0131} + +/\x{0131}/i,utf + \x{0131} +\= Expect no match + i + I + \x{0130} + +/\x{0131}/i,utf,turkish_casing + I + \x{0131} +\= Expect no match + i + \x{0130} + +/[i]/i,utf + i + I +\= Expect no match + \x{0130} + \x{0131} + +/[i]/i,utf,turkish_casing + i + \x{0130} +\= Expect no match + I + \x{0131} + +/[\x{0130}]/i,utf + \x{0130} +\= Expect no match + i + I + \x{0131} + +/[\x{0130}]/i,utf,turkish_casing + i + \x{0130} +\= Expect no match + I + \x{0131} + +/[\x{0120}-\x{0130}]/i,utf + \x{0130} +\= Expect no match + i + I + \x{0131} + +/[\x{0120}-\x{0130}]/i,utf,turkish_casing + i + \x{0130} +\= Expect no match + I + \x{0131} + +/[zi]/i,utf + i + I +\= Expect no match + \x{0130} + \x{0131} + +/[zi]/i,utf,turkish_casing + i + \x{0130} +\= Expect no match + I + \x{0131} + +/[z\x{0130}]/i,utf + \x{0130} +\= Expect no match + i + I + \x{0131} + +/[z\x{0130}]/i,utf,turkish_casing + i + \x{0130} +\= Expect no match + I + \x{0131} + +/[iI]/i,utf + i + I +\= Expect no match + \x{0130} + \x{0131} + +/[iI]/i,utf,turkish_casing + i + I + \x{0130} + \x{0131} + +/[i\x{0130}]/i,utf + i + I + \x{0130} +\= Expect no match + \x{0131} + +/[i\x{0130}]/i,utf,turkish_casing + i + \x{0130} +\= Expect no match + I + \x{0131} + +# End Turkish casing tests + # TESTS for PCRE2_EXTRA_ASCII_xxx - again, tests with and without. # DIGITS diff --git a/testdata/testinput9 b/testdata/testinput9 index f2f50033f..017782bc3 100644 --- a/testdata/testinput9 +++ b/testdata/testinput9 @@ -271,4 +271,6 @@ /abc/substitute_extended,replace=>\o{012345}< abc +/i/turkish_casing + # End of testinput9 diff --git a/testdata/testoutput10 b/testdata/testoutput10 index d240f8e3e..1e6fa92b9 100644 --- a/testdata/testoutput10 +++ b/testdata/testoutput10 @@ -1939,6 +1939,14 @@ No match 0: ab 1: ab +/(.) \1/i,ucp + i I + 0: i I + 1: i + +/(.) \1/i,ucp,turkish_casing +Failed: error 205 at offset 0: PCRE2_EXTRA_TURKISH_CASING requires UTF in 8-bit mode + # python_octal /\400/ diff --git a/testdata/testoutput11-16 b/testdata/testoutput11-16 index 9f69cbd4b..8448e76ff 100644 --- a/testdata/testoutput11-16 +++ b/testdata/testoutput11-16 @@ -716,4 +716,7 @@ Failed: error 134 at offset 34: character code point value in \x{} or \o{} is to abcd!e 0: abcd! +/i/turkish_casing +Failed: error 204 at offset 0: PCRE2_EXTRA_TURKISH_CASING require Unicode (UTF or UCP) mode + # End of testinput11 diff --git a/testdata/testoutput11-32 b/testdata/testoutput11-32 index 2718f72f6..a27a63038 100644 --- a/testdata/testoutput11-32 +++ b/testdata/testoutput11-32 @@ -729,4 +729,7 @@ Subject length lower bound = 1 abcd!e 0: abcd! +/i/turkish_casing +Failed: error 204 at offset 0: PCRE2_EXTRA_TURKISH_CASING require Unicode (UTF or UCP) mode + # End of testinput11 diff --git a/testdata/testoutput12-16 b/testdata/testoutput12-16 index d4f37a128..cc411444a 100644 --- a/testdata/testoutput12-16 +++ b/testdata/testoutput12-16 @@ -1819,8 +1819,120 @@ No match \x{17f} No match +/(.) \1/i,ucp + i I + 0: i I + 1: i + +/(.) \1/i,ucp,turkish_casing +\= Expect no match + i I +No match + +/(.) \1/i,ucp + i I + 0: i I + 1: i + \x{212a} k + 0: \x{212a} k + 1: \x{212a} +\= Expect no match + i \x{0130} +No match + \x{0131} I +No match + +/(.) \1/i,ucp,turkish_casing + \x{212a} k + 0: \x{212a} k + 1: \x{212a} + i \x{0130} + 0: i \x{130} + 1: i + \x{0131} I + 0: \x{131} I + 1: \x{131} +\= Expect no match + i I +No match + +/(.) (?r:\1)/i,ucp,turkish_casing + i I + 0: i I + 1: i +\= Expect no match + i \x{0130} +No match + \x{0131} I +No match + \x{212a} k +No match + +/[a-z][^i]I/ucp,turkish_casing + bII + 0: bII + b\x{0130}I + 0: b\x{130}I + b\x{0131}I + 0: b\x{131}I +\= Expect no match + biI +No match + +/[a-z][^i]I/i,ucp,turkish_casing + b\x{0131}I + 0: b\x{131}I + bII + 0: bII +\= Expect no match + biI +No match + b\x{0130}I +No match + +/[a-z](?r:[^i])I/i,ucp,turkish_casing + b\x{0131}I + 0: b\x{131}I + b\x{0130}I + 0: b\x{130}I +\= Expect no match + bII +No match + biI +No match + +/b(?r:[\x{00FF}-\x{FFEE}])/i,ucp,turkish_casing + b\x{0130} + 0: b\x{130} + b\x{0131} + 0: b\x{131} + B\x{212a} + 0: B\x{212a} +\= Expect no match + bi +No match + bI +No match + bk +No match + # ---------------------------------------------------- +/b[\x{00FF}-\x{FFEE}]/ir + b\x{0130} + 0: b\x{130} + b\x{0131} + 0: b\x{131} + B\x{212a} + 0: B\x{212a} +\= Expect no match + bi +No match + bI +No match + bk +No match + # Quantifier after a literal that has the value of META_ACCEPT (not UTF). This # fails in 16-bit mode, but is OK for 32-bit. diff --git a/testdata/testoutput12-32 b/testdata/testoutput12-32 index c6eac8ae3..4beb9499e 100644 --- a/testdata/testoutput12-32 +++ b/testdata/testoutput12-32 @@ -1817,8 +1817,120 @@ No match \x{17f} No match +/(.) \1/i,ucp + i I + 0: i I + 1: i + +/(.) \1/i,ucp,turkish_casing +\= Expect no match + i I +No match + +/(.) \1/i,ucp + i I + 0: i I + 1: i + \x{212a} k + 0: \x{212a} k + 1: \x{212a} +\= Expect no match + i \x{0130} +No match + \x{0131} I +No match + +/(.) \1/i,ucp,turkish_casing + \x{212a} k + 0: \x{212a} k + 1: \x{212a} + i \x{0130} + 0: i \x{130} + 1: i + \x{0131} I + 0: \x{131} I + 1: \x{131} +\= Expect no match + i I +No match + +/(.) (?r:\1)/i,ucp,turkish_casing + i I + 0: i I + 1: i +\= Expect no match + i \x{0130} +No match + \x{0131} I +No match + \x{212a} k +No match + +/[a-z][^i]I/ucp,turkish_casing + bII + 0: bII + b\x{0130}I + 0: b\x{130}I + b\x{0131}I + 0: b\x{131}I +\= Expect no match + biI +No match + +/[a-z][^i]I/i,ucp,turkish_casing + b\x{0131}I + 0: b\x{131}I + bII + 0: bII +\= Expect no match + biI +No match + b\x{0130}I +No match + +/[a-z](?r:[^i])I/i,ucp,turkish_casing + b\x{0131}I + 0: b\x{131}I + b\x{0130}I + 0: b\x{130}I +\= Expect no match + bII +No match + biI +No match + +/b(?r:[\x{00FF}-\x{FFEE}])/i,ucp,turkish_casing + b\x{0130} + 0: b\x{130} + b\x{0131} + 0: b\x{131} + B\x{212a} + 0: B\x{212a} +\= Expect no match + bi +No match + bI +No match + bk +No match + # ---------------------------------------------------- +/b[\x{00FF}-\x{FFEE}]/ir + b\x{0130} + 0: b\x{130} + b\x{0131} + 0: b\x{131} + B\x{212a} + 0: B\x{212a} +\= Expect no match + bi +No match + bI +No match + bk +No match + # Quantifier after a literal that has the value of META_ACCEPT (not UTF). This # fails in 16-bit mode, but is OK for 32-bit. diff --git a/testdata/testoutput5 b/testdata/testoutput5 index bdcb1a619..1f1a79311 100644 --- a/testdata/testoutput5 +++ b/testdata/testoutput5 @@ -5312,6 +5312,332 @@ No match # End caseless restrict tests +# TESTS for PCRE2_EXTRA_TURKISH_CASING - again, tests with and without. + +/i/i,utf + i + 0: i + I + 0: I +\= Expect no match + \x{0130} +No match + \x{0131} +No match + +/i/i,utf,turkish_casing + i + 0: i + \x{0130} + 0: \x{130} +\= Expect no match + I +No match + \x{0131} +No match + +/I/i,utf + i + 0: i + I + 0: I +\= Expect no match + \x{0130} +No match + \x{0131} +No match + +/I/i,utf,turkish_casing + I + 0: I + \x{0131} + 0: \x{131} +\= Expect no match + i +No match + \x{0130} +No match + +/\x{0130}/i,utf + \x{0130} + 0: \x{130} +\= Expect no match + i +No match + I +No match + \x{0131} +No match + +/\x{0130}/i,utf,turkish_casing + i + 0: i + \x{0130} + 0: \x{130} +\= Expect no match + I +No match + \x{0131} +No match + +/\x{0131}/i,utf + \x{0131} + 0: \x{131} +\= Expect no match + i +No match + I +No match + \x{0130} +No match + +/\x{0131}/i,utf,turkish_casing + I + 0: I + \x{0131} + 0: \x{131} +\= Expect no match + i +No match + \x{0130} +No match + +/[i]/i,utf + i + 0: i + I + 0: I +\= Expect no match + \x{0130} +No match + \x{0131} +No match + +/[i]/i,utf,turkish_casing + i + 0: i + \x{0130} + 0: \x{130} +\= Expect no match + I +No match + \x{0131} +No match + +/[^i]/i,utf + \x{0130} + 0: \x{130} + \x{0131} + 0: \x{131} +\= Expect no match + i +No match + I +No match + +/[^i]/i,utf,turkish_casing + I + 0: I + \x{0131} + 0: \x{131} +\= Expect no match + i +No match + \x{0130} +No match + +/[\x{0130}]/i,utf + \x{0130} + 0: \x{130} +\= Expect no match + i +No match + I +No match + \x{0131} +No match + +/[\x{0130}]/i,utf,turkish_casing + i + 0: i + \x{0130} + 0: \x{130} +\= Expect no match + I +No match + \x{0131} +No match + +/[\x{0120}-\x{0130}]/i,utf + \x{0130} + 0: \x{130} +\= Expect no match + i +No match + I +No match + \x{0131} +No match + +/[\x{0120}-\x{0130}]/i,utf,turkish_casing + i + 0: i + \x{0130} + 0: \x{130} +\= Expect no match + I +No match + \x{0131} +No match + +/[zi]/i,utf + i + 0: i + I + 0: I +\= Expect no match + \x{0130} +No match + \x{0131} +No match + +/[zi]/i,utf,turkish_casing + i + 0: i + \x{0130} + 0: \x{130} +\= Expect no match + I +No match + \x{0131} +No match + +/[z\x{0130}]/i,utf + \x{0130} + 0: \x{130} +\= Expect no match + i +No match + I +No match + \x{0131} +No match + +/[z\x{0130}]/i,utf,turkish_casing + i + 0: i + \x{0130} + 0: \x{130} +\= Expect no match + I +No match + \x{0131} +No match + +/[iI]/i,utf + i + 0: i + I + 0: I +\= Expect no match + \x{0130} +No match + \x{0131} +No match + +/[iI]/i,utf,turkish_casing + i + 0: i + I + 0: I + \x{0130} + 0: \x{130} + \x{0131} + 0: \x{131} + +/[i\x{0130}]/i,utf + i + 0: i + I + 0: I + \x{0130} + 0: \x{130} +\= Expect no match + \x{0131} +No match + +/[i\x{0130}]/i,utf,turkish_casing + i + 0: i + \x{0130} + 0: \x{130} +\= Expect no match + I +No match + \x{0131} +No match + +/(.) \1/i,utf + i I + 0: i I + 1: i +\= Expect no match + i \x{0130} +No match + \x{0131} I +No match + +/(*TURKISH_CASING)(.) \1/i,utf + i \x{0130} + 0: i \x{130} + 1: i + \x{0131} I + 0: \x{131} I + 1: \x{131} +\= Expect no match + i I +No match + +/(.) \1/i,utf,turkish_casing + i \x{0130} + 0: i \x{130} + 1: i + \x{0131} I + 0: \x{131} I + 1: \x{131} +\= Expect no match + i I +No match + +/i/i,utf,caseless_restrict,turkish_casing +Failed: error 206 at offset 0: PCRE2_EXTRA_TURKISH_CASING and PCRE2_EXTRA_CASELESS_RESTRICT are not compatible + +/i/i,turkish_casing +Failed: error 204 at offset 0: PCRE2_EXTRA_TURKISH_CASING require Unicode (UTF or UCP) mode + +/i/i,utf,caseless_restrict + i + 0: i + +/i/i,ucp,caseless_restrict + i + 0: i + +/b(?r:[\x{00FF}-\x{FFEE}])/i,utf,turkish_casing + b\x{0130} + 0: b\x{130} + b\x{0131} + 0: b\x{131} +\= Expect no match + bi +No match + bI +No match + bk +No match + +# End Turkish casing tests + # TESTS for PCRE2_EXTRA_ASCII_xxx - again, tests with and without. # DIGITS diff --git a/testdata/testoutput7 b/testdata/testoutput7 index e1c45e559..eb27100d2 100644 --- a/testdata/testoutput7 +++ b/testdata/testoutput7 @@ -3889,6 +3889,251 @@ No match # End caseless restrict tests +# TESTS for PCRE2_EXTRA_TURKISH_CASING - again, tests with and without. + +/i/i,utf + i + 0: i + I + 0: I +\= Expect no match + \x{0130} +No match + \x{0131} +No match + +/i/i,utf,turkish_casing + i + 0: i + \x{0130} + 0: \x{130} +\= Expect no match + I +No match + \x{0131} +No match + +/I/i,utf + i + 0: i + I + 0: I +\= Expect no match + \x{0130} +No match + \x{0131} +No match + +/I/i,utf,turkish_casing + I + 0: I + \x{0131} + 0: \x{131} +\= Expect no match + i +No match + \x{0130} +No match + +/\x{0130}/i,utf + \x{0130} + 0: \x{130} +\= Expect no match + i +No match + I +No match + \x{0131} +No match + +/\x{0130}/i,utf,turkish_casing + i + 0: i + \x{0130} + 0: \x{130} +\= Expect no match + I +No match + \x{0131} +No match + +/\x{0131}/i,utf + \x{0131} + 0: \x{131} +\= Expect no match + i +No match + I +No match + \x{0130} +No match + +/\x{0131}/i,utf,turkish_casing + I + 0: I + \x{0131} + 0: \x{131} +\= Expect no match + i +No match + \x{0130} +No match + +/[i]/i,utf + i + 0: i + I + 0: I +\= Expect no match + \x{0130} +No match + \x{0131} +No match + +/[i]/i,utf,turkish_casing + i + 0: i + \x{0130} + 0: \x{130} +\= Expect no match + I +No match + \x{0131} +No match + +/[\x{0130}]/i,utf + \x{0130} + 0: \x{130} +\= Expect no match + i +No match + I +No match + \x{0131} +No match + +/[\x{0130}]/i,utf,turkish_casing + i + 0: i + \x{0130} + 0: \x{130} +\= Expect no match + I +No match + \x{0131} +No match + +/[\x{0120}-\x{0130}]/i,utf + \x{0130} + 0: \x{130} +\= Expect no match + i +No match + I +No match + \x{0131} +No match + +/[\x{0120}-\x{0130}]/i,utf,turkish_casing + i + 0: i + \x{0130} + 0: \x{130} +\= Expect no match + I +No match + \x{0131} +No match + +/[zi]/i,utf + i + 0: i + I + 0: I +\= Expect no match + \x{0130} +No match + \x{0131} +No match + +/[zi]/i,utf,turkish_casing + i + 0: i + \x{0130} + 0: \x{130} +\= Expect no match + I +No match + \x{0131} +No match + +/[z\x{0130}]/i,utf + \x{0130} + 0: \x{130} +\= Expect no match + i +No match + I +No match + \x{0131} +No match + +/[z\x{0130}]/i,utf,turkish_casing + i + 0: i + \x{0130} + 0: \x{130} +\= Expect no match + I +No match + \x{0131} +No match + +/[iI]/i,utf + i + 0: i + I + 0: I +\= Expect no match + \x{0130} +No match + \x{0131} +No match + +/[iI]/i,utf,turkish_casing + i + 0: i + I + 0: I + \x{0130} + 0: \x{130} + \x{0131} + 0: \x{131} + +/[i\x{0130}]/i,utf + i + 0: i + I + 0: I + \x{0130} + 0: \x{130} +\= Expect no match + \x{0131} +No match + +/[i\x{0130}]/i,utf,turkish_casing + i + 0: i + \x{0130} + 0: \x{130} +\= Expect no match + I +No match + \x{0131} +No match + +# End Turkish casing tests + # TESTS for PCRE2_EXTRA_ASCII_xxx - again, tests with and without. # DIGITS diff --git a/testdata/testoutput9 b/testdata/testoutput9 index 8556c9e14..f83cb358e 100644 --- a/testdata/testoutput9 +++ b/testdata/testoutput9 @@ -381,4 +381,7 @@ Failed: error -57 at offset 5 in replacement: bad escape sequence in replacement abc Failed: error -57 at offset 10 in replacement: bad escape sequence in replacement string +/i/turkish_casing +Failed: error 204 at offset 0: PCRE2_EXTRA_TURKISH_CASING require Unicode (UTF or UCP) mode + # End of testinput9