File tidies for 10.43-RC1 release

PCRE2Project · Dec 28, 2023 · aadef0c · carenas · Jan 2, 2024 · aadef0c
1 parent 2bba84b
commit aadef0c
Show file tree

Hide file tree

Showing 18 changed files with 450 additions and 379 deletions.
diff --git a/AUTHORS b/AUTHORS
@@ -8,7 +8,7 @@ Email domain:     gmail.com
 Retired from University of Cambridge Computing Service,
 Cambridge, England.
 
-Copyright (c) 1997-2022 University of Cambridge
+Copyright (c) 1997-2023 University of Cambridge
 All rights reserved
 
 
@@ -19,7 +19,7 @@ Written by:       Zoltan Herczeg
 Email local part: hzmester
 Emain domain:     freemail.hu
 
-Copyright(c) 2010-2022 Zoltan Herczeg
+Copyright(c) 2010-2023 Zoltan Herczeg
 All rights reserved.
 
 
@@ -30,7 +30,7 @@ Written by:       Zoltan Herczeg
 Email local part: hzmester
 Emain domain:     freemail.hu
 
-Copyright(c) 2009-2022 Zoltan Herczeg
+Copyright(c) 2009-2023 Zoltan Herczeg
 All rights reserved.
 
 ####
diff --git a/ChangeLog b/ChangeLog
@@ -5,8 +5,8 @@ Before the move to GitHub, this was the only record of changes to PCRE2. Now
 there is often more detail in the pull requests.
 
 
-Version 10.43 xx-xxx-202x
--------------------------
+Version 10.43 27-December-2023
+------------------------------
 
 1. The test program added by change 2 of 10.42 didn't work when the default
 newline setting didn't include \n as a newline. One test needed (*LF) to ensure

diff --git a/LICENCE b/LICENCE
@@ -26,7 +26,7 @@ Email domain:     gmail.com
 Retired from University of Cambridge Computing Service,
 Cambridge, England.
 
-Copyright (c) 1997-2022 University of Cambridge
+Copyright (c) 1997-2023 University of Cambridge
 All rights reserved.
 
 
@@ -37,7 +37,7 @@ Written by:       Zoltan Herczeg
 Email local part: hzmester
 Email domain:     freemail.hu
 
-Copyright(c) 2010-2022 Zoltan Herczeg
+Copyright(c) 2010-2023 Zoltan Herczeg
 All rights reserved.
 
 
@@ -48,7 +48,7 @@ Written by:       Zoltan Herczeg
 Email local part: hzmester
 Email domain:     freemail.hu
 
-Copyright(c) 2009-2022 Zoltan Herczeg
+Copyright(c) 2009-2023 Zoltan Herczeg
 All rights reserved.
 
 

diff --git a/NEWS b/NEWS
@@ -2,6 +2,52 @@ News about PCRE2 releases
 -------------------------
 
 
+Version 10.43 27-December-2023
+------------------------------
+
+There are quite a lot of changes in this release (see ChangeLog and git log for
+a list). Those that are not bugfixes or code tidies are:
+
+* A new function pcre2_get_match_data_heapframes_size() for finer heap control.
+
+* New option flags to restrict the interaction between ASCII and non-ASCII
+  characters for caseless matching and \d and friends. There are also new
+  pattern constructs to control these flags from within a pattern.
+
+* Upgrade to Unicode 15.0.0.
+
+* Treat a NULL pattern with zero length as an empty string.
+
+* Added support for limited-length variable-length lookbehind assertions, with
+  a default maximum length of 255 characters (same as Perl) but with a function
+  to adjust the limit.
+
+* Support for LoongArch to JIT.
+
+* Perl changed the meaning of (for examle) {,3} which did not used to be
+  recognized as a quantifier. Now it means {0,3} and PCRE2 has also changed.
+  Note that {,} is still not a quantifier.
+
+* Following Perl, allow spaces and tabs after { and before } in all Perl-
+  compatible items that use braces, and also around commas in quantifiers. The
+  one exception in PCRE2 is \u{...}, which is from ECMAScript, not Perl, and
+  PCRE2 follows ECMAScript usage.
+
+* Changed the meaning of \w and its synonyms and derivatives (\b and \B) in UCP
+  mode to follow Perl. It now matches characters whose general categories are L
+  or N or whose particular categories are Mn (non-spacing mark) or Pc
+  (combining puntuation).
+
+* Changed the default meaning of [:xdigit:] in UCP mode to follow Perl. It now
+  matches the "fullwidth" versions of hex digits. PCRE2_EXTRA_ASCII_DIGIT can
+  be used to keep it ASCII only.
+
+* Make PCRE2_UCP the default in UTF mode in pcre2grep and add -no_ucp,
+  --case-restrict and --posix-digit.
+
+* Add --group-separator and --no-group-separator to pcre2grep.
+
+
 Version 10.42 11-December-2022
 ------------------------------
 

diff --git a/configure.ac b/configure.ac
@@ -10,14 +10,14 @@ dnl be defined as -RC2, for example. For real releases, it should be empty.
 
 m4_define(pcre2_major, [10])
 m4_define(pcre2_minor, [43])
-m4_define(pcre2_prerelease, [-DEV])
-m4_define(pcre2_date, [2023-04-14])
+m4_define(pcre2_prerelease, [-RC1])
+m4_define(pcre2_date, [2023-12-27])
 
 # Libtool shared library interface versions (current:revision:age)
-m4_define(libpcre2_8_version,     [11:2:11])
-m4_define(libpcre2_16_version,    [11:2:11])
-m4_define(libpcre2_32_version,    [11:2:11])
-m4_define(libpcre2_posix_version, [3:4:0])
+m4_define(libpcre2_8_version,     [12:0:12])
+m4_define(libpcre2_16_version,    [12:0:12])
+m4_define(libpcre2_32_version,    [12:0:12])
+m4_define(libpcre2_posix_version, [3:5:0])
 
 # NOTE: The CMakeLists.txt file searches for the above variables in the first
 # 50 lines of this file. Please update that if the variables above are moved.

diff --git a/doc/html/pcre2grep.html b/doc/html/pcre2grep.html
@@ -71,15 +71,16 @@ <h1>pcre2grep man page</h1>
 <pre>
   pcre2grep some-pattern file1 - file3
 </pre>
-By default, input files are searched line by line. Each line that matches a
-pattern is copied to the standard output, and if there is more than one file,
-the file name is output at the start of each line, followed by a colon.
-However, there are options that can change how <b>pcre2grep</b> behaves. For
-example, the <b>-M</b> option makes it possible to search for strings that span
-line boundaries. What defines a line boundary is controlled by the <b>-N</b>
-(<b>--newline</b>) option. The <b>-h</b> and <b>-H</b> options control whether or
-not file names are shown, and the <b>-Z</b> option changes the file name
-terminator to a zero byte.
+By default, input files are searched line by line, so pattern assertions about
+the beginning and end of a subject string (^, $, \A, \Z, and \z) match at
+the beginning and end of each line. When a line matches a pattern, it is copied
+to the standard output, and if there is more than one file, the file name is
+output at the start of each line, followed by a colon. However, there are
+options that can change how <b>pcre2grep</b> behaves. For example, the <b>-M</b>
+option makes it possible to search for strings that span line boundaries. What
+defines a line boundary is controlled by the <b>-N</b> (<b>--newline</b>) option.
+The <b>-h</b> and <b>-H</b> options control whether or not file names are shown,
+and the <b>-Z</b> option changes the file name terminator to a zero byte.
 </P>
 <P>
 The amount of memory used for buffering files that are being scanned is
@@ -563,16 +564,24 @@ <h1>pcre2grep man page</h1>
 <P>
 <b>-M</b>, <b>--multiline</b>
 Allow patterns to match more than one line. When this option is set, the PCRE2
-library is called in "multiline" mode. This allows a matched string to extend
-past the end of a line and continue on one or more subsequent lines. Patterns
-used with <b>-M</b> may usefully contain literal newline characters and internal
-occurrences of ^ and $ characters. The output for a successful match may
-consist of more than one line. The first line is the line in which the match
-started, and the last line is the line in which the match ended. If the matched
-string ends with a newline sequence, the output ends at the end of that line.
-If <b>-v</b> is set, none of the lines in a multi-line match are output. Once a
-match has been handled, scanning restarts at the beginning of the line after
-the one in which the match ended.
+library is called in "multiline" mode, and a match is allowed to continue past
+the end of the initial line and onto one or more subsequent lines.
+<br>
+<br>
+Patterns used with <b>-M</b> may usefully contain literal newline characters and
+internal occurrences of ^ and $ characters, because in multiline mode these can
+match at internal newlines. Because <b>pcre2grep</b> is scanning multiple lines,
+the \Z and \z assertions match only at the end of the last line in the file.
+The \A assertion matches at the start of the first line of a match. This can
+be any line in the file; it is not anchored to the first line.
+<br>
+<br>
+The output for a successful match may consist of more than one line. The first
+line is the line in which the match started, and the last line is the line in
+which the match ended. If the matched string ends with a newline sequence, the
+output ends at the end of that line. If <b>-v</b> is set, none of the lines in a
+multi-line match are output. Once a match has been handled, scanning restarts
+at the beginning of the line after the one in which the match ended.
 <br>
 <br>
 The newline sequence that separates multiple lines must be matched as part of
@@ -1107,7 +1116,7 @@ <h1>pcre2grep man page</h1>
 </P>
 <br><a name="SEC16" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 20 November 2023
+Last updated: 22 December 2023
 <br>
 Copyright &copy; 1997-2023 University of Cambridge.
 <br>

diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html
@@ -328,10 +328,10 @@ <h1>pcre2pattern man page</h1>
 Brace characters { and } are also used to enclose data for constructions such
 as \g{2} or \k{name}. In almost all uses of braces, space and/or horizontal
 tab characters that follow { or precede } are allowed and are ignored. In the
-case of quantifiers, they may also appear before or after the comma. The 
+case of quantifiers, they may also appear before or after the comma. The
 exception to this is \u{...} which is an ECMAScript compatibility feature
-that is recognized only when the PCRE2_EXTRA_ALT_BSUX option is set. ECMAScript 
-does not ignore such white space; it causes the item to be interpreted as 
+that is recognized only when the PCRE2_EXTRA_ALT_BSUX option is set. ECMAScript
+does not ignore such white space; it causes the item to be interpreted as
 literal.
 </P>
 <P>
@@ -472,7 +472,7 @@ <h1>pcre2pattern man page</h1>
 (carriage return) character.
 </P>
 <P>
-An error occurs if \c is not followed by a character whose ASCII code point 
+An error occurs if \c is not followed by a character whose ASCII code point
 is in the range 32 to 126. The precise effect of \cx is as follows: if x is a
 lower case letter, it is converted to upper case. Then bit 6 of the character
 (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is
@@ -694,8 +694,8 @@ <h1>pcre2pattern man page</h1>
   \s  any character that matches \p{Z} or \h or \v
   \w  any character that matches \p{L}, \p{N}, \p{Mn}, or \p{Pc}
 </pre>
-The addition of \p{Mn} (non-spacing mark) and the replacement of an explicit 
-test for underscore with a test for \p{Pc} (connector punctuation) happened in 
+The addition of \p{Mn} (non-spacing mark) and the replacement of an explicit
+test for underscore with a test for \p{Pc} (connector punctuation) happened in
 PCRE2 release 10.43. This brings PCRE2 into line with Perl.
 </P>
 <P>
@@ -1074,7 +1074,7 @@ <h1>pcre2pattern man page</h1>
 carriage return, and any other character that has the Z (separator) property.
 Xsp is the same as Xps; in PCRE1 it used to exclude vertical tab, for Perl
 compatibility, but Perl changed. Xwd matches the same characters as Xan, plus
-those that match Mn (non-spacing mark) or Pc (connector punctuation, which 
+those that match Mn (non-spacing mark) or Pc (connector punctuation, which
 includes underscore).
 </P>
 <P>
@@ -1586,7 +1586,7 @@ <h1>pcre2pattern man page</h1>
 </P>
 <P>
 The other POSIX classes are unchanged by PCRE2_UCP, and match only characters
-with code points less than 256. 
+with code points less than 256.
 </P>
 <P>
 There are two options that can be used to restrict the POSIX classes to ASCII
@@ -1613,8 +1613,8 @@ <h1>pcre2pattern man page</h1>
 <a href="#smallassertions">"Simple assertions"</a>
 above), and in a Perl-style pattern the preceding or following character
 normally shows which is wanted, without the need for the assertions that are
-used above in order to give exactly the POSIX behaviour. Note also that the 
-PCRE2_UCP option changes the meaning of \w (and therefore \b) by default, so 
+used above in order to give exactly the POSIX behaviour. Note also that the
+PCRE2_UCP option changes the meaning of \w (and therefore \b) by default, so
 it also affects these POSIX sequences.
 </P>
 <br><a name="SEC12" href="#TOC1">VERTICAL BAR</a><br>
@@ -1682,8 +1682,8 @@ <h1>pcre2pattern man page</h1>
 above, it sets (or unsets) all the ASCII options.
 </P>
 <P>
-PCRE2_EXTRA_ASCII_DIGIT has no additional effect when PCRE2_EXTRA_ASCII_POSIX 
-is set, but including it in (?aP) means that (?-aP) suppresses all ASCII 
+PCRE2_EXTRA_ASCII_DIGIT has no additional effect when PCRE2_EXTRA_ASCII_POSIX
+is set, but including it in (?aP) means that (?-aP) suppresses all ASCII
 restrictions for POSIX classes.
 </P>
 <P>
@@ -1993,7 +1993,7 @@ <h1>pcre2pattern man page</h1>
   X{,4} is interpreted as X{0,4}
 </pre>
 This is a change in behaviour that happened in Perl 5.34.0 and PCRE2 10.43. In
-earlier versions such a sequence was not interpreted as a quantifier. Other 
+earlier versions such a sequence was not interpreted as a quantifier. Other
 regular expression engines may behave either way.
 </P>
 <P>
@@ -2287,7 +2287,7 @@ <h1>pcre2pattern man page</h1>
 The sequence \g{-1} is a reference to the capture group whose number is one
 less than the number of the next group to be started, so in this example (where
 the next group would be numbered 3) is it equivalent to \2, and \g{-2} would
-be equivalent to \1. Note that if this construct is inside a capture group, 
+be equivalent to \1. Note that if this construct is inside a capture group,
 that group is included in the count, so in this example \g{-2} also refers to
 group 1:
 <pre>
@@ -2323,8 +2323,8 @@ <h1>pcre2pattern man page</h1>
 </P>
 <P>
 There are several different ways of writing backreferences to named capture
-groups. The .NET syntax is \k{name}, the Python syntax is (?=name), and the 
-original Perl syntax is \k&#60;name&#62; or \k'name'. All of these are now supported 
+groups. The .NET syntax is \k{name}, the Python syntax is (?=name), and the
+original Perl syntax is \k&#60;name&#62; or \k'name'. All of these are now supported
 by both Perl and PCRE2. Perl 5.10's unified backreference syntax, in which \g
 can be used for both numeric and named references, is also supported by PCRE2.
 We could rewrite the above example in any of the following ways:
@@ -2778,7 +2778,7 @@ <h1>pcre2pattern man page</h1>
 condition is true if a capture group of that number has previously matched. If
 there is more than one capture group with the same number (see the earlier
 <a href="#recursion">section about duplicate group numbers),</a>
-the condition is true if any of them have matched. An alternative notation, 
+the condition is true if any of them have matched. An alternative notation,
 which is a PCRE2 extension, not supported by Perl, is to precede the digits
 with a plus or minus sign. In this case, the group number is relative rather
 than absolute. The most recently opened capture group (which could be enclosing

diff --git a/doc/html/pcre2syntax.html b/doc/html/pcre2syntax.html
@@ -408,8 +408,8 @@ <h1>pcre2syntax man page</h1>
   (?-...)         unset the given option(s)
   (?^)            unset imnrsx options
 </pre>
-(?aP) implies (?aT) as well, though this has no additional effect. However, it 
-means that (?-aP) is really (?-PT) which disables all ASCII restrictions for 
+(?aP) implies (?aT) as well, though this has no additional effect. However, it
+means that (?-aP) is really (?-PT) which disables all ASCII restrictions for
 POSIX classes.
 </P>
 <P>