-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Porting guide from pcre1 #51
Comments
I'm afraid you won't get one from me. After 7 years I have effectively forgotten PCRE1. I would have to go and read both sets of documentation - which anyone else could do just as well. Off the top of my head, the executive summary for is: (1) There is no separate study() function; (2) Use pcre2_match() instead of pcre_exec(); (3) Call pcre2_match_data_create() before calling pcre2_match(); (4) Adjust the number and type of various arguments and options bits; (5) If using JIT, call pcre2jit_compile() before pcre2_match(). For guidance, take a look at pcre2demo.c. |
As one of the upstream software authors who has received a debian bug to update my software (https://i3wm.org/) from pcre to pcre2, I would also appreciate a porting guide. In fact, I would have expected a porting guide and examples to be provided in the bug to begin with. Now a lot of people need to look separately at what’s necessary, when we could have avoided much of that duplicate effort with a little more preparation… One initial point of confusion to me is the |
Ah, yes, I had forgotten about that, which just goes to demonstrate the state of my memory. I'm sorry we didn't provide a porting guide at the time. I can't remember what happened in PCRE1, but in PCRE2, if you are just working with a single code unit width, you set PCRE2_CODE_UNIT_WIDTH and can then use generic function names such as pcre2_match() instead of using the width-specific names such as pcre2_match_16(). The pcre2demo program demonstrates and comments on this. |
Looks like one pcre→pcre2 porting example is i3/i3#4682 (comment) (unless we find any issues with it). Maybe this helps anyone else. |
I'm in the process of porting SWI-Prolog's support from PCRE1 to PCRE2. If you're interested in what I do, watch SWI-Prolog/packages-pcre#2 |
Another example: PHP
|
Hi const pcre * const restrict rg = regex->regexCompiled[iPatternPrf];
int ovector[2*8];
unsigned int offset = 0;
const unsigned int len = PFSeq->Length;
int rc;
size_t count = 0;
while (offset < len && (rc = pcre_exec(rg, 0, CleanSeq, len, offset, 0, ovector, 8)) >= 0)
{
for(int k = 0; k < rc; ++k)
{
if (count < nmatch) {
Matches[2*count] = ovector[2*k];
Matches[2*count+1] = ovector[2*k+1];
++count;
}
else {
fprintf(stderr, "Warning: maximum number of matches reached for %s with %s\n",
prfs[iPatternPrf]->Identification, prfs[iPatternPrf]->Pattern);
}
}
offset = ovector[1];
} I have replaced it by const pcre2_code * const restrict rg = regex->regexCompiled[iPatternPrf];
CLEANUP_PCRE2_MATCH_DATA_FREE pcre2_match_data *match_data = pcre2_match_data_create_from_pattern (rg, NULL);
int ovector[2*8];
unsigned int offset = 0;
const unsigned int len = PFSeq->Length;
int rc;
size_t count = 0;
while (offset < len && (rc = pcre2_match(rg, (PCRE2_SPTR)CleanSeq, PCRE2_ZERO_TERMINATED, offset, 0, match_data, NULL)) >= 0)
{
for(int k = 0; k < rc; ++k)
{
if (count < nmatch) {
Matches[2*count] = ovector[2*k];
Matches[2*count+1] = ovector[2*k+1];
++count;
}
else {
fprintf(stderr, "Warning: maximum number of matches reached for %s with %s\n",
prfs[iPatternPrf]->Identification, prfs[iPatternPrf]->Pattern);
}
}
offset = ovector[1];
} but I don't know really how to replace |
Maybe (I don't have my conversion fully working, so can't be sure that I've got it right either) |
Thanks @kamahen, I will look at |
I think I did it (https://github.com/sib-swiss/pftools3/tree/master/src/C/prg) The code contains both PCRE and PCRE2, users can switch between them with cmake |
My code is almost ready for review (maybe in a day or so). |
My migration code is now here, awaiting code inspection: It does not use pcre2_substitute(), but instead uses existing code for doing split, replace, and fold operations using the regexps. |
Anyone porting their codebase to PCRE2 and that expect to run under Ubuntu 20.04, note there is a bug[1] on their package that hasn't been yet fixed (patch available in 10.36 but not backported yet) and that could result in pcre2 looping forever while doing pcre2_match() with jit enabled if you are using the following options (PCRE2_UTF | PCRE2_MULTILINE | PCRE2_MATCH_INVALID_UTF) For it to trigger, the subject must contain some UTF characters and a 1 byte line terminator, and a workaround (which #define PCRE2_CODE_UNIT_WIDTH 8
#include <pcre2.h>
#include <assert.h>
int main(void)
{
pcre2_code *code;
pcre2_match_data *data;
PCRE2_SPTR pattern = "^\\s\0";
PCRE2_SPTR subject = " báz\0";
int errorcode, ret;
PCRE2_SIZE next;
uint32_t options = PCRE2_MULTILINE |
PCRE2_UTF | PCRE2_MATCH_INVALID_UTF;
#ifdef WORKAROUND
options |= PCRE2_NO_START_OPTIMIZE;
#endif
code = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, options,
&errorcode, &next, NULL);
assert(code != NULL);
ret = pcre2_jit_compile(code, PCRE2_JIT_COMPLETE);
assert(ret == 0);
data = pcre2_match_data_create_from_pattern(code, NULL);
assert (data != NULL);
do {
ret = pcre2_match(code, subject, PCRE2_ZERO_TERMINATED,
next, 0, data, NULL);
if (ret > 0)
next = pcre2_get_startchar(data) + 1; /* will force skipping over incomplete UTF-8 */
} while (ret > 0);
return 0;
} PCRE2_MATCH_INVALID_UTF was introduced with 10.34 so outside of Ubuntu there might be few others also affected. [1] https://bugs.launchpad.net/ubuntu/+source/pcre2/+bug/1965925 |
Here are my notes on migration (SWI-Prolog/packages-pcre@1ace340 with some small follow-on fixes, e.g. for systems without compile_jit). Note that this is for a general purpose wrapper, and that I separately did quite a bit of refactoring code to make the pcre1→pcre2 migration relatively easy. The main things that I changed (I've probably forgotten one or two things):
|
This list of differences might be helpful. |
Good idea. |
Hi! Probably also worth mentioning that the apparent natural successor for the old libpcrecpp (which is no longer provided by PCRE2) would be RE2 by Google that also contributed the old libpcrecpp and has a very close API and thus the migration should be easier than switching to the C API. |
The libpcrecpp was an interface to pcre1. Engines have different features, and the so called dfa engines have a limited set of them. Hence moving to another engine is not trivial. |
Ah, right, sorry, should have been more explicit. When I had to see how to migrate some C++ projects recently I first went with a switch to the PCRE2 C API, but then found about RE2, and after considering that for that particular usage the underlying engine was not an issue, switching to that very similar API implied less risk of regressions or bugs. I think it would still be worth mentioning it as the natural C++ API successor, but with that explicit caveat, that depending on what regexes are expected to be used RE2 might indeed not be a valid candidate. |
Hi,
I don't suppose there's any chance of a porting guide from pcre1 to pcre2, is there, please?
I know you want to be shot of pcre1; I've recently filed bugs against the outstanding packages in Debian which still Build against pcre1, and there are a lot of responses of the form "is there any guidance on porting to pcre2?" I don't feel I have deep enough knowledge of the two libraries (especially the older one) to do so myself, but I think having something to point folk at might help in getting more of the remaining ~200(!) packages that still need old-pcre ported, which in turn will make it plausible for me to drop old-pcre from Debian...
Thanks :)
The text was updated successfully, but these errors were encountered: