Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Porting guide from pcre1 #51

Open
MatthewVernon opened this issue Nov 18, 2021 · 19 comments
Open

Porting guide from pcre1 #51

MatthewVernon opened this issue Nov 18, 2021 · 19 comments

Comments

@MatthewVernon
Copy link
Contributor

Hi,

I don't suppose there's any chance of a porting guide from pcre1 to pcre2, is there, please?

I know you want to be shot of pcre1; I've recently filed bugs against the outstanding packages in Debian which still Build against pcre1, and there are a lot of responses of the form "is there any guidance on porting to pcre2?" I don't feel I have deep enough knowledge of the two libraries (especially the older one) to do so myself, but I think having something to point folk at might help in getting more of the remaining ~200(!) packages that still need old-pcre ported, which in turn will make it plausible for me to drop old-pcre from Debian...

Thanks :)

@PhilipHazel
Copy link
Collaborator

I'm afraid you won't get one from me. After 7 years I have effectively forgotten PCRE1. I would have to go and read both sets of documentation - which anyone else could do just as well. Off the top of my head, the executive summary for is: (1) There is no separate study() function; (2) Use pcre2_match() instead of pcre_exec(); (3) Call pcre2_match_data_create() before calling pcre2_match(); (4) Adjust the number and type of various arguments and options bits; (5) If using JIT, call pcre2jit_compile() before pcre2_match(). For guidance, take a look at pcre2demo.c.

@stapelberg
Copy link

As one of the upstream software authors who has received a debian bug to update my software (https://i3wm.org/) from pcre to pcre2, I would also appreciate a porting guide.

In fact, I would have expected a porting guide and examples to be provided in the bug to begin with. Now a lot of people need to look separately at what’s necessary, when we could have avoided much of that duplicate effort with a little more preparation…

One initial point of confusion to me is the PCRE2_CODE_UNIT_WIDTH macro, which seems to be a new concept in pcre2?

@PhilipHazel
Copy link
Collaborator

Ah, yes, I had forgotten about that, which just goes to demonstrate the state of my memory. I'm sorry we didn't provide a porting guide at the time. I can't remember what happened in PCRE1, but in PCRE2, if you are just working with a single code unit width, you set PCRE2_CODE_UNIT_WIDTH and can then use generic function names such as pcre2_match() instead of using the width-specific names such as pcre2_match_16(). The pcre2demo program demonstrates and comments on this.

@stapelberg
Copy link

Looks like one pcre→pcre2 porting example is i3/i3#4682 (comment) (unless we find any issues with it). Maybe this helps anyone else.

@kamahen
Copy link

kamahen commented Nov 30, 2021

I'm in the process of porting SWI-Prolog's support from PCRE1 to PCRE2. If you're interested in what I do, watch SWI-Prolog/packages-pcre#2
But don't expect fast progress on this. ;)

@inkydragon
Copy link

Another example: PHP

Backward Incompatible Changes
Internal library API has changed

  • The 'S' modifier has no effect, patterns are studied automatically. No real impact.
  • The 'X' modifier is the default behavior in PCRE2. The current patch reverts the behavior to the meaning of 'X' how it was in PCRE, but it might be better to go with the new behavior and have 'X' turned on by default. So currently no impact, too.
  • Some behavior change due to the newer Unicode engine was sighted. It's Unicode 10 in PCRE2 vs Unicode 7 in PCRE.
  • Some behavior change can be sighted with invalid patterns.

@smoretti
Copy link

smoretti commented Jan 28, 2022

Hi
I have inherited some C code using PCRE. I think I managed most of the changes to PCRE2 with the help of your examples. But pcre_exec to pcre2_match is the trickiest for me, notably with pcre2_match_data compared to what was in PCRE.
Don't know if this is the right place for asking help, but here it the code I have:

      const pcre * const restrict rg = regex->regexCompiled[iPatternPrf];
      int ovector[2*8];
      unsigned int offset = 0;
      const unsigned int len = PFSeq->Length;
      int rc;
      size_t count = 0;
      while (offset < len && (rc = pcre_exec(rg, 0, CleanSeq, len, offset, 0, ovector, 8)) >= 0)
      {
        for(int k = 0; k < rc; ++k)
        {
          if (count < nmatch) {
            Matches[2*count] = ovector[2*k];
            Matches[2*count+1] = ovector[2*k+1];
            ++count;
          }
          else {
            fprintf(stderr, "Warning: maximum number of matches reached for %s with %s\n",
              prfs[iPatternPrf]->Identification, prfs[iPatternPrf]->Pattern);
          }
        }
        offset = ovector[1];
      }

I have replaced it by

      const pcre2_code * const restrict rg = regex->regexCompiled[iPatternPrf];
      CLEANUP_PCRE2_MATCH_DATA_FREE pcre2_match_data *match_data = pcre2_match_data_create_from_pattern (rg, NULL);
      int ovector[2*8];
      unsigned int offset = 0;
      const unsigned int len = PFSeq->Length;
      int rc;
      size_t count = 0;
      while (offset < len && (rc = pcre2_match(rg, (PCRE2_SPTR)CleanSeq, PCRE2_ZERO_TERMINATED, offset, 0, match_data, NULL)) >= 0)
      {
        for(int k = 0; k < rc; ++k)
        {
          if (count < nmatch) {
            Matches[2*count] = ovector[2*k];
            Matches[2*count+1] = ovector[2*k+1];
            ++count;
          }
          else {
            fprintf(stderr, "Warning: maximum number of matches reached for %s with %s\n",
              prfs[iPatternPrf]->Identification, prfs[iPatternPrf]->Pattern);
          }
        }
        offset = ovector[1];
      }

but I don't know really how to replace ovector regarding match_data.

@kamahen
Copy link

kamahen commented Jan 28, 2022

Maybe pcre2_get_ovector_pointer(match_data)?
Anyway, that's what pcre2demo.c uses.

(I don't have my conversion fully working, so can't be sure that I've got it right either)

@smoretti
Copy link

Thanks @kamahen, I will look at pcre2demo.c.

@smoretti
Copy link

I think I did it (https://github.com/sib-swiss/pftools3/tree/master/src/C/prg)

The code contains both PCRE and PCRE2, users can switch between them with cmake

@kamahen
Copy link

kamahen commented Mar 24, 2022

My code is almost ready for review (maybe in a day or so).
If you want to look at it sooner: https://github.com/kamahen/packages-pcre/tree/pcre2_wip and diff with commit 8876f2a

@kamahen
Copy link

kamahen commented Mar 25, 2022

My migration code is now here, awaiting code inspection:
SWI-Prolog/packages-pcre#11

It does not use pcre2_substitute(), but instead uses existing code for doing split, replace, and fold operations using the regexps.

@carenas
Copy link
Contributor

carenas commented Mar 26, 2022

Anyone porting their codebase to PCRE2 and that expect to run under Ubuntu 20.04, note there is a bug[1] on their package that hasn't been yet fixed (patch available in 10.36 but not backported yet) and that could result in pcre2 looping forever while doing pcre2_match() with jit enabled if you are using the following options (PCRE2_UTF | PCRE2_MULTILINE | PCRE2_MATCH_INVALID_UTF)

For it to trigger, the subject must contain some UTF characters and a 1 byte line terminator, and a workaround (which git is using) is shown in the example code below (if -DWORKAROUND is used)

#define PCRE2_CODE_UNIT_WIDTH 8
#include <pcre2.h>
#include <assert.h>

int main(void)
{
	pcre2_code *code;
	pcre2_match_data *data;
	PCRE2_SPTR pattern = "^\\s\0";
	PCRE2_SPTR subject = " báz\0";
	int errorcode, ret;
	PCRE2_SIZE next;

	uint32_t options = PCRE2_MULTILINE |
	       		PCRE2_UTF | PCRE2_MATCH_INVALID_UTF;
#ifdef WORKAROUND
	options |= PCRE2_NO_START_OPTIMIZE;
#endif

	code = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, options,
				&errorcode, &next, NULL);
	assert(code != NULL);

	ret = pcre2_jit_compile(code, PCRE2_JIT_COMPLETE);
	assert(ret == 0);

	data = pcre2_match_data_create_from_pattern(code, NULL);
	assert (data != NULL);

	do {
		ret = pcre2_match(code, subject, PCRE2_ZERO_TERMINATED,
				next, 0, data, NULL);

		if (ret > 0)
			next = pcre2_get_startchar(data) + 1;    /* will force skipping over incomplete UTF-8 */
	} while (ret > 0);
	return 0;
}

PCRE2_MATCH_INVALID_UTF was introduced with 10.34 so outside of Ubuntu there might be few others also affected.

[1] https://bugs.launchpad.net/ubuntu/+source/pcre2/+bug/1965925

@kamahen
Copy link

kamahen commented Mar 27, 2022

Here are my notes on migration (SWI-Prolog/packages-pcre@1ace340 with some small follow-on fixes, e.g. for systems without compile_jit). Note that this is for a general purpose wrapper, and that I separately did quite a bit of refactoring code to make the pcre1→pcre2 migration relatively easy.

The main things that I changed (I've probably forgotten one or two things):

  • uint32_t is used almost everywhere
  • some fields are extended to 64 bits, e.g. pcre2_match()'s "start" and "length".
  • many more options and values for options e.g., newline_nul
  • additional flags:
    • jit options
    • compile context
    • bsr - pcre2_set_bsr()
    • newline - pcre2_set_newline()
  • pcre2_config() has a different way of returning strings
  • can get most option flags by calling pcre2_pattern_info()
  • note PCRE2_INFO_NEWLINE, PCRE2_INFO_BSR are separate from
    PCRE2_INFO_ARGOPTIONS/PCRE2_ALLOPTIONS
  • pcre2_compile()'s optional pcre2_compile_context_create() and
    pcre2_match_data_create_from_pattern()
  • pcre2_match()'s ovector can be allocated by pcre2_get_ovector_pointer()
  • pcre2_match() has additional pcre2_match_data_create_from_pattern() and
    pcre2_match_data_free()

@abrudz
Copy link

abrudz commented Nov 24, 2022

This list of differences might be helpful.

@kamahen
Copy link

kamahen commented Nov 24, 2022

Good idea.
Please either open a new issue, requesting a change to the documentation; or submit a PR with the changed documentation.
(I'm unable to work on this right now and don't want your idea to get lost)

@guillemj
Copy link

Hi!

Probably also worth mentioning that the apparent natural successor for the old libpcrecpp (which is no longer provided by PCRE2) would be RE2 by Google that also contributed the old libpcrecpp and has a very close API and thus the migration should be easier than switching to the C API.

@zherczeg
Copy link
Collaborator

The libpcrecpp was an interface to pcre1. Engines have different features, and the so called dfa engines have a limited set of them. Hence moving to another engine is not trivial.

@guillemj
Copy link

The libpcrecpp was an interface to pcre1. Engines have different features, and the so called dfa engines have a limited set of them. Hence moving to another engine is not trivial.

Ah, right, sorry, should have been more explicit. When I had to see how to migrate some C++ projects recently I first went with a switch to the PCRE2 C API, but then found about RE2, and after considering that for that particular usage the underlying engine was not an issue, switching to that very similar API implied less risk of regressions or bugs. I think it would still be worth mentioning it as the natural C++ API successor, but with that explicit caveat, that depending on what regexes are expected to be used RE2 might indeed not be a valid candidate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants