Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify PCRE pattern before use #4517

Open
wants to merge 4 commits into
base: 2.15
from

Conversation

@ktomk
Copy link
Contributor

commented Aug 24, 2019

To give a detailed PregException message making any problems w/ the
pattern visible.

Add to all the Preg::...($pattern, ....) methods as getting details of any
pattern related errors later on is not easily possible b/c of multi-
matching w/ and w/o UTF-8 mode (see b29848e).

NOTE: Replacement parameters are not checked.

Related: #4406
Reference: b29848e

@ktomk

This comment has been minimized.

Copy link
Contributor Author

commented Aug 24, 2019

There is an issue w/ the check by fabbot.io, I've seen changes in the file that were explicitly made to \array_map (with leading slash) so I did it the same and the cs-fixer in the project itself did not suggest any changes either. So I'm a little bit puzzled what is wanted here.

I must admit, I was even before filing the PR b/c other global function in the file like preg_match or preg_last_error etc. aren't prefixed w/ a slash.

@julienfalque

This comment has been minimized.

Copy link
Member

commented Aug 25, 2019

Isn't this going to add too much overhead (verifying each pattern every time it is used)? IMO, covering code with tests should be enough to prevent wrong pattern usages.

src/Preg.php Outdated Show resolved Hide resolved
if (\is_array($pattern)) {
return \array_map(__METHOD__, $pattern);
}

This comment has been minimized.

Copy link
@kubawerlos

kubawerlos Aug 25, 2019

Contributor

Can we have an early return if we already verified the pattern to avoid the too much overhead issue mentioned by @julienfalque ? Something like:

if (isset(self::$verifiedPatterns[$pattern])) {
    return;
}

This comment has been minimized.

Copy link
@ktomk

ktomk Aug 26, 2019

Author Contributor

@kubawerlos We certainly are able to write that code. But I would like to mention that we should not.

I could now quote from Rob Pike 1989, however even more concretely from top of my head, the PCRE library already caches compiled patterns internally. At least this is what I remember from years ago when I wrote that CSS parser related lexer / parser thingy years ago you can still find on my Github page. It's PCRE regex based.,

/cc @julienfalque

This comment has been minimized.

Copy link
@kubawerlos

kubawerlos Aug 26, 2019

Contributor

But I would like to mention that we should not.

@ktomk and I still stand on my opinion that we should, please do quote from Rob Pike 1989 because I am eager to learn what is wrong with an early return, which we are using it this project a lot.

This comment has been minimized.

Copy link
@ktomk

ktomk Aug 26, 2019

Author Contributor

@kubawerlos Don't get me wrong, I'm for early reaturns very often, if not always, for the structure of a method. The verification calls are at the top of each of those methods and for the error case they throw. This means the exit is pretty early.

Your suggestion as I understand it is to introduce an in-memory caching layer in form of an array for performance reasons which have not been demonstrated to the best of my current knowledge.

I told in a comment below that I have not measured the change. I know this is sort of weak from my end. This is purely because I don't know where the perf suite of the project is to do these measurements and I'm eager to learn more here.

As you have asked for the quote, it's mainly the rules 1+2 from the complexity section that sprung to my mind, even if not a C programmer I think it's worth to read the whole document which is not very large, but here is a limited quote:

...
Complexity
Most programs are too complicated - that is, more complex than they need to be to solve their problems efficiently. Why? Mostly it's because of bad design, but I will skip that issue here because it's a big one. But programs are often complicated at the microscopic level, and that is something I can address here.
Rule 1. You can't tell where a program is going to spend its time. Bottlenecks occur in surprising places, so don't try to second guess and put in a speed hack until you've proven that's where the bottleneck is.
Rule 2. Measure. Don't tune for speed until you've measured, and even then don't unless one part of the code overwhelms the rest.
...

From: Notes on Programming in C by Rob Pike; February 21, 1989

Even Pike write rules, please mind that this is in context of his writing. I do not want to say that these are rules in this project that I think should be followed, it is just that in my opinion this is generally good advice. And it's more generic.

In this case especially - as I already wrote elsewhere - I think that PCRE caches the compiled patterns already.

So what I'm trying to say is, that if there is a performance issue, we need to find out first to address it. So if you have more data at hand, I really would love to here about it and if this is an issue which needs addressing I'm the last who says to not address it. But let's see if this is an issue first of all.

This comment has been minimized.

Copy link
@kubawerlos

kubawerlos Aug 26, 2019

Contributor

This in-memory caching with single if would save a lot of calls for preg_ functions - and the less calls to preg_ function is worth to have it.

This comment has been minimized.

Copy link
@ktomk

ktomk Aug 26, 2019

Author Contributor

@kubawerlos Even if so, which difference would it make? Is this worth to consider at all?

This comment has been minimized.

Copy link
@kubawerlos

kubawerlos Aug 28, 2019

Contributor

I don't know, it would be great to see some benchmarks with no verify, with cached version and with non-cached version.

This comment has been minimized.

Copy link
@ktomk

ktomk Sep 6, 2019

Author Contributor

@kubawerlos Was afk for some days and then needed to be avail for others elsewhere first, so please excuse the feedback-less time. On topic, just let me know if you've found out. I don't know how things are measured in this project still.

src/Preg.php Outdated Show resolved Hide resolved
@ktomk

This comment has been minimized.

Copy link
Contributor Author

commented Aug 26, 2019

Isn't this going to add too much overhead (verifying each pattern every time it is used)? IMO, covering code with tests should be enough to prevent wrong pattern usages.

For the overhead: I've not measured. In the past this never has been an issue for me, but I'm eager to learn.

About tests: The pattern was wrong in a specific PHP version. It was not wrong earlier. Unit tests aren't effective at run-time (when testing I normally favor a better message as well).

/e: and after a further insight: Preg::removeUtf8Modifier will not work as expected if an invalid pattern is being provided. It already has problems with a valid one - in case of using pairing characters for pattern delimiters like (), [] or {} - but has even many more issues with an invalid pattern that is missing the ending delimiter in any other case. No stress.

@ktomk ktomk force-pushed the ktomk:verify-pcre-pattern-before-use branch from 352c7d0 to 0482f33 Aug 26, 2019

src/Preg.php Outdated Show resolved Hide resolved
src/Preg.php Outdated Show resolved Hide resolved
tests/PregTest.php Outdated Show resolved Hide resolved
tests/PregTest.php Outdated Show resolved Hide resolved

@ktomk ktomk force-pushed the ktomk:verify-pcre-pattern-before-use branch from 0482f33 to 2c62298 Aug 27, 2019

src/Preg.php Outdated Show resolved Hide resolved

@ktomk ktomk force-pushed the ktomk:verify-pcre-pattern-before-use branch 3 times, most recently from 0b865f5 to 0c165b5 Aug 27, 2019

@keradus

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

@ktomk , please, for a while, during the review, avoid doing the squashing.
it's really hard to see the changes you applied between my previous review and now, as everything is a single commit now :(

@ktomk ktomk force-pushed the ktomk:verify-pcre-pattern-before-use branch from 0c165b5 to a46df6d Aug 28, 2019

@ktomk

This comment has been minimized.

Copy link
Contributor Author

commented Aug 28, 2019

@keradus: If I only had read that earlier. I always rebase / amend locally after the build is fine and then push for the remote build resetting anything previous. I can leave fixup! etc. commits if that helps but was not aware earlier this was inconvenient. I'll do for future pushes.

@keradus keradus changed the base branch from 2.12 to 2.15 Sep 4, 2019

@ktomk ktomk force-pushed the ktomk:verify-pcre-pattern-before-use branch 4 times, most recently from 5b07305 to d434981 Sep 4, 2019

ktomk added 2 commits Aug 24, 2019
Verify PCRE pattern before use
To give a detailed PregException message making any problems w/ the
pattern visible.

Add to all the Preg::...($pattern, ....) methods as getting details of any
pattern related errors later on is not easily possible b/c of multi-
matching w/ and w/o UTF-8 mode (see b29848e).

NOTE: Replacement parameters are not checked.

Related: #4406
Reference: b29848e
squash! Verify PCRE pattern before use
(this restores the old behavior to not validate the pattern(s) before use
but only afterwards in case of an error)

(this has been updated to make the pr build portable accross different
error messages across all PHP versions [e.g. UTF-8])

Verify PCRE pattern on error

To give a detailed PregException message making any problems w/ the
pattern visible.

Add to all the Preg::...($pattern, ....) methods giving a more concrete
exception message in case any of the pattern(s) in use does not validate,
otherwise give old exception message (but with the change to signal the
actual method name, not the function name of the preg_* category).

NOTE: Replacement parameters are not checked.

Related: #4406
Reference: b29848e

@ktomk ktomk force-pushed the ktomk:verify-pcre-pattern-before-use branch 2 times, most recently from 503e8aa to 0120261 Sep 7, 2019

@ktomk

This comment has been minimized.

Copy link
Contributor Author

commented Sep 7, 2019

I have added some changes to this PR. The verification of the patterns is now only done if any of the preg_* operations failed regardless in which unicode/non-unicode mode.

In case verification fails, the PregException message has details of the error. Otherwise the PregException message now contains the method name of the Preg class method which caused the error (it had the name of the preg_* method previously) and the behavior is otherwise unchanged.

Then I added another change-set on top as the Travis build was giving me a hard time due to installation failures of the dev-tools (for example see build #18669. This change is marked with three dots at the beginning as I needed a couple of retries to compile it. Let me know what to do with it, e.g. if it should be kept in this PR or moved into a PR of it's own or what else.

@keradus

This comment has been minimized.

Copy link
Member

commented Sep 8, 2019

@ktomk,

As discussed on https://gitter.im/PHP-CS-Fixer/Lobby:

0120261 looks like great improvement !
please, extract it to separated PR (and drop it from current PR, perhaps), so we can have proper review etc. I must admit, i'm not so experienced with shell scripts, so probably i will have few questions about the changes.
also, consider adding the great description from commit message into the shell script file itself, so one can always come back to it easily

also, don't worry about earlier squashes, but now you see, while having atomic commits, it's also easier to extract sth to dedicated PR by simply cherry-picking


your description of changes for Preg error handling itself sounds good. I will take a look at code diff hopefully tomorrow

src/Preg.php Outdated Show resolved Hide resolved

@ktomk ktomk force-pushed the ktomk:verify-pcre-pattern-before-use branch from 0120261 to 2d23241 Sep 9, 2019

squash! Verify PCRE pattern before use
add types / doc comment love.

@ktomk ktomk force-pushed the ktomk:verify-pcre-pattern-before-use branch from 2d23241 to db6d6c9 Sep 9, 2019

@@ -45,7 +45,7 @@ public static function match($pattern, $subject, &$matches = null, $flags = 0, $
return $result;
}
throw new PregException('Error occurred when calling preg_match.', preg_last_error());
throw self::newPregException(preg_last_error(), __METHOD__, (array) $pattern);

This comment has been minimized.

Copy link
@julienfalque

julienfalque Sep 12, 2019

Member

Can we call preg_last_error() inside self::newPregException() directly?

This comment has been minimized.

Copy link
@ktomk

ktomk Sep 13, 2019

Author Contributor

We can and I think we also do. However, this call is necessary to obtain the original last error of the preg operation as testing a patterin inside that routine will overwrite it and this parameter is the default exception code which otherwise would be implicit only and at risk to get overwritten due to more preg operations in the subroutine.

Please let me know if this answers your question and if not, please help me to better understand the intend of your question.

src/Preg.php Outdated Show resolved Hide resolved
*
* @return PregException
*/
private static function newPregException($error, $method, array $patterns)

This comment has been minimized.

Copy link
@julienfalque

julienfalque Sep 12, 2019

Member

I would move this method to PregException and make it public so it can be used as a named constructor. WDYT?

This comment has been minimized.

Copy link
@ktomk

ktomk Sep 13, 2019

Author Contributor

I would not do that. The reason is that on first guide I use the rule to use as little visibility as necessary to get the job done.

As the error handling is also very specific in that routine (won't hurt it matures a bit first would be my guess) and PregException is a utility to Preg, I think for the moment it rests well there.

Albeit what you suggest I also suggested earlier in a comment and I think is possible. But the handling is that specific to Preg that I have stepped a bit back from it.

squash! Verify PCRE pattern before use
add types / doc comment love (2).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.