Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preg_match_all for iframe is not reliable #320

Closed
ajoah opened this issue Oct 29, 2021 · 4 comments
Closed

preg_match_all for iframe is not reliable #320

ajoah opened this issue Oct 29, 2021 · 4 comments

Comments

@ajoah
Copy link
Contributor

ajoah commented Oct 29, 2021

Hi,

I found a strange issue on iframe parsing (for replace iframe src for a youtube video for example) :

$iframe_pattern = '/<(iframe)[^>].*?src=[\'"](http:\/\/|https:\/\/|\/\/)' . $url_pattern_iframes . '[\'"].*?>.*?<\/iframe>/is';
if ( preg_match_all( $iframe_pattern, $output, $matches, PREG_PATTERN_ORDER ) ) {

On one of my website, the youtube placeholder didn't work. Indeed, the preg_match_all return false.

After investigation, it seems that preg_match_all is not reliable on long strings : https://stackoverflow.com/a/27868983/1659617

Maybe you would have to optimize the regex or get just the iframes in first time then in the loop parse the content of the iframe

Code to reproduce :

<?php
$iframe_pattern = '/<(iframe)[^>].*?src=[\'"](http:\/\/|https:\/\/|\/\/)([\w.,;@?^=%&:()\/~+#!\- *]*?)[\'"].*?>.*?<\/iframe>/is';
$html = 'html content'
if ( preg_match_all( $iframe_pattern, $html, $matches, PREG_PATTERN_ORDER ) ) {
  echo 'ok';
}else{
  echo 'ko';
}

If you have a private way, i can send you my html code to reproduce the issue in PHP 7.4.
I am on Wordpress slack, username => ajoah

@rlankhorst
Copy link
Collaborator

Hi @ajoah,

Thanks for reporting! You can contact me on rogier(at)really-simple-ssl.com. Is at any kind of html, but just very long, or should it have some specific content?
If you can send me the html, I'll do some tests.

@rlankhorst
Copy link
Collaborator

Can you check if this branch works for you?
https://github.com/Really-Simple-Plugins/complianz-gdpr/compare/preg_match_all-catastrophic-backgracking?expand=1

With your html, I came across an issue when the iframe src does not contain an actual URL, but just "about:blank". When that happens, there is no match, but the regex pattern just continues to search until the end of the file. If the file is long enough, you run out of the limit. This could also slow down the process: instead of the usual 300-500 steps, it now takes 100000 steps to complete.

I realised I do not actually have to match so strict: the found source will be compared to the block list. One more "about:blank" in the compare list is a lot more efficient.

So the above fix just drops the URL matching. That is a result of trying to match as limited as possible, with negative results here.

Let me know if this helps!

@ajoah
Copy link
Contributor Author

ajoah commented Nov 2, 2021

Hi @rlankhorst,

Yes it works with this regex 👍

@rlankhorst
Copy link
Collaborator

@ajoah great, thanks for checking. This won't make the upcoming release, but will be included in the 6.0 update, coming soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants