Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P304 as reference breaks the Wikidata schema #3274

Closed
olea opened this issue Oct 16, 2020 · 31 comments · Fixed by #3721
Closed

P304 as reference breaks the Wikidata schema #3274

olea opened this issue Oct 16, 2020 · 31 comments · Fixed by #3721
Assignees
Labels
Good First Issue Indicates issues suitable for newcomers to design or coding, providing a gentle introduction. Type: Bug Issues related to software defects or unexpected behavior, which require resolution. vulnerability Security vulnerability which needs fixing wikibase Related to wikidata/wikibase integration
Milestone

Comments

@olea
Copy link

olea commented Oct 16, 2020

Hi:

Here using OR v3.4 in Linux. Uploading data to Wikidata.

I found using the P304 brakes the schema: the UI doesn't provide any clear feedback and when looking at the logs I got this:

21:18:01.030 [                  command] Exception caught (15ms)
java.util.regex.PatternSyntaxException: Unknown inline modifier near index 6
(\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?)([-–](\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?))?(, ?(\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?)([-–](\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?))?)*
      ^
	at java.util.regex.Pattern.error(Pattern.java:1969)
	at java.util.regex.Pattern.group0(Pattern.java:2908)
	at java.util.regex.Pattern.sequence(Pattern.java:2065)
	at java.util.regex.Pattern.expr(Pattern.java:2010)
	at java.util.regex.Pattern.group0(Pattern.java:2919)
	at java.util.regex.Pattern.sequence(Pattern.java:2065)
	at java.util.regex.Pattern.expr(Pattern.java:2010)
	at java.util.regex.Pattern.compile(Pattern.java:1702)
	at java.util.regex.Pattern.<init>(Pattern.java:1352)
	at java.util.regex.Pattern.compile(Pattern.java:1028)
	at org.openrefine.wikidata.qa.scrutinizers.FormatScrutinizer.getPattern(FormatScrutinizer.java:68)
	at org.openrefine.wikidata.qa.scrutinizers.FormatScrutinizer.scrutinize(FormatScrutinizer.java:80)
	at org.openrefine.wikidata.qa.scrutinizers.SnakScrutinizer.scrutinizeSnakSet(SnakScrutinizer.java:71)
	at org.openrefine.wikidata.qa.scrutinizers.SnakScrutinizer.scrutinize(SnakScrutinizer.java:64)
	at org.openrefine.wikidata.qa.scrutinizers.StatementScrutinizer.scrutinize(StatementScrutinizer.java:36)
	at org.openrefine.wikidata.qa.EditInspector.inspect(EditInspector.java:107)
	at org.openrefine.wikidata.commands.PreviewWikibaseSchemaCommand.doPost(PreviewWikibaseSchemaCommand.java:92)
	at com.google.refine.RefineServlet.service(RefineServlet.java:187)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
	at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1166)
	at org.mortbay.servlet.UserAgentFilter.doFilter(UserAgentFilter.java:81)
	at org.mortbay.servlet.GzipFilter.doFilter(GzipFilter.java:132)
	at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
	at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
	at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
	at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
	at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
	at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
	at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
	at org.mortbay.jetty.Server.handle(Server.java:326)
	at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
	at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:938)
	at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:755)
	at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
	at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
	at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

I'm not sure if this happens with any value for P304. In my case the precise value was 69-91

An example of the reference I pretend is, exactly, the one for P31 at Q100407249.

Without using P304 everything seems to work as expected.

Hope this helps.

@olea olea added Type: Bug Issues related to software defects or unexpected behavior, which require resolution. Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators labels Oct 16, 2020
@wetneb wetneb added wikibase Related to wikidata/wikibase integration and removed Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators labels Oct 16, 2020
@wetneb
Copy link
Sponsor Member

wetneb commented Oct 16, 2020

Looking at the stack trace that's a clear bug: since these regular expressions are user-supplied, we should make sure that they are ignored if they are invalid. They should be ignored rather than reported to the user since the user is not responsible for maintaining Wikidata constraints.

The regular expression currently present on P304 is

(\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?)([-–](\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?))?(, ?(\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?)([-–](\d+(?[A-Za-z]+)?|[A-Za-z]+(?\d+(?[A-Za-z]+)?)?))?)*

which is incorrect in all regex variants known to https://regex101.com/

@wetneb
Copy link
Sponsor Member

wetneb commented Oct 16, 2020

Note that we should also evaluate whether there are security vulnerabilities here: there might be options in Pattern.compile to reject regular expressions whose compilation/evaluation would be too expensive (DoS-like attack).

@VDK
Copy link
Contributor

VDK commented Nov 5, 2020

I have the same issue though I was using page(s) as a qualifier. The preview tab states "Your schema is incomplete, fix it to see the preview."

@wetneb wetneb changed the title P304 as reference brakes the Wikidata schema P304 as reference breaks the Wikidata schema Nov 5, 2020
@wetneb wetneb added the vulnerability Security vulnerability which needs fixing label Nov 5, 2020
@wetneb wetneb self-assigned this Nov 5, 2020
@wetneb wetneb added this to the 3.5 milestone Nov 5, 2020
@wetneb
Copy link
Sponsor Member

wetneb commented Nov 5, 2020

@VDK sorry about this! You should be able to temporarily fix this by fixing the regular expression in the property constraint of "page(s)".

@olea
Copy link
Author

olea commented Nov 5, 2020

Just for the record:

The bug is reported at Wikidata P304 discussion page.

@verdy-p
Copy link

verdy-p commented Nov 6, 2020

Something was changed in the support of non-capturing groups (in the past the colon : was non always needed after (?, but now it is).

This may have been caused by a more recent change in the regexp engine used by Wikidata (now compatible with PCRE 7.7 or higher, and supporting the "PCRE v2" syntax, where the (? can be followed by more flags, notably since it supports now named subroutines and other newer features of PCRE v2).

I commented, and fixed that in the P304 talk page.

@wetneb
Copy link
Sponsor Member

wetneb commented Nov 6, 2020

Thanks! Let's leave this issue open since we can improve how such cases are handled in OpenRefine.

@verdy-p
Copy link

verdy-p commented Nov 7, 2020

Note: the bot that updates the constraint report has still not run. So the statistics are still not updated. So yues we can leave it open for now, until we get fresher statistics. For now we can still check using the "live" SPARQL request.

when I was pinged, I supposed this was caause by [-–] with the en-dash in the character class (I was wrong: so this test did not change the error; I then realized that the syntax (? ... ) which was accepted in the past for non*capturing groups is no longer valid without the colon (?: ... ). i think this occured after the upgrade of PCRE to support extended PCRE (aka "PCRE v2", even if this was supported in PCRE 7.7+ when it added more flags after (?, and even required at least one flag character in [:P'<] or longer "named" flags like (?(DEFINE) ... ): it was then impossible to delimit arbitrary regexps after the opening (? for groups, and the explicit colon became required, or could otherwise generate sometimes validation errors, while in the past it was not necessary and the "non-capturing group" semantic was implied by default.).

Anyway, this means that the regexp validator is still not correct in Wikidata: the error is detected a long time after, when the regexp will start being used to check actual data: even if the regexp is now invalid, it is still accepted "as is", and it seems that the validator jut merely checks that parentheses, braces or brackets are correctly paired, it does not check operators with more complex structure and their flags.

@wetneb
Copy link
Sponsor Member

wetneb commented Nov 7, 2020

@verdy-p thanks for the analysis! That might be worth reporting to the Wikidata team then.

@lydiapintscher
Copy link

Does one of you have a regular expression and a test case that shows what doesn't work? Sorry it's a bit hard for me to piece together what the problem is now.

adding @lucaswerkmeister and @addshore

@lucaswerkmeister
Copy link
Contributor

Anyway, this means that the regexp validator is still not correct in Wikidata: the error is detected a long time after, when the regexp will start being used to check actual data: even if the regexp is now invalid, it is still accepted "as is", and it seems that the validator jut merely checks that parentheses, braces or brackets are correctly paired, it does not check operators with more complex structure and their flags.

There is no regexp validator in Wikidata. format as a regular expression is a string property – for all Wikidata cares, you can put Malbolge code in there and it’ll still be saved. Different users can interpret the values differently – for example, according to the format constraints documentation, KrBot uses PCRE whereas WikibaseQualityConstraints effectively uses java.util.regex.

I’m guessing that OpenRefine should catch the error when parsing the regex, and proceed as if no pattern had been defined on the property (possibly after showing a warning to the user, if possible). Meanwhile, for P304 specifically I suppose there’s no harm, and some gain, in changing the regex on Wikidata from (?…) to (?:…).

(For what it’s worth, WikibaseQualityConstraints currently reports that (?'r'(?'p'(?'d'\d+)(?'w'[A-Za-z]+)?|\g'w'(?:\g'n'\g'w'?)?)(?:[-–]\g'p')?)(?:,\s?\g'r')* is not a valid regular expression.)

@wetneb
Copy link
Sponsor Member

wetneb commented Feb 18, 2021

Yes, there is clearly a bug to fix on our side. I am not sure which validation @verdy-p was referring to in this part:

Anyway, this means that the regexp validator is still not correct in Wikidata: the error is detected a long time after, when the regexp will start being used to check actual data: even if the regexp is now invalid, it is still accepted "as is", and it seems that the validator jut merely checks that parentheses, braces or brackets are correctly paired, it does not check operators with more complex structure and their flags.

If that validation is not in Wikibase itself then we need to figure out where this is coming from! (In that case very sorry for the false alarm, @lydiapintscher and @lucaswerkmeister ).

@ostephens
Copy link
Sponsor Member

@lucaswerkmeister I updated the P304 "format as a regular expression" property over a week ago to fix the problem so it's now
(?'r'(?'p'(?'d'\d+)(?'w'[A-Za-z]+)?|\g'w'(?:\g'd'\g'w'?)?)(?:[-–]\g'p')?)(?:,\s?\g'r')*

Is there any reason WikibaseQualityConstraints is still reporting on the older expression?

@lucaswerkmeister
Copy link
Contributor

I don’t follow – I still see a bunch of (?…) in that regex, presumably WBQC is complaining about those. Or is (?'…) special syntax? (I didn’t find anything like it in the Pattern javadoc.)

@ostephens
Copy link
Sponsor Member

Those are subroutine call groups which are supported in PCRE. You can see it validates on regex101

The correction I made was a typo in the expression (\g'n' instead of \g'd') which made it invalid whether subroutine call groups are supported or not

@verdy-p
Copy link

verdy-p commented Feb 19, 2021

I don’t follow – I still see a bunch of (?…) in that regex, presumably WBQC is complaining about those. Or is (?'…) special syntax? (I didn’t find anything like it in the Pattern javadoc.)

Have you read the talk page in P304: it is explained there, and yes it is a PCRE feature (named subroutines). The Pattern javadoc is about more limited patterns in Java, not PCRE patterns used in Wikidata (or in a PCRE extender library for Java).

Also care was taken to flag non-capturing groups, instead of using simple parentheses for alternates (they would create captures by default without the flags).

Yes there was a typo in the name 'n' (undefined) instead of 'd' (correctly defined) referenced by a subroutine call, but it was fixed later. Note that recently I added the acceptation of semicolons instead of just commas for lists of page numbers or page ranges)

And nobody ever tried to "ping" me on Wikidata, I would have seen the notification and could have solved it much sooner ! the regexp101 site does accept this PCRE syntax (once the 'n' was fixed into 'd', it was wrong in one of the two copies of regexps shown in properties: one for the base property was correct, the other for the constraint had the wrong subroutine name 'n' undefined instead of 'd' which was defined correctly)

@lucaswerkmeister
Copy link
Contributor

Well, if you want the regex to be understood by WikibaseQualityConstraints and OpenRefine, you’ll need to stick to the common subset of PCRE and java.util.regex.

@wetneb
Copy link
Sponsor Member

wetneb commented Feb 19, 2021

Which suggests that we should consider using a different regex engine whose features match Wikibase's better.

@verdy-p
Copy link

verdy-p commented Feb 19, 2021 via email

@verdy-p
Copy link

verdy-p commented Feb 19, 2021 via email

@lucaswerkmeister
Copy link
Contributor

But Wikibase uses PCRE (in fact it is the SPARQL which uses PCRE according
to the definition of SPARQL by the W3C, which indicates which version of
PCRE it matches).

Where do you get that from? The SPARQL 1.1 REGEX function is defined in terms of XPath fn:matches, using XPath and XQuery regular expression syntax, which in turn is based on XML Schema regular expressions. None of these standards mention PCRE, as far as I can tell; the latter two have a few mentions of Perl, but only of an informational nature (e.g. based on the established conventions of languages such as Perl).

But that’s kind of beside the point, since Blazegraph doesn’t implement XPath regexes – it uses java.util.regexPattern class. You can test this e.g. using the pattern \p{javaLowerCase} (Virtuoso confirms that’s not a valid SPARQL regex).

Note that I have also added '^...$' anchors to the regexp to avoid partial
matches: the regexp must match the full string to avoid false positives
(where there could be some number anywhere in the string).

The format is already interpreted to cover the entire string, at least by WikibaseQualityConstraints (and I believe by KrBot as well). Adding explicit anchors to the pattern is unnecessary.

Also note that Wikidata really HAS a validtor for regexps

Where? And if it exists (I’m pretty sure it doesn’t), why did it let me save (((\\\(\((\ as a format?

@verdy-p
Copy link

verdy-p commented Feb 19, 2021 via email

@emu-at
Copy link

emu-at commented Feb 26, 2021

So if I understand correctly, due to limitations in OpenRefine this problem can’t be fixed unless the project switches to a different regex engine? That’s really sobering … Is there any possibility to add an option in OpenRefine that circumvents the regex check altogether? I mean, it’s not really a problem for Wikidata per se because you can add all sorts of unchecked data through QuickStatements anyway …

@ostephens
Copy link
Sponsor Member

ostephens commented Feb 26, 2021

So if I understand correctly, due to limitations in OpenRefine this problem can’t be fixed unless the project switches to a different regex engine?

No I don't think that's the case. As @wetneb mentions right at the top of this thread:

Looking at the stack trace that's a clear bug: since these regular expressions are user-supplied, we should make sure that they are ignored if they are invalid. They should be ignored rather than reported to the user since the user is not responsible for maintaining Wikidata constraints.

While it might also be appropriate for OpenRefine to look at alternative approaches for regular expression parsing, I think the key thing in this case is for there to be a graceful failure when encountering regular expressions that can't be parsed for some reason. This should (in my opinion) include telling the user that the condition can't be checked, but not prevent the user from uploading the data in this case (note there is a difference here between data that definitely doesn't match a regular expression and data that cannot be checked against a regular expression - I'd suggest we are strict with the former and relaxed with the latter)

@wetneb wetneb added the Good First Issue Indicates issues suitable for newcomers to design or coding, providing a gentle introduction. label Feb 26, 2021
@wetneb
Copy link
Sponsor Member

wetneb commented Feb 26, 2021

This issue is really easy to solve as far as OpenRefine is concerned: just by ignoring regexes which our regex engine cannot parse.

@verdy-p
Copy link

verdy-p commented Feb 26, 2021 via email

@verdy-p
Copy link

verdy-p commented Feb 26, 2021 via email

@lucaswerkmeister
Copy link
Contributor

Wikidata has full support of PCRE in its implemented tools.

Which tools, specifically, are you talking about? Because, again, neither WikibaseQualityConstraints nor the Wikidata Query Service support PCRE.

@verdy-p
Copy link

verdy-p commented Feb 27, 2021 via email

@tfmorris
Copy link
Member

tfmorris commented Feb 27, 2021

Please move the Wikidata discussion someplace more appropriate. Let's keep this issue focused on the solution to handling illegal regular expressions in OpenRefine, which, as @wetneb said, is as simple as adding a try/catch block, thus the good first issue label . All this extraneous discussion is just going to confuse/discourage newcomers who are trying to figure out what they need to code for a pull request.

@verdy-p

This comment has been minimized.

tfmorris pushed a commit that referenced this issue Apr 4, 2021
* Ignore invalid regexes from Wikibase format constraints. Closes #3274.

* Add logging
gitonthescene pushed a commit to gitonthescene/OpenRefine that referenced this issue May 1, 2021
…3721)

* Ignore invalid regexes from Wikibase format constraints. Closes OpenRefine#3274.

* Add logging
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good First Issue Indicates issues suitable for newcomers to design or coding, providing a gentle introduction. Type: Bug Issues related to software defects or unexpected behavior, which require resolution. vulnerability Security vulnerability which needs fixing wikibase Related to wikidata/wikibase integration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants