-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
P304 as reference breaks the Wikidata schema #3274
Comments
Looking at the stack trace that's a clear bug: since these regular expressions are user-supplied, we should make sure that they are ignored if they are invalid. They should be ignored rather than reported to the user since the user is not responsible for maintaining Wikidata constraints. The regular expression currently present on P304 is
which is incorrect in all regex variants known to https://regex101.com/ |
Note that we should also evaluate whether there are security vulnerabilities here: there might be options in |
I have the same issue though I was using page(s) as a qualifier. The preview tab states "Your schema is incomplete, fix it to see the preview." |
@VDK sorry about this! You should be able to temporarily fix this by fixing the regular expression in the property constraint of "page(s)". |
Just for the record: The bug is reported at Wikidata P304 discussion page. |
Something was changed in the support of non-capturing groups (in the past the colon This may have been caused by a more recent change in the regexp engine used by Wikidata (now compatible with PCRE 7.7 or higher, and supporting the "PCRE v2" syntax, where the I commented, and fixed that in the P304 talk page. |
Thanks! Let's leave this issue open since we can improve how such cases are handled in OpenRefine. |
Note: the bot that updates the constraint report has still not run. So the statistics are still not updated. So yues we can leave it open for now, until we get fresher statistics. For now we can still check using the "live" SPARQL request. when I was pinged, I supposed this was caause by Anyway, this means that the regexp validator is still not correct in Wikidata: the error is detected a long time after, when the regexp will start being used to check actual data: even if the regexp is now invalid, it is still accepted "as is", and it seems that the validator jut merely checks that parentheses, braces or brackets are correctly paired, it does not check operators with more complex structure and their flags. |
@verdy-p thanks for the analysis! That might be worth reporting to the Wikidata team then. |
Does one of you have a regular expression and a test case that shows what doesn't work? Sorry it's a bit hard for me to piece together what the problem is now. adding @lucaswerkmeister and @addshore |
There is no regexp validator in Wikidata. format as a regular expression is a string property – for all Wikidata cares, you can put Malbolge code in there and it’ll still be saved. Different users can interpret the values differently – for example, according to the format constraints documentation, KrBot uses PCRE whereas WikibaseQualityConstraints effectively uses I’m guessing that OpenRefine should catch the error when parsing the regex, and proceed as if no pattern had been defined on the property (possibly after showing a warning to the user, if possible). Meanwhile, for P304 specifically I suppose there’s no harm, and some gain, in changing the regex on Wikidata from (For what it’s worth, WikibaseQualityConstraints currently reports that |
Yes, there is clearly a bug to fix on our side. I am not sure which validation @verdy-p was referring to in this part:
If that validation is not in Wikibase itself then we need to figure out where this is coming from! (In that case very sorry for the false alarm, @lydiapintscher and @lucaswerkmeister ). |
@lucaswerkmeister I updated the P304 "format as a regular expression" property over a week ago to fix the problem so it's now Is there any reason WikibaseQualityConstraints is still reporting on the older expression? |
I don’t follow – I still see a bunch of |
Those are subroutine call groups which are supported in PCRE. You can see it validates on regex101 The correction I made was a typo in the expression (\g'n' instead of \g'd') which made it invalid whether subroutine call groups are supported or not |
Have you read the talk page in P304: it is explained there, and yes it is a PCRE feature (named subroutines). The Pattern javadoc is about more limited patterns in Java, not PCRE patterns used in Wikidata (or in a PCRE extender library for Java). Also care was taken to flag non-capturing groups, instead of using simple parentheses for alternates (they would create captures by default without the flags). Yes there was a typo in the name 'n' (undefined) instead of 'd' (correctly defined) referenced by a subroutine call, but it was fixed later. Note that recently I added the acceptation of semicolons instead of just commas for lists of page numbers or page ranges) And nobody ever tried to "ping" me on Wikidata, I would have seen the notification and could have solved it much sooner ! the regexp101 site does accept this PCRE syntax (once the 'n' was fixed into 'd', it was wrong in one of the two copies of regexps shown in properties: one for the base property was correct, the other for the constraint had the wrong subroutine name 'n' undefined instead of 'd' which was defined correctly) |
Well, if you want the regex to be understood by WikibaseQualityConstraints and OpenRefine, you’ll need to stick to the common subset of PCRE and java.util.regex. |
Which suggests that we should consider using a different regex engine whose features match Wikibase's better. |
But Wikibase uses PCRE (in fact it is the SPARQL which uses PCRE according
to the definition of SPARQL by the W3C, which indicates which version of
PCRE it matches). Wikibase should use the same choice as SPARQL. But
Wikibase and SPARQL both support "named capture groups" the way it was used
and described in Wikidata's P304.
So the website regexp101 shows that this syntax is valid, and it works. It
was correct in Wikidata in one of the properties and in the description,
only one Wikidata property used the wrong name 'n' instead of 'd' because
of a typo.
Note that I have also added '^...$' anchors to the regexp to avoid partial
matches: the regexp must match the full string to avoid false positives
(where there could be some number anywhere in the string).
Also I recently added the acceptation of the semicolon like the comma for
lists; even if I think that we may avoid comma-separated lists using
separate properties (so that each range could have their own "usage" or
"applies to" qualifiers instead of accepting padded annotations often in
parentheses after a page number such as "(printed edition)" or "(digital
edition)", or different paginations depending on the publishing format or
reedition with extra illustrations or different design layouts changing the
pagination; the regexp does not tolerate such annotations in free text,
that also need translation, whilst qualifiers to the property are more
convenient and offer better reusability of this data in different contexts
of presentation or with languages needing other punctuations or ordering).
With these two additions, the number of violations has been reduced again
(and it allowed me to see that some "pages" values contained dates or
years, sometimes without even parentheses, so these were ambiguous in terms
of data)...
Le ven. 19 févr. 2021 à 12:35, Antonin Delpeuch <notifications@github.com>
a écrit :
… Which suggests that we should consider using a different regex engine
whose features match Wikibase's better.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3274 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKSUG3NKLV2JHPHZVVJEWTS7ZEJTANCNFSM4STXZ5LQ>
.
|
Also note that Wikidata really HAS a validtor for regexps, technically the
reference to \g'n' was not syntaxically incorrect, it was just not working
because the name 'n' of the subroutine was not defined in the regexp itself
before its use by " \g'n' ".
Wikidata validates the syntax, but does not check that named subroutines
are defined. An in regexp101 a subroutine call to an undefined subroutine
name is still valid, but it matches nothing, as expected. You could use
\g'dummy' to force a regexp to not match, for example in one of the
alternatives, to exclude some forbidden prefixes or some prefixes that are
kept to be "reserved" and not to be used in matches. However such thing is
not recommended, we should just reference subroutine names that are defined.
All this description was already in the comment I posted in the talk page
of P304, describing that feature of PCRE and giving enough hints about how
to interpret them, and why it was used (allowing a very compact regexp and
generalization).
But the recent addition of the '^ ... $" anchors made the regexp match
faster (notably the first anchor) so that it can process many more Wikidata
properties by avoiding many partial false positives and retries from other
positions to scan the entire value (initially Wikidata used SPARQL by
implicitly adding these '^...$' anchors around the specified regexp, this
is no longer the case).
Le ven. 19 févr. 2021 à 15:14, Philippe Verdy <verdyp@gmail.com> a écrit :
… But Wikibase uses PCRE (in fact it is the SPARQL which uses PCRE according
to the definition of SPARQL by the W3C, which indicates which version of
PCRE it matches). Wikibase should use the same choice as SPARQL. But
Wikibase and SPARQL both support "named capture groups" the way it was used
and described in Wikidata's P304.
So the website regexp101 shows that this syntax is valid, and it works. It
was correct in Wikidata in one of the properties and in the description,
only one Wikidata property used the wrong name 'n' instead of 'd' because
of a typo.
Note that I have also added '^...$' anchors to the regexp to avoid partial
matches: the regexp must match the full string to avoid false positives
(where there could be some number anywhere in the string).
Also I recently added the acceptation of the semicolon like the comma for
lists; even if I think that we may avoid comma-separated lists using
separate properties (so that each range could have their own "usage" or
"applies to" qualifiers instead of accepting padded annotations often in
parentheses after a page number such as "(printed edition)" or "(digital
edition)", or different paginations depending on the publishing format or
reedition with extra illustrations or different design layouts changing the
pagination; the regexp does not tolerate such annotations in free text,
that also need translation, whilst qualifiers to the property are more
convenient and offer better reusability of this data in different contexts
of presentation or with languages needing other punctuations or ordering).
With these two additions, the number of violations has been reduced again
(and it allowed me to see that some "pages" values contained dates or
years, sometimes without even parentheses, so these were ambiguous in terms
of data)...
Le ven. 19 févr. 2021 à 12:35, Antonin Delpeuch ***@***.***>
a écrit :
> Which suggests that we should consider using a different regex engine
> whose features match Wikibase's better.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#3274 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAKSUG3NKLV2JHPHZVVJEWTS7ZEJTANCNFSM4STXZ5LQ>
> .
>
|
Where do you get that from? The SPARQL 1.1 REGEX function is defined in terms of XPath But that’s kind of beside the point, since Blazegraph doesn’t implement XPath regexes – it uses
The format is already interpreted to cover the entire string, at least by WikibaseQualityConstraints (and I believe by KrBot as well). Adding explicit anchors to the pattern is unnecessary.
Where? And if it exists (I’m pretty sure it doesn’t), why did it let me save |
May be it has been disabled, for you but I have a validation running that
blocks such insertion when submitting. At least it checks that enclosing
delimiters are correctly paired May be it does not run for everyone or
there's a setting.
As well using unanchored regexps gives more false positives with partial
matches (and slower execution of the parser, giving more timeout errors).
So anchors are visibly needed. Wikidata should work on stabilizing the
regexp engines it uses (including in its tools used in the documentation
and existing scheduled parsers and reports).
Le ven. 19 févr. 2021 à 16:14, Lucas Werkmeister <notifications@github.com>
a écrit :
… But Wikibase uses PCRE (in fact it is the SPARQL which uses PCRE according
to the definition of SPARQL by the W3C, which indicates which version of
PCRE it matches).
Where do you get that from? The SPARQL 1.1 REGEX function
<https://www.w3.org/TR/sparql11-query/#func-regex> is defined in terms of
XPath fn:matches, using XPath and XQuery regular expression syntax
<https://www.w3.org/TR/xpath-functions/#regex-syntax>, which in turn is
based on XML Schema regular expressions
<https://www.w3.org/TR/xmlschema-2/#regexs>. None of these standards
mention PCRE, as far as I can tell; the latter two have a few mentions of
Perl, but only of an informational nature (e.g. based on the established
conventions of languages such as Perl).
But that’s kind of beside the point, since Blazegraph doesn’t implement
XPath regexes – it uses java.util.regex’ Pattern class. You can test this
e.g. using the pattern \p{javaLowerCase}
<https://query.wikidata.org/#SELECT%20%28REGEX%28%22a%22%2C%20%22%5C%5Cp%7BjavaLowerCase%7D%22%29%20AS%20%3Fq%29%20%7B%7D>
(Virtuoso
<https://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+%28REGEX%28%22a%22%2C+%22%5C%5Cp%7BjavaLowerCase%7D%22%29+AS+%3Fq%29+%7B%7D&format=text%2Fhtml&timeout=30000&signal_void=on&signal_unconnected=on>
confirms that’s not a valid SPARQL regex).
Note that I have also added '^...$' anchors to the regexp to avoid partial
matches: the regexp must match the full string to avoid false positives
(where there could be some number anywhere in the string).
The format is already interpreted to cover the entire string, at least by
WikibaseQualityConstraints
<https://github.com/wikimedia/mediawiki-extensions-WikibaseQualityConstraints/blob/ab49f8f604c3ef1d7253833031822ddb639f849f/src/ConstraintCheck/Helper/SparqlHelper.php#L634>
(and I believe by KrBot as well). Adding explicit anchors to the pattern is
unnecessary.
Also note that Wikidata really HAS a validtor for regexps
Where? And if it exists (I’m pretty sure it doesn’t), why did it let me save
(((\\\(\((\ as a format
<https://www.wikidata.org/w/index.php?title=Q4115189&diff=1365241122&oldid=1365048428>
?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3274 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKSUG2KBTNMGHMBVYHBAFDS7Z56LANCNFSM4STXZ5LQ>
.
|
So if I understand correctly, due to limitations in OpenRefine this problem can’t be fixed unless the project switches to a different regex engine? That’s really sobering … Is there any possibility to add an option in OpenRefine that circumvents the regex check altogether? I mean, it’s not really a problem for Wikidata per se because you can add all sorts of unchecked data through QuickStatements anyway … |
No I don't think that's the case. As @wetneb mentions right at the top of this thread:
While it might also be appropriate for OpenRefine to look at alternative approaches for regular expression parsing, I think the key thing in this case is for there to be a graceful failure when encountering regular expressions that can't be parsed for some reason. This should (in my opinion) include telling the user that the condition can't be checked, but not prevent the user from uploading the data in this case (note there is a difference here between data that definitely doesn't match a regular expression and data that cannot be checked against a regular expression - I'd suggest we are strict with the former and relaxed with the latter) |
This issue is really easy to solve as far as OpenRefine is concerned: just by ignoring regexes which our regex engine cannot parse. |
As far as I can tell, the only thing your engine does not support is the
PCR subroutines syntax, which is used here only as a "macro"-like facility
(i.e. without recursion), so an equivalent is easy to produce (and in fact
the transform from an expanded form to the compressed form using
soubroutines is straightforward and fully described in the talk page. It
was used to "compress" the regexp in a shorter form, which is then easier
to maintain.
Supporting basic subroutines that are "macro-like" (i.e. not containing any
selfrecrusion) is straightforward in the regexp engine, notably because I
had also taken car of using non-capturing groups with "(?: ... )", so there
are *no* side effects on the capture-group numbering, and there are as well
no self-recursion "\1" ... "\9" to numbered capturing groups in the search
pattern (which are ALSO subroutines except they are "named" implicitly with
numbers assigned to unflagged "( ... )" capture groups instead of symbolic
names).
For some more complex regexps elsewhere (e.g. OpenRefine) it is needed to
define and use these subroutines, due to length constraints in Wikidata,
but also because of side effects on capture-group numbering (it would force
all these common subpatterns to flag internal non-capturing groups
explicitly, and rapidly the regexp explodes in length, which become longer
than Wikidata length constraint for a single property (and then maintaining
this long regexp become difficult to edit for humans).
Named subroutines in regexp is a very great addition in PCRE (and some
other engines sometimes using an alternate syntax, noting that PCRE also
supports their alternate equivalent syntax like "?DEFINE:" or using other
delimiters for the subroutine name such as "<anglebrackets>").
Wikidata has full support of PCRE in its implemented tools. Other external
tools may need to see how they can "tranlate" the regexp engine dialects,
but PCRE is the best documented and supported engine and one of the
fastest, also with full support of Unicode (not limited to 7-bit or 8-bit
character classes like many basic engines not suitable at all in Wikidata).
|
Note: is there a possibility for Wikidata to indicate which regexp syntax
(possibly with a version for PCRE) it requires, and allow providing
alternate regexp syntaxes demanding less features or using alternate
syntax, even if they don't validate the value as strictly as the prefered
one used in Wikidata's own builtin tools, or are more memory-intensive when
compiling or less performant while scanning, and more difficult to maintain
by humans?
|
Which tools, specifically, are you talking about? Because, again, neither WikibaseQualityConstraints nor the Wikidata Query Service support PCRE. |
Well all validation tools and bots running and serving quality reports on
Wikidata support these named subroutines (I've made multiple runs to show
that they are effective and don't generate any additional errors when
processing, or false positives, or false negatives if these subroutines
were not working, and they DO have the same syntax as PCRE, even if PCRE is
not the engine effectively used).
I've tested them with the existing constraint templates used at top of talk
pages for wikidata elements, and posting link to the scheduled generated
reports, or showing links to run a SPARQL request (this works best with the
"new" SPARQL link, which is much faster) and return results (cached for
some hours) in a browable datatable.
|
Please move the Wikidata discussion someplace more appropriate. Let's keep this issue focused on the solution to handling illegal regular expressions in OpenRefine, which, as @wetneb said, is as simple as adding a try/catch block, thus the |
This comment has been minimized.
This comment has been minimized.
* Ignore invalid regexes from Wikibase format constraints. Closes #3274. * Add logging
…3721) * Ignore invalid regexes from Wikibase format constraints. Closes OpenRefine#3274. * Add logging
Hi:
Here using OR v3.4 in Linux. Uploading data to Wikidata.
I found using the P304 brakes the schema: the UI doesn't provide any clear feedback and when looking at the logs I got this:
I'm not sure if this happens with any value for P304. In my case the precise value was 69-91
An example of the reference I pretend is, exactly, the one for P31 at Q100407249.
Without using P304 everything seems to work as expected.
Hope this helps.
The text was updated successfully, but these errors were encountered: