Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polar-integer datatypes not verifying polarity of lexical value #1757

Closed
ajnelson-nist opened this issue Mar 14, 2022 · 4 comments · Fixed by #1773
Closed

Polar-integer datatypes not verifying polarity of lexical value #1757

ajnelson-nist opened this issue Mar 14, 2022 · 4 comments · Fixed by #1773

Comments

@ajnelson-nist
Copy link
Contributor

I recently encountered an issue using pySHACL to validate a xsd:positiveInteger and found it was accepting "0"^^xsd:positiveInteger.

From review of today's term.py and the influence of the members of _NUMERIC_LITERAL_TYPES, and from a test-patch (linking in a few moments, I need this Issue number first), this is currently an acceptable rdflib statement:

value = Literal("-1", datatype=XSD.positiveInteger)

What should this behavior be instead? A runtime error (something like raise ValueError) would be correct to me, but I appreciate it'd potentially be a jarring change for many applications.

ajnelson-nist added a commit to ajnelson-nist/pySHACL that referenced this issue Mar 14, 2022
The third test in test_integers.py now constructs Literal objects for a
strict set-equivalence of expected failing path-value pairs.  However,
it's not clear whether these Literals should have been constructible
with rdflib, so this patch might revert significantly.

References:
* RDFLib/rdflib#1757

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
@ashleysommer
Copy link
Contributor

ashleysommer commented Mar 15, 2022

@ajnelson-nist Thanks for opening this issue, however this is a duplicate of my issue #848 from 2018, that is itself a duplicate of #737 from 2017.

There is not much we can do about this. In the linked data world there is nothing stopping you from doing saying:

ex:myObject ex:value "chicken"^^xsd:boolean

or (like your example)

ex:myDog ex:numLegs "-3"^^xsd:positiveInteger

There is no mandate on the RDF backend to ensure the lexical string matches and fits into the given data type.

I made a suggestion in #848 that RDFLib should at least run a test on some well known XSD datatypes to add an ill-formed property on the Literal when it is created. That way, PySHACL can simply consider a sh:datatype constraint to be non-conformant whenever an ill-formed Literal is used as the Value node. That would help on other PySHACL constraints too, where the SHACL Specification specifically states a "well-formed Literal" should be used.

@ashleysommer
Copy link
Contributor

@nicholascar Do you have any opinions on this?

@nicholascar
Copy link
Member

It's true that RDF/OWL is pretty weak on datatypes in general... but Python isn't! We could decide to enforce strict typing in RDFlib so that it is only able to produce values that both Python and RDF/OWL think are correct. This would be the equivalent to us enforcing use of defined namespace members with our use of DefinedNamespace or ClosedNamespace: user exceptions are thrown but there is a way to get around the exceptions, if you really must. In the case of namespaces, this is to just not use the rdflib.namespace-provisioned DefinedNamespace class but to make your own with a string or IRI, e.g. URIRef("http://www.w3.org/2000/01/rdf-schema#NaughtyNick") for rdfs:NaughtyNick, which is not defined in the RDFS namespace.

So, we could throw a value error or similar, I think. What to do about parsing bad values though... Perhaps it's enough to prevent bad data generation (unless a deliberate workaround is used) and allow, but warn about, bad data parsing.

We could indicate that if you really must make the triple

ex:myDog ex:numLegs "-3"^^xsd:positiveInteger

you should parse it from text and expect the warning.

We could poll the mailing list for an approach?

@ghost
Copy link

ghost commented Mar 16, 2022

So, we could throw a value error or similar, I think. What to do about parsing bad values though... Perhaps it's enough to prevent bad data generation (unless a deliberate workaround is used) and allow, but warn about, bad data parsing.

My 2c worth ... there are about 10 extant issues directly related to this but we don't know to what extent externally-sourced graphs are non-conformant in the respect of values agreeing with datatype declarations, so people could find themselves drowned in warnings. I totally agree with the suggestion of flagging a Literal as non-conformant but believe we should devolve actual enforcement to the user.

I had a dig into this a few months ago and produced these notes (based on the XSD reference):

URIRef(_XSD_PFX + "normalizedString"): None,  # The lexical space of xsd:normalizedString is unconstrained (any valid XML
                                              # character may be used). Its value space is the set of strings after whitespace
                                              # replacement—i.e., after any occurrence of #x9 (tab), #xA (linefeed), and
                                              # #xD (carriage return) have been replaced by an occurrence of #x20 (space)
                                              # without any whitespace collapsing.
URIRef(_XSD_PFX + "token"): None,             # The lexical and value spaces of xsd:token are the sets of all strings after
                                              # whitespace replacement; i.e., after any occurrence of #x9 (tab), #xA (linefeed),
                                              # or #xD (carriage return).These are replaced by an occurrence of #x20 (space)
                                              # and collapsing. Collapsing is when contiguous occurrences of spaces are replaced
                                              # by a single space, and leading and trailing spaces are removed.
URIRef(_XSD_PFX + "language"): None,
URIRef(_XSD_PFX + "boolean"): _parseBoolean,  # The value space of xsd:boolean is true and false. Its lexical space accepts true,
                                              # false, and also 1 (for true) and 0 (for false).
URIRef(_XSD_PFX + "decimal"): Decimal,        # decimal number of arbitrary precision
URIRef(_XSD_PFX + "integer"): long_type,      # arbitrarily large integer
URIRef(_XSD_PFX + "nonPositiveInteger"): int, # Minimum Inclusive: 0
URIRef(_XSD_PFX + "long"): long_type,         # Minimum Inclusive: -9223372036854775808 Maximum Inclusive: 9223372036854775807
URIRef(_XSD_PFX + "nonNegativeInteger"): int, # Minimum Inclusive: 0
URIRef(_XSD_PFX + "negativeInteger"): int,    # Minimum Inclusive: 0
URIRef(_XSD_PFX + "int"): long_type,          # Minimum Inclusive: -2147483648 Maximum Inclusive: 2147483647
URIRef(_XSD_PFX + "unsignedLong"): long_type, # Minimum Inclusive: 0 Maximum Inclusive: 18446744073709551615
URIRef(_XSD_PFX + "positiveInteger"): int,    # Minimum Inclusive: 1
URIRef(_XSD_PFX + "short"): int,              # Minimum Inclusive: -32768 Maximum Inclusive: 32767
URIRef(_XSD_PFX + "unsignedInt"): long_type,  # Minimum Inclusive: 0 Maximum Inclusive: 4294967295
URIRef(_XSD_PFX + "byte"): int,               # Minimum Inclusive: -128 Maximum Inclusive: 127
URIRef(_XSD_PFX + "unsignedShort"): int,      # Minimum Inclusive: 0 Maximum Inclusive: 65535
URIRef(_XSD_PFX + "unsignedByte"): int,       # Minimum Inclusive: 0 Maximum Inclusive: 255
URIRef(_XSD_PFX + "float"): float,            # An IEEE double-precision 64-bit floating-point number, the format is a
                                              # mantissa followed, optionally, by the character 'E' or 'e' followed by
                                              # an integer exponent, the following values are valid: INF (infinity), -INF
                                              # (negative infinity), and NaN (Not a Number); INF is considered to be
                                              # greater than all other values, while -INF is less than all other values
                                              # and the value NaN cannot be compared to any other values although it
                                              # equals itself.
URIRef(_XSD_PFX + "double"): float,           # An IEEE double-precision 64-bit floating-point number, the format is a
                                              # mantissa followed, optionally, by the character 'E' or 'e' followed by
                                              # an integer exponent, the following values are valid: INF (infinity), -INF
                                              # (negative infinity), and NaN (Not a Number); INF is considered to be
                                              # greater than all other values, while -INF is less than all other values
                                              # and the value NaN cannot be compared to any other values although it
                                              # equals itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants