Polar-integer datatypes not verifying polarity of lexical value #1757

ajnelson-nist · 2022-03-14T16:42:58Z

I recently encountered an issue using pySHACL to validate a xsd:positiveInteger and found it was accepting "0"^^xsd:positiveInteger.

From review of today's term.py and the influence of the members of _NUMERIC_LITERAL_TYPES, and from a test-patch (linking in a few moments, I need this Issue number first), this is currently an acceptable rdflib statement:

value = Literal("-1", datatype=XSD.positiveInteger)

What should this behavior be instead? A runtime error (something like raise ValueError) would be correct to me, but I appreciate it'd potentially be a jarring change for many applications.

The text was updated successfully, but these errors were encountered:

The third test in test_integers.py now constructs Literal objects for a strict set-equivalence of expected failing path-value pairs. However, it's not clear whether these Literals should have been constructible with rdflib, so this patch might revert significantly. References: * RDFLib/rdflib#1757 Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>

ashleysommer · 2022-03-15T22:04:07Z

@ajnelson-nist Thanks for opening this issue, however this is a duplicate of my issue #848 from 2018, that is itself a duplicate of #737 from 2017.

There is not much we can do about this. In the linked data world there is nothing stopping you from doing saying:

ex:myObject ex:value "chicken"^^xsd:boolean

or (like your example)

ex:myDog ex:numLegs "-3"^^xsd:positiveInteger

There is no mandate on the RDF backend to ensure the lexical string matches and fits into the given data type.

I made a suggestion in #848 that RDFLib should at least run a test on some well known XSD datatypes to add an ill-formed property on the Literal when it is created. That way, PySHACL can simply consider a sh:datatype constraint to be non-conformant whenever an ill-formed Literal is used as the Value node. That would help on other PySHACL constraints too, where the SHACL Specification specifically states a "well-formed Literal" should be used.

ashleysommer · 2022-03-16T00:36:08Z

@nicholascar Do you have any opinions on this?

nicholascar · 2022-03-16T03:13:40Z

It's true that RDF/OWL is pretty weak on datatypes in general... but Python isn't! We could decide to enforce strict typing in RDFlib so that it is only able to produce values that both Python and RDF/OWL think are correct. This would be the equivalent to us enforcing use of defined namespace members with our use of DefinedNamespace or ClosedNamespace: user exceptions are thrown but there is a way to get around the exceptions, if you really must. In the case of namespaces, this is to just not use the rdflib.namespace-provisioned DefinedNamespace class but to make your own with a string or IRI, e.g. URIRef("http://www.w3.org/2000/01/rdf-schema#NaughtyNick") for rdfs:NaughtyNick, which is not defined in the RDFS namespace.

So, we could throw a value error or similar, I think. What to do about parsing bad values though... Perhaps it's enough to prevent bad data generation (unless a deliberate workaround is used) and allow, but warn about, bad data parsing.

We could indicate that if you really must make the triple

ex:myDog ex:numLegs "-3"^^xsd:positiveInteger

you should parse it from text and expect the warning.

We could poll the mailing list for an approach?

ghost · 2022-03-16T09:04:42Z

So, we could throw a value error or similar, I think. What to do about parsing bad values though... Perhaps it's enough to prevent bad data generation (unless a deliberate workaround is used) and allow, but warn about, bad data parsing.

My 2c worth ... there are about 10 extant issues directly related to this but we don't know to what extent externally-sourced graphs are non-conformant in the respect of values agreeing with datatype declarations, so people could find themselves drowned in warnings. I totally agree with the suggestion of flagging a Literal as non-conformant but believe we should devolve actual enforcement to the user.

I had a dig into this a few months ago and produced these notes (based on the XSD reference):

URIRef(_XSD_PFX + "normalizedString"): None,  # The lexical space of xsd:normalizedString is unconstrained (any valid XML
                                              # character may be used). Its value space is the set of strings after whitespace
                                              # replacement—i.e., after any occurrence of #x9 (tab), #xA (linefeed), and
                                              # #xD (carriage return) have been replaced by an occurrence of #x20 (space)
                                              # without any whitespace collapsing.
URIRef(_XSD_PFX + "token"): None,             # The lexical and value spaces of xsd:token are the sets of all strings after
                                              # whitespace replacement; i.e., after any occurrence of #x9 (tab), #xA (linefeed),
                                              # or #xD (carriage return).These are replaced by an occurrence of #x20 (space)
                                              # and collapsing. Collapsing is when contiguous occurrences of spaces are replaced
                                              # by a single space, and leading and trailing spaces are removed.
URIRef(_XSD_PFX + "language"): None,
URIRef(_XSD_PFX + "boolean"): _parseBoolean,  # The value space of xsd:boolean is true and false. Its lexical space accepts true,
                                              # false, and also 1 (for true) and 0 (for false).
URIRef(_XSD_PFX + "decimal"): Decimal,        # decimal number of arbitrary precision
URIRef(_XSD_PFX + "integer"): long_type,      # arbitrarily large integer
URIRef(_XSD_PFX + "nonPositiveInteger"): int, # Minimum Inclusive: 0
URIRef(_XSD_PFX + "long"): long_type,         # Minimum Inclusive: -9223372036854775808 Maximum Inclusive: 9223372036854775807
URIRef(_XSD_PFX + "nonNegativeInteger"): int, # Minimum Inclusive: 0
URIRef(_XSD_PFX + "negativeInteger"): int,    # Minimum Inclusive: 0
URIRef(_XSD_PFX + "int"): long_type,          # Minimum Inclusive: -2147483648 Maximum Inclusive: 2147483647
URIRef(_XSD_PFX + "unsignedLong"): long_type, # Minimum Inclusive: 0 Maximum Inclusive: 18446744073709551615
URIRef(_XSD_PFX + "positiveInteger"): int,    # Minimum Inclusive: 1
URIRef(_XSD_PFX + "short"): int,              # Minimum Inclusive: -32768 Maximum Inclusive: 32767
URIRef(_XSD_PFX + "unsignedInt"): long_type,  # Minimum Inclusive: 0 Maximum Inclusive: 4294967295
URIRef(_XSD_PFX + "byte"): int,               # Minimum Inclusive: -128 Maximum Inclusive: 127
URIRef(_XSD_PFX + "unsignedShort"): int,      # Minimum Inclusive: 0 Maximum Inclusive: 65535
URIRef(_XSD_PFX + "unsignedByte"): int,       # Minimum Inclusive: 0 Maximum Inclusive: 255
URIRef(_XSD_PFX + "float"): float,            # An IEEE double-precision 64-bit floating-point number, the format is a
                                              # mantissa followed, optionally, by the character 'E' or 'e' followed by
                                              # an integer exponent, the following values are valid: INF (infinity), -INF
                                              # (negative infinity), and NaN (Not a Number); INF is considered to be
                                              # greater than all other values, while -INF is less than all other values
                                              # and the value NaN cannot be compared to any other values although it
                                              # equals itself.
URIRef(_XSD_PFX + "double"): float,           # An IEEE double-precision 64-bit floating-point number, the format is a
                                              # mantissa followed, optionally, by the character 'E' or 'e' followed by
                                              # an integer exponent, the following values are valid: INF (infinity), -INF
                                              # (negative infinity), and NaN (Not a Number); INF is considered to be
                                              # greater than all other values, while -INF is less than all other values
                                              # and the value NaN cannot be compared to any other values although it
                                              # equals itself.

ajnelson-nist mentioned this issue Mar 14, 2022

Add tests to confirm detection of various integer types RDFLib/pySHACL#122

Merged

ashleysommer mentioned this issue Mar 24, 2022

Add ability to detect and mark ill-typed literals #1773

Merged

aucampia closed this as completed in #1773 Apr 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polar-integer datatypes not verifying polarity of lexical value #1757

Polar-integer datatypes not verifying polarity of lexical value #1757

ajnelson-nist commented Mar 14, 2022

ashleysommer commented Mar 15, 2022 •

edited

Loading

ashleysommer commented Mar 16, 2022

nicholascar commented Mar 16, 2022

ghost commented Mar 16, 2022

Polar-integer datatypes not verifying polarity of lexical value #1757

Polar-integer datatypes not verifying polarity of lexical value #1757

Comments

ajnelson-nist commented Mar 14, 2022

ashleysommer commented Mar 15, 2022 • edited Loading

ashleysommer commented Mar 16, 2022

nicholascar commented Mar 16, 2022

ghost commented Mar 16, 2022

ashleysommer commented Mar 15, 2022 •

edited

Loading