Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Casting literal with content to rdf:HTML datatype leads incorrectly to empty literal #2475

Closed
floresbakker opened this issue Jul 8, 2023 · 21 comments · Fixed by #2483 or #2490
Closed
Labels
bug Something isn't working concept: RDF Literal

Comments

@floresbakker
Copy link

Casting a literal to rdf:HTML datatype leads incorrectly to an empty literal.

Example code

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

select * where {
  bind(strdt("<body>", rdf:HTML) as ?tag1) # incorrectly disappearing literal

  bind("<body>" as ?tag2)                  # correctly appearing literal
}

Expected results for ?tag1 and ?tag2

"<body>"^^rdf:HTML
"<body>"

Instead we get:

""^^rdf:HTML
"<body>"


@WhiteGobo
Copy link
Contributor

WhiteGobo commented Jul 8, 2023

Mh it seems this is interpreted as a builtin type and then translated directly into an xml-thingy. The <html> is stored in self.value but i have no idea where it is supposed to be read. But it's within args:

class Literal(Identifier, Node):
    ....
    def __repr__(self):
        args = [super().__repr__()]
        ...

https://github.com/RDFLib/rdflib/blob/0ea6ca579442219d67ffb1fc7313f05fd16d8d49/rdflib/term.py#L1610C1-L1613C38
But neither Identifier nor Node have a __repr__-method

You can get a hold on the <html> as a value. But it is in the form of a xml.dom.minidom.DocumentFragment.

q = Literal("<body>", datatype="http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML")
print(q.value)
>>> <xml.dom.minidom.DocumentFragment object at 0x7f61f112b750>

@floresbakker
Copy link
Author

Ah, so it has not completely disappeared as it is still present somehow in the underlying data structure of RDFLib. Yet, it is unavailable within a SPARQL query, right? Do you think this is a rather fundamental and difficult issue to solve in RDFlib? It doesn't sound like an easily solvable bug unfortunately.

@WhiteGobo
Copy link
Contributor

Mh it should be possible to find the bug within short time.

@WhiteGobo
Copy link
Contributor

Mh i think i will produce a hotfix just by overriding some of the init parts. And then i want to know, why Literals dont use the literal str that is given to them but instead convert via some obscure methods.

The bug is that a wrong object is created at this position:

https://github.com/RDFLib/rdflib/blob/0ea6ca579442219d67ffb1fc7313f05fd16d8d49/rdflib/term.py#L1641C1-L1648C22

Or something is wrong when that object is transformed back to a string at this position:

https://github.com/RDFLib/rdflib/blob/0ea6ca579442219d67ffb1fc7313f05fd16d8d49/rdflib/term.py#L1656C1-L1663C31

But im not knowledgeable enough about xml.dom to know how to fix this bug.

@WhiteGobo
Copy link
Contributor

Also rdflib.Literal seem to dont conform with W§C-spec: https://www.w3.org/TR/rdf11-concepts/#dfn-literal-term-equality
I've just written this here for documentation purposes

@WhiteGobo
Copy link
Contributor

Because you asked how long it will take. There is now a hotfix available, but i dont know if and when it will be accepted. If you need it in short time you may consider installing a self-patched version of rdflib with the fork of my pull request above.

@floresbakker
Copy link
Author

floresbakker commented Jul 8, 2023

Wow, impressive how fast this issue has been analysed and perhaps even fixed. Thank you for your hard work! Regarding how fast I need it...it is not about what I need but more a community thing. I am the chair of the W3C community group Semantic HTML vocabulary https://www.w3.org/community/htmlvoc/ and maintain the associated github https://github.com/floresbakker/htmlvoc. With the RDF based vocabulary we can model any HTML document and generate a HTML document only using semantic web compliant technology (OWL, SHACL, SPARQL). To demonstrate the vocabulary when using an open source engine, I developed a super simple script that calls RDFlib and PyShacl and generates some HTML file based on its modeling in RDF. This issue prevents the vocabulary from being used and demonstrated, together with the other issue (Separator in group_concat function with explicit empty string incorrectly defaults to 'space' character #2473). So it would just be awesome if it could be fixed in the official release of RDFlib, as it would make it possible to use and demonstrate our vocabulary using open source libraries. The vocabulary is already in use at the Dutch Ministry of Finance using a commercial triple store & SPARQL and SHACL engine. To facilitate adoption of the vocabulary in the community we would like to show several different infrastructures (commercial, open source) in which the vocabulary can be used without obstacles.

@aucampia
Copy link
Member

aucampia commented Jul 8, 2023

Instead we get:

""^^rdf:HTML
"<body>"

I'm not entirely clear how you get this value, could you clarify a bit? Do you use serialize on the result? And if so, what format? I'm guessing maybe you are serializing TSV or CSV, but it would be helpful to be sure.

@floresbakker
Copy link
Author

Well, the issue initially came to the surface when I ran PyShacl with some SHACL shapes from our HTML vocabulary (specifically the rule:Serialize_HTML_fragment_HTML_Element, see Github link above). There is a SHACL shape with a CONSTRUCT query that does some manipulation of strings in order to generate HTML code. I noticed that PyShacl/RDFlib could not generate the HTML code as I expected. It turns out that whenever we cast some string as rdf:HTML and add that as a literal to some subject-predicate-object, that literal does not 'exist' anymore. In our own semantic web solution (using both Virtuoso and Jena services) this would go fine. So I decided to analyse it.

Here are the results of running a simple SPARQL query:

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

select * where {
  bind(strdt("<body>", rdf:HTML) as ?tag1)
  bind("<body>" as ?tag2)   
}

VIRTUOSO:

?tag1                                       ?tag2
"<body>"^^rdf:HTML            "<body>"

JENA:

?tag1                                       ?tag2
"<body>"^^rdf:HTML            "<body>"

RDFLIB

?tag1                                       ?tag2
""^^rdf:HTML                         "<body>"

Here is a python script just to show the issue in RDFLib:

from rdflib import Graph
some_graph = Graph()
resultquery = some_graph.query('''

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

select * where {
  bind(strdt("<body>", rdf:HTML) as ?tag1)
  bind("<body>" as ?tag2)   
}
''')   

for row in resultquery:
     print ('tag1', row.tag1)
     print ('tag2', row.tag2)

Leading to:

tag1 
tag2 <body>

@floresbakker
Copy link
Author

Also interesting is:

from rdflib import Graph

some_graph = Graph()
resultquery = some_graph.query('''

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

select * where {
  bind(strdt("<body>", rdf:HTML) as ?tag1)
  bind("<body>" as ?tag2)  
}
''')   

for row in resultquery:
     print (row)

Leading to:

(rdflib.term.Literal('<body>'), rdflib.term.Literal('', datatype=rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML')))

So answering your question better, I don't think serialisation is of importance here (but I could be very much wrong here!). I think the handling of literals with rdf:HTML datatype is faulty within RDFlib. I hope I have provided you with enough information so that the problem is made clear. Feel free to ask me more, if needed :) happy to oblige.

@aucampia
Copy link
Member

aucampia commented Jul 9, 2023

I think as a temporary workaround, you can set rdflib.NORMALIZE_LITERALS to False.

@aucampia
Copy link
Member

aucampia commented Jul 9, 2023

I think this problem should only occur if you have html5lib installed, which you may not want to install.

@floresbakker
Copy link
Author

floresbakker commented Jul 9, 2023

Thank you for your reply, workarounds are always appreciated :) However, I am calling PyShacl. I am guessing that RDFlib is called through PyShacl and I doubt I can parameterize PyShacl with this parameter?

pyshacl.validate(
        data_graph=serializable_graph,
        shacl_graph=html_vocabulary,
        data_graph_format="turtle",
        shacl_graph_format="turtle",
        advanced=True,
        inplace=True,
        inference=None,
        iterate_rules=False, 
        debug=False,
        )

Using the 'pip list' command I can confirm that I have installed html5lib. And I can only assume that others in the world may have installed this package. It would not be good that the adoption of a vocabulary would be depended on whether someone has installed some other package or not. Although it is a good find that this issue can be pinpointed to installing this package.

Personally I am also busy with parsing HTML and I think I need html5lib (together with BeautifulSoup), so it would be difficult to say goodbye to this package.

Thank you for your efforts, much obliged :)

@aucampia
Copy link
Member

aucampia commented Jul 9, 2023

I am guessing that RDFlib is called through PyShacl and I doubt I can parameterize PyShacl with this parameter?

This is not a parameter really, just a module level variable.

If you have python code this should work:

import rdflib
rdflib.NORMALIZE_LITERALS = False

You just need to set that before running your code that calls PyShacl.

Using the 'pip list' command I can confirm that I have installed html5lib. And I can only assume that others in the world may have installed this package. It would not be good that the adoption of a vocabulary would be depended on whether someone has installed some other package or not.

The problem is, by default, RDFLib tries to normalize literal's lexical values. To do this for HTML, it just takes the serialization of the parsed HTML, however, for <body>, that is a blank string as the DocumentFragment that html5lib gives back is basically unpopulated. I'm uncertain why it is unpopulated, I think it really just does not like <body>.

The right behaviour here is probably to just mark it as an ill-typed literal and not associate any value with it, but it is not clear to me that html5lib gives any error indication for parsing <body>.

I would be interested to know if you have similar problems with something a bit more conventional than just <body>, but regardless this behaviour is not right.

In general though I would recommend using venvs for anything important, but certainly RDFLib should work with html5lib installed.

@aucampia aucampia added bug Something isn't working concept: RDF Literal labels Jul 9, 2023
@floresbakker
Copy link
Author

floresbakker commented Jul 9, 2023

Hi @aucampia,

Thank you again for your quick reply. Unfortunately adding "rdflib.NORMALIZE_LITERALS = False" to the script before calling PyShacl does not solve it, the literals remain 'non existing' as before. A pity, because it would have been a sweet workaround :) The only thing that works up and till now is to skip datatyping the resulting string (using the STRDT keyword) to rdf:HTML. And that means fiddling with our proposed standard and that is not okay.

There are other tags in HTML that cause problems. Root ('< html >'), row (' < tr >'), head ('< head >') show the same behavior. Funnily enough < table >, < a > and < b > show different behavior where the literal becomes a closing tag despite not having indicated that.

(rdflib.term.Literal('<table/>', datatype=rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML'))

@WhiteGobo
Copy link
Contributor

WhiteGobo commented Jul 10, 2023

Summary of what i write here is:
html5lib is at fault. it produces in aforementioned cases faulty output. In the best case a fragment with no child, at worst a fragment with a wrong child.

So it seems the bug is not in rdflib but in how html5lib parses fragments. Because html5lib.HTMLParser.parseFragment has only minimal docs, i've made some tests (i can give the program and results if wanted.)

So it seems as standard html5lib produces a valid html document and then returns only the body as fragment like <test> would become:

<?xml version="1.0" ?>
<html>
	<head/>
	<body><test/></body>
</html>

and only a <test/>-node (as child of the fragment) would be returned.
So all those nodes like <body> and so on will just not return any children. Also invalid fragments like <head attr=test>invalid<head/> can return something. In this example a text-node with "invalid".
i think <tr> will not return anything because its not valid without <list>. Things like <invalid will also produce an empty fragment.

@WhiteGobo
Copy link
Contributor

mh ok found out that you can read out errors, by reading html5lib.HTMLParser.errors after parsing. I can use this to cancel the normalization process within rdflib.Literal so that it will just use the literal value, if there were any errors found.

@aucampia
Copy link
Member

This is not quite fixed I think, that PR does improve things, but I still think there is more broken.

I will add some more tests to confirm.

Thank you again for your quick reply. Unfortunately adding "rdflib.NORMALIZE_LITERALS = False" to the script before calling PyShacl does not solve it, the literals remain 'non existing' as before.

If you could share the code for this it will be helpful as it should work, if it does not it is a bug that we need to fix. I will try to add tests for this myself also though.

@floresbakker
Copy link
Author

floresbakker commented Jul 12, 2023

Hi aucampia,

the short version is this:

import pyshacl
import rdflib 
rdflib.NORMALIZE_LITERALS = False #see bug https://github.com/RDFLib/rdflib/issues/2475

pyshacl.validate(
        data_graph=serializable_graph,
        shacl_graph=html_vocabulary,
        data_graph_format="turtle",
        shacl_graph_format="turtle",
        advanced=True,
        inplace=True,
        inference=None,
        iterate_rules=False, 
        debug=False,
        )

The real script and data is much larger though. Note how we use the advanced features of SHACL and that the graph is edited in place due to the SHACL shapes with SPARQL rules (construct queries). Perhaps this is the reason that the parameter setting in the beginning does not work? I could investigate whether I can come up with a simple example in PyShacl with the parameter rdflib.NORMALIZE_LITERALS set to false, so that we can debug it.

Setting the parameter rdflib.NORMALIZE_LITERALS to false and just using RDFlib (and not PyShacl) works though.

Running only RDFLIB with the parameter Normalize Literals set to false:

import rdflib
rdflib.NORMALIZE_LITERALS = False #see bug https://github.com/RDFLib/rdflib/issues/2475


serializable_graph = rdflib.Graph()
resultquery = serializable_graph.query('''
    
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

select * where {
  bind(strdt("<body>", rdf:HTML) as ?tag1) # incorrectly disappearing literal
  bind("<body>" as ?tag2)                  # correctly appearing literal
}

''')   

for result in resultquery:
     print (result)

Result:

(rdflib.term.Literal('< body >', datatype=rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML')), rdflib.term.Literal('< body >'))

@WhiteGobo
Copy link
Contributor

WhiteGobo commented Jul 12, 2023

@aucampia

Do you mean with not fixed issues, that things like <body>, <tr> or <head> wont be normalized and transformed to xml.dom.DocumentFragment? I mean yeah that issue remains, so this issue may not be closed but that seems to me more like an issue that html5lib should tackle.

@aucampia
Copy link
Member

Do you mean with not fixed issues, that things like <body>, <tr> or <head> wont be normalized and transformed to xml.dom.DocumentFragment? I mean yeah that issue remains, so this issue may not be closed but that seems to me more like an issue that html5lib should tackle.

There are several layers of problems here, I think html5lib may have issues, but even if it were not for those, I don't think re-seriailzation (i.e. value-to-lexical mapping) is done correctly.

But I'm also not that sure about the rdf:HTML leixcal-to-value mapping part of the RDF spec [ref].

I ran some tests with parse5, a node.js library purporting to be WHATWG html5 compliant, and it seems to be performing about as well in the parsing department, except it provides no way to detect errors.

The code and output for tests with parse5 and html5lib is here.

I'm not entirely sure if it is worth trying to fix it either, I will see a bit what can be done, but I think we should consider eliminating the support for rdf:HTML as a recognized datatype. That section is also not normative.

aucampia added a commit to aucampia/rdflib that referenced this issue Jul 16, 2023
Previously, if without `html5lib` installed, literals with`rdf:HTML`
datatypes were treated as
[ill-typed](https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal),
even if they were not ill-typed.

With this change, if `html5lib` is not installed, literals with the
`rdf:HTML` datatype will not be treated as ill-typed, and will have
`Null` as their `ill_typed` attribute value, which means that it is
unknown whether they are ill-typed or not.

This change also fixes the mapping from `rdf:HTML` literal values to
lexical forms.

Other changes:

- Add tests for `rdflib.NORMALIZE_LITERALS` to ensure it behaves
  correctly.

Related issues:

- Fixes <RDFLib#2475>
aucampia added a commit that referenced this issue Jul 19, 2023
Previously, without `html5lib` installed, literals with`rdf:HTML`
datatypes were treated as
[ill-typed](https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal),
even if they were not ill-typed.

With this change, if `html5lib` is not installed, literals with the
`rdf:HTML` datatype will not be treated as ill-typed, and will have
`Null` as their `ill_typed` attribute value, which means that it is
unknown whether they are ill-typed or not.

This change also fixes the mapping from `rdf:HTML` literal values to
lexical forms.

Other changes:

- Add tests for `rdflib.NORMALIZE_LITERALS` to ensure it behaves
  correctly.

Related issues:

- Fixes <#2475>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working concept: RDF Literal
Projects
None yet
3 participants