New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Casting literal with content to rdf:HTML datatype leads incorrectly to empty literal #2475
Comments
Mh it seems this is interpreted as a builtin type and then translated directly into an xml-thingy. The
https://github.com/RDFLib/rdflib/blob/0ea6ca579442219d67ffb1fc7313f05fd16d8d49/rdflib/term.py#L1610C1-L1613C38 You can get a hold on the
|
Ah, so it has not completely disappeared as it is still present somehow in the underlying data structure of RDFLib. Yet, it is unavailable within a SPARQL query, right? Do you think this is a rather fundamental and difficult issue to solve in RDFlib? It doesn't sound like an easily solvable bug unfortunately. |
Mh it should be possible to find the bug within short time. |
Mh i think i will produce a hotfix just by overriding some of the init parts. And then i want to know, why The bug is that a wrong object is created at this position: Or something is wrong when that object is transformed back to a string at this position: But im not knowledgeable enough about |
Also rdflib.Literal seem to dont conform with W§C-spec: https://www.w3.org/TR/rdf11-concepts/#dfn-literal-term-equality |
Because you asked how long it will take. There is now a hotfix available, but i dont know if and when it will be accepted. If you need it in short time you may consider installing a self-patched version of rdflib with the fork of my pull request above. |
Wow, impressive how fast this issue has been analysed and perhaps even fixed. Thank you for your hard work! Regarding how fast I need it...it is not about what I need but more a community thing. I am the chair of the W3C community group Semantic HTML vocabulary https://www.w3.org/community/htmlvoc/ and maintain the associated github https://github.com/floresbakker/htmlvoc. With the RDF based vocabulary we can model any HTML document and generate a HTML document only using semantic web compliant technology (OWL, SHACL, SPARQL). To demonstrate the vocabulary when using an open source engine, I developed a super simple script that calls RDFlib and PyShacl and generates some HTML file based on its modeling in RDF. This issue prevents the vocabulary from being used and demonstrated, together with the other issue (Separator in group_concat function with explicit empty string incorrectly defaults to 'space' character #2473). So it would just be awesome if it could be fixed in the official release of RDFlib, as it would make it possible to use and demonstrate our vocabulary using open source libraries. The vocabulary is already in use at the Dutch Ministry of Finance using a commercial triple store & SPARQL and SHACL engine. To facilitate adoption of the vocabulary in the community we would like to show several different infrastructures (commercial, open source) in which the vocabulary can be used without obstacles. |
I'm not entirely clear how you get this value, could you clarify a bit? Do you use serialize on the result? And if so, what format? I'm guessing maybe you are serializing TSV or CSV, but it would be helpful to be sure. |
Well, the issue initially came to the surface when I ran PyShacl with some SHACL shapes from our HTML vocabulary (specifically the rule:Serialize_HTML_fragment_HTML_Element, see Github link above). There is a SHACL shape with a CONSTRUCT query that does some manipulation of strings in order to generate HTML code. I noticed that PyShacl/RDFlib could not generate the HTML code as I expected. It turns out that whenever we cast some string as rdf:HTML and add that as a literal to some subject-predicate-object, that literal does not 'exist' anymore. In our own semantic web solution (using both Virtuoso and Jena services) this would go fine. So I decided to analyse it. Here are the results of running a simple SPARQL query:
VIRTUOSO:
JENA:
RDFLIB
Here is a python script just to show the issue in RDFLib:
Leading to:
|
Also interesting is:
Leading to:
So answering your question better, I don't think serialisation is of importance here (but I could be very much wrong here!). I think the handling of literals with rdf:HTML datatype is faulty within RDFlib. I hope I have provided you with enough information so that the problem is made clear. Feel free to ask me more, if needed :) happy to oblige. |
I think as a temporary workaround, you can set |
I think this problem should only occur if you have html5lib installed, which you may not want to install. |
Thank you for your reply, workarounds are always appreciated :) However, I am calling PyShacl. I am guessing that RDFlib is called through PyShacl and I doubt I can parameterize PyShacl with this parameter?
Using the 'pip list' command I can confirm that I have installed html5lib. And I can only assume that others in the world may have installed this package. It would not be good that the adoption of a vocabulary would be depended on whether someone has installed some other package or not. Although it is a good find that this issue can be pinpointed to installing this package. Personally I am also busy with parsing HTML and I think I need html5lib (together with BeautifulSoup), so it would be difficult to say goodbye to this package. Thank you for your efforts, much obliged :) |
This is not a parameter really, just a module level variable. If you have python code this should work: import rdflib
rdflib.NORMALIZE_LITERALS = False You just need to set that before running your code that calls PyShacl.
The problem is, by default, RDFLib tries to normalize literal's lexical values. To do this for HTML, it just takes the serialization of the parsed HTML, however, for The right behaviour here is probably to just mark it as an ill-typed literal and not associate any value with it, but it is not clear to me that html5lib gives any error indication for parsing I would be interested to know if you have similar problems with something a bit more conventional than just In general though I would recommend using venvs for anything important, but certainly RDFLib should work with html5lib installed. |
Hi @aucampia, Thank you again for your quick reply. Unfortunately adding "rdflib.NORMALIZE_LITERALS = False" to the script before calling PyShacl does not solve it, the literals remain 'non existing' as before. A pity, because it would have been a sweet workaround :) The only thing that works up and till now is to skip datatyping the resulting string (using the STRDT keyword) to rdf:HTML. And that means fiddling with our proposed standard and that is not okay. There are other tags in HTML that cause problems. Root ('< html >'), row (' < tr >'), head ('< head >') show the same behavior. Funnily enough < table >, < a > and < b > show different behavior where the literal becomes a closing tag despite not having indicated that.
|
Summary of what i write here is: So it seems the bug is not in rdflib but in how html5lib parses fragments. Because So it seems as standard html5lib produces a valid html document and then returns only the body as fragment like
and only a |
mh ok found out that you can read out errors, by reading |
This is not quite fixed I think, that PR does improve things, but I still think there is more broken. I will add some more tests to confirm.
If you could share the code for this it will be helpful as it should work, if it does not it is a bug that we need to fix. I will try to add tests for this myself also though. |
Hi aucampia, the short version is this:
The real script and data is much larger though. Note how we use the advanced features of SHACL and that the graph is edited in place due to the SHACL shapes with SPARQL rules (construct queries). Perhaps this is the reason that the parameter setting in the beginning does not work? I could investigate whether I can come up with a simple example in PyShacl with the parameter rdflib.NORMALIZE_LITERALS set to false, so that we can debug it. Setting the parameter rdflib.NORMALIZE_LITERALS to false and just using RDFlib (and not PyShacl) works though. Running only RDFLIB with the parameter Normalize Literals set to false:
Result: (rdflib.term.Literal('< body >', datatype=rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML')), rdflib.term.Literal('< body >')) |
Do you mean with not fixed issues, that things like |
There are several layers of problems here, I think html5lib may have issues, but even if it were not for those, I don't think re-seriailzation (i.e. value-to-lexical mapping) is done correctly. But I'm also not that sure about the rdf:HTML leixcal-to-value mapping part of the RDF spec [ref]. I ran some tests with parse5, a node.js library purporting to be WHATWG html5 compliant, and it seems to be performing about as well in the parsing department, except it provides no way to detect errors. The code and output for tests with parse5 and html5lib is here. I'm not entirely sure if it is worth trying to fix it either, I will see a bit what can be done, but I think we should consider eliminating the support for rdf:HTML as a recognized datatype. That section is also not normative. |
Previously, if without `html5lib` installed, literals with`rdf:HTML` datatypes were treated as [ill-typed](https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal), even if they were not ill-typed. With this change, if `html5lib` is not installed, literals with the `rdf:HTML` datatype will not be treated as ill-typed, and will have `Null` as their `ill_typed` attribute value, which means that it is unknown whether they are ill-typed or not. This change also fixes the mapping from `rdf:HTML` literal values to lexical forms. Other changes: - Add tests for `rdflib.NORMALIZE_LITERALS` to ensure it behaves correctly. Related issues: - Fixes <RDFLib#2475>
Previously, without `html5lib` installed, literals with`rdf:HTML` datatypes were treated as [ill-typed](https://www.w3.org/TR/rdf11-concepts/#section-Graph-Literal), even if they were not ill-typed. With this change, if `html5lib` is not installed, literals with the `rdf:HTML` datatype will not be treated as ill-typed, and will have `Null` as their `ill_typed` attribute value, which means that it is unknown whether they are ill-typed or not. This change also fixes the mapping from `rdf:HTML` literal values to lexical forms. Other changes: - Add tests for `rdflib.NORMALIZE_LITERALS` to ensure it behaves correctly. Related issues: - Fixes <#2475>
Casting a literal to rdf:HTML datatype leads incorrectly to an empty literal.
Example code
Expected results for ?tag1 and ?tag2
Instead we get:
The text was updated successfully, but these errors were encountered: