-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace long list of namespaces with list of prefixes used when using serialize. #1679
Comments
OK, confirming I can reproduce this problem like this: # establish a minimal graph
from rdflib import Graph
g = Graph()
g.parse(
data='''
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
<a:> a <b:> ;
foaf:name "Nick" .
''',
format="turtle"
) Normal JSON-LD serialization that is fine: print(g.serialize(format="json-ld")) yields
BUT, if I set
Based on #1676 which was just lodged, it looks like the extra prefixes come through for other formats too, just not for Turtle or certain forms of JSON-LD. |
So indeed RDFlib recently added all namespaces defined in the Namespaces module to graphs by default (see https://github.com/RDFLib/rdflib/blob/master/rdflib/namespace/__init__.py#L341) however some formats seem to auto-clean away unused prefixes but not all do. |
Perhaps we should find the code that does the auto-clean and make it generally available? We could make use of it in some of our own serializers... |
Yes, I think that's what's needed here. I worry though that it may not be on "piece" of code but rather some on-going record that the Turtle/N3 serializer is keeping of all namespaces seen as it operates, line-by-line, which it then uses to drop unseen ones from the namesapces list with. If that's the case, it won't be transferable and to re-examine the graph for unused namespaces could be expensive. We could always reconsider the policy to include the 20 or so namespaces in the namespaces module by default... |
I distinctly recall making sure the JSON-LD serializer (especially the auto-compact) didn't add spurious prefixes a long time ago. Either some change deliberately altered that, or something underneath changed (and if so obviously didn't effect all serializations). Not looking for "blame", just thinking there might be something complex going on (e.g. routines on different levels working "against" each other). |
Perhaps it's that the original no-unused-profixes code worked fine before we added the specific binding in namespasec.py:L341 which then overrides it in some way. So this is on the different "levels" idea you mention. All my fault in trying to include more prefixes by default to make prettier namespaces rather than |
Maybe it is possible to have the 'normalizeUri' function add prefixes whenever they are seen, instead of binding all by default? |
Something like that, but it will be a bit tricky to have the namespace object keep track of all used IRIs efficiently (a call back to a dict of IRI namespaces I suppose, using normaliseURI, as you suggest, but for potentially every subject, predicate & object) |
@nicholascar I noticed 'compute_qname' already binds a namespace when it is not in the graph namespaces yet. I don't know exactly when this is used, but I guess whenever a triple is added? More general, whenever a triple is added or an RDF file is parsed, wouldn't that be the place to check whether the graph 'knows' the namespace (for s, p and o) and add new ones? |
As an rdflib user this would be my preference, rather than post-processing to remove. I would like to be in complete control over namespaces (perhaps with the exceptions of the true core ones like rdf/rdfs/owl/xsd).
it's true these are not so pretty but maybe keeping things simplest is the best? Perhaps there could be some convenience methods to pre-populate namespaces with either the 20 or so prefixes, or even pre-populate from an existing prefix registry? |
Indeed, this behaviour is triggered by the Graph initialization. It takes a NamespaceManager as argument and this NMS is adding the list of namespaces. So, it is not a serialization problem. Serialize only writes what is in the Graph. |
OK, I think the best bet is to wind back the adding of 20 or so namespaces! We can revert to adding just RDF, RDFS, OWL & XSD (and XML for RDF/XML) and then think of some new function or graph init flag to add in others, such as the 20 in I'll try and come up with a PR for this. |
Thanks, I would appreciate that. Some SPARQL editors have a (reverse) lookup function that matches namespaces to prefix and vice versa using prefix.cc. I guess that we don't want RDFLIB to have an external dependency like that, but what if RDFLIB provides (reverse) prefix lookup internally? I mean, just so that when RDF is parsed or added, the used prefix/namespaces are added to the Graph automatically? It could use this context with all preferred namespace-prefix pairings for example. |
That sounds good! One thing about the XML namespace: it "shouldn't" really be defined as an Given that it is used in RDFLib XML processing code for attribute access I'm not proposing to remove it entirely. But it ought to be clarified that it is not to be used as an "RDF namespace" (i.e. a vocabulary/ontology prefix), and thus I don't think it should ever be bound in a graph (unless intentionally, for reasons I don't understand but are not up to me to question). |
I'll chime in here too, last week I found that latest version of RDFLib (with the extra prefixes in the Namespaces) caused PySHACL to start failing a test. Tracked it down to a conflicting namespace (test defined prefix to be X, but RDFLib defined prefix to be Z) that caused the test to no longer pass, however it did expose a problem with the test (misspelled ontology URI) and broken logic in the test (it shouldn't have passed to begin with). That helped me fix the test. Anyway, I'm in favour of keeping the prefixes, and auto-removing them in the serializers. |
I'm part way to solving this, see #1686 |
Another scenario where I would personally find it useful to have JSON-LD |
In case anyone ends up here confused, as I often do, I believe the current status is that this was fixed in rdflib 6.2.0, but then regressed in future versions See #2103 for discussion |
I am wondering why the latest version of rdflib gives me a long list of namespaces when serializing a Graph to JSON-LD. It didn't used to be like that.
The return_format is
'json-ld'
. The context in the result is:This is where the graph and namespace bindings (including some custom ones) were created:
Further, in the custom class I add another namespace and triples. Here's a fragment:
The custom Class is used to read some JSON from a backend system, interpret this and generate RDF for the item. You could see this as a wrapper pattern.
As I wrote in comments to this issue, I had to create some custom function to remove unused prefixes from the context, but that code is not so dynamic:
and this is used here:
Now I just discovered that the context argument can be left out. This is probably because of recent improvements and integration of json-ld. Well done. But it still gives me the long list. Also, when omitting the context argument the
auto_compact=True
argument finally gives me a short representation that I wanted, for example:"sdo:datePublished": "2006-02-19",
. This is not the case when using thiscontext = dict(self.graph.namespaces())
. But, after all I still get the long list of namespace that I aren't used.Another discovery: when further reducing the number of arguments, I still get JSON-LD, but no context at all in the results.
To summarize this issue: I would like to get JSON-LD serialization including context, but with a minimal list of (used) prefixes/namespaces in response to this request:
I hope provided examples will help.
The text was updated successfully, but these errors were encountered: