Replace long list of namespaces with list of prefixes used when using serialize. #1679

wmelder · 2022-01-18T12:24:52Z

I am wondering why the latest version of rdflib gives me a long list of namespaces when serializing a Graph to JSON-LD. It didn't used to be like that.

            return self.graph.serialize(
                format=return_format,
                context=dict(self.graph.namespaces()),
                auto_compact=True
            )

The return_format is 'json-ld'. The context in the result is:

  "@context": {
    "brick": "https://brickschema.org/schema/Brick#",
    "csvw": "http://www.w3.org/ns/csvw#",
    "dc": "http://purl.org/dc/elements/1.1/",
    "dcam": "http://purl.org/dc/dcam/",
    "dcat": "http://www.w3.org/ns/dcat#",
    "dcmitype": "http://purl.org/dc/dcmitype/",
    "dcterms": "http://purl.org/dc/terms/",
    "doap": "http://usefulinc.com/ns/doap#",
    "foaf": "http://xmlns.com/foaf/0.1/",
    "gtaa": "http://data.beeldengeluid.nl/gtaa/",
    "non-gtaa": "http://data.beeldengeluid.nl/nongtaa/",
    "odrl": "http://www.w3.org/ns/odrl/2/",
    "org": "http://www.w3.org/ns/org#",
    "owl": "http://www.w3.org/2002/07/owl#",
    "prof": "http://www.w3.org/ns/dx/prof/",
    "prov": "http://www.w3.org/ns/prov#",
    "qb": "http://purl.org/linked-data/cube#",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "schema": "https://schema.org/",
    "sdo": "https://schema.org/",
    "sh": "http://www.w3.org/ns/shacl#",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "sosa": "http://www.w3.org/ns/sosa/",
    "ssn": "http://www.w3.org/ns/ssn/",
    "time": "http://www.w3.org/2006/time#",
    "vann": "http://purl.org/vocab/vann/",
    "void": "http://rdfs.org/ns/void#",
    "xml": "http://www.w3.org/XML/1998/namespace",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  },

This is where the graph and namespace bindings (including some custom ones) were created:

        self.graph = Graph()
        self.graph.namespace_manager.bind("skos", SKOS)
        self.graph.namespace_manager.bind("gtaa", Namespace(self._model.GTAA_NAMESPACE))
        self.graph.namespace_manager.bind("non-gtaa", Namespace(self._model.NON_GTAA_NAMESPACE))

Further, in the custom class I add another namespace and triples. Here's a fragment:

        self.graph.namespace_manager.bind('sdo', Namespace(self._model.SCHEMA_DOT_ORG_NAMESPACE))
        # create a node for the record
        self.itemNode = URIRef(self.get_uri(concept_type, metadata["id"]))

        # get the RDF class URI for this type
        self.classUri = self._model.CLASS_URIS_FOR_DAAN_LEVELS[concept_type]

        # add the type
        self.graph.add((self.itemNode, RDF.type, URIRef(self.classUri)))

The custom Class is used to read some JSON from a backend system, interpret this and generate RDF for the item. You could see this as a wrapper pattern.

As I wrote in comments to this issue, I had to create some custom function to remove unused prefixes from the context, but that code is not so dynamic:

    def remove_unused_prefixes(self):
        """ Clean up the long list of namespaces.
        """
        context = dict(self.graph.namespaces())
        used_prefixes = ['gtaa', 'non-gtaa', 'rdf', 'rdfs', 'sdo', 'skos', 'xml', 'xsd']
        return {p: context[p] for p in used_prefixes}

and this is used here:

            context_used = self.remove_unused_prefixes()
            return self.graph.serialize(
                format=return_format,
                context=context_used,
                auto_compact=True
            )

Now I just discovered that the context argument can be left out. This is probably because of recent improvements and integration of json-ld. Well done. But it still gives me the long list. Also, when omitting the context argument the auto_compact=True argument finally gives me a short representation that I wanted, for example: "sdo:datePublished": "2006-02-19",. This is not the case when using this context = dict(self.graph.namespaces()). But, after all I still get the long list of namespace that I aren't used.

Another discovery: when further reducing the number of arguments, I still get JSON-LD, but no context at all in the results.

           return self.graph.serialize(
                format=return_format
            )

To summarize this issue: I would like to get JSON-LD serialization including context, but with a minimal list of (used) prefixes/namespaces in response to this request:

            return self.graph.serialize(
                format='json-ld',
                auto_compact=True
            )

I hope provided examples will help.

The text was updated successfully, but these errors were encountered:

nicholascar · 2022-01-18T12:41:25Z

OK, confirming I can reproduce this problem like this:

# establish a minimal graph
from rdflib import Graph

g = Graph()
g.parse(
    data='''
        PREFIX foaf: <http://xmlns.com/foaf/0.1/>
        <a:> a <b:> ;
             foaf:name "Nick" .
        ''',
    format="turtle"
)

Normal JSON-LD serialization that is fine:

print(g.serialize(format="json-ld"))

yields

[
  {
    "@id": "a:",
    "@type": [
      "b:"
    ],
    "http://xmlns.com/foaf/0.1/name": [
      {
        "@value": "Nick"
      }
    ]
  }
]

BUT, if I set auto_compact=True, i.e. print(g.serialize(format="json-ld", auto_compact=True)) then I get:

{
  "@context": {
    "brick": "https://brickschema.org/schema/Brick#",
    "csvw": "http://www.w3.org/ns/csvw#",
    "dc": "http://purl.org/dc/elements/1.1/",
    "dcam": "http://purl.org/dc/dcam/",
    "dcat": "http://www.w3.org/ns/dcat#",
    "dcmitype": "http://purl.org/dc/dcmitype/",
    "dcterms": "http://purl.org/dc/terms/",
    "doap": "http://usefulinc.com/ns/doap#",
    "foaf": "http://xmlns.com/foaf/0.1/",
    "geo": "http://www.opengis.net/ont/geosparql#",
    "odrl": "http://www.w3.org/ns/odrl/2/",
    "org": "http://www.w3.org/ns/org#",
    "owl": "http://www.w3.org/2002/07/owl#",
    "prof": "http://www.w3.org/ns/dx/prof/",
    "prov": "http://www.w3.org/ns/prov#",
    "qb": "http://purl.org/linked-data/cube#",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "schema": "https://schema.org/",
    "sh": "http://www.w3.org/ns/shacl#",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "sosa": "http://www.w3.org/ns/sosa/",
    "ssn": "http://www.w3.org/ns/ssn/",
    "time": "http://www.w3.org/2006/time#",
    "vann": "http://purl.org/vocab/vann/",
    "void": "http://rdfs.org/ns/void#",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  },
  "@id": "a:",
  "@type": "b:",
  "foaf:name": "Nick"
}

Process finished with exit code 0

Based on #1676 which was just lodged, it looks like the extra prefixes come through for other formats too, just not for Turtle or certain forms of JSON-LD.

nicholascar · 2022-01-18T12:44:14Z

So indeed RDFlib recently added all namespaces defined in the Namespaces module to graphs by default (see https://github.com/RDFLib/rdflib/blob/master/rdflib/namespace/__init__.py#L341) however some formats seem to auto-clean away unused prefixes but not all do.

hsolbrig · 2022-01-19T22:50:36Z

Perhaps we should find the code that does the auto-clean and make it generally available? We could make use of it in some of our own serializers...

nicholascar · 2022-01-19T22:56:17Z

Perhaps we should find the code that does the auto-clean

Yes, I think that's what's needed here. I worry though that it may not be on "piece" of code but rather some on-going record that the Turtle/N3 serializer is keeping of all namespaces seen as it operates, line-by-line, which it then uses to drop unseen ones from the namesapces list with. If that's the case, it won't be transferable and to re-examine the graph for unused namespaces could be expensive.

We could always reconsider the policy to include the 20 or so namespaces in the namespaces module by default...

niklasl · 2022-01-20T08:44:57Z

I distinctly recall making sure the JSON-LD serializer (especially the auto-compact) didn't add spurious prefixes a long time ago. Either some change deliberately altered that, or something underneath changed (and if so obviously didn't effect all serializations). Not looking for "blame", just thinking there might be something complex going on (e.g. routines on different levels working "against" each other).

nicholascar · 2022-01-20T12:18:42Z

just thinking there might be something complex going on

Perhaps it's that the original no-unused-profixes code worked fine before we added the specific binding in namespasec.py:L341 which then overrides it in some way. So this is on the different "levels" idea you mention.

All my fault in trying to include more prefixes by default to make prettier namespaces rather than ns, ns2 etc...

wmelder · 2022-01-20T13:15:37Z

Maybe it is possible to have the 'normalizeUri' function add prefixes whenever they are seen, instead of binding all by default?

nicholascar · 2022-01-20T13:22:48Z

Something like that, but it will be a bit tricky to have the namespace object keep track of all used IRIs efficiently (a call back to a dict of IRI namespaces I suppose, using normaliseURI, as you suggest, but for potentially every subject, predicate & object)

wmelder · 2022-01-20T13:33:01Z

@nicholascar I noticed 'compute_qname' already binds a namespace when it is not in the graph namespaces yet. I don't know exactly when this is used, but I guess whenever a triple is added? More general, whenever a triple is added or an RDF file is parsed, wouldn't that be the place to check whether the graph 'knows' the namespace (for s, p and o) and add new ones?

cmungall · 2022-01-21T00:53:10Z

We could always reconsider the policy to include the 20 or so namespaces in the namespaces module by default...

As an rdflib user this would be my preference, rather than post-processing to remove. I would like to be in complete control over namespaces (perhaps with the exceptions of the true core ones like rdf/rdfs/owl/xsd).

default to make prettier namespaces rather than ns, ns2 etc

it's true these are not so pretty but maybe keeping things simplest is the best? Perhaps there could be some convenience methods to pre-populate namespaces with either the 20 or so prefixes, or even pre-populate from an existing prefix registry?

wmelder · 2022-01-21T13:48:45Z

So indeed RDFlib recently added all namespaces defined in the Namespaces module to graphs by default (see https://github.com/RDFLib/rdflib/blob/master/rdflib/namespace/__init__.py#L341) however some formats seem to auto-clean away unused prefixes but not all do.

Indeed, this behaviour is triggered by the Graph initialization. It takes a NamespaceManager as argument and this NMS is adding the list of namespaces. So, it is not a serialization problem. Serialize only writes what is in the Graph.

nicholascar · 2022-01-22T01:05:54Z

OK, I think the best bet is to wind back the adding of 20 or so namespaces! We can revert to adding just RDF, RDFS, OWL & XSD (and XML for RDF/XML) and then think of some new function or graph init flag to add in others, such as the 20 in namespaces/.

I'll try and come up with a PR for this.

wmelder · 2022-01-22T12:30:30Z

Thanks, I would appreciate that.

Some SPARQL editors have a (reverse) lookup function that matches namespaces to prefix and vice versa using prefix.cc. I guess that we don't want RDFLIB to have an external dependency like that, but what if RDFLIB provides (reverse) prefix lookup internally? I mean, just so that when RDF is parsed or added, the used prefix/namespaces are added to the Graph automatically? It could use this context with all preferred namespace-prefix pairings for example.

niklasl · 2022-01-22T13:56:37Z

OK, I think the best bet is to wind back the adding of 20 or so namespaces! We can revert to adding just RDF, RDFS, OWL & XSD (and XML for RDF/XML) and then think of some new function or graph init flag to add in others, such as the 20 in namespaces/.

I'll try and come up with a PR for this.

That sounds good!

One thing about the XML namespace: it "shouldn't" really be defined as an rdflib.Namespace, and certainly not be bound in the Graph. It is never declared in RDF/XML documents, nor any XML. While xml:lang is certainly used in RDF/XML, it is just core XML syntax (implicitly declared and a reserved prefix).

Given that it is used in RDFLib XML processing code for attribute access I'm not proposing to remove it entirely. But it ought to be clarified that it is not to be used as an "RDF namespace" (i.e. a vocabulary/ontology prefix), and thus I don't think it should ever be bound in a graph (unless intentionally, for reasons I don't understand but are not up to me to question).

ashleysommer · 2022-01-23T22:37:12Z

I'll chime in here too, last week I found that latest version of RDFLib (with the extra prefixes in the Namespaces) caused PySHACL to start failing a test. Tracked it down to a conflicting namespace (test defined prefix to be X, but RDFLib defined prefix to be Z) that caused the test to no longer pass, however it did expose a problem with the test (misspelled ontology URI) and broken logic in the test (it shouldn't have passed to begin with). That helped me fix the test.

Anyway, I'm in favour of keeping the prefixes, and auto-removing them in the serializers.

nicholascar · 2022-01-23T23:58:02Z

I'm part way to solving this, see #1686

mgberg · 2022-06-01T16:01:24Z

Another scenario where I would personally find it useful to have JSON-LD auto_compact only show "used" prefixes (like the N3 serializer does) is when you are serializing one graph from a Dataset. The Dataset and Graphs obtained using the get_context or graph methods share a NamespaceManager, and one named graph will likely not contain usages of all the prefixes bound (automatically or manually) to the Dataset. Therefore, you can still end up with a large dictionary containing some unused prefixes for the context of a serialization of one named graph in a Dataset.

cmungall · 2023-08-08T16:09:41Z

In case anyone ends up here confused, as I often do, I believe the current status is that this was fixed in rdflib 6.2.0, but then regressed in future versions

See #2103 for discussion

nicholascar added bug Something isn't working serialization Related to serialization. labels Jan 18, 2022

nicholascar mentioned this issue Jan 18, 2022

Graph comes with a whole lot of unwanted prefixes #1676

Closed

cmungall mentioned this issue Jan 19, 2022

Fix for issue #3 hsolbrig/rdflib-shim#4

Open

nicholascar mentioned this issue Jan 23, 2022

Bind prefixes choices #1686

Merged

cmungall mentioned this issue Feb 11, 2022

rdflib 6.1.0 and 6.1.1 fail unit tests Harold-Solbrig/funowl#34

Closed

nicholascar closed this as completed in #1686 Apr 15, 2022

ajnelson-nist mentioned this issue Jun 1, 2022

Change dependency retrievals casework/CASE-Examples#75

Draft

wmelder mentioned this issue Aug 10, 2022

Fix compatibility with rdflib 6.2 beeldengeluid/beng-lod-server#271

Closed

cmungall mentioned this issue Mar 14, 2024

metamodel OWL has @prefix schema1: <http://schema.org/> . linkml/linkml#1933

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace long list of namespaces with list of prefixes used when using serialize. #1679

Replace long list of namespaces with list of prefixes used when using serialize. #1679

wmelder commented Jan 18, 2022

nicholascar commented Jan 18, 2022

nicholascar commented Jan 18, 2022

hsolbrig commented Jan 19, 2022

nicholascar commented Jan 19, 2022

niklasl commented Jan 20, 2022

nicholascar commented Jan 20, 2022 •

edited

Loading

wmelder commented Jan 20, 2022

nicholascar commented Jan 20, 2022

wmelder commented Jan 20, 2022

cmungall commented Jan 21, 2022

wmelder commented Jan 21, 2022

nicholascar commented Jan 22, 2022

wmelder commented Jan 22, 2022

niklasl commented Jan 22, 2022

ashleysommer commented Jan 23, 2022 •

edited

Loading

nicholascar commented Jan 23, 2022

mgberg commented Jun 1, 2022 •

edited

Loading

cmungall commented Aug 8, 2023

Replace long list of namespaces with list of prefixes used when using serialize. #1679

Replace long list of namespaces with list of prefixes used when using serialize. #1679

Comments

wmelder commented Jan 18, 2022

nicholascar commented Jan 18, 2022

nicholascar commented Jan 18, 2022

hsolbrig commented Jan 19, 2022

nicholascar commented Jan 19, 2022

niklasl commented Jan 20, 2022

nicholascar commented Jan 20, 2022 • edited Loading

wmelder commented Jan 20, 2022

nicholascar commented Jan 20, 2022

wmelder commented Jan 20, 2022

cmungall commented Jan 21, 2022

wmelder commented Jan 21, 2022

nicholascar commented Jan 22, 2022

wmelder commented Jan 22, 2022

niklasl commented Jan 22, 2022

ashleysommer commented Jan 23, 2022 • edited Loading

nicholascar commented Jan 23, 2022

mgberg commented Jun 1, 2022 • edited Loading

cmungall commented Aug 8, 2023

nicholascar commented Jan 20, 2022 •

edited

Loading

ashleysommer commented Jan 23, 2022 •

edited

Loading

mgberg commented Jun 1, 2022 •

edited

Loading