BREAKING CHANGE: Don't use `publicID` as the name for the default graph. #2406

aucampia · 2023-05-23T20:25:21Z

Summary of changes

When parsing data into a ConjunctiveGraph or Dataset, the triples in the
default graphs in the sources were loaded into a graph named publicID.

This behaviour has been changed, and now the triples from the default graph in
source RDF documents will be loaded into ConjunctiveGraph.default_context or
Dataset.default_context.

The publicID parameter to ConjunctiveGraph.parse and Dataset.parse
constructors will now only be used as the base URI for relative URI resolution.

Checklist

Checked that there aren't other open pull requests for
the same change.
Checked that all tests and type checking passes.
If the change adds new features or changes the RDFLib public API:
- Created an issue to discuss the change and get in-principle agreement.
- Considered adding an example in ./examples.
If the change has a potential impact on users of this project:
- Added or updated tests that fail without the change.
- Updated relevant documentation to avoid inaccuracies.
- Considered adding additional documentation.
Considered granting push permissions to the PR branch,
so maintainers can fix minor issues and keep your PR up to date.

aucampia · 2023-05-24T21:38:42Z

I marked this ready for review as I want to get feedback on the essential changes.

I will probably clean up the docstrings a bit still.

coveralls · 2023-05-24T21:41:56Z

Coverage: 90.854% (+0.003%) from 90.851% when pulling 528465c on aucampia:aucampia/20230522T2329-default_is_default into ad56044 on RDFLib:main.

ghost · 2023-05-25T17:04:02Z

test/test_conjunctivegraph/test_conjunctive_graph.py

I believe this is a supported use of publicID. The test succeeds with current main but fails with the change to Graph.parse:

def test_parse_graph_with_publicid_as_new_cg_subgraph(): from test.data import tarek, hates, cheese cg = ConjunctiveGraph() assert len(list(cg.contexts())) == 0 # Empty contexts exclded from listing # Add a statement to the default graph/context cg.add((tarek, hates, cheese)) assert len(cg) == 1 assert len(list(cg.contexts())) == 1 assert list(cg.contexts())[0] == cg.default_context # Parse ttl graph into subgraph identified with publicID cg.parse( data="<urn:example:tarek> <urn:example:likes> <urn:example:pizza> .", publicID="urn:x-rdflib:context-a", # Now ignored format="ttl", ) # New subgraph should be created assert len(list(cg.contexts())) == 2 # And the newly-created subgraph wth the publicID in the above turtle should exist assert URIRef("urn:x-rdflib:context-a") in [c.identifier for c in list(cg.contexts())]

You are right, the test you shared will work on main and fail on this PR branch, and this is because on main publicID is used as the identifier for the default graph in addition to the base for relative URI resolution in source documents, while in this PR branch, it is only used as the base for relative URI resolution.

Do note, this PR together with every issue it addresses is labelled as breaking change, so with that as baseline, with the issues that this PR tries to address, what we need to answer is:

How should parse work? This is a more critical question than how does parse work in main, given this is labelled as a breaking change, so that is clearly signalling something will be different from the main branch.

What should we do about the issues this tries to address?

If parse should work the way it works in main, we close the issues and move on as then they are not issues. If however the issues are valid, then parse should not work the way it works in main, we have to change how it works. And then the question is only what we should change it to, or even more pertinently, how do we make it better than it is.

I think the way it is working in main, i.e. using publicID as the name of the graph that triples inside the default graph is loaded into, is wrong. This is also not what the documentation of publicID suggests will happen.

rdflib/rdflib/graph.py

Lines 1416 to 1418 in ad56044

- ``publicID``: the logical URI to use as the document base. If None

specified the document location is used (at least in the case where

there is a document location).

Given that description, what I would expect is that the publicID is what relative URIs will be resolved against. And that is also what it is used for, and the only thing it should be used for, I think. And I don't see why the base for URI relative URI resolution should be the same as the graph name that default triples are being loaded into, which should not be named to begin with because as per the spec, has no name:

https://www.w3.org/TR/rdf11-concepts/#section-dataset

Exactly one default graph, being an RDF graph. The default graph does not have a name and may be empty.

Another way to think about this is, what is worse:

Not having the content of default graph in sources loaded into the default graph of Dataset/ConjunctiveGraph

Having to explicitly load something into a sub-graph if you don't want it in the default graph

I think 1 is worse.

All that being said, the long and short of it is: How should parse work and how do we address the issues at hand?

Also, if you think the code should work the way it does in main for loading Turtle into a ConjunctiveGraph, how should it work when you change ttl to trig in your code?

I think if it is difficult to find consensus on what {Dataset,ConjunctiveGraph}.parse(...) should do if the source does not support named graphs, the best option is to just refuse to parse sources that don't support named graphs into Datasets or ConjuctiveGraphs. So then the right behaviour for your example would be to raise an exception, as the operation cannot be defined in an unambiguous way.

It would be good to understand alternative ways how the issues this tries to address can be addressed from a public API perspective. If there are more elegant ways, we should try it, but I do think it is critical that we address the issues in one way or another.

I added some documentation and updated the description to make it clearer.

I agree with the new changes here. publicID is used to resolve relative IRIs (I actually never knew it was used as the name of the graph to load the source's default graph statements for ConjunctiveGraph/Dataset).

So what is the behaviour now with this change when loading source data in ttl format into a Dataset?

I actually never knew it was used as the name of the graph to load the source's default graph statements for ConjunctiveGraph/Dataset.

Me neither until I started to investigate the ramifications of the change - but as aucampia has noted, it is documented.

It's probably worth bearing in mind that the implementation of ConjunctiveGraph predates the W3's definition of Dataset. In RDFLib, a Graph object always has an identifier so I guess it was both unavoidable and natural for the Graph object that was bound to ConjunctiveGraph.default_context to be able to have a user-definable identifier (somewhere in the code/issues someone notes the difficulty of distinguishing a BNode-identified default graph from a BNode-identified named graph).

And this is probably why ConjunctiveGraph can behave a little unintuitively when passed an identifier:

cg = ConjunctiveGraph() assert cg.default_context.identifier == cg.identifier # False # AssertionError: assert BNode('Nd463bea483a649c6918c222b20ad8832') == BNode('N1feef746236f44cea47c83ed58f3e752') cg = ConjunctiveGraph(identifier=URIRef('urn:example:context-0')) assert cg.default_context.identifier == cg.identifier # True

Dataset is a different beast and benefits from a concrete assertion:

“Exactly one default graph, being an RDF graph. The default graph does not have a name.”

Unlike ConjunctiveGraph, Dataset.__init__() does not take an identifier param, a BNode is bound to the Dataset object’s identifier property but the (ahem, unnamed, sigh) default graph has a fixed, unchageable identifier property bound to <urn:x-rdflib:default>:

ds = Dataset() assert isinstance(ds.identifier, BNode) assert ds.default_context.identifier == URIRef('urn:x-rdflib:default')

(I s'pose it's technically user-modifiable by rebinding DATASET_DEFAULT_ID)

So what is the behaviour now with this change when loading source data in ttl format into a Dataset?

Current state-of-play is unchanged, source ttl is loaded into the Graph bound to default_context:

ds = Dataset() ds.parse(location=TEST_DATA_DIR / "variants" / "diverse_triples.ttl", format="ttl") assert len(ds) == 5 assert list(ds.contexts()) == [ds.default_context] assert len(ds.default_context) == 5

Also, if you think the code should work the way it does in main

I wouldn't aspire to hold an opinion. My intention was merely to bring the issue to the attention of reviewers in as clear and straightforward a manner as I could so that any potential impact on existing downstream code could be discussed and advice prepared.

@edmondchuc I never answered you explicitly on this question

So what is the behaviour now with this change when loading source data in ttl format into a Dataset?

The new behaviour is that it goes into the default graph.

rdflib/rdflib/graph.py

Lines 2214 to 2215 in 528465c

If the source is in a format that does not support named graphs it's triples

will be added to the default graph (i.e. `Dataset.default_context`).

When parsing data into a `ConjunctiveGraph` or `Dataset`, the triples in the default graphs in the sources were loaded into a graph named `publicID`. This behaviour has been changed, and now the triples from the default graph in source RDF documents will be loaded into `ConjunctiveGraph.default_context` or `Dataset.default_context`. The `publicID` parameter to `ConjunctiveGraph.parse` and `Dataset.parse` constructors will now only be used as the base URI for relative URI resolution.

ashleysommer · 2023-05-26T03:05:02Z

Moving a topic from the code review thread into the main Issue thread:
@aucampia

I think if it is difficult to find consensus on what {Dataset,ConjunctiveGraph}.parse(...) should do if the source does not support named graphs, the best option is to just refuse to parse sources that don't support named graphs into Datasets or ConjuctiveGraphs. So then the right behaviour for your example would be to raise an exception, as the operation cannot be defined in an unambiguous way.

This is an issue I have been facing in my PySHACL application. In the past I used to use Graph.parse() for parsing input files, but in order to support trig and nquads files, I switched to using Dataset.parse() for all input documents.

This gave arise to an issue when parsing ttl and ntriples files, where the Dataset has a default context graph (named DATASET_DEFAULT_GRAPH_ID) but that graph is empty, and the contents are placed into a new graph in the Dataset. This makes targeting the correct triples difficult because you need to determine which graph in the Dataset they are in.

For me personally the ideal behaviour would be:

If calling Dataset.parse() on a non-named-graph file (eg, turtle, ntriples) and
If the Dataset has a default context graph named DATASET_DEFAULT_GRAPH_ID and
If the default context graph is empty, has no triples in it
Then: remove the empty default graph, parse the contents into a new graph, and set that new graph as the default context on the Dataset.

I know that would introduce inconsistent behaviour when any of those conditions differ slightly.

ghost · 2023-05-26T10:57:56Z

.. an issue when parsing ttl and ntriples files, where the Dataset has a default context graph (named DATASET_DEFAULT_GRAPH_ID) but that graph is empty, and the contents are placed into a new graph in the Dataset.

Yeah, that's wrong, the contents of statements without a context (i.e. triples) shouldn't be placed in an arbitrary new graph, it's a behaviour inherited from a longstanding bug in ConjunctiveGraph which aucampia has addressed with this PR.

ConjunctiveGraph docstring clearly sets expectations for users:

All methods that add triples work against this default graph.

and parse is a method that adds triples.

For me personally the ideal behaviour would be:

1. If calling `Dataset.parse()` on a non-named-graph file (eg, turtle, ntriples) _and_

2. If the Dataset has a default context graph named `DATASET_DEFAULT_GRAPH_ID` _and_

3. If the default context graph is empty, has no triples in it
   Then: remove the empty default graph, parse the contents into a new graph, and set that new graph as the default context on the Dataset.

I know that would introduce inconsistent behaviour when any of those conditions differ slightly.

What we have atm is that parse adds to the default graph (identified by DATASET_DEFAULT_GRAPH_ID):

ds = Dataset()
ds.parse(location=TEST_DATA_DIR / "variants" / "diverse_triples.ttl", format="ttl")
assert len(ds.default_context) == 5
ds.parse(location=TEST_DATA_DIR / "variants" / "simple_triple.ttl", format="ttl")
assert len(ds.default_context) == 6

docs/upgrade6to7.rst

Fix a typo

aucampia · 2023-05-26T18:51:40Z

This gave arise to an issue when parsing ttl and ntriples files, where the Dataset has a default context graph (named DATASET_DEFAULT_GRAPH_ID) but that graph is empty, and the contents are placed into a new graph in the Dataset.

What I suggested in the "Upgrading 6 to 7" document I added here is to do:

    dataset = Dataset()
    dataset.get_context("example:graph_name").parse("http://example.com/source.ttl", format="turtle")

I guess if you don't know whether the source supports named graphs or not, this is not that helpful.

For me personally the ideal behaviour would be:

If calling Dataset.parse() on a non-named-graph file (eg, turtle, ntriples) and

If the Dataset has a default context graph named DATASET_DEFAULT_GRAPH_ID and

If the default context graph is empty, has no triples in it
Then: remove the empty default graph, parse the contents into a new graph, and set that new graph as the default context on the Dataset.

If the default context is empty, would it not be fine to just parse it directly into the empty default graph instead of parsing it into a new graph, and setting the new graph as the default graph?

Regarding the second point, I think we should try and make that a given - or at least we should not support cases where the default context has another name and try and prevent it, as per the spec the default graph is not named. I guess for now the special reserved name is okay, but I would say we should not allow changing the way the default graph is named as it is not clear in what sense that is still the default graph.

Just to understand this better, what would happen if the default context graph is not empy? Would the triples from the source then be added to it?

aucampia · 2023-05-27T16:44:52Z

I will add some tests in separate PRs for:

Relative URI resolution
Calling parse on a graph that already contains triples/quads.

aucampia · 2023-06-01T21:27:58Z

For me personally the ideal behaviour would be:

If calling Dataset.parse() on a non-named-graph file (eg, turtle, ntriples) and

If the Dataset has a default context graph named DATASET_DEFAULT_GRAPH_ID and

If the default context graph is empty, has no triples in it
Then: remove the empty default graph, parse the contents into a new graph, and set that new graph as the default context on the Dataset.

If the default context is empty, would it not be fine to just parse it directly into the empty default graph instead of parsing it into a new graph, and setting the new graph as the default graph?

Just as a final warning, with this change parse() does not clear any Graph in the dataset. I think in previous versions it cleared some graphs in the dataset, but not the default graph. parse should behave consistently, and in most cases it does not clear graphs, so I think the most reasonable thing to do is to not clear graphs.

However, that being said, if all conditions you give in your steps hold (i.e. "If the default context graph is empty, has no triples in it"), the behaviour with this PR will have the same outcome.

aucampia · 2023-06-01T21:31:45Z

I'm going to merge this by this weekend if there is no further feedback, I may include the additional tests for unrelated functionality first, but I am uncertain if I will have time and there are other changes that are urgent.

aucampia · 2023-06-08T19:39:55Z

I'm going to merge this by this weekend if there is no further feedback, I may include the additional tests for unrelated functionality first, but I am uncertain if I will have time and there are other changes that are urgent.

Adding tests for relative URIs will be easier with this, as I want to use the variants test infrastructure for it, and this has some changes there that eliminate the hacks related to the default graph not being the default graph.

If there are problems discovered with the tests that I will add for relative URIs and parsing multiple documents or one document multiple times I will fix it.

ghost · 2023-06-09T13:36:18Z

Just FTR, the use of publicID as destination for uncontextualised statements was a fix for issue #535, made in PR #536, sorry it took me so long to track it down.

aucampia · 2023-06-09T13:46:53Z

Just FTR, the use of publicID as destination for uncontextualised statements was a fix for issue #535, made in PR #536, sorry it took me so long to track it down.

Thanks for sharing @gjhiggins - @pchampin as you raised the original issue it may be worth getting your input here.

pchampin · 2023-07-20T15:50:01Z

Just FTR, the use of publicID as destination for uncontextualised statements was a fix for issue #535, made in PR #536, sorry it took me so long to track it down.

Thanks for sharing @gjhiggins - @pchampin as you raised the original issue it may be worth getting your input here.

Sorry for the late response. I am satisfied with this change. Reusing publicID as the name of the default graph was indeed a workaround (either default_context didn't exist at the time, or I missed it...); the new behaviour is cleaner.

`LOAD ... INTO GRAPH` stopped working correctly after the change to handling of the `publicID` `Graph.parse` parameter in RDFLib 7.0.0 (<RDFLib#2406>). This is because `LOAD` evaluation relied on `publicID` to select the graph name. So after <RDFLib#2406> data would be loaded into the default graph even if a named graph is specified. This change adds tests for `LOAD ... INTO GRAPH` and fixes the load evaluation.

`LOAD ... INTO GRAPH` stopped working correctly after the change to handling of the `publicID` `Graph.parse` parameter in RDFLib 7.0.0 (<RDFLib#2406>). This is because `LOAD` evaluation relied on `publicID` to select the graph name. So after <RDFLib#2406> data would be loaded into the default graph even if a named graph is specified. This change adds tests for `LOAD ... INTO GRAPH` and fixes the load evaluation. A consequence of this change is also that relative IRI lookup for graphs loaded with `LOAD ... INTO GRAPH` is now relative to the source document URI instead of the base URI of the graph being loaded into, which is more correct.

`LOAD ... INTO GRAPH` stopped working correctly after the change to handling of the `publicID` `Graph.parse` parameter in RDFLib 7.0.0 (<#2406>). This is because `LOAD` evaluation relied on `publicID` to select the graph name. So after <#2406> data would be loaded into the default graph even if a named graph is specified. This change adds tests for `LOAD ... INTO GRAPH` and fixes the load evaluation. A consequence of this change is also that relative IRI lookup for graphs loaded with `LOAD ... INTO GRAPH` is now relative to the source document URI instead of the base URI of the graph being loaded into, which is more correct.

aucampia requested a review from a team May 24, 2023 21:04

aucampia marked this pull request as ready for review May 24, 2023 21:38

ghost reviewed May 25, 2023

View reviewed changes

aucampia force-pushed the aucampia/20230522T2329-default_is_default branch from 0cf29b0 to db7313a Compare May 25, 2023 20:28

aucampia changed the title ~~Add triples from default graphs in sources to the {Dataset,ConjunctiveGraph}.default_context~~ BREAKING CHANGE: Don't use publicID as the name for the default graph. May 25, 2023

aucampia commented May 26, 2023

View reviewed changes

docs/upgrade6to7.rst Outdated Show resolved Hide resolved

Update docs/upgrade6to7.rst

528465c

Fix a typo

aucampia mentioned this pull request Jun 8, 2023

Mark ConjunctiveGraph as deprecated #2405

Open

aucampia merged commit 4b96e9d into RDFLib:main Jun 8, 2023
25 checks passed

aucampia mentioned this pull request Jun 21, 2023

The default graph should have no name. #2445

Open

This was referenced Jul 6, 2023

aucampia/20230706T0034 refresh literal tests aucampia/rdflib#98

Open

aucampia/20230706T2301 expected outcome aucampia/rdflib#99

Draft

aucampia mentioned this pull request Aug 28, 2023

fix: SPARQL LOAD ... INTO GRAPH handling #2554

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BREAKING CHANGE: Don't use `publicID` as the name for the default graph. #2406

BREAKING CHANGE: Don't use `publicID` as the name for the default graph. #2406

aucampia commented May 23, 2023 •

edited

aucampia commented May 24, 2023

coveralls commented May 24, 2023 •

edited

ghost May 25, 2023

aucampia May 25, 2023 •

edited

aucampia May 25, 2023 •

edited

aucampia May 25, 2023 •

edited

aucampia May 25, 2023 •

edited

aucampia May 25, 2023

edmondchuc May 26, 2023

ghost May 26, 2023

ghost May 26, 2023

aucampia Jun 1, 2023

ashleysommer commented May 26, 2023

ghost commented May 26, 2023

aucampia commented May 26, 2023

aucampia commented May 27, 2023 •

edited

aucampia commented Jun 1, 2023

aucampia commented Jun 1, 2023

aucampia commented Jun 8, 2023

ghost commented Jun 9, 2023

aucampia commented Jun 9, 2023

pchampin commented Jul 20, 2023

	- ``publicID``: the logical URI to use as the document base. If None
	specified the document location is used (at least in the case where
	there is a document location).

	If the source is in a format that does not support named graphs it's triples
	will be added to the default graph (i.e. `Dataset.default_context`).

BREAKING CHANGE: Don't use publicID as the name for the default graph. #2406

BREAKING CHANGE: Don't use publicID as the name for the default graph. #2406

Conversation

aucampia commented May 23, 2023 • edited

Summary of changes

Checklist

aucampia commented May 24, 2023

coveralls commented May 24, 2023 • edited

ghost May 25, 2023

Choose a reason for hiding this comment

aucampia May 25, 2023 • edited

Choose a reason for hiding this comment

aucampia May 25, 2023 • edited

Choose a reason for hiding this comment

aucampia May 25, 2023 • edited

Choose a reason for hiding this comment

aucampia May 25, 2023 • edited

Choose a reason for hiding this comment

aucampia May 25, 2023

Choose a reason for hiding this comment

edmondchuc May 26, 2023

Choose a reason for hiding this comment

ghost May 26, 2023

Choose a reason for hiding this comment

ghost May 26, 2023

Choose a reason for hiding this comment

aucampia Jun 1, 2023

Choose a reason for hiding this comment

ashleysommer commented May 26, 2023

ghost commented May 26, 2023

aucampia commented May 26, 2023

aucampia commented May 27, 2023 • edited

aucampia commented Jun 1, 2023

aucampia commented Jun 1, 2023

aucampia commented Jun 8, 2023

ghost commented Jun 9, 2023

aucampia commented Jun 9, 2023

pchampin commented Jul 20, 2023

BREAKING CHANGE: Don't use `publicID` as the name for the default graph. #2406

BREAKING CHANGE: Don't use `publicID` as the name for the default graph. #2406

aucampia commented May 23, 2023 •

edited

coveralls commented May 24, 2023 •

edited

aucampia May 25, 2023 •

edited

aucampia May 25, 2023 •

edited

aucampia May 25, 2023 •

edited

aucampia May 25, 2023 •

edited

aucampia commented May 27, 2023 •

edited