Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BREAKING CHANGE: Don't use publicID as the name for the default graph. #2406

Merged

Conversation

aucampia
Copy link
Member

@aucampia aucampia commented May 23, 2023

Summary of changes

When parsing data into a ConjunctiveGraph or Dataset, the triples in the
default graphs in the sources were loaded into a graph named publicID.

This behaviour has been changed, and now the triples from the default graph in
source RDF documents will be loaded into ConjunctiveGraph.default_context or
Dataset.default_context.

The publicID parameter to ConjunctiveGraph.parse and Dataset.parse
constructors will now only be used as the base URI for relative URI resolution.

Checklist

  • Checked that there aren't other open pull requests for
    the same change.
  • Checked that all tests and type checking passes.
  • If the change adds new features or changes the RDFLib public API:
    • Created an issue to discuss the change and get in-principle agreement.
    • Considered adding an example in ./examples.
  • If the change has a potential impact on users of this project:
    • Added or updated tests that fail without the change.
    • Updated relevant documentation to avoid inaccuracies.
    • Considered adding additional documentation.
  • Considered granting push permissions to the PR branch,
    so maintainers can fix minor issues and keep your PR up to date.

@aucampia aucampia requested a review from a team May 24, 2023 21:04
@aucampia
Copy link
Member Author

I marked this ready for review as I want to get feedback on the essential changes.

I will probably clean up the docstrings a bit still.

@coveralls
Copy link

coveralls commented May 24, 2023

Coverage Status

Coverage: 90.854% (+0.003%) from 90.851% when pulling 528465c on aucampia:aucampia/20230522T2329-default_is_default into ad56044 on RDFLib:main.

@aucampia aucampia added review wanted This indicates that the PR is ready for review fix Fixes an issue format: TriG Related to the TriG format. format: N-Quads Related to N-Quads format. concept: RDF dataset Relates to the RDF datasets concept. breaking change This involves or proposes breaking RDFLib's public API. and removed format: TriG Related to the TriG format. format: N-Quads Related to N-Quads format. labels May 24, 2023
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is a supported use of publicID. The test succeeds with current main but fails with the change to Graph.parse:

def test_parse_graph_with_publicid_as_new_cg_subgraph():
    from test.data import tarek, hates, cheese
    cg = ConjunctiveGraph()
    assert len(list(cg.contexts())) == 0  # Empty contexts exclded from listing
    # Add a statement to the default graph/context
    cg.add((tarek, hates, cheese))
    assert len(cg) == 1
    assert len(list(cg.contexts())) == 1
    assert list(cg.contexts())[0] == cg.default_context
    # Parse ttl graph into subgraph identified with publicID
    cg.parse(
        data="<urn:example:tarek> <urn:example:likes> <urn:example:pizza> .",
        publicID="urn:x-rdflib:context-a",  # Now ignored
        format="ttl",
    )
    # New subgraph should be created
    assert len(list(cg.contexts())) == 2
    # And the newly-created subgraph wth the publicID in the above turtle should exist
    assert URIRef("urn:x-rdflib:context-a") in [c.identifier for c in list(cg.contexts())]

Copy link
Member Author

@aucampia aucampia May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the test you shared will work on main and fail on this PR branch, and this is because on main publicID is used as the identifier for the default graph in addition to the base for relative URI resolution in source documents, while in this PR branch, it is only used as the base for relative URI resolution.

Do note, this PR together with every issue it addresses is labelled as breaking change, so with that as baseline, with the issues that this PR tries to address, what we need to answer is:

  1. How should parse work? This is a more critical question than how does parse work in main, given this is labelled as a breaking change, so that is clearly signalling something will be different from the main branch.
  2. What should we do about the issues this tries to address?

If parse should work the way it works in main, we close the issues and move on as then they are not issues. If however the issues are valid, then parse should not work the way it works in main, we have to change how it works. And then the question is only what we should change it to, or even more pertinently, how do we make it better than it is.

I think the way it is working in main, i.e. using publicID as the name of the graph that triples inside the default graph is loaded into, is wrong. This is also not what the documentation of publicID suggests will happen.

rdflib/rdflib/graph.py

Lines 1416 to 1418 in ad56044

- ``publicID``: the logical URI to use as the document base. If None
specified the document location is used (at least in the case where
there is a document location).

Given that description, what I would expect is that the publicID is what relative URIs will be resolved against. And that is also what it is used for, and the only thing it should be used for, I think. And I don't see why the base for URI relative URI resolution should be the same as the graph name that default triples are being loaded into, which should not be named to begin with because as per the spec, has no name:

https://www.w3.org/TR/rdf11-concepts/#section-dataset

Exactly one default graph, being an RDF graph. The default graph does not have a name and may be empty.

Another way to think about this is, what is worse:

  1. Not having the content of default graph in sources loaded into the default graph of Dataset/ConjunctiveGraph
  2. Having to explicitly load something into a sub-graph if you don't want it in the default graph

I think 1 is worse.

Copy link
Member Author

@aucampia aucampia May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All that being said, the long and short of it is: How should parse work and how do we address the issues at hand?

Copy link
Member Author

@aucampia aucampia May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if you think the code should work the way it does in main for loading Turtle into a ConjunctiveGraph, how should it work when you change ttl to trig in your code?

I think if it is difficult to find consensus on what {Dataset,ConjunctiveGraph}.parse(...) should do if the source does not support named graphs, the best option is to just refuse to parse sources that don't support named graphs into Datasets or ConjuctiveGraphs. So then the right behaviour for your example would be to raise an exception, as the operation cannot be defined in an unambiguous way.

Copy link
Member Author

@aucampia aucampia May 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to understand alternative ways how the issues this tries to address can be addressed from a public API perspective. If there are more elegant ways, we should try it, but I do think it is critical that we address the issues in one way or another.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some documentation and updated the description to make it clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the new changes here. publicID is used to resolve relative IRIs (I actually never knew it was used as the name of the graph to load the source's default graph statements for ConjunctiveGraph/Dataset).

So what is the behaviour now with this change when loading source data in ttl format into a Dataset?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually never knew it was used as the name of the graph to load the source's default graph statements for ConjunctiveGraph/Dataset.

Me neither until I started to investigate the ramifications of the change - but as aucampia has noted, it is documented.

It's probably worth bearing in mind that the implementation of ConjunctiveGraph predates the W3's definition of Dataset. In RDFLib, a Graph object always has an identifier so I guess it was both unavoidable and natural for the Graph object that was bound to ConjunctiveGraph.default_context to be able to have a user-definable identifier (somewhere in the code/issues someone notes the difficulty of distinguishing a BNode-identified default graph from a BNode-identified named graph).

And this is probably why ConjunctiveGraph can behave a little unintuitively when passed an identifier:

cg = ConjunctiveGraph()
assert cg.default_context.identifier == cg.identifier  # False
# AssertionError: assert BNode('Nd463bea483a649c6918c222b20ad8832') == BNode('N1feef746236f44cea47c83ed58f3e752')

cg = ConjunctiveGraph(identifier=URIRef('urn:example:context-0'))
assert cg.default_context.identifier == cg.identifier  # True

Dataset is a different beast and benefits from a concrete assertion:

“Exactly one default graph, being an RDF graph. The default graph does not have a name.”

Unlike ConjunctiveGraph, Dataset.__init__() does not take an identifier param, a BNode is bound to the Dataset object’s identifier property but the (ahem, unnamed, sigh) default graph has a fixed, unchageable identifier property bound to <urn:x-rdflib:default>:

ds = Dataset()
assert isinstance(ds.identifier, BNode)
assert ds.default_context.identifier == URIRef('urn:x-rdflib:default')

(I s'pose it's technically user-modifiable by rebinding DATASET_DEFAULT_ID)

So what is the behaviour now with this change when loading source data in ttl format into a Dataset?

Current state-of-play is unchanged, source ttl is loaded into the Graph bound to default_context:

ds = Dataset()
ds.parse(location=TEST_DATA_DIR / "variants" / "diverse_triples.ttl", format="ttl")
assert len(ds) == 5
assert list(ds.contexts()) == [ds.default_context]
assert len(ds.default_context) == 5

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if you think the code should work the way it does in main

I wouldn't aspire to hold an opinion. My intention was merely to bring the issue to the attention of reviewers in as clear and straightforward a manner as I could so that any potential impact on existing downstream code could be discussed and advice prepared.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edmondchuc I never answered you explicitly on this question

So what is the behaviour now with this change when loading source data in ttl format into a Dataset?

The new behaviour is that it goes into the default graph.

rdflib/rdflib/graph.py

Lines 2214 to 2215 in 528465c

If the source is in a format that does not support named graphs it's triples
will be added to the default graph (i.e. `Dataset.default_context`).

When parsing data into a `ConjunctiveGraph` or `Dataset`, the triples in the
default graphs in the sources were loaded into a graph named `publicID`.

This behaviour has been changed, and now the triples from the default graph in
source RDF documents will be loaded into `ConjunctiveGraph.default_context` or
`Dataset.default_context`.

The `publicID` parameter to `ConjunctiveGraph.parse` and `Dataset.parse`
constructors will now only be used as the base URI for relative URI resolution.
@aucampia aucampia force-pushed the aucampia/20230522T2329-default_is_default branch from 0cf29b0 to db7313a Compare May 25, 2023 20:28
@aucampia aucampia changed the title Add triples from default graphs in sources to the {Dataset,ConjunctiveGraph}.default_context BREAKING CHANGE: Don't use publicID as the name for the default graph. May 25, 2023
@ashleysommer
Copy link
Contributor

Moving a topic from the code review thread into the main Issue thread:
@aucampia

I think if it is difficult to find consensus on what {Dataset,ConjunctiveGraph}.parse(...) should do if the source does not support named graphs, the best option is to just refuse to parse sources that don't support named graphs into Datasets or ConjuctiveGraphs. So then the right behaviour for your example would be to raise an exception, as the operation cannot be defined in an unambiguous way.

This is an issue I have been facing in my PySHACL application. In the past I used to use Graph.parse() for parsing input files, but in order to support trig and nquads files, I switched to using Dataset.parse() for all input documents.

This gave arise to an issue when parsing ttl and ntriples files, where the Dataset has a default context graph (named DATASET_DEFAULT_GRAPH_ID) but that graph is empty, and the contents are placed into a new graph in the Dataset. This makes targeting the correct triples difficult because you need to determine which graph in the Dataset they are in.

For me personally the ideal behaviour would be:

  1. If calling Dataset.parse() on a non-named-graph file (eg, turtle, ntriples) and
  2. If the Dataset has a default context graph named DATASET_DEFAULT_GRAPH_ID and
  3. If the default context graph is empty, has no triples in it
    Then: remove the empty default graph, parse the contents into a new graph, and set that new graph as the default context on the Dataset.

I know that would introduce inconsistent behaviour when any of those conditions differ slightly.

@ghost
Copy link

ghost commented May 26, 2023

.. an issue when parsing ttl and ntriples files, where the Dataset has a default context graph (named DATASET_DEFAULT_GRAPH_ID) but that graph is empty, and the contents are placed into a new graph in the Dataset.

Yeah, that's wrong, the contents of statements without a context (i.e. triples) shouldn't be placed in an arbitrary new graph, it's a behaviour inherited from a longstanding bug in ConjunctiveGraph which aucampia has addressed with this PR.

ConjunctiveGraph docstring clearly sets expectations for users:

All methods that add triples work against this default graph.

and parse is a method that adds triples.

For me personally the ideal behaviour would be:

1. If calling `Dataset.parse()` on a non-named-graph file (eg, turtle, ntriples) _and_

2. If the Dataset has a default context graph named `DATASET_DEFAULT_GRAPH_ID` _and_

3. If the default context graph is empty, has no triples in it
   Then: remove the empty default graph, parse the contents into a new graph, and set that new graph as the default context on the Dataset.

I know that would introduce inconsistent behaviour when any of those conditions differ slightly.

What we have atm is that parse adds to the default graph (identified by DATASET_DEFAULT_GRAPH_ID):

ds = Dataset()
ds.parse(location=TEST_DATA_DIR / "variants" / "diverse_triples.ttl", format="ttl")
assert len(ds.default_context) == 5
ds.parse(location=TEST_DATA_DIR / "variants" / "simple_triple.ttl", format="ttl")
assert len(ds.default_context) == 6

docs/upgrade6to7.rst Outdated Show resolved Hide resolved
@aucampia
Copy link
Member Author

This gave arise to an issue when parsing ttl and ntriples files, where the Dataset has a default context graph (named DATASET_DEFAULT_GRAPH_ID) but that graph is empty, and the contents are placed into a new graph in the Dataset.

What I suggested in the "Upgrading 6 to 7" document I added here is to do:

    dataset = Dataset()
    dataset.get_context("example:graph_name").parse("http://example.com/source.ttl", format="turtle")

I guess if you don't know whether the source supports named graphs or not, this is not that helpful.

For me personally the ideal behaviour would be:

  1. If calling Dataset.parse() on a non-named-graph file (eg, turtle, ntriples) and
  2. If the Dataset has a default context graph named DATASET_DEFAULT_GRAPH_ID and
  3. If the default context graph is empty, has no triples in it
    Then: remove the empty default graph, parse the contents into a new graph, and set that new graph as the default context on the Dataset.

If the default context is empty, would it not be fine to just parse it directly into the empty default graph instead of parsing it into a new graph, and setting the new graph as the default graph?

Regarding the second point, I think we should try and make that a given - or at least we should not support cases where the default context has another name and try and prevent it, as per the spec the default graph is not named. I guess for now the special reserved name is okay, but I would say we should not allow changing the way the default graph is named as it is not clear in what sense that is still the default graph.

Just to understand this better, what would happen if the default context graph is not empy? Would the triples from the source then be added to it?

@aucampia
Copy link
Member Author

aucampia commented May 27, 2023

I will add some tests in separate PRs for:

  • Relative URI resolution
  • Calling parse on a graph that already contains triples/quads.

@aucampia
Copy link
Member Author

aucampia commented Jun 1, 2023

For me personally the ideal behaviour would be:

  1. If calling Dataset.parse() on a non-named-graph file (eg, turtle, ntriples) and
  2. If the Dataset has a default context graph named DATASET_DEFAULT_GRAPH_ID and
  3. If the default context graph is empty, has no triples in it
    Then: remove the empty default graph, parse the contents into a new graph, and set that new graph as the default context on the Dataset.

If the default context is empty, would it not be fine to just parse it directly into the empty default graph instead of parsing it into a new graph, and setting the new graph as the default graph?

Just as a final warning, with this change parse() does not clear any Graph in the dataset. I think in previous versions it cleared some graphs in the dataset, but not the default graph. parse should behave consistently, and in most cases it does not clear graphs, so I think the most reasonable thing to do is to not clear graphs.

However, that being said, if all conditions you give in your steps hold (i.e. "If the default context graph is empty, has no triples in it"), the behaviour with this PR will have the same outcome.

@aucampia
Copy link
Member Author

aucampia commented Jun 1, 2023

I'm going to merge this by this weekend if there is no further feedback, I may include the additional tests for unrelated functionality first, but I am uncertain if I will have time and there are other changes that are urgent.

@aucampia
Copy link
Member Author

aucampia commented Jun 8, 2023

I'm going to merge this by this weekend if there is no further feedback, I may include the additional tests for unrelated functionality first, but I am uncertain if I will have time and there are other changes that are urgent.

Adding tests for relative URIs will be easier with this, as I want to use the variants test infrastructure for it, and this has some changes there that eliminate the hacks related to the default graph not being the default graph.

If there are problems discovered with the tests that I will add for relative URIs and parsing multiple documents or one document multiple times I will fix it.

@aucampia aucampia merged commit 4b96e9d into RDFLib:main Jun 8, 2023
25 checks passed
@ghost
Copy link

ghost commented Jun 9, 2023

Just FTR, the use of publicID as destination for uncontextualised statements was a fix for issue #535, made in PR #536, sorry it took me so long to track it down.

@aucampia
Copy link
Member Author

aucampia commented Jun 9, 2023

Just FTR, the use of publicID as destination for uncontextualised statements was a fix for issue #535, made in PR #536, sorry it took me so long to track it down.

Thanks for sharing @gjhiggins - @pchampin as you raised the original issue it may be worth getting your input here.

@pchampin
Copy link
Contributor

Just FTR, the use of publicID as destination for uncontextualised statements was a fix for issue #535, made in PR #536, sorry it took me so long to track it down.

Thanks for sharing @gjhiggins - @pchampin as you raised the original issue it may be worth getting your input here.

Sorry for the late response. I am satisfied with this change. Reusing publicID as the name of the default graph was indeed a workaround (either default_context didn't exist at the time, or I missed it...); the new behaviour is cleaner.

aucampia added a commit to aucampia/rdflib that referenced this pull request Aug 28, 2023
`LOAD ... INTO GRAPH` stopped working correctly after the change to
handling of the `publicID` `Graph.parse` parameter in RDFLib 7.0.0
(<RDFLib#2406>).

This is because `LOAD` evaluation relied on `publicID` to select the
graph name. So after <RDFLib#2406> data
would be loaded into the default graph even if a named graph is
specified.

This change adds tests for `LOAD ... INTO GRAPH` and fixes the load
evaluation.
aucampia added a commit to aucampia/rdflib that referenced this pull request Aug 28, 2023
`LOAD ... INTO GRAPH` stopped working correctly after the change to
handling of the `publicID` `Graph.parse` parameter in RDFLib 7.0.0
(<RDFLib#2406>).

This is because `LOAD` evaluation relied on `publicID` to select the
graph name. So after <RDFLib#2406> data
would be loaded into the default graph even if a named graph is
specified.

This change adds tests for `LOAD ... INTO GRAPH` and fixes the load
evaluation.
aucampia added a commit to aucampia/rdflib that referenced this pull request Aug 28, 2023
`LOAD ... INTO GRAPH` stopped working correctly after the change to
handling of the `publicID` `Graph.parse` parameter in RDFLib 7.0.0
(<RDFLib#2406>).

This is because `LOAD` evaluation relied on `publicID` to select the
graph name. So after <RDFLib#2406> data
would be loaded into the default graph even if a named graph is
specified.

This change adds tests for `LOAD ... INTO GRAPH` and fixes the load
evaluation.

A consequence of this change is also that relative IRI lookup for graphs
loaded with `LOAD ... INTO GRAPH` is now relative to the source document
URI instead of the base URI of the graph being loaded into, which is
more correct.
aucampia added a commit to aucampia/rdflib that referenced this pull request Aug 29, 2023
`LOAD ... INTO GRAPH` stopped working correctly after the change to
handling of the `publicID` `Graph.parse` parameter in RDFLib 7.0.0
(<RDFLib#2406>).

This is because `LOAD` evaluation relied on `publicID` to select the
graph name. So after <RDFLib#2406> data
would be loaded into the default graph even if a named graph is
specified.

This change adds tests for `LOAD ... INTO GRAPH` and fixes the load
evaluation.

A consequence of this change is also that relative IRI lookup for graphs
loaded with `LOAD ... INTO GRAPH` is now relative to the source document
URI instead of the base URI of the graph being loaded into, which is
more correct.
aucampia added a commit that referenced this pull request Aug 30, 2023
`LOAD ... INTO GRAPH` stopped working correctly after the change to
handling of the `publicID` `Graph.parse` parameter in RDFLib 7.0.0
(<#2406>).

This is because `LOAD` evaluation relied on `publicID` to select the
graph name. So after <#2406> data
would be loaded into the default graph even if a named graph is
specified.

This change adds tests for `LOAD ... INTO GRAPH` and fixes the load
evaluation.

A consequence of this change is also that relative IRI lookup for graphs
loaded with `LOAD ... INTO GRAPH` is now relative to the source document
URI instead of the base URI of the graph being loaded into, which is
more correct.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change This involves or proposes breaking RDFLib's public API. concept: RDF dataset Relates to the RDF datasets concept. fix Fixes an issue review wanted This indicates that the PR is ready for review
Projects
None yet
5 participants