Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove singly-used alias obviated by IdentifiedNode #1695

Merged
merged 6 commits into from
Mar 6, 2022

Conversation

ajnelson-nist
Copy link
Contributor

ContextNode did not appear anywhere else in the code base.

This PR has no associated Issue.

Proposed Changes

  • Remove ContextNode from /rdflib/graph.py.
  • Import IdentifiedNode into /rdflib/graph.py.

ContextNode did not appear anywhere else in the code base.

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
Copy link
Member

@aucampia aucampia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, change looks good, but I am a bit worried if someone has already used ContextNode, as this will affect them.

@ajnelson-nist
Copy link
Contributor Author

Thanks, change looks good, but I am a bit worried if someone has already used ContextNode, as this will affect them.

Understood. I didn't know that alias was hanging around in graph.py. Should this patch be merged into a next-major-release branch instead of master? If so, what's the name of that branch?

@aucampia
Copy link
Member

Understood. I didn't know that alias was hanging around in graph.py. Should this patch be merged into a next-major-release branch instead of master?

Think best that @nicholascar give some input here, I would be slightly more comfortable waiting for the major release, but not sure it is that critical.

If so, what's the name of that branch?

I think the current approach is to just leave the pull requests and keep rebasing/merging them until we are ready for them, if we have permission from the branch creator to make changes to it then it works best that way.

@ajnelson-nist
Copy link
Contributor Author

I'm happy to await input.

I have the "Allow edits" box checked, so hopefully you can make changes.

@nicholascar
Copy link
Member

ContextNode is only recent and appears in the 6.0.1+ releases. It's just extra typing - not a class or function - so I don't think it's a big deal to remove it.

We are planning a 7.0.0 release in the next few months, but the aims of that are to tidy up Dataset & ConjunctiveGraph, probably by deprecating the latter.

Perhaps it's sensible to issue a 6.2.0 for this level of change? We can merge into master for this: it's ok to have master a bit different to the latest release.

@aucampia
Copy link
Member

aucampia commented Jan 25, 2022

Perhaps it's sensible to issue a 6.2.0 for this level of change? We can merge into master for this: it's ok to have master a bit different to the latest release.

My concern is, if someone did this their code will break at runtime:

from  rdflib.graph import ContextNode

def foo(context_node: ContextNode) -> None:
    pass

I would say there is a low probability that this was done, but it is possible none the less.

I don't have problem with this going into master, I am a bit concerned, but not that much, if it causes any real problems then we can always issue a patch release to put it back, so I'm fine with putting it in master for 6.2.x.

@ghost
Copy link

ghost commented Jan 28, 2022

We are planning a 7.0.0 release in the next few months, but the aims of that are to tidy up Dataset & ConjunctiveGraph, probably by deprecating the latter.

I'd appreciate some early guidelines on the degree of type rigour that 7.0 is planned to impose on the API. atm, these tests succeed:

import pytest
from rdflib import XSD
from rdflib.graph import ContextNode, Graph
from rdflib.term import IdentifiedNode, Literal, Node, URIRef

def test_literal_as_graph_identifier():
    g = Graph(identifier=Literal("mygraph", datatype=XSD.string))

    assert isinstance(g.identifier, Node)

    with pytest.raises(AssertionError):
        assert isinstance(g.identifier, IdentifiedNode)

    with pytest.raises(AssertionError):
        assert type(g.identifier) is ContextNode

def test_string_as_graph_identifier():
    g = Graph(identifier="mygraph")

    assert isinstance(g.identifier, str)

    assert isinstance(g.identifier, IdentifiedNode)

    with pytest.raises(AssertionError):
        assert type(g.identifier) is ContextNode

rdflib/graph.py Outdated
ContextNode = Union[BNode, URIRef]
DatasetQuad = Tuple[Node, URIRef, Node, Optional[ContextNode]]
# Type aliase to make unpacking what's going on a little more human friendly
DatasetQuad = Tuple[Node, URIRef, Node, Optional[IdentifiedNode]]
Copy link

@ghost ghost Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue with the extant typing of ContextNode which this change understandably doesn't address. The Graph identifier is (correctly) type-hinted as Optional[Union[Node, str]] and Optional[IdentifiedNode] doesn't capture the str component.

Meh, I forgot that Graph also accepts a Literal as an identifier, also not captured by IdentifiedNode.

An all-inclusive ContextNode would be Union[IdentifiedNode, Literal, str, None]

Copy link
Member

@nicholascar nicholascar Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this last point is correct: any of those things can be used for a context node, so we may have a case here to now remove ContextNode as I don't think it can be obviated by IdentifiedNode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I certainly missed this use case ContextNode was supporting, apparently incompletely. I suppose the question now is, is ContextNode intended to be a symbol meant for export and re-consumption, or is it internal to the module? If internal, would it be better suited as being named _ContextNode?

Copy link

@ghost ghost Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it should be ContextNode[Any] 😄 ...

    g1 = Graph(identifier="⚠️")
    logger.debug(f"{g1.identifier}")

    g2 = Graph(identifier=Literal("foo"))
    logger.debug(f"{g2.identifier}")

    g3 = Graph(identifier=BNode())
    logger.debug(f"{g3.identifier}")

    g4 = Graph(identifier=URIRef("urn:example:context-1"))
    logger.debug(f"{g4.identifier}")

    g5 = Graph(identifier=FOAF.name)
    logger.debug(f"{g5.identifier}")

    g6 = Graph(identifier=Namespace("http://rdflib.net"))
    logger.debug(f"{g6.identifier}")

    g7 = Graph(identifier=Variable("xyzzy"))
    logger.debug(f"{g7.identifier}")

    # At least this is dereferenced, phew!
    g8 = Graph(identifier=g1)
    logger.debug(f"{g8.identifier}")
    # lol
    g9 = Graph(identifier=(tarek, likes, pizza))
    logger.debug(f"{g9.identifier}")
⚠️
foo
N487b9bd4146e44e8b039eadca3a36ec8
urn:example:context-1
http://xmlns.com/foaf/0.1/name
http://rdflib.net
xyzzy
N3b7d314ec941418d9d0f0f6301846796
(rdflib.term.URIRef('urn:example:tarek'), 
 rdflib.term.URIRef('urn:example:likes'), 
 rdflib.term.URIRef('urn:example:pizza'))
PASSED                                                                   [100%]

Anything that returns a str from repr is valid, so most of the successes are unsurprising. (For clarity, "tarek" etc are URIRefs)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh, I forgot that Graph also accepts a Literal as an identifier, also not captured by IdentifiedNode.

It is important to note, this type alias is only used in one place, which is the return value for Dataset.__iter__ so it should be correct for that, and for this reason it probably should not be a type alias, or at the very least not one outside of the class. At the very least a type alias on module level should be used in more than one class/function in that module before it is clear that it makes sense for it to be an alias, as there may be cases where something looks like a ContextNode and needs to have Literal, str, Node and None - and other places where IdentifiedNode really is the most appropriate.

Also Literal is not in the value space of graph identifiers/names according to RDF 1.1:

An RDF dataset is a collection of RDF graphs, and comprises:

  • Exactly one default graph, being an RDF graph. The default graph does not have a name and may be empty.
  • Zero or more named graphs. Each named graph is a pair consisting of an IRI or a blank node (the graph name), and an RDF graph. Graph names are unique within an RDF dataset.

So if Graph accepts it as an identifier, it probably should not, and the right thing is to narrow the type there to match the correct return type for Dataset.__iter__ - if Dataset.__iter__ does return something that is contrary to the spec I think we should at least understand where things go awry also.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue with the extant typing of ContextNode which this change understandably doesn't address. The Graph identifier is (correctly) type-hinted as Optional[Union[Node, str]] and Optional[IdentifiedNode] doesn't capture the str component

as noted elsewhere, Graph will convert any identifier that is not a Node to a URIRef:

rdflib/rdflib/graph.py

Lines 342 to 345 in 4eba1a0

self.__identifier: Node
self.__identifier = identifier or BNode() # type: ignore[assignment]
if not isinstance(self.__identifier, Node):
self.__identifier = URIRef(self.__identifier) # type: ignore[unreachable]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it should be ContextNode[Any] smile ...

This may be the case for Graph constructor, but all of these values are converted to URIRef if they are not URIRefs:

$ pipx run --spec git+https://github.com/RDFLib/rdflib.git@master#egg=rdflib python
⚠️  python is already on your PATH and installed at /usr/bin/python. Downloading and running anyway.
Python 3.10.2 (main, Jan 17 2022, 00:00:00) [GCC 11.2.1 20211203 (Red Hat 11.2.1-7)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import rdflib
>>> rdflib.Graph(identifier=(1,2,3)).identifier
rdflib.term.URIRef('(1, 2, 3)')

And even though it does work in some sense (I'm actually not sure how valid it will be as a IRI, but maybe it is entirely valid), I still think there is an open question as to whether we want to advertise that anything should be passible into it, and certainly there are some things that do fail and it seems tuple works for reasons that were not quite intended:

>>> rdflib.Graph(identifier=3).identifier
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/iwana/.local/pipx/.cache/1500787fc0bbcf9/lib64/python3.10/site-packages/rdflib/graph.py", line 345, in __init__
    self.__identifier = URIRef(self.__identifier)  # type: ignore[unreachable]
  File "/home/iwana/.local/pipx/.cache/1500787fc0bbcf9/lib64/python3.10/site-packages/rdflib/term.py", line 273, in __new__
    if not _is_valid_uri(value):
  File "/home/iwana/.local/pipx/.cache/1500787fc0bbcf9/lib64/python3.10/site-packages/rdflib/term.py", line 96, in _is_valid_uri
    if c in uri:
TypeError: argument of type 'int' is not iterable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And note that even though things don't throw an exception if someone does not adhere to type annotations, I think the point of type annotations should be to tell people what the correct type is to pass in, so that they can spot bugs in their code, if for some reason things still mostly work in one place when they pass in wrong values, it may still break down later on, and this is basically also declaring our contract with users of the library, if we say we will accept tuples as input to Graph identifier, just because they work, then we have to bear the consequences of this everywhere and maintain this, while if we say that we only accept strings, if someone then passes something else, and it works, well fine, but if something breaks further down the line at least we can tell them that they should be passing in a string instead of a tuple.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't even need to keep str, you are correct, all the sane values resolve to URIRef. I worked up a more thorough check:

import pytest
from rdflib import Namespace, FOAF
from rdflib.graph import Dataset, Graph
from rdflib.term import (
    BNode,
    Literal,
    RDFLibGenid,
    URIRef,
    Node,
    IdentifiedNode,
    Variable,
)

tarek = URIRef("urn:example:tarek")
likes = URIRef("urn:example:likes")
pizza = URIRef("urn:example:pizza")

def test_graph_identifier_as_string():
    g = Graph(identifier="xxx")
    assert type(g.identifier) is URIRef  # rdflib.term.URIRef('xxx')
    assert issubclass(type(g.identifier), Node)
    assert issubclass(type(g.identifier), IdentifiedNode)

def test_graph_identifier_as_literal():
    g = Graph(identifier=Literal("xxx"))
    assert type(g.identifier) is Literal  # rdflib.term.Literal('xxx')
    assert issubclass(type(g.identifier), Node)
    assert not issubclass(type(g.identifier), IdentifiedNode)

def test_graph_identifier_as_bnode():
    g = Graph(identifier=BNode())
    assert type(g.identifier) is BNode  # rdflib.term.BNode('Ndc8ac01941254b299a66c31a5b49cdd3')
    assert issubclass(type(g.identifier), Node)
    assert issubclass(type(g.identifier), IdentifiedNode)

def test_graph_identifier_as_namespace():
    g = Graph(identifier=(tarek, likes, pizza))
    assert type(g.identifier) is URIRef  # rdflib.term.URIRef("(rdflib.term.URIRef('urn:example:tarek'), 
                                         # rdflib.term.URIRef('urn:example:likes'),
                                         # rdflib.term.URIRef('urn:example:pizza'))")
    assert issubclass(type(g.identifier), Node)
    assert issubclass(type(g.identifier), IdentifiedNode)

def test_graph_identifier_as_graph():
    g = Graph()
    g = Graph(identifier=g)
    assert type(g.identifier) is BNode  # rdflib.term.BNode('N666b78c69a3d4544a079ab0919a84dda')
    assert issubclass(type(g.identifier), Node)
    assert issubclass(type(g.identifier), IdentifiedNode)

def test_graph_identifier_as_genid():
    g = Graph()
    g = Graph(identifier=RDFLibGenid("xxx"))
    assert type(g.identifier) is RDFLibGenid  # RDFLibGenid('xxx')
    assert issubclass(type(g.identifier), Node)
    assert issubclass(type(g.identifier), IdentifiedNode)

def test_graph_identifier_as_identifiednode():
    g = Graph()
    g = Graph(identifier=IdentifiedNode("xxx"))
    assert type(g.identifier) is IdentifiedNode   # 'xxx'
    assert issubclass(type(g.identifier), Node)
    assert issubclass(type(g.identifier), IdentifiedNode)

def test_graph_identifier_as_variable():
    g = Graph()
    g = Graph(identifier=Variable("x"))
    assert type(g.identifier) is Variable  # rdflib.term.Variable('x')
    assert issubclass(type(g.identifier), Node)
    assert not issubclass(type(g.identifier), IdentifiedNode)

@nicholascar
Copy link
Member

Sorry all, I've effectively ruined this PR be reintroducing ContextNode since the context identifier really can be more than an IdentifiedNode.

I don't really know where to go from here... @aucampia @ajnelson-nist can you please re-review?

@ajnelson-nist
Copy link
Contributor Author

My intent for the PR wasn't to eliminate ContextNode, but to remove a redundant implementation of IdentifiedNode. This PR could reasonably be renamed to "Remove or redefine singly-used implementation that is now redundant with IdentifiedNode". Please feel free to rename if you agree.

rdflib/graph.py Outdated
# Type aliases to make unpacking what's going on a little more human friendly
ContextNode = Union[BNode, URIRef]
# Type aliase to make unpacking what's going on a little more human friendly
ContextNode = Union[IdentifiedNode, Literal, str, None]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only place that ContextNode is used is in DatasetQuad, and there it is optional, so the None here is not needed, but it also does not hurt really.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should re-evaluate the idea here, I would say at the very least we should consider making these private, i.e. _ContextNode, _DatasetQuad - if we use the same type aliases many times in many places we can look at making them public, but even then I would say they should rather go into rdflib.typing so we can reserve the names in here if we later on want to make them abstract base classes or real classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the problem here is, this is used in a single place, and in that place ContextNode will never be a string, but always be a Node I think:

rdflib/rdflib/graph.py

Lines 2036 to 2041 in 4eba1a0

def quads(self, quad):
for s, p, o, c in super(Dataset, self).quads(quad):
if c.identifier == self.default_context:
yield s, p, o, None
else:
yield s, p, o, c.identifier

Because if it is not a node it will be converted to URIRef here:

rdflib/rdflib/graph.py

Lines 342 to 345 in 4eba1a0

self.__identifier: Node
self.__identifier = identifier or BNode() # type: ignore[assignment]
if not isinstance(self.__identifier, Node):
self.__identifier = URIRef(self.__identifier) # type: ignore[unreachable]

And actually it should also never be anything other than the Default Graph (in this case identified by none), an IRI or a blank node:

https://www.w3.org/TR/rdf11-concepts/#section-dataset

An RDF dataset is a collection of RDF graphs, and comprises:

  • Exactly one default graph, being an RDF graph. The default graph does not have a name and may be empty.
  • Zero or more named graphs. Each named graph is a pair consisting of an IRI or a blank node (the graph name), and an RDF graph. Graph names are unique within an RDF dataset.

@aucampia
Copy link
Member

I think the best options here are:

  1. Remove ContextNode and DatasetQuad and change Dataset.__iter__ as follow:

    -    def __iter__(self) -> Generator[DatasetQuad, None, None]:
    +    def __iter__(self) -> Generator[Tuple[Node, URIRef, Node, Optional[Node]], None, None]:

    This captures the current reality of what is advertised in Graph (though URIRef is also another issue, we should probably narrow the types accepted by Graph.add to match this)

  2. Remove ContextNode and DatasetQuad and change Graph.__init__ as follow:

    diff --git a/rdflib/graph.py b/rdflib/graph.py
    index 6d1ed084..55005574 100644
    --- a/rdflib/graph.py
    +++ b/rdflib/graph.py
    @@ -325,7 +325,7 @@ class Graph(Node):
         def __init__(
             self,
             store: Union[Store, str] = "default",
    -        identifier: Optional[Union[Node, str]] = None,
    +        identifier: Optional[Union[IdentifiedNode, str]] = None,
             namespace_manager: Optional[NamespaceManager] = None,
             base: Optional[str] = None,
         ):

    And then change
    Dataset.__iter__ as follow:

    -    def __iter__(self) -> Generator[DatasetQuad, None, None]:
    +    def __iter__(self) -> Generator[Tuple[Node, URIRef, Node, Optional[IdentifiedNode]], None, None]:

I somewhat favor option 2, as I think this is the more correct thing to do. We don't want to advertise to people that they should be passing values in as graph identifiers that are not inline with RDF 1.1 - and if they still want to they can use type ignores.

Also change Graph/Context identifier annotations to match RDF spec.

Graph/Context identifiers should be URIRef or BNode - in RDFLib these
are now the only extant subclasses of IdentifiedNode, so I'm using that.
@aucampia
Copy link
Member

aucampia commented Feb 20, 2022

2. Remove ContextNode and DatasetQuad and change Graph.__init__ as follow:

I have made a PR to change this branch accordingly: ajnelson-nist#13

Again other options may make sense, but I think in all cases we should eliminate ContextNode and DatasetQuad.

@aucampia aucampia self-requested a review February 20, 2022 22:58
@aucampia aucampia closed this Feb 21, 2022
@aucampia aucampia reopened this Feb 21, 2022
@aucampia
Copy link
Member

Appologies for accidental closure,

I think part of suggestion 2 is being done:

…etQuad

Change `Dataset.__iter__` to match RDF spec
@ajnelson-nist
Copy link
Contributor Author

Thanks for the update, @aucampia !

My contributions to the rest of the conversations in this thread would basically await consequences of type signature review. So, I'm happy with whatever makes incremental progress for now.

If the type aliases are retained, I agree with making them underscore-prefixed private names.

Copy link
Member

@aucampia aucampia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redid review of latest iteration of this change, I think this is optimal for now, aliases would be great to simplify the type annotations but ideally they should be accompanied with a couple of uses and remain private unless they are in a different module. I think we will get there soon though.

rdflib/graph.py Outdated Show resolved Hide resolved
@aucampia aucampia added the review wanted This indicates that the PR is ready for review label Feb 23, 2022
@aucampia aucampia merged commit 6bdb576 into RDFLib:master Mar 6, 2022
@ajnelson-nist ajnelson-nist deleted the remove_obviated_alias branch March 7, 2022 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
review wanted This indicates that the PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants