Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sorting Turtle output? #1890

Open
ajnelson-nist opened this issue May 6, 2022 · 9 comments · May be fixed by #1978
Open

Sorting Turtle output? #1890

ajnelson-nist opened this issue May 6, 2022 · 9 comments · May be fixed by #1978
Labels
enhancement New feature or request format: Turtle Related to the Turtle format.

Comments

@ajnelson-nist
Copy link
Contributor

Hi all, probably esp. @nicholascar,

PR 1425 added the ttl2 serializer, which since became longturtle. One of the goals was to reduce Git difference noise.

I unfortunately find that most of the noise in Git diffs comes from Turtle output not being sorted. This is an effect both on nodes with IRIs and with blank nodes that (if I'm guessing right) may be sorting based on their skolemized IDs.

Could an option to either of the Turtle serializers be added to sort the output, for those willing to pay the .serialize time cost? I personally like Git-tracking demonstration results, so in many of my uses I'm willing to pay the time cost of sorting graphs.

@aucampia
Copy link
Member

aucampia commented May 8, 2022

Would be happy with such an option, and it would be useful with other formats also. My only concern is that our current approach for adding options (i.e. using **kwargs) is a bit adhoc - it would be nice to consider an option that provides validation and type safety somehow.

@ghost
Copy link

ghost commented May 8, 2022

it would be nice to consider an option that provides validation

Assuming you mean validation of URIs, informed opinion on stackoverflow is that URI validation is best left to a dedicated library so that would suggest the necessity of acquiring a dependency but I guess that can be handled via the pip extras mechanism.

@aucampia
Copy link
Member

aucampia commented May 8, 2022

it would be nice to consider an option that provides validation

No I mean that provides a way to validate the options independently, maybe something like:

plugins.get("json-ld", Serializer).make_opts(...) -> JSONLDOptions

and then make it possible to pass serializer options into serialize

not entirely sure it is that critical though.

@ghost
Copy link

ghost commented May 8, 2022

No I mean that provides a way to validate the options independently.. and then make it possible to pass serializer options into serialize

Ah, so. Yes, that would address the needs of those wishing to exert more control over the JSONLD serializer.

@ajnelson-nist
Copy link
Contributor Author

Interesting. I had thought that typing for keyword arguments would be compatible with a **kwargs pass-through, but it seems at least with mypy that's not the case. Here are some notes in case others thought like I did.

The following program runs fine through Python, print 3. It passes mypy --strict review. But it includes what look like to me to be three type errors (f() for 4--6). Python 3.8.9, mypy 0.950.

#!/usr/bin/env python3

# This software was developed at the National Institute of Standards
# and Technology by employees of the Federal Government in the course
# of their official duties. Pursuant to title 17 Section 105 of the
# United States Code this software is not subject to copyright
# protection and is in the public domain. NIST assumes no
# responsibility whatsoever for its use by other parties, and makes
# no guarantees, expressed or implied, about its quality,
# reliability, or any other characteristic.
#
# We would appreciate acknowledgement if the software is used.

from typing import Any


def g(x: int, *args: Any, y: str = "", **kwargs: Any) -> None:
    if y == "a":
        print(x)


def f(x: int, *args: Any, **kwargs: Any) -> None:
    g(x, *args, **kwargs)


f(1)
f(2, True)
f(3, True, y="a")
f(4, True, y=True)
f(5, True, y=77)
f(6, True, y=b"c")

If another keyword parameter is added to f, mypy does test as I expected. So, turning f into this:

def f(x: int, *args: Any, z: str = "", **kwargs: Any) -> None:
    g(x, *args, **kwargs)

and adding this last call for f triggers a mypy failure as expected:

f(7, True, y=b"c", z=4444)

Summarizing for my original request: if the "front"-most serialize call (Graph.serialize?) could get a sort keyword argument, it looks more like the value can be type-tested. But if a sort keyword argument is only added into specific serializers, its input won't be type-checked.

Or, I'm totally missing something else with how the serializer plugin framework handles things. Please do note if that's so, I haven't reviewed that code in depth.

@aucampia aucampia added the format: Turtle Related to the Turtle format. label May 17, 2022
ajnelson-nist added a commit to ajnelson-nist/rdflib that referenced this issue Jun 1, 2022
This patch adds a test to start specifying what sorting Turtle output
would look like.  This is intended to start discussion about
expectations of blank node sorting, and to set an initial interface for
triggering sorted output with a propagated keyword argument in
`Graph.serialize()`.

This patch will fail CI, but should not fail for code-style reasons.
The new test script was reviewed with black, flake8, isort, and
mypy (--strict).

References:
* RDFLib#1890

Signed-off-by: Alex Nelson <alexander.nelson@nist.gov>
@ajnelson-nist ajnelson-nist linked a pull request Jun 1, 2022 that will close this issue
9 tasks
ajnelson-nist added a commit to ajnelson-nist/rdflib that referenced this issue Jun 1, 2022
@ajnelson-nist
Copy link
Contributor Author

Some work I'm doing would benefit from seeing this implemented. So, to resume this conversation, I've filed PR 1978.

I have a general question on the API for Graph.serialize and its intermediary functions down through the plugins folder: Is there a reason most, but not all, of the function signatures don't currently use the *args positional-argument parameter that delimits where keyword arguments start?

I've personally found it difficult to understand when arguments in functions are specified as positional with defaults, versus being keyword arguments, when there isn't a *args or just * used to draw a separating line in the list. I think it would help type system review if explicit *args were added to the .serialize() functions, to make clear that sort: bool = False is a with-default-value keyword argument.

In case the motivation of this question is unclear, Patch 2 in PR 1978 tries adding sort=False to Graph.serialize's signatures, downstream through the Turtle serializer.

The PR also leaves open some question of what would be expected of some sort-order matters. Blank node sorting seems to be a tricky topic that may involve recursion to solve. Prefixes that sort differently from expanded IRIs isn't in the unit test, but will also be worth clarifying with the hard-coded graph serialization. Feedback is welcome.

@aucampia aucampia added the enhancement New feature or request label Jul 29, 2022
@tgbugs
Copy link
Contributor

tgbugs commented Oct 18, 2022

@ajnelson-nist
Copy link
Contributor Author

@tgbugs Thank you for the reference! I see it is in a many-utilities repository. Is your vision that that code be something you continue to maintain in your code base, or instead to be something that integrates into, or overrides, the associated PR 1978? I'm fine with not reliving the whole "Hack, hack, oh we need a spec." experience, and I knew at least blank node handling was going to need some degree of style spec.

@tgbugs
Copy link
Contributor

tgbugs commented Dec 13, 2022

@ajnelson-nist The spec is stable, and there are a bunch of tests that I have written to ensure sane behavior. I intend to upstream as much of the code in ttlser as possible. The implementation that I have is almost completely configurable using CustomTurtleSerializer.sortkey and CustomTurtleSerializer.make_litsortkey. The only thing that is not configurable is that it currently always uses self.store.qname which means that it is not stable to changes to prefixes that are used. Being able to specify the function that runs before sortkey would make it completely flexible.

I can also say that there are many hidden gotchas that any new implementation of such functionality is likely to re-encounter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request format: Turtle Related to the Turtle format.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants