Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve Pretty Turtle syntax after PATCH #1438

Open
timbl opened this issue Aug 25, 2022 · 17 comments
Open

Preserve Pretty Turtle syntax after PATCH #1438

timbl opened this issue Aug 25, 2022 · 17 comments
Labels
☀️ enhancement New feature or request

Comments

@timbl
Copy link

timbl commented Aug 25, 2022

Environment

X-Powered-By: Community Solid Server

  • Server version: Output of community-solid-server --version for a
    global or npx community-solid-server --version for a local installation

4.0.1

  • Node.js version: Output of node -v
  • npm version: Output of npm -v

Description

When I am using am using CSS to host, say, a SolidOS tracker, the user can
edit the configuration using the form system. The configuration includes
things like ordered lists of states in turtle ( list syntax).

When the user edits the file, with code which sends a PATCH command to change the list from one value to another,
the file ends up encoded using low-level rdf:first and rdf:rest syntax.

  • (The rdflib parser in mashlib does not recognize that as a Container - fixed)
  • The file ends up very hard for a developer to read.
  • The changes are a mess when the file is checked into Hg
@timbl
Copy link
Author

timbl commented Aug 25, 2022

@timbl
Copy link
Author

timbl commented Aug 25, 2022

Related: Use relative URIs #1366

@joachimvh
Copy link
Member

This will require changes in the serialization dependencies rdf-serialize and N3.js. One issue will be that all data in CSS is handled as s stream, meaning that the serializer can't know what the contents of a list are while it is serializing. Specifically in the case of PATCH we do load all data in memory, so perhaps something different can be done there, but we would need to do something that doesn't break with the rest of the architecture.

Pinging @rubensworks and @RubenVerborgh for input on the serializers.

@joachimvh joachimvh added the ☀️ enhancement New feature or request label Aug 26, 2022
@rubensworks
Copy link
Contributor

Perhaps a (non-streaming) postprocessing mode could be added to N3.js indeed to convert lists to their compact representation. But perhaps this should be an optional or profile-based mode, as this is probably not something that's desired by default.

Alternatively, it may even be easier to implement support for the expanded list syntax in rdflib?

@RubenVerborgh
Copy link
Member

RubenVerborgh commented Aug 26, 2022

@timbl In practice, this could be solved easily. CSS allows plugging in different parsers and serializers and even patchers, so we could just plug in rdflib.

However, the bigger picture, as @joachimvh mentions, is the streaming architecture that is key to high-performance RDF. Plugging in rdflib as default (or any parser/serializer algorithm that requires everything to be in memory) constrains the supported file size. And people have tried to handle multi-gigabyte files, successfully.

We could try to relax this requirement from a necessity (= mandated by Solid spec) to a nice-to-have (= developer usability):

  • The rdflib parser in mashlib does not recognize that as a Container

Can we fix this in Mashlib too? Since the Turtle list syntax does not have RDF-level model semantics, and some syntaxes do not have special list support, we would expect Mashlib to be syntax-invariant.

In any case, CSS implementing this would not be enough for Mashlib to rely on it; it would need to be a spec requirement. We can consider for developer usability.

@bourgeoa
Copy link

@timbl In practice, this could be solved easily. CSS allows plugging in different parsers and serializers and even patchers, so we could just plug in rdflib.

However, the bigger picture, as @joachimvh mentions, is the streaming architecture that is key to high-performance RDF. Plugging in rdflib as default (or any parser/serializer algorithm that requires everything to be in memory) constrains the supported file size. And people have tried to handle multi-gigabyte files, successfully.

If we only consider PATCH I seemed to understand that it was memory constrained. Is it correct ?

@bourgeoa
Copy link

Can we fix this in Mashlib too? Since the Turtle list syntax does not have RDF-level model semantics, and some syntaxes do not have special list support, we would expect Mashlib to be syntax-invariant.

Is this related with linkeddata/rdflib.js#567 ?

@RubenVerborgh
Copy link
Member

If we only consider PATCH I seemed to understand that it was memory constrained. Is it correct ?

At the moment; but streaming patch is possible in some cases.

Is this related with linkeddata/rdflib.js#567 ?

No, there is no relation.

@timbl
Copy link
Author

timbl commented Aug 30, 2022

@joachimvh says " One issue will be that all data in CSS is handled as s stream, meaning that the serializer can't know what the contents of a list are while it is serializing. ". I don't see that the rdf:first rdf:rest syntax is more stream friendly at all. As there is no guarantee that the first and rest parts of a list come in any order in an RDF document, if you are parsing them into Array-like objects you would have to keep a cache of the bits you have come across already, and match them up when they are both there. Either that or use the random access query to the store, which is not using a stream. Whereas parsing the turtle ( .... ) syntax is seems totally streamable .

@timbl
Copy link
Author

timbl commented Aug 30, 2022

But on a high level, @RubenVerborgh 's comment "However, the bigger picture, as @joachimvh mentions, is the streaming architecture that is key to high-performance RDF." .
Who's to say that the system should optimize performance over developer usability? You could argue that developer understanding of small files is more of a problem for solid than performance with large ones.

My to-do list broke when I tried to tweak the config with CSS. The original pretty file from NSS 1.8k became the ugly file stored by CSS 4.8k. It changed from something quite readable to something quite unreadable. I don't like that CSS is incrementally making my pod more ugly and more verbose.

But much more importantly I don't like the idea that developers who start to use their own to-do lists as examples won't be able to see what is going on clearly.

@RubenVerborgh
Copy link
Member

My to-do list broke

But we should be clear: the root cause is that Mashlib makes assumptions that are not afforded by the Solid Protocol.

CSS is fulfilling a role of devil's advocate here: we implement the spec to the letter, and tend to not make accommodations to individual apps, because then we'd give everyone a false feeling of safety.

I don't like the idea that developers who start to use their own to-do lists as examples won't be able to see what is going on clearly.

I don't like it either; but we really need to separate incorrect app breakage from developer usability.
Happy to investigate a syntax-preserving patch algorithm for small files, but apps MUST NOT depend on it as long as syntax is not part of the protocol.

Whereas parsing the turtle ( .... ) syntax is seems totally streamable .

But serializing isn't. With streaming, we mean arbitrarily sized streams.
Given an unordered Stream<Quad> input, serializing the (…) syntax requires arbitrarily large buffers, breaking arbitrarily sized streams. And thus creating a possible DoS vector.

Who's to say that the system should optimize performance over developer usability?

The comparison is incomplete: NSS will ugly crash on large files or lots of small files. A single 8GB file will definitely do, but try things like <x_n> rdf:first "large_string". for increasing values of n and it will happen quite soon.

CSS, in contrast, does not crash on any of these because it serializes with constant memory.
Which it also happens to do faster; and faster requests are also a part of developer usability (consider the speed difference between Mashlib on CSS and NSS).

The original pretty file from NSS 1.8k became the ugly file stored by CSS 4.8k

For reference, @TallTed has made a similar argument here: solid/specification#342
He was asking to preserve the exact syntax of the incoming document (which some CSS back-ends do).

The feature request in the current issue is even stronger: to also support PATCH with minimal syntax changes.
Could do for files below a certain size, but apps that rely on it will still break on other spec-compliant servers.
(And if you run CSS 5.x, you'll see that at least prefixes get preserved via #1380.)

@RubenVerborgh RubenVerborgh changed the title Preserve Turtle Container syntx when PATCH file Preserve Turtle syntax after PATCH Aug 31, 2022
@timbl
Copy link
Author

timbl commented Aug 31, 2022

Agreed mashlib -- rdflib -- should recognize incoming first and rest.

@timbl
Copy link
Author

timbl commented Nov 3, 2023

They do now. That was a distraction. The issue the ugliness to the developer, sometimes the user, and the inconsistency for the SCM and to the source code management system.

Prompted mainly by this I wrote a new article
https://www.w3.org/DesignIssues/Pretty.html
specifically the bit on Source Code Control. My entire pod is checked into Hg, where diffs become huge and random.

I am concerned that is and when we switch solidcommunity.net to CSS the illegibility of the files after patch will be a massive hurdle for developers. We lose the view source effect. We change RDF from a simple understandable language to an incomprehensible mess. From that point of view this issue should be classed as a bug not an enhancement.

If the data is all in memory then the speed argument doesn't hold, the sort can be fast. Let's add a serializer with the same algo as rdflib to the CSS stack. Th output will still be a stream, just the input will be random access to the store, or a place to sort in memory.

@timbl timbl changed the title Preserve Turtle syntax after PATCH Preserve Pretty Turtle syntax after PATCH Nov 5, 2023
@RubenVerborgh
Copy link
Member

From that point of view this issue should be classed as a bug not an enhancement.

@timbl I'm afraid this would lead us onto a rather slippery slope.

I'm in favor of pretty printing; it has a place when it comes to developer usability. The question is where that place is, and your suggestion is to make that place the Solid HTTP interface and the underlying storage, and mandate this as if it were a spec. While I understand this point of view, the caveats I discuss below make me conclude that an app is a better place.

1. No definition of syntax-preserving

First, there is a lack of proper definition. What exactly does preserving syntax mean? Exact tabs, spaces, comments, escape sequences for literals? rdflib.js isn't syntax-preserving: it similarly re-serializes its own output, but it wouldn't preserve the syntax of a file I wrote by hand or via N3.js. There's no standard or specification for Turtle syntax preservation, which means that we can't commit without disappointing a lot of people. Committing to this would open the door do loads of bug reports where someone's specific syntax feature isn't preserved, and they would all be right.

2. On-disk serialization is beyond the Solid Protocol

Second, your specific use case concerns the on-disk syntax and its presumed connection to HTTP. This is a very unique case, for which I have not encountered other users yet. The specific request here is to preserve the on-disk serialization such that it works well with version control systems. This implicitly assumes that the on-disk and over-HTTP representations are the same, which does not need to be the case, given that the Solid Protocol only governs the HTTP representation (and does not put any syntactical restrictions on it). Furthermore, there might not even be an on-disk file version, if the back-end is an RDF database. If the goal is minimal changes across on-disk versions, writing canonical RDF will yield even better results. Independently, we could then apply a Turtle pretty printer over the HTTP interface to reach the goal of being "view source" friendly, which also works with database backends.

3. No consistent experience

Third, CSS implementing this on disk would not help with a consistent developer experience for the entire ecosystem. CSS deliberately tries to do the minimum as to not create false expectations. If we were somehow to promise syntax-preserving PATCH (despite the lack of definition), it wouldn't be supported across different back-ends nor different servers. So then developers would still get a bad experience. Plus, a friendly syntax to me can be different to others.


Conclusion: should it be an app?

Summarizing, haphazardly committing to syntax constraints in CSS would introduce several additional assumptions into its corner of the Solid ecosystem (defining syntax preservation, linking on disk-storage and HTTP syntax…) that do not generalize to other more typical use cases. Unless we spec them, which would lead to other issues.

It seems much more in the Solid spirit of app/data separation to implement pretty printing of data as an app. That way, developers can use the app to see pretty Turtle of their preference with any Solid server or backend, with neither the app nor the server going beyond the mandate of the Solid Protocol.

@TallTed
Copy link
Contributor

TallTed commented Nov 8, 2023

Consider...

View Source of HTML documents preserves line breaks, whitespace, etc., of the original document on the server. View Rendering of HTML documents may drop line breaks, fold whitespace, etc.

This is the kind of different I've been talking about with regard to Turtle (and other RDF serialization) documents. If I view SOURCE, I expect to see the indentations, line breaks, comments, etc., preserved as uploaded. If I view RDF, I expect to see varying levels and kinds of pretty-printing of that data — without concern for the original document's indentations, line breaks, comments.

Now, it might be acceptable for a Solid server to take an uploaded document, parse it for RDF content, and put that RDF into a triple/quad store. Wherever possible, as a user, I would want that Solid server to give me the option to preserve the original document, whether or not it were (also) parsed for RDF content. Optimally, it would also be possible to choose whether to download/view the original document and/or whatever RDF prettification (or uglification) the server was built to deliver.

Yes, some limits are necessary to assure interop. Generally, the user should be given the option of what to get and/or store — complete with alerts like "getting/saving this serialization may lose inline formatting and/or comments that were in original data documents". Informed consent should be a guiding principle, even more than enforcing interop by limiting user choices.

@RubenVerborgh
Copy link
Member

@TallTed Your comment seems to pertain to a different issue (solid/specification#342). This issue is about PATCH, the goal of which is—by definition—to not preserve everything as uploaded.

@TallTed
Copy link
Contributor

TallTed commented Nov 9, 2023

This issue is titled Preserve Pretty Turtle syntax after PATCH, to which I think my previous comment pertains pretty directly.

To my mind, one goal of PATCH is exactly to preserve everything as (previously) uploaded except that which is being intentionally changed. Building a good PATCH "query" requires starting with a copy of the existing data and/or document, such that you can explicitly and exactly say, "change THIS to THAT". Changing, for instance, the object of one triple in a Turtle (or other serialization) document should not (need to) touch anything else in that document.

I don't demand (though I would certainly prefer) that such a single-object change itself be adjusted to maintain the pretty-print surrounding it; I do feel quite strongly that the line(s) preceding and following the line(s) containing that object not be un-prettified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
☀️ enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants