beginning of draft of API protocol proposal #1

ships · 2023-06-14T20:36:15Z

Status

Work in progress.

Link to rendered MD file

proposals/001_interop.md

ships · 2023-07-17T22:45:55Z

Note this discovered use of index endpoint:

Sciety has noted that there is no Pagination.

add prose regarding handling "last known state".

tabular formatting

.github/pull_request_template.md

de-code · 2023-10-13T15:54:19Z

proposals/001_interop.md

+parameters as fields in the request body.
+
+- from: a sync head. if present, must be number, can be timestamp or preferably a serial. if omitted, start from earliest known.
+Conventionally, element(s) that match `from` are included. Timestamp implementations of `from` may have unspecified behavior


Do you have examples of what you would recommend to be used for from?

Why would you not recommend timestamps?

the main reason not to use timestamps as sync heads is that this assumes the dataset is append-only. How do you represent that a new entry was added whose key date is in the distant past? This is doable with an event-centric datastore where you return events that are dataset deltas. Such use-cases are still allowed with this spec but we don't constrain to them.

Do you think it is preferable to constrain like that ? As i think about it today, this could be valuable so that all users can handle deletions in the same way.

secondly, they can be ambiguous in rare circumstances. Depending on granularity of the timestamp, there is always some probability that there are one or more events that occurred at that exact time. If this is the case, how does the system handle for example pagination, if more elements happened at the given timestamp than fit in a page? The classic example of this is a query like with page size (LIMIT) 10, and FROM , in a system where for whatever reason only dates and NOT times are available, so everything published on a given day has the same timestamp of <date>-00:00:00; and on some date, there are 11 things published. In this case a timestamp-based cannot provide a consistent dataset with pages of 10.

the main reason not to use timestamps as sync heads is that this assumes the dataset is append-only. How do you represent that a new entry was added whose key date is in the distant past? This is doable with an event-centric datastore where you return events that are dataset deltas. Such use-cases are still allowed with this spec but we don't constrain to them.

I believe it doesn't necessarily have to be event based. Wouldn't a last updated timestamp be sufficient?
A DocMap that was previously included in a past time window, could disappear from that in favour of a more recent time window.

Do you think it is preferable to constrain like that ?

I don't have a strong preference. But it seems to be a common usage. Most data sources I work with have some form of date or time filter ability.

But I realize now I was thinking of it from the perspective of getting back to the API after a while, rather than for the pagination.

For pagination itself, an offset is common but has some issues. A cursor seems more reliable.

So for example, when I ingest data say EuropePMC I am issuing a search with a from date depending on my last run. But then iterating through the pages via cursor.

Someone could implement a cursor as an offset (I believe bioRxiv actually do that).

Therefore it could be good to make that distinction. Via explicit a from-last-updated-timestamp and cursor parameter (an empty, no cursor or * means from the start).

As i think about it today, this could be valuable so that all users can handle deletions in the same way.

How would you express deletions in the API?
Few APIs seem to consider deletions. (Semantic Scholar handle some of that in their API for Incremental Diffs)

secondly, they can be ambiguous in rare circumstances. Depending on granularity of the timestamp, there is always some probability that there are one or more events that occurred at that exact time. If this is the case, how does the system handle for example pagination, if more elements happened at the given timestamp than fit in a page? The classic example of this is a query like with page size (LIMIT) 10, and FROM , in a system where for whatever reason only dates and NOT times are available, so everything published on a given day has the same timestamp of <date>-00:00:00; and on some date, there are 11 things published. In this case a timestamp-based cannot provide a consistent dataset with pages of 10.

Depending on the API, I would probably call the API with a from-last-updated-timestamp equal to the latest timestamp seen in previous ingestion calls. But that wouldn't work well for pagination, for the reason you mentioned. That is why a cursor for iteration would be useful.

When seeing the next parameter I was thinking of a from-last-updated-timestamp parameter, I believe you were thinking more of the cursor.

Hi, sorry to take so long getting back to you on this. Thanks for being so engaged, it is very helpful! Your questions and suggestions have revealed some underlying ambiguity I need to resolve and clarify.

First of all, I like the cursor keyword you suggest.

Next let's think about deletions. To handle deletions, it is necessary to express deltas to the dataset rather than the substance of the dataset itself, so that responses are slightly more structured - with something like DELETE { .. } and/or UPDATE { .. } and/or INSERT { .. }.

From this, It occurs to me that a use case that wants to avoid that pattern, while implementing something like from-last-updated-timestamp, reduces to using the above POST /search?specified above, with updatedAt as a query term , and including pagination, as long as the dataset is robust about using updatedAt fields on any object that may be persistent or desired to find in this way.

We are, for better or for worse, trying to stay faithful to the Linked Data W3C working group specifications, which means pagination gets to be handled in its own way using Link headers. This is lucky because it takes away some responsibility that would otherwise have to be designed into the use of the from / cursor keywords.

So we have two cleanly separated concerns:

formally indexing, that supports deletions, replies with deltas (changes/events) rather than data, and is based on an event scheme that can always robustly support unambiguous cursors.

Search, which gives the latest known data.

Both of which must support pagination, and either of which may support timestamps. However the indexing endpoint would relate to the date of the change, whereas the search is based on the date stated in the data.

Finally, note that a server which does not want to retain its events data can implement an approximation by reusing fields like updatedAt and directing the clients to send timestamps in the index requests; and they will never reply with DELETE.

I'll set about writing this more concretely this week, and will spike on an implementation next to see how it feels to work with.

proposals/001_interop.md

de-code · 2023-10-13T16:14:36Z

proposals/001_interop.md

+PATH | DETAILS
+| - | - |
+| `query_terms` | MUST be an array of object. |
+| `query_terms[].match` | MUST be a full IRI or a JSON-LD shortcut present in the Docmaps JSON-LD context. |


Shouldn't that allow any value valid for the given JSON path?
e.g. we are looking up values by an internal manuscript id (e.g. "12345") present in the DocMap in a nested field, not an IRI.

A DOI seems to be another example.

proposals/001_interop.md

ships added 10 commits June 14, 2023 13:32

beginning of draft of API protocol proposal

ece8286

move to proposals dir

38e48a4

More revisions to interop proposal

7621792

more updates to protocol: reference LDP

a64d5a6

more prose

aff009a

more updates

f3a5a5b

format as table.

2f62775

Updates

386d07a

add better README and some links

55d5c43

include PR template

6255ba7

ships commented Jul 6, 2023

View reviewed changes

proposals/001_interop.md Show resolved Hide resolved

ships mentioned this pull request Jul 17, 2023

widget: Create widget Docmaps-Project/docmaps#93

Closed

5 tasks

ships added 2 commits July 17, 2023 16:12

Update 001_interop.md

783daa9

add prose regarding handling "last known state".

Update 001_interop.md

6b51457

tabular formatting

de-code reviewed Oct 13, 2023

View reviewed changes

.github/pull_request_template.md Outdated Show resolved Hide resolved

de-code reviewed Oct 13, 2023

View reviewed changes

proposals/001_interop.md Show resolved Hide resolved

de-code reviewed Oct 13, 2023

View reviewed changes

proposals/001_interop.md Show resolved Hide resolved

consistent capitalization

cb1cc69

ships mentioned this pull request Nov 28, 2023

Propose COAR notify integration & endpoint schema for the API server #3

Open

ships added 3 commits December 5, 2023 14:24

clarify use of endpoints for named nodes

72b7294

fix formatting of code block

711068d

specify changesets in /synchronization

ad925bc

ships merged commit 7a8c94b into main Dec 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

beginning of draft of API protocol proposal #1

beginning of draft of API protocol proposal #1

ships commented Jun 14, 2023

ships commented Jul 17, 2023 •

edited

de-code Oct 13, 2023

ships Oct 17, 2023

de-code Oct 17, 2023

ships Nov 28, 2023

de-code Oct 13, 2023

de-code Oct 13, 2023

beginning of draft of API protocol proposal #1

beginning of draft of API protocol proposal #1

Conversation

ships commented Jun 14, 2023

Status

Link to rendered MD file

ships commented Jul 17, 2023 • edited

de-code Oct 13, 2023

Choose a reason for hiding this comment

ships Oct 17, 2023

Choose a reason for hiding this comment

de-code Oct 17, 2023

Choose a reason for hiding this comment

ships Nov 28, 2023

Choose a reason for hiding this comment

de-code Oct 13, 2023

Choose a reason for hiding this comment

de-code Oct 13, 2023

Choose a reason for hiding this comment

ships commented Jul 17, 2023 •

edited