Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

beginning of draft of API protocol proposal #1

Merged
merged 16 commits into from Dec 6, 2023
Merged

Conversation

ships
Copy link
Contributor

@ships ships commented Jun 14, 2023

Status

Work in progress.

Link to rendered MD file

@ships
Copy link
Contributor Author

ships commented Jul 17, 2023

Note this discovered use of index endpoint:

Screenshot 2023-07-17 at 15 45 32

Sciety has noted that there is no Pagination.

add prose regarding handling "last known state".
tabular formatting
parameters as fields in the request body.

- from: a sync head. if present, must be number, can be timestamp or preferably a serial. if omitted, start from earliest known.
Conventionally, element(s) that match `from` are included. Timestamp implementations of `from` may have unspecified behavior
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have examples of what you would recommend to be used for from?

Why would you not recommend timestamps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main reason not to use timestamps as sync heads is that this assumes the dataset is append-only. How do you represent that a new entry was added whose key date is in the distant past? This is doable with an event-centric datastore where you return events that are dataset deltas. Such use-cases are still allowed with this spec but we don't constrain to them.

Do you think it is preferable to constrain like that ? As i think about it today, this could be valuable so that all users can handle deletions in the same way.

secondly, they can be ambiguous in rare circumstances. Depending on granularity of the timestamp, there is always some probability that there are one or more events that occurred at that exact time. If this is the case, how does the system handle for example pagination, if more elements happened at the given timestamp than fit in a page? The classic example of this is a query like with page size (LIMIT) 10, and FROM , in a system where for whatever reason only dates and NOT times are available, so everything published on a given day has the same timestamp of <date>-00:00:00; and on some date, there are 11 things published. In this case a timestamp-based cannot provide a consistent dataset with pages of 10.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main reason not to use timestamps as sync heads is that this assumes the dataset is append-only. How do you represent that a new entry was added whose key date is in the distant past? This is doable with an event-centric datastore where you return events that are dataset deltas. Such use-cases are still allowed with this spec but we don't constrain to them.

I believe it doesn't necessarily have to be event based. Wouldn't a last updated timestamp be sufficient?
A DocMap that was previously included in a past time window, could disappear from that in favour of a more recent time window.

Do you think it is preferable to constrain like that ?

I don't have a strong preference. But it seems to be a common usage. Most data sources I work with have some form of date or time filter ability.

But I realize now I was thinking of it from the perspective of getting back to the API after a while, rather than for the pagination.

For pagination itself, an offset is common but has some issues. A cursor seems more reliable.

So for example, when I ingest data say EuropePMC I am issuing a search with a from date depending on my last run. But then iterating through the pages via cursor.

Someone could implement a cursor as an offset (I believe bioRxiv actually do that).

Therefore it could be good to make that distinction. Via explicit a from-last-updated-timestamp and cursor parameter (an empty, no cursor or * means from the start).

As i think about it today, this could be valuable so that all users can handle deletions in the same way.

How would you express deletions in the API?
Few APIs seem to consider deletions. (Semantic Scholar handle some of that in their API for Incremental Diffs)

secondly, they can be ambiguous in rare circumstances. Depending on granularity of the timestamp, there is always some probability that there are one or more events that occurred at that exact time. If this is the case, how does the system handle for example pagination, if more elements happened at the given timestamp than fit in a page? The classic example of this is a query like with page size (LIMIT) 10, and FROM , in a system where for whatever reason only dates and NOT times are available, so everything published on a given day has the same timestamp of <date>-00:00:00; and on some date, there are 11 things published. In this case a timestamp-based cannot provide a consistent dataset with pages of 10.

Depending on the API, I would probably call the API with a from-last-updated-timestamp equal to the latest timestamp seen in previous ingestion calls. But that wouldn't work well for pagination, for the reason you mentioned. That is why a cursor for iteration would be useful.

When seeing the next parameter I was thinking of a from-last-updated-timestamp parameter, I believe you were thinking more of the cursor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, sorry to take so long getting back to you on this. Thanks for being so engaged, it is very helpful! Your questions and suggestions have revealed some underlying ambiguity I need to resolve and clarify.

First of all, I like the cursor keyword you suggest.

Next let's think about deletions. To handle deletions, it is necessary to express deltas to the dataset rather than the substance of the dataset itself, so that responses are slightly more structured - with something like DELETE { .. } and/or UPDATE { .. } and/or INSERT { .. }.

From this, It occurs to me that a use case that wants to avoid that pattern, while implementing something like from-last-updated-timestamp, reduces to using the above POST /search?specified above, with updatedAt as a query term , and including pagination, as long as the dataset is robust about using updatedAt fields on any object that may be persistent or desired to find in this way.

We are, for better or for worse, trying to stay faithful to the Linked Data W3C working group specifications, which means pagination gets to be handled in its own way using Link headers. This is lucky because it takes away some responsibility that would otherwise have to be designed into the use of the from / cursor keywords.

So we have two cleanly separated concerns:

  1. formally indexing, that supports deletions, replies with deltas (changes/events) rather than data, and is based on an event scheme that can always robustly support unambiguous cursors.
  2. Search, which gives the latest known data.

Both of which must support pagination, and either of which may support timestamps. However the indexing endpoint would relate to the date of the change, whereas the search is based on the date stated in the data.

Finally, note that a server which does not want to retain its events data can implement an approximation by reusing fields like updatedAt and directing the clients to send timestamps in the index requests; and they will never reply with DELETE.

I'll set about writing this more concretely this week, and will spike on an implementation next to see how it feels to work with.

PATH | DETAILS
| - | - |
| `query_terms` | MUST be an array of object. |
| `query_terms[].match` | MUST be a full IRI or a JSON-LD shortcut present in the Docmaps JSON-LD context. |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't that allow any value valid for the given JSON path?
e.g. we are looking up values by an internal manuscript id (e.g. "12345") present in the DocMap in a nested field, not an IRI.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A DOI seems to be another example.

@ships ships merged commit 7a8c94b into main Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants