Never expose volatile node ids in the API #183

Closed
jar398 opened this Issue Jun 25, 2015 · 30 comments

Projects

None yet

7 participants

@jar398
Member
jar398 commented Jun 25, 2015

@hlapp remarked on this problem when he reviewed the API a few months ago. I know we've talked about this before but I don't think it ever became a treemachine issue.

Neo4j node ids are not stable, and if one is squirreled away (e.g. in a bookmark or email message) and then used after an intervening tree update, it will go to a completely random place without any indication that there may be a problem. Because they are transient, neo4j node ids must be kept completely secret from the user, i.e. should never be exposed in the API.

I suggest the API should use, for references to exposed nodes, use an OTT id when a node has one, and a pair of OTT ids when it doesn't. The node designated by an id pair is the MRCA of the nodes having those two OTT ids (in whatever tree one happens to be looking at). The choice of the two OTT ids for the MRCA is arbitrary among all such pairs that have the same result. It is the case that every node will be covered by one or the other of these two cases, since all tips have OTT ids.

The user could still be surprised when the tree is updated, in the sense of landing on a node that subtends an unexpected tip or fails to subtend an expected tip, e.g. when there is a polytomy refinement, but the node in the later tree will still be very similar to the one in the earlier tree, and will certainly be preferable as a target to a completely randomly chosen node. There is no way to eliminate this 'jitter' completely, since internal nodes do not have any concisely expressed identity that is both useful and persistent, and there is no way in general to capture or predict a user's expectations. If the MRCA notation is documented then we can set expectations.

The syntax could be something like 1234+23456 or any other connecting character, and could be both interpreted and generated by both sides of the API. A node would have many 'ids' of this form.

See OpenTreeOfLife/feedback#63

@jar398 jar398 referenced this issue in OpenTreeOfLife/taxomachine Jun 25, 2015
Closed

Never expose node ids in the API #94

@josephwb
Member

I'll add this to the branch with service updates.

@kcranston
Member

We need to make sure that we communicate this change well to users of the APIs. Will we keep backwards compatibility of the existing behaviour? As concrete examples, the folks writing the rotl library are just about to publish, and the peyotl tutorial that @snacktavish wrote is being hosted by CIPRES. I don't want to break those implementations.

@jimallman
Member

@jar398's solution certainly makes sense to me, but I'll point out that the existing system was intended to fail gracefully. Since we have both tree_id and node_id, we should be able to load a previous tree (if possible) or to say "Sorry, that tree is no longer available."

This doesn't work as expected if we keep using the same tree id over multiple versions, which I believe was the case for a long time.

@jar398
Member
jar398 commented Jun 25, 2015

I don't see how the existing system can fail gracefully, unless you mean
that someone is free to check the tree id - which they are not required to
do now, and can still do under the proposal. Under the proposal, if they
can choose to ignore it (and clearly some do), they will get the
corresponding node in any tree, rather than a 'wrong tree' error; and
presumably that is what they wanted, unlike now, when they get a randomly
selected node.

And I don't know what you mean by "this doesn't work as expected". The
expectation we will set, which will be pretty clear from the syntax, is
that the new designators mean taxa or mrcas of two taxa, no matter what
tree is being browsed, no matter what the tree is called, no matter whether
there's versioning or not, etc.

This is a fairly small change as far as clients are concerned - they will
switch from using opaque, unstable ids to opaque stable ids (non-opaque if
they understand what the '+' means). But I guess it is incompatible in that
the new node designators are not integers. Because (and only because) of
the syntax change it might make sense to make up a new parameter/field name
such as node_designator. We can keep node_id for a transitional period
while clients adjust to using stable designators.

Or we could convert strings x+y to integers somehow, and reuse node_id.
That would be fully compatible with no change required of any client. It
would be a terrible abuse of integers, though. (even ids are 2 * ott id,
odd ids are bit-interleaving of the two ott ids, times 2, plus 1.)

I haven't thought about how this works with the URLs used by the UI; that's
a separate question. I would expect it to be completely transparent, just
tweaking the URL former to understand strings instead of integers. But
neo4j node ids certainly don't belong there either.

@hlapp
hlapp commented Jun 25, 2015

We need to make sure that we communicate this change well to users of the APIs. Will we keep backwards compatibility of the existing behaviour? As concrete examples, the folks writing the rotl library are just about to publish, and the peyotl tutorial that @snacktavish wrote is being hosted by CIPRES. I don't want to break those implementations.

Backwards compatibility is the bane of progress, and the reason why APIs should be versioned, and have an expectation of being sunset at some point in the future.

This is common fare. APIs of packages that a particular published R package depends on change all the time, and do sooner or later break packages, including published ones, if they are not maintained. I'm not sure why Open Tree should be expected to play by stricter rules than everyone else, in particular in science.

@kcranston
Member

I mostly want to make sure that we don't silently push this change, rather than officially versioning the API and communicating to users.

@jar398
Member
jar398 commented Jun 25, 2015
@mtholder
Member

mrca_of perhaps?
initialisms are often a bit cryptic, but MRCA is very common in this field.

@jar398
Member
jar398 commented Jun 25, 2015
@jimallman
Member

maybe node_locator or node_location?

@mtholder
Member

The name does not matter too much to me. FWIW: I don't really see what is wrong with:

mrca_of=2+4

meaning the node that is the MRCA of ott ID 2 and ott ID 4. While

mrca_of=2

meaning the node with ott ID 2 (which is also the MRCA of taxon 2, in some sense).

node_designator is fine and consistent w/ phylocode lingo, but it requires the user to know "designator" means "either 'id of' or 'mrca of' ". So it is not the most transparent name.

@kcranston
Member

+1 to mrca_of

@jar398
Member
jar398 commented Jun 25, 2015
@fmichonneau fmichonneau referenced this issue in ropensci/rotl Jul 4, 2015
Closed

drop all references to node_id(s) #41

@josephwb
Member
josephwb commented Aug 5, 2015

I may be missing something, but is there a reason we cannot have two potential arguments: ott_id or mrca_of (list of ott_ids)?

@jar398
Member
jar398 commented Aug 5, 2015

I think there should be a 'node id' abstraction that clients can use
without having to be aware of their syntax or semantics. That is, clients
that would be happy with opaque node ids, should be able to treat them as
such. The natural representation for such an id would be a string. It would
not make sense for a client that didn't care to be forced to understand
that sometimes they're strings and sometimes they're lists, and sometimes
they're used with one parameter name and sometimes with another. All that
most applications want (including the tree browser, which puts these things
in URLs) is a way to identify nodes.

There's no reason not to have more meaningful parameters for finding nodes
by mrca and so on, but that is a different application.

On Wed, Aug 5, 2015 at 1:46 PM, Joseph W. Brown notifications@github.com
wrote:

I may be missing something, but is there a reason we cannot have two
potential arguments: ott_id or mrca_of (list of ott_ids)?


Reply to this email directly or view it on GitHub
#183 (comment)
.

@josephwb
Member
josephwb commented Aug 5, 2015

I don't really understand; I guess we can discuss during next meeting.

I pushed treemachine services to devapi which get rid of node_id (replacing with mrca_of where appropriate). I will send out tests and documentation later to show how they work (documentation for the services can be looked at using curl calls at the moment).

@snacktavish
Member

Like @josephwb, I'm confused by this suggestion. How would the node id strings transfer across different synthetic trees? I support the "mrca_of" approach as the easiest to understand, and "mrca_of" a single ott_id doesn't bother me either.

@jar398
Member
jar398 commented Aug 5, 2015

The idea is to support data abstraction, i.e. that clients that don't need
to know the representation shouldn't have to. So the abstraction is that of
a 'stable node id' and to those clients it is just an opaque string. But in
fact it would encode one of two situations:

  1. a node that has an OTT id,
  2. a node that does not have an OTT id, but is instead located using a
    pair of OTT ids, and taking their MRCA.
    The string representation could be an mapping of the set ottid + (ottid x
    ottid) to strings. An obvious mapping would be an integer (string) for an
    OTT id, and a string of the form N+N (or N&N or N*N or N%N or ...) for the
    pair-of-ids case. But it could be any other representation from which the
    two kinds of information could be distinguished and recovered, e.g.
    odd/even with bit interleaving, making every 'stable node id' a sequence of
    digits. But there is no reason to go that far.

I really don't like this "mrca_of" business because it leaks representation
information to clients that shouldn't be bothered with it and forces them
to distinguish the two cases when they just don't care. To them, the stable
node id is just an opaque token.

Since this is an architectural change I wish you had put this out for
review before implementing it; treemachine is not the only system component
affected by it.

As for how the strings transfer across synthetic trees - well the id 123
would mean the node with ott id 123 in whatever tree was being referenced,
and 123+456 would mean the mrca of 123 and 456 in whatever tree was being
referenced. This ought to work OK a lot of the time; and as there is no way
to have perfect transfer between trees, since trees are incompatibly
different, "a lot of the time" is the best you can do, and therefore
adequate.

On Wed, Aug 5, 2015 at 4:39 PM, Emily Jane McTavish <
notifications@github.com> wrote:

Like @josephwb https://github.com/josephwb, I'm confused by this
suggestion. How would the node id strings transfer across different
synthetic trees? I support the "mrca_of" approach as the easiest to
understand, and "mrca_of" a single ott_id doesn't bother me either.


Reply to this email directly or view it on GitHub
#183 (comment)
.

@jimallman
Member

This ought to work OK a lot of the time; and as there is no way
to have perfect transfer between trees, since trees are incompatibly
different, "a lot of the time" is the best you can do, and therefore
adequate.

Since we can't guarantee the "same" node (by whatever criterion), I'd like node-id strings that are fairly transparently ottids or combinations, implying MRCA. These look good to me:

  • 123 is a simple ottid (a recognized taxon)
  • 123+456 or (123+456) is the MRCA of two+ taxa
  • (123) or perhaps 123+ is the "MRCA" of a single taxon (@snacktavish 's case above)

The parentheses seem like they could be cumbersome, so I lean toward the others.

@snacktavish
Member

That makes sense. Is the idea that nodes would be assigned a fixed id, e.g. the node that is the mrca of 123 and 456 may also be the mrca of 111 and 444. Would it just be arbitrarily assigned an id string using two of its descendants, or would both '123+456' and '111+444' map to that node? (This may be getting into phylocode territory)
Edit: Ah, just re-read first comment ending: "A node would have many 'ids' of this form." So, yes, both would identify node.

@josephwb
Member
josephwb commented Aug 5, 2015

I still don't understand. 😰

Forgive me for being naive, but:

  1. If a user wants a named node, they need to know the ottid
  2. If a user wants an unnamed node, they need to know at least 2 ottids of descendant taxa.

So, the user knows what situation they are in (they have to be). If that is the case (it might not; like I said, I do not understand the issue yet), then why would:

foo(node_identifier=[NNN or XXX+YYY]); // where the arg is parsed to determine meaning

be better than:

bar([ottid=NNN or mrca_of=XXX]); // where either arg can be passed (not both)

? Either way, the user has to provide the right number of identifiers. Right? As @mtholder mentioned, "mrca_of" is ubiquitous in our community, so I think it is appropriate for identifying an unnamed node. I feel like I am missing something important here...

In terms of implementation, the stuff on devapi was just something to play with and see what we like. It took nearly no time at all, so no problem if it gets pitched. The point was to get moving on this languishing issue.

Two more things that came up:

  1. If we get rid of node_ids, then it won't be possible to query unnamed nodes that do not appear in the synthetic tree. Imagine we have 50 trees supporting one rooting of mammals, and 50 others that support another. One resolution has the highest ranking tree, so it will be present in the synthetic tree. With the ability to query node_ids, it is possible to look at the conflicting resolution and see which trees support it. Maybe not important? But it came up and I thought I'd pass the idea along.
  2. We know that "mrca"s can differ when considering the synthetic tree vs. the taxonomy tree. Which would a user have in mind? Probably taxonomy, right? Do we need to consider both options?
@jar398
Member
jar398 commented Aug 6, 2015

Remember we're just talking about replacing the current unstable neo4j node
ids with something that has a chance of surviving a synthetic tree update.
Perhaps they should be called 'locators' instead of 'ids' since there is no
need and no purpose in trying to make them unique per node.

I would expect treemachine to be deterministic, for any invocation of the
server, in its choice of locator for a node. But redoing synthesis, or
perhaps even restarting the server, might lead to a different arbitrary
choice.

If we wanted uniqueness and stability, then yes, we might implement
phylocode, but that would be a major undertaking - we'd need a registry,
and some way to match nodes in the synthetic tree to "definitions" in the
registry. I'm just talking about a minimal enhancement that will allow us
to be able to use node locators appearing in comments, URLs, etc. to locate
nodes from one synthesis build to the next, which is something we
definitely do not have now.

On Wed, Aug 5, 2015 at 6:02 PM, Emily Jane McTavish <
notifications@github.com> wrote:

That makes sense. Is the idea that nodes would be assigned a fixed id,
e.g. the node that is the mrca of 123 and 456 may also be the mrca of 111
and 444. Would it just be arbitrarily assigned an id string using two of
its descendants, or would both '123+456' and '111+444' map to that node?
(This may be getting into phylocode territory)


Reply to this email directly or view it on GitHub
#183 (comment)
.

@jar398
Member
jar398 commented Aug 6, 2015

On Wed, Aug 5, 2015 at 7:44 PM, Joseph W. Brown notifications@github.com
wrote:

I still don't understand. [image: 😰]

Forgive me for being naive, but:

  1. If a user wants a named node, they need to know the ottid
  2. If a user wants an unnamed node, they need to know at least 2
    ottids of descendant taxa.

So, the user knows what situation they are in (they have to be). If
that is the case (it might not; like I said, I do not understand the issue
yet), then why would:

You are mistaken. The common cases are the same as the current use cases
for neo4j node ids. When someone uses a neo4j node id, that is because they
have been handed it by someone else, typically either an API call (e.g.
arguson) or something that has received a node id indirectly (e.g. the
feedback system). It is just an uninterpreted handle. So situation 1 is
unrelated to the current issue; we already have a way to do #1 and would
continue using it. Number 2 is untrue as well: currently if you want an
unnamed node, you provide a node id - an opaque token that treemachine
knows how to interpret. I'm talking about replacing one opaque token with
another. Yes, treemachine will need to know the two OTT ids to locate the
node (just as it now needs to know the mapping from neo4j node ids to
node), but the user doesn't need to know that. In software engineering,
this is known as data abstraction.
https://en.wikipedia.org/wiki/Abstract_data_type

foo(node_identifier=[NNN or XXX+YYY]); // where the arg is parsed to determine meaning

be better than:

bar([ottid=NNN or mrca_of=XXX]); // where either arg can be passed (not both)

? Either way, the user has to provide the right number of identifiers.
Right?

No, the user just has to provide an opaque string that can be interpreted
by some service that knows how to interpret it. If they have two OTT ids
(or more likely two node locators) and want their mrca, they would use a
service that knows how to handle that case. That is a different case.

Syntactically, I would set a requirement that these locators must be
useable within URIs (including CGI query strings) without escaping. That
would rule out use of =, +, %, and various other characters.

It may be that the XXX+YYY idea is causing so much confusion that maybe
these locators should be made to look syntactically opaque. This is easily
done, as I said before. Let the single OTT id case be written with a
terminal '1', and the double OTT id case be written with a terminal '2' and
the digits interleaved. So OTT id 123 would become node locator 1231, and
the mrca of OTT 123 and OTT 456 would be node locator 1425362. This would
be completely plug-compatible with the current node ids - we change certain
decimal strings with other decimal strings - so would require zero
cooperation from clients. Only treemachine would have to change.

I'm not sure we should go this far.

As @mtholder https://github.com/mtholder mentioned, "mrca_of" is
ubiquitous in our community, so I think it is appropriate for identifying
an unnamed node. I feel like I am missing something important here...

Yes, I think you missing the idea of data abstraction

In terms of implementation, the stuff on devapi was just something to play
with and see what we like. It took nearly no time at all, so no problem if
it gets pitched. The point was to get moving on this languishing issue.

OK. It sounded as if you had made an incompatible change. And we have a
history on this project of sticking with designs that have had no planning
or review.

Two more things that came up:

  1. If we get rid of node_ids, then it won't be possible to query
    unnamed nodes that do not appear in the synthetic tree. Imagine we have 50
    trees supporting one rooting of mammals, and 50 others that support
    another. One resolution has the highest ranking tree, so it will be present
    in the synthetic tree. With the ability to query node_ids, it is possible
    to look at the conflicting resolution and see which trees support it. Maybe
    not important? But it came up and I thought I'd pass the idea along.

And this facility is one we currently have? If so, then yes, we have a
different locator stability problem, and I hadn't thought about it. Perhaps
stable references to these things is a harder problem, and maybe it's even
intractable - in which case they should be tied to a particular tree, to
avoid accidental hits when these get saved across TAG generations. THe
current situation where you can obtain a node id, save it, attempt to reuse
it, and get something completely unrelated really has to be ended, one way
or another.

  1. We know that "mrca"s can differ when considering the synthetic tree
    vs. the taxonomy tree. Which would a user have in mind? Probably taxonomy,
    right? Do we need to consider both options?

This issue is about treemachine, so all decisions are relative to whatever
artifact(s) treemachine hosts. That would be the synthetic tree and/or TAG.
The same locators could be used with taxomachine, and then they would be
relative to one of the taxonomies that it hosts.

As I said, what the user has in mind is whatever they have in mind when
they use neo4j node ids currently. We have had users get confused about OTT
ids vs. neo4j ids, so perhaps we should have a syntactic way to distinguish
OTT ids from node locators.

The alternative to "sloppy" mrca-style locators that might take you to a
node near one someone might have intended, but not exactly to it, is "hard"
locators that are totally specific to a particular synthetic tree (or TAG)
instance, i.e. neo4j node id + synthesis id. Neither approach is "right";
they have different failure modes. "Hard" locators become unresolvable once
the underlying tree goes away (as when we advance to a new TAG /
synthesis). "Sloppy" locators will always be resolvable, but they may take
you to a slightly different node than someone might have wanted, if the
resolution or classification changes in that vicinity. Perhaps the "sloppy"
locator idea is not a good one. But either of these alternatives is
superior to the current system, where a single locator can have wildly
different meaning in different trees, and there is no check to make sure
that an id is only used with the tree it came from. The purpose of this
issue is not to introduce a new way of talking about mrcas, it's about
solving the current problem where node ids are not only unstable but can be
wrong in a way that is not detected. All discussion on this issue should be
directed at solving that problem.

Just exploring the design space, another option is to combine the two
approaches. A locator would consist of a "sloppy" mrca-style locator plus a
synthesis version id. If you try to use a locator with a synthesis other
than the one it came from, you would get a warning that the synthesis you
asked for isn't available any more, but that a corresponding node in the
current synthesis is such-and-such.


Reply to this email directly or view it on GitHub
#183 (comment)
.

@josephwb
Member
josephwb commented Aug 6, 2015

Should:
"the mrca of OTT 123 and OTT 456 would be node locator 1425362"
be:
"the mrca of OTT 123 and OTT 456 would be node locator 1234562"?

@josephwb
Member
josephwb commented Aug 6, 2015

Regardless, happy to do whatever people want.

@jimallman
Member

Just exploring the design space, another option is to combine the two approaches. A locator would consist of a "sloppy" mrca-style locator plus a synthesis version id. If you try to use a locator with a synthesis other than the one it came from, you would get a warning that the synthesis you asked for isn't available any more, but that a corresponding node in the current synthesis is such-and-such.

👍

@josephwb
Member

I started playing with the idea:

"Let the single OTT id case be written with a terminal '1', and the double OTT id case be written with a terminal '2' and the digits interleaved. So OTT id 123 would become node locator 1231, and the mrca of OTT 123 and OTT 456 would be node locator 1425362." [although I think the last thing should be "1234562"]

But don't we need some sort of delimiter? Are ottids not all the same length? If we need a delimiter, then which one? We have traditionally used '_', but I know people dislike that one.

@jar398
Member
jar398 commented Aug 11, 2015

you would prefill with zeroes as needed.

123 + 4 = 123 + 004 = 102034
456 + 78901 = 0456 + 78901 = 708495061

The encoding doesn't matter. That you can establish a 1-1 correspondence between integers and pairs of integers is well established (Georg Cantor, 1870s). We shouldn't be talking details of the encoding until we've settled on general requirements.

@mtholder mtholder referenced this issue in OpenTreeOfLife/opentree Dec 1, 2015
Closed

Syntax of synthetic tree node references #813

@jar398 jar398 changed the title from Never expose node ids in the API to Never expose volatile node ids in the API Jan 26, 2016
@jar398
Member
jar398 commented Feb 9, 2016

On hangout we agreed the ids would be chosen by propinquity and included in the tree input to tm-lite.

@josephwb
Member

This seems done.

@josephwb josephwb closed this Feb 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment