New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

publish target definitions #15

Open
kcranston opened this Issue Oct 28, 2014 · 16 comments

Comments

Projects
None yet
6 participants
@kcranston
Member

kcranston commented Oct 28, 2014

The annotation DB will contain many different targets, and it would be interesting to be able to easily see / query those definitions. This could also potentially lead to re-use of definitions for the same clade - "you defined clade x as blah blah blah, and we also have these definitions that point to that same clade in the synthetic tree"

@nfranz

This comment has been minimized.

Show comment
Hide comment
@nfranz

nfranz Oct 28, 2014

Hi all:

To me (probably not just me), the "Open" also means (in addition to Open
source, Open access), "Open (= indefinite) chain of revisions/updates of
phylogenetic hypotheses". More or less.

I'm clearly not close to the end of my personal journey here, but am
suggesting that the open-ended-ness can be brought out more clearly and
consistently if some ways of speaking are used.

I was curious to learn last week (only) that DarwinCore defines the term
"Taxon" (http://rs.tdwg.org/dwc/terms/index.htm#Taxon) as:

"The category of information pertaining to taxonomic names, taxon name
usages, or taxon concepts."

Category of information..? I thought it was more like this:
http://en.wikipedia.org/wiki/Taxon (= some natural entity
of..stuff/processes "out there").

What I am getting at is this: there are two often intersecting contexts
in which we use terms such as taxon, group, clade, monophylum, etc. The
contexts do often intersect, but have slightly different flavors or domains
of application. I'll call context 1 the "speak about taxa" context, and
context 2 the the "*represent *taxon information" context.

Context 1 is I think how we usually talk to each other. There is an
assumption, either openly acknowledged or at least enacted (sometimes
hypocritically I suppose), that there are taxa are natural,
causally/evolutionarily sustained entities "out there", that we have some
epistemic access to their identity and boundaries, and generally it's ok to
talk and act that way. (even though we are still "reconstructing")

Context 2 is how I think we ought to translate the very legitimate but
abbreviated conventions of context 1 into an open system in the above
sense. In this second context we might want to be closer to the DwC notion
(however unintuitive when [e.g.] giving an evolutionary presentation).
Taking this a bit further, then, in that second context we may only wish to
speak about inter-subjective (human to human) mental representations and
their interconnections. "Taxon" becomes "perceived taxon", "clade" becomes
"inferred clade", etc.

Why all this? Because of the "same clade". And we all understand that
phrase, but this way of speaking falls into context 1, right? It might be
read to make claims about identity that do not hold at a more granular
level of information representation in the OT environment.

I obviously think there is some value in an exercise where one does
context 2 all the way. It's not that easy to get there, but presumably
easier to then scale back and converge more on context 1 again. In context
2, strictly speaking, "same" has a very narrow meaning (same identifier in
the database), and "taxon" or "clade" are not needed at all.

Accordingly, "same clade in the synthetic tree" (context 1) becomes:

"phylogenetically congruent clade hypotheses, published independently
(hence with different identifiers), that make non-conflicting contributions
to the topology of OT version X" (context 2).

Perhaps it is apparent how context 2 may inform annotation practices
that are ultimately of value.

Hopefully I also managed to express that ascribing a full reality to
taxa and not actually talking like that when it comes to information
representation are actually compatible notions.

Sorry if this was TL-DR.

Best, Nico

On Tue, Oct 28, 2014 at 4:08 AM, Karen Cranston notifications@github.com
wrote:

The annotation DB will contain many different targets, and it would be
interesting to be able to easily see / query those definitions. This could
also potentially lead to re-use of definitions for the same clade - "you
defined clade x as blah blah blah, and we also have these definitions that
point to that same clade in the synthetic tree"


Reply to this email directly or view it on GitHub
#15.

nfranz commented Oct 28, 2014

Hi all:

To me (probably not just me), the "Open" also means (in addition to Open
source, Open access), "Open (= indefinite) chain of revisions/updates of
phylogenetic hypotheses". More or less.

I'm clearly not close to the end of my personal journey here, but am
suggesting that the open-ended-ness can be brought out more clearly and
consistently if some ways of speaking are used.

I was curious to learn last week (only) that DarwinCore defines the term
"Taxon" (http://rs.tdwg.org/dwc/terms/index.htm#Taxon) as:

"The category of information pertaining to taxonomic names, taxon name
usages, or taxon concepts."

Category of information..? I thought it was more like this:
http://en.wikipedia.org/wiki/Taxon (= some natural entity
of..stuff/processes "out there").

What I am getting at is this: there are two often intersecting contexts
in which we use terms such as taxon, group, clade, monophylum, etc. The
contexts do often intersect, but have slightly different flavors or domains
of application. I'll call context 1 the "speak about taxa" context, and
context 2 the the "*represent *taxon information" context.

Context 1 is I think how we usually talk to each other. There is an
assumption, either openly acknowledged or at least enacted (sometimes
hypocritically I suppose), that there are taxa are natural,
causally/evolutionarily sustained entities "out there", that we have some
epistemic access to their identity and boundaries, and generally it's ok to
talk and act that way. (even though we are still "reconstructing")

Context 2 is how I think we ought to translate the very legitimate but
abbreviated conventions of context 1 into an open system in the above
sense. In this second context we might want to be closer to the DwC notion
(however unintuitive when [e.g.] giving an evolutionary presentation).
Taking this a bit further, then, in that second context we may only wish to
speak about inter-subjective (human to human) mental representations and
their interconnections. "Taxon" becomes "perceived taxon", "clade" becomes
"inferred clade", etc.

Why all this? Because of the "same clade". And we all understand that
phrase, but this way of speaking falls into context 1, right? It might be
read to make claims about identity that do not hold at a more granular
level of information representation in the OT environment.

I obviously think there is some value in an exercise where one does
context 2 all the way. It's not that easy to get there, but presumably
easier to then scale back and converge more on context 1 again. In context
2, strictly speaking, "same" has a very narrow meaning (same identifier in
the database), and "taxon" or "clade" are not needed at all.

Accordingly, "same clade in the synthetic tree" (context 1) becomes:

"phylogenetically congruent clade hypotheses, published independently
(hence with different identifiers), that make non-conflicting contributions
to the topology of OT version X" (context 2).

Perhaps it is apparent how context 2 may inform annotation practices
that are ultimately of value.

Hopefully I also managed to express that ascribing a full reality to
taxa and not actually talking like that when it comes to information
representation are actually compatible notions.

Sorry if this was TL-DR.

Best, Nico

On Tue, Oct 28, 2014 at 4:08 AM, Karen Cranston notifications@github.com
wrote:

The annotation DB will contain many different targets, and it would be
interesting to be able to easily see / query those definitions. This could
also potentially lead to re-use of definitions for the same clade - "you
defined clade x as blah blah blah, and we also have these definitions that
point to that same clade in the synthetic tree"


Reply to this email directly or view it on GitHub
#15.

@hlapp

This comment has been minimized.

Show comment
Hide comment
@hlapp

hlapp Oct 29, 2014

This is being revised in the round of changes approved by the executive at
TDWG 2014. -hilmar

On 10/28/14, 5:29 PM, Nico Franz wrote:

I was curious to learn last week (only) that DarwinCore defines the term
"Taxon" (http://rs.tdwg.org/dwc/terms/index.htm#Taxon) as:

"The category of information pertaining to taxonomic names, taxon name
usages, or taxon concepts."

Category of information..? I thought it was more like this:
http://en.wikipedia.org/wiki/Taxon (= some natural entity
of..stuff/processes "out there").

Hilmar Lapp -:- lappland.io

hlapp commented Oct 29, 2014

This is being revised in the round of changes approved by the executive at
TDWG 2014. -hilmar

On 10/28/14, 5:29 PM, Nico Franz wrote:

I was curious to learn last week (only) that DarwinCore defines the term
"Taxon" (http://rs.tdwg.org/dwc/terms/index.htm#Taxon) as:

"The category of information pertaining to taxonomic names, taxon name
usages, or taxon concepts."

Category of information..? I thought it was more like this:
http://en.wikipedia.org/wiki/Taxon (= some natural entity
of..stuff/processes "out there").

Hilmar Lapp -:- lappland.io

@nfranz

This comment has been minimized.

Show comment
Hide comment
@nfranz

nfranz Oct 29, 2014

Thanks, Hilmar.

Incidentally there is a related, upcoming Berkeley "BIGCB" workshop. I
will try to document (blog/tweet) some of the talks and outcomes from this
3-day event.

http://taxonbytes.org/bigcb-workshop-at-uc-berkeley-tackling-the-taxon-concept-problem/

Nico

On Wed, Oct 29, 2014 at 8:14 AM, Hilmar Lapp notifications@github.com
wrote:

This is being revised in the round of changes approved by the executive at
TDWG 2014. -hilmar

On 10/28/14, 5:29 PM, Nico Franz wrote:

I was curious to learn last week (only) that DarwinCore defines the term
"Taxon" (http://rs.tdwg.org/dwc/terms/index.htm#Taxon) as:

"The category of information pertaining to taxonomic names, taxon name
usages, or taxon concepts."

Category of information..? I thought it was more like this:
http://en.wikipedia.org/wiki/Taxon (= some natural entity
of..stuff/processes "out there").

Hilmar Lapp -:- lappland.io


Reply to this email directly or view it on GitHub
#15 (comment)
.

nfranz commented Oct 29, 2014

Thanks, Hilmar.

Incidentally there is a related, upcoming Berkeley "BIGCB" workshop. I
will try to document (blog/tweet) some of the talks and outcomes from this
3-day event.

http://taxonbytes.org/bigcb-workshop-at-uc-berkeley-tackling-the-taxon-concept-problem/

Nico

On Wed, Oct 29, 2014 at 8:14 AM, Hilmar Lapp notifications@github.com
wrote:

This is being revised in the round of changes approved by the executive at
TDWG 2014. -hilmar

On 10/28/14, 5:29 PM, Nico Franz wrote:

I was curious to learn last week (only) that DarwinCore defines the term
"Taxon" (http://rs.tdwg.org/dwc/terms/index.htm#Taxon) as:

"The category of information pertaining to taxonomic names, taxon name
usages, or taxon concepts."

Category of information..? I thought it was more like this:
http://en.wikipedia.org/wiki/Taxon (= some natural entity
of..stuff/processes "out there").

Hilmar Lapp -:- lappland.io


Reply to this email directly or view it on GitHub
#15 (comment)
.

@kcranston

This comment has been minimized.

Show comment
Hide comment
@kcranston

kcranston Nov 1, 2014

Member

Hi Nico,
Thanks for the thoughts. I think - if I understand you correctly - that our idea of an annotation database separate from the synthetic tree supports the distinction between your concepts 1 and 2. When I say "same clade", I mean the node in a given version of the synthetic tree. The annotation targets (concepts) do not change, but their presence / absence / placement on the tree might from version to version. So, for a particular synthetic tree, we can then ask "how many of these annotation targets map to the same node in the tree?". By being flexible with how people can define the targets, we allow different taxon concepts for different use cases.

Does that make sense, or am I missing something? @jar398 might also have some input here.

Member

kcranston commented Nov 1, 2014

Hi Nico,
Thanks for the thoughts. I think - if I understand you correctly - that our idea of an annotation database separate from the synthetic tree supports the distinction between your concepts 1 and 2. When I say "same clade", I mean the node in a given version of the synthetic tree. The annotation targets (concepts) do not change, but their presence / absence / placement on the tree might from version to version. So, for a particular synthetic tree, we can then ask "how many of these annotation targets map to the same node in the tree?". By being flexible with how people can define the targets, we allow different taxon concepts for different use cases.

Does that make sense, or am I missing something? @jar398 might also have some input here.

@mjy

This comment has been minimized.

Show comment
Hide comment
@mjy

mjy Nov 1, 2014

Can someone provide examples of target definitions that are not defined in the phylocode? It seems to me that this real life use case is precisely that which the phylocode seeks to address.

mjy commented Nov 1, 2014

Can someone provide examples of target definitions that are not defined in the phylocode? It seems to me that this real life use case is precisely that which the phylocode seeks to address.

@jar398

This comment has been minimized.

Show comment
Hide comment
@jar398

jar398 Nov 1, 2014

Member

Example definition not defined by phylocode: anything that is
character-based. For example, you may want to annotate a higher taxon in
the taxonomy, e.g. Mammalia, under the assumption that (a) the character
based definition is either well known or can be located in the literature,
and (b) it is a clade.

I know some people don't like to admit claims like this, but they are
common in biology (at least as hypotheses). But I think they qualify as an
example, which is what you requested.

On Sat, Nov 1, 2014 at 8:58 AM, Matt notifications@github.com wrote:

Can someone provide examples of target definitions that are not defined
in the phylocode? It seems to me that this real life use case is precisely
that which the phylocode seeks to address.


Reply to this email directly or view it on GitHub
#15 (comment)
.

Member

jar398 commented Nov 1, 2014

Example definition not defined by phylocode: anything that is
character-based. For example, you may want to annotate a higher taxon in
the taxonomy, e.g. Mammalia, under the assumption that (a) the character
based definition is either well known or can be located in the literature,
and (b) it is a clade.

I know some people don't like to admit claims like this, but they are
common in biology (at least as hypotheses). But I think they qualify as an
example, which is what you requested.

On Sat, Nov 1, 2014 at 8:58 AM, Matt notifications@github.com wrote:

Can someone provide examples of target definitions that are not defined
in the phylocode? It seems to me that this real life use case is precisely
that which the phylocode seeks to address.


Reply to this email directly or view it on GitHub
#15 (comment)
.

@mjy

This comment has been minimized.

Show comment
Hide comment
@mjy

mjy Nov 1, 2014

Some clarification- I was assuming (likely incorrectly) that target definitions were those being used in some computable manner, i.e. not asserted as in the nomenclatural pipeline. Since character data are not stored in OT I assumed this wasn't an option.

If definitions are simply user asserted annotations then why worry about target definitions, just make a generic "tagging" system and let users come up with sets of attributes (tags on tags) that they find useful?

mjy commented Nov 1, 2014

Some clarification- I was assuming (likely incorrectly) that target definitions were those being used in some computable manner, i.e. not asserted as in the nomenclatural pipeline. Since character data are not stored in OT I assumed this wasn't an option.

If definitions are simply user asserted annotations then why worry about target definitions, just make a generic "tagging" system and let users come up with sets of attributes (tags on tags) that they find useful?

@jar398

This comment has been minimized.

Show comment
Hide comment
@jar398

jar398 Nov 1, 2014

Member

I find the annotation enterprise epistemologically troublesome. It
would help me if I could see a list of use cases, i.e. actual claims
that one would want to express and store.

I think it's very important to distinguish real-life clades, the
physical / biological entities, from the information structures comprising
the synthetic tree. So I am with Nico in warning against saying
"clade" when you probably mean "node". If these aren't distinguished
then, among other things, there is no way to speak rationally about
how annotations are to be transferred from one tree to another.

To rigorously interpret a node N as a designator for a clade, you
would have to do the following:

  1. Consider the set T of tip nodes of the synthetic tree.
  2. Interpret each tip node t in T as a clade cl(t).
  3. Let tips(N) = the "descendent" tips of N in the tree.
  4. Consider the clades that (a) contain cl(t) for all t in tips(N), AND
    (b) do not contain any of the clades cl(t) for t in T-tips(N).
    There could be many such clades, or there could be none.
    By (a) and the nature of cladeness, these clades will all either be
    or contain the real-life MRCA of cl(t) for t in tips(N).
  5. Select one of these clades as cl(N) = the one that N designates.

An annotation (a biological claim) expressed in relation to node N
could be about any of these clades (4). The claim could even be vague on
this point, saying it's about one of these clades but it's not known
which. That's how a claim of an ancestral trait would be.

The hard part comes when we update the synthetic tree, i.e. we compute
a new tree2 different from tree1. There may or may not be a node in
tree2 that designates the same clade according to this formula, or the
same range of clades. It is possible to set up a correspondence between
trees with the property that equivalent nodes can consistently be taken to
designate the same clade.

The problem case where N 'becomes paraphyletic' when we go from tree1
to tree2 is pretty obvious - N is not DCC-5 equivalent to any node in
tree2. (This may or may not reflect N failing to designate a
clade; it could just be a loss of resolution.)
But another problem case is where a new node is 'inserted' as
a sibling of N, i.e. N aligns with N' and parent(N) aligns with M' and
M' is not parent(N'). Can you say whether a given N-annotation
applies to N' or to its parent? Probably not.

Re 4(b), you can't just say that the annotation is about the MRCA of
cl(t) for t in tips(N). The placement in tree2 of a tip that's not in tree1
in or out of N' could affect the support for the annotation. Similarly a
tip that's in tree1 could be missing from tree2, effectively retracting
any claim that it's in or not in cl(N'). So you have to pay attention to
all the tips, not
just the ones under the node of interest.

Now maybe I'm being rabid, and the annotations ought to be attached to
the synthetic tree somewhere convenient without such nit-picky regard
for logical soundness. If there is a reasoned argument for every annotation
(e.g. via a citation),
anyone can just go look at the original argument to figure out what it's
saying exactly and whether
it's consistent with any particular tree hypothesis, should there be
any doubt. (The evidence for a claim - placed on the current synthetic
tree - might even include a particular older version of the synthetic tree.)
If the support and rationale are captured formally we even have a
chance of reasoning about the claim using tools.

Jonathan

Member

jar398 commented Nov 1, 2014

I find the annotation enterprise epistemologically troublesome. It
would help me if I could see a list of use cases, i.e. actual claims
that one would want to express and store.

I think it's very important to distinguish real-life clades, the
physical / biological entities, from the information structures comprising
the synthetic tree. So I am with Nico in warning against saying
"clade" when you probably mean "node". If these aren't distinguished
then, among other things, there is no way to speak rationally about
how annotations are to be transferred from one tree to another.

To rigorously interpret a node N as a designator for a clade, you
would have to do the following:

  1. Consider the set T of tip nodes of the synthetic tree.
  2. Interpret each tip node t in T as a clade cl(t).
  3. Let tips(N) = the "descendent" tips of N in the tree.
  4. Consider the clades that (a) contain cl(t) for all t in tips(N), AND
    (b) do not contain any of the clades cl(t) for t in T-tips(N).
    There could be many such clades, or there could be none.
    By (a) and the nature of cladeness, these clades will all either be
    or contain the real-life MRCA of cl(t) for t in tips(N).
  5. Select one of these clades as cl(N) = the one that N designates.

An annotation (a biological claim) expressed in relation to node N
could be about any of these clades (4). The claim could even be vague on
this point, saying it's about one of these clades but it's not known
which. That's how a claim of an ancestral trait would be.

The hard part comes when we update the synthetic tree, i.e. we compute
a new tree2 different from tree1. There may or may not be a node in
tree2 that designates the same clade according to this formula, or the
same range of clades. It is possible to set up a correspondence between
trees with the property that equivalent nodes can consistently be taken to
designate the same clade.

The problem case where N 'becomes paraphyletic' when we go from tree1
to tree2 is pretty obvious - N is not DCC-5 equivalent to any node in
tree2. (This may or may not reflect N failing to designate a
clade; it could just be a loss of resolution.)
But another problem case is where a new node is 'inserted' as
a sibling of N, i.e. N aligns with N' and parent(N) aligns with M' and
M' is not parent(N'). Can you say whether a given N-annotation
applies to N' or to its parent? Probably not.

Re 4(b), you can't just say that the annotation is about the MRCA of
cl(t) for t in tips(N). The placement in tree2 of a tip that's not in tree1
in or out of N' could affect the support for the annotation. Similarly a
tip that's in tree1 could be missing from tree2, effectively retracting
any claim that it's in or not in cl(N'). So you have to pay attention to
all the tips, not
just the ones under the node of interest.

Now maybe I'm being rabid, and the annotations ought to be attached to
the synthetic tree somewhere convenient without such nit-picky regard
for logical soundness. If there is a reasoned argument for every annotation
(e.g. via a citation),
anyone can just go look at the original argument to figure out what it's
saying exactly and whether
it's consistent with any particular tree hypothesis, should there be
any doubt. (The evidence for a claim - placed on the current synthetic
tree - might even include a particular older version of the synthetic tree.)
If the support and rationale are captured formally we even have a
chance of reasoning about the claim using tools.

Jonathan

@nfranz

This comment has been minimized.

Show comment
Hide comment
@nfranz

nfranz Nov 1, 2014

Thank you, Karen (et al.).

I tend to think that the less abstract, the easier to understand and
ultimately converge. The below example is likely not perfectly suited (and
not "real") but maybe gets us closer.

Suppose that OT version 12 includes a section in its topology for
Senecio, a genus (concept) of the daisy family (concept). The purported
Senecio clade (genus, possibly some subgenera, then species) as shown in
that OT version is entirely "grounded" in a single phylogeny resource
(citation) submitted through PhyloGrafter. Say, "Pelser et al. 2007" are
the authors of that phylogeny reference.

So then at time 0, maybe the annotation database ought to be able to
say: "Version OT12 contains a clade Senecio with subsumed clades named ...,
and this information is referenced to Pelser et al. 2007. Make annotations
accordingly." Do I see three identifiers here or at least three groupings
of data bits that amount to the following functionality?

  1. for "Senecio" (the name),

  2. for Senecio as circumscribed by Pelser et al. 2007 (who presumably get
    an ID too) (the concept), and

  3. for 2. as integrated specifically into OT12?

    Versioning scenario A, time 1. OT13 gets published, incorporating no
    changes at or under Senecio. Pelser et al. 2007 are still valid, and
    exclusively so. Annotations happen. I suppose they happen over "Senecio as
    circumscribed by Pelser et al. 2007 as integrated into OT13?". Maybe that
    last part (update to OT13) is not needed. Either way, I would personally
    prefer the cumbersome but maybe more clear context 2 speak here: OT12 and
    OT13 congruently include a concept of Senecio as circumscribed by Pelser et
    al. 2007. "Same clade" works well enough as a shorthand here though.

    Versioning scenario B, time 1. OT13 gets published, but now part but not
    all of Pelser et al.'s 2007 phylogenetic rendering as it pertains to the
    concept of Senecio is replaced (at lower levels) with concepts published in
    Watson et al. 2015.

    I think at that point, both in terms of speaking, and for the purpose of
    database and annotation structuring, we are increasingly in the context 2
    realm, where mention of "taxon", "clade", "node" might be fruitfully
    adjusted to "taxon concept" (which as a taxon concept label:
    taxonomic/clade name + reference; see 2 above), "clade concept/hypothesis",
    "node concept/definition", and so forth. All of these effectively carry
    "according to's". At the database representation level, there is a fair bit
    of granularity. Lots of linking may have to happen. Much of that could
    likely be automated. Not all granularity and linking need be exposed very
    obviously to every user.

    Some of this is clearly academic. As I said, we humans tend to
    understand each other well either way. But I think too that OT has an
    opportunity to improve our community's syntactics and semantics as we
    explore open-ended tree hypotheses assembly systems. Put differently, if an
    arguably paradigmatic shift from a one-off culture of publication to an
    open-ended but credit-aware system did not also force us to revise our
    ways of speaking, wouldn't that be surprising? OT could be seen as
    generating a new environment of phylogenetic hypothesis linking that was
    not there before. Developing ways of speaking for that context need not
    negate the value of simpler ways of speaking in the more traditional
    contexts (like Pelser et al. 2007 in isolation), where provenance of
    concepts is readily inferred ("this publication, duh") and taxon and clade
    names map rather directly to a tree structure that is less of a composite
    than higher-level sections of OT versions might likely be.

    Hopefully a productive post, which is of great concern to me.

Nico

On Sat, Nov 1, 2014 at 2:09 AM, Karen Cranston notifications@github.com
wrote:

Hi Nico,
Thanks for the thoughts. I think - if I understand you correctly - that
our idea of an annotation database separate from the synthetic tree
supports the distinction between your concepts 1 and 2. When I say "same
clade", I mean the node in a given version of the synthetic tree. The
annotation targets (concepts) do not change, but their presence / absence /
placement on the tree might from version to version. So, for a particular
synthetic tree, we can then ask "how many of these annotation targets map
to the same node in the tree?". By being flexible with how people can
define the targets, we allow different taxon concepts for different use
cases.

Does that make sense, or am I missing something? @jar398
https://github.com/jar398 might also have some input here.


Reply to this email directly or view it on GitHub
#15 (comment)
.

nfranz commented Nov 1, 2014

Thank you, Karen (et al.).

I tend to think that the less abstract, the easier to understand and
ultimately converge. The below example is likely not perfectly suited (and
not "real") but maybe gets us closer.

Suppose that OT version 12 includes a section in its topology for
Senecio, a genus (concept) of the daisy family (concept). The purported
Senecio clade (genus, possibly some subgenera, then species) as shown in
that OT version is entirely "grounded" in a single phylogeny resource
(citation) submitted through PhyloGrafter. Say, "Pelser et al. 2007" are
the authors of that phylogeny reference.

So then at time 0, maybe the annotation database ought to be able to
say: "Version OT12 contains a clade Senecio with subsumed clades named ...,
and this information is referenced to Pelser et al. 2007. Make annotations
accordingly." Do I see three identifiers here or at least three groupings
of data bits that amount to the following functionality?

  1. for "Senecio" (the name),

  2. for Senecio as circumscribed by Pelser et al. 2007 (who presumably get
    an ID too) (the concept), and

  3. for 2. as integrated specifically into OT12?

    Versioning scenario A, time 1. OT13 gets published, incorporating no
    changes at or under Senecio. Pelser et al. 2007 are still valid, and
    exclusively so. Annotations happen. I suppose they happen over "Senecio as
    circumscribed by Pelser et al. 2007 as integrated into OT13?". Maybe that
    last part (update to OT13) is not needed. Either way, I would personally
    prefer the cumbersome but maybe more clear context 2 speak here: OT12 and
    OT13 congruently include a concept of Senecio as circumscribed by Pelser et
    al. 2007. "Same clade" works well enough as a shorthand here though.

    Versioning scenario B, time 1. OT13 gets published, but now part but not
    all of Pelser et al.'s 2007 phylogenetic rendering as it pertains to the
    concept of Senecio is replaced (at lower levels) with concepts published in
    Watson et al. 2015.

    I think at that point, both in terms of speaking, and for the purpose of
    database and annotation structuring, we are increasingly in the context 2
    realm, where mention of "taxon", "clade", "node" might be fruitfully
    adjusted to "taxon concept" (which as a taxon concept label:
    taxonomic/clade name + reference; see 2 above), "clade concept/hypothesis",
    "node concept/definition", and so forth. All of these effectively carry
    "according to's". At the database representation level, there is a fair bit
    of granularity. Lots of linking may have to happen. Much of that could
    likely be automated. Not all granularity and linking need be exposed very
    obviously to every user.

    Some of this is clearly academic. As I said, we humans tend to
    understand each other well either way. But I think too that OT has an
    opportunity to improve our community's syntactics and semantics as we
    explore open-ended tree hypotheses assembly systems. Put differently, if an
    arguably paradigmatic shift from a one-off culture of publication to an
    open-ended but credit-aware system did not also force us to revise our
    ways of speaking, wouldn't that be surprising? OT could be seen as
    generating a new environment of phylogenetic hypothesis linking that was
    not there before. Developing ways of speaking for that context need not
    negate the value of simpler ways of speaking in the more traditional
    contexts (like Pelser et al. 2007 in isolation), where provenance of
    concepts is readily inferred ("this publication, duh") and taxon and clade
    names map rather directly to a tree structure that is less of a composite
    than higher-level sections of OT versions might likely be.

    Hopefully a productive post, which is of great concern to me.

Nico

On Sat, Nov 1, 2014 at 2:09 AM, Karen Cranston notifications@github.com
wrote:

Hi Nico,
Thanks for the thoughts. I think - if I understand you correctly - that
our idea of an annotation database separate from the synthetic tree
supports the distinction between your concepts 1 and 2. When I say "same
clade", I mean the node in a given version of the synthetic tree. The
annotation targets (concepts) do not change, but their presence / absence /
placement on the tree might from version to version. So, for a particular
synthetic tree, we can then ask "how many of these annotation targets map
to the same node in the tree?". By being flexible with how people can
define the targets, we allow different taxon concepts for different use
cases.

Does that make sense, or am I missing something? @jar398
https://github.com/jar398 might also have some input here.


Reply to this email directly or view it on GitHub
#15 (comment)
.

@mjy

This comment has been minimized.

Show comment
Hide comment
@mjy

mjy Nov 1, 2014

" It would help me if I could see a list of use cases, i.e. actual claims that one would want to express and store." <- Yes, this is likely the only way to really resolve this.

IMO there are 2 worlds. 1) clades are asserted to exist vs. 2) clades are calculated from data. IMO OT is completely within 1.

When you seek to persist clades/nodes across OTs, then you must ask, are you in world 1), or world 2)?

For arguments sake I claim that unless you begin to calculate on data (or, more broadly annotations on nodes), you will always be in 1). Regardless of who states what about concept T at time X, if you can't recalculate based on the data, you're stuck with an assertion. This was the basis of my original observation in this thread, i.e. what then can you do that is not doable as defined in the phylocode (or maybe the phylocode doesn't work, but let's assume it does)?

If OT agrees to happily exist in world 1) (which is just fine), then there are many things that are easily done without over complicating things. Entomologist don't think to themselves, "I've got this nagging doubt, humans just might be insects!". For many (all?) practical purposes they never, ever have to do this. They do real work, on a daily basis, without ever worrying about the definition of insects expanding to including humans. If they are doing an insect phylogeny they also don't have to worry about birds, lizards, fish, or squirrels. This suggests to me that there are real clades, that can be represented as nodes, in the OT, and that these can persist across versions. Do we need a robust logical framework for this level of assertion/claim? I love the idea, but maybe its over-engineering at some level.

M

mjy commented Nov 1, 2014

" It would help me if I could see a list of use cases, i.e. actual claims that one would want to express and store." <- Yes, this is likely the only way to really resolve this.

IMO there are 2 worlds. 1) clades are asserted to exist vs. 2) clades are calculated from data. IMO OT is completely within 1.

When you seek to persist clades/nodes across OTs, then you must ask, are you in world 1), or world 2)?

For arguments sake I claim that unless you begin to calculate on data (or, more broadly annotations on nodes), you will always be in 1). Regardless of who states what about concept T at time X, if you can't recalculate based on the data, you're stuck with an assertion. This was the basis of my original observation in this thread, i.e. what then can you do that is not doable as defined in the phylocode (or maybe the phylocode doesn't work, but let's assume it does)?

If OT agrees to happily exist in world 1) (which is just fine), then there are many things that are easily done without over complicating things. Entomologist don't think to themselves, "I've got this nagging doubt, humans just might be insects!". For many (all?) practical purposes they never, ever have to do this. They do real work, on a daily basis, without ever worrying about the definition of insects expanding to including humans. If they are doing an insect phylogeny they also don't have to worry about birds, lizards, fish, or squirrels. This suggests to me that there are real clades, that can be represented as nodes, in the OT, and that these can persist across versions. Do we need a robust logical framework for this level of assertion/claim? I love the idea, but maybe its over-engineering at some level.

M

@nfranz

This comment has been minimized.

Show comment
Hide comment
@nfranz

nfranz Nov 2, 2014

Thanks, Matt.

I think the point about the right amount of engineering is always well
taken. (and my views in this thread are myopic on this one issue of a much
larger undertaking)

Whether clade concepts are asserted (with data being "somewhere else")
or inferred from data (provided and analyzed "right there in the same
system") - one could still ask in each case how provenance might be
tracked, at coarse or fine levels of granularity. Provenance can apply to
direct evidence as well:
http://onlinelibrary.wiley.com/doi/10.1111/j.1095-8312.2007.00847.x/abstract

I think (but may well be ignorant) the following example is challenging
for the PhyloCode approach. You have two clade concepts, each with three
children, at time 0:

0.clade1 with three children:
0.clade1_child1,
0.clade1_child2,
0.clade1_child3.

Then also 0.clade2 with three children:
0.clade2_child4,
0.clade2_child5,
0.clade2_child6.

Suppose that the PhyloCode node-based identity of 0.clade1 is set as the
most recent common ancestor of 0.clade1_child1 and 0.clade1_child3.

Similarly, the identity of 0.clade2 is set as the 0.clade2_child4 and
0.clade2_child6 intersection (node).

At time = 1, new evidence/interpretation indicates that the respective
phylogenetic positions of child2 and child5 should be "inverted"; so we
obtain:

1.clade1 with children:
1.clade1_child1,
1.clade1_child5,
1.clade1_child3.

1.clade2 with children:
1.clade1_child4,
1.clade1_child2,
1.clade1_child6.

I believe under the PhyloCode application the clade definitions do not
change. But we do have two taxonomies here at t = 0 versus t = 1 whose
taxonomic content appears non-congruent at a more granular level. We could
even presume that the stated synapomorphies of 0.clade1/1.clade1 and
0.clade2/1.clade2 are "the same" (= identical text strings; I would prefer
to say they have congruent intensions). In the eyes of time = 1, the
properties of child2 and child5 at time = 0 had been misdescribed.

Best, Nico

On Sat, Nov 1, 2014 at 1:36 PM, Matt notifications@github.com wrote:

" It would help me if I could see a list of use cases, i.e. actual claims
that one would want to express and store." <- Yes, this is likely the only
way to really resolve this.

IMO there are 2 worlds. 1) clades are asserted to exist vs. 2) clades are
calculated from data. IMO OT is completely within 1.

When you seek to persist clades/nodes across OTs, then you must ask, are
you in world 1), or world 2)?

For arguments sake I claim that unless you begin to calculate on data (or,
more broadly annotations on nodes), you will always be in 1). Regardless of
who states what about concept T at time X, if you can't recalculate based
on the data, you're stuck with an assertion. This was the basis of my
original observation in this thread, i.e. what then can you do that is not
doable as defined in the phylocode (or maybe the phylocode doesn't work,
but let's assume it does)?

If OT agrees to happily exist in world 1) (which is just fine), then there
are many things that are easily done without over complicating things.
Entomologist don't think to themselves, "I've got this nagging doubt,
humans just might be insects!". For many (all?) practical purposes
they never, ever have to do this. They do real work, on a daily basis,
without ever worrying about the definition of insects expanding to
including humans. If they are doing an insect phylogeny they also don't
have to worry about birds, lizards, fish, or squirrels. This suggests to me
that there are real clades, that can be represented as nodes, in the OT,
and that these can persist across versions. Do we need a robust logical
framework for this level of assertion/claim? I love the idea, but maybe its
over-engineering at some level.

M


Reply to this email directly or view it on GitHub
#15 (comment)
.

nfranz commented Nov 2, 2014

Thanks, Matt.

I think the point about the right amount of engineering is always well
taken. (and my views in this thread are myopic on this one issue of a much
larger undertaking)

Whether clade concepts are asserted (with data being "somewhere else")
or inferred from data (provided and analyzed "right there in the same
system") - one could still ask in each case how provenance might be
tracked, at coarse or fine levels of granularity. Provenance can apply to
direct evidence as well:
http://onlinelibrary.wiley.com/doi/10.1111/j.1095-8312.2007.00847.x/abstract

I think (but may well be ignorant) the following example is challenging
for the PhyloCode approach. You have two clade concepts, each with three
children, at time 0:

0.clade1 with three children:
0.clade1_child1,
0.clade1_child2,
0.clade1_child3.

Then also 0.clade2 with three children:
0.clade2_child4,
0.clade2_child5,
0.clade2_child6.

Suppose that the PhyloCode node-based identity of 0.clade1 is set as the
most recent common ancestor of 0.clade1_child1 and 0.clade1_child3.

Similarly, the identity of 0.clade2 is set as the 0.clade2_child4 and
0.clade2_child6 intersection (node).

At time = 1, new evidence/interpretation indicates that the respective
phylogenetic positions of child2 and child5 should be "inverted"; so we
obtain:

1.clade1 with children:
1.clade1_child1,
1.clade1_child5,
1.clade1_child3.

1.clade2 with children:
1.clade1_child4,
1.clade1_child2,
1.clade1_child6.

I believe under the PhyloCode application the clade definitions do not
change. But we do have two taxonomies here at t = 0 versus t = 1 whose
taxonomic content appears non-congruent at a more granular level. We could
even presume that the stated synapomorphies of 0.clade1/1.clade1 and
0.clade2/1.clade2 are "the same" (= identical text strings; I would prefer
to say they have congruent intensions). In the eyes of time = 1, the
properties of child2 and child5 at time = 0 had been misdescribed.

Best, Nico

On Sat, Nov 1, 2014 at 1:36 PM, Matt notifications@github.com wrote:

" It would help me if I could see a list of use cases, i.e. actual claims
that one would want to express and store." <- Yes, this is likely the only
way to really resolve this.

IMO there are 2 worlds. 1) clades are asserted to exist vs. 2) clades are
calculated from data. IMO OT is completely within 1.

When you seek to persist clades/nodes across OTs, then you must ask, are
you in world 1), or world 2)?

For arguments sake I claim that unless you begin to calculate on data (or,
more broadly annotations on nodes), you will always be in 1). Regardless of
who states what about concept T at time X, if you can't recalculate based
on the data, you're stuck with an assertion. This was the basis of my
original observation in this thread, i.e. what then can you do that is not
doable as defined in the phylocode (or maybe the phylocode doesn't work,
but let's assume it does)?

If OT agrees to happily exist in world 1) (which is just fine), then there
are many things that are easily done without over complicating things.
Entomologist don't think to themselves, "I've got this nagging doubt,
humans just might be insects!". For many (all?) practical purposes
they never, ever have to do this. They do real work, on a daily basis,
without ever worrying about the definition of insects expanding to
including humans. If they are doing an insect phylogeny they also don't
have to worry about birds, lizards, fish, or squirrels. This suggests to me
that there are real clades, that can be represented as nodes, in the OT,
and that these can persist across versions. Do we need a robust logical
framework for this level of assertion/claim? I love the idea, but maybe its
over-engineering at some level.

M


Reply to this email directly or view it on GitHub
#15 (comment)
.

@jar398

This comment has been minimized.

Show comment
Hide comment
@jar398

jar398 Nov 2, 2014

Member

On Sat, Nov 1, 2014 at 4:36 PM, Matt notifications@github.com wrote:

" It would help me if I could see a list of use cases, i.e. actual claims
that one would want to express and store." <- Yes, this is likely the only
way to really resolve this.

IMO there are 2 worlds. 1) clades are asserted to exist vs. 2) clades are
calculated from data. IMO OT is completely within 1.

I think what you mean by 1) is real-life clades and what is true about
them, and by 2) hypotheses about clades. IMO open tree, as an information
artifact, is squarely about both. This is because some of what it records
is true, and some of what it records is only hypothesized (could be true or
not).

In case 2) sometimes there will be a clade to which the hypothesis (such as
membership or traits) apply, and sometimes not.

Of course you don't have a claim that can be judged true or not unless you
have semantics that tells you what your data structures and calculations
mean. I think that is what we are talking about when it comes to
annotations.

When you seek to persist clades/nodes across OTs, then you must ask, are
you in world 1), or world 2)?

A clade will persist out in nature, unless we go out and make it extinct.
What I'm talking about is how a claim related to a node in version 1 of the
tree can be related to version 2 of the tree. So it's really about
'persistence' of access to claims through editions of the tree - regardless
of whether the claim, or the tree, is true or not.

I think it's better to have the annotations not expressed in terms of any
node in any edition of the synthetic tree, but rather to use phylocode or
'taxon concepts' to refer to clades. But if one wanted to interpret a node
in the synthetic tree as a clade, what I'm saying is that it's not at all
obvious how to do this. For me the obvious interpretation would be some
clade that contains the tips or samples below that node, and does not
contain the tips or samples not below that node, in that edition of the
tree. But there may be many such clades, or no such clade.

For arguments sake I claim that unless you begin to calculate on data (or,
more broadly annotations on nodes), you will always be in 1). Regardless of
who states what about concept T at time X, if you can't recalculate based
on the data, you're stuck with an assertion. This was the basis of my
original observation in this thread, i.e. what then can you do that is not
doable as defined in the phylocode (or maybe the phylocode doesn't work,
but let's assume it does)?

I don't get this. Data would be support for a claim. So you're talking
about whether claims are supported or not. Phylocode is not about data or
claims, it's a way to refer to clades. You can use phylocode to make
unsupported claims, or to make supported claims.

If OT agrees to happily exist in world 1) (which is just fine), then
there are many things that are easily done without over complicating
things. Entomologist don't think to themselves, "I've got this nagging
doubt, humans just might be insects!". For many (all?) practical
purposes they never, ever have to do this. They do real work, on a daily
basis, without ever worrying about the definition of insects expanding to
including humans. If they are doing an insect phylogeny they also don't
have to worry about birds, lizards, fish, or squirrels. This suggests to me
that there are real clades, that can be represented as nodes, in the OT,
and that these can persist across versions. Do we need a robust logical
framework for this level of assertion/claim? I love the idea, but maybe its
over-engineering at some level.

The problem is not with the clear-cut cases like comparing humans to
insects or lizards. The problems come when you add a primitive insect-like
fossil and there is disagreement over whether it's an insect or not (i.e.
whether insect annotations should apply to it); or when you have a name
whose circumscription (or assignment to a clade, if there is one) 'changes'
from one edition of a source taxonomy to the next; or when a newer analysis
'moves' a taxon into or out of a clade (referenced using phylocode). We
have thousands of names whose definitions seriously conflict between source
taxonomies - one taxonomy says that a set of tips grouped by another
taxonomy is not a clade, and vice versa. (perhaps neither is right, but
both can't be.) We don't have enough information to know whether in each
case two different 'taxon concepts' were applied, if the same 'taxon
concept' held but someone had a change of heart over whether a subgroup
satisfied that 'taxon concept'. And when a curator assigns an OTU to a
named taxon we don't know what's going through their head. It's keeping
track of what to do in these cases - and harder, explaining after the fact
how decisions were made so that errors can be tracked down and corrected -
that requires careful thought and engineering. There will be analogous
situations regarding phylogeny: you make an annotation on mrca(A,B)
assuming that C is in the clade and it turns out that a better hypothesis
is that C isn't in mrca(A,B), and maybe the annotation loses its support in
that case. This is going to happen a lot; it's not a disaster, but it's
going to be hard to know what's meant and to be transparent about what
happened when we advanced to newer tree hypotheses. Without a way to
explain how things ended up the way they are in the taxonomy or synthetic
tree, what we're doing isn't science.

Here's another example I'm struggling with: there are currently a couple of
species in OTT that are misclassified as crustaceans instead of molluscs.
When we fix this problem, there will be an incompatible 'change' in the
membership of Arthropoda. Does this mean that the new group should get a
new identifier? - after all its identity in some sense has changed. If so,
annotations and OTU mappings linked to the old id have no home in the tree.
It doesn't get a new id with the current taxonomy generator, which assumes
that names are tied uniquely to taxon concepts (with some exceptions), but
with a more principled system where groups are defined by membership or
phylogenetic hypotheses, it might. This would have an impact on OTU
mappings and annotation carryover. I don't have a good answer to this one,
but am working on ways to anchor the semantics of ids.

The problem of transparency for identifier semantics and annotation
carryover is real, and has to be solved regardless of whether we decide to
"overengineer" or not.

Jonathan

M


Reply to this email directly or view it on GitHub
#15 (comment)
.

Member

jar398 commented Nov 2, 2014

On Sat, Nov 1, 2014 at 4:36 PM, Matt notifications@github.com wrote:

" It would help me if I could see a list of use cases, i.e. actual claims
that one would want to express and store." <- Yes, this is likely the only
way to really resolve this.

IMO there are 2 worlds. 1) clades are asserted to exist vs. 2) clades are
calculated from data. IMO OT is completely within 1.

I think what you mean by 1) is real-life clades and what is true about
them, and by 2) hypotheses about clades. IMO open tree, as an information
artifact, is squarely about both. This is because some of what it records
is true, and some of what it records is only hypothesized (could be true or
not).

In case 2) sometimes there will be a clade to which the hypothesis (such as
membership or traits) apply, and sometimes not.

Of course you don't have a claim that can be judged true or not unless you
have semantics that tells you what your data structures and calculations
mean. I think that is what we are talking about when it comes to
annotations.

When you seek to persist clades/nodes across OTs, then you must ask, are
you in world 1), or world 2)?

A clade will persist out in nature, unless we go out and make it extinct.
What I'm talking about is how a claim related to a node in version 1 of the
tree can be related to version 2 of the tree. So it's really about
'persistence' of access to claims through editions of the tree - regardless
of whether the claim, or the tree, is true or not.

I think it's better to have the annotations not expressed in terms of any
node in any edition of the synthetic tree, but rather to use phylocode or
'taxon concepts' to refer to clades. But if one wanted to interpret a node
in the synthetic tree as a clade, what I'm saying is that it's not at all
obvious how to do this. For me the obvious interpretation would be some
clade that contains the tips or samples below that node, and does not
contain the tips or samples not below that node, in that edition of the
tree. But there may be many such clades, or no such clade.

For arguments sake I claim that unless you begin to calculate on data (or,
more broadly annotations on nodes), you will always be in 1). Regardless of
who states what about concept T at time X, if you can't recalculate based
on the data, you're stuck with an assertion. This was the basis of my
original observation in this thread, i.e. what then can you do that is not
doable as defined in the phylocode (or maybe the phylocode doesn't work,
but let's assume it does)?

I don't get this. Data would be support for a claim. So you're talking
about whether claims are supported or not. Phylocode is not about data or
claims, it's a way to refer to clades. You can use phylocode to make
unsupported claims, or to make supported claims.

If OT agrees to happily exist in world 1) (which is just fine), then
there are many things that are easily done without over complicating
things. Entomologist don't think to themselves, "I've got this nagging
doubt, humans just might be insects!". For many (all?) practical
purposes they never, ever have to do this. They do real work, on a daily
basis, without ever worrying about the definition of insects expanding to
including humans. If they are doing an insect phylogeny they also don't
have to worry about birds, lizards, fish, or squirrels. This suggests to me
that there are real clades, that can be represented as nodes, in the OT,
and that these can persist across versions. Do we need a robust logical
framework for this level of assertion/claim? I love the idea, but maybe its
over-engineering at some level.

The problem is not with the clear-cut cases like comparing humans to
insects or lizards. The problems come when you add a primitive insect-like
fossil and there is disagreement over whether it's an insect or not (i.e.
whether insect annotations should apply to it); or when you have a name
whose circumscription (or assignment to a clade, if there is one) 'changes'
from one edition of a source taxonomy to the next; or when a newer analysis
'moves' a taxon into or out of a clade (referenced using phylocode). We
have thousands of names whose definitions seriously conflict between source
taxonomies - one taxonomy says that a set of tips grouped by another
taxonomy is not a clade, and vice versa. (perhaps neither is right, but
both can't be.) We don't have enough information to know whether in each
case two different 'taxon concepts' were applied, if the same 'taxon
concept' held but someone had a change of heart over whether a subgroup
satisfied that 'taxon concept'. And when a curator assigns an OTU to a
named taxon we don't know what's going through their head. It's keeping
track of what to do in these cases - and harder, explaining after the fact
how decisions were made so that errors can be tracked down and corrected -
that requires careful thought and engineering. There will be analogous
situations regarding phylogeny: you make an annotation on mrca(A,B)
assuming that C is in the clade and it turns out that a better hypothesis
is that C isn't in mrca(A,B), and maybe the annotation loses its support in
that case. This is going to happen a lot; it's not a disaster, but it's
going to be hard to know what's meant and to be transparent about what
happened when we advanced to newer tree hypotheses. Without a way to
explain how things ended up the way they are in the taxonomy or synthetic
tree, what we're doing isn't science.

Here's another example I'm struggling with: there are currently a couple of
species in OTT that are misclassified as crustaceans instead of molluscs.
When we fix this problem, there will be an incompatible 'change' in the
membership of Arthropoda. Does this mean that the new group should get a
new identifier? - after all its identity in some sense has changed. If so,
annotations and OTU mappings linked to the old id have no home in the tree.
It doesn't get a new id with the current taxonomy generator, which assumes
that names are tied uniquely to taxon concepts (with some exceptions), but
with a more principled system where groups are defined by membership or
phylogenetic hypotheses, it might. This would have an impact on OTU
mappings and annotation carryover. I don't have a good answer to this one,
but am working on ways to anchor the semantics of ids.

The problem of transparency for identifier semantics and annotation
carryover is real, and has to be solved regardless of whether we decide to
"overengineer" or not.

Jonathan

M


Reply to this email directly or view it on GitHub
#15 (comment)
.

@mjy

This comment has been minimized.

Show comment
Hide comment
@mjy

mjy Nov 3, 2014

I think I likely confuse rather than add to this discussion, so I'll tiptoe away after this.

My world 1/2 distinction appears to confuse. IMO any proposed species, taxon, clade is a hypothesis, so this does not factor into my distinction b/w one and two. 1/2 is a pragmatic distinction, it's related to how hypotheses burst into existence, then get referenced later on. World 2) is about how species and clade hypothesis are ("originally") defined, it pertains to data derived directly from instances of the (ultimately hypothesized) species/clades. For example I might gather DNA, anatomical, and behavioral data and then run an algorithm on these data, based on those results I hypothesize the existence of taxa/clades. Later, in world 1) someone points to my hypothesis, assumes it's a good one, and does a new study. They do not compute on any of the data I used to define my original hypotheses. An OT in this 2) would necessarily reference specimens, and the data directly tied to those specimens. It would then define classes (clades/taxa) that classify those specimens based on the outcomes of the analysis of the underlying data. ( Supertree methods do not count as world 2, I assert this, rather than back it up).

OT could do something similar to what happens world two, but abstracted away from specimen data a layer or two. Annotations (= data that can used to define clades) can be added to OTUs. Clades can be defined as classes that are bound to a quantitative calculation on those annotations, i.e. they classify OTUs. In this scenario identifiers are provided for these clades, and from tree to tree they repopulate based on the data that is available. Their definition remains the same, a calculation/algorithm, therefor there is no need to change identifiers. Transparency is not an issue, when someone asks why X is in Y, you point to the algorithm that placed X in Y. Want to tweak the calculation that defines a clade? Mint a new identifier, it's demonstrably different because it actually calculates on data.

A final way of thinking about it. In OT there are publications, topologies, and taxa/clades. Now you want to add annotations to the system. The problem is that the system defines taxa/clades only via reference to a publication and topology. How can you expect a system to persist annotations on clades when the system does not define those clades based on those annotations?

mjy commented Nov 3, 2014

I think I likely confuse rather than add to this discussion, so I'll tiptoe away after this.

My world 1/2 distinction appears to confuse. IMO any proposed species, taxon, clade is a hypothesis, so this does not factor into my distinction b/w one and two. 1/2 is a pragmatic distinction, it's related to how hypotheses burst into existence, then get referenced later on. World 2) is about how species and clade hypothesis are ("originally") defined, it pertains to data derived directly from instances of the (ultimately hypothesized) species/clades. For example I might gather DNA, anatomical, and behavioral data and then run an algorithm on these data, based on those results I hypothesize the existence of taxa/clades. Later, in world 1) someone points to my hypothesis, assumes it's a good one, and does a new study. They do not compute on any of the data I used to define my original hypotheses. An OT in this 2) would necessarily reference specimens, and the data directly tied to those specimens. It would then define classes (clades/taxa) that classify those specimens based on the outcomes of the analysis of the underlying data. ( Supertree methods do not count as world 2, I assert this, rather than back it up).

OT could do something similar to what happens world two, but abstracted away from specimen data a layer or two. Annotations (= data that can used to define clades) can be added to OTUs. Clades can be defined as classes that are bound to a quantitative calculation on those annotations, i.e. they classify OTUs. In this scenario identifiers are provided for these clades, and from tree to tree they repopulate based on the data that is available. Their definition remains the same, a calculation/algorithm, therefor there is no need to change identifiers. Transparency is not an issue, when someone asks why X is in Y, you point to the algorithm that placed X in Y. Want to tweak the calculation that defines a clade? Mint a new identifier, it's demonstrably different because it actually calculates on data.

A final way of thinking about it. In OT there are publications, topologies, and taxa/clades. Now you want to add annotations to the system. The problem is that the system defines taxa/clades only via reference to a publication and topology. How can you expect a system to persist annotations on clades when the system does not define those clades based on those annotations?

@arlin

This comment has been minimized.

Show comment
Hide comment
@arlin

arlin Nov 3, 2014

On Nov 2, 2014, at 11:35 AM, Nico Franz <notifications@github.commailto:notifications@github.com> wrote:

0.clade1 with three children:
0.clade1_child1,
0.clade1_child2,
0.clade1_child3.

Then also 0.clade2 with three children:
0.clade2_child4,
0.clade2_child5,
0.clade2_child6.

Nico, I like the approach of specifying an example of how relationships change. This discussion would be helped by a set of concrete examples of attributions and changes that reflect the kinds of problems that are likely to arise (maybe the 80% rule could work here).

To me, this raises the issue of why we want to traffic in clade concepts that purport to be stable by virtue of referring to an external reality, when in reality they reflect a limited view that is likely to change in the future. SFAIK it is generally agreed in the ontology world that ontological statements may be asserted as true based on the best available knowledge, even when they are hypotheses with some uncertainty (the whole issue of describing a conceptual world of hypotheses or posterior distributions is a separate matter). The problem with clades is just that the uncertainty is high enough, and they are so likely to change, that we are all here having an explicit discussion here about how knowledge can persist through these changes.

Is there some more generic way to assign attributes that sticks closer to the evidence?

Rather than pinning a label “blue” (for instance) on a clade based on some research publication, let's say that the publication assigns “blue" based on an ordered split, where the ingroup is { child1, child3 } and the outgroup is { child4, child6 }. This rule for assigning “blue" can persist, potentially through multiple tree topologies. In each case, we have to determine whether the topology is consistent with the split, and if so, how to apply the attribute.

We could back this up a further step— staying even closer to the evidence— and simply specify the method and evidence used in the research publication that attributes “blue” to clade1. We could say that “blue” is assigned by parsimony based on a particular distribution, e.g., ((((child1:blue,child2:blue),child4:purple),child5:red),child6:green). And again, we need a set of rules to know how to apply this when the topology is updated and when new members are added.

I don’t think there is any way to avoid the need to implement some kind of complex rule-based system, where the rules are based on phylogenetic logic.

Arlin

Suppose that the PhyloCode node-based identity of 0.clade1 is set as the
most recent common ancestor of 0.clade1_child1 and 0.clade1_child3.

Similarly, the identity of 0.clade2 is set as the 0.clade2_child4 and
0.clade2_child6 intersection (node).

At time = 1, new evidence/interpretation indicates that the respective
phylogenetic positions of child2 and child5 should be "inverted"; so we
obtain:

1.clade1 with children:
1.clade1_child1,
1.clade1_child5,
1.clade1_child3.

1.clade2 with children:
1.clade1_child4,
1.clade1_child2,
1.clade1_child6.

I believe under the PhyloCode application the clade definitions do not
change. But we do have two taxonomies here at t = 0 versus t = 1 whose
taxonomic content appears non-congruent at a more granular level. We could
even presume that the stated synapomorphies of 0.clade1/1.clade1 and
0.clade2/1.clade2 are "the same" (= identical text strings; I would prefer
to say they have congruent intensions). In the eyes of time = 1, the
properties of child2 and child5 at time = 0 had been misdescribed.

Best, Nico

On Sat, Nov 1, 2014 at 1:36 PM, Matt <notifications@github.commailto:notifications@github.com> wrote:

" It would help me if I could see a list of use cases, i.e. actual claims
that one would want to express and store." <- Yes, this is likely the only
way to really resolve this.

IMO there are 2 worlds. 1) clades are asserted to exist vs. 2) clades are
calculated from data. IMO OT is completely within 1.

When you seek to persist clades/nodes across OTs, then you must ask, are
you in world 1), or world 2)?

For arguments sake I claim that unless you begin to calculate on data (or,
more broadly annotations on nodes), you will always be in 1). Regardless of
who states what about concept T at time X, if you can't recalculate based
on the data, you're stuck with an assertion. This was the basis of my
original observation in this thread, i.e. what then can you do that is not
doable as defined in the phylocode (or maybe the phylocode doesn't work,
but let's assume it does)?

If OT agrees to happily exist in world 1) (which is just fine), then there
are many things that are easily done without over complicating things.
Entomologist don't think to themselves, "I've got this nagging doubt,
humans just might be insects!". For many (all?) practical purposes
they never, ever have to do this. They do real work, on a daily basis,
without ever worrying about the definition of insects expanding to
including humans. If they are doing an insect phylogeny they also don't
have to worry about birds, lizards, fish, or squirrels. This suggests to me
that there are real clades, that can be represented as nodes, in the OT,
and that these can persist across versions. Do we need a robust logical
framework for this level of assertion/claim? I love the idea, but maybe its
over-engineering at some level.

M


Reply to this email directly or view it on GitHub
#15 (comment)
.


Reply to this email directly or view it on GitHubhttps://github.com/OpenTreeOfLife/muriqui/issues/15#issuecomment-61413130.


Arlin Stoltzfus (arlin@umd.edumailto:arlin@umd.edu)
Research Biologist, NIST; Fellow, IBBR; Adj. Assoc. Prof., UMCP
IBBR, 9600 Gudelsky Drive, Rockville, MD, 20850
tel: 240 314 6208; web: www.molevol.orghttp://www.molevol.org

arlin commented Nov 3, 2014

On Nov 2, 2014, at 11:35 AM, Nico Franz <notifications@github.commailto:notifications@github.com> wrote:

0.clade1 with three children:
0.clade1_child1,
0.clade1_child2,
0.clade1_child3.

Then also 0.clade2 with three children:
0.clade2_child4,
0.clade2_child5,
0.clade2_child6.

Nico, I like the approach of specifying an example of how relationships change. This discussion would be helped by a set of concrete examples of attributions and changes that reflect the kinds of problems that are likely to arise (maybe the 80% rule could work here).

To me, this raises the issue of why we want to traffic in clade concepts that purport to be stable by virtue of referring to an external reality, when in reality they reflect a limited view that is likely to change in the future. SFAIK it is generally agreed in the ontology world that ontological statements may be asserted as true based on the best available knowledge, even when they are hypotheses with some uncertainty (the whole issue of describing a conceptual world of hypotheses or posterior distributions is a separate matter). The problem with clades is just that the uncertainty is high enough, and they are so likely to change, that we are all here having an explicit discussion here about how knowledge can persist through these changes.

Is there some more generic way to assign attributes that sticks closer to the evidence?

Rather than pinning a label “blue” (for instance) on a clade based on some research publication, let's say that the publication assigns “blue" based on an ordered split, where the ingroup is { child1, child3 } and the outgroup is { child4, child6 }. This rule for assigning “blue" can persist, potentially through multiple tree topologies. In each case, we have to determine whether the topology is consistent with the split, and if so, how to apply the attribute.

We could back this up a further step— staying even closer to the evidence— and simply specify the method and evidence used in the research publication that attributes “blue” to clade1. We could say that “blue” is assigned by parsimony based on a particular distribution, e.g., ((((child1:blue,child2:blue),child4:purple),child5:red),child6:green). And again, we need a set of rules to know how to apply this when the topology is updated and when new members are added.

I don’t think there is any way to avoid the need to implement some kind of complex rule-based system, where the rules are based on phylogenetic logic.

Arlin

Suppose that the PhyloCode node-based identity of 0.clade1 is set as the
most recent common ancestor of 0.clade1_child1 and 0.clade1_child3.

Similarly, the identity of 0.clade2 is set as the 0.clade2_child4 and
0.clade2_child6 intersection (node).

At time = 1, new evidence/interpretation indicates that the respective
phylogenetic positions of child2 and child5 should be "inverted"; so we
obtain:

1.clade1 with children:
1.clade1_child1,
1.clade1_child5,
1.clade1_child3.

1.clade2 with children:
1.clade1_child4,
1.clade1_child2,
1.clade1_child6.

I believe under the PhyloCode application the clade definitions do not
change. But we do have two taxonomies here at t = 0 versus t = 1 whose
taxonomic content appears non-congruent at a more granular level. We could
even presume that the stated synapomorphies of 0.clade1/1.clade1 and
0.clade2/1.clade2 are "the same" (= identical text strings; I would prefer
to say they have congruent intensions). In the eyes of time = 1, the
properties of child2 and child5 at time = 0 had been misdescribed.

Best, Nico

On Sat, Nov 1, 2014 at 1:36 PM, Matt <notifications@github.commailto:notifications@github.com> wrote:

" It would help me if I could see a list of use cases, i.e. actual claims
that one would want to express and store." <- Yes, this is likely the only
way to really resolve this.

IMO there are 2 worlds. 1) clades are asserted to exist vs. 2) clades are
calculated from data. IMO OT is completely within 1.

When you seek to persist clades/nodes across OTs, then you must ask, are
you in world 1), or world 2)?

For arguments sake I claim that unless you begin to calculate on data (or,
more broadly annotations on nodes), you will always be in 1). Regardless of
who states what about concept T at time X, if you can't recalculate based
on the data, you're stuck with an assertion. This was the basis of my
original observation in this thread, i.e. what then can you do that is not
doable as defined in the phylocode (or maybe the phylocode doesn't work,
but let's assume it does)?

If OT agrees to happily exist in world 1) (which is just fine), then there
are many things that are easily done without over complicating things.
Entomologist don't think to themselves, "I've got this nagging doubt,
humans just might be insects!". For many (all?) practical purposes
they never, ever have to do this. They do real work, on a daily basis,
without ever worrying about the definition of insects expanding to
including humans. If they are doing an insect phylogeny they also don't
have to worry about birds, lizards, fish, or squirrels. This suggests to me
that there are real clades, that can be represented as nodes, in the OT,
and that these can persist across versions. Do we need a robust logical
framework for this level of assertion/claim? I love the idea, but maybe its
over-engineering at some level.

M


Reply to this email directly or view it on GitHub
#15 (comment)
.


Reply to this email directly or view it on GitHubhttps://github.com/OpenTreeOfLife/muriqui/issues/15#issuecomment-61413130.


Arlin Stoltzfus (arlin@umd.edumailto:arlin@umd.edu)
Research Biologist, NIST; Fellow, IBBR; Adj. Assoc. Prof., UMCP
IBBR, 9600 Gudelsky Drive, Rockville, MD, 20850
tel: 240 314 6208; web: www.molevol.orghttp://www.molevol.org

@nfranz

This comment has been minimized.

Show comment
Hide comment
@nfranz

nfranz Nov 3, 2014

Thanks, all, I am trying to keep up the momentum (as time permits).

I thought Jonathan's crustacean/mollusc example is neat. Possibly neat
because it lays bare one's intuitions (should they exist) that identifiers
ought to be able - to a degree - to do the following work for us:

  1. Parse out (syntactically?) new information elements entering the
    pre-existing OTT environment (= expand the database infrastructure in a
    bit-level sense), and quite finely so.

  2. Reflect identity in name (taxonomic/clade).

  3. Express, to a decent degree of resolution, taxonomic/phylogenetic
    equivalence, and the lack thereof.

  4. Maybe even - express, to some degree, how much has "really changed", and
    whether it "matters".

    I am not trying to set up a straw issue. I assume we will largely agree
    that, as a whole, this is asking too much from a single set of identifiers.

    But I do think that each of the above functions are tied to legitimate
    or at least worth-to-consider expectations. I did a little bit about
    Jonathan's example here:
    http://taxonbytes.org/taxonomic-concept-identification-reconciliation-open-tree-life-part-1/

    Intersecting with this issue for me is the question about the "right
    kind of logic". I tend to think the glass remains largely empty here. In
    particular, I personally do not find it readily obvious that we should
    have, in the context of evolving taxon/character concept hypotheses, a
    logic system implementation that stipulates the referent of a class or
    predicate as being constant in all possible worlds. There is an exchange
    about the OBO way of doing things by Smith and Merrill, alluded to here:
    http://www.applied-ontology.org/ontologicalrealism/ I tend to
    be in the Merrill camp; very stenographically -- domain needs require
    domain-specific conceptualizations of "identity". Bottom line, OT may well
    require new logic/representation development and implementation - I for one
    can't say for sure that it won't. If one tried to feed Jonathan's example
    into a standard Pizza-type ontology, I believe it would break the
    consistency for the reasoner.

    I think that leaves at least one more issue on the table - how can we
    express "residual congruence". This relates to Arlin's example of blue
    versus purple standing in for an overarching kind of evidence/partition,
    under which incongruent sets of taxon concepts can be variously
    accommodated without losing the sense of continuity. Working on this a
    little too (but nothing ready for showing yet). Typically this means, from
    a perspective of the kind of computational logic I am most familiar with,
    that a "coverage constraint" must be relaxed. Slides 117-131 here (
    http://www.slideshare.net/taxonbytes/franz-2014-explaining-taxonomys-legacy-to-computers-how-and-why)
    illustrate the general point; one could substitute "PcarPeve_IC" for "male
    terminalia configured in a certain synapomorphic way"; and at a higher
    level 2006.PER and 2001.PER are congruent in spite of having non-congruent
    sets of children.

    Intuitively, we tend to think that property-centered definitions are
    more stable at higher levels. Annotations can likely bear that out. It
    doesn't necessarily follow (of course) that property-referencing concepts
    behave "fundamentally differently" from taxon concepts across OTT revisions.

Best, Nico

nfranz commented Nov 3, 2014

Thanks, all, I am trying to keep up the momentum (as time permits).

I thought Jonathan's crustacean/mollusc example is neat. Possibly neat
because it lays bare one's intuitions (should they exist) that identifiers
ought to be able - to a degree - to do the following work for us:

  1. Parse out (syntactically?) new information elements entering the
    pre-existing OTT environment (= expand the database infrastructure in a
    bit-level sense), and quite finely so.

  2. Reflect identity in name (taxonomic/clade).

  3. Express, to a decent degree of resolution, taxonomic/phylogenetic
    equivalence, and the lack thereof.

  4. Maybe even - express, to some degree, how much has "really changed", and
    whether it "matters".

    I am not trying to set up a straw issue. I assume we will largely agree
    that, as a whole, this is asking too much from a single set of identifiers.

    But I do think that each of the above functions are tied to legitimate
    or at least worth-to-consider expectations. I did a little bit about
    Jonathan's example here:
    http://taxonbytes.org/taxonomic-concept-identification-reconciliation-open-tree-life-part-1/

    Intersecting with this issue for me is the question about the "right
    kind of logic". I tend to think the glass remains largely empty here. In
    particular, I personally do not find it readily obvious that we should
    have, in the context of evolving taxon/character concept hypotheses, a
    logic system implementation that stipulates the referent of a class or
    predicate as being constant in all possible worlds. There is an exchange
    about the OBO way of doing things by Smith and Merrill, alluded to here:
    http://www.applied-ontology.org/ontologicalrealism/ I tend to
    be in the Merrill camp; very stenographically -- domain needs require
    domain-specific conceptualizations of "identity". Bottom line, OT may well
    require new logic/representation development and implementation - I for one
    can't say for sure that it won't. If one tried to feed Jonathan's example
    into a standard Pizza-type ontology, I believe it would break the
    consistency for the reasoner.

    I think that leaves at least one more issue on the table - how can we
    express "residual congruence". This relates to Arlin's example of blue
    versus purple standing in for an overarching kind of evidence/partition,
    under which incongruent sets of taxon concepts can be variously
    accommodated without losing the sense of continuity. Working on this a
    little too (but nothing ready for showing yet). Typically this means, from
    a perspective of the kind of computational logic I am most familiar with,
    that a "coverage constraint" must be relaxed. Slides 117-131 here (
    http://www.slideshare.net/taxonbytes/franz-2014-explaining-taxonomys-legacy-to-computers-how-and-why)
    illustrate the general point; one could substitute "PcarPeve_IC" for "male
    terminalia configured in a certain synapomorphic way"; and at a higher
    level 2006.PER and 2001.PER are congruent in spite of having non-congruent
    sets of children.

    Intuitively, we tend to think that property-centered definitions are
    more stable at higher levels. Annotations can likely bear that out. It
    doesn't necessarily follow (of course) that property-referencing concepts
    behave "fundamentally differently" from taxon concepts across OTT revisions.

Best, Nico

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment