Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Term suggestions from ENVIRONMENTS EOL #100

Open
GoogleCodeExporter opened this Issue Mar 28, 2015 · 15 comments

Comments

Projects

In Progress in blueMarble

4 participants
Suggested:term addition, synonym clean up
The list of terms:
https://docs.google.com/spreadsheet/ccc?key=0Ao1hXzPMk-h-dHcwdTdxbHREcFYzRldSTlV
FQUxhU1E&om=true&richtext=false#gid=0

Rationale:review of EnvO, for freshwater biome branch

Relevant sources (e.g. provenance of synonyms):
Evangelos Pafilis
ENVIRONMENTS EOL.corpus curation

Original issue reported on code.google.com by lynn.sch...@gmail.com on 8 Apr 2014 at 6:15

Thanks for this!
Quick feedback:
many of the proposed terms can be added, but anything that is an adjective 
(e.g. shallow, dry, etc) cannot be a class in itself. 

Removing those we have:
ground
vegetation
rocky shore
tree: see issue
infralittoral
burrow
crevice
lowland
burrow


@Vangelis: Would you or your curators like to put forth definitions for these 
or recommend sources?

I agree that many synonyms should be made classes and will begin working on 
this.

Original comment by p.buttig...@gmail.com on 24 Apr 2014 at 1:55

  • Added labels: ****
  • Removed labels: ****
Regarding superfluous or replicate synonym entries, see Issue 2. This is more a 
stylistic issue. Did this cause major issues in the EOL text mining? 

Original comment by p.buttig...@gmail.com on 25 Apr 2014 at 8:20

  • Changed state: Started
  • Added labels: ****
  • Removed labels: ****
Hi Pier, hi all!

Re: #1:  Pier I will go around the curator offices and ask for more information 
on relevant resources

Re: #2: reg replicate synonyms and text mining: there have been several 
dictionary/stopword curation rounds reducing the effect of such cases. Also  
the mapping of a word(s)-in-text to more than one EnvO terms assists in a 
mitigative way. 

These synonyms caused more confusion during the manual EOL document annotation 
(corpus curation). The curators were uncertain which  EnvO-term to chose for 
the mapping. Does this may affect biologists looking up EnvO terms too?

Stay tuned for more feedback



Original comment by vagpafi...@gmail.com on 29 Apr 2014 at 1:19

  • Added labels: ****
  • Removed labels: ****
Many thanks!

To a degree, confusion is expected when handling environment terms: in 
practice, many tend to be poorly defined and used colloquially and/or loosely. 
I think it's largely this situation that leads to long lists of related 
synonyms etc. This situation affects everyone from submitters to curators and 
I'm not sure how much this can be reduced. 

The important thing is to look at the class definition (if present, if not, it 
would be good to propose one) and 1) select the class that best fits the case 
being annotated or 2) propose a new class. This would then be an accurate 
annotation referencing a defined and uniquely identified concept (through the 
ENVO ID), regardless of the class label (i.e. the term) itself. The synonyms 
could then serve to show the 'fuzziness' around the concept, which can be 
useful too.

We're aware that many ENVO classes need (improved) definitions, and we are 
happy to receive proposals. Would it be possible to get a list of EOL records 
annotated with a given class (or one of its synonyms)? This could be a good 
source of feedback to help us tune the class synonym lists.

Original comment by p.buttig...@gmail.com on 30 Apr 2014 at 9:09

  • Added labels: ****
  • Removed labels: ****
Owner

pbuttigieg commented Apr 14, 2015

ground
vegetation (can you use "vegetated area"?)
rocky shore (added to envo-edit.owl by revision 280)
tree: see issue
infralittoral (infralittoral zone added to envo-edit.owl by revision 279)
burrow (added to envo-edit.owl by revision 280)
crevice (this exists: ENVO:01000294)
lowland (needs disambiguation, see here)

@pbuttigieg pbuttigieg added a commit that referenced this issue Apr 14, 2015

@pbuttigieg pbuttigieg Addressed Issue #100 d169024
Owner

pbuttigieg commented Jun 3, 2015

Hi @evangelospafilis, any new input?

Owner

pbuttigieg commented Sep 22, 2015

New strategy: use ENVIRONMENTS-EOL results to auto-generate habitat classes.
Input will look something like this:

EOL:212026 30241743;http://www.eol.org/voc/table_of_contents#Wikipedia subtropical ENVO:01000205
EOL:212026 30241743;http://www.eol.org/voc/table_of_contents#Wikipedia Ocean ENVO:00000015
EOL:212026 31083003;http://rs.tdwg.org/ontology/voc/SPMInfoItems#Habitat continental shelf ENVO:00000223
EOL:212026 31083003;http://rs.tdwg.org/ontology/voc/SPMInfoItems#Habitat Marine ENVO:00000447

or a count of ENVO classes:

1 ENVO:01000205 subtropical
1 ENVO:00000447 marine biome
1 ENVO:00000223 continental shelf
1 ENVO:00000015 ocean

@cmungall: Vangelis will provide the mapping of each EOL page to ENVO classes (similar to the quoted text above). We can then autopopulate classes such as "Heterodontus zebra habitat" defined with simple 'overlaps' relations. We may have to handle things like conditions as suggested below:

Heterodontus zebra habitat
habitat and
overlaps some (environmental system and has_condition some subtropical)
overlaps some continental shelf
overlaps some ocean
overlaps some marine biome

Further, @cmungall, is there a way to add some sort of weight to relations? If the counts were not all '1' and continental shelf had 4 hits, for example, could we assert that this may be more important to the species at hand?

pbuttigieg self-assigned this Sep 22, 2015

pbuttigieg added this to the 2015-12-01 milestone Sep 22, 2015

Owner

cmungall commented Sep 22, 2015

I assume this will be experiment outside the main ENVO to begin with (this will create 1000s of classes). Should just be a few lines of groovy code to make the ontology.

Weighting relations: I think the easiest way is more specific relations. For marine biome, this should be part_of for all sharks I'm aware of. This would have to come from prior knowledge (or some kind of statistical weighting of results). If we want to weight continental shelf higher, what does that mean? That the shelf is a causal hub in the environmental system that supports the shark? If so we can have a more specific chain relation. Or just promote to continental shelf environmental system.

Would we just do taxonomy leaves? For higher taxa we could use DL-Learner to learn the common features.

The ENV-EOL raw dataset has just been updated and can be found under
http://download.jensenlab.org/EOL/ =>
http://download.jensenlab.org/EOL/eol_env_annotations_noParentTerms.tar.gz (9.3M)
http://download.jensenlab.org/EOL/eol_env_annotations.tar.gz (36M)

the "noParentTerms" version lists the ENVO terms as they result from the term identification in text.

the larger dataset in addition to the "noParentTerms" includes also an extended version in which all parent terms (via IS_A, PART_OF traversal) for every match are reported.

This dataset is in-sync with the EOL (Encyclopedia of Life) text contents of 2015-09-16
and it contains terms for 227515 EOL taxa (both species and higher taxa)

Owner

pbuttigieg commented Sep 23, 2015

Thanks @evangelospafilis! Which ENVO release was used for this?

Owner

pbuttigieg commented Sep 23, 2015

@cmungall

Would we just do taxonomy leaves? For higher taxa we could use DL-Learner to learn the common features.

@evangelospafilis

...and it contains terms for 217963 EOL taxa (both species and higher taxa)

It would be very interesting to compare the results of the DL-Learner with the information mined from EOL's higher taxa descriptions. Do they match? We'll need to use a comparable taxonomy file to aggregate. This can and should be done downstream (not for the December milestone) and would be an interesting part of a project.

I would be a fan of dealing with the leaves and then aggregating up. This is unlikely to recreate taxonomy (e.g. the case of the Galapagos penguins).

@pbuttigieg Which ENVO release was used for this?
The dictionary of the ENV-EOL tagger is based on the

envo-basic.obo , format-version: 1.2 , data-version: releases/2013-06-14
(with small some updates/tweaks since then)

Owner

cmungall commented Sep 23, 2015

It would be very interesting to compare the results of the DL-Learner with the information mined from EOL's higher taxa descriptions. Do they match?

These may just turn out to be a union of the features of the child taxa

We'll need to use a comparable taxonomy file to aggregate. This can and should be done downstream (not for the December milestone) and would be an interesting part of a project.

yes, this would be experimental, no dependencies on this?

I would be a fan of dealing with the leaves and then aggregating up. This is unlikely to recreate taxonomy (e.g. the case of the Galapagos penguins).

the kind of thing I would hope to see come out of DL-learner type approaches would be:

penguin SubClassOf environed-in some overlaps some ('cold water environment' or fed-by some 'cool water current') and environment-of some teleost and ...

but this may require bringing in some geographic knowledge, possible species interactions too

The new ENV-EOL dataset is now available under:
http://download.jensenlab.org/EOL/
http://download.jensenlab.org/EOL/eol_env_annotations_noParentTerms.tar.gz
227583 EOL taxa are linked to ENVO terms

NB: the format has been modified: citation related infomation has been added.
Each entry now reads::
eol taxon id TAB data object id ; text section type ; externalSourceDataObjectURL TAB matched term TAB envo identifier TAB full citation
e.g.
EOL:45858206 25111593;http://rs.tdwg.org/ontology/voc/SPMInfoItems#GeneralDescription;http://www.pensoft.net/journals/compcytogen/article/4320/abstract cleft ENVO:00000526 Gavrilov-Zimin I (2012) A contribution to the taxonomy, cytogenetics and reproductive biology of the genus Aclerda Signoret (Homoptera, Coccinea, Aclerdidae) CompCytogen 6(4): 389–395

pbuttigieg added to In Progress in blueMarble Mar 29, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment