Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get taxon match categories #39

Closed
peterdesmet opened this issue Jan 26, 2015 · 5 comments
Closed

Get taxon match categories #39

peterdesmet opened this issue Jan 26, 2015 · 5 comments
Assignees
Labels
Milestone

Comments

@peterdesmet
Copy link
Member

Description

For a given dataset, I want to know how many records provide a taxon. I also want to know how many of those match the GBIF taxonomy and if there are any issues.

Outcome

dataset_key
taxon_not_provided
taxon_match_none
taxon_match_higherrank
taxon_match_fuzzy
taxon_match_complete

Terms we need

scientificName
genus
issues

Process

IF scientificName = "" OR genus = ""
    /* If scientificName is empty, GBIF builds a name with genus, specificEpithet, etc, see
       https://github.com/gbif/occurrence/blob/master/occurrence-processor/src/main/java/org/gbif/occurrence/processor/interpreting/TaxonomyInterpreter.java#L34
       If scientificName is empty, we can check for genus (no need to check other atomized fields)
       Note: TAXON_MATCH_NONE is applied for empty taxa (unless record was indexed before that
       issue was applied). */
    THEN category = "taxon_not_provided"
ELSEIF issues CONTAINS (TAXON_MATCH_NONE)
    THEN category ="taxon_match_none"
ELSEIF issues CONTAINS (TAXON_MATCH_HIGHERRANK)
    THEN category = "taxon_match_higherrank"
ELSEIF issues CONTAINS (TAXON_MATCH_FUZZY)
    THEN category = "taxon_match_fuzzy"
ELSE category = "taxon_match_complete"
@peterdesmet peterdesmet changed the title Get identification categories Get taxon match categories Jan 26, 2015
@peterdesmet peterdesmet added this to the Taxon match milestone Jan 26, 2015
@niconoe
Copy link
Member

niconoe commented Jan 28, 2015

Data extraction module now support this metric. Sample report output:

{
  "8137b32e-f762-11e1-a439-00145eb45e9a": {
    "NUMBER_OF_RECORDS": 756426,
    "BASISOFRECORDS": {
      "UNKNOWN": 6,
      "OBSERVATION": 11660,
      "PRESERVED_SPECIMEN": 744760
    },
    "TAXON_MATCHES": {
      "TAXON_MATCH_HIGHERRANK": 63987,
      "TAXON_MATCH_FUZZY": 17272,
      "TAXON_MATCH_COMPLETE": 607443,
      "TAXON_NOT_PROVIDED": 67724
    }
  }
}

(I took the freedom to make the constants uppercase for the sake of consistency).

@bartaelterman
Copy link
Member

@niconoe I think we're missing TAXON_MATCH_NONE for records that have a taxon, but it could not be matched. At least there is a column taxon_match_none in the cartodb table created by @peterdesmet .

@bartaelterman
Copy link
Member

test data is written to cartodb. Only taxon_match_none is still missing. All values for that column are set to 0.

@niconoe
Copy link
Member

niconoe commented Jan 29, 2015

Well, I just had a quick look and it seems the code support it, but that most records that trigger this issue at GBIF have no scientificName at all: http://www.gbif.org/occurrence/search?ISSUE=TAXON_MATCH_NONE

As I blindly implemented Peter's algorithm above, I think we will return TAXON_NOT_PROVIDED for those (unlike GBIF services, this algorithm will put each row in a sigle category... Is that desirable?). And the data extractor (so far) doesn't return TAXON_MATCH_* counters at all if they don't have corresponding record.

Should Peter's algorithm be changed ? Would you like that the report contains TAXON_MATCH_NONE: 0 (instead of nothing) ?

Thx !

@bartaelterman
Copy link
Member

Ok, no then everything is fine. The aggregator fills in zeros for tags that are not found, so there is no need to add that to the extractor.

@niconoe niconoe closed this as completed Feb 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants