Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get coordinate quality categories #23

Closed
1 task done
peterdesmet opened this issue Jan 16, 2015 · 9 comments
Closed
1 task done

Get coordinate quality categories #23

peterdesmet opened this issue Jan 16, 2015 · 9 comments
Assignees
Labels

Comments

@peterdesmet
Copy link
Member

Description

For a given dataset, I want to know how many records have coordinates. I also want to know how many of those are useful, have issues, and maybe what their precision is.

Outcome

dataset_key
coordinates_not_provided // Coordinates not provided
coordinates_major_issues // Coordinates with major issues
coordinates_minor_issues // Coordinates with minor issues
coordinates_valid  // Valid coordinates (all in WGS84)

Terms we need

decimalLatitude
decimalLongitude
issue

Questions

  • PRESUMED_SWAPPED_COORDINATE, PRESUMED_NEGATED_LATITUDE, PRESUMED_NEGATED_LONGITUDE could be useful to the provider as minor issues, but to the user, these are quite valuable. Where would you group them?

Process

IF issue CONTAINS (
        COORDINATE_INVALID /* Can appear for invalid verbatim => no decimal coordinates */
        COORDINATE_OUT_OF_RANGE /* Can appear for invalid verbatim => no decimal coordinates */
        ZERO_COORDINATE
        COUNTRY_COORDINATE_MISMATCH
        /* I consider COUNTRY_COORDINATE_MISMATCH as a major issue,
           since it looks like GBIF only applies this when there are no country issues, 
           such as COUNTRY_INVALID */
    )
    THEN category = "coordinates_major_issues"
ELSEIF issues CONTAINS (
        GEODETIC_DATUM_INVALID /* Always followed by GEODETIC_DATUM_ASSUMED_WGS84,
            but it does indicate that the provider wanted to indicate the datum. */
        COORDINATE_REPROJECTION_FAILED /* Then GBIF just uses the original ones */
        COORDINATE_REPROJECTION_SUSPICIOUS /* Indicates successful coordinate reprojection
            according to provided datum, but which results in a datum shift larger 
            than 0.1 decimal degrees.*/
    )
    THEN category = "coordinates_minor_issues"
ELSEIF decimalLatitude = "" OR decimalLongitude = ""
    /* Not sure if we need to test for isNumber(), I think GBIF transforms those already */
    /* Also, this ELSEIF could appear between major and minor issues, as minor issues will always 
        have coordinates. I placed it here to have all issue checking first. */
    THEN category = "coordinates_not_provided"
ELSE category = "coordinates_valid"
    /* This can include issues like:
         GEODETIC_DATUM_ASSUMED_WGS84
         COORDINATE_REPROJECTED
         COORDINATE_ROUNDED (to 5 decimals)
         PRESUMED_SWAPPED_COORDINATE
         PRESUMED_NEGATED_LATITUDE
         PRESUMED_NEGATED_LONGITUDE
     Although these are issues, they are all corrected by GBIF and result into valuable WGS84 coordinates
@peterdesmet peterdesmet added this to the Term metrics milestone Jan 16, 2015
@peterdesmet
Copy link
Member Author

Discovered that:

  1. If there are no coordinates, GBIF does interpret verbatim coordinates. If it can interpret those, it populates decimalLatitude and decimalLongitude and applies COORDINATE_ROUNDED and GEODETIC_DATUM_ASSUMED_WGS84. Example
  2. When does coordinates cannot be interpreted, it might apply COORDINATE_INVALID and COORDINATE_OUT_OF_RANGE and leave the decimal coordinates empty. Example. That means we need to change the order in which things are interpreted. Will do that in the body of the issue

@peterdesmet
Copy link
Member Author

Here's a breakdown for my test dataset:

  • coordinates not provided: 2601
  • coordinates with major issues: 128
  • coordinates with minor issues: 175 (205 if we include swapped/negated)
  • valuable coordinates (all in WGS84): 3065

Decimals for valuable coordinates:

1 177
2 352
3 113
4 658
5 700
6+ 1101

@peterdesmet peterdesmet changed the title Coordinates categories Coordinate quality categories Jan 16, 2015
@peterdesmet
Copy link
Member Author

Note: I think we could almost use the count API for this, e.g. http://api.gbif.org/v1/occurrence/count?datasetKey=4ce8e3f9-2546-4af1-b28d-e2eadf05dfd4&issue=COUNTRY_COORDINATE_MISMATCH. The reason it won't work for all, is that you can't search for decimalLatitude/Longitude populated and using it in an IF loop.

@peterdesmet
Copy link
Member Author

I think we can actually do a lot more of this with the regular occurrence search API:

http://api.gbif.org/v1/occurrence/search?datasetKey=4ce8e3f9-2546-4af1-b28d-e2eadf05dfd4&hasCoordinate=true&issue=COUNTRY_COORDINATE_MISMATCH => count = 50330

The biggest question is if we can use negations: all coordinates with NO issues and multiple issues. I'll ask Tim.

@peterdesmet
Copy link
Member Author

Asked Tim: There is no OR and NOT operator in the API, only AND. I think that means we can't use it for this usecase. :-(

@peterdesmet peterdesmet changed the title Coordinate quality categories Create coordinate quality categories Jan 19, 2015
@bartaelterman
Copy link
Member

Concerning the PRESUMED_SWAPPED_COORDINATE, PRESUMED_NEGATED_LATITUDE, PRESUMED_NEGATED_LONGITUDE issues.

I think a record with these issues is ready for use. While the other issues we categorized as minor issues are not (if you don't know the geodetic datum, you'll need to do some work to figure that out first). So I think records with these PRESUMED_... issues are usable, thus valid.

However, I agree that this information would be valuable to the data provider. So he should be informed of the fact that GBIF fixed his coordinates but maybe we can provide that information somewhere else.

@peterdesmet
Copy link
Member Author

OK. We keep the categories as they are. We can discuss this in the documentation #32. All 4 fields are now available in CartoDB.

@peterdesmet peterdesmet changed the title Create coordinate quality categories Get coordinate quality categories Jan 26, 2015
@peterdesmet peterdesmet assigned niconoe and unassigned bartaelterman Jan 27, 2015
@niconoe
Copy link
Member

niconoe commented Feb 2, 2015

@peterdesmet, @bartaelterman: I'm ready to implement: is the algorithm above (in "Process") still valid ? are there adjustments to be made ?

@peterdesmet
Copy link
Member Author

@niconoe, the algorithm described in the issue body is still valid. No adjustments needed for now.

@niconoe niconoe closed this as completed in d8fac01 Feb 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants