Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iDigBio Flags on Continent #1291

Closed
Jegelewicz opened this issue Oct 5, 2017 · 46 comments
Closed

iDigBio Flags on Continent #1291

Jegelewicz opened this issue Oct 5, 2017 · 46 comments
Assignees
Labels
Aggregator issues e.g., GBIF, iDigBio, etc dwc terms This issue is related to Darwin Core terms Function-CodeTables Priority-Cancelled Issue as stated was not approved for implementation by the community.

Comments

@Jegelewicz
Copy link
Member

As my data was recently ingested by iDigBio, I received a huge list of specimens flagged for various corrections (sigh). I wanted to bring this one to the group to see if we should be paying more attention to Darwin Core, or if it is just something to let iDigBio keep "correcting" for.

Some of my specimens on islands in the Pacific, are flagged by iDigBio with "dwc_continent_replaced | Darwin Core Continent Corrected." one example is here:
https://www.idigbio.org/portal/records/89015b8e-d745-430c-b846-8b250b62afcb

Is Arctos not complying with Darwin Core or is this just an artifact of iDigBio? Do we need to do anything about it or do I just need to know that these flags are not a problem? My main concern is that users of iDigBio will view our data as less reliable with flags attached.

@dustymc dustymc added this to the Needs Discussion milestone Oct 5, 2017
@dustymc dustymc added Function-ExternalLinks Priority-High (Needed for work) High because this is causing a delay in important collection work.. labels Oct 5, 2017
@dustymc
Copy link
Contributor

dustymc commented Oct 5, 2017

@DerekSikes noticed something similar in GBIF regarding dates.

Darwin Core is an exchange standard; Arctos isn't "complying" with any data standards because none exist.

I agree with your assessment: User's initial reaction to the flag will be "Arctos is broken," which is absolutely not the case.

@ekrimmel
Copy link

ekrimmel commented Oct 5, 2017

We've done a bit of thinking about this internally. Right now there are some data quality flags from iDigBio that are useful because they correct objectively incorrect data, like a mismatch between coordinates and country due to a missing sign. Others, like your Pacific islands example, are subjective to how data are stored in Arctos vs. other models. Many of the objective DQ tests would flag errors that we don't have because Arctos/Dusty also catches them (e.g. "April 31st is a date that doesn't exist"). The subjective ones I don't think are worth our time to care about at this point, in particular because the DQ tests and methods iDigBio uses are in flux due to work being done in TDWG.

The TDWG Biodiversity Data Quality task group has a few factions working on different aspects. One is trying to define a framework for what we even mean when we talk about data quality as a collections community. Another is getting all the aggregators, including iDigBio, to settle on a set of the same data quality tests to run on provider data and return flags for.

I don't actually think the flags are visibly negative enough to make users think "Arctos is broken." I would hope (although I guess hope is the operative word here), that people who are running analyses on or otherwise using aggregator data for something beyond browsing would notice that the flags are doing more standardizing than correcting, and that obviously different collections/databases use different but equally correct ways to say the same thing...

@Jegelewicz
Copy link
Member Author

I think I know what is going on here now and it would be a change to higher geography. While at SPNHC, Robert Mesibov offered to review some Arctos data for me. He downloaded the MSB fish data from iDigBio and reviewed the RAW file. One of the issues he found was that all of the stuff coming from oceans had no water body and instead the body of water was in the DWC_Continent field.

In Darwin Core, Atlantic Ocean is a body of water, not a continent.

I thought that it would make sense to call the tectonic plate the "continent", but that isn't how iDigBio does it. They use political boundaries for continent.

So DMNS:Bird:18967 in Arctos shows a continent of "Atlantic Ocean" in Arctos and no associated water body.

and DMNS:Bird:18967 in iDigBio shows a continent of "Europe" and has the flag DWC Continent Replaced.

Strictly speaking, we are both wrong but I doubt that anyone searching in iDigBio for Europe wants stuff from the South Georgia Islands. And when I search iDigBio for insitution code "DMNS" plus water body "Atlantic Ocean" I get no results. At least anyone searching Arctos for stuff from the Continent/Ocean field for "Atlantic Ocean" will find this specimen (I tried it and it worked!).

All this being said. It seems to me that there needs to be a wider community discussion about Continent and Bodies of Water but in the interest of making our stuff more searchable in iDigBio (and GBIF I'm betting), I suggest that we add Water Body to higher geography and for anything with a continent that is really a water body we add the correct name to the water body field. iDig Bio will still replace our "Continent/Ocean" information, but the correct water body will get there, so people searching the oceans will find our stuff.

@Jegelewicz Jegelewicz added this to To Do in Geography in Arctos via automation Sep 22, 2018
@Jegelewicz
Copy link
Member Author

BTW, I added the whole continent/ocean issue to the TDWG data quality GitHub.

Darwin Core Continent and Water Body

@Jegelewicz Jegelewicz added Aggregator issues e.g., GBIF, iDigBio, etc dwc terms This issue is related to Darwin Core terms labels Sep 22, 2018
@dustymc
Copy link
Contributor

dustymc commented Sep 23, 2018

This is an aggregator doing something indefensible (which you've explicitly permitted by licensing your data CC0). This isn't an Arctos issue (there is no standard of which I'm aware), and it's not a DWC issue (the data are being properly transported to the aggregators).

There's been a "community discussion" going on for 32 year(this is what TDWG was formed to do) with no resolution. What we NEED is a usable authority. Arctos could become that or plug into something else; both are technically trivial. (What's Kurator using?)

I dislike waterbody. I fail to see how the few miles of sometimes-wet sorta-ditch behind the farm (it's in Getty) is the same sort of data as states and counties.

@dustymc dustymc closed this as completed Sep 23, 2018
Geography in Arctos automation moved this from To Do to Done Sep 23, 2018
@dustymc
Copy link
Contributor

dustymc commented Sep 24, 2018

woops

@dustymc dustymc reopened this Sep 24, 2018
@Jegelewicz
Copy link
Member Author

@ArctosDB/geo-group , please read John W's response.

@Jegelewicz Jegelewicz moved this from Done to To Do in Geography in Arctos Sep 24, 2018
@dustymc
Copy link
Contributor

dustymc commented Sep 24, 2018

We could (theoretically - it may push this into 'infrastructure-limited' territory) use a non-DWC vocab and translate. Eg if ya'll really like 'Central America" as a continent then we could push it and North America to 'North and Central America' on export. (Or maybe that's a horrible idea which just ensures that someone finding something in iDigBio can't find it in Arctos and vise-versa.)

if the location itself is not in the water, dwc:waterbody should be left empty, otherwise we end up with some incongruent assertions some day when the semantics become rigorously important.

#1107 - we regularly violate this principle and seem resistant to stopping that.

Continent: ...suggest The Getty Thesaurus of Geographic Names (TGN) as the source...Oceania...does not include the oceans.

Maybe that's correct and Oceania only refers to the dirt-parts??

dwc:waterbody is a lot more broad than dwc:continent, as it can include everything from a pond to an ocean. Some use it for drainage basin systems

I'd say that's just wrong (and that's why we've added "drainage" and not "waterbody" the the geography table). There's a LOT of stuff in "Cimarron River Drainage" which isn't anywhere near the Cimarron River (or any other water!).

And #1366 is still unanswered, but I don't think a pond is included within what we generally see as geography. Maybe that's an indication that trying to draw a line between geography and locality is not a useful thing to do.

And I'd like to amend my assertion above: what we NEED is a lookup service which turns shapes into whatever sort of text string anyone might want. (We already have that, but it's not very good, not very structured, and not very exposed - it just supports "any geog" queries, and it does so from points. We also have services to turn strings into coordinates, but that quickly becomes circular - at least sometimes, I'm inclined to support our current model which treats those coordinates as suggestions and relies on a person to accept them as "data.")

@Jegelewicz
Copy link
Member Author

See also tdwg/bdq#172

After looking into this - I have to agree that our current "Higher Geography" is misleading in searches.

DMNS:Bird:18967 provides a good example. Its higher geography is: Atlantic Ocean, United Kingdom, South Georgia & South Sandwich Islands, South Georgia Islands, South Georgia

As John W. points out, an island is not part of the ocean (a water body). iDigBio moves this specimen to:
Europe, United Kingdom, South Georgia & South Sandwich Islands, South Georgia Islands, South Georgia because the United Kingdom is in Europe.

If we were following the ISO 3166 codes, we would have a higher geography of:

AN GS SGS 239 South Georgia and the South Sandwich Islands (dependent state)

AN = Antarctica
GS = South Georgia and the South Sandwich Islands
SGS = South Georgia and the South Sandwich Islands
239 = South Georgia and the South Sandwich Islands

Which makes sense if you are searching by continent or country.

ISO 3166 would be far more stable than Wikipedia and we would stop the madness of finding Magellanic Penguins in the United Kingdom (which most certainly happens in Arctos).

@dustymc
Copy link
Contributor

dustymc commented Sep 26, 2018

Here's your link - click "requery" on the "show/hide" widget to get a URL. http://arctos.database.museum/SpecimenResults.cfm?scientific_name=Spheniscus%20magellanicus&scientific_name_scope=currentID&scientific_name_match_type=startswith&country=United%20Kingdom

I don't really have a problem with those data - the UK is a political entity, not a place. More on that below...

I dislike ISO codes as they line up with our data; the intent/meaning is drastically different. We record (sometimes...) what was there when the specimen was collected (or georeferenced, or when the label was printed, or ...), ISO codes refer to something else, those don't always have much to do with each other, and we don't have the resources to update our data when something changes. "Yugoslavia" could refer to lots of shapes (https://www.youtube.com/watch?v=Ic5tBXESxl8) while ISO 3166-1:890 is 1) just https://en.wikipedia.org/wiki/Socialist_Federal_Republic_of_Yugoslavia#/media/File:Yugoslavia_1956-1990.svg, and 2) a withdrawn code.

because the United Kingdom is in Europe

One problem is that we (and GBIF, apparently) have a crazy mix of geography and politics in the data, and often no way to tell them apart. The UK is most certainly not (entirely) in Europe, nor does the name have any sort of spatiotemporal stability.

an island is not part of the ocean

That brings up the question of where exactly the island ends and the ocean begins. Mean high tide, the exclusive economic zone (for island nations), some arbitrary point established by some historical event, the place where the collector felt they were no longer close enough to the island to record that, ... ?

I'm not sure there's a One True Method for any of that which involves strings. It's all fairly trivial with georeferences - just ask some service capable of responding with the data you want. Theoretically anyway - hard to say what might happen with this input:

screen shot 2018-09-26 at 9 52 08 am

@Jegelewicz
Copy link
Member Author

See tdwg/dwc-qa#128 (comment)

@Jegelewicz
Copy link
Member Author

Taxonomy Committee had a brief discussion about this. People searching at VertNet, GBIF and iDigBio will not find some of Arctos records due to mismatches between the Continents we use in Arctos and those they use (apparently a standard set) see tdwg/dwc-qa#128 (comment).

Although it would be a lot of work, I think we need to review all higher geography that uses an ocean as the "continent". As John W. pointed out, Hawaii is not part of the Pacific Ocean (it is not water) and if we are sticking with political divisions for higher geography, then Hawaii should be part of North America. see also #1291 (comment).

I also think we should consider how our continents map to those used by the aggregators:

Arctos Aggregators
Africa Africa
Americas
Antarctica Antarctica
Arctic Ocean
Asia Asia
Atlantic Ocean
Australia Oceania
Central America
Eurasia
Europe Europe
Indian Ocean
North America North America
North Atlantic Ocean
North Pacific Ocean
Pacific Ocean
South America South America
South Atlantic Ocean
Southern Ocean
South Pacific Ocean
West Indies

Everything that we have in any of the oceans is likely lost in many searches of aggregators and that could be a lot of things.

Actually, I find our continent/ocean list a bit perplexing...why did we decide to make the West Indies a continent?

The West Indies is a subregion of North America - https://en.wikipedia.org/wiki/West_Indies

How is that any different from "Patagonia"?

@Jegelewicz Jegelewicz removed the Priority-High (Needed for work) High because this is causing a delay in important collection work.. label Aug 20, 2020
@tucotuco
Copy link

Hi folks, rather than rehash what I think are the issues with how GBIF interprets continent, I urge you to read the issue I presented to them, as it will explain a lot about why you see what you see in GBIF.

@tucotuco
Copy link

implies all our specimens are marine

That sort of confounded assumption can be nothing but a recipe for bad inferences.

Careful everyone. The VertNet principle of best practice suggests how to do it, it does not say that everyone has done it, or that an assumption to that effect is sage or safe.

@dustymc
Copy link
Contributor

dustymc commented Aug 21, 2020

how to do it

I think that's our primary question here.

  1. Given a blank slate, what's our geography model look like? (Actually not that radical of an idea - geography is just a foreign key from most of Arctos.)
  2. Given the model we should have and a dot on the map, how do we select appropriate geography?

Second is how aggregators and other not-us users interpret those data. The easy solution to that is to just share a model.

@tucotuco
Copy link

To me it needs two parts, the shapes and the thesaurus that connects to it. One could approach geography from the spatio-temporal perspective or from the names perspective. You could do things like:

reverse geocoding: Tell me the standard administrative region names for this point (at this time). Here is an example that uses GADM - https://api.gbif-uat.org/v1/geocode/reverse?lat=48.17156&lng=1.18177.

get preferred name - I wanna search on the name of a place as I know it and let something translate that into the preferred name used in an index so I get everything I am looking for. This would take a combination of something like TGN (http://www.getty.edu/vow/TGNServlet?english=Y&find=Sudamerica&place=&page=1&nation=), which does have web services now, and an index that actually is standardized against the preferred names.

@sharpphyl
Copy link

@dustymc
Let's see if I understand the above links. These are reverse geocoding of coordinates moving from a point within the US boundary out into the US Exclusive Economic Zone and beyond into the Pacific Ocean. Would this add the EEZs as part of higher geography and thus tie both to the political entity that controls the EEZ and the ocean it is in? That certainly has promise and I don't immediately see an issue. Would it improve how GBIF interprets our data? I think @mkoo has suggested using EEZs before.

@dustymc
Copy link
Contributor

dustymc commented Aug 31, 2020

add the EEZs as part of higher geography

That's a possibility. I was thinking more radically, but I'm not sure how realistic anything is.

If we do something, we'd need to do something consistent. It looks like they end 'continent' right about the golden gate bridge - you OK with that?

The Faralons are part of SF County, adopting enough of this would leave us with a transcontinental county, that doesn't seem ideal.

and thus tie both to the political entity that controls the EEZ and the ocean it is in?

Seems a bit optimistic, but maybe. Would be useful to see their basemap rather than trying to reverse engineer it.

Would it improve how GBIF interprets our data?

It might - presumably they built this for their own use.

@Jegelewicz
Copy link
Member Author

move all that to either Europe or Asia

Russia really is big! Merging those to Eurasia is trivial. Splitting Russia is not. NULL continent may be less-evil than merges (or not, IDK)

There are only 3 HG entries with Eurasia, Russia

@campmlc
Copy link

campmlc commented Sep 1, 2020

Create an Uber-geog level above continent just for Eurasia?

@tucotuco
Copy link

tucotuco commented Sep 2, 2020 via email

@dustymc
Copy link
Contributor

dustymc commented Sep 2, 2020

only 3 HG

I don't understand why that matters. The most precise information we have doesn't fit into the normal "hierarchy" (it's not, because the world isn't).

  • We could accept that continent-->country is two different kinds of THINGs and should not be expected to be consistent. This to me looks like the reality we should embrace.

  • We could lose the precision altogether and dump everything into Eurasia. That will toss out some data for "Russia, that's all we know" records, and won't do anything for Hawaii being inconveniently located.

  • We could do something truly evil - reject records which don't meet our expectations of how the world should have been put together or something.

@Jegelewicz
Copy link
Member Author

We could accept that continent-->country is two different kinds of THINGs and should not be expected to be consistent. This to me looks like the reality we should embrace.

I agree that this is what we should be doing. The only issue arises when we have a locality = "Russia" (or does it? In this case, I would suggest that HG = no higher geography and that "Russia" be included in Specific Locality OR there should be two localities provided one with HG = Asia, Russia and one with HG = Europe, Russia.

@Jegelewicz
Copy link
Member Author

Also, I can figure out the 3 Russia HG in Eurasia and put them on the appropriate continent.

@dustymc
Copy link
Contributor

dustymc commented Sep 2, 2020

HG = no higher geography and that "Russia" be included in Specific Locality

I think that's in my "evil" category - it's purposefully "demoting" data to meet our unrealistic expectations.

two localities

That works for search, might not be evil, still seems pretty janky to me.

figure out the 3 Russia HG in Eurasia

That does not seem possible.

One is a country that spans both.

One is a former, bigger, country that spans both.

One has this:

Screen Shot 2020-09-02 at 9 16 53 AM

@Jegelewicz
Copy link
Member Author

HG = no higher geography and that "Russia" be included in Specific Locality

I think that's in my "evil" category - it's purposefully "demoting" data to meet our unrealistic expectations.

I think that using Eurasia is every bit as evil.

two localities

That works for search, might not be evil, still seems pretty janky to me.

Janky, maybe, but it gets the job done (IMO - could be completely wrong).

figure out the 3 Russia HG in Eurasia

That does not seem possible.

One is a country that spans both.

See first comment above. We have "Asia, Russia" and "Europe, Russia". Assign two events with both localities to the records that use "Eurasia, Russia". BTW, I think some of these could have more appropriate HG

image

One is a former, bigger, country that spans both.

Aren't we supposed to be using "current" HG? Some of these could be made better and for the rest "no higher geography" with Soviet Union in the spec loc seems not so evil, since they are just the vague anyway.

image

One has this:

See fix as applied to "Russia". Also, pretty sure these could be sorted onto the correct continent, since they have coordinates...

image

@sharpphyl
Copy link

It looks like they end 'continent' right about the golden gate bridge - you OK with that?

It would be nice to have a bit of wiggle room so our coordinates could be 100' off shore and not create an out-of-bounds, but if we had EEZs to work with right off the bridge, it would probably be ok.

This issue has gained a lot of Where's Russia? influence so maybe the rest of this comment belongs elsewhere, but it's related to the question of how to deal with offshore locations.

A consortium of Museums (I don't think any are in Arctos) recently received a grant https://www.nsf.gov/awardsearch/showAward?AWD_ID=2001510&HistoricalAwards=false that is focused on geolocating specimens on the US eastern seaboard. Here is part of their proposal: This project will generate reliable geo-coordinate data for all covered specimen lots using a collaborative georeferencing project in GeoLocate. GeoLocate will add layers for bathymetric data, benthic habitat, and marine conservation areas. Incorporating bathymetry into GeoLocate to determine the extent of locations will also provide that capability for complex elevational data for terrestrial species....The data will be shared through public data repositories, including iDigBio, GBIF, OBIS, and the InvertEBase Symbiota portal.

I asked Dr. José Leal at the National Shell Museum, one of the participants, if, in addition to geolocating specimens more precisely, the project would result in a marine locality structure that could be used by other museums with specimens from similar locations. His reply: Yes, that is the idea. We have Nelson Rios from Geolocate as a PI in the grant, so some of the more technical questions will be resolved by him on this. For marine localities we'll be adding station coordinates (which is nothing new), but still need to resolve how to handle "stations" without coordinates ("off Cape Sable, etc.)

Not sure there's anything in the work they are doing that will be helpful for us, but I thought I'd add it to the stew just in case.

@dustymc
Copy link
Contributor

dustymc commented Sep 2, 2020

Assign two events

Taken to extremes, would that require a "France, 1800" record to have about 80 determinations?

supposed to be using "current" HG?

That idea died an agonizing death under the pressure of reality; it's a nice ideal, but it would require a tremendous amount of work every time someone moves a border.

vague anyway

It's less vague than the alternatives.

Eurasia is every bit as evil.

It does not involve discarding data, so I have to disagree. Splitting Sverdlovsk Oblast or San Francisco County across two made-up pigeonholes doesn't seem terribly conducive to discovery, nor does dumping Norway and India into one made-up pigeonhole. I have no idea what we should do, but I do not think it will involve removing precision at any scale.

@Jegelewicz Jegelewicz added Priority-High (Needed for work) High because this is causing a delay in important collection work.. and removed Priority-Critical (Arctos is broken) Critical because it is breaking functionality. labels Jul 23, 2021
@Jegelewicz Jegelewicz added Priority-Cancelled Issue as stated was not approved for implementation by the community. and removed Priority-High (Needed for work) High because this is causing a delay in important collection work.. labels Nov 2, 2021
@Jegelewicz
Copy link
Member Author

Closing as we are not addressing the original issue.

Geography in Arctos automation moved this from To Do to Done Nov 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Aggregator issues e.g., GBIF, iDigBio, etc dwc terms This issue is related to Darwin Core terms Function-CodeTables Priority-Cancelled Issue as stated was not approved for implementation by the community.
Projects
No open projects
Development

No branches or pull requests

8 participants