Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geography Proposal #3272

Closed
dustymc opened this issue Dec 3, 2020 · 35 comments
Closed

Geography Proposal #3272

dustymc opened this issue Dec 3, 2020 · 35 comments
Labels
Enhancement I think this would make Arctos even awesomer! Function-Locality/Event/Georeferencing Help wanted I have a question on how to use Arctos

Comments

@dustymc
Copy link
Contributor

dustymc commented Dec 3, 2020

Background

I can see no evidence that the recent efforts in geography cleanup have resulted in more discoverable catalog record data, which I presume to be a core use case for maintaining geography. It's still possible for data entry personnel to assign arbitrary geography to records, and without consistency predictable geography text search results are not possible. See #3249 for example.

#3186 is a proposal to find more consistency in these data, but it will result in significantly reduced functionality in several areas. I don't see this as an acceptable tradeoff, and I don't think Curators will or should either.

Our current geography model does offer several valuable tools for georeferencing and confirming that georeferences fall within specified geography areas, but this still does not provide a consistent mechanism for locating cataloged records by geography.

Arctos has for some time been using various webservices to find coordinates for records without them, and to associate coordinates (both asserted and derived) with place names from various webservices. This is useful for search, but there is not formality or consistency in these data; they're just search strings.

Proposal

  1. Retain the existing geography model, which allows "traditional" curatorial assertions (which support various internal functions - organizing material by Quad, for example).

  2. Split the derived geography out into a separate, structured, formal table. This would allow consistent searching - all records from http://www.geonames.org/5880054/barrow.html would be discoverable as "United States","Alaska" and "North Slope" for example. For contrast, current data would require somewhere between three and 16 queries (depending on level) to find the desired "Barrow-ish" records.


CONTINENT_OCEAN	COUNTRY	STATE_PROV	COUNTY	QUAD	SEA
Arctic Ocean					Beaufort Sea
Arctic Ocean					Chukchi Sea
Arctic Ocean					
no higher geography recorded					
North America	United States	Alaska	North Slope Borough		
North America	United States	Alaska		Barrow	Beaufort Sea
North America	United States	Alaska		Barrow	Chukchi Sea
North America	United States	Alaska		Barrow	
North America	United States	Alaska		Barter Island	Beaufort Sea
North America	United States	Alaska		Iliamna	
North America	United States	Alaska		Meade River	
North America	United States	Alaska		St. Lawrence	Bering Sea
North America	United States	Alaska		Teshekpuk	
North America	United States	Alaska			Beaufort Sea
North America	United States	Alaska			Chukchi Sea
North America	United States	Alaska			

Implications

This would immediately result in more discoverable (by virtue of consistency) data in Arctos. One query - rather than the currently-required 16 - would find records from Barrow.

Longer term, we could discuss making these data more visible, perhaps sharing them via DWC, etc. This is essentially an implementation of #3186 but as an enhancement rather than a replacement.

This approach also has significant future-proof qualities. A county's new name will become available for searching as soon as it's entered into a service we use, with no curatorial work involved. Using a new/better/specialized service would be a matter of making Arctos aware of it.

No changes would be required to catalog new material.

Future changes to "curatorial geography" would not be so wide-ranging; we might be able to more readily accommodate curatorial needs without reducing functionality to users.

In short, I think this would result in drastically more discoverable data with no additional curatorial work, and without asking Curators to give up anything. It would also retain all of the work we've put into cleaning and organizing geography.

Followup

This approach would rely on coordinates to retrieve the consistent geography data, and so I also propose that we make the derived coordinates more visible, and more available to collections who wish to use them, as an immediate followup. It would be trivial to create a georeferenced Specimen Event for cataloged records without one, for example. This would not be a particularly "good" georeference, but it would make any problems much more discoverable by providing a path to spatial tools, and could be flagged as automation in various ways (a new value in https://arctos.database.museum/info/ctDocumentation.cfm?table=ctverificationstatus is perhaps most "filter-able").

For scale, Arctos currently holds 688778 localities, 496467 (72%) of which have curatorial coordinate assertions. 668709 (97%) have service-derived coordinate assertions.

Related Issues

In no particular order. I got overwhelmed and gave up trying to better organize these, you can too! There are a few "themes" in these, but they're often broad and intermingled.

  1. Some Issues are incorporated in this proposal. There's nothing new here, it's just a no-compromises merger of existing ideas. Restructuring geography, incorporating various Standards and Services, and being a more involved member of the larger community are inevitable, for example.

  2. Some Issues become less important if not irrelevant under this proposal. Choosing curatorial functionality over discovery has little impact with this 2-part approach. Inconsistent data has a much shorter reach. Using "modern" geography is not as pressing, perhaps not even desirable. Lacking a universal definition of geography or idea of the goals is not necessary.

  3. Some Issues change very little, or not at all, under this. Adding spatial data will enable the same awesomeness under this proposal, for example.

Structure

Table formal_geography could take two general shapes.

A normalized structure would provide more flexibility, but is more difficult and expensive to query

formal_geography_id serial
term varchar not null
rank varchar null
order int not null
souce varchar not null
metadata various

would support any number of terms of any rank (including none), and generally be more capable of representing whatever comes in from Services (including that cool new thing which hasn't been built yet). It would also be expensive to query, difficult to access, impractical to flatten, and perhaps difficult to "translate" (eg, we end up with 12 ways of saying "country" from various sources).

A more flattened approach would serve the core use case of discoverability, could be treated like a spreadsheet for various purposes, but would not be completely faithful to service data.

formal_geography_id serial
term_1 varchar<--- map continent-level-ish data here
term_2 varchar<--- map country-level-ish data here
term_3 varchar<--- map state-level-ish data here
term_4 varchar<--- map county-level-ish data here
term_5 varchar<--- map municipality-level-ish data here
souce varchar not null
metadata various

Both would require some way to tie to "core" or "curatorial" data (probably Locality). A linking table would provide a mechanism to tie many assertions to a locality, which seems necessary, and a mechanism to tie many localities to an assertion (which could reduce the data we must store, but I don't anticipate using this direction).

geo_link_id serial
formal_geography_id fkey-->formal_geography
locality_id fkey-->locality

An alternate would be adding locality_id fkey-->locality directly into the formal_geography table, which might make sense with the flatter version.

@dustymc dustymc added Function-Locality/Event/Georeferencing Enhancement I think this would make Arctos even awesomer! Help wanted I have a question on how to use Arctos Service-related labels Dec 3, 2020
@dustymc dustymc added this to the Needs Discussion milestone Dec 3, 2020
@dustymc
Copy link
Contributor Author

dustymc commented Dec 3, 2020

Issues meeting:

  • allow eg "just use GADM" as geog for data entry-->don't assert anything, just pull from coordinates-at-source

has potential, implement, gather some data, expose internally and in limited scope (eg, from higher geog edit page), then analyze and decide how to proceed

AWG: Go

@tucotuco
Copy link

tucotuco commented Dec 3, 2020

This is a goldmine. I am going to blithely steal from it as I work on the Locality Services.

@dustymc
Copy link
Contributor Author

dustymc commented Dec 3, 2020

as I work on the Locality Services.

Built it, and we shall steal....

@tucotuco
Copy link

tucotuco commented Dec 3, 2020 via email

@dustymc
Copy link
Contributor Author

dustymc commented Dec 9, 2020

I went with a fairly-normalized model, should be pretty easy to shuffle things around if it causes some sort of problem.

create table place_terms (
  place_term_id serial not null,
  locality_id bigint references locality(locality_id) on delete cascade,
  term_type varchar not null,
  term_value varchar not null,
  source varchar not null,
  last_date date default current_date
);

It's talking to Google, and keeping only

administrative_area_level_1,administrative_area_level_2,administrative_area_level_3,country

which are the only "geography-like" terms I could find in that particular API. That's easy to adjust if someone wants something else; Google seems to know a lot about rooftops...

Plugging in to other APIs should be trivial, so if anyone knows of anything that'll take coordinates and return something that someone might consider geography, please let me know about it.

http://test.arctos.database.museum/place.cfm?action=detail&locality_id=1178173 looks like....

Screen Shot 2020-12-08 at 4 45 25 PM

It would be pretty easy to use those terms and/or ranks in search, assert them instead of or alongside "curatorial geography," or whatever turns out to be handy.

It won't be very interesting until some data are gathered. @mkoo if we have the bandwidth I could temporarily be more aggressive with the cacher after this goes to production, which might happen in a couple hours.

@tucotuco
Copy link

tucotuco commented Dec 9, 2020 via email

@dustymc
Copy link
Contributor Author

dustymc commented Dec 9, 2020

Thx - I did eventually remember that...

Screen Shot 2020-12-09 at 1 41 01 PM

I've got it set to grab everything for now - I suspect we'll end up filtering and deleting some stuff at some point. Given the (vague and potential) intent of this, perhaps it's best to preemptively reject everything with distance>0?

@dustymc
Copy link
Contributor Author

dustymc commented Dec 10, 2020

For the followup of making generated coordinates more visible, there's a new operator button on specimen detail for no-coordinate events. Two clicks...

Screen Shot 2020-12-09 at 3 57 18 PM

Screen Shot 2020-12-09 at 3 57 28 PM

...and...

Screen Shot 2020-12-09 at 4 02 36 PM

... happens. It's not a great georeference - there is no error calculation - but I've clicked the button perhaps 50 times and nothing meaningfully "wrong" has happened. (Maybe I'm bad at picking test cases!) There is a map available before the second click, should anyone want to review it before clicking - this is simply a new path to an old tool. The georeference will need further work to be suitable for all use cases, but it also makes the record available to spatial tools where it can be more efficiently improved; even horribly incorrect georeferences seem like an improvement from that perspective.

I'd be happy to talk about further lowering the bar, should anyone or everyone want magical coordinates without the clicking.

@tucotuco
Copy link

tucotuco commented Dec 10, 2020 via email

@dustymc
Copy link
Contributor Author

dustymc commented Dec 10, 2020

too simple.

Sounds scary, but I guess we could give it a try....

Done, in production, cache-checker-thingee is running a little harder than normal @mkoo

@dustymc
Copy link
Contributor Author

dustymc commented Dec 10, 2020

This has processed ~20K localities so far, there's perhaps enough data for patterns to begin emerging.

https://arctos.database.museum/place.cfm?action=detail&locality_id=1116141 had just finished when I checked in, seems fairly normal.

Locality terms:

        term_value        |          term_type          |   source   
--------------------------+-----------------------------+------------
 United States of America | Political                   | GBIF API
 United States            | GADM0                       | GBIF API
 New Mexico               | GADM1                       | GBIF API
 Sandoval                 | GADM2                       | GBIF API
 NORTH AMERICA MAINLAND   | SeaVoX                      | GBIF API
 New Mexico               | WGSRPD                      | GBIF API
 United States            | country                     | Google API
 New Mexico               | administrative_area_level_1 | Google API
 Sandoval County          | administrative_area_level_2 | Google API
 Sandoval County          | political                   | Google API
 New Mexico               | political                   | Google API
 United States            | political                   | Google API
 Jemez Springs            | political                   | Google API

GBIF:GADM0,GBIF:GADM1,GBIF:GADM2 pretty consistently form country:state:province, they seem like a suitable solution to #3186. Google:country,Google:administrative_area_level_1,Google:administrative_area_level_2 could serve the same purpose. [Dis]agreement between those things could be a useful metric.

This does seem capable of providing a consistent, limited set of search parameters which will return ALL (or the 97% I can get coordinates for) items from a placename.

The "all localities" report @mkoo asked for are a decent reflection of the all-localities map.

Screen Shot 2020-12-10 at 8 47 52 AM

        term_value        |          term_type          |   source   
--------------------------+-----------------------------+------------
 United States of America | Political                   | GBIF API
 United States            | GADM0                       | GBIF API
 Michigan                 | GADM1                       | GBIF API
 New Mexico               | GADM1                       | GBIF API
 Los Alamos               | GADM2                       | GBIF API
 Sandoval                 | GADM2                       | GBIF API
 Roosevelt                | GADM2                       | GBIF API
 Otero                    | GADM2                       | GBIF API
 Santa Fe                 | GADM2                       | GBIF API
 Taos                     | GADM2                       | GBIF API
 Rio Arriba               | GADM2                       | GBIF API
 Wexford                  | GADM2                       | GBIF API
 NORTH AMERICA MAINLAND   | SeaVoX                      | GBIF API
 New Mexico               | WGSRPD                      | GBIF API
 Michigan                 | WGSRPD                      | GBIF API
 United States            | country                     | Google API
 New Mexico               | administrative_area_level_1 | Google API
 Michigan                 | administrative_area_level_1 | Google API
 Colfax County            | administrative_area_level_2 | Google API
 Roosevelt County         | administrative_area_level_2 | Google API
 Sandoval County          | administrative_area_level_2 | Google API
 Los Alamos County        | administrative_area_level_2 | Google API
 Otero County             | administrative_area_level_2 | Google API
 Santa Fe County          | administrative_area_level_2 | Google API
 Taos County              | administrative_area_level_2 | Google API
 Wexford County           | administrative_area_level_2 | Google API
 Rio Arriba County        | administrative_area_level_2 | Google API
 Boon Township            | administrative_area_level_3 | Google API
 Sandoval County          | political                   | Google API
 San Ildefonso Pueblo     | political                   | Google API
 San Luis                 | political                   | Google API
 San Pedro                | political                   | Google API
 Santa Fe                 | political                   | Google API
 Santa Fe County          | political                   | Google API
 San Ysidro               | political                   | Google API
 Taos                     | political                   | Google API
 Taos County              | political                   | Google API
 United States            | political                   | Google API
 Wexford County           | political                   | Google API
 Algodones                | political                   | Google API
 White Rock               | political                   | Google API
 Angel Fire               | political                   | Google API
 Boon                     | political                   | Google API
 Boon Township            | political                   | Google API
 Budaghers                | political                   | Google API
 Cloudcroft               | political                   | Google API
 Cochiti Lake             | political                   | Google API
 Colfax County            | political                   | Google API
 Corrales                 | political                   | Google API
 Coyote                   | political                   | Google API
 Cuba                     | political                   | Google API
 Golden                   | political                   | Google API
 Jemez Pueblo             | political                   | Google API
 Jemez Springs            | political                   | Google API
 La Jara                  | political                   | Google API
 Los Alamos               | political                   | Google API
 Los Alamos County        | political                   | Google API
 Mescalero                | political                   | Google API
 Michigan                 | political                   | Google API
 New Mexico               | political                   | Google API
 Otero County             | political                   | Google API
 Pep                      | political                   | Google API
 Placitas                 | political                   | Google API
 Questa                   | political                   | Google API
 Rio Arriba County        | political                   | Google API
 Rio Rancho               | political                   | Google API
 Roosevelt County         | political                   | Google API
 Sandia Park              | political                   | Google API

they're both all over the place, might be useful for demonstrating that we need funding to resolve #1679, but they're not useful for addressing spatial questions.

There's some limited oceanic data in GBIF - https://arctos.database.museum/place.cfm?action=detail&locality_id=80080 is the first "mostly wet" locality I stumbled across, the service seems to be at least as useful as the asserted data. I think the important point for this is that figuring out marine things isn't an Arctos problem under this model, it's a community problem. If GBIF (who certainly has far more resources than Arctos) does something clever it'll magically find its way in to Arctos, if someone else does something we should be able to plug in to their API. @sharpphyl

This seems to be working far better than I'd expected. I suggest we begin thinking about how to make it available in the UIs, how to distinguish it from "curatorial geography," and perhaps even how to share it back to GBIF via DWC (which should stop the flagging that seems to annoy some users).

@dustymc
Copy link
Contributor Author

dustymc commented Dec 10, 2020

https://arctos.database.museum/place.cfm?action=detail&locality_id=10824871 is interesting.

There's no WKT for the drainage-in-county.

Without something like #3108 (which would get at "in county" but not "in drainage") it's difficult to say if the coordinates are reasonable or not.

GBIF is returning "Bernalillo" for GADM2, strongly suggesting that the coordinate/curatorial geography alignment is in fact not reasonable.

While not a replacement for better WKT, this looks like it will expose useful ways of detecting low-quality data.

@dustymc
Copy link
Contributor Author

dustymc commented Dec 11, 2020

Scattered links to place detail around a bit
Indexed the table
Added "Standardized Place Name" to specimendetail

Screen Shot 2020-12-10 at 3 27 48 PM

with some light styling to separate it from "data"

@Jegelewicz
Copy link
Member

Nice.

Re: standardized place name - It isn't really the place name, it is the geography, right? Why smaller, maybe some other way to separate it, just call it "Service Asserted Geography? Also, how about a "more" link to that? Possible?

Maybe "Higher Geography" should be titled "Curatorial Asserted Higher Geography"? Or maybe we just need a section here that is "Curatorial Asserted" and another that is "Service Asserted" or something like that.

@dustymc
Copy link
Contributor Author

dustymc commented Dec 11, 2020

it is the geography

For now - yea, more or less, I think, whatever that means.....

Potentially, it's whatever we find at some place - certainly marine (no geo) stuff, maybe there's something cool in Google's rooftop data, whatever. I'm struggling to find a name that might accommodate that, suggestions greatly appreciated.

"more" link

There are 2 in the area that will get you there. The one with locality is the more relevant, that may or may not say something useful about the label.

Screen Shot 2020-12-11 at 7 21 39 AM

Curatorial Asserted Higher Geography"

That's what it IS in my view, but we use higher_geog[raphy] in many places, and I don't want this to turn in to something that someone finds offensive - I think that might be a little overly aggressive.

Service Asserted

It's "Service-Derived" in /place - "Asserted" might be better - accurate, but does everyone know what that means?

@Jegelewicz
Copy link
Member

There are 2 in the area that will get you there. The one with locality is the more relevant, that may or may not say something useful about the label.

Those "more" take you to things that are more of those. This is probably a bad example because the HG and the "Standardized Place Name" are essentially the same, but if the SPN was different from the HG, then I would assume that "more" would be a different set of stuff - No?

Curatorial Asserted Higher Geography"

That's what it IS in my view, but we use higher_geog[raphy] in many places, and I don't want this to turn in to something that someone finds offensive - I think that might be a little overly aggressive.

Verbatim?

Service Asserted

It's "Service-Derived" in /place - "Asserted" might be better - accurate, but does everyone know what that means?

Service Derived seems good.

@dustymc
Copy link
Contributor Author

dustymc commented Dec 11, 2020

different set of stuff

It will anyway - /place will have a table

Screen Shot 2020-12-11 at 7 59 19 AM

SpecimenDetail is just pulling a few terms from the data that make that table and concatenating them into a hopefully-familiar form.

@dustymc
Copy link
Contributor Author

dustymc commented Jan 19, 2021

See https://arctos.database.museum/info/reviewAnnotation.cfm?ANNOTATION_GROUP_ID=37714

The webservice data is pulling in a nearby county, in this case unnecessarily/incorrectly. Can/should we do anything about that?

@dustymc
Copy link
Contributor Author

dustymc commented Feb 2, 2021

This is now searchable in https://arctos.database.museum/SpecimenSearch.cfm

@dustymc
Copy link
Contributor Author

dustymc commented May 28, 2021

How the webservices works is changing a bit (unless @mkoo has a dramatic change of heart!). This is running in test, will probably be in production tonight. The data will take some time to catch up.

GeoLocate is now the primary source of coordinate-from-text data, and it generally returns NULL (translation: "I have no idea what you're talking about") for variations of No specific locality recorded. When I get a NULL return from GeoLocate I replace locality with the most precise available term from geography, which I think generally all comes together as an accidentally more sophisticated way of ignoring No specific locality recorded. (I increment to the next geography "field" if that doesn't work, see below.)

"most precise available term from geography" is currently feature,quad,island,island_group,drainage,sea" - geography isn't very consistent at scale, I don't think there's a "correct" ordering of those terms (or other stuff in the table), but I can easily rearrange them if anyone has better ideas.

I am now being more explicit in source. The locality detail page now looks like...

Screen Shot 2021-05-28 at 8 41 06 AM

note "asserted" (from curatorially-supplied coordinates) and "derived" (from coordinates I've produced from the text data).

The catalog record now looks like...

Screen Shot 2021-05-28 at 8 42 08 AM

  • The label is more distinct from verbatim locality
  • There's a distinct style (easy to change, should be developed and applied to all non-asserted data)
  • There's a mouseover with an explanation (also easy to change)

I don't think any of this is incompatible with idea of "categorizing" localities (from a couple comments up); that would add another dimension on what we can use to detect conflicting data, and would still be useful (eg in ignoring terrestrial, overly precise, whatever terms) if we do want to assert a "standardized" place name at some point.

@dustymc
Copy link
Contributor Author

dustymc commented Feb 18, 2022

I'm closing this. We're pulling in standardized geography data and it's available for search; it is not completely impossible to predictably find things by geography terms in Arctos, that is always the primary goal. If a collection wants to take that farther, a new Issue can be opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement I think this would make Arctos even awesomer! Function-Locality/Event/Georeferencing Help wanted I have a question on how to use Arctos
Projects
None yet
Development

No branches or pull requests

5 participants