Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

higher geography lookup is slow #2874

Closed
mvzhuang opened this issue Jul 3, 2020 · 24 comments
Closed

higher geography lookup is slow #2874

mvzhuang opened this issue Jul 3, 2020 · 24 comments
Assignees
Labels
Bug Arctos is not performing as it should. Component Loader Things involved in Round Five of the component loader discussions Function-DataEntry/Bulkloading Priority-Normal (Not urgent) Normal because this needs to get done but not immediately.

Comments

@mvzhuang
Copy link

mvzhuang commented Jul 3, 2020

Issue Documentation is http://handbook.arctosdb.org/how_to/How-to-Use-Issues-in-Arctos.html

Describe the bug
higher geography lookup cleaning tool isn't working

To Reproduce

  1. Reports/Services
  2. higher geography lookup
    uploaded higher geography lookup for data cleaning and getting this error
    Tried it with old files that worked before and it's still throwing the same error
    http://arctos.database.museum/DataServices/geog_lookup.cfm?action=validate

Expected behavior
for selection of higher geography to show up

Screenshots
image

** Data**
If this involves external data, attach the actual data that caused the problem. Do not attach a transformation or subset. You may ZIP most formats to attach, or request a Box email address for very large files.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.
highergeog.xlsx

Priority
Github isn't letting me choose a label right now...

@mvzhuang
Copy link
Author

mvzhuang commented Jul 8, 2020

@dustymc Dusty, I'm for some reason unable to add labels to issues. Did something change in permissions or something?

@dustymc dustymc added this to the Next Task milestone Jul 8, 2020
@dustymc
Copy link
Contributor

dustymc commented Jul 8, 2020

@mkoo @Jegelewicz should Arctos Users have Write on ArctosDB/Arctos or is that some other Team (which @mvzhuang should be a part of)?

https://github.com/orgs/ArctosDB/teams/arctos-users/repositories

@mkoo
Copy link
Member

mkoo commented Jul 8, 2020 via email

@mvzhuang mvzhuang added the Bug Arctos is not performing as it should. label Jul 8, 2020
@mvzhuang mvzhuang added Function-DataEntry/Bulkloading Priority-Normal (Not urgent) Normal because this needs to get done but not immediately. labels Jul 8, 2020
@mvzhuang
Copy link
Author

mvzhuang commented Jul 8, 2020

Yay labels are fixed for me! Thanks!

@mkoo
Copy link
Member

mkoo commented Jul 8, 2020

ok then fixed for Arctos Users group then!
thx for the issue

@mkoo mkoo closed this as completed Jul 8, 2020
@Jegelewicz Jegelewicz reopened this Jul 8, 2020
@Jegelewicz
Copy link
Member

Yes labels work, but @dustymc still needs to resolve the issue....

@dustymc
Copy link
Contributor

dustymc commented Jul 15, 2020

The original issue is fixed, but stripGeogRanks isn't performing adequately, and it's going to take some time to somehow address that.

Needs prioritized.

@dustymc
Copy link
Contributor

dustymc commented Jul 16, 2020

Looks like PG's generated columns would serve this purpose, but that only exists in PG12 and my test box is PG11.

@dustymc
Copy link
Contributor

dustymc commented Jul 27, 2020

Blocked by https://github.com/ArctosDB/internal/issues/65, going back to needs discussion

@dustymc
Copy link
Contributor

dustymc commented Sep 9, 2020

Played with this some more, the issue seems to be that geography has grown by a great deal, largely with the addition of "subquad" data in quad, and partially from eg #1278 ("minor" features are treated as geography).

I've reduced the defaults on the form so it's more functional, but remains slow, albeit still probably orders of magnitude faster than not having the form.

Two obvious possibilities:

  1. Rethink geography - can we resolve define geography #1366, and will that resolution result in fewer things being treated as geography?
  2. Do the heavy lifting at create/update - add "stripped_{thing}" columns (default stripgeogranks() ) for every column, and cache in them the data the service uses to predict intention. (PG12 has a very nice mechanism for this, as above we don't have a PG12 test environment, so "now" is a bit optimistic unless we want to fall back to a more kludgy mechanism.)

@dustymc dustymc changed the title higher geography lookup not working higher geography lookup is slow Sep 9, 2020
@dustymc dustymc added the Blocked Issue cannot be addressed until another Issue (which should be linked) is addressed. label Oct 2, 2020
@Jegelewicz
Copy link
Member

@dustymc is this only an issue for the various components? So if I use option 2 and the strings I enter are only compared to the concatenated higher geog strings, would that be less problematic?

@dustymc
Copy link
Contributor

dustymc commented Oct 13, 2021

I'm not sure, it probably is faster, but it's also a LOT less likely to figure things out when comparing big disorganized strings.

@Jegelewicz
Copy link
Member

@dustymc Maybe we make the first step "is this string there?"

So, when I have

North America, Bering Sea, United States, Alaska, Pribilof Islands Quad, Pribilof Islands, Saint Paul Island

and that is already there - no further work is required, just say "in Arctos". If it isn't there, just say "FAIL" kinda the way the taxonomy name checker works. What this thing is currently doing is not going to be useful in any big set of data. I have 39 HGs and it returns them 2 at a time after about 5 minutes of processing - that means hitting refresh 20 times and waiting 100 minutes!

@Jegelewicz
Copy link
Member

And the last refresh I did gave me this:
image

What am I supposed to do here?

@Jegelewicz
Copy link
Member

I mean, I see the misspelling in California - why is Tehama County the problem?

@dustymc
Copy link
Contributor

dustymc commented Oct 14, 2021

"is this string there?"

You can probably just pull table geog_auth_rec for now - or not, I'm not sure, I can get it out if you can't.

What am I supposed to do here?

Type to pick - its suggesting what it knows (or choking in the attempt, or something).

big set of data

I've cleaned a couple million records with it, but yea it's not ideal like it is. First question is whether we bother trying (and continue failing) to standardize geography at all. If we do, then we need to decide what "geography" means - the bajillion not-quite-quads (and waterbodies and maybe other stuff) are pluggin' the toobs, so we move them, or do a better job of organizing them, or cache more aggressively, or SOMETHING. If we get through all that, the "component loader" model (or something like it) does a good job of dealing with limited processors.

@dustymc
Copy link
Contributor

dustymc commented Oct 19, 2021

Merging #1105 here - if we keep this these need added to stripgeogranks

Autonomous
and
Area
Atoll

canton
changwat
County
Counties
Census

Division
District

Hsien

Krai
kray

Municipo
Municipality

Oblast
of

Province
Prefecture

Region
Regional

state

United

Xiàn

accented characters (??)

@Jegelewicz
Copy link
Member

@dustymc can we please make this better? See https://github.com/ArctosDB/data-migration/issues/1147

@dustymc dustymc modified the milestones: Needs Discussion, Next Task Mar 24, 2022
@dustymc
Copy link
Contributor

dustymc commented Mar 24, 2022

Yep, the component loader ecosystem gets around my problems, I'll go next task.

@Jegelewicz
Copy link
Member

Loaded the Bell file at 6PM MDT at 6:04 this was returned

image

At that rate, it will take me like 20 hours hitting refresh every 4 minutes to check the whole list of higher geography for the Bell mammals....

@dustymc dustymc added Component Loader Things involved in Round Five of the component loader discussions and removed Blocked Issue cannot be addressed until another Issue (which should be linked) is addressed. labels May 18, 2022
@dustymc dustymc modified the milestones: Next Task, Active Development Jun 21, 2022
@dustymc
Copy link
Contributor

dustymc commented Jun 22, 2022

Next release.

Even the component loader wasn't able to handle the function-manipulated data at a reasonable rate, I rebuilt stripGeogRanks and added generated stripped_{field} terms to geog_auth_rec. It's some junk to store, but I think we can afford that (its tiny compared to spatial data) and processing is now reasonably fast.

The loader returns up to 10 possible matches, and a status value that will hopefully help sort them out. "Just use the first" is probably a mostly-sorta-defensible position for eg, an incoming collection - it likely won't be WRONG most of the time, but it will probably not be of quite the right precision for lots of data.

@Jegelewicz (or anybody else) if you've got any "raw" data - the uglier the better - please pass it along, there's room for lots of tuning.

@Jegelewicz
Copy link
Member

try this
geography test.csv

@dustymc
Copy link
Contributor

dustymc commented Jun 22, 2022

thx, script is a little smarter than it used to be.

cf_temp_geog_lookup_download.csv.zip

@Jegelewicz
Copy link
Member

Betta, but what the heck? Shouldn't North America, United States, Texas, Aransas County also appear here?

image

Also, can the first column hold the closest match?

HIGHER_GEOG HG_1 HG_2 HG_3 HG_4 HG_5 HG_6 HG_7 HG_8 HG_9 HG_10
North America, United States, Wyoming, Park  County North America, United States, Wyoming, Yellowstone National Park North America, United States, Wyoming, Park County, Missouri River North America, United States, Wyoming, Uinta County, Colorado River North America, United States, Wyoming, Crook County, Missouri River North America, United States, Wyoming, Teton County, Missouri River North America, United States, Wyoming, Uinta County, Missouri River North America, United States, Wyoming, Albany County, Missouri River North America, United States, Wyoming, Platte County, Missouri River North America, United States, Wyoming, Carbon County, Missouri River North America, United States, Wyoming, Weston County, Missouri River

North America, United States, Wyoming, Park County exists - the other stuff is nice, but knowing there is an exact match is task number one and the exact match didn't even make the list?

@dustymc dustymc closed this as completed Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Arctos is not performing as it should. Component Loader Things involved in Round Five of the component loader discussions Function-DataEntry/Bulkloading Priority-Normal (Not urgent) Normal because this needs to get done but not immediately.
Projects
None yet
Development

No branches or pull requests

4 participants