Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request - more data entry/bulkloader columns #5193

Closed
dustymc opened this issue Oct 20, 2022 · 85 comments
Closed

Feature Request - more data entry/bulkloader columns #5193

dustymc opened this issue Oct 20, 2022 · 85 comments
Labels
Enhancement I think this would make Arctos even awesomer! Priority-High (Needed for work) High because this is causing a delay in important collection work..

Comments

@dustymc
Copy link
Contributor

dustymc commented Oct 20, 2022

This may be dead

The replacement proposal is #6171 (more columns, but also rename everything)


Implementation will begin after 2023-04-14.

All change or addition requests must be received before then.


Briefly discussed by AWG, consensus is that lots of columns is a workable idea. https://docs.google.com/document/d/1VEUSR-8UK0-9WeFOyiJDRit9UCIDJbq-HMpXUbBRvm8/edit# @Jegelewicz @ccicero @Nicole-Ridgwell-NMMNHS @ebraker @wellerjes @mkoo @genevieve-anderegg @campmlc @atrox10

Whatever's decided here will be in effect for (some time - 5 years, maybe?) - please pass this on to anyone who might care.

Current table is always https://arctos.database.museum/tblbrowse.cfm?tbl=bulkloader

CSV:
temp_maybe_new_bulkloader(8).csv

Summary: https://docs.google.com/spreadsheets/d/1ssNP_jAiOok7TYIPq8b-OKcLCw9NngtCElPwbS6ajFo/edit#gid=0

Column Count: 1133


lastedit: add identification_order_
lastedit: two identifications


MAYBE:

encumbrances (#703) - ask is not clear so requirements are not clear


TODO

DOCUMENTATION! (merge from #5196)


OLDSTUFF

bumping to 15 attributes for #5210

AWG discussion suggests 2 part attributes isn't sufficient, try 4, and moving to format in #5193 (comment)

Identifications: #4416

event attributes (#4230) - each requires 7 columns

Generalizing, we've been maybe overly-conservative about adding stuff, perhaps we don't need to anymore? No idea what Excel supports, PG supports ~a thousand columns, let's see if we can use 'em.

Parts are currently 7 columns, one of which is preservation.

Part attributes require 6 columns.

Attributes (see #5210) require 7 columns.

Identifiers (see #5164) are currently one column.

I'm not merging, but #5120 should be resolved here too - is that magic-mapping of existing columns, use the existing locality attributes, ???

Coordinate-stuff should be better arranged, see #4716

QUESTION: Should we also add taxon concepts? Hard "no" vote from DLM - it's not being used, there's no way of knowing if the shape will change.

@dustymc dustymc added the Enhancement I think this would make Arctos even awesomer! label Oct 20, 2022
@dustymc dustymc added this to the Needs Discussion milestone Oct 20, 2022
@dustymc
Copy link
Contributor Author

dustymc commented Oct 20, 2022

Can we have fewer parts - do we really need 12?

@Jegelewicz
Copy link
Member

Remove the preservation attribute shortcut and allow for at least two attributes per part in the bulkloader.

@dustymc
Copy link
Contributor Author

dustymc commented Oct 20, 2022

defaults are https://handbook.arctosdb.org/documentation/bulkloader.html

[ doc ] This is a shortcut to creating a part attribute of type preservation. Attribute date will default to current_date and determiner will default to enteredAgent

@ewommack
Copy link

We need more columns (maybe?) for (stuff?)?

Yes please. I feel all museums want more stuff.

Remove the preservation attribute shortcut and allow for at least two attributes per part in the bulkloader.

What does this mean? I like being able to select the preservation and then get it to load without having to go through a second time. I probably just do not understand what the shortcut does.

@dustymc dustymc changed the title Feature Request - more data entry parts columns Feature Request - more data entry/bulkloader columns Oct 25, 2022
@campmlc
Copy link

campmlc commented Oct 25, 2022

I'm happy with removing preservation attribute shortcut from the bulkloader. We should keep the option in the data entry forms - but it is really common to have two preservation types - e.g. 95% ethanol and frozen, or fixed in 95% EtOH and preserved in 70% EtOH, and we need to make it as easy as possible to capture this info.
Overall, having more columns in the bulkloader would make everyone's life easier and lead to better data - Yay!

@dustymc
Copy link
Contributor Author

dustymc commented Oct 25, 2022

removing preservation attribute shortcut from the bulkloader. We should keep the option in the data entry forms

That's not possible - the data entry form is just(ish) a UI for the catalog record bulkloader.

more columns

I need to know exactly what this means.

@Jegelewicz
Copy link
Member

I like @campmlc idea of a limit to the number of columns in any given bulkload file, but allowing them to be any combination of columns (Maybe I want 12 parts with no part attributes, but she wants one part with 5 part attributes each). Possible?

@campmlc
Copy link

campmlc commented Oct 25, 2022

I just meant removing the limit on number of attributes etc.
For the data entry form, we currently have a work around where there is a "preservation" field default in the parts table, but if needed, the "extras" menu can be opened to allow additional data entered for the part attribute bulkloader. I think we should leave the "preservation" default - and ideally put in an "add more" link to show there can be more than one preservation attribute?

@dustymc
Copy link
Contributor Author

dustymc commented Oct 25, 2022

Possible?

Not really - I mean, I suppose I can refuse to deal with more than 86 columns or something, but why? My hard limit is what I can get PG to accept, which is something less than 1600 (depending on some techy details). If ya'll can deal with WHATEVER, as long as it's below that, then I can too. I can't deal with anything above that, at least not in a simple table structure. (I could potentially parse to component loaders or something, but anyone who could navigate that probably doesn't need it.)

removing the limit

For flat data, the only way to do that is for you to tell me exactly what you want - we either do or do not have a column called 'part_75_attribute_16,' there is no way for a flat object to just take whatever comes.

"extras" menu can be opened

That's the "ish" above - the data entry form is also a UI to a bunch of component loaders, and those are not flat - you can happily add your 947th part, I don't care or need to know, it'll just work.

(And I thought that had solved all of this, but here we are anyway - this needs clear instructions from ya'll to proceed.)

@campmlc
Copy link

campmlc commented Oct 25, 2022 via email

@dustymc
Copy link
Contributor Author

dustymc commented Oct 25, 2022

field names could be the same

Let's call this an administrative problem. If there's a difference it's because someone's asked for it; I would LOVE to have a policy to point at while refusing to relabel the next time this comes up.

between both forms

There is not really a bulkloader 'form' - it's very purposefully done in the most portable manner possible (CSV), and there's been an API for a very long time so there's no real reason to use any Arctos form. If we're doing this, the table columns have to be the controlling element.

Related, columns are hard to change/rename. We can mitigate that by picking good names in this issue. I'll start:

  • rename collection_object_id to key, and either disallow uploading it or make it a text datatype (the default assumption for the generic data-driven sqlldr)

@dustymc
Copy link
Contributor Author

dustymc commented Nov 2, 2022

Here's an attempt at a table which addresses the issues mentioned, I hope, maybe. Implementing this as attached would require a few other issues to be addressed, I'm hoping we can get this and everything attached to it as one big release (anything else is going to require rebuilding the same complex things over and over).

First considerations:

  1. Is this usable?
  2. What's missing? I think this should include pushing through any potential new development that would effect this; rebuilding the bulkloader is a Big Deal, let's do whatever lets us not do this again for a while.
  3. What's organized/named/structured/sorted/etc. incorrectly? The new proposed part attribute naming schema in particular could use a close look (maybe part_n_attribute_m is better??).
  4. Which of your favorite tools does this kill? Seems to work for everything I use...

If we get past the above, Eventually:

  1. Migration path. I'm about 99% sure that there should be none, I'll just dump the old bulkloader as CSV and assist in moving to this structure by request. There's stuff from 2011 in there, it's never meant to be used for long-term storage, we'll never have a better chance to clean up.
  2. UI - lots of new stuff in here, the data entry screen will require a lot of updates. It would be fabulous if @campmlc 's 'just use column names' suggestion above could be adopted/rejected before that.
  3. Triggers and calculations and constraints and such - that all needs rewritten to use this, some of it still needs postgresified, that can be cleaned up while rebuilding
  4. Handler - this will require a major rewrite of the loader-upper tool, it needs transmogrified into a function anyway

Current table in first comment.

@dustymc
Copy link
Contributor Author

dustymc commented Nov 3, 2022

Consider pulling #4707 into this as well, it's going to change a lot of the vocabulary and targeting - although we probably DO want to retain part_condition (but make it optional) for magicking bare-bones 'condition report' part attributes. And does that mean we need more than 2 attributes per part? A goal of this should be stability, suggest that's sufficient to bump to 3 part attributes.

@Jegelewicz
Copy link
Member

These columns and all other similarcolumns need to be order, with all "1" values coming first in order from left to right. This is how humans would understand it - and is necessary for humans to avoid making mistakes, given the number of columns involved.
The problems I see in the current csv are having the "Identification_2" columns show up first before the "Identification_1" columns (in addition to the naming problem for these columns referenced above", and also in the ATTRIBUTES columns, where attributes 1-15 are inserted after PART_12, and the remaining attributes 16 and on show back up again after PART_ATTRIBUTE_METHOD_20_4.
We need to standardize going forward so that the all associated columns, whenever they are added in 5 years, appear in order and adjacent, from lowest (1) to highest, left to right.

Possibly the same issue as above?

@Jegelewicz
Copy link
Member

Make the formatting of the column numbering consistent. Currently we have ID_MADE_BY_AGENT_2_3; but we also have IDENTIFICATION_2_ATTRIBUTE_1. For me, the latter system is much clearer. So I propose changing the former and all similar column headers to the format: ID_2_MADE_BY_AGENT_3, as above. Also, PART_ATTRIBUTE_VALUE_1_1 should be PART_1_ATTRIBUTE_VALUE_1, etc.

This seems like a reasonable and good idea - see also #6103 (comment) where I suggested something of this nature.

@Jegelewicz
Copy link
Member

Currently event verification status is forced to be unverified for a bulkload. If we are keeping this constraint, there is no reason to have this column in the bulkloader, since nothing else will load.
I personally suggest we abandon this constraint, as I doubt there is a single collection with the time and collector and staff resources to incorporate updating all these values after load into their workflows. If we know that an event is accepted during data entry, we should be able to enter that. Otherwise we end up with entire collections full of "unverified" values merely because of the bulkloader constraint, not because other information is unavailable. This defeats the purpose of recording these data.

This has it's own issue - posting this there for posterity.

@Jegelewicz
Copy link
Member

Since we are revamping the bulkloader - one more critical related request regarding order of columns:

Please set up the order of columns in the csv download from the browse and edit page in the same order as the columns are displayed in Browse and Edit. (and the same in the proposed csv file here).
I am having to download data with errors and DELETE from Browse and Edit. The downloaded csv has columns in different order, e.g. dec lat degrees etc at the end of the file and separated from other georeferencing columns - which makes comparing to my original file and fixing errors for reload challenging.

This definitely makes fixing things ourselves much easier and I support making the order of columns consistent however we can, if we can. This includes

Bulkloader Builder
Browse and Edit
Browse and Edit Download csv

I do not know the technical difficulties involved (which may be too difficult to overcome), but if naming columns appropriately can facilitate sorting that makes sense, perhaps that is where we should be looking to improve.

@campmlc
Copy link

campmlc commented Apr 17, 2023

I think the problem is that @campmlc is looking at the columns in the summary - and I didn't truly update them, I just tacked on additions. I'm sorry for this, but a whole new sheet means assigning categories to 1133 columns all over again. Please review the template that Dusty entered in the first comment to see what his proposal for actual column headers is currently.

Yes - is there a different csv list of column headers? Sorry, I was looking at the summary . . .

@Jegelewicz
Copy link
Member

is there a different csv list of column headers?

CSV:
temp_maybe_new_bulkloader(8).csv

Dusty's actual can always be found here - #5193 (comment)

@campmlc
Copy link

campmlc commented Apr 17, 2023

So I guess I scrolled through and carefully examined1133 column headers in the wrong file :(
Well, I'm glad it's not that one!
I'll look over the actual one and provide comments.

@Jegelewicz
Copy link
Member

I am truly sorry for that - I just don't have time to re-categorize everything every time the csv changes. :(

@campmlc
Copy link

campmlc commented Apr 17, 2023

I'll look over temp_maybe_new_bulkloader(8).csv and provide updated comments.
And no worries.

@campmlc
Copy link

campmlc commented Apr 17, 2023

A couple of comments on what I hope is the correct csv this time:

  1. Move all OTHER_ID_NUM_TYPE to the first column before OTHER_ID_NUM_VALUE and before issued by. I need to be able to write : "NK" as the type followed by "12345" as the number, so I can keep track of what identifier is what in the row. In the current order, we have the number and issued by first, followed by the type - please swap so that type is first to the left, as this is how humans would read the data. This is also consistent with how we record attributes, with the "type" field first, followed by the value.

  2. Swap the order of the ORIG_LAT_LONG_UNITS and the GEOREFERENCE_PROTOCOL columns, so that the latlong units column is directly adjacent to the various declat, declong etc values that need to be selected based on the units value. For example, the columns used for decimal degrees are different from the columns used for deg min sec - and you need to know the value in the orig lat long units column to determine what to use. So they should be visible and clustered as a unit.

  3. Do we really need "ORIG" in front of the lat long units and elevation units? Do these columns really reflect only the "original" values of these units? Or is this legacy? We don't have "ORIG" in front of depth units . . .
    This isn't breaking anything, really- but they could be confusing if they don't actually mean what they imply they do, and if so, they are also making the headers unnecessarily longer and harder to search for. Also if you are searching for units values in the long list, it helps to get rid of extra and confusing terms.

  4. Ditto re: previous question about verificationstatus being necessary here. bulkloader and verification status #4982

  5. Same previous request to standardize format so that PART_ATTRIBUTE_TYPE_1_1 becomes PART_1_ATTRIBUTE_TYPE_1 etc across the board.

  6. To be totally consistent for current and future users, ATTRIBUTE_1 should really be ATTRIBUTE_TYPE_1, to go with ATTRIBUTE_VALUE_1, just sayin . . .

1, 2, and 5 above are the only really critical things that need fixing, in my view.

Otherwise looks good?

@Jegelewicz
Copy link
Member

Do we really need "ORIG" in front of the lat long units and elevation units? Do these columns really reflect only the "original" values of these units? Or is this legacy? We don't have "ORIG" in front of depth units . . .
This isn't breaking anything, really- but they could be confusing if they don't actually mean what they imply they do, and if so, they are also making the headers unnecessarily longer and harder to search for. Also if you are searching for units values in the long list, it helps to get rid of extra and confusing terms.

I vote we drop the "ORIG"

@Jegelewicz
Copy link
Member

To be totally consistent for current and future users, ATTRIBUTE_1 should really be ATTRIBUTE_TYPE_1, to go with ATTRIBUTE_VALUE_1, just sayin . . .

I agree with adding "TYPE" to this to make the expected contents more clear.

@Jegelewicz
Copy link
Member

Move all OTHER_ID_NUM_TYPE to the first column before OTHER_ID_NUM_VALUE and before issued by. I need to be able to write : "NK" as the type followed by "12345" as the number, so I can keep track of what identifier is what in the row. In the current order, we have the number and issued by first, followed by the type - please swap so that type is first to the left, as this is how humans would read the data. This is also consistent with how we record attributes, with the "type" field first, followed by the value.

Not all users will care about the type and the issuer may be more important (so those humans will want issued by to come first). This is one of those places where we have two ways of doing things and some people prefer one while others prefer the other. However, I don't really care so much about the order and if nobody actively opposes this proposal, then I think that putting them in the order issued_by, type, value should make this workable for most?

@campmlc
Copy link

campmlc commented Apr 17, 2023 via email

@Jegelewicz
Copy link
Member

for MSB, we need the order to be type, value, issued by.

Not to be a broken record - but this is what MSB needs and others may "need" issued by then number so this one just isn't super clear cut in my opinion.

@campmlc
Copy link

campmlc commented Apr 17, 2023 via email

@Jegelewicz
Copy link
Member

I don;t understand why. I suggested the order

issued_by, type, value

Which puts type right next to value but allows for those who want issued_by next to value to see things that way too.

However, I'm not even sure any of this placement is possible? #5193 (comment)

@campmlc
Copy link

campmlc commented Apr 18, 2023

Apologies, I've had dinner now and can think clearly. Putting issued by first is fine, as long as ID TYPE and ID VALUE are together, so that we have the same UI as we currently see in the catalog record page.

@campmlc
Copy link

campmlc commented Apr 19, 2023

Just confirming what the status is on this?

  1. Move all OTHER_ID_NUM_TYPE to the first column before OTHER_ID_NUM_VALUE and before issued by. I need to be able to write : "NK" as the type followed by "12345" as the number, so I can keep track of what identifier is what in the row. In the current order, we have the number and issued by first, followed by the type - please swap so that type is first to the left, as this is how humans would read the data. This is also consistent with how we record attributes, with the "type" field first, followed by the value.
  2. Swap the order of the ORIG_LAT_LONG_UNITS and the GEOREFERENCE_PROTOCOL columns, so that the latlong units column is directly adjacent to the various declat, declong etc values that need to be selected based on the units value. For example, the columns used for decimal degrees are different from the columns used for deg min sec - and you need to know the value in the orig lat long units column to determine what to use. So they should be visible and clustered as a unit.
  3. Do we really need "ORIG" in front of the lat long units and elevation units? Do these columns really reflect only the "original" values of these units? Or is this legacy? We don't have "ORIG" in front of depth units . . .
    This isn't breaking anything, really- but they could be confusing if they don't actually mean what they imply they do, and if so, they are also making the headers unnecessarily longer and harder to search for. Also if you are searching for units values in the long list, it helps to get rid of extra and confusing terms.
  4. Ditto re: previous question about verificationstatus being necessary here. bulkloader and verification status #4982
  5. Same previous request to standardize format so that PART_ATTRIBUTE_TYPE_1_1 becomes PART_1_ATTRIBUTE_TYPE_1 etc across the board.
  6. To be totally consistent for current and future users, ATTRIBUTE_1 should really be ATTRIBUTE_TYPE_1, to go with ATTRIBUTE_VALUE_1, just sayin . . .

1, 2, and 5 above are the only really critical things that need fixing, in my view.

Otherwise looks good?

@dustymc
Copy link
Contributor Author

dustymc commented Apr 20, 2023

Code table working group discussing now.

Consensus is full rebuild, don't go out of the way to preserve any column names.

Still not clear if we have too many columns at the moment - @mkoo ??

Not at all clear how we actually finalize this and begin development (https://github.com/ArctosDB/internal/issues/258#issuecomment-1515440177)

@mkoo
Copy link
Member

mkoo commented Apr 20, 2023

What's the summary of the meeting ? (or where is it?) just trying to understand the consensus and needs...

@Jegelewicz
Copy link
Member

See https://docs.google.com/spreadsheets/d/1qstLM0xpW8gkkEnRxUpZWOZGtJkv2NTu8zIgKxv-nYc/edit?usp=sharing

BUT - also I will be posting an issue regarding localities/georeferencing that could alter this and I think we should hash it out soon.

@dustymc
Copy link
Contributor Author

dustymc commented Apr 20, 2023

temp_maybe_new_bulk_cols.csv.zip
temp_maybe_new_bulkloader(9).csv.zip

I'll put this somewhere more "official" at some point, but for now I think this is primarily to make sure I've properly understood the proposal. temp_maybe_new_bulkloader is what the bulkloader may become, temp_maybe_new_bulk_cols is a transposed version (but still built from the builder code, so there should be no differences - use whichever makes the most sense to you).

I think there's still some question as to how many columns need to be in here, but that should be adjusting some variable ('number_parts' or similar) which is an easy exercise.

I'm feeling completely overwhelmed, and from that at the moment I think the locality stuff should probably be skipped - I don't see sorting that out now-ish, and putting the bulkloader builder off yet again doesn't seem like a great idea. (And the rebuild is going to take significantly more time if it involves rebuilding basically everything, so the stretch is already getting stretched.)

Rebuilding the bulkloader because we have a new model seems a completely different thing that rebuilding the bulkloader because someone decided they need one more thing. I also like the idea of stability. I'm not sure how to balance those.

I think some of the locality concerns also involve DWC, which might be a simple mapping adjustment - after we've rebuilt the bulkloader to fully incorporate #5120.

@dustymc
Copy link
Contributor Author

dustymc commented Apr 28, 2023

This has become something else, closing.

@dustymc dustymc closed this as completed Apr 28, 2023
@Jegelewicz
Copy link
Member

@dustymc In today's AWG - it was made clear that two things are missing from the columns listed in the file above

associated species
identification confidence - is this becoming identification rank? or are we also missing this and rank from the new list?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement I think this would make Arctos even awesomer! Priority-High (Needed for work) High because this is causing a delay in important collection work..
Projects
None yet
Development

No branches or pull requests

8 participants