Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCM Agent Bulkload Request #7894

Closed
javanveldhuizen opened this issue Jun 24, 2024 · 18 comments
Closed

UCM Agent Bulkload Request #7894

javanveldhuizen opened this issue Jun 24, 2024 · 18 comments
Assignees
Labels
Function-Agents Priority-High (Needed for work) High because this is causing a delay in important collection work..

Comments

@javanveldhuizen
Copy link

cf_temp_pre_bulk_agent_download_version ready.csv
Please bulkload the agents in the attached file.

Note: The file should be results from the Agent Prebulkload Tool. If the file is too large for Github attachments, comment here and an email address or shared file space will be provided to you.

@javanveldhuizen javanveldhuizen added Priority-High (Needed for work) High because this is causing a delay in important collection work.. Function-Agents labels Jun 24, 2024
@Jegelewicz
Copy link
Member

S-C Lee
C-C Chen
C-P Chen
J-T Chao

and others with a dash in preferred name. First names should not include punctuation other than a period. Are we sure these are people and should they be:

S. C. Lee
C. C. Chen
C. P. Chen
J. T. Chao

?

R/V Soyo-Maru

Is not a person, but a research vessel? If so, this may be added as an organization.

A. C. Burrill
R. C. Burrill

This really feels like someone somewhere mis-transcribed an A for an R or the other way around?

W. F. Halliday
W. R. Halliday

Ditto for the F and R in these two

Mr. A. E. Collins
Mrs. A. E. Collins

And the D and K here

D. A. Han
K. A. Han

Add the "spouse of" relationship between these after they are added?

Will Eberle-Taylor
Nick Eberle-Taylor
Quinn Eberle-Taylor

I assume these people are related? Do we know how?

Can I be convinced that these are really not the same person?

William W. Hay
W. Hay

Or these two?

Norman E. A. Hinds
Norman E. C. Hinds

All of the "not the same as" relationships require a method and determiner.

I am not trying to be obstructionist, but it seems like there is still some cleanup that could be done before we add these agents? I stopped looking at the near matches, so there are probably others I would add to the categories above.

@javanveldhuizen
Copy link
Author

No worries. Thanks for catching those. Updated agent list attached:
cf_temp_pre_bulk_agent_download_final version.csv

@javanveldhuizen
Copy link
Author

@dustymc Thanks for including me in the #7649 issue. Maybe we should pair our list down so that the only agents that get uploaded are ones that have full names (i.e., no initial) or have one (or more) attribute that distinguishes them (makes them unique) from other agents? So, for instance if we have a J. Smith the only way we can upload that person as an agent is if we had an attribute, say "child of", linked to that agent. Would that work?

@Jegelewicz
Copy link
Member

So, for instance if we have a J. Smith the only way we can upload that person as an agent is if we had an attribute, say "child of", linked to that agent. Would that work?

That will help, but the ones I am struggling with include things like

Barbara Waleis which feels like it may be a mistranscription of Barbara T. Waters

Charles A. Nelson feels like a mistranscription of Charles D. Nelson (or perhaps it is the other way around, A and D can look very similar when written or maybe these ARE two different people, but I have no way to decide that)

Chin-Tsong Lewis and Chin-Tsong Lo - one of these must be a misspelling, an alternate name for the same person, or are they related people?

You may have no way to figure out if my "feelings" are justified, but if you do, it might be good to get things like this sorted before making agents.

As before, I did not peruse the entire list to look for these internal issues, but there are probably others! Do not take this as a summary of everything that I think needs review - just ideas for looking at the data you have in-house even before comparisons to Arctos agents.

@javanveldhuizen
Copy link
Author

javanveldhuizen commented Jul 1, 2024

Barbara Waleis which feels like it may be a mistranscription of Barbara T. Waters

I can confirm that Barbara Waleis and Barbara T. Waters are two different people. Waleis is a collector from the 1930s, while Waters is a collector from the 1980s.


The others are all agents for the invert zoo collection, which will need to be checked by @Krmartin3 when she gets back from vacation. I can say that the Chinese do use hyphenated first names. So, Arctos may need to figure that one out, but I'll let Kelly chime in when she is back.

Charles A. Nelson feels like a mistranscription of Charles D. Nelson (or perhaps it is the other way around, A and D can look very similar when written or maybe these ARE two different people, but I have no way to decide that)

Chin-Tsong Lewis and Chin-Tsong Lo - one of these must be a misspelling, an alternate name for the same person, or are they related people?


In the mean time, I'm going to pull all of invert zoo's agents from the sheet, as I think most of the issues are coming from that side (sorry Kelly). I'll reupload a new sheet of agents here in a bit.

@javanveldhuizen
Copy link
Author

@Jegelewicz new list of agents attached
cf_temp_pre_bulk_agent_vert paleo agents only.csv

@dustymc
Copy link
Contributor

dustymc commented Jul 1, 2024

@javanveldhuizen the dates in that CSV have been mangled (probably by Excel?).

@javanveldhuizen
Copy link
Author

javanveldhuizen commented Jul 2, 2024

@dustymc Interesting, the dates look fine on my end.
Screenshot 2024-07-02 075148

Should I use a different program to edit the CSV instead?

@javanveldhuizen
Copy link
Author

@dustymc Ok. I edited the CSV using Notepad and changed all the dates into the desired format: yyyy-mm-dd. Let me know if that doesn't work.

cf_temp_pre_bulk_agent_vert paleo agents only.csv

@dustymc
Copy link
Contributor

dustymc commented Jul 2, 2024

look fine

Yea, but they don't SAVE fine (eg unambiguously), which is why we require CSV.

https://handbook.arctosdb.org/how_to/How-to-Excel-for-Arctos.html#dates (I wrote the 'eat your data' bits but not the niceties at the top!)

Thanks, I've got those in the pre-loader.

The first thing in my view is "Humboldt Museum" - surely that's https://arctos.database.museum/agent/21336826 or https://arctos.database.museum/agent/21348575??

@javanveldhuizen
Copy link
Author

The first thing in my view is "Humboldt Museum" - surely that's https://arctos.database.museum/agent/21336826 or https://arctos.database.museum/agent/21348575??

It's kind of actually neither of those things. The specimens I have tied to the Humboldt Museum were donated to us from a researcher at the Humboldt-Universität zu Berlin. What's unclear is whether these were actually part of the museum at that university, which later became the Museum fuer Naturkunde der Humboldt-Universitaet Berlin, or if they were a part of a researchers lab collection. I kept is Humboldt Museum until I could fully untangle it. Feel free to delete it from the list if you feel that it is not an appropriate true agent.

@javanveldhuizen
Copy link
Author

@dustymc Here is the agent sheet again with the Humboldt Museum removed
cf_temp_pre_bulk_agent_vert paleo agents only.csv
.

@dustymc
Copy link
Contributor

dustymc commented Jul 2, 2024

you feel

Ugh, that should not be the path, @ArctosDB/agents-committee HELP!

Lacking further guidance, that seems a somewhat defensible position to me (and a remark would be useful, if that's not already there).

I loaded data to https://docs.google.com/spreadsheets/d/1it7JgDc0Fxnccn5yD_bO6kdYFjPRrbJhqptOVAOu3G8/edit?gid=907589706#gid=907589706

Again an "interesting" situation on the first line!

Screenshot 2024-07-02 at 08 58 39

First your agent will load, then Arctos will run....

arctosprod@arctos>> select getAgentID('David Taylor');
 getagentid 
------------
   21333592

except two results will be returned - this one and the one just created - which will result in an error. Maybe that's somehow my problem, but I'm not quite sure how to address it. https://arctos.database.museum/agent/21333592 will always be unambiguous, but isn't great for humans to work with in a spreadsheet.

Beyond that, I don't know how to proceed. (I'd use verbatim agents as a first pass so we don't have to guess from strings, but I seem to have lost that argument!)

<style type="text/css"></style>

person Sarah E. Rieboldt attribute match: first+last variants Sarah Rieboldt person first name Sarah               middle name E.               last name Rieboldt               not the same as       Sarah Reiboldt 2024-07-01 Jacob Van Veldhuizen                                                                                                                 dlm      

<style type="text/css"></style>

person Bill Simpson attribute match: first+last variants William Simpson person first name Bill               last name Simpson               not the same as       William Simpson 2024-07-01 Jacob Van Veldhuizen                                                                                                                                   dlm      

<style type="text/css"></style>

organization Brigham Young University Museum of Paleontology attribute match: aka Brigham Young University Life Science Museum organization aka BYU               Wikidata https://www.wikidata.org/wiki/Q4836911               not the same as       Brigham Young University Life Science Museum 2024-07-01 Jacob Van Veldhuizen                                                                                                                                   dlm      

look pretty suspicious (and maybe that's OK, I don't know, this should still not be my call @ArctosDB/agents-committee !!)

I didn't scroll very far, just enough to grab a couple examples.

I don't see any super-obvious duplicates or mistyped agents or such in the file. I REALLY don't want this to be my call (see above, I'd do something entirely different!), and the ~30 flagged by the checker could definitely use careful review, but loading this doesn't seem unreasonable.

@Jegelewicz @mkoo thoughts??

@dustymc dustymc added this to the Needs Discussion milestone Jul 2, 2024
@javanveldhuizen
Copy link
Author

javanveldhuizen commented Jul 2, 2024

@dustymc

arctosprod@arctos>> select getAgentID('David Taylor');
getagentid

21333592

I have deleted David Taylor from my list and will make him a verbatim agent for now until that issue is fixed. I can confirm that the David Taylor already in Arctos is not the same David Taylor in my data.

person Sarah E. Rieboldt attribute match: first+last variants Sarah Rieboldt person first name Sarah               middle name E.               last name Rieboldt               not the same as       Sarah Reiboldt 2024-07-01 Jacob Van Veldhuizen                                                                                                                 dlm      

<style type="text/css"></style>

For some reason Sarah Reiboldt keeps reappearing in this list even though I keep deleting it. Anyway, I've deleted it once again and I can confirm that the Sarah Reiboldt already in Arctos is the same Sarah Reiboldt in my data.

person Bill Simpson attribute match: first+last variants William Simpson person first name Bill               last name Simpson               not the same as       William Simpson 2024-07-01 Jacob Van Veldhuizen                                                                                                                                   dlm      

<style type="text/css"></style>

The Bill Simpson I have in my data is an amateur collector in the Denver area and not the William Simpson already in Arctos. These are two separate people, as indicated by the "not the same as" attribute.

organization Brigham Young University Museum of Paleontology attribute match: aka Brigham Young University Life Science Museum organization aka BYU               Wikidata https://www.wikidata.org/wiki/Q4836911               not the same as       Brigham Young University Life Science Museum 2024-07-01 Jacob Van Veldhuizen                                                                                                                                   dlm

The BYU Museum of Paleontology and the BYU Life Science Museum are two different organizations. Here are their websites so you can confirm:


New list here:
cf_temp_pre_bulk_agent_vert paleo agents only.csv

@Jegelewicz Jegelewicz removed their assignment Jul 2, 2024
@dustymc
Copy link
Contributor

dustymc commented Jul 2, 2024

David Taylor

You can also just create the agent manually (where everything involved IDs instead of strings).

as indicated by the "not the same as" attribute

Sorry, I didn't look very carefully (was aiming for general considerations, not specifics!), thanks!

New list

running....

https://docs.google.com/spreadsheets/d/1SBF83EZncUko6u1KkVzbQdhaPGDULnNVNKuSEn6Leak/edit?usp=sharing

I suppose I should just load that??? @mkoo

@dustymc
Copy link
Contributor

dustymc commented Jul 3, 2024

@javanveldhuizen I found a problem on my end and am rolling a partial load back, but during that I noticed

Ward Scientific
Wards National Science

in these data. Surely those are both duplicates of https://arctos.database.museum/agent/21293521?

@javanveldhuizen
Copy link
Author

@dustymc I deleted those agents. They need some verification. New list here:
cf_temp_pre_bulk_agent_vert paleo agents only.csv

@dustymc
Copy link
Contributor

dustymc commented Jul 3, 2024

Done and blamed on you @javanveldhuizen

There's one full-duplicate low-data copy of another low-data agent that maybe ought to have something done with it.

 agent_id | agent_type | preferred_agent_name |       creator        |        created_date        
----------+------------+----------------------+----------------------+----------------------------
 21354938 | person     | Scott Parker         | Jacob Van Veldhuizen | 2024-07-03 14:40:17.101114
 21257771 | person     | Scott Parker         | unknown              | 2013-12-16 21:49:31
(2 rows)

and one that errored out

cf_temp_agent_download(3).csv

@dustymc dustymc closed this as completed Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Function-Agents Priority-High (Needed for work) High because this is causing a delay in important collection work..
Projects
None yet
Development

No branches or pull requests

5 participants