Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-harvesting ICPSR datasets #63

Open
jggautier opened this issue Feb 13, 2020 · 7 comments
Open

Re-harvesting ICPSR datasets #63

jggautier opened this issue Feb 13, 2020 · 7 comments

Comments

@jggautier
Copy link
Collaborator

jggautier commented Feb 13, 2020

Now that IQSS/dataverse#4964 is technically resolved, Harvard Dataverse should be able to harvest ICPSR dataset metadata over OAI-PMH (to keep the records up to date) and the dataset title links will point to the ICPSR dataset pages.

  • Can someone confirm that it's okay to delete the existing harvesting client listed on Harvard Dataverse's "Manage Harvesting Clients" page, to remove the 8k+ stale records? I don't know how that client was created, but the delete button on the "Manage Harvesting Clients" page looks nice and clickable...

    After that's done, I'd like to create a new client and schedule a harvesting job for fresh ICPSR datasets.

  • There's an Archive Type called "ICPSR"

    74364722-b4181180-4d9a-11ea-9131-f1fb48609121

    It looks like the Archive Type called "ICPSR" should be used when harvesting from ICPSR, otherwise, when choosing oai_ddi25 as the metadata format, the dataset title links point to the ICPSR homepage (see this dataverse on Demo Dataverse), or, when choosing oai_dc as the metadata format, no records are harvested (all 10k+ records fail to harvest). I think this should be documented.

@jggautier
Copy link
Collaborator Author

jggautier commented Dec 16, 2020

Error setting up harvesting client on UNC Dataverse and Demo Dataverse (but not Harvard Dataverse)

Documentation of ICPSR's OAI-PMH feeds are at https://www.icpsr.umich.edu/web/pages/membership/or/metdata/oai.html.

Thu-Mai at Odum/UNC let us know today (see RT support email) and I've confirmed on Demo Dataverse that Dataverse shows the following error when we try the first step of creating a harvesting client using the server URL https://www.icpsr.umich.edu/icpsrweb/neutral/oai/studies:

https://www.icpsr.umich.edu/icpsrweb/neutral/oai/studies: Invalid URL. Failed to establish connection and receive a valid server response.

Screen Shot 2020-12-16 at 2 23 58 PM

Demo Dataverse reports the same error when I try to use the second of ICPSR's two documented Server URLs, https://www.icpsr.umich.edu/icpsrweb/neutral/oai/citations

UNC is running Dataverse version 4.16. Demo Dataverse is running 5.13. (Both show this error.)

On Harvard Dataverse, also running 5.13, I don't get this error and I am able to get past all 4 steps for creating a harvesting client. I haven't started a harvesting run on Harvard Dataverse because:

Dataverse misses 3,573 of the 10,890 records

Every time I've been able to harvest ICPSR's metadata (using the oai_ddi25 metadata format and ICPSR "archive type"), more recently by creating a Dataverse instance on AWS, Dataverse fails to harvest 3,573 records. It gets the other 7,317 records.

I compared the DDI 2.5 metadata of a few records that Dataverse could and could not harvest, and the only difference I could see is that records that Dataverse failed to harvest have file level metadata (in DDI's fileDscr element) while records that Dataverse was able to harvest do not include file level metadata. Also, when ICPSR's DDI 2.5 metadata does include file level metadata, it doesn't include the fileDscr element's ID or URI attributes.

So maybe the Dataverse code that imports DDI metadata from ICPSR's OAI-PMH feed doesn't like seeing file level metadata or expects to see more metadata about the files (like the ID and/or URI)?

Questions

  • Why are UNC Dataverse and Demo Dataverse reporting an error on the first step of creating a harvesting client, while Harvard Dataverse and instances I create on AWS aren't?
  • Why is Dataverse skipping about a third of the records in ICPSR's OAI-PMH feed?
  • What should Dataverse be doing with file-level metadata in general and how is its inclusion or exclusion from being indexed affecting data discoverability?
  • Once we can confirm that the Dataverse software will successfully harvest ICPSR's metadata, can we remove the dataset metadata in Harvard Dataverse (https://dataverse.harvard.edu/dataverse/icpsr) and start using ICPSR's OAI-PMH feed to regularly harvest ICPSR metadata into Harvard Dataverse? At least some of the in https://dataverse.harvard.edu/dataverse/icpsr are also in ICPSR's OAI-PMH feed, but I haven't done a thorough check to confirm that all of the datasets in https://dataverse.harvard.edu/dataverse/icpsr are also in the OAI-PMH feed.

@jggautier
Copy link
Collaborator Author

jggautier commented Apr 21, 2022

Because we've tried to harvest ICPSR's DDI Codebook exports, and I wrote that harvesting failures might be related to file-level metadata, I wonder if the recently opened GitHub issue at IQSS/dataverse#8629 is related.

Also, folks from ICPSR are doing research to improve how ICPSR exports the metadata of their datasets, including exports in their OAI-PMH feeds. An archivist at ICPSR asked me to consider taking a survey on behalf of the Harvard Dataverse Repository (or share it with those who can take the survey), but others in the community have reported issues with harvesting from ICPSR, such as the repository managed at UNC, so I thought I'd share their survey more widely, including in this issue.

The survey deadline is May 31, 2022. The email and the survey link are below.

Greetings, Julian; I hope this message finds you well. My name is Mike Shallcross and I am an archivist at the Inter-university Consortium for Political and Social Research (ICPSR). We are exploring how to make improvements to our metadata exports (e.g., copies of metadata records for our data collections) and prioritize new features and functionality associated with the export process.

We are therefore reaching out to ICPSR users as well as libraries and service providers to better understand how well our downloadable metadata records meet the community's needs (including those of Harvard Dataverse) and also gain insight into preferred metadata standards, file formats, and functionality related to metadata record exports.

The survey is available here: https://umich.qualtrics.com/jfe/form/SV_3fN70KNbCJ6iuuG

Please feel free to share the above link with other individuals in your organization who might be better-positioned to respond.

The survey should take around 5-10 minutes to complete and we are asking participants to submit their responses by May 31, 2022. If you have any questions, please reach out to me at shallcro@umich.edu. Thank you very much for your time and consideration; we sincerely appreciate your input!

With best regards,

Mike Shallcross

--
Mike Shallcross
Associate Archivist
Pronouns: he/him/his
Inter-university Consortium for Political and Social Research
University of Michigan

@jggautier
Copy link
Collaborator Author

jggautier commented Apr 21, 2022

So I think it would be great if some of the responses to that survey could be informed by future attempts to re-harvest ICPSR metadata as the Dataverse team works on the related harvesting issues that @pdurbin and @landreev have been listing and looking into.

Maybe we won't know enough by the survey deadline, which is basically a month from now. But in case that's possible, I'm going to create a Google Doc with the survey's questions and so that interested folks can collaborate on answers to the survey questions.

@jggautier
Copy link
Collaborator Author

A Google Doc with the survey is at https://docs.google.com/document/d/1_qmnh_acoTX_MF8zDnOW4HW9BtO7smyjksqMb1XlbRY.

In a post in the Dataverse Community Forums and in the metadata interest group channel in the Dataverse Community Slack I wrote about the survey, the Google Doc, and how it might be helpful for any Dataverse community members who answer the survey to share their answers.

@landreev
Copy link
Collaborator

OK, I'll take a look at the doc.
I expect that we will have to do some dev. work, to make our DDI import code be able to parse and process these ICPSR-produced DDIs. It's a very rich format where things can be encoded in different ways. Our parser was written specifically to understand DDI produced by other Dataverses. (A while ago we added some extra cases and fixes specifically to accommodate ICPSR records - but it appears that they have completely reworked how they create their DDI records; so we'll need to do this again).

@landreev
Copy link
Collaborator

What should we do with the current harvested ICPSR dataverse in prod., with all its stale records, etc. Should we just wipe it clean until we can re-harvest from scratch?

@jggautier
Copy link
Collaborator Author

jggautier commented Apr 25, 2022

I'm not sure about wiping before we can re-harvest. Maybe in the meantime there's still some value in having the stale records discoverable in the Harvard repo. I think a lot of records are missing, but at least some of the records that are in https://dataverse.harvard.edu/dataverse/icpsr still point to datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants