Re-harvesting ICPSR datasets #63

jggautier · 2020-02-13T21:58:09Z

Now that IQSS/dataverse#4964 is technically resolved, Harvard Dataverse should be able to harvest ICPSR dataset metadata over OAI-PMH (to keep the records up to date) and the dataset title links will point to the ICPSR dataset pages.

Can someone confirm that it's okay to delete the existing harvesting client listed on Harvard Dataverse's "Manage Harvesting Clients" page, to remove the 8k+ stale records? I don't know how that client was created, but the delete button on the "Manage Harvesting Clients" page looks nice and clickable...

After that's done, I'd like to create a new client and schedule a harvesting job for fresh ICPSR datasets.
There's an Archive Type called "ICPSR"

It looks like the Archive Type called "ICPSR" should be used when harvesting from ICPSR, otherwise, when choosing oai_ddi25 as the metadata format, the dataset title links point to the ICPSR homepage (see this dataverse on Demo Dataverse), or, when choosing oai_dc as the metadata format, no records are harvested (all 10k+ records fail to harvest). I think this should be documented.

jggautier · 2020-12-16T20:02:30Z

Error setting up harvesting client on UNC Dataverse and Demo Dataverse (but not Harvard Dataverse)

Documentation of ICPSR's OAI-PMH feeds are at https://www.icpsr.umich.edu/web/pages/membership/or/metdata/oai.html.

Thu-Mai at Odum/UNC let us know today (see RT support email) and I've confirmed on Demo Dataverse that Dataverse shows the following error when we try the first step of creating a harvesting client using the server URL https://www.icpsr.umich.edu/icpsrweb/neutral/oai/studies:

https://www.icpsr.umich.edu/icpsrweb/neutral/oai/studies: Invalid URL. Failed to establish connection and receive a valid server response.

Demo Dataverse reports the same error when I try to use the second of ICPSR's two documented Server URLs, https://www.icpsr.umich.edu/icpsrweb/neutral/oai/citations

UNC is running Dataverse version 4.16. Demo Dataverse is running 5.13. (Both show this error.)

On Harvard Dataverse, also running 5.13, I don't get this error and I am able to get past all 4 steps for creating a harvesting client. I haven't started a harvesting run on Harvard Dataverse because:

we need to clear the stale metadata records in https://dataverse.harvard.edu/dataverse/icpsr
when I tested harvesting ICPSR metadata into Demo Dataverse, Demo Dataverse harvested only two-thirds of all records...

Dataverse misses 3,573 of the 10,890 records

Every time I've been able to harvest ICPSR's metadata (using the oai_ddi25 metadata format and ICPSR "archive type"), more recently by creating a Dataverse instance on AWS, Dataverse fails to harvest 3,573 records. It gets the other 7,317 records.

I compared the DDI 2.5 metadata of a few records that Dataverse could and could not harvest, and the only difference I could see is that records that Dataverse failed to harvest have file level metadata (in DDI's fileDscr element) while records that Dataverse was able to harvest do not include file level metadata. Also, when ICPSR's DDI 2.5 metadata does include file level metadata, it doesn't include the fileDscr element's ID or URI attributes.

So maybe the Dataverse code that imports DDI metadata from ICPSR's OAI-PMH feed doesn't like seeing file level metadata or expects to see more metadata about the files (like the ID and/or URI)?

Questions

Why are UNC Dataverse and Demo Dataverse reporting an error on the first step of creating a harvesting client, while Harvard Dataverse and instances I create on AWS aren't?
Why is Dataverse skipping about a third of the records in ICPSR's OAI-PMH feed?
What should Dataverse be doing with file-level metadata in general and how is its inclusion or exclusion from being indexed affecting data discoverability?
Once we can confirm that the Dataverse software will successfully harvest ICPSR's metadata, can we remove the dataset metadata in Harvard Dataverse (https://dataverse.harvard.edu/dataverse/icpsr) and start using ICPSR's OAI-PMH feed to regularly harvest ICPSR metadata into Harvard Dataverse? At least some of the in https://dataverse.harvard.edu/dataverse/icpsr are also in ICPSR's OAI-PMH feed, but I haven't done a thorough check to confirm that all of the datasets in https://dataverse.harvard.edu/dataverse/icpsr are also in the OAI-PMH feed.

jggautier · 2022-04-21T15:14:06Z

Because we've tried to harvest ICPSR's DDI Codebook exports, and I wrote that harvesting failures might be related to file-level metadata, I wonder if the recently opened GitHub issue at IQSS/dataverse#8629 is related.

Also, folks from ICPSR are doing research to improve how ICPSR exports the metadata of their datasets, including exports in their OAI-PMH feeds. An archivist at ICPSR asked me to consider taking a survey on behalf of the Harvard Dataverse Repository (or share it with those who can take the survey), but others in the community have reported issues with harvesting from ICPSR, such as the repository managed at UNC, so I thought I'd share their survey more widely, including in this issue.

The survey deadline is May 31, 2022. The email and the survey link are below.

Greetings, Julian; I hope this message finds you well. My name is Mike Shallcross and I am an archivist at the Inter-university Consortium for Political and Social Research (ICPSR). We are exploring how to make improvements to our metadata exports (e.g., copies of metadata records for our data collections) and prioritize new features and functionality associated with the export process.

We are therefore reaching out to ICPSR users as well as libraries and service providers to better understand how well our downloadable metadata records meet the community's needs (including those of Harvard Dataverse) and also gain insight into preferred metadata standards, file formats, and functionality related to metadata record exports.

The survey is available here: https://umich.qualtrics.com/jfe/form/SV_3fN70KNbCJ6iuuG

Please feel free to share the above link with other individuals in your organization who might be better-positioned to respond.

The survey should take around 5-10 minutes to complete and we are asking participants to submit their responses by May 31, 2022. If you have any questions, please reach out to me at shallcro@umich.edu. Thank you very much for your time and consideration; we sincerely appreciate your input!

With best regards,

Mike Shallcross

--
Mike Shallcross
Associate Archivist
Pronouns: he/him/his
Inter-university Consortium for Political and Social Research
University of Michigan

jggautier · 2022-04-21T17:38:37Z

So I think it would be great if some of the responses to that survey could be informed by future attempts to re-harvest ICPSR metadata as the Dataverse team works on the related harvesting issues that @pdurbin and @landreev have been listing and looking into.

Maybe we won't know enough by the survey deadline, which is basically a month from now. But in case that's possible, I'm going to create a Google Doc with the survey's questions and so that interested folks can collaborate on answers to the survey questions.

jggautier · 2022-04-25T21:02:47Z

A Google Doc with the survey is at https://docs.google.com/document/d/1_qmnh_acoTX_MF8zDnOW4HW9BtO7smyjksqMb1XlbRY.

In a post in the Dataverse Community Forums and in the metadata interest group channel in the Dataverse Community Slack I wrote about the survey, the Google Doc, and how it might be helpful for any Dataverse community members who answer the survey to share their answers.

landreev · 2022-04-25T21:11:52Z

OK, I'll take a look at the doc.
I expect that we will have to do some dev. work, to make our DDI import code be able to parse and process these ICPSR-produced DDIs. It's a very rich format where things can be encoded in different ways. Our parser was written specifically to understand DDI produced by other Dataverses. (A while ago we added some extra cases and fixes specifically to accommodate ICPSR records - but it appears that they have completely reworked how they create their DDI records; so we'll need to do this again).

landreev · 2022-04-25T21:13:38Z

What should we do with the current harvested ICPSR dataverse in prod., with all its stale records, etc. Should we just wipe it clean until we can re-harvest from scratch?

jggautier · 2022-04-25T21:27:44Z

I'm not sure about wiping before we can re-harvest. Maybe in the meantime there's still some value in having the stale records discoverable in the Harvard repo. I think a lot of records are missing, but at least some of the records that are in https://dataverse.harvard.edu/dataverse/icpsr still point to datasets.

jggautier mentioned this issue Jul 27, 2020

Harvesting zenodo client fail IQSS/dataverse#5050

Closed

This was referenced Jan 4, 2021

Error setting up harvesting client for ICPSR on UNC Dataverse and Demo Dataverse IQSS/dataverse#7497

Closed

When Dataverse can harvest ICPSR metadata, it misses a third of records IQSS/dataverse#7498

Closed

jggautier mentioned this issue Apr 18, 2022

Spike: Inventory and prioritize all existing Harvesting related issues IQSS/dataverse-pm#24

Closed

3 tasks

jggautier mentioned this issue Nov 17, 2022

Spike: Review if inability to harvest all records from certain repositories will be or has been resolved when other harvesting-related issues are addressed #92

Closed

jggautier mentioned this issue Dec 3, 2022

Re-harvesting datasets from Roper Center #204

Open

jggautier mentioned this issue Feb 8, 2023

Spike: Figure out what to do with broken links in harvested metadata from Harvard Geospatial Library #211

Open

cmbz mentioned this issue Feb 7, 2024

GREI 3: HDV Task - Improve OAI-PMH Harvesting IQSS/dataverse-pm#171

Open

32 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-harvesting ICPSR datasets #63

Re-harvesting ICPSR datasets #63

jggautier commented Feb 13, 2020 •

edited

jggautier commented Dec 16, 2020 •

edited

jggautier commented Apr 21, 2022 •

edited

jggautier commented Apr 21, 2022 •

edited

jggautier commented Apr 25, 2022

landreev commented Apr 25, 2022

landreev commented Apr 25, 2022

jggautier commented Apr 25, 2022 •

edited

Re-harvesting ICPSR datasets #63

Re-harvesting ICPSR datasets #63

Comments

jggautier commented Feb 13, 2020 • edited

jggautier commented Dec 16, 2020 • edited

jggautier commented Apr 21, 2022 • edited

jggautier commented Apr 21, 2022 • edited

jggautier commented Apr 25, 2022

landreev commented Apr 25, 2022

landreev commented Apr 25, 2022

jggautier commented Apr 25, 2022 • edited

jggautier commented Feb 13, 2020 •

edited

jggautier commented Dec 16, 2020 •

edited

jggautier commented Apr 21, 2022 •

edited

jggautier commented Apr 21, 2022 •

edited

jggautier commented Apr 25, 2022 •

edited